Mimir: Bringing CTables into Practice
MetadataShow full item record
Traditional data analytics requires upfront data sanitization to get reliable query results. There is an implicit expectation on database engines to return absolutely correct answers to all queries. Relaxing this expectation allows the automation of data curation tasks, resulting in a spectrum in the time invested curating data and the achieved data quality. Mimir is a system based on a recently introduced class of operators called lenses to clean data with minimal user effort. Lenses provide a general, composable framework to apply various data cleaning tasks to messy data, while also preserving the provenance of all sources of uncertainty in the clean data. Mimir can sit atop any database with a mature JDBC driver and extend it to support lenses and probabilistic queries over them. The chief limitation of the work that introduced lenses was that join queries over uncertain predicates decomposed into cross products. This thesis presents several approaches to making query processing using lenses scalable. Experimental evidence of the viability of using Mimir to process queries on large datasets is presented, using SPJ queries based on the TPCH benchmarks. This thesis also describes Mimir's GUI, which annotates each uncertain data element in query results with quality metrics as well as its provenance. A third contribution of this work is Mimir’s ability to ingest raw CSV data and provide structure to it with a type inference lens.