Computing Provenance for Database Updates and Transactions
Speaker
Boris Glavic
Assistant Professor of Computer Science, Illinois Institute of Technology
http://cs.iit.edu/~glavic/
Description
Abstract: Data provenance, information about the origin and creation process of data, has been used to debug queries and clean data in data warehouses, to understand and correct complex data integration transformations, for auditing, and to understand the quality of data in Big Data analytics and Data Science. Automatic provenance generation is of immense importance in Big Data and data science where the data size, its heterogeneity, and the time requirements for analysis results to be available make it infeasible to generate provenance information manually. Most of the literature on database provenance has focused on tracing the provenance of queries, i.e., to map each output row of a query to all rows from the query's input that where used to compute this output row. However, use cases such as auditing need to be able to trace the origin of a row through database updates which are usually executed as part of a transaction to preserve consistency under concurrent access and recovery from failures. In this talk I given an overview of my group's research on computing provenance for updates and transactions. Similar to most approaches for computing the provenance of queries, we use query rewrite techniques to generate queries that compute provenance as a side-effect. Our approach is based on transaction time histories for tables and an encoding of update as queries over past states of tables. This work is partially supported by Oracle.
Event Topic
Data Science