An Automatic Physical Design Tool for Clustered Column-stores
Description
Abstract
There has been a significant amount of prior work on automating physical database design. The goal of an automated designer is to produce auxiliary structures that speed up user queries, while not using more than the allotted resource budget (typically disk space). Most existing research has been done in the context of commercial row store databases such as Microsoft SQL Server, IBM DB2 or Oracle. In fact, every commercial database ships with some sort of a tool that can provide design recommendations for the consideration of the database administrator.
We have done a lot of work on automating the database design process in a column-store database. In our experiments, we primarily used Vertica, a commercial column-store database that is based on the C-Store research prototype. Although, on the surface, it seems like we are simply changing the underlying storage system while the problem of designing the physical structures remains essentially the same, we have found that there are several fundamental differences that turn this into a new and unsolved problem. Many of the basic axioms that are used in row-store design do not hold in column-store setting (and vice versa). In this talk, we demonstrate the construction of an effective design tool and an analytic cost model for a column-store like C-Store. We show that some techniques from machine learning such as clustering can reduce and simplify this design problem. To our knowledge there had been little work on the problem of physical design in the context of column-stores and none in the context of column-stores like C-Store or Vertica.
Biography
Alexander Rasin is an Assistant Professor in the College of Computing and Digital Media (CDM) at DePaul University. He received his Ph.D. and M.Sc. in Computer Science from Brown University, Providence. His current research centers on high-performance data warehouses and large scale data analytics. Dr. Rasin's other research interests include resource provisioning and high availability guarantees in distributed systems.