Professor, CSA, Indian Institute of Science
In this article, we introduce the basics of database query optimization and processing, and the strategies currently employed by database systems in this exercise. We then present a sampling of recent techniques that take significant steps forward towards addressing the sub-optimality issue, providing for the first time, provable and attractive performance guarantees. Finally, we conclude with exemplars of open research problems that need to be addressed before a holistic solution to robust query processing in database systems can be achieved.
Organizations typically collect vast quantities of data relating to their operations. For example, the Income Tax department in India has accumulated a huge repository of information pertaining to taxpayer returns. In order to provide a convenient and efficient environment to productively use these enormous data collections, software packages called “Data Base Management Systems” (DBMS) have been developed and refined over the past half-century, beginning in the early 1960s. These packages are now extensively used throughout the world in virtually all spheres of human activity, including banking, insurance, governance, business, travel, manufacturing, education and health-care. Popular commercial offerings of DBMS software include IBM’s DB2  , Oracle Database  and Microsoft’s SQL Server  , while PostgreSQL  and MySQL  are well-known public-domain packages.
Research related to database systems has a rich history, going back to almost the origins of computing itself, and the fruits of this research have long been an organic component of the core computer science curriculum. In the 1960s and 70s, the focus was on developing expressive data models, with the relational model  finally emerging triumphant as the workhorse of enterprise data processing. The following decade of the 80s saw the transaction concept gaining centrestage, the emphasis being on developing efficient mechanisms to provide the powerful ACID (Atomicity, Consistency, Isolation, Durability) semantics that are the hallmark of this concept  . Included in its ambit were data recovery mechanisms, concurrency control techniques, indexing strategies, and memory management. The value of these contributions is testified to by the fact that both Edgar Codd and Jim Gray, the primary proponents of the relational model and the transaction concept, respectively, were awarded the Turing Award in 1981 and 1998, respectively (the Turing Award is considered to be the equivalent of the Nobel Prize in computer science).
Concurrently with the above advances, the automated identification of efficient execution strategies for declarative query processing gained tremendous ground through the development of dynamic-programming based techniques for navigating the exponentially large strategy search space, and it is this aspect of database technology that we focus on in the remainder of this article.
1.1 Declarative User Interface
An organic reason for the popularity of relational DBMS is that they provide a simple but powerful logical model wherein all data is stored in the form of tables called relations. Each row in the table represents information about a real-world entity, whereas the columns reflect the attributes of the entity. To make this concept clear, sample relations of a University database are shown in Figure 1. Here, information is maintained in three relations: STUDENT, COURSE and REGISTER, which tabulate data about students, courses, and the course registrations of students, respectively. For instance, we find that Jim Gray, a doctoral candidate, has registered for the 4-credit database course taught by Edgar Codd.
A variety of friendly and powerful mechanisms to ask questions, called queries, on the information stored in database relations, have been proposed in the literature. Over time, however, the Structured Query Language (SQL)  has become the global de facto standard for interacting with relational DBMS. Therefore, queries are typically expressed in SQL, either directly or through form-based interfaces that are translated to SQL equivalents.
A particularly appealing feature of SQL is that it is “declarative” in nature, meaning that the user only states what is wanted without having to specify the procedure for obtaining the information. That is, in this framework, the user only specifies the end objectives and the database system is responsible for identifying and executing the most efficient means to achieve these objectives.
To make the declarative notion concrete, consider again the sample university database shown in Figure 1. Assume that the user’s goal is to extract the names of the students and the courses for which they are registered. A sample SQL query that achieves this goal is shown in Figure 2, producing the result “Jim Gray, Database Systems”. In this query, the desired output is obtained by combining the data across the three tables using the roll numbers and the course numbers as the
connectors. These cross-table connectors are known as “joins” in the database terminology, and represented with a ⋈ symbol.
*Portions of the material presented in this article are sourced from technical papers published by the author’s lab  .