A Quarterly Publication of ACCS

Robust Query Processing in Database Systems*

Jayant Ramaswamy Haritsa, Professor, CSA, Indian Institute of Science

Abstract

Database management systems constitute the backbone of today’s information-rich society, providing a congenial environment for handling enterprise data during its entire life cycle of generation, storage, maintenance and processing. The defacto standard user interface to query the information present in the database is SQL (Structured Query Language). An organic USP of SQL is its declarative persona, which enables users to focus solely on query formulation, leaving it to the database system to identify an efficient execution strategy. Paradoxically, however, the declarative constitution of SQL is also its Achilles heel. This is because the execution strategies chosen by the system often turn out, in hindsight, to be highly sub-optimal as compared to the ideal choice. Unfortunately, due to the intrinsic technical complexities and challenges, solutions to this long-standing problem have remained chronically elusive despite intensive study by the database research community over the past five decades.

In this article, we introduce the basics of database query optimization and processing, and the strategies currently employed by database systems in this exercise. We then present a sampling of recent techniques that take significant steps forward towards addressing the sub-optimality issue, providing for the first time, provable and attractive performance guarantees. Finally, we conclude with exemplars of open research problems that need to be addressed before a holistic solution to robust query processing in database systems can be achieved.

1 Introduction

Organizations typically collect vast quantities of data relating to their operations. For example, the Income Tax department in India has accumulated a huge repository of information pertaining to taxpayer returns. In order to provide a convenient and efficient environment to productively use these enormous data collections, software packages called “Data Base Management Systems” (DBMS) have been developed and refined over the past half-century, beginning in the early 1960s. These packages are now extensively used throughout the world in virtually all spheres of human activity, including banking, insurance, governance, business, travel, manufacturing, education and health-care. Popular commercial offerings of DBMS software include IBM’s DB2 [53] , Oracle Database [55] and Microsoft’s SQL Server [57] , while PostgreSQL [56] and MySQL [54] are well-known public-domain packages.

Research related to database systems has a rich history, going back to almost the origins of computing itself, and the fruits of this research have long been an organic component of the core computer science curriculum. In the 1960s and 70s, the focus was on developing expressive data models, with the relational model [14] finally emerging triumphant as the workhorse of enterprise data processing. The following decade of the 80s saw the transaction concept gaining centrestage, the emphasis being on developing efficient mechanisms to provide the powerful ACID (Atomicity, Consistency, Isolation, Durability) semantics that are the hallmark of this concept [20] . Included in its ambit were data recovery mechanisms, concurrency control techniques, indexing strategies, and memory management. The value of these contributions is testified to by the fact that both Edgar Codd and Jim Gray, the primary proponents of the relational model and the transaction concept, respectively, were awarded the Turing Award in 1981 and 1998, respectively (the Turing Award is considered to be the equivalent of the Nobel Prize in computer science).

Concurrently with the above advances, the automated identification of efficient execution strategies for declarative query processing gained tremendous ground through the development of dynamic-programming based techniques for navigating the exponentially large strategy search space, and it is this aspect of database technology that we focus on in the remainder of this article.

1.1 Declarative User Interface

An organic reason for the popularity of relational DBMS is that they provide a simple but powerful logical model wherein all data is stored in the form of tables called relations. Each row in the table represents information about a real-world entity, whereas the columns reflect the attributes of the entity. To make this concept clear, sample relations of a University database are shown in Figure 1. Here, information is maintained in three relations: STUDENT, COURSE and REGISTER, which tabulate data about students, courses, and the course registrations of students, respectively. For instance, we find that Jim Gray, a doctoral candidate, has registered for the 4-credit database course taught by Edgar Codd.

A variety of friendly and powerful mechanisms to ask questions, called queries, on the information stored in database relations, have been proposed in the literature. Over time, however, the Structured Query Language (SQL) [57] has become the global de facto standard for interacting with relational DBMS. Therefore, queries are typically expressed in SQL, either directly or through form-based interfaces that are translated to SQL equivalents.

A particularly appealing feature of SQL is that it is “declarative” in nature, meaning that the user only states what is wanted without having to specify the procedure for obtaining the information. That is, in this framework, the user only specifies the end objectives and the database system is responsible for identifying and executing the most efficient means to achieve these objectives.

To make the declarative notion concrete, consider again the sample university database shown in Figure 1. Assume that the user’s goal is to extract the names of the students and the courses for which they are registered. A sample SQL query that achieves this goal is shown in Figure 2, producing the result “Jim Gray, Database Systems”. In this query, the desired output is obtained by combining the data across the three tables using the roll numbers and the course numbers as the

connectors. These cross-table connectors are known as “joins” in the database terminology, and represented with a âÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ symbol.

^*Portions of the material presented in this article are sourced from technical papers published by the author’s lab [52] .

1.2 Query Execution Plans

The important point to note in the above formulation is that several decisions are left unspecified by the user. For instance, the sequence in which the tables are combined – candidate sequences could be ((STUDENT âÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ REGISTER) âÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ COURSE) or ((REGISTER âÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ COURSE) âÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ STUDENT). The result of the joins would be identical in both cases because the âÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ operator is, from a mathematical perspective, both associative and commutative – however, the effort involved in the two executions could be vastly different. Moreover, even after a sequence has been decided, we still have to decide the implementation mechanism for each of the logical join operators. A spectrum of join algorithms has been developed in the literature, including NESTED-LOOPS JOIN, SORT-MERGE JOIN, HASH JOIN, etc., and a choice from among these has to be made for each logical join.

All the above decisions have to be made by the database engine, which is responsible for identifying and executing the most efficient means to process the user query. These two steps are implemented by the query optimizer and the query executor components within the core of the engine, and over the past half century, research on the design and implementation of these components has been a foundational topic for both the academic and industrial database communities.

The alternative strategies considered by the query optimizer, in its quest to find the ideal, are referred to as “plans” in the database jargon. They are usually costed and compared in terms of their estimated query response times. It is important to note here that optimization is a mandatory exercise since the difference between the cost of the best plan and a random choice could be in orders of magnitude. The role of query optimizers has become especially critical in recent times due to the high degree of query complexity characterizing current decision-support applications, as exemplified by the industry-standard TPC-H and TPC-DS performance benchmarks [59] [60] .

1.3 Query Optimization Challenges

In spite of the long-standing research mentioned above, the query processing domain has largely remained a “black art”. This is due to a variety of well-documented complexities and challenges [8] [9] , discussed in Section 2. In fact, even as recently as 2014, a highly-respected industry veteran had this extremely pessimistic statement to make in a blog post [34]: The wonder isn’t “Why did the optimizer pick a bad plan?” Rather, the wonder is “Why would the optimizer ever pick a decent plan?”! Similar sentiments have been expressed by other academic and industrial experts, including: “Query optimizers do a terrible job of producing reliable, good plans (for complex queries) without a lot of hand tuning.” [16] , and “Almost all of us who have worked on query optimization find the current state of the art unsatisfactory with known big gaps in the technology.” [10]

The scale of performance degradation faced by database queries arising out of the poor choices of execution plans, can be huge – often in orders of magnitude – as compared to an oracular ideal that magically knows the ideal plan for optimal query processing. As a case in point, when Query 19 of the TPC-DS benchmark is executed on contemporary database engines, the worst-case slowdown, relative to the hypothetical oracle, can exceed a million! [17] Therefore, it is self-evident that the current design of database systems is extremely brittle, with large-scale performance variations arising out of inadequate design methodologies. In a nutshell, they substantively fail to provide robust performance.

Finally, apart from the obvious negative impact on user productivity and satisfaction, there are also financial implications of this performance degradation – for instance, it is reported in [49] that the lack of robustness can contribute as much as a third to the total cost of ownership for a database system.

1.4 Recent Research Advances

In the midst of this gloom and doom, the good news is that in recent times, there have been a host of exciting research advances, which collectively promise to provide strong foundations for designing the next generation of query engines. The expectation is that these advances will eventually organically support robust query processing (RQP), relegating to the past the above-mentioned cynicism on this bedrock feature. Many of the new ideas owe their genesis to a series of influential and well-attended Dagstuhl Seminars in Germany on the topic of Robust Query Processing over the last decade (2010 [49] , 2012 [50] , 2017 [51] ). Further, they have arisen from research teams located at diverse locations across the world, including the US, Europe and Asia.

1.5 Organization

The remainder of this article is organized as follows: In Section 2, we outline the basics of the query
optimization process. Then in Section 3, we present the recently developed PlanBouquettechnique
which provides performance guarantees and takes an important step towards addressing the RQP
problem. Finally, in Section 4, we outline future research avenues that can be explored in the RQP
context.

2 Query Optimization

Given its industrial relevance, it is not surprising that a rich body of literature has developed on database query processing. A sampling of this prior work is given in the Bibliography – in particular, we direct the reader to the comprehensive and readable surveys presented in [19] [8] [9] . In this section, we will only provide an intuitive overview of the process to serve as motivation for explicitly incorporating robustness in database engine design.

Given a generic SQL query that requires combining information across N relations, there are in principle N! different permutations of the combination sequence, implying that the strategy search space is at least exponential in the query size. The automated identification of efficient plans from the humongous search space is the responsibility of the database query optimizer.

Plans are typically comprised of a tree of data processing operators that are evaluated in a bottom-up paradigm. A sample plan is shown in Figure 3 for the example query of Figure 2, where the STUDENT and REGISTER relations are first combined with a NESTED-LOOP join operator, and this intermediate result is then combined with the COURSE relation using a HASH-JOIN operator. The plan is evaluated in a bottom-up traversal, beginning with the leaf nodes, and working our way up to the root, which provides the final results. Further, the bracketed numbers within each operator node indicate the estimated aggregate processing costs incurred until this stage in the evaluation. While operator costs could be defined with regard to various measures, including execution time, output latency, or resource consumption, we will assume in this article that the costs are indicative of execution time.

Modern query optimizers each have their own “secret sauce” to identify the best (i.e. cheapest/fastest) plan for answering these declarative SQL queries. However, the de-facto standard underlying strategy that is present in all these systems, and which was pioneered by the System R project at IBM Research [42] , is the following:

1. First apply a variety of heuristics to restrict the exponential plan search space to a manageable size. For instance, the early database systems only considered left-deep trees, wherein at least one of the relations in each join is a base user relation appearing in the database schema.
2. Next, estimate with a cost model and a dynamic-programming-based processing algorithm, the efficiency of each of these candidate plans.
3. Finally, choose the plan with the lowest estimated cost.

A pictorial representation of the canonical query optimization framework is shown in Figure 5. The input is the declarative SQL query and the output is the optimal (i.e. cheapest/fastest) execution plan. The core processing is implemented by the dynamic programming-based search algorithm, which leverages the fact that the globally optimal plan can be incrementally built up using the locally optimal solution for each operator. Further, there are two essential models that serve as inputs to this process, namely, the cardinality estimation model and the cost estimation model, which are described below.

2.1 Cardinality Estimation Model

The cardinality model estimates the volume of data, measured in number of database rows, that flows from one operator to the next in the plan tree. As shown in Figure 5, this volume is a function of the data distributions within the relational columns, and the data correlations across the columns. The individual column distributions are usually approximated, in a piece-wise manner, using histogram-based summaries.

Typically, a host of cardinalities have to be estimated while constructing efficient execution plans for declarative decision-support queries. For instance, consider the example query shown in Figure 4 for enumerating orders of cheap parts (costing less than Rs. 1000) from a manufacturing database – here, the optimizer has to estimate the cardinalities of the three bold-faced constraints:
a filter predicate (p_retailprice < 1000) and two join predicates (part âÂÂÂÂÂÂÂÂÂÂÂÂÂÂ lineitem, orders âÂÂÂÂÂÂÂÂÂÂÂÂÂÂ lineitem). Unfortunately, in practice, these estimates are often significantly in error with respect to the actual values subsequently encountered during query execution. Such errors, which can even be in orders of magnitude in real database environments [37] [34] , arise due to a variety of well-documented reasons [43] [34] , including outdated statistics, coarse summaries, attributevalue independence assumptions, complex user-defined predicates, and error propagation in the query plan tree [29]. Moreover, in industrial environments such as ETL (Extract, Transform, Load) workflows, the statistics may actually be unavailable due to data source constraints, forcing the optimizer to resort to “magic numbers” for the cardinality values (e.g. 0.1R for equality selections on columns of a relation with R rows [42] ). The net outcome of these erroneous estimates is that the execution plans recommended by the query optimizer may turn out to be very poor choices at run-time, resulting in grossly inflated query response times.

A considerable body of literature exists on proposals to tackle this classical problem. For instance, techniques for improving the statistical quality of the associated meta-data include improved summary structures [2] [38] , feedback-based adjustments [43] , and on-the-fly reoptimization of queries [30] [6] [39]. A complementary approach is to identify robust plans that are relatively less sensitive to estimation errors [13] [5] [6] [25] . While all these prior techniques provide novel and innovative formulations, a common limitation is their inability to furnish performance guarantees. That is, we can only hope for good performance, but cannot provide provable bounds in this regard.

2.2 Cost Estimation Model

We now turn our attention to the cost estimation model, which is responsible for estimating the time taken for processing the data at each operator in the plan. As shown in Figure 5, its estimates, which are usually computed on a normalized per-row basis, are dependent on the underlying computing platform and the software implementation of the database engine. The overall cost of each operator is the product of the estimated row cardinalities (as obtained from the cardinality model) and the per-row cost estimates.

Just as in the cardinality model, the cost model can also be subject to errors that adversely impact robustness. These errors arise out of difficulties in predicting runtime resource availability, modeling complex interactions among database and platform components, catering to heterogeneity in the hardware, etc. Again, similar to the cardinality model, the cost model also has been the subject of several studies in the prior literature, and has seen renewed interest in the last decade. For instance, the effectiveness of machine learning techniques to predict query execution time, by using both plan-level and operator-level models, was demonstrated in [3] . Their features included optimizer cost estimates, query parameters and the actual runtime statistics. In marked contrast, another research group showed in [47] as to how the existing statistical models themselves, with some degree of augmentation and proper tuning, could produce satisfactory estimates of query execution times with much less computational effort. In their approach, an initial offline profiling phase was used to to establish the unit temporal costs for utilizing various system resources. Subsequently, online sampling was employed to estimate the number of usages for a given query. In a series of follow-up papers [46] [48] , stronger statistical models were incorporated in their algorithmic suite to maintain the prediction quality in the presence of uncertainty and concurrency.

2.3 Characterization of Robustness

As discussed above, the current performance of database systems is extremely sensitive to the accuracy of the underlying estimation models. Therefore, it is highly desirable to design robust solutions that provide performance stability. However, the definition of robustness itself has been a subject of intense debate, especially at the Dagstuhl seminars, and a consensus has been difficult to achieve. For instance, if worst-case performance is improved at the expense of average-case performance, is that an acceptable notion of robustness? Rather than get sidetracked into this vexed semantic tangle, we instead advocate that the database engine should have a multiplicity of components that separately but cooperatively cater to the various environments. For instance, if the cardinality estimates are expected to be reasonably accurate, then the traditional query optimizer is a reasonable choice for arriving at the right plan. On the other hand, if the estimates are expected to be brittle, then an alternative mechanism kicks in – as a case in point, the Plan Bouquet technique [17] discussed in Section 3, which does not use cardinality estimates at all.

Further, there are different levels at which notions of robustness can be introduced, ranging from individual operators (e.g [7] ) to entire query plans (e.g [12] ), or even end-to-end query executions (e.g. [17] ). Moreover, one can take recourse to algorithmic (e.g. [44] ), statistical (e.g [47] ) or learning-based (e.g. [36] ) approaches to incorporate the robustness features at these various levels. The big picture is that a rich variety of possibilities are currently available, and a judicious choice could potentially lead to the desired robustness. Most importantly, with the impending advent of the so-called Big Data world, wherein data will be the engine driving virtually all aspects of human endeavour, the role of robust query processing (RQP) will soon assume critical proportions.

2.4 RQP Metric

For the purpose of this article, we will define robustness to be “the worst-case sub-optimality in plan performance that can arise due to modeling errors resulting in inaccurate estimates”. We denote this metric as Maximum Sub-Optimality, or MSO, since it captures the worst-case ratio, over the entire modeling space, of the execution time taken by the chosen plan with respect to the optimal time incurred by an oracular system that magically knows all the correct parameter values. The precise mathematical definition of MSO is given below.

Assume that we have an n-dimensional estimation space, referred to as the ESS. Then, the optimizer-estimated location of the query in the ESS is denoted by q_e, whereas the actual location is denoted by q_a. The optimal plan at q_e, as determined by the native optimizer, is denoted by P_opt(q_e), and similarly the optimal plan at q_a by P_opt(q_a). Further, assume that the query locations and the associated estimation errors range over the entire estimation space, that is, all (q_e , q_a) combinations are possible. Finally, the cost of a generic execution plan P_i at an arbitrary query location q in the ESS is denoted by cost(P_i , q).

With the above notation, the sub-optimality incurred due to using plan P_opt(q_e) at location q_a is simply defined as the ratio:

The worst-case SubOpt for a given q_a is defined to be with respect to the q_e that results in the maximum sub-optimality, that is, where modeling inaccuracies have the maximum adverse performance impact:

Now, the global worst-case situation is simply defined as the (q_e , q_a) error combination that results in the maximum value of SubOpt over the entire ESS, that is:

As per this formulation, MSO values range over the interval [1, ∞).

2.5 RQP Problem Definition

Given the above framework, the problem of robust query processing is defined as follows: For a given input SQL query Q with its associated ESS, and the search space consisting of tuples < q , P_opt(q), cost(P_opt(q) , q) > covering all locations q ∈ ESS, develop a query processing approach that minimizes the MSO value for Q.

Unfortunately, the current situation is that, even with industrial-strength database systems, there are no inherent bounds on the MSO values. In fact, it has been empirically observed in [17] that MSO values can reach very large magnitudes in practice, often in the several tens to hundreds, and sometimes even as high as a million! However, as demonstrated in the following section, approaching classical problems with a fresh perspective can yield surprisingly potent results, and that it is indeed possible to provide attractive MSO guarantees.

3 The PlanBouquet Approach to RQP [17]

In this section, we present a recently developed and radically different approach to the cardinality modeling problem. In our new approach, called PlanBouquet, the classical compile-time estimation process is completely eschewed for error-prone cardinalities. Instead, these cardinalities are systematically discovered at run-time through a calibrated sequence of cost-limited plan executions. That is, the cardinality estimation problem is side-stepped, rather than addressed head-on, by adopting a “seeing is believing” perspective on these values.

To enable uniform treatment of all predicates, we will hereafter substitute cardinality values with normalized equivalents called selectivities – that is, selectivity is defined as the ratio of the estimated output number of rows from an operator to the maximum feasible number of output rows. This formulation causes all selectivities to range over the [0, 1] interval, resulting in the ESSbecoming an n-dimensional unit hypercube. Further, for ease of presentation, we will use the percentage form of selectivity in our discussion, i.e. selectivities range over [0%, 100%].

3.1 Plan Cost Behavior

An assumption that fundamentally underlies the entire PlanBouquet mechanism is that of Plan Cost Monotonicity (PCM) – that is, the costs of plans increase monotonically with increasing selectivity values. It captures the intuitive observation that when more data is processed by a plan, signified by larger selectivities, the cost of processing also increases. It has been empirically observed that this assumption generally holds for the plans generated by current database systems on decision support queries [22]. Apart from monotonicity, we also assume the cost functions to be continuous (smooth) throughout the ESS, again a commonplace feature in practice.

3.2 One-dimensional ESS

We introduce the PlanBouquet approach through a restricted one-dimensional version of the earlier example query (Figure 4), wherein only the p_retailprice < 1000 filter predicate is assumed to be error-prone. For this scenario, we first, through repeated invocations of the optimizer, identify the “parametric optimal set of plans” (POSP) [27] [28] that cover the entire selectivity range of the predicate. A sample outcome of this process is shown in Figure 6, wherein the POSP set is comprised of plans P1 through P5. Further, each plan is annotated with the selectivity range over which it is optimal – for instance, plan P3 is optimal in the (1.0%, 7.5%] interval. (In Figure 6, P = Part, L = Lineitem, O = Order, NL = Nested Loops Join, MJ = Sort Merge Join, and HJ = Hash Join).

The optimizer-computed costs of these POSP plans over the selectivity range are shown (on a log-log scale) in Figure 7. On this figure, we first construct the “POSP infimum curve” (PIC), defined as the trajectory of the minimum cost from among the POSP plans – this curve represents the ideal performance. The next step, which is a distinctive feature of our approach, is to discretize the PIC by projecting a graded progression of isocost (IC) steps onto the curve. For example, in Figure 7, the dotted horizontal lines represent a geometric progression of isocost steps, IC1 through IC7, with each step being double the preceding value. The intersection of each IC with the PIC (indicated by âÂÂÂÂÂ ) provides an associated selectivity, along with the identity of the best POSP plan for this selectivity. For example, in Figure 7, the intersection of IC5 with the PIC corresponds to a selectivity of 0.65% with associated POSP plan P2. We term the subset of POSP plans that are associated with the intersections as the “plan bouquet” for the given query – in Figure 7, the bouquet consists of {P1, P2, P3, P5}.

The above exercise is carried out at query compilation time. Subsequently, at run-time, the correct query selectivities are implicitly discovered through a sequence of cost-limited executions of bouquet plans. Specifically, beginning with the cheapest cost step, we iteratively execute the bouquet plan assigned to each step until either:

1. The partial execution overheads exceed the step’s cost value – in this case, we know that the
actual selectivity location lies beyond the current step, motivating a switch to the next step in the sequence; or
2. The current plan completes execution within the budget – in this case, we know that the actual selectivity location has been reached, and a plan that is at least 2-optimal wrt the ideal choice, was used for the final execution.

Example To make the above process concrete, consider the case where the selectivity of p_retailprice is 5%. Here, we begin by partially executing plan P1 until the execution overheads reach IC1 (1.2E4 | 0.015%). Then, we extend our cost horizon to IC2, and continue executing P1 until the overheads reach IC2 (2.4E4 | 0.03%), and so on until the overheads reach IC4 (9.6E4 | 0.2%). At this juncture, there is a change of plan to P2 as we look ahead to IC5 (1.9E5 | 0.65%), and during this switching all the intermediate results (if any) produced thus far by P1 are discarded. The new plan P2 is executed until the associated overhead limit (1.9E5) is reached. The cost horizon is now extended to IC6 (3.8E5 | 6.5%), in the process discarding P2’s intermediate results and executing P3 instead. The execution in this case will complete before the cost limit is reached since the actual location, 5%, is less than the selectivity limit of IC6. Viewed in toto, the net sub-optimality turns out to be 1.78 since the exploratory overheads are 0.78 times the optimal cost, and the optimal plan itself was (coincidentally) employed for the final execution.

3.2.1 Performance Characteristics

At first glance, the plan bouquet approach, as described above, may appear to be utterly absurd and self-defeating because: (a) At compile time, considerable preprocessing may be required to identify the POSP plan set and the associated PIC; and (b) At run-time, the overheads may be hugely expensive since there are multiple plan executions for a single query – in the worst scenario, as many plans as are present in the bouquet!

However, we will now make the case that it is indeed possible, through careful design, to have plan bouquets efficiently provide robustness profiles that are markedly superior to the native optimizer’s profile. Specifically, the bouquet mechanism delivers substantial improvements with regard to MSO, the robustness metric defined in Section 2.

For instance, the runtime performance of the bouquet technique on the example query is profiled in Figure 8 (dark blue curve). We observe that its performance is much closer to the PIC (dark green) as compared to the worst case profile for the native optimizer (dark red), which is comprised of the supremum of the individual plan profiles. In fact, the MSO for the bouquet is only 3.6 (at 6.5%), whereas the native optimizer suffers a sub-optimality of around 100 when P5, which is optimal for large selectivities is mistakenly chosen to execute a query with a small selectivity of only 0.01%.

3.2.2 MSO Guarantees

Our motivation for the cost-based discretization of the PIC is that it leads to guaranteed bounds on the MSO value. For instance, we now prove that the cost-doubling strategy used in the 1D example results in an MSO upper-bound of 4 – this bound is inclusive of all exploratory overheads incurred by the partial executions, and is irrespective of the query’s actual selectivity. In fact, we can go further to show that 4 is the best guarantee achievable by any deterministic algorithm.

By virtue of our assumptions on plan cost behavior, the PIC turns out to be a monotonically increasing and continuous function throughout the ESS; its minimum and maximum costs are denoted by C_min and C_max , respectively. This PIC is discretized by projecting a graded progression of cost steps onto the curve. Specifically, consider the case wherein the steps are organized in a geometric progression with initial value a (a > 0) and common ratio r (r > 1), such that the

Figure 9: ID selectivity space

For 1 ≤ k ≤ m, denote the selectivity location where the k^th cost step (IC_k) intersects the PIC by q_k, and the corresponding bouquet plan as P_k. All the q_k locations are unique, by definition, due to the monotonicity and continuity features of the PIC. Finally, for mathematical convenience, assign q₀ to be 0.

With this framework, the bouquet execution algorithm, outlined in Algorithm 1, operates as follows in the most general case, where a different plan is associated with each step: We start with plan P₁ and budget cost(IC₁), progressively working our way up through the successive bouquet plans P₂ , P₃, … until we reach the first plan P_k that is able to fully execute the query within its assigned budget cost(IC_k). It is easy to see that the following lemma holds:

Lemma 3.1. If q_a resides in the range (q_k-1 , q_k], 1 ≤ k ≤ m, then plan P_k executes it to completion in the bouquet algorithm.

Proof. We prove by contradiction: If q_a was located in the region (q_k, q_k+1], then P_k could not have completed the query due to the PCM restriction. Conversely, if q_a was located in (q_k-2, q_k-1], P_k-1 itself would have successfully executed the query to completion. With similar reasoning, we can prove the same for the remaining regions that are beyond q_k+1 or before q_k-2.

Performance Bounds Consider the generic case where q_a lies in the range (q_k-1, q_k]. Based on Lemma 1, the associated worst case cost of the bouquet execution algorithm is given by the following expression:

where the “*” in the q_elocation indicates that, unlike conventional optimizers, the PlanBouquet algorithm does not make any such estimations. Further, note that the final expression is independent of k, and hence of the specific location of q_a. Therefore, we can state, for the entire selectivity space, that:

Theorem 3.2. Given a query Q with a one-dimensional ESS, and the associated PIC discretized with a
geometric progression having common ratio r, the PlanBouquet bouquet execution algorithm ensures that MSO ≤ r²/r-1
Further, the choice of r can be optimized to minimize this value – the RHS reaches its minima at r = 2, for which the value of MSO is 4. That is, it is important to note here that a MSO guarantee of 4 is impressive, given that conventional database systems are incapable of providing any guarantee at all! Moreover, the following theorem shows that this guarantee is the best performance achievable by any deterministic online algorithm – leading us to conclude that the doubling-based discretization is the ideal solution.

Theorem 3.3. Given a universe of cost-limited executions of POSP plans, no deterministic online algorithm can ensure MSO lower than 4 in the one-dimensional scenario.

Proof. We prove by contradiction, assuming there exists an optimal online robust algorithm, R*, with an MSOof f, f < 4. The proof is divided into two parts: First, we show that R* must be a monotonically increasing sequence of plan execution costs, [a₁, a₂, . . . , a_m]; and second, we demonstrate that achieving an MSOof less than 4 requires the ratio of cumulative costs for consecutive steps in the sequence to be strictly decreasing – however, this is fundamentally impossible and hence the contradiction.

(a) Assume that R* has cost sequence [a₁, . . . , a_i, a_j , . . . ,a_m+1] which is sorted in increasing order except for the inversion caused by a_j < a_i.

Now, let us define a plan execution to be useful if its execution covers a hitherto uncovered region of the selectivity space. With this definition, an execution of a_j after a_i is clearly useless since no fresh selectivity ground is covered by this cheaper execution. A sample instance with reference to Figure 5, is executing P₂, which covers the selectivity region (0, q₂), after P₃ which covers the region (0, q₃) – this does not add any value since the latter subsumes the former.

In summary, an out-of-order execution sequence cannot provide any improvement over an ordered sequence, which is why aj can be safely discarded to give a completely sorted sequence [a₁, . . . , a_i, . . . , a_m].

(b) For the sorted execution sequence R*, denote the cumulative cost at each step with A_j = ∑^j _i=1 a_i,
and the ratio between the cumulative costs for consecutive steps as Y_j = A_j+1/A_j.
Note that, by definition A_j+1 > A_j.

Now, since R* has MSO_g of f , the sub-optimality caused by each and every step should be at most f , that is,

and therefore

After dividing both sides with A_j , we get

Through elementary algebra, it is known that ∀z > 0, ( 1-1/z ) ≤ z/4. Therefore, we get

Since f < 4, it implies that the sequence Y_j is strictly decreasing with multiplicative factor < 1. With repeated applications of the same inequality, we obtain

For sufficiently large j, this results in

which is a contradiction to our earlier observation that A_j+1 > A_j.

3.3 Multi-dimensional ESS

When the above 1D approach is generalized to a multi-dimensional selectivity environment, the IC steps and the PIC curve become surfaces, and their intersections represent selectivity surfaces on which multiple bouquet plans may be present. For example, in the 2D case, the IC steps are horizontal planes cutting through a hollow three-dimensional PIC surface, typically resulting in hyperbolic intersection contours featuring a multitude of plans covering disjoint segments of the contours. A sample 2D scenario is shown in Figure 10a, wherein the isosurfaces IC_k are contours that represent a continuous sequence of selectivity locations (in contrast to the single location in the 1D case). Further, multiple bouquet plans may be present on each contour, as shown for IC_k, wherein four plans,

PK1,PK2,PK3,PK4P1K,P2K,P3K,P4K

are the optimizer’s choices over disjoint (x, y) selectivity ranges on the contour.

Notwithstanding these changes, the basic mechanics of the bouquet algorithm remain virtually identical. The primary difference is that we jump from one isosurface to the next only after it is determined that none of the bouquet plans present on the current isosurface can completely execute the given query within the associated cost budget. This is because, in order to decide whether q_a lies below or beyond IC_k, in principle every plan on the IC_k contour has to be executed – only if none complete, do we know that the actual location definitely lies beyond the contour.

4 Future Research Directions

In the previous section, we demonstrated the possibility of achieving progress in robust query processing by re-examining the fundamentals of the exercise. We now move on to outlining, in the final stage of this article, a sample set of future research directions. These open problems require leveraging of ideas across diverse branches of computer science, and their solutions will represent significant steps towards the ultimate quest for robust query processing.

Handling Database Scaleup: While PlanBouquet is inherently robust to changes in data distribution,
since these changes only shift the location of q_a in the ESS, the same is not true with regard to database scale-up. That is, if the database size increases significantly, then the original ESS no longer covers the entire error space. An obvious solution to handle this problem is to recompute the bouquet from scratch, but most of the processing may turn out to be redundant. Therefore, the development of incremental and efficient techniques for bouquet maintenance is essential for catering to dynamic databases.

Query Graph-sensitive Robustness: The robustness techniques that have been developed in the
literature thus far are largely agnostic to the specifics of the join graph of the query under consideration. That is, whether the graph is a chain, cycle, star, clique, etc. with regard to its structure. It appears quite logical to expect that this information could be gainfully used to improve the robustness of the resulting execution – for instance, it should be simpler to assure good performance for chain queries, where the optimization choices are limited, as opposed to star queries.

Geometries of Plan Cost Functions: Most of the prior work has only assumed that plan cost functions are monotonic with regard to selectivities. However, in practice, the functions often exhibit greater regularity in their behavior – for example, the Bounded Cost Growth property, recently leveraged in, [18] to ensure bounded suboptimalities of the Parametric Query Optimization choices. In BCG, the relative increase of plan costs is modeled as a low-order polynomial function of the relative increase in plan selectivities. A stronger constraint that has also been found to be generally true in practice is to assume concavity in cost functions, wherein they exhibit monotonically non-increasing slopes. Such constraints can be utilized to improve the robustness of the query processing solutions.

Dimensionality Reduction of the ESS: An important question is what predicates constitute the ESSin the PlanBouquet framework. A simple conservative option would be to assign all predicates (filter, join, projection) to be uncertain, but this would needlessly increase the MSO guarantee. Therefore, an interesting exercise would be to use a combination of domain knowledge, query logs, cost behavior and machine learning techniques to reduce the number of dimensions to the bare minimum required in the ESS.

Portable MSO Guarantees: The PlanBouquet formulation, while breaking new ground, suffers from a systemic drawback – the specific value of p, and therefore the bound, is a function of not only the query, but also the optimizer’s behavioral profile over the underlying database platform (including data contents, physical schema, hardware configuration, etc. ). As a result, there are adverse consequences: (i) The bound value becomes highly variable, depending on the specifics of the current operating environment (ii) It becomes infeasible to compute the value without substantial investments in preprocessing overheads; and (iii) Ensuring a bound that is small enough to be of practical value, is contingent on the heuristic of “anorexic reduction” [24] holding true.

In a recent work [31] , we have made initial progress on addressing the above limitations. Specifically, we have presented an algorithm called SpillBound, which achieves portable MSO guarantees. That is, its MSO bound is only a function of D, the dimensionality of the ESS. However, the function is quadratic in D, which means that MSO values rapidly increase with increasing dimensionality. Therefore, an interesting future problem is to investigate whether it would be feasible to provide portable MSO bounds that are linear in D.

Machine Learning Techniques for Component Selection: We advocate that the database engine should have a multiplicity of components that separately but cooperatively cater to the various query processing environments. An obvious question that arises with such an architecture is how to determine the specific environment that is currently operational, and hence the associated component. We are of the view that machine learning techniques could be used to judiciously make this choice, similar to the exercise recently carried out in [26] in the context of analytical data flows.

Graceful Performance Degradation: A major problem faced in real deployments is the presence of “performance cliffs”, where the performance suddenly degrades precipitously although there has only been a minor change in the operational environment. This is particularly true with regard to hardware resources, such as memory. So, an important future challenge is to design algorithms that provably degrade gracefully with regard to all their performance related parameters.

In closing, we would like to use this platform to reiterate that there are a host of conceptually challenging and eminently practical problems to be addressed in the database query processing domain – it is our earnest hope that this article would encourage at least a few readers, especially young students, to investigate further and take the field forward.

Bibliography

[1] M. Abhirama, S. Bhaumik, A. Dey, H. Shrimal and J. Haritsa, “On the Stability of Plan Costs and the Costs of Plan Stability”, PVLDB Journal, 3(1), 2010.
[2] A. Aboulnaga and S. Chaudhuri, “Self-tuning Histograms: Building Histograms Without Looking at Data”, Prod. of ACM SIGMOD Intl. Conf. of Management of Data, 1999.
[3] M. Akdere, U. Cetintemel, M. Riondato, E. Upfal and S. Zdonik, “Learning-based query performance modeling and prediction”, Proc. of IEEE Intl. Conf. on Data Engineering (ICDE), 2012.
[4] R. Avnur and J. Hellerstein, “Eddies: Continuously Adaptive Query Processing”, Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 2000.
[5] B. Babcock and S. Chaudhuri, “Towards a Robust Query Optimizer: A Principled and Practical Approach”, Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 2005.
[6] S. Babu, P. Bizarro, and D. DeWitt, “Proactive re-optimization”, ACM SIGMOD Intl. Conf. on Management of Data, 2005.
[7] R. Borovica-Gajic, S. Idreos, A. Ailamaki, M. Zukowski and C. Fraser, “Smooth Scan: Statistics-oblivious access paths”, Proc. of IEEE Intl. Conf. on Data Engineering (ICDE), 2015.
[8] S. Chaudhuri, “An Overview of Query Optimization in Relational Systems”, Proc. of ACM Symp. on Principles of Database Systems (PODS), 1998.
[9] S. Chaudhuri, “Query Optimizers: Time to rethink the contract?”, Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 2009.
[10] S. Chaudhuri, Interview in ACM XRDS, 19(1), 2012.
[11] S. Chen, P. Gibbons, and S. Nath, “Rethinking Database Algorithms for Phase Change Memory”, Proc. of 5th Biennial Conf. on Innovative Data Systems Research (CIDR), 2011.
[12] F. Chu, J. Halpern and P. Seshadri, “Least Expected Cost Query Optimization: An Exercise in Utility”, Proc. of ACM Symp. on Principles of Database Systems (PODS), 1999.
[13] F. Chu, J. Halpern and J. Gehrke, “Least Expected Cost Query Optimization: What can we expect?”, Proc. of ACM Symp. on Principles of Database Systems (PODS), 2002.
[14] E. Codd, “A Relational Model of Data for Large Shared Data Banks”, Comm. of the ACM , 13 (6), 1970.
[15] A. Deshpande, Z. Ives and V. Raman, “Adaptive Query Processing”, Foundations and Trends in Databases, Now Publishers, 1 (1), 2007.
[16] D. DeWitt, Interview in ACM Sigmod Record, 31(2), 2002. [17] A. Dutt and J. Haritsa, “Plan Bouquets: A Fragrant Approach to Robust Query Processing”, ACM Trans. on Database Systems (TODS), 41(2), 2016.
[17] A. Dutt and J. Haritsa, “Plan Bouquets: A Fragrant Approach to Robust Query Processing”, ACM Trans. on Database Systems (TODS), 41(2), 2016.
[18] A. Dutt, V. Narasayya and S. Chaudhuri, “Leveraging re-costing for online optimization of parameterized queries with guarantees”, Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 2016.
[19] G. Graefe, “Query evaluation techniques for large databases”, ACM Computing Surveys, 25(2), 1993.
[20] J. Gray, “The Transaction Concept: Virtues and Limitations”, Proc. of 7th Intl. Conf. on Very Large Data Bases (VLDB), 1981.
[21] G. Graefe, “New algorithms for join and grouping operations”, Computer Science – R&D, 27(1), 2012.
[22] J. Haritsa, “The Picasso Database Query Optimizer Visualizer”, PVLDB Journal, 3(2), 2010.
[23] J. Haritsa, “Plan Diagrams: Visualizing Database Query Optimizers”, Annals of Indian National Academy of Engineering (INAE), Volume VIII, 2011.
[24] D. Harish, P. Darera and J. Haritsa, “On the Production of Anorexic Plan Diagrams”, Proc. of 31st Intl. Conf. on Very Large Data Bases (VLDB), 2007.
[25] D. Harish, P. Darera and J. Haritsa, “Identifying Robust Plans through Plan Diagram Reduction”, PVLDB Journal, 1(1), 2008.
[26] F. Hueske, “Specification and Optimization of Analytical Data Flows”, PhD Thesis, 2016. depositonce.tu-berlin.de/bitstream/11303/5482/4/hueske_fabian.pdf
[27] A. Hulgeri and S. Sudarshan, “Parametric Query Optimization for Linear and Piecewise Linear Cost Functions”, Proc. of 28th Intl. Conf. on Very Large Data Bases (VLDB), 2002.
[28] A. Hulgeri and S. Sudarshan, “AniPQO: Almost Non-intrusive Parametric Query Optimization for Nonlinear Cost Functions”, Proc. of 29th Intl. Conf. on Very Large Data Bases (VLDB), 2003.
[29] Y. Ioannidis and S. Christodoulakis, “On the propagation of errors in the size of join results”, Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 1991.
[30] N. Kabra and D. DeWitt, “Efficient Mid-query Re-optimization of Sub-optimal Query Execution Plans”, Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 1998.
[31] S. Karthik, J. Haritsa, S. Kenkre, V. Pandit and L. Krishnan, “Platform-independent Robust Query Processing”, IEEE Trans. on Knowledge and Data Engineering (TKDE), 2017.
[32] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kempers and T. Neumann, “How Good are Query Optimizers, Really?”, PVLDB Journal, 9(3), 2015.
[33] V. Leis, B. Radke, A. Gubichev, A. Kempers and T. Neumann, “Cardinality Estimation DoneRight: Index-based Join Sampling”, Proc. of Conf. on Innovative Data Systems Research (CIDR), 2017.
[34] G. Lohman, “Is Query Optimization a âËAIJSolvedâËAËI Problem?”, wp.sigmod.org/?p=1075.
[35] L. Mackert and G. Lohman, “R Optimizer Validation and Performance Evaluation for Local Queries”, Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 1986.
[36] T. Malik, R. Burns and N. Chawla, “A Black-Box Approach to Query Cardinality Estimation”, Proc. of Conf. on Innovative Data Systems Research (CIDR), 2007.
[37] V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic, “Robust query processing through progressive optimization”, Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 2004.
[38] G. Moerkotte, T. Neumann and G. Steidl, “Preventing Bad Plans by Bounding the Impact of Cardinality Estimation Errors”, PVLDB Journal, 2(1), 2009.
[39] T. Neumann and C. Galindo-Legaria, “Taking the Edge off Cardinality Estimation Errors using Incremental Execution”, Proc. of BTW Conf., 2013.
[40] N. Reddy and J. Haritsa, “Analyzing Plan Diagrams of Database Query Optimizers”, Proc. of 31st Intl. Conf. on Very Large Data Bases (VLDB)., 2005.
[41] W. Rodiger, S. Idicula, A. Kemper and T. Neumann, “Flow-join: Adaptive skew handling for distributed joins over high-speed networks”, Proc. of IEEE Intl. Conf. on Data Engineering (ICDE), 2016.
[42] P. Selinger, P. Griffiths, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price. “Access Path Selection in a Relational Database Management System”, Proc. of ACM SIGMOD Intl. Conf. on Management of Data, 1979.
[43] M. Stillger, G. Lohman, V. Markl and M. Kandil, “LEO – DB2’s LEarning Optimizer”, Proc. of 27th Intl. Conf. on Very Large Data Bases (VLDB), 2001.
[44] K. Tzoumas, A. Deshpande and C. Jensen, “Lightweight graphical models for selectivity estimation without independence assumptions”, PVLDB Journal, 4(11), 2011.
[45] K. Tzoumas, A. Deshpande and C. Jensen, “Efficiently adapting graphical models for selectivity estimation”, VLDB Journal, 22(1), 2013.
[46] W. Wu, Y. Chi, H. Hacigumus and J. Naughton, “Towards predicting query execution time for concurrent and dynamic databae workloads”, PVLDB Journal, 6(10), 2013.
[47] W. Wu, Y. Chi, S. Zhu, J. Tatemura, H. Hacigumus and J. Naughton, “Predicting query execution time: Are optimizer cost models really unusable?”, Proc. of IEEE Intl. Conf. on Data Engineering (ICDE), 2012.
[48] W. Wu, X. Wu, H. Hacigumus and J. Naughton, “Uncertainty Aware Query Execution Time Prediction”, PVLDB Journal, 7(14), 2014.
[49] Dagstuhl Seminar on Robust Query Processing, 2010.
www.dagstuhl.de/en/program/calendar/semhp/?semnr=10381
[50] Dagstuhl Seminar on Robust Query Processing, 2012.
www.dagstuhl.de/en/program/calendar/semhp/?semnr=12321
[51] Dagstuhl Seminar on Robust Query Processing, 2017.
www.dagstuhl.de/en/program/calendar/semhp/?semnr=17222
[52] Database Systems Lab, IISc. dsl.cds.iisc.ac.in
[53] DB2. www.ibm.com/db2
[54] MySQL. www.mysql.com
[55] Oracle. www.oracle.com/technology/products/database/oracle11gbitem
[56] PostgreSQL. www.postgresql.org
[57] SQL. en.wikipedia.org/wiki/SQL:2008
[58] SQLServer. www.microsoft.com/sqlserver/2008/
[59] TPC-H Benchmark. www.tpc.org/tpch
[60] TPC-DS Benchmark. www.tpc.org/tpcds

Current Issue

Issue 3
03 December, 2022

Browse by Categories