Query Optimization

We want queries to be:

To make it easier to reason about queries, we split into a planning (e.g. SQL to plan compiling) and optimisation stages, first responsible for correctness and second for performance.

Query optimisation is enabled by the algebraic properties of plans: it is closed and easily composable. However, it is hard to produce semantically equivalent plans.

1. Peephole Optimisation

Since a transformation of a subplan is equivalent to the transformation of the whole plan, we can apply local transformations to small parts of the plan. To do this, we need to identify patterns in the plan tree and apply rewrite rules.

To apply peephole optimisations:

  1. Traverse the plan from the root (in any order).
  2. For every traversed node, see if a pattern matches.
  3. If so, replace with rewrite and start again at root.
  4. Repeat until no more matches.

To decide on the rules and when they should be applied, there are four main strategies: whether we take note of the algorithm (physical (aware) or logical (agnostic)) in terms of the operator implementation or whether we take note of the data (rule based (agnostic) or cost based (aware)) in terms of the data characteristics.

LogicalPhysical
Rule BasedAlgorithm Agnostic, Data AgnosticAlgorithm Aware, Data Agnostic
Cost BasedAlgorithm Agnostic, Data AwareAlgorithm Aware, Data Aware

Although peephole optimisation is simple, fast, verifiable and composable, it is locally optimal and may miss global optimisations. Additionally, it has a potential for infinite optimisation loops.

2. Rule Based, Logical Optimisation

These do not take into account the data or the algorithm, only the structure of the plan. They are heavily heuristics based, almost universally beneficial, and very portable (implementation independent and robust).

We can use the fact that some relational operators are associative (joins, selections, unions, differences) and some are distributive (selections over joins, selections over unions) to swap their order. Additionally, we can use the following cost heuristics:

Rule: Selection Pushdown

Selections can be pushed through joins if they only refer to attributes from one side of the join.

This is usually a good optimisation because selections are pipelineable.

Plans must be totally ordered. Peephole optimisation rule graph cannot have cycles as we will never reach a stable, optimised plan. For example, selection ordering has a cycle, we can keep applying it:

Select[Select[input, cond1], cond2] -> Select[Select[input, cond2], cond1]

To fix this, we can impose an ordering on conditions:

1Select[Select[input, cond1], cond2] 2/; (cond1.cmp == '>' and cond2.cmp == '=') 3-> Select[Select[input, cond2], cond1]

However, we also need to be careful of the order we apply optimisations, as this can affect the final plan. In general, rule based optimisation is:

3. Cost Based, Logical Optimisation

To order plans by quality, we define a cost metric (e.g. sum number of tuples produced by each operator). However, we don't know cardinalities before running the query, so we need to estimate them using statistics about the data. For example, assuming uniform distribution, we can estimate the cardinality of an equality selection as: . This may not be very accurate, so we could keep a histogram of value distributions as a statistic in the catalogue.

We also may want to keep track of attribute correlation (e.g. most urgent orders are pending). To do this, we use a multidimensional histogram. Now, for example, a first selection's cardinality is , and a second selection's cardinality is . However, the number of histograms grows combinatorially with the number of tables.

4. Rule Based, Physical Optimisation

This usually comes with more rules, often taking into account hardware (e.g. CPU vs IO costs). With more indicies, we have more rules (e.g. B-trees vs hashes for equalities, B+-trees vs bitmaps for ranges).

5. Cost Based, Physical Optimisation

This is likely the most advanced optimisation strategy, strongly dependant on the metric used (e.g. num function calls, CPU costs, page faults, max(IO, CPU), etc). For example, a nested loop join requires less space and does not require hashing, but induces more comparisons.

Back to Home