Query Optimization
We want queries to be:
- Correct - captured with a boolean. However, this is hard to prove in a general case.
- Performant - captured with a cost metric.
To make it easier to reason about queries, we split into a planning (e.g. SQL to plan compiling) and optimisation stages, first responsible for correctness and second for performance.
Query optimisation is enabled by the algebraic properties of plans: it is closed and easily composable. However, it is hard to produce semantically equivalent plans.
1. Peephole Optimisation
Since a transformation of a subplan is equivalent to the transformation of the whole plan, we can apply local transformations to small parts of the plan. To do this, we need to identify patterns in the plan tree and apply rewrite rules.
To apply peephole optimisations:
- Traverse the plan from the root (in any order).
- For every traversed node, see if a pattern matches.
- If so, replace with rewrite and start again at root.
- Repeat until no more matches.
To decide on the rules and when they should be applied, there are four main strategies: whether we take note of the algorithm (physical (aware) or logical (agnostic)) in terms of the operator implementation or whether we take note of the data (rule based (agnostic) or cost based (aware)) in terms of the data characteristics.
| Logical | Physical | |
|---|---|---|
| Rule Based | Algorithm Agnostic, Data Agnostic | Algorithm Aware, Data Agnostic |
| Cost Based | Algorithm Agnostic, Data Aware | Algorithm Aware, Data Aware |
Although peephole optimisation is simple, fast, verifiable and composable, it is locally optimal and may miss global optimisations. Additionally, it has a potential for infinite optimisation loops.
2. Rule Based, Logical Optimisation
These do not take into account the data or the algorithm, only the structure of the plan. They are heavily heuristics based, almost universally beneficial, and very portable (implementation independent and robust).
We can use the fact that some relational operators are associative (joins, selections, unions, differences) and some are distributive (selections over joins, selections over unions) to swap their order. Additionally, we can use the following cost heuristics:
- Cost grows with cardinality (generally true).
- Joins tend to increase cardinality (or leave unchanged).
- Selections tend to decrease cardinality (or leave unchanged).
- Aggregations tend to decrease cardinality (or leave unchanged).
- Data access is more expensive than function evaluation.
Rule: Selection Pushdown
Selections can be pushed through joins if they only refer to attributes from one side of the join.
This is usually a good optimisation because selections are pipelineable.
Plans must be totally ordered. Peephole optimisation rule graph cannot have cycles as we will never reach a stable, optimised plan. For example, selection ordering has a cycle, we can keep applying it:
Select[Select[input, cond1], cond2] -> Select[Select[input, cond2], cond1]
To fix this, we can impose an ordering on conditions:
1Select[Select[input, cond1], cond2] 2/; (cond1.cmp == '>' and cond2.cmp == '=') 3-> Select[Select[input, cond2], cond1]
However, we also need to be careful of the order we apply optimisations, as this can affect the final plan. In general, rule based optimisation is:
- Unprincipled - no guarantee of optimality.
- Very Brittle - a missing rule can have devastating effects.
- Contradicting rules often lead to infinite cycles.
- Often wrong, not taking data characteristics into account.
3. Cost Based, Logical Optimisation
To order plans by quality, we define a cost metric (e.g. sum number of tuples produced by each operator). However, we don't know cardinalities before running the query, so we need to estimate them using statistics about the data. For example, assuming uniform distribution, we can estimate the cardinality of an equality selection as:
We also may want to keep track of attribute correlation (e.g. most urgent orders are pending). To do this, we use a multidimensional histogram. Now, for example, a first selection's cardinality is
4. Rule Based, Physical Optimisation
This usually comes with more rules, often taking into account hardware (e.g. CPU vs IO costs). With more indicies, we have more rules (e.g. B-trees vs hashes for equalities, B+-trees vs bitmaps for ranges).
5. Cost Based, Physical Optimisation
This is likely the most advanced optimisation strategy, strongly dependant on the metric used (e.g. num function calls, CPU costs, page faults, max(IO, CPU), etc). For example, a nested loop join requires less space and does not require hashing, but induces more comparisons.