Algorithms & Indexing

1. Joins

Joins are very important in databases - due to normalisation, joining tables produces value. A join is a cross product with a selection involving both inputs. There are main types of joins:

Inner Join : only returns rows where there is a match in both tables e.g. .
Left Join : Return every row in even if no rows from match. If no rows match, the columns from are filled with NULL.
Right Join : Return every row in even if no rows from match. If no rows match, the columns from are filled with NULL.
Outer Join : Return every row in both and , even if there are no matches. .

A predicate can be (equi join), (inequality join), (anti join) or anything else (theta join).

1.1 Nested Loop Join

The simplest join algorithm is the nested loop join:

1using Table = vector<Tuple>;
2Table left, right;
3
4for (size_t i = 0; i < left.size(); i++) {
5  auto leftInput = left[i];
6  for (size_t j = 0; j < right.size(); j++) {
7    auto rightInput = right[j];
8    if (predicate(leftInput, rightInput)) {
9      writeToOutput({leftInput, rightInput});
10    }
11  }
12}

This is simple, has sequential IO (good spacial locality), and is trivial to parallelize. However, it is . This can be reduced to if value uniqueness is assumed.

The answer is always either sorting or hashing.

1.2 Sort-Merge Join

The sort-merge join assumes values are unique and sorted. The following implemention is for an equi-join:s

1auto leftI = 0;
2auto rightI = 0;
3while (leftI < left.size() && rightI < right.size()) {
4  auto leftInput = left[leftI];
5  auto rightInput = right[rightI];
6  if (leftInput[leftAttr] < rightInput[rightAttr]) {
7    leftI++;
8  } else if (leftInput[leftAttr] > rightInput[rightAttr]) {
9    rightI++;
10  } else {
11    writeToOutput({leftInput, rightInput});
12    leftI++;
13    rightI++;
14  }
15}

This only works when its sorted, using two pointers to step through the data. It is , or .

Its sequential in the merge phase, and works for inequality joins. However, it is tricky to parallelize.

1.3 Hash Join

We distinguish build side (side buffered in the hashtable) and probe side (used to lookup tuples in the hashtable). The hash join is as follows:

1vector<optional<Tuple>> hashTable; // Slots may be empty!
2int hash(int);
3int nextSlot(int); // Probe function
4
5for (size_t i = 0; i < build.size(); i++) {
6  auto buildInput = build[i];
7  auto hash_value = hash(buildInput[buildAttr]);
8  while (hashTable[hash_value].has_value()) { // Careful of infinite loops!
9    hash_value = nextSlot(hash_value);
10  }
11  hashTable[hash_value] = buildInput;
12}
13
14for (size_t i = 0; i < probe.size(); i++) {
15  auto probeInput = probe[i];
16  auto hash_value = hash(probeInput[probeAttr]);
17  while (hashTable[hash_value].has_value() && hashTable[hash_value].value[buildAttr] != probeInput[probeAttr]) {
18    hash_value = nextSlot(hash_value);
19  }
20  if (hashTable[hash_value].value[buildAttr] == probeInput[probeAttr]) {
21    writeToOutput({hashTable[hash_value].value, probeInput});
22  }
23}

However, to implement this we need a hash and nextSlot function.

1.3.1 Hash Functions

It has to be pure (no state), and have a known output domain. It would be nice to have a contiguous output domain and a uniform distribution of outputs.

MD5 is expensive and has a bad distribution.
Modulo Division is a simple and effective method for creating hash functions, but can lead to clustering if the input data has patterns that align with the modulus.
MurmurHash is fast and has a good distribution.
CRC32 has a suboptimal distrbution, but is very fast (having hardware support).

1.3.2 Probing Functions

This handles conflicts in the has function. We want locality (but too much will lead to long probe chains). Probing functions can also lead to holes.

Linear Probing - try the next one. Simple with great locality, but can lead to clustering and long probe chains.
Quadratic Probing - try 1 ahead, then 2 ahead, then 4 ahead, Simple, good locality for first probes. However, first probes are still likely to incur conflicts.
Rehashing - use the hash function for probing. Simple with constant conflict probability, but bad locality. It also won't necessarily probe all slots.

Example: Module hashing & Linear probing

1int hash(int v) { return v % TABLE_SIZE; }
2int probe(int v) { return (v + 1) % TABLE_SIZE; }

1.3.3 Hash Join Properties

Sequential IO on inputs, but random access on the hashtable. (Bad!)
Parallelizable on probe side (not so much on build side).
Best case , worst case (all values collide).
Hashing is expensive (good algorithms need many cycles).

Slots are often allocated in buckets. Here, we round each hash value down to the nearest multiple of the bucket size, and store multiple tuples in each bucket. This reduces the number of cache misses, but increases the number of tuples per bucket (leading to longer probe chains). Bucket-chaining is bad for lookup performance, but very resilient to skewed data.

Hashtables also use memory (usually the size of the inputs), and are probed randomly. So, we want to makae sure they stay in cache. If the hashtable does not fit, every access has a constant penalty.

We use hash tables when one relation is much smaller than the other.

1.4 Partitioning

If random access has cost , sequential access has cost . So, if the relation is larger than half of the buffer pool, we can invest in partitioning. This involves splitting the hashtable up into partitions (using another hash function with a small output domain). Each partition's probing will have sequential access, and can even be done in parallel. This is the state of the art.

2. Indexing

With indexing, we can maintain a build side or sorted data to make algorithms (e.g. joins) faster. We store replicant secondary storage (the opposite of normalisation). They increase performance, are invisible to users, and can be created / destroyed dynamically. However, they take up space, need to be maintained on updates & stress the query optimiser.

A clustered (primary) index stores the actual tuples in the table. Max 1. May use more space than a table but don't replicate data.
A non-clustered (secondary) index stores a pointer to the actual data, allowing for multiple indexes on a table. It doesn't replicate data but must be maintained on updates.

2.1 Hash Indexing

Hash indexing is a persistent version of the hash table from hash joins. In an unclustered index, values point to tuples. In a clustered index, values are the tuples.

Persistent hash tables may grow arbitrarily large, so we overallocate by a lot. We could rebuild if the fill-factor (percentage of used slots) exceeds a threshold (very expensive, causes load spikes).

On delete, a value has to remain in the slot. We can either mark it as deleted (tombstone), or put the last value in the probechain, to optimise its length. To delete key :

Hash , find , keep pointer to .
Continue probing until end of chain.
If last value in probe chain has same hash as , move it to 's slot.
Otherwise, mark as deleted.

Persistent hash tables are good for hash joins and aggregations, but bad for range queries.

2.2 Bitmap Indexing

A bitmap index is a collection of (usually disjoint) bitvectors on a column (one per distinct value). Useful if there are few distinct values. We can use them to evaluate select queries:

1unsigned char** bitmaps; // A collection of bitmaps on a column
2
3void scanBitmap(byte* column, size_t inputSize, byte value) {
4  unsigned char* scannedBitmap = bitmapForValue(bitmaps, value);
5  for (size_t i = 0; i < inputSize / 8; i++) {
6    if (scannedBitmap[i] != 0) {
7      // Look at 1 bit at a time:
8      unsigned char mask = 127;
9      for (size_t j = 0; j < 8; j++) {
10        if ((mask & scannedBitmap[i]) && column[i * 8 + j] == value) {
11          writeToOutput(column[i * 8 + j]);
12        }
13        mask >>= 1;
14      }
15    }
16  }
17}

These indicies:

Reduce bandwidth requried for scans (in the order of the size of the type of the column in bits).
Predicates can be combined using logical operations on bit vectors.
Some systems can arbitrarily index boolean conditions.
Support binned bitmaps.

In binned bitmaps, we group values into bins (e.g. ages -, -, ). Each bin has a bitmap, and we store the actual value in a separate structure. This reduces the number of bitmaps needed, but increases the number of candidates returned (requiring a second pass to filter them). We have many different binning strategies:

Equi-width - each bin has the same width. Bad for non-uniform data.
Equi-height - each bin has the same number of values. Bin construction is tricky (usually determine quantiles on a sample). If distribution changes, bins may need to be rebuilt.

2.3 Bitmap RLE Indexing

We can RLE compress the bitmaps to save space. This works well on high locality data. However, it requires a sequential scan to find a specific position. Instead, we can sum the length prefixes and use a binary search to find the right run.

2.4 B Trees

Databases are IO bound, with many equality lookups and insertions. So, use a tree! We want a high fanout to minimise page IO.

A B-Tree node of out degree contains up to keys and pointers to children. The keys are sorted, and the pointers separate the keys into ranges.

The tree is balanced - all leaf nodes are at the same depth.
The root has at least one element.
Each non-root node has at least keys.

2.4.1 Maintaining B Trees

To maintain a balanced B-tree on insertion:

Find the right leaf node to insert and insert it.
If the node overflows (has keys), split it in half.
Insert a new split element in the parent node.
If the parent overflows, repeat from step 2.
If the parent is the root, create a new root.

To maintain a balanced B-tree on deletion:

Find the value to delete. If its in a leaf, delete it. Done!
Otherwise, replace it with the maximum leafnode value from the left child, or the minimum leafnode value of the right child (removing it from there).
If the affected leafnode underflows, rebalance the tree bottom up:
Try to obtain an element from a neighbouring node, making it the new splitting key and move the key into the node.
If this fails, the neighbouring node cannot be more than half full and can be merged. Remove the parent splitting key.
If the parent underflows, rebalance from it.

2.4.2 B+ Trees

Although they can support range queries, this is complicated, causes many node traversals. We usually make B-Tree nodes the same size as a page. We also waste a lot of space for pointers on leaf nodes.

In B+ trees, all data is kept in the leaf nodes. The leaf nodes are linked together in a linked list by their unused pointers for fast range scans. Split values are replicas of leaf-node values. In a B+ node:

Key:value spaces now just store the keys.
Values are stored where the old pointers were.
The th pointer points to the next leaf node.

2.5 Foreign Key Indexing

A foreign key references exactly one primary key column of another table. We can index foreign keys by storing pointers to the actual primary key instead of a supposedly matching value.

In a PK/FK constraint, the number of join partners is implied to be . So, FK indices are basically precalculated joins. They have insignificant extra work for updates, do not cost significant space and require no extra query optimisation.

Back to Home

Table of Contents