Algorithms & Indexing

1. Joins

Joins are very important in databases - due to normalisation, joining tables produces value. A join is a cross product with a selection involving both inputs. There are main types of joins:

A predicate can be (equi join), (inequality join), (anti join) or anything else (theta join).

1.1 Nested Loop Join

The simplest join algorithm is the nested loop join:

1using Table = vector<Tuple>; 2Table left, right; 3 4for (size_t i = 0; i < left.size(); i++) { 5 auto leftInput = left[i]; 6 for (size_t j = 0; j < right.size(); j++) { 7 auto rightInput = right[j]; 8 if (predicate(leftInput, rightInput)) { 9 writeToOutput({leftInput, rightInput}); 10 } 11 } 12}

This is simple, has sequential IO (good spacial locality), and is trivial to parallelize. However, it is . This can be reduced to if value uniqueness is assumed.

The answer is always either sorting or hashing.

1.2 Sort-Merge Join

The sort-merge join assumes values are unique and sorted. The following implemention is for an equi-join:s

1auto leftI = 0; 2auto rightI = 0; 3while (leftI < left.size() && rightI < right.size()) { 4 auto leftInput = left[leftI]; 5 auto rightInput = right[rightI]; 6 if (leftInput[leftAttr] < rightInput[rightAttr]) { 7 leftI++; 8 } else if (leftInput[leftAttr] > rightInput[rightAttr]) { 9 rightI++; 10 } else { 11 writeToOutput({leftInput, rightInput}); 12 leftI++; 13 rightI++; 14 } 15}

This only works when its sorted, using two pointers to step through the data. It is , or .

Its sequential in the merge phase, and works for inequality joins. However, it is tricky to parallelize.

1.3 Hash Join

We distinguish build side (side buffered in the hashtable) and probe side (used to lookup tuples in the hashtable). The hash join is as follows:

1vector<optional<Tuple>> hashTable; // Slots may be empty! 2int hash(int); 3int nextSlot(int); // Probe function 4 5for (size_t i = 0; i < build.size(); i++) { 6 auto buildInput = build[i]; 7 auto hash_value = hash(buildInput[buildAttr]); 8 while (hashTable[hash_value].has_value()) { // Careful of infinite loops! 9 hash_value = nextSlot(hash_value); 10 } 11 hashTable[hash_value] = buildInput; 12} 13 14for (size_t i = 0; i < probe.size(); i++) { 15 auto probeInput = probe[i]; 16 auto hash_value = hash(probeInput[probeAttr]); 17 while (hashTable[hash_value].has_value() && hashTable[hash_value].value[buildAttr] != probeInput[probeAttr]) { 18 hash_value = nextSlot(hash_value); 19 } 20 if (hashTable[hash_value].value[buildAttr] == probeInput[probeAttr]) { 21 writeToOutput({hashTable[hash_value].value, probeInput}); 22 } 23}

However, to implement this we need a hash and nextSlot function.

1.3.1 Hash Functions

It has to be pure (no state), and have a known output domain. It would be nice to have a contiguous output domain and a uniform distribution of outputs.

1.3.2 Probing Functions

This handles conflicts in the has function. We want locality (but too much will lead to long probe chains). Probing functions can also lead to holes.

Example: Module hashing & Linear probing

1int hash(int v) { return v % TABLE_SIZE; } 2int probe(int v) { return (v + 1) % TABLE_SIZE; }

1.3.3 Hash Join Properties

Slots are often allocated in buckets. Here, we round each hash value down to the nearest multiple of the bucket size, and store multiple tuples in each bucket. This reduces the number of cache misses, but increases the number of tuples per bucket (leading to longer probe chains). Bucket-chaining is bad for lookup performance, but very resilient to skewed data.

Hashtables also use memory (usually the size of the inputs), and are probed randomly. So, we want to makae sure they stay in cache. If the hashtable does not fit, every access has a constant penalty.

We use hash tables when one relation is much smaller than the other.

1.4 Partitioning

If random access has cost , sequential access has cost . So, if the relation is larger than half of the buffer pool, we can invest in partitioning. This involves splitting the hashtable up into partitions (using another hash function with a small output domain). Each partition's probing will have sequential access, and can even be done in parallel. This is the state of the art.

2. Indexing

With indexing, we can maintain a build side or sorted data to make algorithms (e.g. joins) faster. We store replicant secondary storage (the opposite of normalisation). They increase performance, are invisible to users, and can be created / destroyed dynamically. However, they take up space, need to be maintained on updates & stress the query optimiser.

2.1 Hash Indexing

Hash indexing is a persistent version of the hash table from hash joins. In an unclustered index, values point to tuples. In a clustered index, values are the tuples.

Persistent hash tables may grow arbitrarily large, so we overallocate by a lot. We could rebuild if the fill-factor (percentage of used slots) exceeds a threshold (very expensive, causes load spikes).

On delete, a value has to remain in the slot. We can either mark it as deleted (tombstone), or put the last value in the probechain, to optimise its length. To delete key :

  1. Hash , find , keep pointer to .
  2. Continue probing until end of chain.
  3. If last value in probe chain has same hash as , move it to 's slot.
  4. Otherwise, mark as deleted.

Persistent hash tables are good for hash joins and aggregations, but bad for range queries.

2.2 Bitmap Indexing

A bitmap index is a collection of (usually disjoint) bitvectors on a column (one per distinct value). Useful if there are few distinct values. We can use them to evaluate select queries:

1unsigned char** bitmaps; // A collection of bitmaps on a column 2 3void scanBitmap(byte* column, size_t inputSize, byte value) { 4 unsigned char* scannedBitmap = bitmapForValue(bitmaps, value); 5 for (size_t i = 0; i < inputSize / 8; i++) { 6 if (scannedBitmap[i] != 0) { 7 // Look at 1 bit at a time: 8 unsigned char mask = 127; 9 for (size_t j = 0; j < 8; j++) { 10 if ((mask & scannedBitmap[i]) && column[i * 8 + j] == value) { 11 writeToOutput(column[i * 8 + j]); 12 } 13 mask >>= 1; 14 } 15 } 16 } 17}

These indicies:

In binned bitmaps, we group values into bins (e.g. ages -, -, ). Each bin has a bitmap, and we store the actual value in a separate structure. This reduces the number of bitmaps needed, but increases the number of candidates returned (requiring a second pass to filter them). We have many different binning strategies:

2.3 Bitmap RLE Indexing

We can RLE compress the bitmaps to save space. This works well on high locality data. However, it requires a sequential scan to find a specific position. Instead, we can sum the length prefixes and use a binary search to find the right run.

2.4 B Trees

Databases are IO bound, with many equality lookups and insertions. So, use a tree! We want a high fanout to minimise page IO.

A B-Tree node of out degree contains up to keys and pointers to children. The keys are sorted, and the pointers separate the keys into ranges.

2.4.1 Maintaining B Trees

To maintain a balanced B-tree on insertion:

  1. Find the right leaf node to insert and insert it.
  2. If the node overflows (has keys), split it in half.
  3. Insert a new split element in the parent node.
  4. If the parent overflows, repeat from step 2.
  5. If the parent is the root, create a new root.

To maintain a balanced B-tree on deletion:

  1. Find the value to delete. If its in a leaf, delete it. Done!
  2. Otherwise, replace it with the maximum leafnode value from the left child, or the minimum leafnode value of the right child (removing it from there).
  3. If the affected leafnode underflows, rebalance the tree bottom up:
  4. Try to obtain an element from a neighbouring node, making it the new splitting key and move the key into the node.
  5. If this fails, the neighbouring node cannot be more than half full and can be merged. Remove the parent splitting key.
  6. If the parent underflows, rebalance from it.

2.4.2 B+ Trees

Although they can support range queries, this is complicated, causes many node traversals. We usually make B-Tree nodes the same size as a page. We also waste a lot of space for pointers on leaf nodes.

In B+ trees, all data is kept in the leaf nodes. The leaf nodes are linked together in a linked list by their unused pointers for fast range scans. Split values are replicas of leaf-node values. In a B+ node:

2.5 Foreign Key Indexing

A foreign key references exactly one primary key column of another table. We can index foreign keys by storing pointers to the actual primary key instead of a supposedly matching value.

In a PK/FK constraint, the number of join partners is implied to be . So, FK indices are basically precalculated joins. They have insignificant extra work for updates, do not cost significant space and require no extra query optimisation.

Back to Home