Storage

1. DBMS Kernel

The database architecture is composed of layers:

External Interface (e.g. SQL)
Logical Plan (e.g. Relational Algebra)
Physical Plan (Actual execution plan in more detail)
DB Kernel (Library of functions implementing algorithms, manages CPU, disk, memory, etc)

The kernel is the key part of the DBMS system - everything else is optional. Contains IO, memman, operations and provides an interface to subsystems.

2. Kernel Architecture

The kernel is composed of 4 main components:

Catalog - metadata on the actual data (e.g. schema, constraints, etc).
Data - the data itself.
Storage Manager - component of the kernel that interfaces with the data and catalog.
Operator Implementations - functionality of the DB that uses the storage manager.

Lets create a kernel interface in C++:

1class StorageManager;
2class OperatorImplementations;
3class DatabaseKernel {
4  StorageManager& getStorageManager();
5  OperatorImplementations& getOperatorImplementations();
6}

3. Storage Manager

The storage manager is the first thing every database needs to support - specifically inserting tuples. Inserts are not usually optimised.

Lets design a storage manager interface:

1struct Table;
2class StorageManager {
3  map<string, Table> catalog;
4
5public:
6  Table& getTable(string name) { return catalog[name]; }
7};

To decide how to store the data, we need to decide on where to store it (disk, mem, etc) and how to store it (n-ary or decomposed, sorted, compressed, etc).

1struct Table {
2  using AttributeValue = variant<int, float, string>;
3  using Tuple = vector<AttributeValue>;
4
5  void insert(Tuple) {};
6  vector<Tuple> findTuplesWithAttributeValue(int attributePosition, AttributeValue) {};
7  void deleteTuplesWithAttributeValue(int attributePosition, AttributeValue) {};
8};

There is a fundamental mismatch between two dimensional relational data and one dimensional memory. So, tuples need to be linearised.

3.1 N-ary Storage Model (NSM)

In NSM, the tuples are stored one after another in a contiguous block of memory. This is also known as a row store. We can implement this as:

1struct NSMTable : public Table {
2  struct InternalTuple {
3    Tuple actualTuple;
4    bool deleted = false;
5    InternalTuple(Tuple t): actualTuple(t) {}
6  };
7  vector<InternalTuple> data;
8
9  void insert(Tuple t);
10  vector<Tuple> findTuplesWithAttributeValue(int attributePosition, AttributeValue value);
11  void deleteTuplesWithAttributeValue(int attributePosition, AttributeValue value);
12};

Then we can implement the methods as:

1void NSMTable::insert(Tuple t) { data.push_back(NSMTable::InternalTuple(t)); };
2
3vector<Table::Tuple> NSMTable::findTuplesWithAttributeValue(int attributePosition, AttributeValue value) {
4  vector<Tuple> result;
5  for (size_t i = 0; i < data.size(); i++) {
6    if (!data[i].deleted && data[i].actualTuple[attributePosition] == value) {
7      result.push_back(data[i].actualTuple);
8    }
9  }
10  return result;
11};
12
13void NSMTable::deleteTuplesWithAttributeValue(int attributePosition, AttributeValue value) {
14  for (size_t i = 0; i < data.size(); i++) {
15    if (data[i].actualTuple[attributePosition] == value) {
16      data[i].deleted = true;
17    }
18  }
19};

Data locality is very important for performance. This is important even on RAM, we want to make sure that data is being brought into CPU cache in 64B cache lines. So, having a high stride between data points significantly hurts performance.

For N-ary storage, insertion achieves high locality as it writes to a contiguous memory block. Reading a single tuple also achieves high locality. However, reading a single attribute across many tuples has poor locality as it has a high stride. This includes searching and deleting.

3.2 Decomposed Storage Model (DSM)

In a decomposed storage model, each attribute is stored in its own contiguous block of memory. In practice, extra memory is left between each attribute to allow for faster insertions. This is also known as a column store. We can implement this as:

1struct DSMTable : public Table {
2  using Column = vector<AttributeValue>;
3  vector<Column> data;
4  vector<bool> deleteMarkers;
5
6  void insert(Tuple t);
7  vector<Tuple> findTuplesWithAttributeValue(int attributePosition, AttributeValue value);
8  void deleteTuplesWithAttributeValue(int attributePosition, AttributeValue value);
9};

Then we can implement the methods as:

1void DSMTable::insert(Tuple t) {
2  for(size_t i = 0; i < tuple.size(); i++) {
3    data[i].push_back(t[i]);
4  }
5};
6
7vector<Table::Tuple> DSMTable::findTuplesWithAttributeValue(int attributePosition, AttributeValue value) {
8  vector<Tuple> result;
9  for(size_t i = 0; i < tuple.size(); i++) {
10    if(!deleteMarkers[i] && data[attributePosition][i] == value) {
11      Tuple reconstructed;
12      for (size_t column = 0; column < data.size(); column++) {
13        reconstructed.push_back(data[column][i]);
14      }
15      result.push_back(reconstructed);
16    }
17  }
18  return result;
19};
20
21void DSMTable::deleteTuplesWithAttributeValue(int attributePosition, AttributeValue value) {
22  for(size_t i = 0; i < tuple.size(); i++) {
23    if(data[attributePosition][i] == value) {
24      deleteMarkers[i] = true;
25    }
26  }
27};

Inserting into / accessing a tuple in a DSM tables has low data locality. However, accessing a single attribute across many tuples has high data locality.

3.3 NSM vs DSM

DSM works well for scan-heavy queries. Common in analytical processing (e.g. aggregate, join) and historial data (OLAP).
NSM works well for lookups and inserts. Common in transactional processing (OLTP).

This naturally leads to the development of hybrids.

3.4 Delta-Main Design

A delta-main storage manager uses a hybrid table, having both an NSM table and DSM table as members. The tuples in each are disjoint (no tuples live in both!):

1struct HybridTable : public Table {
2  DSMTable main;
3  NSMTable delta;
4
5  void insert(Tuple);
6  vector<Tuple> findTuplesWithAttributeValue(int attributePosition, AttributeValue value);
7  void deleteTuplesWithAttributeValue(int attributePosition, AttributeValue value);
8  void merge();
9}
10
11// Insert into the delta for high locality.
12void HybridTable::insert(Tuple t) { delta.insert(t); };
13
14// Search both tables
15vector<Table::Tuple> HybridTable::findTuplesWithAttributeValue(int attributePosition, AttributeValue value) {
16  auto mainResults = main.findTuplesWithAttributeValue(attributePosition, value);
17  auto deltaResults = delta.findTuplesWithAttributeValue(attributePosition, value);
18  mainResults.insert(mainResults.end(), deltaResults.begin(), deltaResults.end());
19  return mainResults;
20};
21
22// Delete from both tables
23void HybridTable::deleteTuplesWithAttributeValue(int attributePosition, AttributeValue value) {
24  main.deleteTuplesWithAttributeValue(attributePosition, value);
25  delta.deleteTuplesWithAttributeValue(attributePosition, value);
26};
27
28// Occasionally, we move data from the delta store to the main store
29void HybridTable::merge() {
30  for (auto i = 0u; i < delta.data.size(); i++) {
31    main.insert(delta.data[i].actualTuple);
32    delta.data[i].deleted = true;
33  }
34};

Delta-main stores aim to exploit the benefits of both NSM and DSM models. However, regular migrations are required which lock the database, and lookups are also complicated.

Delta-main model is often implemented as two separate DBMSs, one for delta and one for main.

4. Catalog

The catalog stores metadata about the database. Real life data follows patterns, that if recognised can optimise the DBMS. The catalog can store things like datatypes, min/max values, histograms, etc. Lets implement sortedness and denseness:

1class Table {
2  vector<Tuple> storage;
3  bool isFirstColSorted = true; // e.g. 1, 3, 7, 12
4  bool isFirstColDense = true; // e.g. 3, 4, 5, 6 (Stricter than sorted!)
5
6public:
7  void insert(Tuple t) {
8    if (storage.size() > 0) {
9      // Some metadata is easy to check in insertion 
10      isFirstColSorted &= (t[0] >= storage.back()[0]);
11      isFirstColDense &= (t[0] == storage.back()[0] + 1);
12    }
13    storage.push_back(t);
14  };
15
16  vector<Tuple> findTuplesWithAttributeValue(int attribute, AttributeValue value) {
17    if (attribute == 0 && isFirstColDense) {
18      // If the column is dense we can calculate the index without looping
19      return { data[value - storage.front()[0]] };
20    } else if (attribute == 0 && isFirstColSorted && bin_search(data, attribute, value)[0] == value) {
21      // If its sorted, we can perform a binary search (instead of a linear one!!)
22      return { bin_search(data, attribute, value) };
23    }
24    vector<Tuple> result;
25    for (size_t i = 0; i < data.size(); i++) {
26      if (data[i].actualTuple[attributePosition] == value) {
27        result.push_back(data[i]);
28      }
29    }
30    return result;
31  }
32};

4.1 Re-Establishing Metadata

If we want, we can enforce some of these metadata as constraints by, for example, sorting columns ourselves:

1class Table {
2  vector<Tuple> storage;
3  bool isSorted = true;
4  bool isDense = true;
5
6public:
7  void analyze() {
8    sort(storage, [](auto l, auto r) { return l.id < r.id });
9  
10    isDense = true;
11    for (size_t i = 1; i < storage.size(); i++) {
12      isDense &= storage[i].id == storage[i - 1].id + 1;
13    }
14  }
15};

5. Variable Size Datatypes

Datatypes should be able to store data of unknown sizes (e.g. strings). For easier data handling, we would like them to occupy fixed memory spaces. We can either:

Overallocate space e.g. varchars. This is simple and good for locality, but it wastes lots of space.
Store out of place with a dictionary for variable size datatypes. This is space conservative, but bad for locality and harder to implement. Also, garbage-collection is more difficult.

Out of place storage can also benefit from dictionary compression. Before each insert, we check if it is present. If so, reuse the existing values.

6. Disk Based Storage

How are disks different?

Larger pages (KB instead of B)

Much higher latency (ms instead of ns)

Much lower throughput (100MB instead of 10GB) - this causes DBMS to be IO-bound

OS gets in the way (e.g. file sizes may be limited)

Because of these differences:

Disks are dominating costs, so complicated strategies start to pay off.
Pages are large, so each page starts behaving like a mini DB (in n-ary storage).

We can extend the kernel to store the data in disk instead of memory, and add a buffer manager used by the storage manager and operator implementations to interface with the data. A buffer manager manages disk resident data by:

Mapping unstructured files to structured tables for reading / writing.
Making sure files have a fixed size (not trusting the OS).
Safely writing data to disk when necessary.

Buffers write into open pages (a memory mapped file). We can assume that there is one open page per relation, and that pages can overflow slightly.

6.1 Disk Based Storage Manager

1class Table {
2  BufferManager& bufferManager;
3  string relationName = "Employee";
4
5public:
6  void insert(Tuple t) {
7    bufferManager.getOpenPageForRelation(relationName).push_back(t);
8    bufferManager.commitOpenPageForRelation(relationName);
9  };
10
11  vector<Tuple> findTuplesWithAttributeValue(int attribute, AttributeValue value) {
12    vector<Tuple> result;
13    auto pages = bufferManager.getPagesForRelation(relationName);
14    for (size_t i = 0; i < pages.size(); i++) {
15      auto page = pages[i];
16      for (size_t j = 0; j < page.size(); j++) {
17        if (page[j][attribute] == value) {
18          result.push_back(page[j]);
19        }
20      }
21    }
22    return result;
23  };
24};

And to implement the BufferManager:

1struct BufferManager {
2  using Tuple = vector<AttributeValue>;
3  using Page = vector<Tuple>; // Assuming n-ary storage
4
5  size_t tupleSizeInBytes;
6  map<string, Page> openPages; // Map one page per relation (for simplicity)
7  map<string, vector<string>> pagesOnDisk; // Maps one relation to many files on disk that are not open
8
9  size_t numberOfTuplesPerPage();
10
11  Page& getOpenPageForRelation(string relationName);
12  void commitOpenPageForRelation(string relationName);
13  vector<Page> getPagesForRelation(string relationName);
14};
15
16Page& BufferManager::getOpenPageForRelation(string relationName) {
17  return openPages[relationName]; // Should also create one if necessary
18};
19
20// As long as a page doesn't overflow, we don't commit it to disk. This is probably not good, but its okay for this example.
21void BufferManager::commitOpenPageForRelation(string relationName) {
22  while (openPages[relationName].size() >= numberOfTuplesPerPage()) { // While we have stuff to flush to disk...
23    Page newPage;
24    while (newPage.size() <= numberOfTuplesPerPage()) {
25       // Move stuff from the openPage to the newPage:
26      newPage.push_back(openPages[relationName].back());
27      openPages[relationName].pop_back();
28    }
29    // Write to disk and store the handle:
30    pagesOnDisk[relationName].push_back(writeToDisk(newPage));
31  }
32};
33
34vector<Page> BufferManager::getPagesForRelation(string relationName) {
35  // Return the openPage not committed to disk and the rest of the pages on disk
36  vector<Page> result = { openPages[relationName] };
37  for (size_t i = 0; i < pagesOnDisk[relationName].size(); i++) {
38    result.push_back(readFromDisk(pagesOnDisk[relationName][i]));
39  }
40  return result;
41};

How do we decide how many tuples to put in each page?

6.2 Unspanned Pages

Pages can be thought of as mini-databases - an unspanned page. These are simple, and give good random access performance. However, to do this, we must ensure constant numbers of tuples per page, we need to introduce padding which wastes space. We also cannot deal with large records (e.g. 4KB) and no in-page random access for variable sized records.

1const long pageSizeInBytes = 4096;
2size_t BufferManager::numberOfTuplesPerPage() {
3  return floor(pageSizeInBytes / tupleSizeInBytes);
4}

We can simply say that tuples will use .

6.3 Spanned Pages

In this case, tuples are split over multiple pages. This minimises space waste and supports large records. However, it requires a complicated implementation, hurts random access performance. Also, much like unspanned pages, there is no in-page random access for variable sized records.

To find the number of pages per relation, we can do .

However, finding the numberOfTuplesPerPage() is impossible, as it is not constant.

6.4 Slotted Pages

In slotted pages, we:

Store tuples in in-place n-ary format (spanned or unspanned).
Variable datatypes are not overallocated, but null terminated.
Store tuple count in page header.
The offsets to every tuple in page footer. The offsets type is only large enough to address the page.

The tuples themselves and offset table grow in opposite directions until the page is filled. Slotted pages strike a balance between random access performance and space waste.

6.5 In-Page Dictionaries

Global dictionaries are a horrible idea - if we put this in separate pages, each string access in the worst case will cause every page to be brought into disk. Instead, we can use a dictionary per page. This solves the problem of variable sized-records and also allows some duplicate elimination.

Back to Home

Table of Contents