Relational Algebra

1. Data Processing Systems

Typical data intensive applications involve:

Online Transaction Processing (OLTP): lots of small updates to a persistent database, focus on throughput & ACID.
Online Analytical Processing (OLAP): running a single data analysis task, focus on latency.
Reporting: running many data analysis tasks within a time budget, focus on resource efficiency. Queries are known in advance.
Hybrid Transactional/Analytical Processing (HTAP): small updates interwoven with larger analytics.

A databus management system should be Atomic, Consistent, Isolated, Durable (ACID).

2. Relational Algebra with C++

A logical representation of relational operations. Actual systems don't implement this directly. A relation is almost a set of tuples:

Each row represents an -tuple of -ary relation .
The ordering of rows is immaterial.
All rows are distinct.
The ordering of columns is significant.
The significance of a column is partially conveyed by its name.

1template<template... types> struct Relation {
2  using OutputType = tuple<types...>;
3  set<tuple<types...>> data;
4  array<string, sizeof...(types)> schema;
5
6  Relation(array<string, sizeof...(types)> schema, set<tuple<types...>> data): schema(schema), data(data) {}
7};
8
9auto createCustomerTable() {
10  Relation<int, string, string> customer(
11    {"id", "name", "address"},
12    {{1, "Alice", "123 Main St"},
13     {2, "Bob", "456 Maple Ave"},
14     {3, "Charlie", "789 Oak Dr"}});
15  return customer;
16}

To make relational operators we can make use of the OutputType alias.

2.1 Relational Expressions

An expression is composed of operators. Cardinality is the number of tuples in a set (used to estimate cost).

template<typename... types> struct Operator : public Relation<types...> {};

2.2 Projection

A projection extracts a subset of attributes from a relation, changing the schema. It preserves relational semantics (duplicates removed).

Because duplicates may be removed, we don't know the cardinality of the result beforehand. The upper cardinality is the maximum number of tuples that can be returned, which is known beforehand.

1template<typename InputOperator, typename... outputTypes> struct Project : public Operator<outputTypes...> {
2  InputOperator input;
3  set<pair<string, string>> projections;
4
5  Project(InputOperator input, set<pair<string, string>> projections): input(input), projections(projections) {};
6};
7
8auto customer = createCustomerTable();
9auto project = Project<decltype(customer), string>(customer, {{"name", "customer_name"}});

We might instead want to implement a projection that performs a scalar function on each tuple:

1template<typename InputOperator, typename... outputTypes> struct Project : public Operator<outputTypes...> {
2  InputOperator input;
3  function<tuple<outputTypes...>(typename InputOperator::OutputType)> projectFunc;
4
5  Project(InputOperator input, function<tuple<outputTypes...>(typename InputOperator::OutputType)> projectFunc):
6    input(input), projectFunc(projectFunc) {};
7};
8
9auto customer = createCustomerTable();
10auto project = Project<decltype(customer), string>(customer, [](auto input) {
11  return get<1>(input); // Get second attribute (name)
12});

In reality, you would be able to specify either a function or a set of projections:

1template<typename InputOperator, typename... outputTypes> struct Project : public Operator<outputTypes...> {
2  InputOperator input;
3  variant<function<tuple<outputTypes...>(typename InputOperator::OutputType)>, set<pair<string, string>>> proj;
4
5  Project(InputOperator input, function<tuple<outputTypes...>(typename InputOperator::OutputType)> proj):
6    input(input), proj(proj) {};
7
8  Project(InputOperator input, set<pair<string, string>> proj):
9    input(input), proj(proj) {};
10};
11
12auto customer = createCustomerTable();
13auto p1 = Project<decltype(customer), string>(customer, {{"name", "customer_name"}});
14auto p2 = Project<decltype(customer), string>(customer, [](auto input) {
15  return get<1>(input); // Get second attribute (name)
16});

2.3 Selection

A selection produces a new relation containing tuples satisfying a condition. Doesn't change schema, changes cardinality.

1enum class Comparator { LT, LE, EQ, GE, GT };
2
3struct Column {
4  string name;
5  Column(string name): name(name) {};
6};
7
8using Value = variant<string, int, float>;
9
10struct Condition {
11  Column lhs;
12  Comparator compare;
13  variant<Column, Value> rhs;
14  Condition(Column lhs, Comparator compare, variant<Column, Value> rhs): lhs(lhs), compare(compare), rhs(rhs) {};
15};
16
17template<typename InputOperator> struct Select : public Operator<typename InputOperator::OutputType> {
18  InputOperator input;
19  Condition condition;
20
21  Select(InputOperator input, Condition condition): input(input), condition(condition) {};
22};
23
24auto customer = createCustomerTable();
25auto c2v = Condition(Column("Name"), Comparator::EQ, Value("TSNotes"));
26auto p1 = Select<decltype(customer)>(customer, c2v);

2.4 Cross (Cartesian) Product

A cross product takes 2 inputs to produce a new relation combining every tuple from the first with every tuple from the second. Changes schema.

1template<typename LeftInputOperator, typename RightInputOperator> struct CrossProduct : public Operator<Concat<typename LeftInputOperator::OutputType, typename RightInputOperator::OutputType>> {
2  LeftInputOperator left;
3  RightInputOperator right;
4
5  CrossProduct(LeftInputOperator left, RightInputOperator right): left(left), right(right) {};
6};

2.5 Union

Union produces new relation containing all tuples in either input. Requires both inputs to have the same schema (left unchanged).

1template<typename LeftInputOperator, typename RightInputOperator> struct Union : public Operator<typename LeftInputOperator::OutputType> {
2  LeftInputOperator left;
3  RightInputOperator right;
4
5  Union(LeftInputOperator left, RightInputOperator right): left(left), right(right) {};
6};

2.6 Set Difference

Difference produces a new relation containing tuples present in the left but not the right input. Doesn't change schema.

1template<typename LeftInputOperator, typename RightInputOperator> struct Difference : public Operator<typename LeftInputOperator::OutputType> {
2  LeftInputOperator left;
3  RightInputOperator right;
4
5  Difference(LeftInputOperator left, RightInputOperator right): left(left), right(right) {};
6};

2.7 Grouped Aggregation

Grouped aggregation produces a new relation from one input, by grouping together tuples with an equal attribute and performing an aggregation function on each group. Changes schema & cardinality.

1enum class AggregationFunc { MIN, MAX, SUM, AVG, COUNT };
2
3template<typename InputOperator, typename... Output> struct GroupedAggregation : public Operator<Output...> {
4  InputOperator input;
5  set<string> groupAttributes;
6  set<tuple<string, AggregationFunc, string>> aggregations;
7
8  GroupedAggregation(InputOperator input, set<string> groupAttributes, set<tuple<string, AggregationFunc, string>> aggregations):
9    input(input), groupAttributes(groupAttributes), aggregations(aggregations) {};
10};

2.8 Top-N

Top-N produces a new relation from one input selecting the tuples with the greatest values with respect to an attribute. Changes cardinality, maintains schema.

1template<typename InputOperator> struct TopN : public Operator<typename InputOperator::OutputType> {
2  InputOperator input;
3  size_t N;
4  string predicate;
5
6  TopN(InputOperator input, size_t N, string predicate) : input(input), N(N), predicate(predicate) {};
7};
8
9auto customer = createCustomerTable();
10auto top2 = TopN<decltype(customer)>(customer, 2, "id");

Back to Home

Table of Contents