Relational Algebra
1. Data Processing Systems
Typical data intensive applications involve:
- Online Transaction Processing (OLTP): lots of small updates to a persistent database, focus on throughput & ACID.
- Online Analytical Processing (OLAP): running a single data analysis task, focus on latency.
- Reporting: running many data analysis tasks within a time budget, focus on resource efficiency. Queries are known in advance.
- Hybrid Transactional/Analytical Processing (HTAP): small updates interwoven with larger analytics.
A databus management system should be Atomic, Consistent, Isolated, Durable (ACID).
2. Relational Algebra with C++
A logical representation of relational operations. Actual systems don't implement this directly. A relation is almost a set of tuples:
- Each row represents an
-tuple of -ary relation . - The ordering of rows is immaterial.
- All rows are distinct.
- The ordering of columns is significant.
- The significance of a column is partially conveyed by its name.
1template<template... types> struct Relation { 2 using OutputType = tuple<types...>; 3 set<tuple<types...>> data; 4 array<string, sizeof...(types)> schema; 5 6 Relation(array<string, sizeof...(types)> schema, set<tuple<types...>> data): schema(schema), data(data) {} 7}; 8 9auto createCustomerTable() { 10 Relation<int, string, string> customer( 11 {"id", "name", "address"}, 12 {{1, "Alice", "123 Main St"}, 13 {2, "Bob", "456 Maple Ave"}, 14 {3, "Charlie", "789 Oak Dr"}}); 15 return customer; 16}
To make relational operators we can make use of the OutputType alias.
2.1 Relational Expressions
An expression is composed of operators. Cardinality is the number of tuples in a set (used to estimate cost).
template<typename... types> struct Operator : public Relation<types...> {};
2.2 Projection
A projection
Because duplicates may be removed, we don't know the cardinality of the result beforehand. The upper cardinality is the maximum number of tuples that can be returned, which is known beforehand.
1template<typename InputOperator, typename... outputTypes> struct Project : public Operator<outputTypes...> { 2 InputOperator input; 3 set<pair<string, string>> projections; 4 5 Project(InputOperator input, set<pair<string, string>> projections): input(input), projections(projections) {}; 6}; 7 8auto customer = createCustomerTable(); 9auto project = Project<decltype(customer), string>(customer, {{"name", "customer_name"}});
We might instead want to implement a projection that performs a scalar function on each tuple:
1template<typename InputOperator, typename... outputTypes> struct Project : public Operator<outputTypes...> { 2 InputOperator input; 3 function<tuple<outputTypes...>(typename InputOperator::OutputType)> projectFunc; 4 5 Project(InputOperator input, function<tuple<outputTypes...>(typename InputOperator::OutputType)> projectFunc): 6 input(input), projectFunc(projectFunc) {}; 7}; 8 9auto customer = createCustomerTable(); 10auto project = Project<decltype(customer), string>(customer, [](auto input) { 11 return get<1>(input); // Get second attribute (name) 12});
In reality, you would be able to specify either a function or a set of projections:
1template<typename InputOperator, typename... outputTypes> struct Project : public Operator<outputTypes...> { 2 InputOperator input; 3 variant<function<tuple<outputTypes...>(typename InputOperator::OutputType)>, set<pair<string, string>>> proj; 4 5 Project(InputOperator input, function<tuple<outputTypes...>(typename InputOperator::OutputType)> proj): 6 input(input), proj(proj) {}; 7 8 Project(InputOperator input, set<pair<string, string>> proj): 9 input(input), proj(proj) {}; 10}; 11 12auto customer = createCustomerTable(); 13auto p1 = Project<decltype(customer), string>(customer, {{"name", "customer_name"}}); 14auto p2 = Project<decltype(customer), string>(customer, [](auto input) { 15 return get<1>(input); // Get second attribute (name) 16});
2.3 Selection
A selection
1enum class Comparator { LT, LE, EQ, GE, GT }; 2 3struct Column { 4 string name; 5 Column(string name): name(name) {}; 6}; 7 8using Value = variant<string, int, float>; 9 10struct Condition { 11 Column lhs; 12 Comparator compare; 13 variant<Column, Value> rhs; 14 Condition(Column lhs, Comparator compare, variant<Column, Value> rhs): lhs(lhs), compare(compare), rhs(rhs) {}; 15}; 16 17template<typename InputOperator> struct Select : public Operator<typename InputOperator::OutputType> { 18 InputOperator input; 19 Condition condition; 20 21 Select(InputOperator input, Condition condition): input(input), condition(condition) {}; 22}; 23 24auto customer = createCustomerTable(); 25auto c2v = Condition(Column("Name"), Comparator::EQ, Value("TSNotes")); 26auto p1 = Select<decltype(customer)>(customer, c2v);
2.4 Cross (Cartesian) Product
A cross product
1template<typename LeftInputOperator, typename RightInputOperator> struct CrossProduct : public Operator<Concat<typename LeftInputOperator::OutputType, typename RightInputOperator::OutputType>> { 2 LeftInputOperator left; 3 RightInputOperator right; 4 5 CrossProduct(LeftInputOperator left, RightInputOperator right): left(left), right(right) {}; 6};
2.5 Union
Union
1template<typename LeftInputOperator, typename RightInputOperator> struct Union : public Operator<typename LeftInputOperator::OutputType> { 2 LeftInputOperator left; 3 RightInputOperator right; 4 5 Union(LeftInputOperator left, RightInputOperator right): left(left), right(right) {}; 6};
2.6 Set Difference
Difference
1template<typename LeftInputOperator, typename RightInputOperator> struct Difference : public Operator<typename LeftInputOperator::OutputType> { 2 LeftInputOperator left; 3 RightInputOperator right; 4 5 Difference(LeftInputOperator left, RightInputOperator right): left(left), right(right) {}; 6};
2.7 Grouped Aggregation
Grouped aggregation
1enum class AggregationFunc { MIN, MAX, SUM, AVG, COUNT }; 2 3template<typename InputOperator, typename... Output> struct GroupedAggregation : public Operator<Output...> { 4 InputOperator input; 5 set<string> groupAttributes; 6 set<tuple<string, AggregationFunc, string>> aggregations; 7 8 GroupedAggregation(InputOperator input, set<string> groupAttributes, set<tuple<string, AggregationFunc, string>> aggregations): 9 input(input), groupAttributes(groupAttributes), aggregations(aggregations) {}; 10};
2.8 Top-N
Top-N
1template<typename InputOperator> struct TopN : public Operator<typename InputOperator::OutputType> { 2 InputOperator input; 3 size_t N; 4 string predicate; 5 6 TopN(InputOperator input, size_t N, string predicate) : input(input), N(N), predicate(predicate) {}; 7}; 8 9auto customer = createCustomerTable(); 10auto top2 = TopN<decltype(customer)>(customer, 2, "id");