Vector Databases
Vector databases store vector embeddings of data to enable similarity searches. To support this, we need to be able to do certain operations on vectors efficiently.
1. Vector Arithmetic
This is essentially element wise operations on vectors. For example:
1array<Dimension, float> add(array<Dimension, float> a, array<Dimension, float> b) { 2 array<Dimension, float> result = a; 3 for (auto i = 0; i < Dimension; i++) { 4 result[i] += b[i]; 5 } 6 return result; 7}
2. Clustering
This involves grouping similar vectors together. There are many methods for this (heirarchical, dbscan, voronoi tesselation, k-means, etc). K-means works by:
- Randomly selecting k centroids.
- Assigning each vector to the nearest centroid.
- Recomputing centroids as the mean of assigned vectors.
- Repeating steps 2 and 3 until convergence.
Lets implement k-means:
1#include <algorithm> 2#include <array> 3#include <random> 4#include <vector> 5 6typedef std::array<float, Dimension> Datapoint; 7float distance(Datapoint, Datapoint); 8Datapoint& operator+=(Datapoint&, Datapoint); 9Datapoint& operator/=(Datapoint&, size_t); 10auto rng = std::mt19937(std::random_device{}()); 11 12std::vector<std::vector<Datapoint>> kMeans(std::vector<Datapoint> data, size_t k) { 13 auto centroids = std::vector<Datapoint>(); 14 std::sample(data.begin(), data.end(), std::back_inserter(centroids), k, rng); 15 16 auto centroidAssignments = std::vector<size_t>(data.size(), {}); 17 for (bool converged = false; !converged;) { 18 for (size_t i = 0u; i < data.size(); i++) { 19 centroidAssignments[i] = findClosestCentroid(data[i], centroids); 20 } 21 auto newCentroids = calculateCentroids(centroidAssignments, data, k); 22 converged = (newCentroids == centroids); 23 } 24 25 auto result = std::vector<std::vector<Datapoint>>(k, std::vector<Datapoint>()); 26 for (size_t i = 0u; i < data.size(); i++) { 27 result[centroidAssignments[i]].push_back(data[i]); 28 } 29 return result; 30}; 31 32size_t findClosestCentroid(Datapoint point, std::vector<Datapoint> centroids) { 33 size_t closestIndex = 0u; 34 float closestDistance = distance(point, centroids[0]); 35 for (size_t i = 1u; i < centroids.size(); i++) { 36 float dist = distance(point, centroids[i]); 37 if (dist < closestDistance) { 38 closestDistance = dist; 39 closestIndex = i; 40 } 41 } 42 return closestIndex; 43}; 44 45std::vector<Datapoint> calculateCentroids( 46 std::vector<size_t> assignments, 47 std::vector<Datapoint> data, 48 size_t k) { 49 auto result = std::vector<Datapoint>(k, Datapoint{}); 50 auto sizes = std::vector<size_t>(k, 0u); 51 for (size_t i = 0u; i < data.size(); i++) { 52 result[assignments[i]] += data[i]; 53 sizes[assignments[i]] += 1; 54 } 55 for (size_t i = 0u; i < k; i++) { 56 result[i] /= sizes[i]; 57 } 58 return result; 59};
2.1 Distance Metrics
To implement the distance function, we can use various metrics:
- Euclidean Distance: pythagorean theorem (commonly optimised by omitting the square root, okay as distance is monotonic). Used when position in vector space matters.
- Cosine Similarity: measures the cosine of the angle between two vectors. Useful when only direction matters.
- Manhattan Distance: sum of absolute differences. Useful with high dimension data.
3. Nearest Neighbour Search
1std::vector<Datapoint> nearestNeighbours(Datapoint query, size_t numNeighbours) { 2 auto nearestNeighbours = std::priority_queue< 3 std::pair<Datapoint, float>, 4 std::vector<std::pair<Datapoint, float>>, 5 std::less<std::pair<Datapoint, float>>>(); 6 auto i = 0u; 7 for (; i < std::min(numNeighbours, database.size()); i++) { 8 nearestNeighbours.push({database[i], distance(query, database[i])}); 9 } 10 for (; i < database.size(); i++) { 11 float dist = distance(query, database[i]); 12 if (dist < nearestNeighbours.top().second) { 13 nearestNeighbours.pop(); 14 nearestNeighbours.push({database[i], dist}); 15 } 16 } 17 18 auto result = std::vector<Datapoint>(); 19 while (!nearestNeighbours.empty()) { 20 result.push_back(nearestNeighbours.top().first); 21 nearestNeighbours.pop(); 22 } 23 return result; 24}
4. Index
Joins are a natural extension of similarity search, iterating over a large selection and performing lookups on the smaller ones. However, now the small relation can be indexed. There are many ways to index vector data.
4.1 Z Ordering
Z-Ordering involves meaningfully reducing dimensionality of a vector:
1constexpr auto Dimensionality = 2; 2long zorder(std::array<int, Dimensionality> point) { 3 long zvalue = 0; 4 for (int bit = sizeof(float) * 8 - 1; bit >= 0; bit--) { 5 for (int dim = 0; dim < Dimensionality; dim++) { 6 zvalue |= ((point[dim] >> bit) & 1) << (bit * Dimensionality + dim); 7 } 8 } 9 return zvalue; 10}
This ordering preserves spatial locality: points that are close in the vector space will have similar z-values. Z-ordering is simple, reasonably effective, and combines well with existing indexing structures like B-trees. However, it creates long codewords for high dimensional inputs. Additionally, in the worst case we still have low performance (going across the shaft of the Z).
4.2 Locality Sensitive Hashing
Regular hash functions aim to avoid collisions of inequal values. Locality Sensitive Hashing (LSH) aims to void collisions of dissimilar values while similar values collide often.
For example, we could take a uniform random set of bits of each value, such as the higher order bits.
4.3 Inverted File Index
Builds on clustering algorithms by using centroids as index keys. To build an IVF index:
1Datapoint calcCentroid(vector<Datapoint> points) { 2 Datapoint centroid = {}; 3 for (auto point : points) { 4 centroid += point; 5 } 6 centroid /= points.size(); 7 return centroid; 8} 9 10class IVF { 11 vector<pair<Datapoint, vector<Datapoint>>> index; 12 13public: 14 IVF(vector<Datapoint> data, size_t k) { 15 for (auto cluster : kMeans(data, k)) { 16 index.push_back({calcCentroid(cluster), cluster}); 17 } 18 } 19 20 vector<Datapoint> findNeighbours(Datapoint query, size_t maxDist) { 21 auto closestCentroid = 0; 22 for (size_t i = 0; i < index.size(); i++) { 23 if (distance(query, index[i].first) < distance(query, index[closestCentroid].first)) { 24 closestCentroid = i; 25 } 26 } 27 auto result = vector<Datapoint>(); 28 for (auto &it : index[closestCentroid].second) { 29 if (distance(query, it) < maxDist) { 30 result.push_back(it); 31 } 32 } 33 return result; 34 } 35};
If we have a lot of centroids this is expensive, so a 2nd level index structure can be used for centroids (e.g., a KD-tree).
4.4 Graph-based Vector Indices
A heirarchical navigable small world (HNSW) graph is built by connecting each vector to its nearest neighbours. The graph has multiple layers, with higher layers having fewer nodes. Searching starts at the top layer and descends to lower layers, refining the search at each level.