Scale Out Systems

Scaling out is increasing performance by adding more machines, as opposed to scaling up which adds more resources to a single machine (clusters!!!!)

1. Design Considerations

To design well, we need to take into account metrics & tools.

1.1 Metrics

We want to take into account the tenant (client), who has a service level objective (SLO) (what the tenant wants), caring about performance / $ and latency. The operator has a service level agreement (SLA) (what the tenant pays for), caring about infrastructure cost.

An SLO should define target metrics and their performance. This is done by studying the structure and performance of the application's critical path. We should derive a mtric SLO for every component.

Latency varies, so we should define a latency SLO as a percentile of the latency distribution (e.g., 99th percentile latency < 100ms). This ensures that the system meets performance requirements for most requests, even under variable load.
Ideally, energy consumption utilization, but in reality there is often a significant fixed energy cost, and hardware is not energy proportional. We should define an energy SLO as a target energy per request (e.g., < 1 Joule/request) to ensure energy efficiency.
We want to almost maximize utilization to minimize cost, but we also need to maintain headroom for traffic spikes and avoid performance degradation. We should define a utilization SLO as a target average utilization (e.g., 70% average CPU utilization) to balance cost and performance.

1.2 Tools

Many components communicate over the network, and we want to maximise CPU performance. We could use asynchronous communication over the network.

We also want an elastic system - one that can scale up and down in response to load. This requires automated scaling based on monitoring metrics (e.g., CPU utilization, request latency) to add or remove resources as needed.

2. Scale Out Architectures

A simple recipe for scaling out is to break application up into concurrent components. This is done by defining asynchronous communication APIs and protocols. We can then deploy components on different nodes elastically (different number of processors per component). We can handle failures by monitoring and restarting dead processes.

There is a productivity / performance trade-off from scaling out. We can achieve better performance by breaking the application into more components, but this increases complexity and reduces productivity. This is made worse by unexpected failures and slowdown.

Self-managing scale-out systems gives all benefits of consolidation (lower cost), but all disadvantages of self-management (monitoring, maintenance, scheduling, etc).

2.1 Microservices

We think of an application as a collection of microservices - one service (program) per small task. System capacity can be increased by increasing service instances, all connected through a network. Some kind of monitoring is required to scale up and down the number of instances.

This has good application architecture (modular) but systems management is complex. We have to manage many expensive VMs (long boot time, duplicate memory). Additioanlly, complex management in monitoring, VM image distribution, life-cycle management.

Instead of VMs, we can use containers (lightweight, fast startup, shared memory), but we still have to manage the container images and life-cycle.

2.2 Serverless

Here, tenant just declares what to run (function), which is simpler to manage, and pays by use. The operator handles how to run it, by abstracting microservice instances, load balancing between instances and auto-scaling based on load.

Aside: Kubernetes Autoscaling Example

For example, we could use Kubernetes autoscaling:
1type: Resource
2resource:
3  name: cpu
4  target:
5    type: Utilization
6    averageUtilization: 70
This configuration tells Kubernetes to automatically scale the number of instances of a microservice based on CPU utilization, aiming to maintain an average CPU utilization of 70%. If the CPU utilization exceeds this threshold, Kubernetes will add more instances to handle the load, and if it falls below, it will reduce the number of instances to save resources.

We don't set CPU utilisation to 100% to allow time to catch-up on small inefficiencies.

This is a simple architecture for the tenant, but the operator has to manage many short-lived instances (standing problem), which is a challenge for scheduling and monitoring.

2.3 Function as a Service (FaaS)

This builds on serverless by decoupling tenant logic from the service. The tenant only provides the business logic (function) and the trigger. It is a stateless function with very fine granularity. The operator provides and manages the VM, containers and the runtime. Often there is a limited set of predefined runtimes (e.g., Python, Node.js) that tenants can choose from.

There are limited VM/container/runtime configurations, which simplifies management for the operator (deep ecosystem integration, easy to predict, prefetch & reuse).

3. Communication Mechanisms

We need to define application to network traffic. This is complicated, requiring a network stack (low level code for manipulating raw bits), encryption, compression, etc.

3.1 Web Based Technologies

HTTP REST + JSON is a common solution, self describing representation. It is a server API on top of HTTP, with plenty of frameworks, and easy to debug. Has loose principles with headers for caching, statelessness, layering, etc. However, it is inefficient for high performance applications.

3.2 Remote Procedure Calls (RPC)

RPCs define server API as function calls. It can be asynchronous with many frameworks available. Typically operates on binary data. Its simple to understand, just needing defining function calls, with the RPC framework generatign client/server code stubs. However, it is tightly coupled to the server API, and can be difficult to debug.

Aside: gRPC Example

For example:

1service Greeter {
2 rpc SayHello(HelloRequest) returns (HelloReply);
3}
4message HelloRequest { string name = 1; }
5message HelloReply { string message = 1; }

Both REST and RPC end up transferring data buffers, which means lots of cycles spent on buffer transport (standing problem!).

3.3 Remote DMA (RDMA)

RDMA can move data directly between the memory of two machines without involving the CPU, which can significantly reduce latency and increase throughput. It is often used in high-performance computing and data center environments. Contains streams (similar to sockets) and direct memory access. Can operate in reliable / unreliable modes (similar to TCP/UDP). This moves OS network stack into NIC, but hard to program and debug, and is tightly coupled to the server API. To allow this:

Application preregisters memory buffer with the NIC using the OS.
OS driver sends virtual memory / physical memory translations to the NIC (for the buffer).
NIC associates translations to each buffer ID + offset.
Reception of RDMA operations uses the NIC's translations for that buffer.
Paging application memory out requries invalidating a NIC's translations (very expensive!).

Data Center Tax

The data center tax says 30% of CPU cycles are spent on communication. This tax grows with growing data center size. RDMA helps with this.

4. Taming Latency

Not all requests take the same time (transient overheads and fundamental application-level differences). Utilization matters: reducing utilization reduces tail latency, and increasing utilization reduces costs.

4.1 Load Balancing

Load Balancing means distributing tasks (requests) over a set of resources.

Service Latency is defined by arrival distribution (time between arrivals, typically Poisson), arrival assignment (who and when is assigned to a process) and service time distribution (time to process a request, ideally constant but typically exponential).

Queueing theory statistically models the cumulative effects of arrival and service. The parameters are whether it is a single queue (all requests go to one queue) or multiple queues (requests are distributed across multiple queues), and whether it is first-come-first-served (FCFS) or processor sharing (PS) (requests are processed in parallel). The goal is to minimize tail latency and work conservation (core is never idle if there's work to do).

Head of Line (HOL) Blocking is when a request at the front of the queue is delayed, causing all subsequent requests to be delayed as well. This can lead to increased latency and reduced utilization.

Models include:

IX (Scale Up): a multi-queue system using 32 cores, 1 queue per core, with FCFS scheduling. It is perfect on constant / exponential service times, with low tail latency and high throughput.
ZygOS (Scale Up): almost a single queue system, but gracefully supports large service dispertion. It is work-conserving, avoiding HOL blocking.
R2P2 (Scale Out): has a single queue for all service instances, with a global queue for establishing connections. It will not rebalance long lived connections.

4.2 Load Proportionality

When using a scale-up model, we must monitor application metrics and determine when to scale up or down system (core count, frequency, power, etc).

When using a scale-out model, we have a large waste when multiple nodes are underutilized, so we must explot elasticity. This is simple with microservices, and efficient with serverless / FaaS.

Back to Home

Table of Contents