System Interfaces

We want abstractions to make developers productive, to multiplex (unused hardware is a wasted purchase & energy), to consider cache (storage cost increases with performance) and to avoid communication (this is expensive).

1. OS Interfaces

Abstraction is everywhere:

CPU / threads scheduling.
Memory paging.
Security syscalls.

The OS implements protected objects inside the kernel, which mediate all accesses (e.g. sockets, files, etc). This can become expensive! A syscall:

Store arguments in registers.
Execute syscall instruction.
Santize environment.
Save state.
Execute syscall handler.
Restore state.
System return instruction.

Instead, we can use a library OS: the OS kernel as a library to a single application, this is very adaptable to application needs, but elimates privelege separation overhead.

1.1 Memory & Paging

Fast storage is expensive, so we want to limit size by caching. However, a cache miss needs expensive copy time, so we must reduce misses by predicting application access patterns.

1.2 Security & Syscall

Buffered IPC is when the kernel mediates communication between processes, this is simple but can be expensive. We must maintain two copies of data: one in the user space and one in kernel space. However, it is non-blocking if buffer is available.

Unbuffered IPC is when processes communicate directly, this is more efficient but more complex. It is blocking if the other process is not ready to receive/send data.

2. CPU Interfaces

The instruction set architecture (ISA) also has tradeoffs in:

Security / Performance (privelage levels).
Complexity / Performance (memory consistency models).
Adaptability / Performance (specialized hardware vs volume cost).

1.1 Caching

A CPU chip has many caches:

Exploit spatial & temporal locality
ISA extended to control & flush entries for cross core consistency
Software must be optmised to maximise cache hits.

We have a hardware context cache (L1, L2 ... LLC). We have a MMU cache (TLB) to translate virtual addresses to physical addresses.

The MMU also has a partial walk cache (PWC) which caches intermediate page table entries to speed up address translation when missing in TLB. Each level in the MMU page table has a separate PWC, and the OS can control the size of each PWC. All 4 caches can be accessed in parallel.

We must also keep the memory translation caches (TLB, PWC) consistent on page downgrade (e.g. from read-write to read-only). TLB shootdown is when the OS sends an inter-processor interrupt (IPI) to all cores to invalidate TLB entries for a page that has been downgraded. This is expensive, and scales poorly with the number of cores. To mitigate this, we could:

Avoid downgrading pages.
Asynchronous downgrade syscalls.
Hardware acceleration.

1.2 NUMA

The CPU is connected to memory through an extensible bus. Non Uniform Memory Access (NUMA) is when each CPU has its own local memory. This is usually paired with a cache coherent interconnect (e.g. Intel QPI). Accessing your own memory is always faster than accessing another CPU's memory, so we want to write code with good locality.

1.3 Virtualization

Allows multiple virtual machines to run on a single virtual machines, reducing cost with higher security than containers / processes. However, VM exceptions are very expensive, with long microcodes to switch between guest/host. Examples of VM exceptions are hypercalls, trapped instructions, trapped memory & interrupts. A VM exit:

Trapped instruction / interrupt.
Santize environment.
Save state.
Execute handler.
Restore state.
VM resume instruction.

VMs also use nested page tables, which means a PT traversal goes from to memory accesses (as each virtual page table itself must be translated). To fix this:

Let hypervisor manage PTs (shadow paging), but this requires VM-exists to manage guest PTs.
Use guest-host page walk caches, of which we need 6.

3. IO Interfaces

IO interfaces describe data going in/out of the machine (e.g. network, storage, GPUs, etc).

3.1 Device as Memory

Modern devices are accessed as memory using memory mapped IO (MMIO). Device is assigned a range of physical addresses at boot time, corresponding to PCIE bus address ranges (BARs). Hardware uses the address to distinguish between DRAM and BARs. A device spec describes which offsets in the BARs correspond to which device functions.

A modern device may have both:

Circular Queues with request / response buffers in the host memory.
A BAR to check / configure queue changes using MMIO.

3.2 Device Interconnect

PCIe is a common interconenct for devices, with latency. Uses MMIOand interrupts as messages. As its an interconnect, device-to-memory has the same performance & locality problems as cpu-to-memory, with the same solutions.

3.3 Device Models

We could use an interrupt driven model. This works well but has an overhead of context switching:

MMIO writes to program device.
Device reads request buffers or writes response buffers.
OS waits for an interrupt.
MMIO reads to check result.

A polling model has no context switch and can give an answer faster (if the polling time is short):

MMIO writes to program device.
Device reads request buffers or writes response buffers.
OS repeatedly polls MMIO to check result.

A hybrid model would poll for some time, then tell the device to interrupt if not done. This is more complex but can give better performance.

3.4 Device Virtualization

Accessing a virtual device is a memory trap. By default, we trap & emulate, where the hypervisor unmaps device BARs to trap MMIO accesses, and emulates device operation on every trap. This is extremely expensive, requiring multiple VM exits per device operation.

We could paravirtualize the device, by mapping shared memory between VM and hypervisor, and a single trapped MMIO access is used to notify the hypervisor of new requests (doorbell request). This is much faster, but requires changes to the guest OS and device drivers.

Alternatively, we could passthrough by mapping device BARs directly to the VM, and configure the IOMMU to allow the VM to access the device. This is the fastest, but eliminates isolation between VMs, can cause security issues, and requires hardware support from IO devices to have a different state per guest VM.

3.5 OS level interfaces

Traditional blocking interfaces (POSIX read/write) are simple but incur high overhead. Every call requires a hardware transition to kernel mode, at least one data copy, and potentially two context switches if the thread must be rescheduled while waiting.

Non-blocking interfaces aim to eliminate context switches and data copies to preserve cache locality.

Asynchronous: The application submits a start-of-operation and continues other work until the kernel signals completion (e.g., Linux AIO for storage).
Event-based: The application expresses interest in specific file descriptors and only operates on them after polling the kernel to confirm the operation will not block (e.g., Linux epoll for networking).

For maximum efficiency, direct device assignment (passthrough) allows user-mode applications to perform MMIO directly to a device's virtual function. This bypasses the OS entirely to eliminate syscalls and intermediate copies. Applications typically use specialized libraries like DPDK (networking) or SPDK (storage) to manage these devices.

Back to Home

Table of Contents