System Interfaces
We want abstractions to make developers productive, to multiplex (unused hardware is a wasted purchase & energy), to consider cache (storage cost increases with performance) and to avoid communication (this is expensive).
1. OS Interfaces
Abstraction is everywhere:
- CPU / threads
scheduling. - Memory
paging. - Security
syscalls.
The OS implements protected objects inside the kernel, which mediate all accesses (e.g. sockets, files, etc). This can become expensive! A syscall:
Store arguments in registers. Execute syscallinstruction.Santize environment. Save state. Execute syscall handler. Restore state. System return instruction.
Instead, we can use a library OS: the OS kernel as a library to a single application, this is very adaptable to application needs, but elimates privelege separation overhead.
1.1 Memory & Paging
Fast storage is expensive, so we want to limit size by caching. However, a cache miss needs expensive copy time, so we must reduce misses by predicting application access patterns.
1.2 Security & Syscall
Buffered IPC is when the kernel mediates communication between processes, this is simple but can be expensive. We must maintain two copies of data: one in the user space and one in kernel space. However, it is non-blocking if buffer is available.
Unbuffered IPC is when processes communicate directly, this is more efficient but more complex. It is blocking if the other process is not ready to receive/send data.
2. CPU Interfaces
The instruction set architecture (ISA) also has tradeoffs in:
- Security / Performance (privelage levels).
- Complexity / Performance (memory consistency models).
- Adaptability / Performance (specialized hardware vs volume cost).
1.1 Caching
A CPU chip has many caches:
- Exploit spatial & temporal locality
- ISA extended to control & flush entries for cross core consistency
- Software must be optmised to maximise cache hits.
We have a hardware context cache (L1, L2 ... LLC). We have a MMU cache (TLB) to translate virtual addresses to physical addresses.
The MMU also has a partial walk cache (PWC) which caches intermediate page table entries to speed up address translation when missing in TLB. Each level in the MMU page table has a separate PWC, and the OS can control the size of each PWC. All 4 caches can be accessed in parallel.
We must also keep the memory translation caches (TLB, PWC) consistent on page downgrade (e.g. from read-write to read-only). TLB shootdown is when the OS sends an inter-processor interrupt (IPI) to all cores to invalidate TLB entries for a page that has been downgraded. This is expensive, and scales poorly with the number of cores. To mitigate this, we could:
- Avoid downgrading pages.
- Asynchronous downgrade syscalls.
- Hardware acceleration.
1.2 NUMA
The CPU is connected to memory through an extensible bus. Non Uniform Memory Access (NUMA) is when each CPU has its own local memory. This is usually paired with a cache coherent interconnect (e.g. Intel QPI). Accessing your own memory is always faster than accessing another CPU's memory, so we want to write code with good locality.
1.3 Virtualization
Allows multiple virtual machines to run on a single virtual machines, reducing cost with higher security than containers / processes. However, VM exceptions are very expensive, with long microcodes to switch between guest/host. Examples of VM exceptions are hypercalls, trapped instructions, trapped memory & interrupts. A VM exit:
Trapped instruction / interrupt. Santize environment. Save state. Execute handler. Restore state. VM resume instruction.
VMs also use nested page tables, which means a PT traversal goes from
- Let hypervisor manage PTs (shadow paging), but this requires VM-exists to manage guest PTs.
- Use guest-host page walk caches, of which we need 6.
3. IO Interfaces
IO interfaces describe data going in/out of the machine (e.g. network, storage, GPUs, etc).
3.1 Device as Memory
Modern devices are accessed as memory using memory mapped IO (MMIO). Device is assigned a range of physical addresses at boot time, corresponding to
A modern device may have both:
- Circular Queues with request / response buffers in the host memory.
- A BAR to check / configure queue changes using MMIO.
3.2 Device Interconnect
PCIe is a common interconenct for devices, with
3.3 Device Models
We could use an interrupt driven model. This works well but has an overhead of context switching:
- MMIO writes to program device.
- Device reads request buffers or writes response buffers.
- OS waits for an interrupt.
- MMIO reads to check result.
A polling model has no context switch and can give an answer faster (if the polling time is short):
- MMIO writes to program device.
- Device reads request buffers or writes response buffers.
- OS repeatedly polls MMIO to check result.
A hybrid model would poll for some time, then tell the device to interrupt if not done. This is more complex but can give better performance.
3.4 Device Virtualization
Accessing a virtual device is a memory trap. By default, we trap & emulate, where the hypervisor unmaps device BARs to trap MMIO accesses, and emulates device operation on every trap. This is extremely expensive, requiring multiple VM exits per device operation.
We could paravirtualize the device, by mapping shared memory between VM and hypervisor, and a single trapped MMIO access is used to notify the hypervisor of new requests (doorbell request). This is much faster, but requires changes to the guest OS and device drivers.
Alternatively, we could passthrough by mapping device BARs directly to the VM, and configure the IOMMU to allow the VM to access the device. This is the fastest, but eliminates isolation between VMs, can cause security issues, and requires hardware support from IO devices to have a different state per guest VM.
3.5 OS level interfaces
Traditional blocking interfaces (POSIX read/write) are simple but incur high overhead. Every call requires a hardware transition to kernel mode, at least one data copy, and potentially two context switches if the thread must be rescheduled while waiting.
Non-blocking interfaces aim to eliminate context switches and data copies to preserve cache locality.
- Asynchronous: The application submits a start-of-operation and continues other work until the kernel signals completion (e.g., Linux AIO for storage).
- Event-based: The application expresses interest in specific file descriptors and only operates on them after polling the kernel to confirm the operation will not block (e.g., Linux
epollfor networking).
For maximum efficiency, direct device assignment (passthrough) allows user-mode applications to perform MMIO directly to a device's virtual function. This bypasses the OS entirely to eliminate syscalls and intermediate copies. Applications typically use specialized libraries like DPDK (networking) or SPDK (storage) to manage these devices.