Virtualization

1. Virtualization

Usually, the OS abstracts the CPU's non-privelaged instructions and registers, virtual memory a file system, syscalls for IO and signals. It has a subset of machine instructions.

What if the abstraction looked like hardware? All CPU instructions and registers, physical and virtual memory, raw disk access, full IO access and interrupts would be available. This is called virtualization.

Now, each process runs a full OS. This provides better security using isolation and support for legacy software. It also provides better server consolidatation - allocating VMs to different processes on demand. It can also be used to test software, and easily roll-back to various snapshots when in testing. This can also be used to run multiple OSs on the same hardware.

1.1 Virtual Machine Monitor

A virtual machine monitor (VMM) partitions the hardware resources and provides each VM with a virtual machine (VM). A VMM intercetpts all instructions and emulates their execution on actual hardware. This results in a overhead for CPU-bound processes and for IO-bound processes. We only need to trap and emulate sensitive instructions. This can cause problems:

On some architectures, sensitive instructions don't trap.
Some instructions behave differently in user and kernel mode.
Visibility of privilage level - guest OS should not know it is being virtualized.

A CPU is virtualizable if all sensitive instructions trap. x86 is virtualizable since 2005 with Intel VT-x and AMD-V.

1.2 Hypervisor

A hypervisor is a VMM that runs directly on the hardware. It can be type 1 (bare-metal) or type 2 (virtualization). Type 1 is more efficient since it has direct access to hardware, but type 2 is more flexible since it can run on any OS.

A type 1 needs hardware support to trap instructions:

Guest OS executes sensitive instruction.
CPU traps to hypervisor.
Hypervisor checks the instruction.
Hypervisor emulates the instruction.
Control is returned to the guest OS.

2. Binary Translation

Instead of emulating code, we can dynamically translate it so that it can run natively, which is faster.

Scan each basic block before it executes.
If it contains privileged instructions, replace them with hypercalls.
Replace last instruction with a hypercall to the hypervisor.
Execute the basic block natively.

Cache the translated code to avoid retranslation. This can be done with just-in-time (JIT) compilation. There is no need to translate user code, only the kernel.

3. Paravirtualization

We can change OS source code to replace sensitive instructions with hypercalls. This can handle unvirtualizable architectures while achieving near-native performance.

3.1 Virtual Machine Interface (VMI)

However, paravirtualization requires source code modification, which must be unique for each OS. The VMI is a standard interface for paravirtualization. It is a set of hypercalls that can be used by any OS. This allows for a single hypervisor to run multiple OSs.

4. Memory Virtualization

Memory virtualization keeps a physical map (PMAP) structure for each VM which maps physical addresses to actual machine addresses. This is stored in the hypervisor. We keep seperate shadow page tables which map virtual addresses to machine addresses. This is stored in the guest OS.

invert center Small screen

4.1 Shadow Page Tables

The hardware's memory management unit (MMU) uses the shadow page tables. Hardware's translation lookaside buffer (TLB) maps virtual addresses to machine addresses. On a TLB miss:

MMU searches for mapping in the shadow page table. If it is found, it is loaded into the TLB, and instruction is re-executed.
Otherwise, the page fault must be handled by the VMM. It tries to find the virtual to physical address mapping in the guest OS' page table. If it is not found, a true page fault occurs, forwarded to the guest OS. If it is found, a hidden page fault occurs, and the mapping is added to the shadow page table.

4.2 Hardware Support

The VMM must keep guest and shadow page tables synchronized. This can be done by trapping into the VMM when hardware changes the page table. New CPU's have hardware support for this. On a TLB miss:

MMU searches for the mapping in the guest page table. If it is not found, a true page fault occurs, forwarded to the guest OS.
If it is found, the MMU searches for a mapping in the pmap. If it is found, the TLB is updated and the instruction is re-executed. Otherwise, a hidden page fault occurs, which the VMM handles.

5. VM Memory Management

How does a hypervisor reclaim memory from a VM if it doesn't know which pages are in use? A double paging problem occurs:

VMM under memory pressure selects to be paged out.
Guest OS under memory pressure selects to be paged out.
will be brought back to disk only to be written out again.

To fix this, we use ballooning:

Hypervisor allocates a balloon driver in the guest OS.
Hypervisor tells the balloon driver to inflate, which allocates memory.
Guest OS under memory pressure swaps out pages.
Hypervisor reclaims the memory allocated by the balloon driver.
Driver deflates and returns memory to the hypervisor.

We can share memory between VMs by mapping two physical memory addresses to one machine address. To do this we need to find similar pages:

Compute a hash of the page content.
Index into a hash table to find existing page with the same hash.
Do a full comparison to confirm the match.
Set up a mapping between the two pages.

Back to Home

Table of Contents