Cache & Memory

Average memory access time (AMAT) is given by: .

1. Miss Rate (Hardware)

There are 4 kinds of cache misses (4 Cs):

1.1 Cache Anatomy

1.2 Increasing Block Size

By increasing the block size we decrease the miss rate (we essentially speculatively load in more data, improving programs with high spatial locality). However, loading in more data at a time, perhaps more than needed has a cost, so too large a block size incraeses the miss rate.

1.3 Victim Caching

In a victim caching structure, we combine the speed of a direct mapped cache with the flexibility of a set associative cache. A small fully associative cache (the victim cache) is placed between the main cache and main memory. When a block is evicted from the main cache, it is placed in the victim cache. On a miss in the main cache, the victim cache is checked. If there is a hit in the victim cache, the block is swapped with the evicted block from the main cache. A victim cache has a cost of extra comparators and selectors, but can significantly reduce conflict misses.

1.4 Skewed Associative Caching

In a convential associative cache, we get conflicts when blocks have the same address index bits (where is the number of ways). In a skewed associative cache, each way uses a different hash function to map address index bits to cache sets (usually XOR index bits with tag bits and reorder bits). This reduces conflict misses, as blocks that would conflict in a convential set associative cache are likely to be mapped to different sets in a skewed associative cache. This:

1.5 Prefetching

A stream buffer is used. When a cache miss occurs in a L1 cache, a stream buffer fetches the missed block and several subsequent blocks from memory into a small FIFO buffer. Subsequent accesses to these blocks can be serviced from the stream buffer, reducing miss penalty. If the stream buffer is exhausted, a new block and subsequent blocks are fetched from memory. On access, the stream buffer is checked in parallel with the cache.

A multi-way stream buffer has multiple stream buffers, each fetching a different stream of data. This is useful for programs with multiple streams of data being accessed simultaneously.

Decoupled Access Execute

We can separate instructions that generate addresses from those that use memory results, letting address generating instructions run ahead and prefetch. To do this, two seperate processors execute execute and access instructions.

2. Miss Rate (Software)

Software prefetching instructions are almost never implemented - they require significant changes to compilers and instruction sets, and are difficult to use effectively. However, some techniques include:

3. Miss Penalty

3.1 Write Policy

When we use caches, we must consider policies which determine how data is managed. We can either write through or write back:

We can also use write allocate or write around:

3.2 Write Buffer

Introducing a write buffer between the cache and main memory can help reduce the effective miss penalty for write through caches. When a write occurs, data is placed in the write buffer and the CPU can continue executing without waiting for the write to complete. The write buffer then writes data to main memory in the background.

However, we now have to check the buffer in parallel to the cache on reads, to ensure we get the most up-to-date data. Write buffers store 2-8 cache lines. Buffers are good for write bursts, since we can coalesce multiple writes to the same block.

3.3 Early Restart

In early restart, when a cache miss occurs, we begin fetching the missed block from main memory. As soon as the tag of the requested word is returned, we can resume execution, even if the rest of the block has not yet been fetched. The remaining data can be fetched in the background. This reduces the effective miss penalty, as we do not have to wait for the entire block to be fetched before resuming execution.

In critical word first, we fetch the requested word first, then the rest of the block. This further reduces the effective miss penalty, as we can resume execution as soon as the requested word is available.

However, if we do an early restart we can get a race condition, if another load to the same block occurs before the rest of the block has been fetched. To fix this, we can divide cache line into sectors each with their own validity bit. We allocate in cache lines and deliver data in cache sectors.

3.4 Non-blocking Cache

A non-blocking cache allows supplying data to the CPU while a miss is being serviced. Hit under miss reduces effective penalty by working during miss service. Hit under multiple miss or miss under miss further lowers effective penalty by allowing overlapping miss servicing. This significantly increases cache controller complexity and requires multiple memory banks.

3.5 Multi-level Cache

We can have multi-level cache. Then:

If there was a L3 cache, the miss penalty of L2 would be the AMAT of L3, and so on.

Multilevel inclusion is where all data in L1 is also in L2, and all data in L2 is also in L3. This simplifies cache coherence, but can increase miss rates (evicting from L2 evicts from L1 too).

4. Hit Time

We can split pipeline stages which require a cache access into multiple cycles. However, this means that cache hits take multiple cycles, so we can access the cache more regularly, but with a higher latency. This essentially improves cache throughput at the potential cost of latency.

To support multiple parallel accesses to cache, we can use:

5. Address Translation

To securely run programs in their own virtual address spaces we need to translate virtual addresses to physical addresses. There are three common translation schemes:

cluster_pipt PIPT cluster_vivt VIVT cluster_vipt VIPT PIPT_CPU CPU PIPT_TB TB PIPT_CPU->PIPT_TB VIRT PIPT_CACHE CACHE PIPT_TB->PIPT_CACHE PHYS PIPT_MEM MEM PIPT_CACHE->PIPT_MEM PHYS VIVT_CPU CPU VIVT_CACHE CACHE VIVT_CPU->VIVT_CACHE VIRT VIVT_TB TB VIVT_CACHE->VIVT_TB VIRT VIVT_MEM MEM VIVT_TB->VIVT_MEM PHYS VIPT_CPU CPU VIPT_TB TB VIPT_CPU->VIPT_TB VIRT VIPT_CACHE CACHE VIPT_CPU->VIPT_CACHE VIRT VIPT_CACHE2 L2 CACHE VIPT_TB->VIPT_CACHE2 PHYS VIPT_CACHE->VIPT_TB VIPT_CACHE->VIPT_CACHE2 PHYS VIPT_MEM MEM VIPT_CACHE2->VIPT_MEM PHYS

5.1 Paging

Virtual address spaces is divided into pages of equal size. Main memory is divided into page frames of same size. This allows non-contiguous allocation of memory, reducing fragmentation and allowing for easier swapping of pages to disk.

A virtual address is divided into a virtual page number (VPN) and a page offset. The VPN is used to index into a page table, which maps VPNs to physical page numbers (PPNs). The page offset is then used to access the specific byte within the physical page.

The translation lookaside buffer is a small cache of the page table, storing recent VPN to PPN mappings. This allows for faster address translation, reducing the overhead of page table lookups. This is often a highly associative cache.

5.2 Synonyms & Homonyms

A homonym occurs when a single virtual address maps to multiple physical addresses in multiple processes. If the cache is indexed with virtual addresses, this can lead to multiple copies of the same data in cache, causing coherence issues. To avoid this, we can flush the cache on context switches, or include a process identifier (PID) in the cache tag.

A synonym occurs when multiple virtual addresses map (from one or more processes) to the same physical address. In a virtually indexed cache, virtual address could be cached twice under different physical addresses, so updates to one copy are not reflected in the other. To avoid this, we must force synonyms to have the same index bits in the cache. This is called page colouring.

5.3 Avoiding Translation

In a VIPT memory scheme, we can avoid translation entirely:

cluster_cache L1 CACHE CACHE_TAGS Physical Tags CACHE_DATA Data VIPT_CMP Comparator CACHE_DATA->VIPT_CMP PHYS VIPT_CACHE2 L2 CACHE CACHE_DATA->VIPT_CACHE2 VIPT_CPU CPU VIPT_CPU->CACHE_TAGS VIRT Page Offset VIPT_TB TB VIPT_CPU->VIPT_TB VIRT Page Number VIPT_TB->VIPT_CMP VIPT_TB->VIPT_CACHE2 PHYS VIPT_CMP->VIPT_CACHE2 Miss VIPT_CMP->VIPT_HIT Hit VIPT_MEM MEM VIPT_CACHE2->VIPT_MEM PHYS

If the cache index consists only of physical (untranslated) address bits, we can access tag in parallel with translation. However, this limits the size of the cache to be no larger than the page size multiplied by the associativity (increase associativity for bigger cache!). Often, L1 caches are highly associative for this reason.

The L2 cache is indexed with the translated address, so its associativity conflicts depend on the virtual to physical mapping. Here, the OS should use page colouring to avoid synonyms causing conflicts.

5.4 Page Colouring

Alternativelty to avoiding translation, we can also use page colouring in the L1 cache to ensure that synonyms map to the same cache index. This involves controlling the allocation of physical pages such that the index bits of the physical address match those of the virtual address. This can be done by maintaining multiple free lists of physical pages, each corresponding to a different colour (i.e., cache index). When allocating a new page, we select a physical page from the appropriate free list based on the virtual address being mapped.

6. DRAM

In DRAM data is stored in a square array of cells, with the address split into a row and column. To access data:

  1. Row address selects a row to access, via wordline.
  2. Cells discharge.
  3. Cell state latched by per-column sense amplifiers.
  4. Column address selects data for output, via bitline.
  5. Data must be written back to the selected row.

Since the row is latched, subsequent accesses to the same row are faster (page mode). To access a different row, we must first precharge the bitlines, then activate the new row. This means row access cycle time is larger than the row access time, as the data needs to be written back after it is read.

After a while the charge leaks, so every cell must be refreshed every . This is managed by a microcontroller in the DRAM module. DRAM is unavailable during refresh, so we should aim to reduce it.

6.1 Error Correction

We could add redundancy via parity / ECC bits. To do this we use a hamming code (single error correction, double error detection).

We could cause memory upsets by updating surrounding cells (e.g., row hammering). This is possible even in modern DRAM devices. Error correcting codes can help mitigate this. This could also be avoided with adaptive refreshing by target row refreshing.

7. Memory Parallelism

We could achieve memory parallelism with parallel addressing as well as parallel data transfer by having multiple memory banks. Each bank can be accessed independently, allowing multiple memory operations to occur simultaneously. This increases overall memory bandwidth and reduces latency for memory accesses.

Back to Home