Cache & Memory
Average memory access time (AMAT) is given by:
1. Miss Rate (Hardware)
There are 4 kinds of cache misses (4 Cs):
- Compulsory: first access to a block is not in cache. Also called cold start misses or first reference misses.
- Capacity: blocks are discarded when cache is full, so later accesses will cause a cache miss.
- Conflict: if block placement strategy is set associative or direct mapped, a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.
- Coherence: data may have been invalidated by another processor or IO device.
1.1 Cache Anatomy
- A cache block/line is a unit of allocation.
- A tag is the memory address of the block stored in the cache. It is associated with each block in cache.
- A set is a set of cache blocks indexed by a given cache index.
- A way is a set of alternative locations for a stored block in a given set.
- A comparator checks whether a tag matches an address.
- A selector picks data from the way with the matching tag (if any).
1.2 Increasing Block Size
By increasing the block size we decrease the miss rate (we essentially speculatively load in more data, improving programs with high spatial locality). However, loading in more data at a time, perhaps more than needed has a cost, so too large a block size incraeses the miss rate.
1.3 Victim Caching
In a victim caching structure, we combine the speed of a direct mapped cache with the flexibility of a set associative cache. A small fully associative cache (the victim cache) is placed between the main cache and main memory. When a block is evicted from the main cache, it is placed in the victim cache. On a miss in the main cache, the victim cache is checked. If there is a hit in the victim cache, the block is swapped with the evicted block from the main cache. A victim cache has a cost of extra comparators and selectors, but can significantly reduce conflict misses.
1.4 Skewed Associative Caching
In a convential associative cache, we get conflicts when
- Possibly reduces associativity
- Provides a more predictable average performance
- Has some extra costs for hash function & implementing LRU.
1.5 Prefetching
A stream buffer is used. When a cache miss occurs in a L1 cache, a stream buffer fetches the missed block and several subsequent blocks from memory into a small FIFO buffer. Subsequent accesses to these blocks can be serviced from the stream buffer, reducing miss penalty. If the stream buffer is exhausted, a new block and subsequent blocks are fetched from memory. On access, the stream buffer is checked in parallel with the cache.
A multi-way stream buffer has multiple stream buffers, each fetching a different stream of data. This is useful for programs with multiple streams of data being accessed simultaneously.
Decoupled Access Execute
We can separate instructions that generate addresses from those that use memory results, letting address generating instructions run ahead and prefetch. To do this, two seperate processors execute execute and access instructions.
2. Miss Rate (Software)
Software prefetching instructions are almost never implemented - they require significant changes to compilers and instruction sets, and are difficult to use effectively. However, some techniques include:
- Letting the compiler decide where to insert prefetch instructions based on analysis of memory access patterns.
- By choosing instruction memory layout based on the program's control flow graph, we can reduce instruction cache misses (a link time optimisation).
- Storage layout transformations for spatial locality:
- Array Merging: improve spatial locality by merging arrays that are accessed together. For example, array of structs OR struct of arrays?
- Permuting multi-dimensional arrays: improve spatial locality by matching array layout to traversal order.
- Iteration space transformations for temporal locality:
- Loop Interchange: change order of nested loops to improve locality.
- Loop Fusion: combine adjacent loops that iterate over the same range to improve locality. Not so simple, e.g. convolution filters. This can be done by shifting one loop by some iterations, making it fusable.
- Blocking: improve temporal locality by accessing blocks of data repeatedly.
3. Miss Penalty
3.1 Write Policy
When we use caches, we must consider policies which determine how data is managed. We can either write through or write back:
- Write through: all writes will update both the cache and the next level of memory. This means we can always discard a block from cache without worrying about losing data. A valid bit is used to indicate whether a block contains valid data. This is simpler and faster to evict blocks, but slower for writes.
- Write back: all writes just update cache. We must write to main memory only when a block is evicted from cache. This is faster, but requires more complex management to ensure data consistency. A dirty bit is used to indicate whether a block has been modified since it was loaded into cache. A valid bit is also used. Much lower bandwidth and better tolerance to long-latency main memory.
We can also use write allocate or write around:
- Write allocate: on a write miss, we load the block into cache, then perform the write. This is better for temporal locality.
- Write around: on a write miss, we just write directly to main memory, without loading the block into cache. This is better for workloads with poor temporal locality.
3.2 Write Buffer
Introducing a write buffer between the cache and main memory can help reduce the effective miss penalty for write through caches. When a write occurs, data is placed in the write buffer and the CPU can continue executing without waiting for the write to complete. The write buffer then writes data to main memory in the background.
However, we now have to check the buffer in parallel to the cache on reads, to ensure we get the most up-to-date data. Write buffers store 2-8 cache lines. Buffers are good for write bursts, since we can coalesce multiple writes to the same block.
3.3 Early Restart
In early restart, when a cache miss occurs, we begin fetching the missed block from main memory. As soon as the tag of the requested word is returned, we can resume execution, even if the rest of the block has not yet been fetched. The remaining data can be fetched in the background. This reduces the effective miss penalty, as we do not have to wait for the entire block to be fetched before resuming execution.
In critical word first, we fetch the requested word first, then the rest of the block. This further reduces the effective miss penalty, as we can resume execution as soon as the requested word is available.
However, if we do an early restart we can get a race condition, if another load to the same block occurs before the rest of the block has been fetched. To fix this, we can divide cache line into sectors each with their own validity bit. We allocate in cache lines and deliver data in cache sectors.
3.4 Non-blocking Cache
A non-blocking cache allows supplying data to the CPU while a miss is being serviced. Hit under miss reduces effective penalty by working during miss service. Hit under multiple miss or miss under miss further lowers effective penalty by allowing overlapping miss servicing. This significantly increases cache controller complexity and requires multiple memory banks.
3.5 Multi-level Cache
We can have multi-level cache. Then:
If there was a L3 cache, the miss penalty of L2 would be the AMAT of L3, and so on.
- Local miss rate is the number of misses at a given level divided by the number of accesses to that level.
- Global miss rate is the number of misses at a given level divided by the number of accesses from the CPU. This is what matters!
Multilevel inclusion is where all data in L1 is also in L2, and all data in L2 is also in L3. This simplifies cache coherence, but can increase miss rates (evicting from L2 evicts from L1 too).
4. Hit Time
We can split pipeline stages which require a cache access into multiple cycles. However, this means that cache hits take multiple cycles, so we can access the cache more regularly, but with a higher latency. This essentially improves cache throughput at the potential cost of latency.
To support multiple parallel accesses to cache, we can use:
- Banking: divide cache into independently accessible banks, mapped by lower order bits or hash functions.
- Duplicate the cache.
- Multiport RAM: support multiple simultaneous accesses, with two wordlines per row and two bitlines per column. This is expensive and complex (but probably better than duplicating the cache).
5. Address Translation
To securely run programs in their own virtual address spaces we need to translate virtual addresses to physical addresses. There are three common translation schemes:
- Physically Indexed, Physically Tagged (PIPT) uses physical addresses for both indexing and tagging the cache. This requires a translation lookaside buffer (TLB) lookup before accessing the cache, which can add latency.
- Virtually Indexed, Virtually Tagged (VIVT) uses virtual addresses for both indexing and tagging the cache. This allows for faster cache access, reducing hit time. However, it can lead to issues with synonyms (multiple virtual addresses mapping to the same physical address) and requires additional complexity to maintain coherence.
- Virtually Indexed, Physically Tagged (VIPT) combines the benefits of both PIPT and VIVT. The data and tag can be accessed in parallel with the TLB lookup, reducing hit time while avoiding synonym issues. However, it requires careful design to ensure that the cache size and page size are compatible.
5.1 Paging
Virtual address spaces is divided into pages of equal size. Main memory is divided into page frames of same size. This allows non-contiguous allocation of memory, reducing fragmentation and allowing for easier swapping of pages to disk.
A virtual address is divided into a virtual page number (VPN) and a page offset. The VPN is used to index into a page table, which maps VPNs to physical page numbers (PPNs). The page offset is then used to access the specific byte within the physical page.
The translation lookaside buffer is a small cache of the page table, storing recent VPN to PPN mappings. This allows for faster address translation, reducing the overhead of page table lookups. This is often a highly associative cache.
5.2 Synonyms & Homonyms
A homonym occurs when a single virtual address maps to multiple physical addresses in multiple processes. If the cache is indexed with virtual addresses, this can lead to multiple copies of the same data in cache, causing coherence issues. To avoid this, we can flush the cache on context switches, or include a process identifier (PID) in the cache tag.
A synonym occurs when multiple virtual addresses map (from one or more processes) to the same physical address. In a virtually indexed cache, virtual address could be cached twice under different physical addresses, so updates to one copy are not reflected in the other. To avoid this, we must force synonyms to have the same index bits in the cache. This is called page colouring.
5.3 Avoiding Translation
In a VIPT memory scheme, we can avoid translation entirely:
If the cache index consists only of physical (untranslated) address bits, we can access tag in parallel with translation. However, this limits the size of the cache to be no larger than the page size multiplied by the associativity (increase associativity for bigger cache!). Often, L1 caches are highly associative for this reason.
The L2 cache is indexed with the translated address, so its associativity conflicts depend on the virtual to physical mapping. Here, the OS should use page colouring to avoid synonyms causing conflicts.
5.4 Page Colouring
Alternativelty to avoiding translation, we can also use page colouring in the L1 cache to ensure that synonyms map to the same cache index. This involves controlling the allocation of physical pages such that the index bits of the physical address match those of the virtual address. This can be done by maintaining multiple free lists of physical pages, each corresponding to a different colour (i.e., cache index). When allocating a new page, we select a physical page from the appropriate free list based on the virtual address being mapped.
6. DRAM
In DRAM data is stored in a square array of cells, with the address split into a row and column. To access data:
- Row address selects a row to access, via wordline.
- Cells discharge.
- Cell state latched by per-column sense amplifiers.
- Column address selects data for output, via bitline.
- Data must be written back to the selected row.
Since the row is latched, subsequent accesses to the same row are faster (page mode). To access a different row, we must first precharge the bitlines, then activate the new row. This means row access cycle time is larger than the row access time, as the data needs to be written back after it is read.
After a while the charge leaks, so every cell must be refreshed every
6.1 Error Correction
We could add redundancy via parity / ECC bits. To do this we use a hamming code (single error correction, double error detection).
We could cause memory upsets by updating surrounding cells (e.g., row hammering). This is possible even in modern DRAM devices. Error correcting codes can help mitigate this. This could also be avoided with adaptive refreshing by target row refreshing.
7. Memory Parallelism
We could achieve memory parallelism with parallel addressing as well as parallel data transfer by having multiple memory banks. Each bank can be accessed independently, allowing multiple memory operations to occur simultaneously. This increases overall memory bandwidth and reduces latency for memory accesses.