Efficient Code

We want to build a balanced system - with no bottlenecks. True balance is unachievable, code is either compute-bound, latency bound or bandwidth bound.

1. CPU Efficiency

CPU bound applications are usually poorly implemented, operate on small datasets, are math heavy or apply memory-oriented optimisations. The target metric is wall clock time (time taken to execute a program). Stall cycles are also a good indicator of CPU efficiency (hazards):

Control Hazards are stalls due to data-dependent hcanges in the control flow.
Structural Hazards are stalls due to lack of execution resources.
Data Hazards are stalls due to operands being unavailable on time.

CPU efficiency is strongly influenced by pipeline design. Is it:

Speculative? This keeps pipeline full even if no instructions are eligible. Addresses some control hazards, but occasionally needs to flush pipeline.
Doing superscalar execution? Allows different instructions to be in the same stage.
Doing out-of-order execution? Exploits instruction independence to execute instructions as soon as their operands are ready, rather than strictly following program order. Addresses data & structural hazards.
Executing statically parallel (e.g. SIMD)? Allows performing the same operation on multiple data points simultaneously. Addresses data hazards.

Write runtime predictable code: evaluate code as early & as little as possible.

1.1 Partial Evaluation

Program evaluation is a multi-phase process, with the fix point being the final result. Partial evaluation: optimize code by precomputing parts of it when some inputs are known at compile time. Examples include function inlining, JiT compilation and constant folding.

1.2 Loop Specialization

Loop specialization is an optimisation technique wherein a loop is transformed to handle specific cases more efficiently. This can be specified for compilers with metaprogramming.

1.3 If Conversion

If-conversion / Predication can be used to eliminate control hazards by converting conditional branches into straight-line code with predicated instructions. However, there are cases where this can lead to performance degradation, such as when the condition is rarely true or when the predicated instructions are more expensive than the original branch.

1.4 SIMD Vectorization

SIMD Vectorization allows performing the same operation on multiple data points simultaneously, which can significantly improve performance for data-parallel tasks. However, it may not always be beneficial, especially if the data is not well-aligned or if the overhead of setting up SIMD operations outweighs the performance gains.

2. Data Hazards

Caches have a fixed cache line size (e.g. 64 bytes), and fixed capacity. When instruction accessed data is not in cache, a cache miss occurs, stalling the pipeline & causing a data hazard. If the data has been accessed before, its a capacity miss. Otherwise, a compulsory miss.

Code is fully memory bound if all stalls are due to data hazards.
If memory bus is fully utlized, its memory bandwidth bound.
Otherwise, its memory latency bound.

2.1 Prefetching

If code is latency bound & cache misses are compulsory, we may prefetch. CPUs will speculatively load the next cache line by recognizing memory access patterns (hardware prefetching). Works well for regular memory access but breaks for irregular accesses.

Software Prefetching can be used to explicitly prefetch data into cache, optimising irregular access patterns. This is done with CPU instrinsics.

2.2 Increasing Utilization

If code is bandwidth bound, we should increase cache line utilization: . This is done by changing the data layout in memory to ensure more of the data loaded into cache is actually used by instructions. For example, if we have a struct with many fields but only use a few, we can rearrange the struct to group the used fields together, improving cache line utilization.

2.3 Reducing Cache Footprint

Capacity misses are caused by thrashes: when the working set of data exceeds the cache capacity, leading to frequent evictions and reloads of data.

Loop tiling is an optimization technique that breaks down a loop into smaller blocks or tiles, allowing for better cache utilization and reducing cache misses. By processing data in smaller chunks that fit into the cache, we can minimize thrashing and improve performance.

Multipass Algorithms that perform strictly more work can be more efficient if they reduce the cache footprint. For example, a single-pass algorithm that processes a large dataset may cause many cache misses, while a multi-pass algorithm that processes the data in smaller chunks can fit into the cache and reduce misses, ultimately improving performance despite doing more work.

MESI (Modified, Exclusive, Shared, Invalid) is a cache coherence protocol that ensures consistency of data across multiple cores. It defines the states of cache lines and the rules for transitioning between these states to maintain coherence:

Modified: The cache line is modified and only present in the current cache. It must be written back to memory before being evicted.
Exclusive: The cache line is unmodified and only present in the current cache.
Shared: The cache line is unmodified and may be present in multiple caches.
Invalid: The cache line is invalid and cannot be used.

False Sharing is a control hazard wherein multiple threads access different variables that reside on the same cache line, leading to unnecessary cache coherence traffic and performance degradation.

To avoid this, pad data structures to ensure that variables accessed by different threads do not share the same cache line.

Back to Home

Table of Contents