Efficient Code

We want to build a balanced system - with no bottlenecks. True balance is unachievable, code is either compute-bound, latency bound or bandwidth bound.

1. CPU Efficiency

CPU bound applications are usually poorly implemented, operate on small datasets, are math heavy or apply memory-oriented optimisations. The target metric is wall clock time (time taken to execute a program). Stall cycles are also a good indicator of CPU efficiency (hazards):

CPU efficiency is strongly influenced by pipeline design. Is it:

Write runtime predictable code: evaluate code as early & as little as possible.

1.1 Partial Evaluation

Program evaluation is a multi-phase process, with the fix point being the final result. Partial evaluation: optimize code by precomputing parts of it when some inputs are known at compile time. Examples include function inlining, JiT compilation and constant folding.

1.2 Loop Specialization

Loop specialization is an optimisation technique wherein a loop is transformed to handle specific cases more efficiently. This can be specified for compilers with metaprogramming.

1.3 If Conversion

If-conversion / Predication can be used to eliminate control hazards by converting conditional branches into straight-line code with predicated instructions. However, there are cases where this can lead to performance degradation, such as when the condition is rarely true or when the predicated instructions are more expensive than the original branch.

1.4 SIMD Vectorization

SIMD Vectorization allows performing the same operation on multiple data points simultaneously, which can significantly improve performance for data-parallel tasks. However, it may not always be beneficial, especially if the data is not well-aligned or if the overhead of setting up SIMD operations outweighs the performance gains.

2. Data Hazards

Caches have a fixed cache line size (e.g. 64 bytes), and fixed capacity. When instruction accessed data is not in cache, a cache miss occurs, stalling the pipeline & causing a data hazard. If the data has been accessed before, its a capacity miss. Otherwise, a compulsory miss.

2.1 Prefetching

If code is latency bound & cache misses are compulsory, we may prefetch. CPUs will speculatively load the next cache line by recognizing memory access patterns (hardware prefetching). Works well for regular memory access but breaks for irregular accesses.

Software Prefetching can be used to explicitly prefetch data into cache, optimising irregular access patterns. This is done with CPU instrinsics.

2.2 Increasing Utilization

If code is bandwidth bound, we should increase cache line utilization: . This is done by changing the data layout in memory to ensure more of the data loaded into cache is actually used by instructions. For example, if we have a struct with many fields but only use a few, we can rearrange the struct to group the used fields together, improving cache line utilization.

2.3 Reducing Cache Footprint

Capacity misses are caused by thrashes: when the working set of data exceeds the cache capacity, leading to frequent evictions and reloads of data.

Loop tiling is an optimization technique that breaks down a loop into smaller blocks or tiles, allowing for better cache utilization and reducing cache misses. By processing data in smaller chunks that fit into the cache, we can minimize thrashing and improve performance.

Multipass Algorithms that perform strictly more work can be more efficient if they reduce the cache footprint. For example, a single-pass algorithm that processes a large dataset may cause many cache misses, while a multi-pass algorithm that processes the data in smaller chunks can fit into the cache and reduce misses, ultimately improving performance despite doing more work.

3. False Sharing Hazards

MESI (Modified, Exclusive, Shared, Invalid) is a cache coherence protocol that ensures consistency of data across multiple cores. It defines the states of cache lines and the rules for transitioning between these states to maintain coherence:

False Sharing is a control hazard wherein multiple threads access different variables that reside on the same cache line, leading to unnecessary cache coherence traffic and performance degradation.

To avoid this, pad data structures to ensure that variables accessed by different threads do not share the same cache line.

Back to Home