Profiling

1. Events

To identify optimisation opportunities we can:

Identify the hot path.
Identify the bottleneck resource.
Both are functions of system behaviour.

System behavior is described by events, or a change in system state. This is usually restricted to a certain granularity. Events can be:

Simple / Atomic: e.g. sent package, executed instruction, clock ticked.
Complex: e.g. cache line evicted, instruction aborted due to misspeculation.

Events may have a payload. In this case they also have accuracy (the degree to which the payload represents reality).

Event sources are composed of a generator (observing system state) and consumer (processing events). Can be online (real-time) or offline (post-mortem).

2. Tracing

A trace is a complete log of every state the system has ever been in, comprised of ordered events. Accuracy is inherited from the events, and event collection overhead may be high.

Example: Call Stack Tracing

A call stack is made from stack frames, containing variables, return address and pointer to the underlying stack frame.

Recording the entire call stack is expensive, as the CPU must walk the call stack on every function call/return. For small cheap functions, this overhead is significant. This causes perturbation.

Tracing collects events, but is subject to pertubation, which must be managed. However, analyzing traces is tedious.

2.1 Perturbation

Perturbation is the effect of the measurement process on the system being measured, which can distort the results. To reduce perturbation we can reduce fidelity (the amount of detail recorded). Perfect fidelity records all events, while reduced fidelity does not.

2.2 Sampling

Fidelity can be reduced with sampling. We can either sample regularly or randomly. If we skip events, there is a chance that we will not sample a critical event. Fortunately, critical events run for longer and are more likely to be sampled.

Sampling intervals is the distance between two samples being taken. This can be:

Time Based: set a hardware timer and sample when it fires. Use CPU reference cycles as a proxy metric. Inaccyrate, non-deterministic & noisy but easy to interpret.
Event Based: in terms of occurence of an event. Accurate, deterministic & less noisy but harder to interpret.

Quantisation Errors occurs when interval resolution is limited and time is continuous. They mean that some events may be missed or misattributed.

2.3 Indirect Tracing

Some trace events are dominated by others (executed depending on their outcome). This can be used to reduce overhead, as we can trace only the dominating events and infer the dominated ones. This gives good fidelity and accuracy.

3. Profiling

A profile is a representation of information relating to particular characteristics of a system in terms of the resources it spends in certian states, recorded in quantified form. Essentially, its an aggregate of events over a specific metric (e.g. total cache misses, cycles per instruction, cache misses per line of code). Profiles lose information, but can reduce perturbation in real-time.

3.1 Flame Graphs

X-axis shows stack profile population, sorted alphabetically (not by time).
Y-axis shows stack depth.
Each rectangle represents a stack frame, and its width is the number of collected samples.
Colours are arbitrary, used to distinguish functions.

4. Event Sources

Event sources should be detailed, accurate and cause little perturbation. We can do this either with:

Software: using a library (manual instrumentation) / logging), using a compiler (automatic instrumentation), or with the OS (kernel counters).
Hardware: using performance monitoring counters (PMCs). Expensive on the CPU but no observable perturbation to software.

4.1 Instrumentation

Instrumentation means augmenting the program with event logging code. Despite high overhead, there is no need for hardware support and it is very flexible. It can be:

Manual (using libraries to log events). Gives fine control, with no hardware / compiler support required. However, high overhead for implementation & runtime, and disabled for release build. Needs recompilation for selective instrumentation.
Automatic (using compiler to insert logging code). Gives less control & needs compiler support.
Binary Instrumentation (modifying binary to insert logging code). No source code / recompilation needed, but needs sophisticated tools. Static means modifying binary before execution, while dynamic means modifying binary during execution. Static is simple & portable but inflexible, while dynamic is flexible but complex & platform dependent. Dynamic also works with JIT compilation.

4.2 Performance Counters

Software Performance Counters (OS) are kernel-level counters that log events like context switches, page faults and I/O operations. They have low overhead, but limited accuracy & detail.

Hardware Performance Counters are special registers that can be configured to count low-level events like cache misses, branch mispredictions and instructions executed. Unfortunately, they are often buggy, poorly documented and may have poor accuracy.

5. Bottleneck Analysis

CPUs can stall on control dependencies (branch mispredictions), due to a lack of compute resources, and on data access.

To analyse this we can count cycles during which certain things happen. For example:

Back to Home

Table of Contents