Multicore & Parallelism
1. Parallelism
Parallelism can be:
- Data Level Parallelism (DLP) - e.g. vector instructions.
- Instruction Level Parallelism (ILP) - e.g. superscalar out-of-order processors.
- Task Level Parallelism (TLP) - e.g. multiple threads or processes. This became more important at the end of Dennard scaling.
Parallelism means performing multiple computations simultaneously. Concurrency means managing multiple computations at the same time, which may or may not be executed simultaneously. Parallelism is a subset of concurrency.
Parallelism is good for performance, cost efficiency (can share RAM, disk, etc). Datacenters are optimised for total cost of ownership, including hardware purchase, power usage, cooling, security and Mean Time To Failure (MTTF).
Amdahl's Law
where
is the proportion of the program that can be parallelized, and is the number of processors. As , , which means that the maximum speedup is limited by the sequential portion of the program.
2. Multithreading
Threads are concurrent, and share memory within a process. If mapped to seperate cores, we have parallelism. In reality, low level operations are complex to program, and thread management / communication is expensive.
2.1 Performance Analysis
Naive multithreading is as follows:
- Start with a sequential application.
- Apply existing tools for performance analysis.
- Use Amdahl's law to identify what to parallelize.
. - Use multithreading to execute in parallel.
However, this is often wrong as it ignores the critical path (Sequence of tasks determmining the minimum time for an operation.). Often, the best choice is to parallelize the critical path, even if it has a smaller
For example, if
a()takes7s,b()takes7sandc()takes5s, buta()andb()are parallelized, thenc()is the best function to speed up as it is on the critical path, even though it has a smaller.
2.2 Communication & Synchronization
2.2.1 Explicit Sharing
Explicit Sharing is when threads share data through shared memory, requiring synchronization to avoid race conditions.
Cache coherency keeps data consistent accross cores. The MSI protocol is a cache coherency protocol that maintains three states for each cache line: Modified, Shared, and Invalid.
- Modified: The cache line is modified and only present in the current cache. It must be written back to memory before being evicted.
- Shared: The cache line is unmodified and may be present in multiple caches.
- Invalid: The cache line is invalid and cannot be used.
To ensure coherency, we need hardware support for atomic read/modify/write (RMW). We can add a locked state, which allows a core to lock a cache line while it is being modified, preventing other cores from accessing it until the modification is complete.
The universal atomic instruction is compare-and-swap (CAS): if (*addr == old_val): *addr = new_val. This is used to implement synchronization primitives like locks and semaphores. Hardware provides more atomic functions for performance, but CAS is enough to implement any synchronization primitive.
2.2.2 Critical Sections
It is possible to write code that can produce incorrect states due to concurrency issues. This section of code is a critical section. We use a lock to guard related critical sections.
2.2.3 Synchronization
Threads need to synchronize to access critical section. There are many primitives for this, including mutexes, semaphores, shared locks, condition variables & barriers.
In exclusive mode, only one thread can access a critical section at a time. In shared mode, multiple threads can access a critical section simultaneously, but only if they are not modifying shared data.
User Level Lock Implementation
We use a CAS to signal when a lock is acquired:
- On lock, loop until CAS is released to acquired.
- On unlock, set the lock to released.
However, this wastes cycles, and has potential for thread starvation (as no fairness guarantee on acquire order).
Kernel Level Lock Implementation
We can use a futex (fast user-space mutex) to avoid busy-waiting. This syscall reschedules thread if blocked trying to acquire. This is expensive if already released (
CAS + syscallvsCAS). But, it keeps fairness/order of blocked threads.
Hybrid Lock Implementation
We can use a hybrid approach, where we first try to acquire the lock using CAS, and if it fails, we fall back to using a futex. This allows us to avoid the overhead of a syscall in the common case where the lock is not contended, while still providing fairness and avoiding starvation in cases where the lock is contended.
2.2.4 False Sharing
When multiple threads access seperate variables, but sit on the same cache line, they can cause false sharing, which leads to performance degradation due to cache coherency traffic. To avoid this, we can pad data structures to ensure that frequently accessed variables are on seperate cache lines (performance / memory tradeoff):
1#define CACHE_LINE_SIZE 64 2struct __attribute__((aligned(CACHE_LINE_SIZE))) PaddedData { 3 uint32_t data; 4 char padding[CACHE_LINE_SIZE - sizeof(uint32_t)]; 5};
2.3 Thread & Task Management
Synchronization and shared variables are key for correctness, but multi-threading has more factors that affect performance.
2.3.1 On Demand
On Demand threading creates a thread for each incoming task onto the system:
1void process_burst(const std::vector<size_t>& jobs) { 2 for (size_t job : jobs) { 3 std::thread t([job, &]() { process_job(job); }); 4 t.detach(); 5 } 6}
This is the simplest parallel approach, but it is expensive if num_threads > num_cores, especially when holding locks. We could create new threads for every batch (similar to #pragma omp for):
1void process_burst(const std::vector<size_t>& jobs) { 2 std::vector<std::thread> threads; 3 for (size_t j : jobs) { threads.emplace_back([&]() { process_job(j); }); } 4 for (auto &t : threads) { t.join(); } 5}
Here, each batch runs its tasks in parallel, but frequent creation / destrictuion causes large overhead.
2.3.2 Worker Threads
Instead, we could create a thread pool, where a fixed number of worker threads are created at startup, and tasks are added to a queue for the workers to process. This is thread dispatching:
1size_t worker_idx; std::vector<worker_deque> workers; 2 3void process_burst(const std::vector<size_t>& jobs) { 4 for (auto job : jobs) { 5 workers[worker_idx++ % workers.size()].add_job(job); 6 } 7}
This has minimal queue contention, but event driven, and producer is unaware of imbalance. Instead, with work stealing, we have a shared job queue that consumers pull from:
1worker_deque job_queue; 2 3void process_burst(const std::vector<size_t>& jobs) { 4 for (auto job : jobs) { job_queue.add_job(job); } 5}
This has optimal consumer balancing. However, it is event driven meaning expensive queue manipulation due to contention.
2.3.3 Streaming & SEDA
Streaming (pipelining) means breaking a task into stages. Each function (stage) is given a thread, each task has an input queue, and each function enqueues result into next function's input queue. This is event driven so expensive queue manipulation, but has good code & temporary data locality, but bad locality for IO data.
Staged Event Driven Architecture (SEDA) is designed for massive concurrency and dynamic load. It generalises streaming by allowing multiple threads per stage, and dynamic load balancing. Each stage has a thread pool, and tasks are enqueued to stages based on their type. This allows for better performance under high load, but can be complex to implement and manage. All that the programmer needs to do is to define the stages and the tasks, and the SEDA framework will handle the rest.
3. Multiprocessing
Multiprocessing uses multiple processes to solve a single problem. Same concurrency / parallelism capabillities but has communication & synchronisation tradeoffs.
- No implicitly shared memory by default.
- Communication is explicit and expensive.
- Allows multiple functions / tasks that execute independently and have simple, explicitly expressed communication.
We can assign processes to each group of tasks.
Without shared memory, communication is key:
- System overheads - syscall, scheduling & memcpy overheads.
- Programmability - we cannot pass complex structures, so we need serialization / deserialization, which is expensive and error-prone.
We could use explicit shared memory (support zero-copy, no standard interface, local-only), sockets (intermediate copies, network stack processing even when local) or pipes (intermediate copies, same read-write interface, local-only).