Lecture from: 22.04.2024 | Video: Video ETHZ
We’ve established that concurrent programming is essential for harnessing the power of modern hardware, which boasts incredible transistor densities and core counts, enabling massive computations like those powering AI and large-scale simulations on supercomputers like Summit.
Context: Large Scale Computing
The scale continues to grow, with systems like Cerebras WSE-3 pushing towards a million cores on a single piece of silicon. Managing concurrency correctly and efficiently at this scale is paramount.
Learning Goals for Today
Building on Last Time:
- We saw how bad interleavings and, more fundamentally, data races (caused by memory reordering) can break concurrent programs.
- We introduced memory models as the contract defining allowable behavior and explored formal proofs.
- We examined implementations of 2-process (Peterson) and N-process locks (Filter, Bakery) using only atomic or safe/regular registers, highlighting their theoretical possibility but practical limitations (efficiency, space complexity).
Today’s New Topics:
- Locks via Read-Modify-Write (RMW): Implementing locks using atomic hardware operations.
- Higher-Level Concurrency: Discussing Deadlocks, Semaphores, and Barriers (though Semaphores/Barriers might carry over).
Learning Goals:
- Understand the pitfalls and implementation details of synchronization algorithms built on hardware primitives.
- Recognize the importance of these concepts for designing correct parallel code, even if you primarily use library functions.
Recap: Implementations using Atomic/Safe Registers
Let’s briefly recall the lock implementations from last time that relied solely on atomic/regular memory reads and writes.
Peterson Lock (Recap)
Peterson’s lock provided an elegant mutual exclusion solution for two processes using flag
variables (intent) and a victim
variable (tie-breaker). It guarantees mutual exclusion and freedom from starvation but is limited to two threads.
Atomic Registers (Recap)
We defined atomic registers where read/write operations appear to take effect at a single point in time (τ
), ensuring reads observe the value from the immediately preceding write (providing a total order). This formalism helped us prove the correctness of algorithms like Peterson’s, assuming atomic registers for shared variables.
Filter Lock (Recap)
The Filter Lock extended Peterson’s idea to N processes using N-1 levels of “filters”. Threads progress through levels, and at each level, a Peterson-like victim mechanism allows at most threads to pass level i
. It ensures mutual exclusion, deadlock-freedom, and starvation-freedom but is not FCFS and can be inefficient due to reading N-1 variables per level in the wait loop.
Bakery Algorithm (Recap)
Lamport’s Bakery Algorithm provides N-process mutual exclusion using only single-writer/multiple-reader (SWMR) regular registers. It uses a “take-a-ticket” system, ensuring FCFS fairness and thus starvation-freedom.
Its Java implementation requires AtomicIntegerArray
for correctness due to the JMM. While correct and fair, it also suffers from inefficiency as threads must read all other threads’ flags and labels in the waiting loop.
Limitations of Register-Based Locks
These algorithms (Filter, Bakery) demonstrate it’s possible to achieve mutual exclusion with only atomic/regular registers. However, they have practical drawbacks:
- Space Complexity: They often require shared variables proportional to the number of threads (N). Theorem 5.1 by Burns shows that any deadlock-free mutual exclusion algorithm using only read/write atomic registers needs at least N variables. This doesn’t scale well if N is potentially huge (e.g., millions of threads/tasks).
- Performance: The waiting loops involve reading multiple shared variables, leading to high contention and poor cache performance.
- Reliance on Memory Model /
volatile
: We assumed no memory reordering for analysis, but real implementations require careful use ofvolatile
or memory barriers to prevent failures due to compiler/hardware reordering, adding complexity and overhead.
So, while we proved these software-only constructions work (under certain assumptions), they aren’t how high-performance locks are typically implemented. We need a better approach, leveraging direct hardware support. We need to extend our model to include specialized atomic instructions provided by modern architectures.
Hardware Support for Parallelism: Read-Modify-Write Operations
Modern processors provide special instructions that can atomically read a value from memory, modify it, and write the new value back, all as a single, indivisible hardware operation. These are crucial for efficient synchronization.
Example: x86 Compare-and-Swap (CMPXCHG)
CMPXCHG mem, reg
:- Atomically compares the value in memory location
mem
with the value in a designated register (e.g., EAX). - If they are equal, it writes the value from the source register
reg
to the memory locationmem
and sets the Zero Flag (ZF). - If they are not equal, it loads the value from
mem
into the designated register (EAX) and clears the ZF.
- Atomically compares the value in memory location
LOCK
Prefix: Prepending theLOCK
instruction prefix toCMPXCHG
(and certain other instructions) ensures that the entire read-modify-write sequence is performed atomically with respect to other processors, typically by locking the memory bus or using cache locking mechanisms.
Example: ARM Load-Linked / Store-Conditional (LDREX/STREX)
ARM uses a different approach with a pair of instructions:
LDREX rd, [rn]
(Load Register Exclusive): Loads a value from memory location[rn]
into registerrd
. Crucially, it also “marks” the physical address as having an exclusive access pending by this processor (using a local “exclusive monitor”).STREX rd, rm, [rn]
(Store Register Exclusive): Attempts to store the value from registerrm
into memory location[rn]
. The store succeeds only if the executing processor still has exclusive access marked for that address (i.e., no other processor has written to it since theLDREX
). The status of the store (success=0, failure=1) is written to registerrd
.
This pair allows building atomic RMW sequences: load exclusively, perform modifications in registers, and then attempt to store conditionally. If the store fails, it means another thread interfered, and the operation needs to be retried.
Typical Atomic Hardware Instructions
Common atomic RMW primitives found across different architectures include:
- Test-And-Set (TAS): Atomically sets a memory location to 1 (or true) and returns its previous value. (e.g., Motorola 68000 TSL instruction).
- Compare-And-Swap (CAS): Atomically compares the value in memory with an expected
old
value. If they match, it replaces the memory value with anew
value and returns success (or the old value). If they don’t match, it does nothing and returns failure (or the current value). (e.g., x86 LOCK CMPXCHG, Sparc CASA). - Load-Linked / Store-Conditional (LL/SC): Paired instructions as seen in ARM, MIPS, POWER, RISC-V. Load marks exclusive access; conditional store succeeds only if exclusivity is maintained.
- Fetch-And-Add: Atomically reads a value, adds a specified amount to it, writes the result back, and returns the original value.
Cost: These atomic instructions are typically much slower than simple load or store operations because they often require complex cache coherence interactions or bus locking to ensure atomicity across multiple cores.
Semantics of TAS and CAS
Let’s formalize the basic semantics:
-
boolean TAS(memref s)
:// ---- Start Atomic ---- if (mem[s] == 0) { mem[s] = 1; return true; // Indicate success (acquired lock) } else { return false; // Indicate failure (lock already held) } // ---- End Atomic ----
(Note: Some definitions return the old value, so
TAS
returning 0 means success) -
int CAS(memref a, int old, int new)
:// ---- Start Atomic ---- int oldval = mem[a]; if (old == oldval) { mem[a] = new; } return oldval; // Return the value originally read // ---- End Atomic ----
(Check return value: if
return == old
, the swap succeeded)
Importance:
- These are fundamental Read-Modify-Write (RMW) operations.
- They enable building mutexes (locks) with O(1) space complexity (just need one shared variable for the lock state), unlike the O(N) requirement for register-only algorithms like Filter/Bakery.
- They are essential primitives for implementing lock-free programming techniques (a more advanced topic).
Implementation of a Spinlock using Atomic Operations
A spinlock is a lock where a thread attempting to acquire it repeatedly checks the lock status in a loop (it “spins”) until the lock becomes available. Here’s how to implement one using TAS and CAS:
Using Test-and-Set (TAS):
- State:
int lock = 0;
(0=free, 1=held) Init(lock)
:lock = 0;
Acquire(lock)
:while (!TAS(lock));
// Spin while TAS returns false (lock was already 1)Release(lock)
:lock = 0;
// Simple write is enough if Release only called by holder
Using Compare-and-Swap (CAS):
- State:
int lock = 0;
(0=free, 1=held) Init(lock)
:lock = 0;
Acquire(lock)
:while (CAS(lock, 0, 1) != 0);
// Spin while CAS fails to change 0 to 1 (returns non-zero old value)Release(lock)
:CAS(lock, 1, 0);
// Atomically set back to 0 (ignore result)
Read-Modify-Write in Java (java.util.concurrent.atomic
)
Java provides high-level support for atomic operations starting from JDK 5 in the java.util.concurrent.atomic
package.
- Classes:
AtomicBoolean
,AtomicInteger
,AtomicLong
,AtomicReference
, etc. - Key Operations (e.g.,
AtomicBoolean
):boolean get()
: Atomic read.void set(boolean newValue)
: Atomic write.boolean compareAndSet(boolean expect, boolean update)
: Atomic CAS. Returnstrue
if successful (current value wasexpect
, set toupdate
).boolean getAndSet(boolean newValue)
: Atomic TAS-like operation. Sets tonewValue
and returns the previous value.
How does this work? (Experts):
- JVM bytecode itself doesn’t directly expose RMW instructions like CAS. Standard synchronization relies on
monitorenter
/monitorexit
. - However, the JDK implementation often uses a special, internal (and officially undocumented/unsafe) class
sun.misc.Unsafe
(escape patch). sun.misc.Unsafe
provides methods that map directly (or as closely as possible) to underlying hardware atomic operations (like CAS).- This mapping is not strictly guaranteed, and the performance/lock-free nature depends on the specific JVM, OS, and hardware.
Example: AtomicInteger.compareAndSet
internally often calls Unsafe.compareAndSwapInt
.
TASLock in Java
Using AtomicBoolean
:
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.locks.Lock;
public class TASLock implements Lock {
AtomicBoolean state = new AtomicBoolean(false); // false = free, true = held
public void lock() {
// Spin: Keep trying getAndSet(true).
// getAndSet returns the *previous* value.
// Loop continues as long as the previous value was true (lock held).
// Loop exits when previous value was false (we successfully acquired).
while (state.getAndSet(true)) {
// Body of loop is empty - just spin
}
}
public void unlock() {
// Release the lock
state.set(false);
}
// Note: tryLock(), lockInterruptibly() etc. not implemented here
}
Spinlock Behavior: The while
loop in lock()
continuously executes the atomic getAndSet
operation, consuming CPU cycles while waiting.
Simple TAS Spin Lock - Performance
Measurements show that the runtime of this simple TAS spinlock scales poorly as the number of threads (n
) increases. The normalized time per thread grows significantly.
Why the Poor Performance?
- Sequential Bottleneck / Contention: The atomic
getAndSet
operation requires exclusive access to thestate
variable in memory (or cache). All spinning threads are constantly fighting for access to this single location via the memory bus. - Cache Coherence Traffic: Every
getAndSet
attempt by one thread potentially modifies the cache line containingstate
. Cache coherence protocols must then invalidate the copies of this cache line held by all other spinning processors, causing them to re-fetch it over the bus on their next attempt. This generates massive amounts of bus traffic and contention.
Test-and-Test-and-Set (TATAS) Lock
To reduce the contention caused by TAS, we can use the Test-and-Test-and-Set (TATAS) optimization.
public void lock() {
do {
// 1. Test (Spin Read): First, spin locally reading the state.
// Only proceeds if the lock *appears* free (state == false).
// This read usually hits the local cache and avoids bus traffic.
while (state.get()) {}
// 2. Test-and-Set (Atomic Attempt): Once lock appears free,
// *then* try the atomic operation to acquire it.
// getAndSet returns previous value. Loop if it was already true
// (meaning someone else got the lock between our get() and getAndSet()).
} while (state.getAndSet(true));
// Alternative using compareAndSet:
// } while (!state.compareAndSet(false, true));
}
public void unlock() {
state.set(false);
}
Improvement: Threads spend most of their spinning time executing the local state.get()
, which likely hits their cache. Only when get()
returns false
do they attempt the expensive atomic RMW operation (getAndSet
or compareAndSet
) over the bus. This significantly reduces bus traffic and contention compared to the simple TAS lock.
Measurement: The graph shows TATAS performs much better than TAS, although runtime still increases with the number of threads due to contention during the actual atomic attempt phase. (Note: Performance varies significantly across machines/JVMs; this is qualitative.)
TATAS Limitation: While better, TATAS doesn’t solve all problems. For example, the infamous “Double-Checked Locking” pattern for lazy initialization often fails due to memory reordering issues, even if atomic primitives seem involved. Relying on low-level spinning reads without proper memory barriers can be unsafe.
TATAS with Backoff
Observation: Even with TATAS, if many threads detect the lock is free around the same time, they will all execute the atomic compareAndSet
or getAndSet
, causing a burst of contention. This slows down overall progress.
Solution: Backoff: When a thread fails to acquire the lock (i.e., the atomic operation fails because someone else got it first), it shouldn’t immediately retry. Instead, it should back off – wait for a short duration before trying again.
- Random Duration: Wait for a random amount of time to reduce the probability that all waiting threads retry simultaneously.
- Exponential Backoff: Increase the expected waiting time exponentially each time the thread fails to acquire the lock (up to a maximum). This adapts to the level of contention.
Lock with Backoff Code Structure:
public void lock() {
Backoff backoff = null; // Lazy initialization of backoff helper
while (true) {
// Spin-read phase (TATAS)
while (state.get()) {};
// Attempt atomic acquire
if (!state.getAndSet(true)) {
return; // Success! Acquired the lock
} else {
// Failure - another thread got the lock
try {
if (backoff == null) { // Create backoff object on first failure
backoff = new Backoff(MIN_DELAY, MAX_DELAY);
}
backoff.backoff(); // Wait for a random, increasing duration
} catch (InterruptedException ex) {}
}
}
}
Exponential Backoff Implementation:
class Backoff {
final int minDelay, maxDelay;
int limit;
final Random random;
Backoff(int min, int max) {
minDelay = min;
maxDelay = max;
limit = minDelay;
random = new Random();
}
public void backoff() throws InterruptedException {
int delay = random.nextInt(limit); // Random delay up to current limit
if (limit < maxDelay) { // Double limit for next time, up to max
limit = 2 * limit;
}
Thread.sleep(delay); // Pause the thread
}
}
Measurement: Backoff locks typically show much better scalability than simple spinlocks (TAS or TATAS) under high contention, as threads yield the CPU (Thread.sleep
) instead of spinning uselessly and fighting over the bus. The runtime per thread remains relatively flat.
Deadlock
We’ve focused on ensuring mutual exclusion (only one thread in the CS). But using locks introduces a new potential problem: Deadlock.
Deadlock Motivation: Bank Transfer
Consider adding a transferTo
method to our BankAccount
class, using synchronized
methods (which acquire the intrinsic lock of the BankAccount
instance).
class BankAccount {
// ... balance, etc. ...
synchronized void withdraw(int amount) {/*...*/}
synchronized void deposit(int amount) {/*...*/}
// Method to transfer money FROM this account TO account 'a'
synchronized void transferTo(int amount, BankAccount a) {
this.withdraw(amount); // Acquires lock on 'this'
a.deposit(amount); // Attempts to acquire lock on 'a'
}
}
Problem: What happens if two threads try to transfer money in opposite directions simultaneously?
- Thread 1: Calls
x.transferTo(1, y)
- Thread 2: Calls
y.transferTo(1, x)
Scenario:
- Thread 1: Enters
x.transferTo
, acquires lock onx
. - Thread 1: Executes
this.withdraw(amount)
successfully (using lock onx
). - Thread 2: Enters
y.transferTo
, acquires lock ony
. - Thread 2: Executes
this.withdraw(amount)
successfully (using lock ony
). - Thread 1: Attempts to execute
a.deposit(amount)
(wherea
isy
). Tries to acquire lock ony
, but Thread 2 holds it. Thread 1 blocks. - Thread 2: Attempts to execute
a.deposit(amount)
(wherea
isx
). Tries to acquire lock onx
, but Thread 1 holds it. Thread 2 blocks.
DEADLOCK! Both threads are blocked indefinitely, each waiting for a lock held by the other. This forms a circular wait.
Deadlock Definition
Deadlock: A situation where two or more processes (or threads) are mutually blocked, each waiting for a resource held by another process in the cycle. No process can proceed. (Illustrated by the traffic gridlock and the 4-way intersection).
Continue here: 18 Deadlocks, Semaphores, Barriers