Lecture from 14.05.2025 | Video: Video ETHZ
We continue our exploration into advanced concurrent programming. Building on our knowledge of locks and their limitations, we’ll focus on lock-free algorithms, specifically for queues, which are vital in systems like OS schedulers. This path, however, is fraught with subtlety, most notably the ABA problem, which arises from memory reuse. We will examine solutions like pointer tagging and hazard pointers. Finally, we’ll formalize our notion of correctness for concurrent objects by introducing Linearizability.
Summary/Recap Last Lecture
Previously, we investigated several fine-grained locking strategies:
- Optimistic List-Based Sets: Traversed lock-free, then locked and validated.
- Lazy Synchronization: Introduced logical deletion (marking) to potentially simplify
remove
and enable wait-freecontains
(single lock-free traversal). - Lazy Skip Lists: Presented skip lists as a practical, probabilistic data structure amenable to lazy/optimistic techniques.
- Lock-Free Basics & CAS: Introduced the concept of lock-freedom and the crucial role of Compare-and-Swap (CAS) atomic operations.
Learning Goals Today
Our main goals for this lecture are:
- Reuse and ABA Problem: Deep dive into the ABA problem that occurs when nodes are reused in lock-free CAS-based algorithms. Explore solutions:
- Pointer Tagging
- Hazard Pointers
- Correctness in Parallel Programs (Linearizability): Introduce a formal correctness condition for concurrent objects.
Reuse and the ABA Problem
The previous lock-free algorithms assumed that memory nodes, once removed from the data structure, were not reused. What happens if we do reuse nodes, for example, by managing them in a memory pool?
Back to the Lock-Free Stack (with Reuse)
Let’s modify our lock-free stack to use a NodePool
with get()
and put()
methods instead of new Node()
and garbage collection.
NodePool itself is implemented lock-free using CAS on its internal top
reference.
push
now calls pool.get(item)
and pop
calls pool.put(head)
.
Experiment and Failure
Run N producer/consumer threads, each doing 10k push/pops using the stack with/without reuse. Check correctness (sum pushed == sum popped) and measure time.
Initial results might look okay…
But other runs fail! We get exceptions, sums don’t match, or the program hangs. Why?
The ABA Problem Explained
The reuse of nodes breaks the logic of the CAS-based pop
operation.
- Initial State:
top
→ A → B → NULL. - Thread X starts
pop()
:- Reads
top
→head = A
. - Reads
head.next
→next = B
. - Thread X is preempted before
compareAndSet(A, B)
.
- Reads
- Thread Y
pop()
: Readstop
(A),next
(B). CAS(A, B) succeeds.top
→ B. Puts Node A in pool. - Thread Z
push(New)
: Gets Node A from pool. SetsA.next = B
. CAS(B, A) succeeds.top
→ A → B → NULL. - Thread X resumes: Executes
compareAndSet(head, next)
, which iscompareAndSet(A, B)
.- It reads current
top
, which is A. - It compares current
top
(A) with its localhead
(A). They match! - CAS succeeds incorrectly and sets
top
tonext
(B).
- It reads current
- Result: Stack becomes
top
→ B → NULL. Node A (with Z’s data) is lost. Thread X thinks it popped the original A, but its operation corrupted the stack due to the reuse of Node A’s memory address.
ABA Problem Definition: A CAS operation CAS(location, expected_A, new_B)
can succeed erroneously if the value at location
was A
, changed to B
(or something else), and then changed back to A
before the CAS executed. The CAS fails to detect the intermediate changes.
How to Solve the ABA Problem
The core issue with the ABA problem in concurrent programming is that a memory location can be read (A), modified to something else (B), and then modified back to the original value (A) by another thread between the time a thread reads the initial value and attempts a Compare-and-Swap (CAS) operation. The CAS incorrectly succeeds because the value appears unchanged, potentially leading to data corruption.
Since ABA stems from reusing a location with the exact same value, we need ways to distinguish reused locations from the original. Here are several common techniques:
1. Double Compare-and-Swap (DCAS) / Multi-Word CAS
This technique involves atomically operating on two adjacent memory words instead of just one. Typically, this involves pairing the pointer (or value) of interest with a version counter or tag.
- Mechanism: The CAS operation checks both the pointer/value and the version counter simultaneously. If another thread modifies the value and restores it, it would likely also modify the version counter.
- Hardware Support: Some hardware architectures provide native instructions for DCAS or multi-word atomic operations, offering a direct and efficient solution.
- Software Simulation: In environments lacking hardware support, it can be simulated. Java provides
java.util.concurrent.atomic.AtomicMarkableReference
, which associates a boolean “mark” with a reference, allowing atomic updates to both.
Intuition
Imagine you’re not just checking the house number (the memory address/value) but also a unique serial number written on the house (the version counter). If the house is demolished and an identical one is rebuilt (value restored), it will get a new serial number. DCAS checks both the house number and the serial number, so it detects the replacement.
2. Garbage Collection (GC)
In languages with automatic garbage collection (like Java or C#), the ABA problem is often implicitly mitigated, especially when dealing with objects.
- Mechanism: When an object is no longer referenced, it becomes eligible for garbage collection. If the memory address previously occupied by object A is later reused for a new object A’, even if A’ has the same value content as A, it usually has a distinct object identity in the managed runtime. A CAS operation comparing object references will compare these identities (or the new memory address), not just the underlying value, and will correctly fail if the object has been replaced.
Intuition
The garbage collector acts like a meticulous record keeper. Even if a new tenant moves into the same apartment number (memory address) and looks identical to the previous tenant (same value), the GC knows it’s a different person (new object identity). Comparing IDs reveals the change.
3. Pointer Tagging
This method embeds version information directly within the pointer itself, leveraging unused bits.
- Mechanism: Memory pointers are often aligned to word boundaries (e.g., 4 or 8 bytes), meaning the lower few bits of the address are always zero. These spare bits can be repurposed to store a small “tag” or version counter. When the pointer’s target value is modified, the tag is incremented along with the pointer update. The CAS operation then checks both the pointer address and the tag bits.
- Implementation: Java’s
java.util.concurrent.atomic.AtomicStampedReference
uses this approach, associating an integer “stamp” with an object reference. - Likelihood: A reused memory address is highly unlikely to have the exact same tag value as it did previously before being modified.
Intuition
Think of adding a small, incrementing version number (like a sticky note counter) directly onto the pointer address itself using otherwise wasted space. Each time the data is genuinely changed, the counter increases. When you CAS, you check both the address and this counter. If the address was reused, the counter value will almost certainly be different, exposing the ABA situation.
4. Hazard Pointers
Hazard Pointers are a specific memory reclamation scheme primarily used in lock-free data structures, especially in environments without garbage collection (like C/C++), to prevent the ABA problem and ensure memory safety.
Intuition
Imagine threads need to temporarily “reserve” or put a “hazard cone” around the specific memory nodes they are currently inspecting or about to use in a CAS. Before any thread can recycle or free a node, it must check everyone else’s hazard cones. If a node is marked as hazardous by any thread, it cannot be touched.
The Core Idea
Threads publicly declare which specific nodes (memory locations) they are currently accessing and might potentially use in an upcoming operation (like CAS). This declaration prevents other threads from reclaiming or reusing that specific node while it’s “hazardous.”
How it Works
- Hazard List: A small, shared list (often a simple array, one entry per thread) holds these “hazard pointers.”
- Marking: Before a thread reads a pointer
p
that it intends to use (e.g.,p = shared_ptr.get()
), it must first placep
into its own slot in the hazard list:hazard[myThreadID] = p
. Crucially, it must then re-read theshared_ptr
after setting the hazard pointer to confirm thatp
is still the current value. If theshared_ptr
changed between the initial read and setting the hazard pointer, the thread must retry. - Reclaiming: When a thread wants to free or reuse a node
node_to_free
, it must first scan the entire hazard list (all other threads’ entries). Ifnode_to_free
appears in any thread’s hazard slot, it cannot be reclaimed at this time and must be deferred. - Clearing: Once a thread is finished with
p
(e.g., after a successful CAS or deciding not to proceed), it clears its hazard pointer:hazard[myThreadID] = null
.
Example: Lock-Free Stack Operations
Consider modifying push
and pop
operations for a lock-free stack:
pop
:- Read the current
head
node. - Mark
head
as hazardous:setHazardous(head)
. - Re-check: Verify that
head
is still the current top:head == top.get()
. If not, retry from step 1. - Read
head.next
. - Attempt the
CAS
to replacehead
withhead.next
. - Clear the hazard pointer:
setHazardous(null)
. - Before returning the old
head
node to a reuse pool (e.g.,pool.put(head)
), check if it’s hazardous:if (!isHazardous(head))
. If it is hazardous (meaning another thread marked it between the CAS and this check), the node cannot be immediately reused. (Handling this typically involves placing it on a thread-local list for later reclamation attempts).
- Read the current
push
:- When acquiring a new node from a shared pool (
pool.get()
), the pool mechanism must ensure it doesn’t return a node that is currently marked as hazardous by any thread.
- When acquiring a new node from a shared pool (
Protecting the Node Pool
The shared pool used to manage free nodes often requires its own protection mechanisms, which might themselves involve techniques like hazard pointers or using thread-local pools to reduce contention.
Remarks in Java
- Complexity: Hazard Pointers are significantly more complex to implement correctly compared to relying on GC or using tagged pointers like
AtomicStampedReference
. - Performance: In managed environments like Java, the performance benefits over standard GC might be negligible or even negative due to the overhead of maintaining and scanning hazard lists. However, they are crucial in non-GC environments (C/C++).
- Optimization: Performance can be improved if hazard lists utilize processor-local storage to reduce cache contention.
Lessons Learned (Lock-Free)
- Lock-free programming avoids lock issues but brings new challenges.
- Atomically updating multiple locations requires CAS retry loops, helping, or specialized atomics (DCAS,
AtomicMarkableReference
). - The ABA problem is a major hazard when memory is reused, requiring explicit solutions (GC, tagging, hazard pointers).
Overall Recap & Need for Formalism
We’ve covered lock implementation algorithms, hardware support, reasoning with state diagrams, higher-level constructs like monitors/semaphores, and lock-free implementations. However, reasoning about correctness, especially with memory reordering and complex interleavings, has been informal. We need robust theoretical tools.
Correctness in Parallel Programs: Linearizability
Example of a correct bounded FIFO queue for a single enqueuer and single dequeuer.
How do we formally define correctness for concurrent objects?
Sequential Objects vs. Concurrent Objects
Sequential objects (like in single-threaded Java) have a well-defined state between method calls. Correctness can be proven using pre/post conditions (Floyd-Hoare logic).
A method call spans an interval from invocation to response.
- Sequential: State is meaningful only between calls. Methods are analyzed in isolation. Global clock applies.
- Concurrent: Method calls overlap. State might never be “between calls”. All interactions matter. Adding methods can affect others. Need per-object/thread clocks.
With locking, operations become sequentialized based on lock acquisition order. Can we formalize this apparent sequential behavior?
Linearizability Definition
A concurrent object implementation is linearizable if every execution history appears as if each operation took effect atomically at some single point in time (linearization point) between its invocation and response.
Formalism
An execution history
H
is linearizable if it can be mapped (by completing/removing pending calls) to a historyG
such that:
G
is equivalent (same per-thread projections) to a legal sequential historyS
(obeys sequential object semantics, e.g., FIFO).S
respects the real-time order ofG
: if operationm0
finished before operationm1
started inG
, thenm0
must come beforem1
in the sequential historyS
. (→S ⊂ →G
).
Linearizability Examples
Linearizable. Seq order: enq(x), enq(y), deq()->y, deq()->x
. Respects FIFO and real-time.
Not Linearizable. deq()->y
implies enq(y)
happened before it sequentially. FIFO implies enq(x)
happened before enq(y)
. Seq order must be enq(x), enq(y), deq()->y
. But enq(x)
finished before enq(y)
started. The real-time order prevents finding valid linearization points.
Continue here: 25 Linearizability Formalism, Sequential Consistency (Weaker Alternative), Quiescent Consistency