Lecture from 14.05.2025 | Video: Video ETHZ

We continue our exploration into advanced concurrent programming. Building on our knowledge of locks and their limitations, we’ll focus on lock-free algorithms, specifically for queues, which are vital in systems like OS schedulers. This path, however, is fraught with subtlety, most notably the ABA problem, which arises from memory reuse. We will examine solutions like pointer tagging and hazard pointers. Finally, we’ll formalize our notion of correctness for concurrent objects by introducing Linearizability.

Summary/Recap Last Lecture

Previously, we investigated several fine-grained locking strategies:

  • Optimistic List-Based Sets: Traversed lock-free, then locked and validated.
  • Lazy Synchronization: Introduced logical deletion (marking) to potentially simplify remove and enable wait-free contains (single lock-free traversal).
  • Lazy Skip Lists: Presented skip lists as a practical, probabilistic data structure amenable to lazy/optimistic techniques.
  • Lock-Free Basics & CAS: Introduced the concept of lock-freedom and the crucial role of Compare-and-Swap (CAS) atomic operations.

Learning Goals Today

Our main goals for this lecture are:

  1. Reuse and ABA Problem: Deep dive into the ABA problem that occurs when nodes are reused in lock-free CAS-based algorithms. Explore solutions:
    • Pointer Tagging
    • Hazard Pointers
  2. Correctness in Parallel Programs (Linearizability): Introduce a formal correctness condition for concurrent objects.

Reuse and the ABA Problem

The previous lock-free algorithms assumed that memory nodes, once removed from the data structure, were not reused. What happens if we do reuse nodes, for example, by managing them in a memory pool?

Back to the Lock-Free Stack (with Reuse)

Let’s modify our lock-free stack to use a NodePool with get() and put() methods instead of new Node() and garbage collection.

NodePool itself is implemented lock-free using CAS on its internal top reference.

push now calls pool.get(item) and pop calls pool.put(head).

Experiment and Failure

Run N producer/consumer threads, each doing 10k push/pops using the stack with/without reuse. Check correctness (sum pushed == sum popped) and measure time.

Initial results might look okay…

But other runs fail! We get exceptions, sums don’t match, or the program hangs. Why?

The ABA Problem Explained

The reuse of nodes breaks the logic of the CAS-based pop operation.

  1. Initial State: top A B NULL.
  2. Thread X starts pop():
    • Reads top head = A.
    • Reads head.next next = B.
    • Thread X is preempted before compareAndSet(A, B).
  3. Thread Y pop(): Reads top(A), next(B). CAS(A, B) succeeds. top B. Puts Node A in pool.
  4. Thread Z push(New): Gets Node A from pool. Sets A.next = B. CAS(B, A) succeeds. top A B NULL.
  5. Thread X resumes: Executes compareAndSet(head, next), which is compareAndSet(A, B).
    • It reads current top, which is A.
    • It compares current top(A) with its local head(A). They match!
    • CAS succeeds incorrectly and sets top to next(B).
  6. Result: Stack becomes top B NULL. Node A (with Z’s data) is lost. Thread X thinks it popped the original A, but its operation corrupted the stack due to the reuse of Node A’s memory address.

ABA Problem Definition: A CAS operation CAS(location, expected_A, new_B) can succeed erroneously if the value at location was A, changed to B (or something else), and then changed back to A before the CAS executed. The CAS fails to detect the intermediate changes.

How to Solve the ABA Problem

The core issue with the ABA problem in concurrent programming is that a memory location can be read (A), modified to something else (B), and then modified back to the original value (A) by another thread between the time a thread reads the initial value and attempts a Compare-and-Swap (CAS) operation. The CAS incorrectly succeeds because the value appears unchanged, potentially leading to data corruption.

Since ABA stems from reusing a location with the exact same value, we need ways to distinguish reused locations from the original. Here are several common techniques:

1. Double Compare-and-Swap (DCAS) / Multi-Word CAS

This technique involves atomically operating on two adjacent memory words instead of just one. Typically, this involves pairing the pointer (or value) of interest with a version counter or tag.

  • Mechanism: The CAS operation checks both the pointer/value and the version counter simultaneously. If another thread modifies the value and restores it, it would likely also modify the version counter.
  • Hardware Support: Some hardware architectures provide native instructions for DCAS or multi-word atomic operations, offering a direct and efficient solution.
  • Software Simulation: In environments lacking hardware support, it can be simulated. Java provides java.util.concurrent.atomic.AtomicMarkableReference, which associates a boolean “mark” with a reference, allowing atomic updates to both.

Intuition

Imagine you’re not just checking the house number (the memory address/value) but also a unique serial number written on the house (the version counter). If the house is demolished and an identical one is rebuilt (value restored), it will get a new serial number. DCAS checks both the house number and the serial number, so it detects the replacement.

2. Garbage Collection (GC)

In languages with automatic garbage collection (like Java or C#), the ABA problem is often implicitly mitigated, especially when dealing with objects.

  • Mechanism: When an object is no longer referenced, it becomes eligible for garbage collection. If the memory address previously occupied by object A is later reused for a new object A’, even if A’ has the same value content as A, it usually has a distinct object identity in the managed runtime. A CAS operation comparing object references will compare these identities (or the new memory address), not just the underlying value, and will correctly fail if the object has been replaced.

Intuition

The garbage collector acts like a meticulous record keeper. Even if a new tenant moves into the same apartment number (memory address) and looks identical to the previous tenant (same value), the GC knows it’s a different person (new object identity). Comparing IDs reveals the change.

3. Pointer Tagging

This method embeds version information directly within the pointer itself, leveraging unused bits.

  • Mechanism: Memory pointers are often aligned to word boundaries (e.g., 4 or 8 bytes), meaning the lower few bits of the address are always zero. These spare bits can be repurposed to store a small “tag” or version counter. When the pointer’s target value is modified, the tag is incremented along with the pointer update. The CAS operation then checks both the pointer address and the tag bits.
  • Implementation: Java’s java.util.concurrent.atomic.AtomicStampedReference uses this approach, associating an integer “stamp” with an object reference.
  • Likelihood: A reused memory address is highly unlikely to have the exact same tag value as it did previously before being modified.

Intuition

Think of adding a small, incrementing version number (like a sticky note counter) directly onto the pointer address itself using otherwise wasted space. Each time the data is genuinely changed, the counter increases. When you CAS, you check both the address and this counter. If the address was reused, the counter value will almost certainly be different, exposing the ABA situation.

4. Hazard Pointers

Hazard Pointers are a specific memory reclamation scheme primarily used in lock-free data structures, especially in environments without garbage collection (like C/C++), to prevent the ABA problem and ensure memory safety.

Intuition

Imagine threads need to temporarily “reserve” or put a “hazard cone” around the specific memory nodes they are currently inspecting or about to use in a CAS. Before any thread can recycle or free a node, it must check everyone else’s hazard cones. If a node is marked as hazardous by any thread, it cannot be touched.

The Core Idea

Threads publicly declare which specific nodes (memory locations) they are currently accessing and might potentially use in an upcoming operation (like CAS). This declaration prevents other threads from reclaiming or reusing that specific node while it’s “hazardous.”

How it Works

  1. Hazard List: A small, shared list (often a simple array, one entry per thread) holds these “hazard pointers.”
  2. Marking: Before a thread reads a pointer p that it intends to use (e.g., p = shared_ptr.get()), it must first place p into its own slot in the hazard list: hazard[myThreadID] = p. Crucially, it must then re-read the shared_ptr after setting the hazard pointer to confirm that p is still the current value. If the shared_ptr changed between the initial read and setting the hazard pointer, the thread must retry.
  3. Reclaiming: When a thread wants to free or reuse a node node_to_free, it must first scan the entire hazard list (all other threads’ entries). If node_to_free appears in any thread’s hazard slot, it cannot be reclaimed at this time and must be deferred.
  4. Clearing: Once a thread is finished with p (e.g., after a successful CAS or deciding not to proceed), it clears its hazard pointer: hazard[myThreadID] = null.

Example: Lock-Free Stack Operations

Consider modifying push and pop operations for a lock-free stack:

  • pop:
    1. Read the current head node.
    2. Mark head as hazardous: setHazardous(head).
    3. Re-check: Verify that head is still the current top: head == top.get(). If not, retry from step 1.
    4. Read head.next.
    5. Attempt the CAS to replace head with head.next.
    6. Clear the hazard pointer: setHazardous(null).
    7. Before returning the old head node to a reuse pool (e.g., pool.put(head)), check if it’s hazardous: if (!isHazardous(head)). If it is hazardous (meaning another thread marked it between the CAS and this check), the node cannot be immediately reused. (Handling this typically involves placing it on a thread-local list for later reclamation attempts).
  • push:
    1. When acquiring a new node from a shared pool (pool.get()), the pool mechanism must ensure it doesn’t return a node that is currently marked as hazardous by any thread.

Protecting the Node Pool

The shared pool used to manage free nodes often requires its own protection mechanisms, which might themselves involve techniques like hazard pointers or using thread-local pools to reduce contention.

Remarks in Java

  • Complexity: Hazard Pointers are significantly more complex to implement correctly compared to relying on GC or using tagged pointers like AtomicStampedReference.
  • Performance: In managed environments like Java, the performance benefits over standard GC might be negligible or even negative due to the overhead of maintaining and scanning hazard lists. However, they are crucial in non-GC environments (C/C++).
  • Optimization: Performance can be improved if hazard lists utilize processor-local storage to reduce cache contention.

Lessons Learned (Lock-Free)

  • Lock-free programming avoids lock issues but brings new challenges.
  • Atomically updating multiple locations requires CAS retry loops, helping, or specialized atomics (DCAS, AtomicMarkableReference).
  • The ABA problem is a major hazard when memory is reused, requiring explicit solutions (GC, tagging, hazard pointers).

Overall Recap & Need for Formalism

We’ve covered lock implementation algorithms, hardware support, reasoning with state diagrams, higher-level constructs like monitors/semaphores, and lock-free implementations. However, reasoning about correctness, especially with memory reordering and complex interleavings, has been informal. We need robust theoretical tools.

Correctness in Parallel Programs: Linearizability

Example of a correct bounded FIFO queue for a single enqueuer and single dequeuer.

How do we formally define correctness for concurrent objects?

Sequential Objects vs. Concurrent Objects

Sequential objects (like in single-threaded Java) have a well-defined state between method calls. Correctness can be proven using pre/post conditions (Floyd-Hoare logic).

A method call spans an interval from invocation to response.

  • Sequential: State is meaningful only between calls. Methods are analyzed in isolation. Global clock applies.
  • Concurrent: Method calls overlap. State might never be “between calls”. All interactions matter. Adding methods can affect others. Need per-object/thread clocks.

With locking, operations become sequentialized based on lock acquisition order. Can we formalize this apparent sequential behavior?

Linearizability Definition

A concurrent object implementation is linearizable if every execution history appears as if each operation took effect atomically at some single point in time (linearization point) between its invocation and response.

Formalism

An execution history H is linearizable if it can be mapped (by completing/removing pending calls) to a history G such that:

  1. G is equivalent (same per-thread projections) to a legal sequential history S (obeys sequential object semantics, e.g., FIFO).
  2. S respects the real-time order of G: if operation m0 finished before operation m1 started in G, then m0 must come before m1 in the sequential history S. (→S ⊂ →G).

Linearizability Examples

Linearizable. Seq order: enq(x), enq(y), deq()->y, deq()->x. Respects FIFO and real-time.

Not Linearizable. deq()->y implies enq(y) happened before it sequentially. FIFO implies enq(x) happened before enq(y). Seq order must be enq(x), enq(y), deq()->y. But enq(x) finished before enq(y) started. The real-time order prevents finding valid linearization points.

Continue here: 25 Linearizability Formalism, Sequential Consistency (Weaker Alternative), Quiescent Consistency