Lecture from 07.05.2025 | Video: Video ETHZ

Today, we’ll expand on fine-grained locking techniques, exploring practical implementations and trade-offs. This leads us naturally into the realm of lock-free programming, where we aim to achieve concurrency without using traditional mutual exclusion locks.

Summary/Recap Last Lecture

In our previous lecture, we covered:

  • Explicit/Interface Locks (Lock, Condition): Explored the Java Lock interface and associated Condition variables, noting their increased power and flexibility compared to intrinsic monitors (synchronized/wait/notify), but also their increased potential for programmer error.
  • Sleeping Barber Example: Analyzed this classic problem to illustrate how to avoid excessive/useless notifications in condition variable patterns by keeping track of waiting thread counts, preventing lost signals.
  • Reader/Writer Locks (ReentrantReadWriteLock): Discussed locks that allow multiple concurrent readers or a single exclusive writer, examining different fairness policies (reader vs. writer priority).

Learning Goals Today

Building on that foundation, our goals for today are:

  • Lock Granularity in Practice: Apply fine-grained, optimistic, and lazy locking strategies to list and stack examples.
  • Various Locking Modes: Understand the implementation details and trade-offs of these different locking approaches.
  • Lock-Free Data Structures:
    • Implement a lock-free stack using atomic operations (Compare-and-Swap).
    • Examine lock-free queues, a common requirement in OS structures.
  • The ABA Problem: Understand this subtle issue that arises with Compare-and-Swap when memory is reused.
  • Avoiding the ABA Problem: Discuss techniques like pointer tagging and hazard pointers.
  • (Introduction to Formal Models): Briefly introduce Linearizability and Sequential Consistency.
  • (Introduction to Consensus): Define the consensus problem and its relation to the power of atomic operations.

Compare-and-Swap (CAS) Revisited

  • CAS(memory_location, expected_old_value, new_value)
  • Atomically:
    1. Reads current value at memory_location.
    2. Compares it with expected_old_value.
    3. If they match, writes new_value to memory_location and returns success (or old value).
    4. If they don’t match, does not write, and returns failure (or current value).
  • More powerful than TAS. Can often be implemented wait-free in hardware.

Non-blocking Counter using CAS

  • Mechanism: Read-modify-try-CAS. If compareAndSet fails, it means another thread modified value between steps (a) and (c), so we read again and retry.
  • Progress: This is lock-free. If multiple threads attempt inc() concurrently, at least one will eventually succeed in its CAS operation, ensuring system progress. It’s not wait-free, as a thread might theoretically fail CAS indefinitely if there’s extreme contention (though unlikely in practice).
  • Deadlock/Starvation: Immune to deadlock. Starvation is possible but rare.
  • Thread Death: If a thread dies between reading v and the CAS, other threads are unaffected and can continue to make progress.

Handle CAS With Care

  • A successful CAS suggests no other thread wrote between the read (get) and the CAS.
  • Warning: This suggestion is not always true due to the ABA problem.
  • CAS (or LL/SC) is the fundamental mechanism for checking conditions and making atomic changes in most lock-free algorithms.
  • Alternatives like Transactional Memory might simplify things but have their own challenges.

Lock-Free Stack

Let’s implement a lock-free stack using an AtomicReference for the top pointer.

This is the baseline lock-based version for comparison.

Lock-free stack structure:

public class ConcurrentStack {
    // Use AtomicReference to allow atomic updates (CAS) on the top pointer
    AtomicReference<Node> top = new AtomicReference<Node>();
 
    // Methods use CAS loops instead of locks
    public void push(Long item) { /* ... */ }
    public Long pop() { /* ... */ }
}

Lock-Free Pop

Logic: Repeatedly read the current head and its next. Try to atomically swing the top pointer from head to next. If top changed between the get() and the compareAndSet() (because another thread pushed or popped), the CAS fails, and the loop repeats.

Lock-Free Push

Logic: Create the newi node. Repeatedly read the current head, set newi.next to head, and try to atomically swing top from head to newi. If top changed concurrently, CAS fails, and the loop retries with the updated head.

Benefits and Performance

  • Benefit: Lock-free algorithms are inherently deadlock-free.
  • Performance:
    • The graph shows that without backoff, the lock-free stack performs worse than the simple blocking (synchronized) stack under contention. This is because failed CAS attempts still consume resources and potentially cause cache contention, similar to spinlocks.
    • Atomic operations (like CAS used by AtomicReference) are relatively expensive compared to uncontended lock acquires/releases.

Lock-free doesn’t automatically mean faster. Contention on the underlying atomic operations can still be a bottleneck.

Backoff: Adding exponential backoff to the retry loops (pausing after a failed CAS) significantly improves the performance of the lock-free stack under contention, making it outperform the blocking version.

Lock-Free List-Based Set (Introduction)

Implementing a lock-free linked list is significantly more complex than a stack due to needing to potentially modify two pointers (pred.next and the new node’s next for add, or pred.next and marking for remove) somewhat “atomically”. (We are not discussing skip lists here).

Can simple CAS handle concurrent add and remove? A: remove(c) needs CAS(b.next, c, d). B: add(b') needs CAS(b.next, c, b'). CAS hardware arbitrates, only one succeeds, the other retries. This seems okay.

What about A: remove(c) needing CAS(b.next, c, d) and B: remove(b) needing CAS(a.next, b, c)? If B succeeds first, A’s CAS on b.next will fail (as b is removed), but c remains incorrectly linked from a. Simple CAS on single pointers is insufficient.

Adding a mark bit for lazy deletion (A: CAS(c.mark, false, true) then CAS(b.next, c, d)) also doesn’t solve the problem directly. Thread B might add c' (CAS(c.next, d, c')) after A marks c but before A physically removes c. c' gets added to a logically deleted node.

The Problem: We need to atomically modify or validate two pieces of information: the next pointer and a state marker (like a delete bit) or potentially two pointers (tail.next and tail in a queue). Standard CAS only works on a single memory location.

Java Solution: AtomicMarkableReference

Java provides AtomicMarkableReference<V> to address this. It atomically manages a reference V and a single boolean mark.

  • compareAndSet(expectedRef, newRef, expectedMark, newMark): Atomically sets the reference to newRef and the mark to newMark if and only if the current reference is expectedRef AND the current mark is expectedMark.
  • attemptMark(expectedRef, newMark): Atomically sets the mark to newMark if the current reference is expectedRef.
  • Other methods: get(), getReference(), isMarked(), set().

This effectively provides a way to perform a Double CAS (DCAS) on the reference and the mark bit together (often leveraging hardware support like double-word CAS if available, or internal locking if not). The slide notes that one bit is often “free” in aligned pointer references, allowing this mark bit to be stored efficiently alongside the pointer.

Algorithm using AtomicMarkableReference

We can now revisit the lazy list approach using AtomicMarkableReference for the next field in each node.

  • The remove operation becomes two distinct conceptual steps:
    1. Logical Delete: Atomically set the mark bit of the target node’s next field (using attemptMark or compareAndSet on the target node itself if the mark is stored there).
    2. Physical Delete: Atomically update the predecessor’s next field to skip the marked node (using compareAndSet, checking both the reference and the mark of the predecessor’s next field).

Example: Removing c (between b and d).

  1. Mark c: c.attemptMark(false, true) (assuming mark is on the node).
  2. Swing b.next: b.next.compareAndSet(c, d, false, false) (assuming c and d are unmarked initially). This tries to change b’s next from (ref=c, mark=false) to (ref=d, mark=false). This step physically removes c.

If A tries to remove c and B tries to remove b concurrently:

  • A marks c.next pointer (not c itself).
  • B marks b.next pointer (not b itself).
  • A tries b.next.compareAndSet(c, d, false, false). Fails because b.next is now marked by B. A might then help physically remove b before retrying on c.
  • B tries a.next.compareAndSet(b, c, false, false). Succeeds (assuming a, b, c initially unmarked). b is physically removed.

Helping: If a thread encounters a marked node during traversal, it should attempt to “finish the job” by performing the physical deletion (CASing the predecessor’s next pointer) before continuing its own operation. This “helping” mechanism is crucial for ensuring lock-free progress, as it prevents threads from getting stuck behind nodes that are logically deleted but not yet physically unlinked.

The find method now incorporates logic to detect marked nodes (marked[0]) and attempt physical deletion (pred.next.compareAndSet(curr, succ, false, false)) before restarting the search if the physical deletion fails due to interference.

The remove method uses find to locate pred and curr. It first tries to logically delete curr by marking its successor reference (curr.next.attemptMark(succ, true)). If successful, it then tries to physically delete it (pred.next.compareAndSet(curr, succ, false, false)). If logical deletion fails (!snip), it continues the outer retry loop.

The add method uses find to locate pred and curr. It creates the new node and attempts to insert it using pred.next.compareAndSet(curr, node, false, false), retrying if the CAS fails (because pred or curr changed or were marked).

  • Using AtomicMarkableReference effectively simulates a DCAS on the reference and the mark bit.
  • The lazy marking combined with helping (physically deleting marked nodes encountered during traversal) is key to making the list operations lock-free.

Lock-Free Unbounded Queue

Motivation: Implementing core OS or runtime system components, like thread schedulers, often requires highly efficient, non-blocking concurrent data structures. Scheduler queues are a prime example.

These queues need protection from concurrent access by multiple threads and potentially interrupt handlers running on different cores. Using locks can lead to the problems discussed earlier (deadlocks, convoying, priority inversion), making lock-free implementations desirable.

Continue here: 23 Lock-Free Unbounded Queue Protocol