Lecture from 13.05.2025 | Video: Video ETHZ

Summary/Recap Last Lecture

In our previous session, we covered several advanced locking concepts:

  • Optimistic & Lazy List-Based Sets: We explored fine-grained locking strategies for linked lists, moving from basic hand-over-hand locking to optimistic locking (lock-free traversal, lock-then-validate) and lazy synchronization (using markers for logical deletion, enabling wait-free contains).
  • Lazy Skip Lists: Introduced skip lists as a practical, probabilistic alternative to balanced trees for concurrent sets, suitable for applying lazy/optimistic locking techniques.
  • Lock-Free Introduction & CAS: Defined lock-freedom and wait-freedom, contrasting them with blocking synchronization, and re-introduced Compare-and-Swap (CAS) as the key atomic primitive for building lock-free algorithms.

Learning Goals Today

Today, we will delve into:

  • Lock-Free Unbounded Queue: Implementing a queue without locks, suitable for scenarios like OS schedulers, using sentinel nodes.
  • Reuse and the ABA Problem: Understanding how memory reuse (common in systems without garbage collection or using object pools) can break naive lock-free algorithms relying on CAS.
  • Avoiding the ABA Problem: Examining techniques like Pointer Tagging and Hazard Pointers to mitigate the ABA issue.

Lock-Free Unbounded Queue

Queues are fundamental in many systems, especially operating systems and runtime environments, for managing tasks, threads, or requests.

Motivation: A Lock-Free Operating System Kernel

Schedulers lie at the heart of operating systems, moving tasks between states (ready, running, waiting) often implemented using queues. Scheduling decisions happen frequently (thread creation/end, blocking/unblocking).

These internal kernel/runtime data structures must be protected against concurrent access from multiple threads running on different cores, and potentially even from interrupt service routines. While spinlocks are conventionally used, their blocking nature can cause problems (priority inversion, convoying, etc.).

Therefore, implementing scheduler queues (and other core structures) in a lock-free manner is highly desirable for performance, scalability, and responsiveness. This requires a lock-free unbounded queue algorithm. However, OS kernels typically cannot rely on automatic Garbage Collection (GC) and must manage memory explicitly, often reusing queue nodes. This reuse creates a significant challenge for lock-free algorithms: the ABA problem.

Basic Queue Node Structure

We start with a simple node containing an item and a next pointer.

Blocking Unbounded Queue (Recap)

For reference, a simple blocking queue using synchronized provides mutual exclusion for enqueue and dequeue operations, handling the head and tail pointers. However, this suffers from the general drawbacks of locking.

Challenges for Lock-Free Queues

Implementing this lock-free is tricky because operations involve multiple pointers:

  • Enqueue: Modifies tail.next and tail.
  • Dequeue: Modifies head and reads head.next. These operations need to appear atomic, but a single CAS operates on only one memory location.

Idea: Sentinel Node

To simplify edge cases (especially handling the empty queue), a common technique is to always have a dummy “sentinel” node at the front of the queue. The head pointer always points to the sentinel. The queue is empty when head.next is null (or, more robustly, when head == tail). The actual first element is head.next.

Enqueue with Sentinel

  1. The new node is added after the node currently pointed to by tail. Requires updating tail.next.
  2. The tail pointer itself needs to be updated to point to the newly added node. These two steps (updating tail.next and tail) are ideally atomic but involve two separate memory locations.

Dequeue with Sentinel

  1. The item to be dequeued is in the node after head (i.e., head.next). The value is read from head.next.item.
  2. The head pointer needs to be updated to point to the dequeued node (head = head.next). This involves reading head.next and updating head.

Does the sentinel solve the atomicity problem? No. Enqueue still needs to update two pointers (tail.next and tail). This can lead to a transient inconsistency: a thread might update tail.next but get delayed before updating tail. During this time, tail doesn’t point to the true last node. Waiting for consistency reintroduces blocking (“locking camouflaged”). The solution involves helping: threads detect inconsistency and help complete the pending operation (like advancing tail) before proceeding.

Lock-Free Queue Node with AtomicReference

To implement the queue lock-free, we need atomic operations on the pointers. We use AtomicReference for the head, tail, and the next pointer within each Node.

Michael & Scott Lock-Free Queue Protocol

This describes the widely used Michael & Scott lock-free queue algorithm:

Enqueuer Protocol:

  1. Read current tail into a local variable last.
  2. Read last.next into next.
  3. (Consistency Check) If tail still points to last: a. If next is null (meaning last is the actual end): Try to atomically link the new_node using CAS(last.next, null, new_node). b. If the CAS in (a) succeeds, try to swing the tail pointer: CAS(tail, last, new_node). This second CAS is a “helping” step; it’s okay if it fails (another thread might have already done it), but attempting it helps maintain progress. c. If next is not null (meaning tail is lagging behind the actual end): Help another thread by trying to swing the tail forward: CAS(tail, last, next).
  4. (Retry) If tail changed between reading last and the consistency check, or if the primary CAS(last.next, ...) failed due to a race, retry the entire process.

Why might the CAS(tail, last, new_node) fail? Either another enqueuer already updated tail, or the thread performing the CAS died before completing it.

Dequeuer Protocol:

  1. Read head into first and tail into last.
  2. Read first.next into next.
  3. (Consistency Check) If head still points to first: a. If first == last (head and tail point to the same node, possibly sentinel): i. If next == null, the queue is truly empty. Return null/failure. ii. If next != null, the queue is in a transient intermediate state (an enqueue finished linking but didn’t update tail yet). Help by trying CAS(tail, last, next). Then retry the dequeue. b. If first != last (queue appears non-empty): i. Read the value from the node to be dequeued: value = next.item. ii. Try to swing the head pointer forward: CAS(head, first, next). iii. If CAS succeeds, return value. iv. If CAS fails (another dequeue happened), retry the entire process.
  4. (Retry) If head changed between reading first and the consistency check, retry.

Why might CAS(head, first, next) fail? Another dequeuer successfully executed its CAS first.

A specific inconsistency: Thread A enqueues the first element (node) by setting S.next = node but gets delayed before setting tail = node. Thread B then dequeues S (the sentinel) by setting head = node. Now tail (still pointing to S) points to a dequeued node. The helping mechanism (tail.compareAndSet(last, next) in both enqueue and dequeue) addresses this.

This code implements the enqueue logic with the helping step (else tail.compareAndSet(last, next)).

This code implements the dequeue logic, including the check for empty vs. intermediate state (if (first == last)) and the helping step.

This Michael & Scott algorithm provides a correct lock-free unbounded queue implementation.

Reuse and the ABA Problem

The lock-free stack and queue implementations work fine if nodes, once removed, are never reused. However, in many contexts (OS kernels, systems without GC, performance-critical code using memory pools), node reuse is necessary. This reuse exposes a critical flaw in algorithms relying solely on CAS: the ABA problem.

Back to the Lock-Free Stack

Continue here: 24 ABA Problem, DCAS, GC, Pointer Tagging, Hazard Pointers, Linearizability