Lecture from 13.05.2025 | Video: Video ETHZ
Summary/Recap Last Lecture
In our previous session, we covered several advanced locking concepts:
- Optimistic & Lazy List-Based Sets: We explored fine-grained locking strategies for linked lists, moving from basic hand-over-hand locking to optimistic locking (lock-free traversal, lock-then-validate) and lazy synchronization (using markers for logical deletion, enabling wait-free
contains
). - Lazy Skip Lists: Introduced skip lists as a practical, probabilistic alternative to balanced trees for concurrent sets, suitable for applying lazy/optimistic locking techniques.
- Lock-Free Introduction & CAS: Defined lock-freedom and wait-freedom, contrasting them with blocking synchronization, and re-introduced Compare-and-Swap (CAS) as the key atomic primitive for building lock-free algorithms.
Learning Goals Today
Today, we will delve into:
- Lock-Free Unbounded Queue: Implementing a queue without locks, suitable for scenarios like OS schedulers, using sentinel nodes.
- Reuse and the ABA Problem: Understanding how memory reuse (common in systems without garbage collection or using object pools) can break naive lock-free algorithms relying on CAS.
- Avoiding the ABA Problem: Examining techniques like Pointer Tagging and Hazard Pointers to mitigate the ABA issue.
Lock-Free Unbounded Queue
Queues are fundamental in many systems, especially operating systems and runtime environments, for managing tasks, threads, or requests.
Motivation: A Lock-Free Operating System Kernel
Schedulers lie at the heart of operating systems, moving tasks between states (ready, running, waiting) often implemented using queues. Scheduling decisions happen frequently (thread creation/end, blocking/unblocking).
These internal kernel/runtime data structures must be protected against concurrent access from multiple threads running on different cores, and potentially even from interrupt service routines. While spinlocks are conventionally used, their blocking nature can cause problems (priority inversion, convoying, etc.).
Therefore, implementing scheduler queues (and other core structures) in a lock-free manner is highly desirable for performance, scalability, and responsiveness. This requires a lock-free unbounded queue algorithm. However, OS kernels typically cannot rely on automatic Garbage Collection (GC) and must manage memory explicitly, often reusing queue nodes. This reuse creates a significant challenge for lock-free algorithms: the ABA problem.
Basic Queue Node Structure
We start with a simple node containing an item and a next
pointer.
Blocking Unbounded Queue (Recap)
For reference, a simple blocking queue using synchronized
provides mutual exclusion for enqueue
and dequeue
operations, handling the head
and tail
pointers. However, this suffers from the general drawbacks of locking.
Challenges for Lock-Free Queues
Implementing this lock-free is tricky because operations involve multiple pointers:
Enqueue
: Modifiestail.next
andtail
.Dequeue
: Modifieshead
and readshead.next
. These operations need to appear atomic, but a single CAS operates on only one memory location.
Idea: Sentinel Node
To simplify edge cases (especially handling the empty queue), a common technique is to always have a dummy “sentinel” node at the front of the queue. The head
pointer always points to the sentinel. The queue is empty when head.next
is null (or, more robustly, when head == tail
). The actual first element is head.next
.
Enqueue with Sentinel
- The new node is added after the node currently pointed to by
tail
. Requires updatingtail.next
. - The
tail
pointer itself needs to be updated to point to the newly added node. These two steps (updatingtail.next
andtail
) are ideally atomic but involve two separate memory locations.
Dequeue with Sentinel
- The item to be dequeued is in the node after
head
(i.e.,head.next
). The value is read fromhead.next.item
. - The
head
pointer needs to be updated to point to the dequeued node (head = head.next
). This involves readinghead.next
and updatinghead
.
Does the sentinel solve the atomicity problem? No. Enqueue
still needs to update two pointers (tail.next
and tail
). This can lead to a transient inconsistency: a thread might update tail.next
but get delayed before updating tail
. During this time, tail
doesn’t point to the true last node. Waiting for consistency reintroduces blocking (“locking camouflaged”). The solution involves helping: threads detect inconsistency and help complete the pending operation (like advancing tail
) before proceeding.
Lock-Free Queue Node with AtomicReference
To implement the queue lock-free, we need atomic operations on the pointers. We use AtomicReference
for the head
, tail
, and the next
pointer within each Node
.
Michael & Scott Lock-Free Queue Protocol
This describes the widely used Michael & Scott lock-free queue algorithm:
Enqueuer Protocol:
- Read current
tail
into a local variablelast
. - Read
last.next
intonext
. - (Consistency Check) If
tail
still points tolast
: a. Ifnext
isnull
(meaninglast
is the actual end): Try to atomically link thenew_node
usingCAS(last.next, null, new_node)
. b. If the CAS in (a) succeeds, try to swing thetail
pointer:CAS(tail, last, new_node)
. This second CAS is a “helping” step; it’s okay if it fails (another thread might have already done it), but attempting it helps maintain progress. c. Ifnext
is notnull
(meaningtail
is lagging behind the actual end): Help another thread by trying to swing the tail forward:CAS(tail, last, next)
. - (Retry) If
tail
changed between readinglast
and the consistency check, or if the primaryCAS(last.next, ...)
failed due to a race, retry the entire process.
Why might the CAS(tail, last, new_node)
fail? Either another enqueuer already updated tail
, or the thread performing the CAS died before completing it.
Dequeuer Protocol:
- Read
head
intofirst
andtail
intolast
. - Read
first.next
intonext
. - (Consistency Check) If
head
still points tofirst
: a. Iffirst == last
(head and tail point to the same node, possibly sentinel): i. Ifnext == null
, the queue is truly empty. Return null/failure. ii. Ifnext != null
, the queue is in a transient intermediate state (an enqueue finished linking but didn’t update tail yet). Help by tryingCAS(tail, last, next)
. Then retry the dequeue. b. Iffirst != last
(queue appears non-empty): i. Read the value from the node to be dequeued:value = next.item
. ii. Try to swing thehead
pointer forward:CAS(head, first, next)
. iii. If CAS succeeds, returnvalue
. iv. If CAS fails (another dequeue happened), retry the entire process. - (Retry) If
head
changed between readingfirst
and the consistency check, retry.
Why might CAS(head, first, next)
fail? Another dequeuer successfully executed its CAS first.
A specific inconsistency: Thread A enqueues the first element (node
) by setting S.next = node
but gets delayed before setting tail = node
. Thread B then dequeues S
(the sentinel) by setting head = node
. Now tail
(still pointing to S
) points to a dequeued node. The helping mechanism (tail.compareAndSet(last, next)
in both enqueue and dequeue) addresses this.
This code implements the enqueue logic with the helping step (else tail.compareAndSet(last, next)
).
This code implements the dequeue logic, including the check for empty vs. intermediate state (if (first == last)
) and the helping step.
This Michael & Scott algorithm provides a correct lock-free unbounded queue implementation.
Reuse and the ABA Problem
The lock-free stack and queue implementations work fine if nodes, once removed, are never reused. However, in many contexts (OS kernels, systems without GC, performance-critical code using memory pools), node reuse is necessary. This reuse exposes a critical flaw in algorithms relying solely on CAS: the ABA problem.
Back to the Lock-Free Stack
Continue here: 24 ABA Problem, DCAS, GC, Pointer Tagging, Hazard Pointers, Linearizability