Lecture from 07.05.2025 | Video: Video ETHZ
Today, we’ll expand on fine-grained locking techniques, exploring practical implementations and trade-offs. This leads us naturally into the realm of lock-free programming, where we aim to achieve concurrency without using traditional mutual exclusion locks.
Summary/Recap Last Lecture
In our previous lecture, we covered:
- Explicit/Interface Locks (
Lock
,Condition
): Explored the JavaLock
interface and associatedCondition
variables, noting their increased power and flexibility compared to intrinsic monitors (synchronized
/wait
/notify
), but also their increased potential for programmer error. - Sleeping Barber Example: Analyzed this classic problem to illustrate how to avoid excessive/useless notifications in condition variable patterns by keeping track of waiting thread counts, preventing lost signals.
- Reader/Writer Locks (
ReentrantReadWriteLock
): Discussed locks that allow multiple concurrent readers or a single exclusive writer, examining different fairness policies (reader vs. writer priority).
Learning Goals Today
Building on that foundation, our goals for today are:
- Lock Granularity in Practice: Apply fine-grained, optimistic, and lazy locking strategies to list and stack examples.
- Various Locking Modes: Understand the implementation details and trade-offs of these different locking approaches.
- Lock-Free Data Structures:
- Implement a lock-free stack using atomic operations (Compare-and-Swap).
- Examine lock-free queues, a common requirement in OS structures.
- The ABA Problem: Understand this subtle issue that arises with Compare-and-Swap when memory is reused.
- Avoiding the ABA Problem: Discuss techniques like pointer tagging and hazard pointers.
- (Introduction to Formal Models): Briefly introduce Linearizability and Sequential Consistency.
- (Introduction to Consensus): Define the consensus problem and its relation to the power of atomic operations.
Compare-and-Swap (CAS) Revisited
CAS(memory_location, expected_old_value, new_value)
- Atomically:
- Reads current value at
memory_location
. - Compares it with
expected_old_value
. - If they match, writes
new_value
tomemory_location
and returns success (or old value). - If they don’t match, does not write, and returns failure (or current value).
- Reads current value at
- More powerful than TAS. Can often be implemented wait-free in hardware.
Non-blocking Counter using CAS
- Mechanism: Read-modify-try-CAS. If
compareAndSet
fails, it means another thread modifiedvalue
between steps (a) and (c), so we read again and retry. - Progress: This is lock-free. If multiple threads attempt
inc()
concurrently, at least one will eventually succeed in its CAS operation, ensuring system progress. It’s not wait-free, as a thread might theoretically fail CAS indefinitely if there’s extreme contention (though unlikely in practice). - Deadlock/Starvation: Immune to deadlock. Starvation is possible but rare.
- Thread Death: If a thread dies between reading
v
and the CAS, other threads are unaffected and can continue to make progress.
Handle CAS With Care
- A successful CAS suggests no other thread wrote between the read (
get
) and the CAS. - Warning: This suggestion is not always true due to the ABA problem.
- CAS (or LL/SC) is the fundamental mechanism for checking conditions and making atomic changes in most lock-free algorithms.
- Alternatives like Transactional Memory might simplify things but have their own challenges.
Lock-Free Stack
Let’s implement a lock-free stack using an AtomicReference
for the top
pointer.
This is the baseline lock-based version for comparison.
Lock-free stack structure:
public class ConcurrentStack {
// Use AtomicReference to allow atomic updates (CAS) on the top pointer
AtomicReference<Node> top = new AtomicReference<Node>();
// Methods use CAS loops instead of locks
public void push(Long item) { /* ... */ }
public Long pop() { /* ... */ }
}
Lock-Free Pop
Logic: Repeatedly read the current head
and its next
. Try to atomically swing the top
pointer from head
to next
. If top
changed between the get()
and the compareAndSet()
(because another thread pushed or popped), the CAS fails, and the loop repeats.
Lock-Free Push
Logic: Create the newi
node. Repeatedly read the current head
, set newi.next
to head
, and try to atomically swing top
from head
to newi
. If top
changed concurrently, CAS fails, and the loop retries with the updated head
.
Benefits and Performance
- Benefit: Lock-free algorithms are inherently deadlock-free.
- Performance:
- The graph shows that without backoff, the lock-free stack performs worse than the simple blocking (
synchronized
) stack under contention. This is because failed CAS attempts still consume resources and potentially cause cache contention, similar to spinlocks. - Atomic operations (like CAS used by
AtomicReference
) are relatively expensive compared to uncontended lock acquires/releases.
- The graph shows that without backoff, the lock-free stack performs worse than the simple blocking (
Lock-free doesn’t automatically mean faster. Contention on the underlying atomic operations can still be a bottleneck.
Backoff: Adding exponential backoff to the retry loops (pausing after a failed CAS) significantly improves the performance of the lock-free stack under contention, making it outperform the blocking version.
Lock-Free List-Based Set (Introduction)
Implementing a lock-free linked list is significantly more complex than a stack due to needing to potentially modify two pointers (pred.next
and the new node’s next
for add, or pred.next
and marking for remove) somewhat “atomically”. (We are not discussing skip lists here).
Can simple CAS handle concurrent add and remove? A: remove(c)
needs CAS(b.next, c, d)
. B: add(b')
needs CAS(b.next, c, b')
. CAS hardware arbitrates, only one succeeds, the other retries. This seems okay.
What about A: remove(c)
needing CAS(b.next, c, d)
and B: remove(b)
needing CAS(a.next, b, c)
? If B succeeds first, A’s CAS on b.next
will fail (as b
is removed), but c
remains incorrectly linked from a
. Simple CAS on single pointers is insufficient.
Adding a mark bit for lazy deletion (A: CAS(c.mark, false, true)
then CAS(b.next, c, d)
) also doesn’t solve the problem directly. Thread B might add c'
(CAS(c.next, d, c')
) after A marks c
but before A physically removes c
. c'
gets added to a logically deleted node.
The Problem: We need to atomically modify or validate two pieces of information: the next
pointer and a state marker (like a delete bit) or potentially two pointers (tail.next
and tail
in a queue). Standard CAS only works on a single memory location.
Java Solution: AtomicMarkableReference
Java provides AtomicMarkableReference<V>
to address this. It atomically manages a reference V
and a single boolean mark
.
compareAndSet(expectedRef, newRef, expectedMark, newMark)
: Atomically sets the reference tonewRef
and the mark tonewMark
if and only if the current reference isexpectedRef
AND the current mark isexpectedMark
.attemptMark(expectedRef, newMark)
: Atomically sets the mark tonewMark
if the current reference isexpectedRef
.- Other methods:
get()
,getReference()
,isMarked()
,set()
.
This effectively provides a way to perform a Double CAS (DCAS) on the reference and the mark bit together (often leveraging hardware support like double-word CAS if available, or internal locking if not). The slide notes that one bit is often “free” in aligned pointer references, allowing this mark bit to be stored efficiently alongside the pointer.
Algorithm using AtomicMarkableReference
We can now revisit the lazy list approach using AtomicMarkableReference
for the next
field in each node.
- The
remove
operation becomes two distinct conceptual steps:- Logical Delete: Atomically set the mark bit of the target node’s
next
field (usingattemptMark
orcompareAndSet
on the target node itself if the mark is stored there). - Physical Delete: Atomically update the predecessor’s
next
field to skip the marked node (usingcompareAndSet
, checking both the reference and the mark of the predecessor’snext
field).
- Logical Delete: Atomically set the mark bit of the target node’s
Example: Removing c
(between b
and d
).
- Mark
c
:c.attemptMark(false, true)
(assuming mark is on the node). - Swing
b.next
:b.next.compareAndSet(c, d, false, false)
(assumingc
andd
are unmarked initially). This tries to changeb
’snext
from(ref=c, mark=false)
to(ref=d, mark=false)
. This step physically removesc
.
If A tries to remove c
and B tries to remove b
concurrently:
- A marks
c.next
pointer (not c itself). - B marks
b.next
pointer (not b itself). - A tries
b.next.compareAndSet(c, d, false, false)
. Fails becauseb.next
is now marked by B. A might then help physically removeb
before retrying onc
. - B tries
a.next.compareAndSet(b, c, false, false)
. Succeeds (assuminga
,b
,c
initially unmarked).b
is physically removed.
Helping: If a thread encounters a marked node during traversal, it should attempt to “finish the job” by performing the physical deletion (CASing the predecessor’s next
pointer) before continuing its own operation. This “helping” mechanism is crucial for ensuring lock-free progress, as it prevents threads from getting stuck behind nodes that are logically deleted but not yet physically unlinked.
The find
method now incorporates logic to detect marked nodes (marked[0]
) and attempt physical deletion (pred.next.compareAndSet(curr, succ, false, false)
) before restarting the search if the physical deletion fails due to interference.
The remove
method uses find
to locate pred
and curr
. It first tries to logically delete curr
by marking its successor reference (curr.next.attemptMark(succ, true)
). If successful, it then tries to physically delete it (pred.next.compareAndSet(curr, succ, false, false)
). If logical deletion fails (!snip
), it continues the outer retry loop.
The add
method uses find
to locate pred
and curr
. It creates the new node and attempts to insert it using pred.next.compareAndSet(curr, node, false, false)
, retrying if the CAS fails (because pred
or curr
changed or were marked).
- Using
AtomicMarkableReference
effectively simulates a DCAS on the reference and the mark bit. - The lazy marking combined with helping (physically deleting marked nodes encountered during traversal) is key to making the list operations lock-free.
Lock-Free Unbounded Queue
Motivation: Implementing core OS or runtime system components, like thread schedulers, often requires highly efficient, non-blocking concurrent data structures. Scheduler queues are a prime example.
These queues need protection from concurrent access by multiple threads and potentially interrupt handlers running on different cores. Using locks can lead to the problems discussed earlier (deadlocks, convoying, priority inversion), making lock-free implementations desirable.
Continue here: 23 Lock-Free Unbounded Queue Protocol