21 Optimistic Sync, Lazy Sync, Lazy Skip Lists, Lock-Free Programming

Lecture from 06.05.2025 | Video: Video ETHZ

We’ve seen the immense computational demands, particularly in AI, driving the need for massive parallelism. Understanding how to manage concurrency correctly and efficiently is crucial, whether dealing with multi-core CPUs or specialized hardware like Tesla’s FSD chip, which, interestingly, aims for high fault tolerance without traditional locks according to Elon Musk (though the specifics of how they achieve concurrency control are complex).

Last Week’s Lecture Recap

Let’s briefly summarize the key points from our previous lecture:

Explicit/Interface Locks (java.util.concurrent.locks): We saw that explicit Lock and Condition objects offer more flexibility and power than intrinsic monitors (synchronized/wait/notify) but are also more complex and error-prone.
Sleeping Barber Example: This illustrated a technique using counters alongside condition variables to avoid unnecessary notifications when no threads are actually waiting, preventing “lost signals”. The core idea is to keep counts of active/waiting threads.
Reader/Writer Locks: We introduced ReentrantReadWriteLock as a mechanism to allow concurrent reads while ensuring exclusive writes, discussing various implementations and fairness considerations (e.g., reader vs. writer priority).

Learning Goals Today

Today, we’ll focus on:

Lock Granularity in Practice: Applying different locking strategies (fine-grained, optimistic, lazy) to concrete data structures like linked lists and stacks.
Various Locking Modes: Understanding the trade-offs and implementation details of these different approaches through examples.

Fine-Grained Locking Revisited

Recall the core idea of fine-grained locking:

Goal: Increase concurrency by splitting a data structure into smaller pieces, each protected by its own separate lock.
Benefit: Allows operations on disjoint pieces to proceed in parallel without mutual exclusion.
Challenge: Often significantly more intricate than coarse-grained locking, requiring careful analysis of special cases and potential interactions between threads operating on adjacent pieces.

Example: Fine-Grained Locking on a Sorted Linked List Set

We previously introduced the concept of hand-over-hand locking (or lock coupling) for traversing and modifying a linked list, where a thread always holds locks on two adjacent nodes (pred and curr) before proceeding or modifying pointers.

Let’s reconsider removing c. Locking just b is insufficient.

Concurrent removes of b and c can interfere, leaving c undeleted.

The problem is that modifying b.next requires knowing c.next safely, and preventing interference with b itself. Locking both pred and curr is necessary. (Note that d does not need to be locked, since as long as a -> d is satisfied, we don’t care how d.next changes. Deletes to d happen atomically using these locks, so that isn’t a problem).

The hand-over-hand locking protocol for remove ensures safety by locking curr before unlocking pred during traversal, and holding locks on both pred and curr during the actual pointer update (pred.next = curr.next).

Disadvantages of Hand-Over-Hand Locking:

Overhead: Operations require a sequence of lock acquire/release operations during traversal, adding overhead.
Blocking: A slow thread holding a lock on an early node can block other threads wanting to access nodes further down the list, even if the operations are on logically disjoint parts. Concurrency is limited by the sequential nature of the traversal locking.

Can we do better for list-based structures?

Optimistic Synchronization (Locking)

Idea

Assume conflicts (interference from other threads) are rare. Perform the operation in stages:

Find Nodes (Lock-Free): Traverse the list without acquiring any locks to find the relevant nodes (e.g., pred and curr for an add/remove operation).
Lock Nodes: Once the target nodes are found, acquire locks only on those specific nodes.
Validate: After acquiring locks, check if the situation is still valid. Did another thread modify the list structure (e.g., delete pred or curr, insert between them) after stage 1 but before we acquired the locks in stage 2?
Perform Operation: If validation succeeds, perform the actual modification (e.g., update pointers).
Unlock: Release the locks. If validation fails, release locks and retry (or signal failure).

Example: `add(c)`

Traverse without locks, finding that c should go between b and d.
Lock b and d.
Validate: Is b still reachable? Is d still the successor of b?
If valid, set c.next = d, b.next = c.
Unlock b and d.

Validation Failure Example 1: Deletion

Thread A: add(c). Finds insertion point between b and d.
Thread B: remove(b). Runs concurrently. Finds a and b, locks them, changes a.next to d, unlocks. b is now removed.
Thread A: Tries to lock b and d. Let’s say it succeeds (if b wasn’t fully reclaimed).
Thread A: Validates. Rescans from the head. Finds a points to d. Node b is no longer reachable. Validation fails.
Thread A: Unlocks b and d, returns false (or retries).

Validation Failure Example 2: Insertion

Thread A: add(c). Finds insertion point between b and d.
Thread B: insert(b'). Runs concurrently. Finds insertion point between b and d, locks b and d, inserts b' (b.next = b', b'.next = d), unlocks.
Thread A: Tries to lock b and d. Succeeds.
Thread A: Validates. Rescans or checks b.next. Finds b.next is now b', not d. Validation fails because d != succ(b).
Thread A: Unlocks b and d, returns false (or retries).

Validation Summary: The validate(pred, curr) method needs to:

Lock pred and curr.
Check if pred is still reachable from the head.
Check if curr is still the immediate successor of pred (pred.next == curr).
(Implicitly) Check if pred and curr are not marked deleted (relevant for lazy lists).

Correctness relies on Validation

For remove(c) (where b is pred, c is curr): If validation (b reachable, c == b.next, neither marked) passes while holding locks on b and c, then no other thread can be concurrently deleting b or c, nor inserting between them. It’s safe to proceed.
For remove(c) (where c not found, b is pred, d is curr): If validation (b reachable, d == b.next, neither marked) passes while holding locks on b and d, then no other thread can have inserted c between b and d after our initial scan but before we locked b and d. It’s safe to return false.

Optimistic List Trade-offs

Good:
- No Contention on Traversals: The initial scan is lock-free, allowing many concurrent readers/traversers.
- Wait-Free Traversals: Finding the nodes doesn’t involve waiting for locks (assuming contains doesn’t lock). A wait-free operation guarantees completion in a finite number of its own steps, regardless of other threads’ speeds or pausing.
- Fewer Lock Acquisitions: Locks are only acquired briefly on the target nodes during the modify phase.
Bad:
- Double Traversal: Need to traverse (or at least rescan for validation) twice in the worst case (scan, lock, validate-scan, modify).
- contains() Needs Locks?: A simple contains that just scans might return true for a node that’s concurrently being deleted (after scan, before validation). A correct contains might also need to lock/validate, losing the wait-free benefit.
- Not Starvation-Free: Threads might repeatedly fail validation due to high contention and have to retry indefinitely.

Lazy Synchronization (Locking)

Idea: Improve upon optimistic locking, particularly for remove. Removing nodes often requires locking three nodes (pred, curr, succ) in fine-grained schemes or careful validation in optimistic schemes. Lazy synchronization aims to simplify remove.

Lazy List Approach:

Similar to optimistic list, but modifies the remove operation.
Scan Once: Traverse the list once (optimistically or using hand-over-hand).
contains() Never Locks: A simple scan is sufficient.
Logical Deletion (Marking): Instead of physically removing a node immediately, remove first marks the node as “logically deleted”.
Physical Deletion (Lazy): The actual removal (updating predecessor’s next pointer) is done later, potentially by the same remove operation or even by subsequent add or remove operations that happen to traverse over the marked node.
Key Invariant: Every unmarked node in the list must always be reachable from the head. contains only returns true for unmarked nodes.

Lazy List Remove Steps:

Scan: Find pred and curr (where curr is the node to be removed).
Lock: Lock pred and curr.
Validate: Check if pred is reachable and points to curr, AND crucially, check that curr is not already marked as deleted.
Logical Delete: If validation passes, mark curr as deleted.
Physical Delete: Change pred.next to point to curr.next.
Unlock: Release locks on pred and curr.

Key Invariant for Lazy List: If a node is not marked as deleted, it must be reachable from the head via a path of unmarked nodes. contains checks this invariant by ignoring marked nodes.

Lazy Remove Code Sketch:

Wait-Free contains():

Since contains only performs reads and doesn’t acquire locks, it’s wait-free. The marked flag ensures it doesn’t report true for nodes that have been logically removed.

Lazy List Trade-offs:

Pros: contains is wait-free and fast (single scan, no locks). remove might be simpler as physical deletion is separated.
Cons: add still needs validation similar to optimistic lists. Memory isn’t reclaimed until marked nodes are physically unlinked (which might be delayed).

More Practical: Lazy Skip Lists

While linked lists illustrate locking concepts well, they aren’t typically the best structure for concurrent sets/maps due to O(n) traversal. Skip lists offer a more practical alternative. (Bill Pugh is the inventor).

Skip List: A probabilistic data structure representing a set (sorted elements without duplicates).

Interface: add(x), remove(x), find(x) (or contains(x)).
Typical Workload: Assumes many calls to find, fewer add, and even fewer remove.

Why not balanced trees (AVL, Red-Black)?

Rebalancing after add/remove can be expensive.
Rebalancing often requires global restructuring (locking large parts or the whole tree), hindering concurrency.
Implementing lock-free or highly concurrent balanced trees is extremely complex.

Skip Lists

A probabilistic approach (Las Vegas style - always correct, probably fast) to achieve logarithmic performance on average without complex rebalancing.

Structure

A sorted multi-level linked list.
Each node has a probabilistic height (e.g., P(height=n) = 0.5^n). Higher nodes act as express lanes.
No rebalancing is needed.

Property: Nodes present at higher levels are always present in all lower levels. Level 0 contains all elements.

Operations

We can apply the lazy synchronization strategy to skip lists.

Searching

Start at the highest level list at the head (-∞).
Traverse forward until the next node is greater than the target (8).
Drop down one level.
Repeat traversal and dropping down until level 0 is reached.
Expected search time is O(log n).

Similarly for finding 6 (which doesn’t exist)…

The find operation typically returns predecessors (pre) and successors (succ) at each level to facilitate subsequent add/remove.

Adding

add(6) Example:

Find Predecessors (Lock-Free): Traverse lock-free to find the predecessors at each level where the new node (6) should be inserted (nodes 5 and -inf in this example).
Lock Predecessors: Acquire locks on the predecessor nodes found (lock node 5 at levels 0, 1, 2; lock head at level 3).
Validate: Rescan from predecessors to ensure they are still reachable and their successors haven’t changed (i.e., no concurrent add/remove interfered).
Splice: If validation succeeds, create the new node (6) with its random height (say, 3 levels), and update the next pointers of the locked predecessors at the relevant levels. Mark the node as fully linked.
Unlock: Release locks on predecessors.

Removing

remove(5) Example (Lazy):

Find Predecessors: Traverse lock-free to find predecessors of node 5 at all levels.
Lock Victim: Acquire lock on the node to be removed (node 5).
Logical Remove (Mark): Mark node 5 as logically deleted at all its levels. This prevents contains(5) from returning true.
Lock Predecessors & Validate: Acquire locks on predecessors. Validate that predecessors are still linked correctly and point to the (now marked) victim node 5.
Physical Remove: If validation passes, update predecessors’ next pointers to bypass node 5.
Unlock: Release locks on predecessors and victim.

Contains

contains(8) Example:

Perform a standard skip list find operation (traverse high levels, drop down). No locks are acquired.
If node 8 is found, check if it is marked for deletion AND if it is fully linked (relevant during add operations).
Return true only if found, not marked, and fully linked.
This operation is wait-free because it involves no locking and the traversal is guaranteed to finish (assuming no infinite loops in the list structure itself, which marking prevents for removals).

Skip lists provide a practical concurrent set/map data structure. The full implementation involves handling many details carefully (e.g., node marking, validation logic).

Motivating Lock-Free Programming

We’ve explored various locking techniques (spinlocks, locks with scheduling/waiting, different granularities). Now, let’s consider why we might want to move beyond locks entirely.

Reminder: Problems with Spinlocks

Poor performance under contention (bus traffic, cache invalidation).
Waste CPU resources spinning.
No scheduling fairness (can lead to starvation).
No notification mechanism for condition waiting.

Locks with Waiting/Scheduling (Mutexes, Semaphores, Monitors)

Avoid wasteful spinning by suspending threads (yielding CPU).
Require support from OS/runtime scheduler.
Internal waiting queues need protection (often short-term spinlocks!).
Higher latency for wakeup (involves scheduler).
Hybrid “competitive spinning” tries spinning for a short time before blocking.

Performance: Locks have minimal overhead in the uncontended case (often just the cost of one atomic RMW). However, in the contended case, performance degrades significantly due to blocking, context switching, and scheduler overhead. Starvation is possible depending on the implementation.

Fundamental Disadvantages of Locking

Pessimistic: Locks assume the worst (a conflict might happen) and enforce mutual exclusion always, even if no conflict actually occurs in a particular execution.
Performance: Overhead even when uncontended; significant degradation under contention. Bottlenecks (Amdahl’s Law!).
Blocking Semantics:
- Priority Inversion: A high-priority thread can be blocked waiting for a low-priority thread holding a lock.
- Convoying: If a thread holding a lock is descheduled (e.g., due to time slice end, page fault), all other threads waiting for that lock are blocked, even if they are ready to run.
- Fault Tolerance: If a thread dies while holding a lock, the lock might never be released, blocking other threads permanently.
- Deadlocks/Livelocks: Prone to these issues.
- Interrupt Handlers: Cannot typically acquire blocking locks within interrupt handlers.

Lock-Free Programming

Can we achieve synchronization without using locks (mutual exclusion)? Yes, using lock-free techniques.

Recap: Blocking Synchronization Issues

Deadlock: Cycle of processes waiting for resources held by others.
Livelock: Processes repeatedly change state in response to others but make no overall progress (e.g., two people trying politely to get out of each other’s way in a corridor).
Starvation: A process is indefinitely prevented from making progress, even though the system might be progressing overall.

Definitions for Non-Blocking Synchronization

Lock-Freedom: Guarantees that at least one thread always makes progress (completes its operation) in a finite number of steps, regardless of the state of other threads (even if some are slow or paused). Implies system-wide progress but does not guarantee freedom from starvation (a specific thread might repeatedly fail or be preempted just before completing).
Wait-Freedom: Stronger guarantee. Every thread is guaranteed to make progress (complete its operation) in a finite number of its own steps, regardless of other threads. Implies freedom from starvation.

Progress Conditions Compared

	Non-blocking (No Locks)	Blocking (Locks)
Everyone Progresses	Wait-free	Starvation-free
Someone Progresses	Lock-free	Deadlock-free

Non-Blocking Algorithms

Key Property: The failure or suspension of one thread cannot cause the failure or indefinite suspension of another thread. This is fundamentally different from locks, where one thread holding a lock can indefinitely block others.
Implementation: Rely on atomic hardware primitives like Compare-and-Swap (CAS) to make progress safely without acquiring locks.

Continue here: 22 Lock-Free Stack, Lock-Free List-Based Set, Lock-Free Queue