17 Read-Modify-Write Operations, TAS, CAS, TATAS Locks, Deadlocks

Lecture from: 22.04.2024 | Video: Video ETHZ

We’ve established that concurrent programming is essential for harnessing the power of modern hardware, which boasts incredible transistor densities and core counts, enabling massive computations like those powering AI and large-scale simulations on supercomputers like Summit.

Context: Large Scale Computing

The scale continues to grow, with systems like Cerebras WSE-3 pushing towards a million cores on a single piece of silicon. Managing concurrency correctly and efficiently at this scale is paramount.

Learning Goals for Today

Building on Last Time:

We saw how bad interleavings and, more fundamentally, data races (caused by memory reordering) can break concurrent programs.
We introduced memory models as the contract defining allowable behavior and explored formal proofs.
We examined implementations of 2-process (Peterson) and N-process locks (Filter, Bakery) using only atomic or safe/regular registers, highlighting their theoretical possibility but practical limitations (efficiency, space complexity).

Today’s New Topics:

Locks via Read-Modify-Write (RMW): Implementing locks using atomic hardware operations.
Higher-Level Concurrency: Discussing Deadlocks, Semaphores, and Barriers (though Semaphores/Barriers might carry over).

Learning Goals:

Understand the pitfalls and implementation details of synchronization algorithms built on hardware primitives.
Recognize the importance of these concepts for designing correct parallel code, even if you primarily use library functions.

Recap: Implementations using Atomic/Safe Registers

Let’s briefly recall the lock implementations from last time that relied solely on atomic/regular memory reads and writes.

Peterson Lock (Recap)

Peterson’s lock provided an elegant mutual exclusion solution for two processes using flag variables (intent) and a victim variable (tie-breaker). It guarantees mutual exclusion and freedom from starvation but is limited to two threads.

Atomic Registers (Recap)

We defined atomic registers where read/write operations appear to take effect at a single point in time (τ), ensuring reads observe the value from the immediately preceding write (providing a total order). This formalism helped us prove the correctness of algorithms like Peterson’s, assuming atomic registers for shared variables.

Filter Lock (Recap)

The Filter Lock extended Peterson’s idea to N processes using N-1 levels of “filters”. Threads progress through levels, and at each level, a Peterson-like victim mechanism allows at most $N - i$ threads to pass level i. It ensures mutual exclusion, deadlock-freedom, and starvation-freedom but is not FCFS and can be inefficient due to reading N-1 variables per level in the wait loop.

Bakery Algorithm (Recap)

Lamport’s Bakery Algorithm provides N-process mutual exclusion using only single-writer/multiple-reader (SWMR) regular registers. It uses a “take-a-ticket” system, ensuring FCFS fairness and thus starvation-freedom.

Its Java implementation requires AtomicIntegerArray for correctness due to the JMM. While correct and fair, it also suffers from inefficiency as threads must read all other threads’ flags and labels in the waiting loop.

Limitations of Register-Based Locks

These algorithms (Filter, Bakery) demonstrate it’s possible to achieve mutual exclusion with only atomic/regular registers. However, they have practical drawbacks:

Space Complexity: They often require shared variables proportional to the number of threads (N). Theorem 5.1 by Burns shows that any deadlock-free mutual exclusion algorithm using only read/write atomic registers needs at least N variables. This doesn’t scale well if N is potentially huge (e.g., millions of threads/tasks).
Performance: The waiting loops involve reading multiple shared variables, leading to high contention and poor cache performance.
Reliance on Memory Model / volatile: We assumed no memory reordering for analysis, but real implementations require careful use of volatile or memory barriers to prevent failures due to compiler/hardware reordering, adding complexity and overhead.

So, while we proved these software-only constructions work (under certain assumptions), they aren’t how high-performance locks are typically implemented. We need a better approach, leveraging direct hardware support. We need to extend our model to include specialized atomic instructions provided by modern architectures.

Hardware Support for Parallelism: Read-Modify-Write Operations

Modern processors provide special instructions that can atomically read a value from memory, modify it, and write the new value back, all as a single, indivisible hardware operation. These are crucial for efficient synchronization.

Example: x86 Compare-and-Swap (CMPXCHG)

CMPXCHG mem, reg:
1. Atomically compares the value in memory location mem with the value in a designated register (e.g., EAX).
2. If they are equal, it writes the value from the source register reg to the memory location mem and sets the Zero Flag (ZF).
3. If they are not equal, it loads the value from mem into the designated register (EAX) and clears the ZF.
LOCK Prefix: Prepending the LOCK instruction prefix to CMPXCHG (and certain other instructions) ensures that the entire read-modify-write sequence is performed atomically with respect to other processors, typically by locking the memory bus or using cache locking mechanisms.

Example: ARM Load-Linked / Store-Conditional (LDREX/STREX)

ARM uses a different approach with a pair of instructions:

LDREX rd, [rn] (Load Register Exclusive): Loads a value from memory location [rn] into register rd. Crucially, it also “marks” the physical address as having an exclusive access pending by this processor (using a local “exclusive monitor”).
STREX rd, rm, [rn] (Store Register Exclusive): Attempts to store the value from register rm into memory location [rn]. The store succeeds only if the executing processor still has exclusive access marked for that address (i.e., no other processor has written to it since the LDREX). The status of the store (success=0, failure=1) is written to register rd.

This pair allows building atomic RMW sequences: load exclusively, perform modifications in registers, and then attempt to store conditionally. If the store fails, it means another thread interfered, and the operation needs to be retried.

Typical Atomic Hardware Instructions

Common atomic RMW primitives found across different architectures include:

Test-And-Set (TAS): Atomically sets a memory location to 1 (or true) and returns its previous value. (e.g., Motorola 68000 TSL instruction).
Compare-And-Swap (CAS): Atomically compares the value in memory with an expected old value. If they match, it replaces the memory value with a new value and returns success (or the old value). If they don’t match, it does nothing and returns failure (or the current value). (e.g., x86 LOCK CMPXCHG, Sparc CASA).
Load-Linked / Store-Conditional (LL/SC): Paired instructions as seen in ARM, MIPS, POWER, RISC-V. Load marks exclusive access; conditional store succeeds only if exclusivity is maintained.
Fetch-And-Add: Atomically reads a value, adds a specified amount to it, writes the result back, and returns the original value.

Cost: These atomic instructions are typically much slower than simple load or store operations because they often require complex cache coherence interactions or bus locking to ensure atomicity across multiple cores.

Semantics of TAS and CAS

Let’s formalize the basic semantics:

boolean TAS(memref s):

// ---- Start Atomic ----
if (mem[s] == 0) {
    mem[s] = 1;
    return true; // Indicate success (acquired lock)
} else {
    return false; // Indicate failure (lock already held)
}
// ---- End Atomic ----

(Note: Some definitions return the old value, so TAS returning 0 means success)

int CAS(memref a, int old, int new):

// ---- Start Atomic ----
int oldval = mem[a];
if (old == oldval) {
    mem[a] = new;
}
return oldval; // Return the value originally read
// ---- End Atomic ----

(Check return value: if return == old, the swap succeeded)

Importance:

These are fundamental Read-Modify-Write (RMW) operations.
They enable building mutexes (locks) with O(1) space complexity (just need one shared variable for the lock state), unlike the O(N) requirement for register-only algorithms like Filter/Bakery.
They are essential primitives for implementing lock-free programming techniques (a more advanced topic).

Implementation of a Spinlock using Atomic Operations

A spinlock is a lock where a thread attempting to acquire it repeatedly checks the lock status in a loop (it “spins”) until the lock becomes available. Here’s how to implement one using TAS and CAS:

Using Test-and-Set (TAS):

State: int lock = 0; (0=free, 1=held)
Init(lock): lock = 0;
Acquire(lock): while (!TAS(lock)); // Spin while TAS returns false (lock was already 1)
Release(lock): lock = 0; // Simple write is enough if Release only called by holder

Using Compare-and-Swap (CAS):

State: int lock = 0; (0=free, 1=held)
Init(lock): lock = 0;
Acquire(lock): while (CAS(lock, 0, 1) != 0); // Spin while CAS fails to change 0 to 1 (returns non-zero old value)
Release(lock): CAS(lock, 1, 0); // Atomically set back to 0 (ignore result)

Read-Modify-Write in Java (`java.util.concurrent.atomic`)

Java provides high-level support for atomic operations starting from JDK 5 in the java.util.concurrent.atomic package.

Classes: AtomicBoolean, AtomicInteger, AtomicLong, AtomicReference, etc.
Key Operations (e.g., AtomicBoolean):
- boolean get(): Atomic read.
- void set(boolean newValue): Atomic write.
- boolean compareAndSet(boolean expect, boolean update): Atomic CAS. Returns true if successful (current value was expect, set to update).
- boolean getAndSet(boolean newValue): Atomic TAS-like operation. Sets to newValue and returns the previous value.

How does this work? (Experts):

JVM bytecode itself doesn’t directly expose RMW instructions like CAS. Standard synchronization relies on monitorenter/monitorexit.
However, the JDK implementation often uses a special, internal (and officially undocumented/unsafe) class sun.misc.Unsafe (escape patch).
sun.misc.Unsafe provides methods that map directly (or as closely as possible) to underlying hardware atomic operations (like CAS).
This mapping is not strictly guaranteed, and the performance/lock-free nature depends on the specific JVM, OS, and hardware.

Example: AtomicInteger.compareAndSet internally often calls Unsafe.compareAndSwapInt.

TASLock in Java

Using AtomicBoolean:

import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.locks.Lock;
 
public class TASLock implements Lock {
    AtomicBoolean state = new AtomicBoolean(false); // false = free, true = held
 
    public void lock() {
        // Spin: Keep trying getAndSet(true).
        // getAndSet returns the *previous* value.
        // Loop continues as long as the previous value was true (lock held).
        // Loop exits when previous value was false (we successfully acquired).
        while (state.getAndSet(true)) {
            // Body of loop is empty - just spin
        }
    }
 
    public void unlock() {
        // Release the lock
        state.set(false);
    }
    // Note: tryLock(), lockInterruptibly() etc. not implemented here
}

Spinlock Behavior: The while loop in lock() continuously executes the atomic getAndSet operation, consuming CPU cycles while waiting.

Simple TAS Spin Lock - Performance

Measurements show that the runtime of this simple TAS spinlock scales poorly as the number of threads (n) increases. The normalized time per thread grows significantly.

Why the Poor Performance?

Sequential Bottleneck / Contention: The atomic getAndSet operation requires exclusive access to the state variable in memory (or cache). All spinning threads are constantly fighting for access to this single location via the memory bus.
Cache Coherence Traffic: Every getAndSet attempt by one thread potentially modifies the cache line containing state. Cache coherence protocols must then invalidate the copies of this cache line held by all other spinning processors, causing them to re-fetch it over the bus on their next attempt. This generates massive amounts of bus traffic and contention.

Test-and-Test-and-Set (TATAS) Lock

To reduce the contention caused by TAS, we can use the Test-and-Test-and-Set (TATAS) optimization.

public void lock() {
    do {
        // 1. Test (Spin Read): First, spin locally reading the state.
        //    Only proceeds if the lock *appears* free (state == false).
        //    This read usually hits the local cache and avoids bus traffic.
        while (state.get()) {}
 
        // 2. Test-and-Set (Atomic Attempt): Once lock appears free,
        //    *then* try the atomic operation to acquire it.
        //    getAndSet returns previous value. Loop if it was already true
        //    (meaning someone else got the lock between our get() and getAndSet()).
    } while (state.getAndSet(true));
    // Alternative using compareAndSet:
    // } while (!state.compareAndSet(false, true));
}
 
public void unlock() {
    state.set(false);
}

Improvement: Threads spend most of their spinning time executing the local state.get(), which likely hits their cache. Only when get() returns false do they attempt the expensive atomic RMW operation (getAndSet or compareAndSet) over the bus. This significantly reduces bus traffic and contention compared to the simple TAS lock.

Measurement: The graph shows TATAS performs much better than TAS, although runtime still increases with the number of threads due to contention during the actual atomic attempt phase. (Note: Performance varies significantly across machines/JVMs; this is qualitative.)

TATAS Limitation: While better, TATAS doesn’t solve all problems. For example, the infamous “Double-Checked Locking” pattern for lazy initialization often fails due to memory reordering issues, even if atomic primitives seem involved. Relying on low-level spinning reads without proper memory barriers can be unsafe.

TATAS with Backoff

Observation: Even with TATAS, if many threads detect the lock is free around the same time, they will all execute the atomic compareAndSet or getAndSet, causing a burst of contention. This slows down overall progress.

Solution: Backoff: When a thread fails to acquire the lock (i.e., the atomic operation fails because someone else got it first), it shouldn’t immediately retry. Instead, it should back off – wait for a short duration before trying again.

Random Duration: Wait for a random amount of time to reduce the probability that all waiting threads retry simultaneously.
Exponential Backoff: Increase the expected waiting time exponentially each time the thread fails to acquire the lock (up to a maximum). This adapts to the level of contention.

Lock with Backoff Code Structure:

public void lock() {
    Backoff backoff = null; // Lazy initialization of backoff helper
    while (true) {
        // Spin-read phase (TATAS)
        while (state.get()) {};
 
        // Attempt atomic acquire
        if (!state.getAndSet(true)) {
            return; // Success! Acquired the lock
        } else {
            // Failure - another thread got the lock
            try {
                if (backoff == null) { // Create backoff object on first failure
                    backoff = new Backoff(MIN_DELAY, MAX_DELAY);
                }
                backoff.backoff(); // Wait for a random, increasing duration
            } catch (InterruptedException ex) {}
        }
    }
}

Exponential Backoff Implementation:

class Backoff {
    final int minDelay, maxDelay;
    int limit;
    final Random random;
 
    Backoff(int min, int max) {
        minDelay = min;
        maxDelay = max;
        limit = minDelay;
        random = new Random();
    }
 
    public void backoff() throws InterruptedException {
        int delay = random.nextInt(limit); // Random delay up to current limit
        if (limit < maxDelay) { // Double limit for next time, up to max
            limit = 2 * limit;
        }
        Thread.sleep(delay); // Pause the thread
    }
}

Measurement: Backoff locks typically show much better scalability than simple spinlocks (TAS or TATAS) under high contention, as threads yield the CPU (Thread.sleep) instead of spinning uselessly and fighting over the bus. The runtime per thread remains relatively flat.

Deadlock

We’ve focused on ensuring mutual exclusion (only one thread in the CS). But using locks introduces a new potential problem: Deadlock.

Deadlock Motivation: Bank Transfer

Consider adding a transferTo method to our BankAccount class, using synchronized methods (which acquire the intrinsic lock of the BankAccount instance).

class BankAccount {
    // ... balance, etc. ...
    synchronized void withdraw(int amount) {/*...*/}
    synchronized void deposit(int amount) {/*...*/}
 
    // Method to transfer money FROM this account TO account 'a'
    synchronized void transferTo(int amount, BankAccount a) {
        this.withdraw(amount); // Acquires lock on 'this'
        a.deposit(amount);     // Attempts to acquire lock on 'a'
    }
}

Problem: What happens if two threads try to transfer money in opposite directions simultaneously?

Thread 1: Calls x.transferTo(1, y)
Thread 2: Calls y.transferTo(1, x)

Scenario:

Thread 1: Enters x.transferTo, acquires lock on x.
Thread 1: Executes this.withdraw(amount) successfully (using lock on x).
Thread 2: Enters y.transferTo, acquires lock on y.
Thread 2: Executes this.withdraw(amount) successfully (using lock on y).
Thread 1: Attempts to execute a.deposit(amount) (where a is y). Tries to acquire lock on y, but Thread 2 holds it. Thread 1 blocks.
Thread 2: Attempts to execute a.deposit(amount) (where a is x). Tries to acquire lock on x, but Thread 1 holds it. Thread 2 blocks.

DEADLOCK! Both threads are blocked indefinitely, each waiting for a lock held by the other. This forms a circular wait.

Deadlock Definition

Deadlock: A situation where two or more processes (or threads) are mutually blocked, each waiting for a resource held by another process in the cycle. No process can proceed. (Illustrated by the traffic gridlock and the 4-way intersection).

Continue here: 18 Deadlocks, Semaphores, Barriers

CS Notes

Explorer

17 Read-Modify-Write Operations, TAS, CAS, TATAS Locks, Deadlocks

Learning Goals for Today

Recap: Implementations using Atomic/Safe Registers

Peterson Lock (Recap)

Atomic Registers (Recap)

Filter Lock (Recap)

Bakery Algorithm (Recap)

Limitations of Register-Based Locks

Hardware Support for Parallelism: Read-Modify-Write Operations

Example: x86 Compare-and-Swap (CMPXCHG)

Example: ARM Load-Linked / Store-Conditional (LDREX/STREX)

Typical Atomic Hardware Instructions

Semantics of TAS and CAS

Implementation of a Spinlock using Atomic Operations

Read-Modify-Write in Java (`java.util.concurrent.atomic`)

TASLock in Java

Simple TAS Spin Lock - Performance

Test-and-Test-and-Set (TATAS) Lock

TATAS with Backoff

Deadlock

Deadlock Motivation: Bank Transfer

Deadlock Definition

Table of Contents

Graph View

CS Notes

Explorer

17 Read-Modify-Write Operations, TAS, CAS, TATAS Locks, Deadlocks

Learning Goals for Today

Recap: Implementations using Atomic/Safe Registers

Peterson Lock (Recap)

Atomic Registers (Recap)

Filter Lock (Recap)

Bakery Algorithm (Recap)

Limitations of Register-Based Locks

Hardware Support for Parallelism: Read-Modify-Write Operations

Example: x86 Compare-and-Swap (CMPXCHG)

Example: ARM Load-Linked / Store-Conditional (LDREX/STREX)

Typical Atomic Hardware Instructions

Semantics of TAS and CAS

Implementation of a Spinlock using Atomic Operations

Read-Modify-Write in Java (java.util.concurrent.atomic)

TASLock in Java

Simple TAS Spin Lock - Performance

Test-and-Test-and-Set (TATAS) Lock

TATAS with Backoff

Deadlock

Deadlock Motivation: Bank Transfer

Deadlock Definition

Table of Contents

Graph View

Read-Modify-Write in Java (`java.util.concurrent.atomic`)