Lecture from 27.05.2025 | Video: Video ETHZ
Today, we delve into two important advanced topics in concurrent programming. First, we revisit Consensus, a fundamental theoretical problem that helps us understand the relative power of different atomic operations – why primitives like Compare-and-Swap (CAS) are essential for building sophisticated lock-free data structures. Second, we introduce Transactional Memory (TM), an alternative concurrency control paradigm that aims to simplify parallel programming by shifting the burden of synchronization from the programmer to the system, offering an abstraction similar to database transactions. The backdrop for these topics is the ever-increasing demand for computational power, particularly driven by fields like deep learning, necessitating both powerful hardware and effective parallel programming models.
Last Lecture Recap
In our last lecture, we established a formal foundation for reasoning about concurrency:
- Theory Introduction: We motivated the need for formalism beyond intuitive arguments.
- Histories: Defined histories as sequences of invocation and response events, along with concepts like projections, completeness, equivalence, and legality, providing a language to precisely describe concurrent executions.
- Linearizability: Introduced this strong correctness condition, requiring that every concurrent execution appears equivalent to a legal sequential execution that respects the real-time order of non-overlapping operations. We emphasized its composability, a key feature for modular software design.
- Sequential Consistency: Contrasted linearizability with this weaker model, which only requires preserving per-thread program order in the equivalent sequential execution and is not composable.
Learning Goals for Today
Building on our theoretical foundation and practical knowledge of locks and lock-freedom, today we aim to:
- Understand Consensus (and the Consensus Hierarchy):
- Formally define the wait-free Consensus problem.
- Introduce the concept of the Consensus Number to classify the power of atomic primitives (Atomic Registers, CAS, Queues).
- Understand why CAS is fundamentally more powerful than simpler primitives like atomic registers or Test-and-Set.
- Explore Transactional Memory (TM):
- Motivation: Summarize the weaknesses of locking and the difficulties of lock-free programming that motivate TM.
- Semantics: Define the core concepts of Atomicity and Isolation in TM.
- Discuss implementation aspects like conflict detection and concurrency control.
- Examine practical considerations like nesting, interaction with non-transactional code, and different TM approaches (Hardware, Software, Hybrid).
Consensus
We’ve used various atomic operations (registers, TAS, CAS) and built concurrent objects (locks, queues, stacks). A fundamental question arises: are all atomic operations created equal? Can we implement any concurrent object using, say, only atomic registers? The theory of Consensus helps answer this.
(Wait-free) Consensus Problem Definition
Consider a shared object c
implementing the Consensus<T>
interface with a single method: T decide(T value)
.
- Scenario: N threads participate. Each thread
i
proposes its own input valuev_i
by callingc.decide(v_i)
. - Goal: The threads must agree on a single, common decision value that was one of the proposed inputs.
A Consensus Protocol (the implementation of the decide
method) must satisfy the following requirements:
- Wait-free: Every call to
decide
must return a value in a finite number of steps, regardless of the execution speed or potential pausing of other threads. (This is a strong progress guarantee, stronger than lock-freedom). - Consistent: All threads that receive a return value must decide on the exact same value.
- Valid: The common decision value must be one of the values initially proposed by some participating thread.
Furthermore, the consensus object itself must be linearizable. This implies there’s a conceptual point in time where the “decision” is made. The thread whose decide
call linearizes first effectively determines the outcome for all threads.
This timeline shows threads A, B, and C calling decide
with inputs x, y, and z, respectively. Thread A’s call linearizes first (indicated by the red square). As a result, all threads (A, B, and C) consistently decide on value x
(A’s input), satisfying the requirements.
Consensus Number
To classify the power of different synchronization primitives (or object types), we use the concept of the Consensus Number.
- Definition: A class of objects
C
(e.g., atomic registers, CAS objects, FIFO queues) can solve n-thread consensus if it’s possible to implement a wait-free consensus protocol forn
threads using any number of objects of classC
and any number of atomic read/write registers (atomic registers are considered the baseline). - Consensus Number of C: The largest integer
n
such that objects of classC
can solve n-thread consensus wait-free. If a class can solve consensus for anyn
, its consensus number is infinite (∞).
The consensus number quantifies the maximum number of threads that can reach agreement wait-free using a given type of object as the primary synchronization primitive.
Consensus Power of Primitives
What are the consensus numbers of the atomic operations we’ve seen?
1. Atomic Registers:
- Theorem I: Atomic Read/Write Registers have consensus number 1.
- Proof: (Detailed proof in Herlihy, Chapter 5, or the linked video). The intuition is that with only reads and writes, if two threads propose different values, there’s no way for them to definitively agree on one winner in a wait-free manner. If thread A writes its value and thread B writes its value, a subsequent read might see either value, and there’s no atomic way to “claim victory” based just on reads and writes that guarantees agreement and wait-freedom for both threads.
- Corollary: It is impossible to implement a wait-free solution to n-thread consensus for n > 1 using only atomic read/write registers.
2. Compare-and-Swap (CAS):
- Theorem II: Compare-and-Swap (CAS) has infinite consensus number (∞).
- Proof (by construction): We can construct a wait-free consensus protocol for any number of threads
N
using a singleAtomicInteger
(which provides CAS) and atomic registers (likeAtomicIntegerArray
used here for storing proposals). - Explanation: Each thread stores its proposed value. Then, each thread attempts to atomically change the shared
winner
register from its initial state (FIRST
) to its own thread ID (i
) usingcompareAndSet
. Because CAS is atomic, only one thread will succeed in this operation – the one whose CAS executes first. All subsequent CAS attempts by other threads will fail because thewinner
register no longer holds theFIRST
value. Every thread then reads the final value of thewinner
register (which now holds the ID of the winning thread) and returns the value proposed by that winner. - Wait-free? Yes. Each step involves a finite number of atomic reads, writes, or a single CAS. No unbounded loops or blocking waits.
- Consistent? Yes. All threads read the same final
winner.get()
value and return the corresponding proposed value. - Valid? Yes. The decided value is
proposed.get(winner.get())
, which was proposed by the winning thread. - Since this protocol works for any
N
using one CAS object and atomic registers, CAS has consensus number .
Using Consensus Numbers: Impossibility Results
Consensus numbers provide a powerful tool for proving what cannot be built wait-free using only certain primitives.
No Wait Free Implementation of FIFO Queue using atomic registers
- Theorem III: There is no wait-free implementation of a FIFO queue using only atomic registers.
- Proof Strategy: Proof by contradiction. We will show that if such a queue did exist, we could use it (along with atomic registers) to solve 2-thread consensus. But since atomic registers have consensus number 1, this would lead to a contradiction.
Step 1: Show FIFO Queue can implement 2-thread consensus
- Setup: Two threads, 0 (Blue) and 1 (Red). Shared atomic array
proposed[2]
. Shared wait-free FIFO queueq
. We enqueue two distinct markers (e.g., a “red ball” representing thread 1’s proposal turn and a “black ball” representing thread 0’s).
- Protocol Step 1: Thread
i
writes its proposed valuev_i
intoproposed[i]
. (Uses atomic registers).
- Protocol Step 2: Thread
i
dequeues one item (a ball) from the shared FIFO queueq
. Since the queue is FIFO and wait-free, each thread gets exactly one unique ball in finite time.
-
Protocol Step 3 (Decision):
- If thread
i
dequeues the “red ball” (its marker), it decides its own proposed valuev_i
. - If thread
i
dequeues the “black ball” (the other thread’s marker), it decides the other thread’s value, which it reads fromproposed[other_id]
.
- If thread
-
Why it works:
- Because the queue is FIFO, one thread is guaranteed to dequeue the red ball, and the other is guaranteed to dequeue the black ball.
- The thread that dequeues its own marker (the “winner” based on queue order) decides its own value.
- The thread that dequeues the other’s marker (the “loser”) reads the winner’s value from the
proposed
array (which was written before dequeueing, ensuring validity). - All steps are wait-free (assuming the queue and registers are). Agreement is reached because the loser reads the winner’s chosen value.
Step 2: The Contradiction
- Given: We designed a wait-free 2-thread consensus protocol using only a wait-free FIFO queue and atomic registers.
- Assume: A wait-free FIFO queue implementation exists using only atomic registers.
- Substitution: If assumption (2) is true, we can substitute the queue implementation into the protocol from (1), resulting in a wait-free 2-thread consensus protocol built entirely from atomic registers.
- Contradiction: This conclusion (that atomic registers can solve 2-thread consensus wait-free) contradicts Theorem I (atomic registers have consensus number 1).
- Conclusion: Therefore, the initial assumption (2) must be false. No wait-free FIFO queue can be implemented using only atomic registers.
Importance and the Consensus Hierarchy
Consensus numbers establish a hierarchy of synchronization power:
- Atomic Registers: Consensus Number 1.
- Test-And-Set, Fetch&Add, Swap, FIFO Queues, Stacks: Consensus Number 2. (We only showed queue ≥ 2, proofs for others exist).
- Compare-and-Swap (CAS), LL/SC: Consensus Number ∞.
Implication: Objects with consensus number k
cannot be used to construct a wait-free implementation of an object with consensus number m > k
.
- You cannot build wait-free queues or CAS using only atomic registers.
- You cannot build wait-free CAS using only queues or Fetch&Add.
This table visualizes the hierarchy. Primitives like CAS are “universal” in the sense that they can implement any other object wait-free (though the implementation might be complex).
Understanding this hierarchy is crucial. It tells us the fundamental limitations of certain tools. Trying to build, for example, a wait-free queue using only atomic reads/writes is provably impossible, akin to trying to square the circle with only a compass and straightedge. Knowing these impossibility results saves wasted effort.
Motivation for Transactional Memory (TM)
We’ve seen that traditional locking, while conceptually simpler than lock-free programming, suffers from significant practical problems. Lock-free programming, while avoiding those problems, introduces its own extreme complexities (subtle algorithms, ABA). Is there a middle ground? Can we get the benefits of fine-grained or lock-free concurrency without the immense programming burden?
Motivation: Programming with locks is difficult (deadlocks, performance issues, composability problems). Lock-free programming is even harder and more error-prone. Goal: Shift the burden of implementing complex synchronization logic from the programmer to the system (hardware, software runtime, or a combination).
Recap: What’s Wrong with Locking?
Deadlocks: Circular dependencies when acquiring multiple locks.
Convoying: Threads pile up waiting for a lock held by a descheduled thread.
Priority Inversion: High-priority thread blocked by low-priority thread holding a needed lock.
Lack of Composability: Combining independently developed thread-safe modules using locks is difficult and error-prone. Deadlocks can emerge from unforeseen interactions. Requires understanding internal locking details.
Pessimistic: Locks enforce mutual exclusion always, incurring overhead even when no actual conflict occurs. Hard-wired: Locking logic is intertwined with application logic, making it hard to change or adapt synchronization strategies.
What’s Wrong with Lock-Free (using CAS)?
Lock-free algorithms, while avoiding lock issues, are notoriously difficult to design correctly.
- Complexity: Algorithms like the M&S queue are subtle and hard to prove correct.
- Atomicity Illusion: Operations often require multiple CAS steps (like enqueue updating
tail.next
thentail
). Ensuring atomicity and handling intermediate states requires complex “helping” protocols. [include image from page 31 showing enqueue code with helping] - ABA Problem: Requires careful handling via tagging, hazard pointers, or reliance on GC.
Hypothetical Simplicity: Multi-Compare-And-Set (DCAS)
If we had a magical atomic multiCompareAndSet
that could atomically perform the two CAS operations needed for enqueue ({last.next, tail}
, {next, last}
, {node, node}
), the code would become much simpler and ensure consistency directly. This highlights the desire for higher-level atomic operations.
Motivating Example: Bank Account Transfer
Consider a simple bank account with synchronized withdraw
and deposit
.
A naive transfer_unsafe
method simply calls a.withdraw
then b.deposit
. This is not atomic. Another thread could observe the state after the withdrawal but before the deposit, seeing an inconsistent total balance.
Trying to make it atomic by synchronizing on both accounts (synchronized(a) { synchronized(b) {...} }
) introduces deadlock if another thread concurrently transfers from b
to a
, acquiring locks in the opposite order.
The standard fix involves imposing a global lock order (e.g., based on account ID) to prevent cycles.
This requires careful, ad-hoc logic within the transfer function itself.
This illustrates the lack of composability with locks – combining two individually thread-safe operations (withdraw
, deposit
) requires complex, non-local reasoning and specific ordering protocols to ensure correctness and avoid deadlock in the composite operation (transfer
).
Solution: Atomic Blocks (Transactional Memory)
What if the programmer could simply declare the intended atomic unit?
atomic { // This is conceptual syntax
a.withdraw(amount);
b.deposit(amount);
}
Idea: The programmer specifies what should execute atomically (the withdraw
and deposit
together). The how is left to the underlying system. This is the core idea behind Transactional Memory (TM).
Difference from Locks: While locks also aim for atomicity, they do so via mutual exclusion. TM aims to achieve atomicity often without mandatory locking, typically using optimistic execution. The difference lies in the execution strategy.
Transactional Memory (TM):
Programmer: Defines atomic
code sections (transactions). Focuses on what operations belong together atomically.
System (Hardware/Software/Hybrid): Responsible for how to execute these transactions atomically and in isolation, usually via:
- Optimistic execution (run transaction speculatively).
- Conflict detection (monitor reads/writes).
- Validation/Commit or Abort/Rollback.
Potential Benefits of TM:
- Simplicity: Programmer focuses on atomicity, not low-level locking protocols. Less error-prone.
- Declarative: Higher-level abstraction.
- Composable: Atomic blocks can often be nested or combined more naturally and safely than lock-based code (though nesting semantics vary).
- Optimistic: Avoids the pessimism of locks. If conflicts are rare, performance can be better as mutual exclusion isn’t strictly required by default.
TM Semantics
Atomicity
Changes made by a transaction become visible to other threads all at once (on commit) or not at all (on abort). No intermediate states are visible externally. Unlike locks, TM achieves this without necessarily requiring mutual exclusion during execution.
Isolation
Transactions execute as if they are running alone. A running transaction does not observe the partial effects of other concurrently running, uncommitted transactions. Conceptually, it’s often implemented as if the transaction takes a consistent snapshot of memory when it begins and operates on that snapshot (or ensures its reads remain consistent throughout).
Serializability
Serializability: The concurrent execution of multiple transactions must be equivalent to some sequential execution of those same transactions. The system ensures that even though transactions run in parallel, the final outcome is consistent with a serial ordering.
Database Analogy: TM is heavily inspired by database ACID properties (Atomicity, Consistency, Isolation, Durability). TM typically focuses on A and I in main memory. Consistency (application-level invariants) is the programmer’s responsibility but aided by A and I. Durability (persistence to disk) is usually not part of TM.
How is TM Implemented?
- Naive Approach (Big Lock): A single global lock around all atomic sections. Ensures atomicity and isolation but offers no scalability (worse than fine-grained locking). Not practical.
- Realistic Approach: The system needs to track the memory locations read (read-set) and written (write-set) by each active transaction. This information is used for concurrency control.
Concurrency Control and Conflicts
Conflict: Occurs when concurrent transactions access the same memory location, and at least one access is a write.
Example:
- Initially
a = 0
. - Transaction TX_A starts, reads
a
(gets 0). - Transaction TX_B starts, writes
a = 10
(often buffered locally), and commits. The changea = 10
becomes globally visible. - TX_A now holds a value (0) that is inconsistent with the committed state of the system (
a
is now 10).
Problem: Can TX_A commit now? If it commits based on having read a == 0
, the final state might be inconsistent. If TX_A were placed after TX_B in the serial order, it should have read a == 10
.
Serialization: If TX_B commits first, the only valid serial order is TX_B then TX_A. In this order, TX_A must read a == 10
. Any execution trace where TX_A read a == 0
is invalid in this serialization.
Concurrency Control (CC) Mechanism: The TM system must detect this conflict (e.g., TX_A read from a location subsequently modified and committed by TX_B before TX_A committed).
Abort and Retry: The CC mechanism handles the conflict, typically by aborting the offending transaction (TX_A in this case). The aborted transaction’s effects are discarded (rolled back), and it can then be automatically retried by the TM system or signal an error to the user.
TM Execution Examples
Two transactions: TX_A transfers 10 from a to b. TX_B transfers 5 from b to a. Initially a=100, b=100.
TX_A starts. Reads a=100
. Local speculative state: a=90
.
TX_B starts concurrently. Reads b=100
. Local speculative state: b=95
. Reads a=100
. Local speculative state: a=105
.
TX_B reaches commit point. Assume it commits successfully. Global state becomes a=105
, b=95
.
TX_A continues. It tries to execute Deposit(b, 10)
. It reads b
. What value does it read? The TM system must ensure TX_A either sees the initial state (b=100, requiring potential abort if TX_B already committed) or the consistent state after TX_B (b=95). It cannot proceed based on its initial read a=100
if TX_B has already committed changing a
to 105. The TM’s concurrency control (using read/write sets) detects this conflict (TX_A read a=100
, TX_B wrote a=105
and committed). TX_A will likely be aborted and retried.
Consistency Guarantee
Consider TX_A: a+=10; b+=10; c=1/(a-b);
and TX_B: a-=10; b+=10; c=1/(a-b);
. Initially a=10, b=0.
TX_A starts, calculates local a=20
.
TX_B runs, calculates local a=0
, b=10
, commits. Global state: a=0, b=10
.
TX_A continues. If it were allowed to proceed using its local a=20
but read the committed b=10
, it would calculate c = 1 / (20 - 10)
. This is inconsistent. If it read the committed a=0
and b=10
, it would calculate c = 1 / (0 - 10)
. If it somehow saw its local a=20
and the initial b=0
, it would calculate c = 1 / (20 - 0)
. A “zombie” transaction operating on inconsistent data can lead to catastrophic failures (like division by zero if it read committed b=10
but somehow kept its initial a=10
, making a-b=0
).
Transactional Memory Guarantee
The TM system guarantees that a running transaction always sees consistent data. It achieves this through its concurrency control mechanism, ensuring isolation.
Possibilities (Conceptual):
- Snapshot Isolation: Transaction logically operates on a snapshot of memory taken at the beginning. Conflicts detected at commit time.
- Early Abort: Conflicts are detected eagerly (e.g., when a write conflicts with an active read). Transaction aborts immediately.
Where to Implement TM?
Hardware TM
Implemented directly in processor hardware (e.g., using cache coherence protocol extensions to track read/write sets).
- Pros: Can be very fast, low overhead for common cases.
- Cons: Limited by hardware resources (cache size, buffer sizes). Often cannot handle large transactions that exceed these bounds. Aborts can be expensive.
- Examples: Intel Haswell (TSX/RTM - largely deprecated/disabled due to security flaws), IBM Blue Gene/Q, Sun Rock (cancelled). The future of widespread HTM is unclear.
Software TM
Implemented purely in software, often as a library or language feature.
- Pros: Greater flexibility, not limited by hardware resources (can handle large transactions).
- Cons: Can have significant performance overhead (instrumentation, logging reads/writes). Achieving good performance is challenging.
- Examples: Libraries/languages like Haskell, Clojure, Scala-STM. Research systems like Deuce for Java.
Hybrid TM
Combines hardware and software (e.g., use HTM for small transactions, fall back to STM for large ones). Active research area.
Status
TM is still largely experimental or niche. Implementations are often immature. Intel’s widely available HTM (RTM in TSX) suffered from robustness/security issues and is often disabled. STM is used in some functional languages but hasn’t seen widespread adoption in mainstream imperative languages like Java or C++.
TM Design Choices
Isolation (Strong vs Weak)
Does the TM system protect transactional accesses from conflicting non-transactional accesses?
- Strong: Yes. Easier to port code, but harder/more expensive for the TM to implement (needs to monitor all memory accesses).
- Weak: No. Assumes all accesses to potentially shared data happen within transactions. Simpler TM, but requires programmer discipline (JVM uses this).
Nesting (Flattened vs CLosed vs Open)
How are atomic
blocks inside other atomic
blocks handled? Important for composability.
Flattened Nesting
Inner transactions merge with the outer one. An inner abort aborts the whole thing. Inner commits only become visible if the outer one commits. Simple, but limits modularity.
Closed Nesting
Inner transactions can abort independently without aborting the outer one (allowing retry/alternative logic). Inner commits make changes visible only to the outer transaction until the outer one commits globally. More complex but more flexible. Useful for software engineering…
Other Nestings
Open Nesting: Allows inner transactions to commit independently and make changes visible globally before the outer transaction commits (breaks isolation, used carefully for specific patterns like I/O).
Scope
Scope: What memory is tracked?
- All Variables: Easier to port, but requires instrumenting/checking every memory operation – potentially high overhead.
- Reference-Based: Only accesses via special “transactional reference” types are tracked by the STM system. Explicitly marks mutable shared state. Everything else is assumed immutable or thread-local. Requires programmer discipline but reduces TM overhead.
Example: Scala-STM
While Java doesn’t have native STM, libraries like Scala-STM (with a Java interface) exist.
It follows the reference-based approach. Shared mutable state must be wrapped in Ref
objects.
Transactions are defined explicitly, often using closures or Runnable
objects (especially in older Java versions without lambdas).
No compile-time guarantees that Refs are only accessed within transactions.
The balance
is stored in a Ref.View<Integer>
.
Ideally, we’d write simple atomic { ... }
blocks.
In practice (with the Java API, especially pre-Java 8), we wrap the code in STM.atomic(new Runnable() { ... })
.
Returning values requires using Callable
instead of Runnable
.
The transfer
method becomes simply wrapping the withdraw
and deposit
calls inside STM.atomic { ... }
. The STM system handles atomicity and isolation, avoiding the manual lock ordering needed previously.
Conditional Waiting (Retry)
How to handle conditions like insufficient funds? Instead of condition variables, STM offers STM.retry()
. Calling retry()
aborts the current transaction and puts the thread to sleep.
The STM system tracks the read-set of the aborted transaction. The thread automatically wakes up and retries the transaction only when another transaction commits a write to any of the memory locations (Refs) that were in the aborted transaction’s read-set. In the example, the transfer retries when a.balance
is potentially updated by another transaction.
Simplest STM Implementation Concepts (Conceptual)
- Threads: Have states (active, aborted, committed).
- Transactional Objects: Represent shared state, support read (
get
), write (set
), and crucially, versioning or copying. - Concurrency Control: Needed to ensure isolation and atomicity.
Clock Based STM (One approach)
- A global version clock advances on every commit.
- Each transaction gets a
birthdate
(clock value when it starts). - Each transactional object has a
timestamp
(clock value when it was last committed). - Transactions maintain local read-sets and write-sets.
Read obj
:
- Check thread’s local write-set first. If found, return buffered value.
- Otherwise, check
obj.timestamp <= tx.birthdate
. If not, the object was modified after the transaction started → conflict → abort transaction. - If timestamp is okay, read the object’s committed value, add
obj
to the read-set.
Write obj
:
- If
obj
not in write-set, create a private copy in the write-set (write buffering). Update the private copy.
Commit:
- Acquire commit lock (simplest CC).
- Validate read-set: For every
obj
in read-set, checkobj.timestamp <= tx.birthdate
again (check for conflicts that occurred during execution). If validation fails, abort. - If valid, increment global clock. Update timestamps of all objects in the write-set to the new clock value. Write buffered values back to main memory. Release commit lock.
This timeline illustrates the validation: If T reads Y, then X, then Z, it must ensure that during its lifetime (since its birthdate
), none of the versions of X, Y, or Z it read have been overwritten by a later committing transaction (i.e., their date
or timestamp must still be less than or equal to T’s birthdate
at commit time). If Z’s date becomes greater than T’s birthdate before T commits, T must abort.
This provides a glimpse into how STM systems work internally, managing versions, detecting conflicts, and ensuring atomic commits or rollbacks. We’ll take a deeper look next time.
Continue here: 27 Transactional Memory, Distributed Memory, Message Passing, Go Concurrency Model, Message Passing Interface