15 Memory Reordering, Java Memory Model, Data Races

Lecture from: 15.04.2024 | Video: Video ETHZ

We now start with part 2 of this lecture.

In previous lectures, we explored concepts like locks, critical sections, and the challenges posed by bad interleavings. Today, we delve deeper into the complexities of shared memory concurrency. We’ll examine the fundamental issue of data races, which arise from memory reordering by compilers and hardware. Then, we’ll investigate how mutual exclusion, the core problem locks solve, can be implemented using only basic atomic memory operations (atomic registers), looking at classic algorithms like Dekker’s and Peterson’s.

Before we dive into the technical details, it’s worth remembering the broader context. The drive for parallel programming and understanding these low-level details is fueled by the immense computational demands of modern applications, particularly in fields like Artificial Intelligence (AI) and High-Performance Computing (HPC). As noted by AI pioneers like Geoffrey Hinton and Yann LeCun, advances in compute power have been a critical factor in the AI revolution. Companies like Google, Meta, Nvidia, OpenAI, and others are building massive supercomputers, often using specialized hardware like TPUs and GPUs, and scaling systems to thousands of nodes. Running large-scale parallel applications efficiently and correctly on such systems requires a deep understanding of the underlying principles of concurrency and parallelism.

Learning Goals for Today

Recap (So Far):

We know how to use locks to protect critical sections.
We understand key guidelines (like lock granularity) and trade-offs.
We’ve seen how bad interleavings (high-level race conditions) can occur even with locks.

Today’s Focus (Now):

Memory Models: The often counter-intuitive reality of how parallel programs behave due to hardware and compiler optimizations.
Data Races: Understanding why we must avoid low-level data races, which are often caused by memory reordering.
Mutex Implementation: Exploring how mutual exclusion (the primitive behind locks) can be implemented using only atomic registers, examining Dekker’s and Peterson’s algorithms.

Important Context: While you will likely use pre-built lock implementations provided by your programming language or libraries in practice, understanding how they could be built from simpler primitives provides crucial insights into the challenges of concurrency. As the saying goes, “Tell me and I forget, teach me and I may remember, involve me and I learn.”

Motivation: Can This Fail?

Consider the following simple Java class with two shared integer variables, x and y, initially 0.

class C {
    private int x = 0;
    private int y = 0;
 
    // Executed by Thread 1
    void f() {
        x = 1; // A
        y = 1; // B
    }
 
    // Executed by Thread 2
    void g() {
        int a = y; // C
        int b = x; // D
        assert(b >= a);
    }
}

Assume Thread 1 calls f() and Thread 2 calls g(). Can the assertion assert(b >= a) ever fail?

Let’s analyze the possible interleavings of the four operations (A, B, C, D). We assume for now that each operation executes indivisibly and in program order within each thread.

The assertion fails only if b < a, which means b must be 0 and a must be 1.

For a to be 1, operation C (a = y) must execute after operation B (y = 1). (C after B)
For b to be 0, operation D (b = x) must execute before operation A (x = 1). (D before A)

Can we find an interleaving where (C after B) AND (D before A)?

Let’s list the possibilities (the checkmarks indicate the assertion holds):

A B C D: a=1, b=1. OK. ✓
A C B D: a=0, b=1. OK. ✓
A C D B: a=0, b=1. OK. ✓
C A B D: a=0, b=1. OK. ✓
C A D B: a=0, b=1. OK. ✓
C D A B: a=0, b=0. OK. ✓

It seems, by exhaustion, that no interleaving can cause the assertion to fail if we assume sequential consistency (operations execute in program order and results are immediately visible to all threads).

Combinatorial Excursion

How many interleavings are there? If we have 2 threads, each executing k statements, the total number of statements is 2k. An interleaving is determined by choosing which k positions in the final merged sequence belong to thread 1. This is equivalent to choosing k items from 2k without replacement:

(k 2 k) = O (\frac{4 ^{k}}{2 k})

This number grows very rapidly, making proof by exhaustion impractical for larger examples.

Another Proof (By Contradiction)

Let’s try proving the assertion cannot fail using contradiction, still assuming sequential consistency:

Assume b < a fails: This implies a == 1 and b == 0.
If a == 1: The read a = y (C) must have happened after the write y = 1 (B). So, B → C.
If b == 0: The read b = x (D) must have happened before the write x = 1 (A). So, D → A.
Program Order: Within thread 1, x = 1 happens before y = 1. So, A → B. Within thread 2, a = y happens before b = x. So, C → D.
Transitivity: Combining these: D → A (from 3), A → B (program order 1), B → C (from 2), C → D (program order 2). This gives us D → A → B → C → D.
Contradiction: This implies D → D, meaning an event must happen before itself, which is impossible. 

Therefore, our initial assumption (b < a) must be false. The assertion cannot fail under the assumption of sequential consistency.

Let’s Try That on My Laptop…

What happens when we actually run this code? Surprisingly, the assertion can fail in practice! This happens due to compiler optimizations (which only guarantee semantic correctness in sequential context).

Why It Still Can Fail: Memory Reordering

Our proofs relied on the assumption that operations execute strictly in the order they appear in the code (program order) and that writes are instantly visible everywhere. This idealized model is called sequential consistency.

Reality Check: Modern compilers and hardware do not guarantee sequential consistency. They are allowed to reorder memory operations (reads and writes) as long as the semantics of a sequentially executed program are preserved.

Rule of thumb: Compiler and hardware allowed to make changes that do not affect the observable behavior of a single-threaded program.

Example reorderings valid in a sequential world:

// Original
void f() {
  x = 1;
  y = x + 1; // Depends on x=1
  z = x + 1; // Depends on x=1
}

// Reordered (y and z swapped) - OK sequentially
void f() {
  x = 1;
  z = x + 1; // Depends on x=1
  y = x + 1; // Depends on x=1
}

// Optimized (constants computed) - OK sequentially
void f() {
  x = 1;
  z = 2;
  y = 2;
}

These reorderings are fine for single-threaded execution. However, in a multithreaded context, they can lead to unexpected behavior because one thread might observe the effects of another thread’s operations in an order different from the program order.

Memory Reordering: Software View (Compilers)

Compilers perform many optimizations that can reorder or even eliminate memory accesses:
- Dead Code Elimination: Removing code whose results are never used.
- Register Hoisting: Moving memory reads out of loops into registers.
- Locality Optimizations: Rearranging code to improve cache usage.
- …and many more complex transformations.
These optimizations assume single-threaded semantics by default. Without specific instructions (like volatile or synchronized), the compiler doesn’t guarantee a specific global ordering of memory operations visible to other threads.

Example: Compiler Optimization Failure

Consider this attempt at a simple synchronization mechanism (rendezvous) using a shared variable x:

int x;
 
// Thread A
void wait() {
  x = 1;
  while(x == 1); // Spin until x changes
}
 
// Thread B
void arrive() {
  x = 2;
}

Naively, we expect Thread A to set x=1, loop, and then Thread B to set x=2, allowing Thread A to exit the loop.

Memory Reordering: Hardware View (Processors)

Modern multi-core processors also reorder instructions for performance.
Pipelining: Processors execute multiple instructions concurrently in different stages.
Out-of-Order Execution: Instructions might be executed out of program order to keep execution units busy.
Caches: Each core has its own local caches. A store to memory by one core might not be immediately visible to other cores; it takes time for cache coherence protocols to propagate the update.

Memory Hierarchy: Access times vary dramatically: Registers (~0.5ns), L1 Cache (~1ns), L2 Cache (~7ns), System Memory (~100ns). Processors heavily optimize to hide memory latency, often involving reordering.

Multi-Core Hierarchy: Each core has its own L1 cache (and often L2), with potentially a shared L3 cache and system memory. Writes made by one core need to propagate through this hierarchy to become visible to others.

Real-life Analogy: Imagine Anna, Beat, and Zoe working on a shared blackboard (global data). Each also has their own notepad (local data/cache). When Anna writes something on the blackboard, it takes time for Beat and Zoe to notice the change. Furthermore, they might perform calculations on their notepads based on what they last saw on the blackboard, even if Anna has updated it since.

Schematic: CPUs connect via a system bus to shared memory. Caches sit between cores and the bus/memory. Writes update local caches first, and coherence protocols eventually make them visible elsewhere.

Memory Models: The Contract

Because compiler and hardware reorderings can break naively written concurrent code, we need a memory model.

Definition: A memory model is a specification, usually part of a programming language (like Java) or hardware architecture, that defines the guarantees for how memory operations by different threads appear to interact.
Purpose: It acts as a contract between the programmer and the system (compiler/hardware).
Trade-off: It restricts some optimizations (those that would violate the model’s guarantees) but still allows many optimizations for performance, while providing programmers with rules to write correct multithreaded programs.

Implications:

We need to understand (at least the basics of) the memory model of our language (e.g., the Java Memory Model - JMM) to write correct concurrent code.
Constructs like synchronized and volatile provide specific guarantees within the JMM that prevent problematic reorderings.

Memory Reordering in Parallel Programming: A Deeper Look
In parallel programming, where multiple threads or processes access shared memory concurrently, the order in which these operations appear to happen in the source code might not be the actual order in which they are executed and become visible to other threads. This phenomenon is known as memory reordering.

What Causes Memory Reordering?

Several components in a modern computer system can contribute to memory reordering:

The Central Processing Unit (CPU):

Out-of-Order Execution: CPUs often reorder instructions internally to improve performance. If an instruction is waiting for data (e.g., from memory), the CPU might execute subsequent independent instructions. This can lead to memory accesses happening in a different order than programmed.

Write Buffers: CPUs frequently use write buffers to hold pending writes to memory. Instead of immediately writing to slower main memory, the CPU stores the data in a fast buffer and continues processing. These writes are then flushed to memory later, potentially in a different order than they were initiated.

Cache Coherence Protocols: While these protocols ensure that all cores have a consistent view of shared memory, the mechanisms used to maintain this consistency can sometimes lead to observable reordering from the perspective of different cores.

The Memory System: In some advanced memory architectures, the memory controller itself might reorder memory requests to optimize bus utilization and overall memory throughput.

Why Memory Reordering is Performed:

The primary motivation behind memory reordering is to enhance performance by:

Reducing Stalls: Memory access is often a bottleneck. By reordering operations, the CPU can avoid waiting for slow memory operations to complete before proceeding with other computations.

Improving Cache Efficiency: Reordering can group memory accesses to the same cache lines, increasing the likelihood of cache hits and reducing the need to fetch data from main memory.

Exploiting Instruction-Level Parallelism: Out-of-order execution allows the CPU to execute multiple independent instructions in parallel, even if they are not sequentially ordered in the code.

The Challenge for Parallel Programs:

While beneficial for single-threaded performance, memory reordering can introduce significant challenges in parallel programming. Different threads might observe the effects of memory operations in inconsistent orders, leading to:

Race Conditions: When the outcome of a program depends on the unpredictable order of execution of operations by multiple threads accessing shared resources.

Unexpected Program Behavior: Logic that seems correct based on the sequential order of operations might fail when those operations are reordered at the hardware level.

Illustrative Example:

Consider two threads sharing two variables, x and y, initialized to 0.

Thread 1:
x = 1;
y = 2;
Thread 2:
while (y != 2);
print(x);
Naively, one might expect Thread 2 to always print 1. Thread 1 first sets x and then y. Thread 2 waits for y to become 2 and then reads x. However, due to memory reordering, the CPU executing Thread 1 might write y = 2 to memory before x = 1 becomes visible to other cores. In this scenario, Thread 2 could observe y becoming 2 and then read the value of x before Thread 1’s write to x has propagated, resulting in Thread 2 printing 0.

Difference from Compiler Reordering:

Compiler reordering optimizes instruction order for sequential execution, ensuring that the observable behavior within a single thread remains the same. Memory reordering, on the other hand, affects the order in which memory operations become visible to other threads in a parallel program, introducing concurrency-specific issues.

Managing Memory Reordering:

To write correct and reliable parallel programs, it’s crucial to manage memory reordering. Common techniques include:

Memory Barriers (or Fences): These are special instructions that enforce ordering constraints on memory operations. A memory barrier ensures that all memory operations preceding it in program order are completed before any memory operations following it can begin. Different types of memory barriers exist with varying levels of ordering guarantees.

Memory Models: Programming languages and hardware architectures define memory models that specify the rules governing how and when memory operations become visible to different threads. These models dictate the allowed reorderings. Examples include sequential consistency (strongest ordering) and relaxed consistency (allows more reordering).

Synchronization Primitives: High-level synchronization constructs like mutexes, locks, semaphores, and atomic operations often internally use memory barriers to ensure proper synchronization and prevent data races caused by memory reordering. When a thread acquires a lock, for instance, a memory barrier might be inserted to ensure that all its subsequent memory operations happen after the lock acquisition and are visible to other threads after the lock is released.

Fixing Our Example (Using JMM Guarantees)

If we use synchronized blocks to guard all accesses to the shared variables x and y, we prevent data races and problematic reorderings relative to these accesses.

class C {
    private int x = 0;
    private int y = 0;
 
    void f() {
        synchronized(this) { x = 1; }
        synchronized(this) { y = 1; }
    }
 
    void g() {
        int a, b;
        synchronized(this) { a = y; }
        synchronized(this) { b = x; }
        assert(b >= a);
    }
}

Under the rules of the JMM (specifically, the guarantees provided by entering and exiting synchronized blocks), the assertion in this version cannot fail. Synchronization creates necessary happens-before relationships.

Another Fix: volatile

Java’s volatile keyword provides weaker guarantees than synchronized but is sufficient for this specific example:

class C {
    private volatile int x = 0;
    private volatile int y = 0;
    // ... f() and g() as before ...
}

Accesses to volatile fields are guaranteed not to be reordered with respect to other volatile accesses by the compiler/hardware in ways that violate the JMM’s rules for volatile.
Accesses to volatile variables do not count as data races.
volatile is generally slower than regular field access but faster than locking.
Warning: volatile is tricky and primarily for experts. It prevents certain reorderings but doesn’t provide mutual exclusion or atomicity for compound operations. Using standard concurrent libraries or synchronized is usually safer.

synchronized and volatile
Let’s break down synchronized and volatile in Java to understand how they address concurrency issues like memory reordering.

The Core Problem: Shared Mutable State In concurrent programming, multiple threads can access and modify the same data (shared mutable state). Without proper control, this can lead to race conditions where the final outcome depends on the unpredictable order in which threads execute, often resulting in incorrect results.

synchronized: Mutual Exclusion and Visibility The synchronized keyword in Java provides a mechanism for mutual exclusion. This means that only one thread can hold the lock associated with a synchronized block or method at any given time for a specific object.
synchronized(this) {
    // Only one thread can execute this block for the 'this' object
    // at a time.
}
But synchronized does more than just mutual exclusion. It also establishes happens-before relationships. When a thread exits a synchronized block, all the changes it made within that block are guaranteed to be visible to any other thread that subsequently enters a synchronized block on the same object. This prevents the kind of memory reordering that could lead to the assertion failure in the example. The JMM guarantees that the write to x in f() will be visible to the read of x in g() because they are both protected by synchronization on the same object (this).

volatile: Visibility Without Mutual Exclusion The volatile keyword provides a weaker form of synchronization. When a variable is declared volatile, the Java Memory Model (JMM) guarantees that:

Writes to the variable are immediately visible to other threads. The value is written directly to main memory (or a shared cache that’s immediately visible to other cores), not just the local processor cache.

Reads of the variable will always retrieve the most recent value from main memory. The thread will not use a stale value from its local cache.

Memory operations involving volatile variables cannot be reordered with respect to other volatile accesses in a way that violates the JMM’s rules for volatile. This means that if you have two volatile writes in your code, they will appear to happen in that order to other threads.
private volatile int count;
In the example with volatile, when Thread 1 writes to x and then y, these writes are guaranteed to become visible to Thread 2 eventually. When Thread 2 reads y, it will see the updated value. Because the writes to x and y are volatile, the JMM ensures that the write to x happens-before the write to y from the perspective of another thread. This ensures that when y becomes 1 (in the modified example), x will have already been set to 1.

Key Differences Summarized:

synchronized: Provides both mutual exclusion (only one thread can access) and visibility (changes are guaranteed to be seen by other synchronized blocks on the same object). It’s like having a lock on a room – only one person can be inside, and when they leave, they tidy up (make changes visible).

volatile: Only guarantees visibility (changes are immediately seen by other threads). It does not provide mutual exclusion. It’s like a whiteboard – when someone writes on it, everyone else can see the changes immediately, but there’s no control over who can write at what time.

Why volatile is Tricky: volatile is simpler and often faster than synchronized because it avoids the overhead of acquiring and releasing locks. However, it doesn’t protect compound operations (like count++). If multiple threads try to increment a volatile variable concurrently, you can still have issues because the increment operation (read-modify-write) is not atomic. That’s why the warning advises using synchronized or the concurrency utilities for more complex scenarios where atomicity is required.

In essence, both synchronized and volatile help manage shared mutable state in concurrent Java programs by ensuring that changes made by one thread are reliably seen by other threads, thus preventing errors caused by memory reordering and race conditions. synchronized provides a stronger guarantee by also ensuring exclusive access.

A More Realistic Broken Example

Consider this common pattern:

class C {
    boolean stop = false;
 
    // Thread 1
    void f() {
        while (!stop) {
            // draw a monster or do other work
        }
    }
 
    // Thread 2
    void g() {
        stop = didUserQuit(); // didUserQuit() returns true eventually
    }
}

Problem: There is no guarantee that Thread 1 will ever see the update to stop made by Thread 2.

Compiler Optimization: The compiler might optimize while(!stop) by reading stop into a register once outside the loop, leading to an infinite loop.
Hardware Caching: The value stop = true written by Thread 2 might remain in Thread 2’s cache and never become visible to Thread 1.

While this code might often “work in practice” due to cache coherence or luck, it’s fundamentally broken according to the JMM because there’s a data race (unsynchronized read in Thread 1 and write in Thread 2). Using volatile boolean stop or synchronizing accesses to stop would fix this.

What Did We Learn?

Reordering is Real: Compilers and hardware will reorder memory operations for performance, potentially breaking concurrent code that relies on strict program order.
Language Constructs: Languages provide constructs (synchronized, volatile) to forbid certain reorderings and establish happens-before relationships, ensuring correct behavior in concurrent contexts.
Memory Models: These define the rules (“contract”) governing memory operation visibility and ordering between threads. They are complex but necessary.

Why Architectural Memory Models?

Different architectures (x86, ARM, POWER, etc.) have different hardware-level reordering rules and guarantees. This table shows some examples.
Programming language memory models (like the JMM) aim to abstract over these hardware differences, providing a consistent model for programmers, but they must account for the weakest guarantees of the underlying hardware they run on.

x86 Example Revisited: Even on x86 (which has a relatively strong memory model compared to others), the initial example assert(b >= a) can fail due to subtle reorderings allowed by the processor (specifically related to store buffering), resulting in the outcome i=0, j=0 (where i corresponds to a and j to b).

Java Memory Model (JMM): Necessary Basics

The JMM has evolved significantly since its initial flawed version in 1995. The modern JMM (since Java 5, 2005) provides a formal basis for Java concurrency.

Purpose: Restricts the allowable outcomes of concurrent programs. Without synchronization (volatile, synchronized, etc.), outcomes can be unexpected due to reordering.
Actions: Defines basic operations like read(x):1 (read variable x, saw value 1), write(x):1, lock(m), unlock(m).
Orders: Defines relationships between actions:
- Program Order (PO): The order of actions within a single thread.
- Synchronizes-With (SW): Specific pairing between synchronization actions (e.g., an unlock on a monitor synchronizes-with a subsequent lock on the same monitor).
- Synchronization Order (SO): A total order over all synchronization actions in an execution.
- Happens-Before (HB): The transitive closure of Program Order and Synchronizes-With. This is the most important ordering guarantee. If action A happens-before action B, then the effects of A are guaranteed to be visible to B.

Program Order

Program Order (PO): An order within a thread. It doesn’t guarantee memory access order across threads. Its main purpose is to link the execution trace back to the original source code.

Synchronization

Synchronization Actions (SA): Special actions that create ordering constraints. Includes reads/writes to volatile variables, lock/unlock operations, thread start/join, etc.
Synchronization Order (SO): A global total order agreed upon by all threads for all synchronization actions.

In the example on the right side of the image above (0,0) is not a possible outcome.

Synchronizes-With (SW): Pairs specific synchronization actions (e.g., volatile write synchronizes-with subsequent volatile read of the same variable).
Happens-Before (HB): If A happens-before B (A →HB B), then the memory effects of A are visible to B. This is the key guarantee we rely on.
HB Consistency: A read is allowed to see the last write that happens-before it, OR any other write that does not have a happens-before relationship (i.e., it’s in a race with the read). Data races mean reads can see unexpected values!

Example with volatile:

This highlights that volatile prevents some reorderings but doesn’t guarantee atomicity for sequences of operations.

Continue here: 16 Implementation of Mutual Exclusion, Critical Section, Decker’s Algorithm, Peterson’s Lock, Filter Lock, Lamport’s Bakery Algorithm

CS Notes

Explorer