15 Memory Reordering, Java Memory Model, Data Races

Lecture from: 15.04.2024 | Video: Video ETHZ

We now start with part 2 of this lecture.

In the first part of the course, we focused on using abstractions like locks to manage concurrency and protect critical sections, acknowledging challenges like choosing lock granularity and avoiding high-level race conditions (bad interleavings).

Now, in Part 2, we dive deeper into the foundations of shared memory concurrency. We’ll confront the often counter-intuitive behavior caused by memory reordering-optimizations performed by compilers and hardware that can break seemingly correct code. This leads us to the critical concept of data races and why they must be avoided. We will then explore how the fundamental problem of mutual exclusion (the core idea behind locks) can theoretically be solved using only basic atomic memory operations, examining classic algorithms like Dekker’s and Peterson’s.

Understanding these low-level details is crucial, even if you primarily use high-level library constructs in practice. The relentless push for computational power, especially in demanding fields like AI and High-Performance Computing (HPC), necessitates this deeper knowledge. As AI pioneers like Geoffrey Hinton and Yann LeCun have emphasized, massive increases in compute power have been instrumental in recent AI breakthroughs. Tech giants are building vast supercomputers, often with thousands of nodes and specialized hardware (GPUs, TPUs). Effectively programming these systems requires a solid grasp of the underlying concurrency principles to ensure both correctness and performance.

Learning Goals for Today

Recap (Previous Topics):

We know how to use locks (synchronized, ReentrantLock, etc.) to protect critical sections of code.
We understand practical guidelines (e.g., minimizing lock scope, avoiding holding locks during long operations) and associated trade-offs.
We’ve seen examples of high-level race conditions (incorrect program logic due to interleaving, even with locks).

Today’s Focus (New Topics):

Memory Models & Reordering: Understanding how compiler and hardware optimizations can reorder memory operations, leading to unexpected behavior in concurrent programs.
Data Races: Defining low-level data races (conflicting, unsynchronized accesses to shared memory) and understanding why they are fundamentally problematic and must be eliminated.
Mutex Implementation Fundamentals: Exploring how mutual exclusion primitives could theoretically be built from the ground up using only atomic read/write operations (atomic registers), looking at algorithms like Dekker’s and Peterson’s.

Important Context: While you’ll typically rely on robust, pre-built lock implementations, examining how they could be constructed illuminates the inherent difficulties of concurrency control. As Confucius purportedly said (paraphrased): “Tell me and I forget, teach me and I may remember, involve me and I learn.”

Motivation: Can This Simple Code Fail?

Consider this basic Java class with two shared integer fields, x and y, initialized to 0.

class C {
    private int x = 0;
    private int y = 0;
 
    // Executed by Thread 1
    void f() {
        x = 1; // Operation A
        y = 1; // Operation B
    }
 
    // Executed by Thread 2
    void g() {
        int a = y; // Operation C
        int b = x; // Operation D
        assert(b >= a);
    }
}

Suppose Thread 1 executes f() and Thread 2 executes g() concurrently. Can the assertion assert(b >= a) ever fail?

Let’s analyze based on the sequential order of operations within each thread. The assertion fails if and only if b < a, which requires b == 0 and a == 1.

For a (read from y) to be 1, operation C must happen after operation B (the write y = 1). (Constraint: B must precede C)
For b (read from x) to be 0, operation D must happen before operation A (the write x = 1). (Constraint: D must precede A)

Can we find a single, valid interleaving of A, B, C, and D that satisfies both constraints? Let’s assume, for now, that operations within a thread execute in program order and writes become instantly visible to all threads (Sequential Consistency assumption).

Possible interleavings (checking if b >= a):

A B C D: a=1, b=1. Assertion holds. ✓
A C B D: a=0, b=1. Assertion holds. ✓
A C D B: a=0, b=1. Assertion holds. ✓
C A B D: a=0, b=1. Assertion holds. ✓
C A D B: a=0, b=1. Assertion holds. ✓
C D A B: a=0, b=0. Assertion holds. ✓

Based on this exhaustive check, it appears no interleaving allows the assertion to fail under the assumption of sequential consistency.

Combinatorial Excursion: Why Exhaustion Fails

How many possible interleavings exist for n threads each executing k statements? The total number of statements is nk. An interleaving is formed by merging these nk statements while preserving the internal order within each thread. For 2 threads with k statements each, the number of interleavings is $bin o m 2 k k$ , which grows exponentially ( $O (4^{k} / s q r t k)$ ). Proving correctness by enumerating interleavings quickly becomes infeasible.

Another Proof Attempt (By Contradiction, Assuming Sequential Consistency)

Let’s formally prove the assertion holds under sequential consistency:

Assume Assertion Fails: Suppose assert(b >= a) fails. This implies a == 1 and b == 0.
Implication of a == 1: The read a = y (C) must occur after the write y = 1 (B). Let’s denote this temporal order as B → C.
Implication of b == 0: The read b = x (D) must occur before the write x = 1 (A). So, D → A.
Program Order: Within Thread 1, A (x = 1) occurs before B (y = 1). So, A → B. Within Thread 2, C (a = y) occurs before D (b = x). So, C → D.
Transitivity: Combining these required orderings: D → A (from step 3), A → B (program order T1), B → C (from step 2), C → D (program order T2). This creates a cycle: D → A → B → C → D.
Contradiction: This chain implies D → D, meaning operation D must happen before itself, which is logically impossible. 

Conclusion: The initial assumption (that the assertion fails) must be false. Therefore, the assertion cannot fail if the system behaves with sequential consistency.

Reality Check: Let’s Run It…

Screenshot or output showing the assertion failing in practice

Despite our proofs, when this code is executed on standard hardware with optimizing compilers, the assertion can and often does fail! Why? Because our fundamental assumption of sequential consistency is violated in practice.

Why It Fails: The Reality of Memory Reordering

Our proofs relied on an idealized model called Sequential Consistency, where:

Operations within each thread execute in the order specified by the program source code.
The effects of operations (especially writes) become instantaneously visible to all other threads in a single, globally agreed-upon order.

Reality: Modern computer systems (compilers and hardware) do not guarantee sequential consistency by default because enforcing it would severely hinder performance. Instead, they are allowed to reorder memory operations (reads and writes) relative to each other, as long as the observable behavior of a single-threaded program is preserved.

Rule of Thumb: Optimizations are generally permitted if they don’t change the outcome when viewed from the perspective of the thread performing the operations itself.

Consider these valid sequential transformations:

// Original Code
void f() {
  x = 1;
  y = x + 1; // Depends on x=1
  z = x + 1; // Depends on x=1
}

// Reordered: y and z swapped (OK sequentially)
void f() {
  x = 1;
  z = x + 1; // Depends on x=1
  y = x + 1; // Depends on x=1
}

// Optimized: Constants computed (OK sequentially)
void f() {
  x = 1;
  z = 2;
  y = 2;
}

These are perfectly safe in a single thread. However, in a multithreaded program, another thread might observe the effects of these reordered operations, leading to outcomes impossible under sequential consistency. For instance, in our initial example, Thread 2 might observe y=1 before observing x=1 if the hardware/compiler reordered the writes in Thread 1 (y=1 effectively happening before x=1 from Thread 2’s perspective).

Why It Still Can Fail: Memory Reordering

Our proofs relied on the assumption that operations execute strictly in the order they appear in the code (program order) and that writes are instantly visible everywhere. This idealized model is called sequential consistency.

Reality Check: Modern compilers and hardware do not guarantee sequential consistency. They are allowed to reorder memory operations (reads and writes) as long as the semantics of a sequentially executed program are preserved.

Rule of thumb: Compiler and hardware allowed to make changes that do not affect the observable behavior of a single-threaded program.

Example reorderings valid in a sequential world:

// Original
void f() {
  x = 1;
  y = x + 1; // Depends on x=1
  z = x + 1; // Depends on x=1
}

// Reordered (y and z swapped) - OK sequentially
void f() {
  x = 1;
  z = x + 1; // Depends on x=1
  y = x + 1; // Depends on x=1
}

// Optimized (constants computed) - OK sequentially
void f() {
  x = 1;
  z = 2;
  y = 2;
}

These reorderings are fine for single-threaded execution. However, in a multithreaded context, they can lead to unexpected behavior because one thread might observe the effects of another thread’s operations in an order different from the program order.

Memory Reordering: Software View (Compilers)

Compilers perform many optimizations that can reorder or even eliminate memory accesses:
- Dead Code Elimination: Removing code whose results are never used.
- Register Hoisting: Moving memory reads out of loops into registers.
- Locality Optimizations: Rearranging code to improve cache usage.
- …and many more complex transformations.
These optimizations assume single-threaded semantics by default. Without specific instructions (like volatile or synchronized), the compiler doesn’t guarantee a specific global ordering of memory operations visible to other threads.

Example: Compiler Optimization Failure

Consider this attempt at a simple synchronization mechanism (rendezvous) using a shared variable x:

int x;
 
// Thread A
void wait() {
  x = 1;
  while(x == 1); // Spin until x changes
}
 
// Thread B
void arrive() {
  x = 2;
}

Naively, we expect Thread A to set x=1, loop, and then Thread B to set x=2, allowing Thread A to exit the loop.

Memory Reordering: Hardware View (Processors)

Modern multi-core processors also reorder instructions for performance.
Pipelining: Processors execute multiple instructions concurrently in different stages.
Out-of-Order Execution: Instructions might be executed out of program order to keep execution units busy.
Caches: Each core has its own local caches. A store to memory by one core might not be immediately visible to other cores; it takes time for cache coherence protocols to propagate the update.

Memory Hierarchy: Access times vary dramatically: Registers (~0.5ns), L1 Cache (~1ns), L2 Cache (~7ns), System Memory (~100ns). Processors heavily optimize to hide memory latency, often involving reordering.

Multi-Core Hierarchy: Each core has its own L1 cache (and often L2), with potentially a shared L3 cache and system memory. Writes made by one core need to propagate through this hierarchy to become visible to others.

Real-life Analogy: Imagine Anna, Beat, and Zoe working on a shared blackboard (global data). Each also has their own notepad (local data/cache). When Anna writes something on the blackboard, it takes time for Beat and Zoe to notice the change. Furthermore, they might perform calculations on their notepads based on what they last saw on the blackboard, even if Anna has updated it since.

Schematic: CPUs connect via a system bus to shared memory. Caches sit between cores and the bus/memory. Writes update local caches first, and coherence protocols eventually make them visible elsewhere.

Memory Models: The Contract

Because compiler and hardware reorderings can break naively written concurrent code, we need a memory model.

Definition: A memory model is a specification, usually part of a programming language (like Java) or hardware architecture, that defines the guarantees for how memory operations by different threads appear to interact.
Purpose: It acts as a contract between the programmer and the system (compiler/hardware).
Trade-off: It restricts some optimizations (those that would violate the model’s guarantees) but still allows many optimizations for performance, while providing programmers with rules to write correct multithreaded programs.

Implications:

We need to understand (at least the basics of) the memory model of our language (e.g., the Java Memory Model - JMM) to write correct concurrent code.
Constructs like synchronized and volatile provide specific guarantees within the JMM that prevent problematic reorderings.

Memory Reordering in Parallel Programming: A Deeper Look
In parallel programming, where multiple threads or processes access shared memory concurrently, the order in which these operations appear to happen in the source code might not be the actual order in which they are executed and become visible to other threads. This phenomenon is known as memory reordering.

What Causes Memory Reordering?

Several components in a modern computer system can contribute to memory reordering:

The Central Processing Unit (CPU):

Out-of-Order Execution: CPUs often reorder instructions internally to improve performance. If an instruction is waiting for data (e.g., from memory), the CPU might execute subsequent independent instructions. This can lead to memory accesses happening in a different order than programmed.

Write Buffers: CPUs frequently use write buffers to hold pending writes to memory. Instead of immediately writing to slower main memory, the CPU stores the data in a fast buffer and continues processing. These writes are then flushed to memory later, potentially in a different order than they were initiated.

Cache Coherence Protocols: While these protocols ensure that all cores have a consistent view of shared memory, the mechanisms used to maintain this consistency can sometimes lead to observable reordering from the perspective of different cores.

The Memory System: In some advanced memory architectures, the memory controller itself might reorder memory requests to optimize bus utilization and overall memory throughput.

Why Memory Reordering is Performed:

The primary motivation behind memory reordering is to enhance performance by:

Reducing Stalls: Memory access is often a bottleneck. By reordering operations, the CPU can avoid waiting for slow memory operations to complete before proceeding with other computations.

Improving Cache Efficiency: Reordering can group memory accesses to the same cache lines, increasing the likelihood of cache hits and reducing the need to fetch data from main memory.

Exploiting Instruction-Level Parallelism: Out-of-order execution allows the CPU to execute multiple independent instructions in parallel, even if they are not sequentially ordered in the code.

The Challenge for Parallel Programs:

While beneficial for single-threaded performance, memory reordering can introduce significant challenges in parallel programming. Different threads might observe the effects of memory operations in inconsistent orders, leading to:

Race Conditions: When the outcome of a program depends on the unpredictable order of execution of operations by multiple threads accessing shared resources.

Unexpected Program Behavior: Logic that seems correct based on the sequential order of operations might fail when those operations are reordered at the hardware level.

Illustrative Example:

Consider two threads sharing two variables, x and y, initialized to 0.

Thread 1:
x = 1;
y = 2;
Thread 2:
while (y != 2);
print(x);
Naively, one might expect Thread 2 to always print 1. Thread 1 first sets x and then y. Thread 2 waits for y to become 2 and then reads x. However, due to memory reordering, the CPU executing Thread 1 might write y = 2 to memory before x = 1 becomes visible to other cores. In this scenario, Thread 2 could observe y becoming 2 and then read the value of x before Thread 1’s write to x has propagated, resulting in Thread 2 printing 0.

Difference from Compiler Reordering:

Compiler reordering optimizes instruction order for sequential execution, ensuring that the observable behavior within a single thread remains the same. Memory reordering, on the other hand, affects the order in which memory operations become visible to other threads in a parallel program, introducing concurrency-specific issues.

Managing Memory Reordering:

To write correct and reliable parallel programs, it’s crucial to manage memory reordering. Common techniques include:

Memory Barriers (or Fences): These are special instructions that enforce ordering constraints on memory operations. A memory barrier ensures that all memory operations preceding it in program order are completed before any memory operations following it can begin. Different types of memory barriers exist with varying levels of ordering guarantees.

Memory Models: Programming languages and hardware architectures define memory models that specify the rules governing how and when memory operations become visible to different threads. These models dictate the allowed reorderings. Examples include sequential consistency (strongest ordering) and relaxed consistency (allows more reordering).

Synchronization Primitives: High-level synchronization constructs like mutexes, locks, semaphores, and atomic operations often internally use memory barriers to ensure proper synchronization and prevent data races caused by memory reordering. When a thread acquires a lock, for instance, a memory barrier might be inserted to ensure that all its subsequent memory operations happen after the lock acquisition and are visible to other threads after the lock is released.

Fixing Our Example (Using JMM Guarantees)

If we use synchronized blocks to guard all accesses to the shared variables x and y, we prevent data races and problematic reorderings relative to these accesses.

class C {
    private int x = 0;
    private int y = 0;
 
    void f() {
        synchronized(this) { x = 1; }
        synchronized(this) { y = 1; }
    }
 
    void g() {
        int a, b;
        synchronized(this) { a = y; }
        synchronized(this) { b = x; }
        assert(b >= a);
    }
}

Under the rules of the JMM (specifically, the guarantees provided by entering and exiting synchronized blocks), the assertion in this version cannot fail. Synchronization creates necessary happens-before relationships.

Another Fix: volatile

Java’s volatile keyword provides weaker guarantees than synchronized but is sufficient for this specific example:

class C {
    private volatile int x = 0;
    private volatile int y = 0;
    // ... f() and g() as before ...
}

Accesses to volatile fields are guaranteed not to be reordered with respect to other volatile accesses by the compiler/hardware in ways that violate the JMM’s rules for volatile.
Accesses to volatile variables do not count as data races.
volatile is generally slower than regular field access but faster than locking.
Warning: volatile is tricky and primarily for experts. It prevents certain reorderings but doesn’t provide mutual exclusion or atomicity for compound operations. Using standard concurrent libraries or synchronized is usually safer.

synchronized and volatile
Let’s break down synchronized and volatile in Java to understand how they address concurrency issues like memory reordering.

The Core Problem: Shared Mutable State In concurrent programming, multiple threads can access and modify the same data (shared mutable state). Without proper control, this can lead to race conditions where the final outcome depends on the unpredictable order in which threads execute, often resulting in incorrect results.

synchronized: Mutual Exclusion and Visibility The synchronized keyword in Java provides a mechanism for mutual exclusion. This means that only one thread can hold the lock associated with a synchronized block or method at any given time for a specific object.
synchronized(this) {
    // Only one thread can execute this block for the 'this' object
    // at a time.
}
But synchronized does more than just mutual exclusion. It also establishes happens-before relationships. When a thread exits a synchronized block, all the changes it made within that block are guaranteed to be visible to any other thread that subsequently enters a synchronized block on the same object. This prevents the kind of memory reordering that could lead to the assertion failure in the example. The JMM guarantees that the write to x in f() will be visible to the read of x in g() because they are both protected by synchronization on the same object (this).

volatile: Visibility Without Mutual Exclusion The volatile keyword provides a weaker form of synchronization. When a variable is declared volatile, the Java Memory Model (JMM) guarantees that:

Writes to the variable are immediately visible to other threads. The value is written directly to main memory (or a shared cache that’s immediately visible to other cores), not just the local processor cache.

Reads of the variable will always retrieve the most recent value from main memory. The thread will not use a stale value from its local cache.

Memory operations involving volatile variables cannot be reordered with respect to other volatile accesses in a way that violates the JMM’s rules for volatile. This means that if you have two volatile writes in your code, they will appear to happen in that order to other threads.
private volatile int count;
In the example with volatile, when Thread 1 writes to x and then y, these writes are guaranteed to become visible to Thread 2 eventually. When Thread 2 reads y, it will see the updated value. Because the writes to x and y are volatile, the JMM ensures that the write to x happens-before the write to y from the perspective of another thread. This ensures that when y becomes 1 (in the modified example), x will have already been set to 1.

Key Differences Summarized:

synchronized: Provides both mutual exclusion (only one thread can access) and visibility (changes are guaranteed to be seen by other synchronized blocks on the same object). It’s like having a lock on a room – only one person can be inside, and when they leave, they tidy up (make changes visible).

volatile: Only guarantees visibility (changes are immediately seen by other threads). It does not provide mutual exclusion. It’s like a whiteboard – when someone writes on it, everyone else can see the changes immediately, but there’s no control over who can write at what time.

Why volatile is Tricky: > volatile is simpler and often faster than synchronized because it avoids the overhead of acquiring and releasing locks. However, it doesn’t protect compound operations (like count++). If multiple threads try to increment a volatile variable concurrently, you can still have issues because the increment operation (read-modify-write) is not atomic. That’s why the warning advises using synchronized or the concurrency utilities for more complex scenarios where atomicity is required.

In essence, both synchronized and volatile help manage shared mutable state in concurrent Java programs by ensuring that changes made by one thread are reliably seen by other threads, thus preventing errors caused by memory reordering and race conditions. synchronized provides a stronger guarantee by also ensuring exclusive access.

A More Realistic Broken Example

Consider this common pattern:

class C {
    boolean stop = false;
 
    // Thread 1
    void f() {
        while (!stop) {
            // draw a monster or do other work
        }
    }
 
    // Thread 2
    void g() {
        stop = didUserQuit(); // didUserQuit() returns true eventually
    }
}

Problem: There is no guarantee that Thread 1 will ever see the update to stop made by Thread 2.

Compiler Optimization: The compiler might optimize while(!stop) by reading stop into a register once outside the loop, leading to an infinite loop.
Hardware Caching: The value stop = true written by Thread 2 might remain in Thread 2’s cache and never become visible to Thread 1.

While this code might often “work in practice” due to cache coherence or luck, it’s fundamentally broken according to the JMM because there’s a data race (unsynchronized read in Thread 1 and write in Thread 2). Using volatile boolean stop or synchronizing accesses to stop would fix this.

What Did We Learn?

Reordering is Real: Compilers and hardware will reorder memory operations for performance, potentially breaking concurrent code that relies on strict program order.
Language Constructs: Languages provide constructs (synchronized, volatile) to forbid certain reorderings and establish happens-before relationships, ensuring correct behavior in concurrent contexts.
Memory Models: These define the rules (“contract”) governing memory operation visibility and ordering between threads. They are complex but necessary.

Why Architectural Memory Models?

Different architectures (x86, ARM, POWER, etc.) have different hardware-level reordering rules and guarantees. This table shows some examples.
Programming language memory models (like the JMM) aim to abstract over these hardware differences, providing a consistent model for programmers, but they must account for the weakest guarantees of the underlying hardware they run on.

x86 Example Revisited: Even on x86 (which has a relatively strong memory model compared to others), the initial example assert(b >= a) can fail due to subtle reorderings allowed by the processor (specifically related to store buffering), resulting in the outcome i=0, j=0 (where i corresponds to a and j to b).

Java Memory Model (JMM): Necessary Basics

The JMM has evolved significantly since its initial flawed version in 1995. The modern JMM (since Java 5, 2005) provides a formal basis for Java concurrency.

Purpose: Restricts the allowable outcomes of concurrent programs. Without synchronization (volatile, synchronized, etc.), outcomes can be unexpected due to reordering.
Actions: Defines basic operations like read(x):1 (read variable x, saw value 1), write(x):1, lock(m), unlock(m).
Orders: Defines relationships between actions:
- Program Order (PO): The order of actions within a single thread.
- Synchronizes-With (SW): Specific pairing between synchronization actions (e.g., an unlock on a monitor synchronizes-with a subsequent lock on the same monitor).
- Synchronization Order (SO): A total order over all synchronization actions in an execution.
- Happens-Before (HB): The transitive closure of Program Order and Synchronizes-With. This is the most important ordering guarantee. If action A happens-before action B, then the effects of A are guaranteed to be visible to B.

Program Order

Program Order (PO): An order within a thread. It doesn’t guarantee memory access order across threads. Its main purpose is to link the execution trace back to the original source code.

Synchronization

Synchronization Actions (SA): Special actions that create ordering constraints. Includes reads/writes to volatile variables, lock/unlock operations, thread start/join, etc.
Synchronization Order (SO): A global total order agreed upon by all threads for all synchronization actions.

In the example on the right side of the image above (0,0) is not a possible outcome.

Synchronizes-With (SW): Pairs specific synchronization actions (e.g., volatile write synchronizes-with subsequent volatile read of the same variable).
Happens-Before (HB): If A happens-before B (A →HB B), then the memory effects of A are visible to B. This is the key guarantee we rely on.
HB Consistency: A read is allowed to see the last write that happens-before it, OR any other write that does not have a happens-before relationship (i.e., it’s in a race with the read). Data races mean reads can see unexpected values!

Example with volatile:

This highlights that volatile prevents some reorderings but doesn’t guarantee atomicity for sequences of operations.

Continue here: 16 Implementation of Mutual Exclusion, Critical Section, Decker’s Algorithm, Peterson’s Lock, Filter Lock, Lamport’s Bakery Algorithm

CS Notes

Explorer