13 Virtual Threads

Lecture from: 08.04.2024 | Video: Video ETHZ

Shared Memory Concurrency, Locks, and Data Races

Continuation

Getting Concurrency Right: Guidelines

Avoiding race conditions is notoriously difficult. Decades of experience have led to some conventional wisdom.

The 3 Choices for Memory Locations

For every memory location (field, array element, etc.) in your program, you must ensure at least one of these conditions holds:

Thread-Local: The location is only ever accessed by a single thread. (No sharing = No problem).
Immutable: The location’s value is never written to after initialization. (Read-only = No problem).
Synchronized: Accesses (reads and writes) to the location are controlled using synchronization mechanisms (like locks) to prevent data races and harmful interleavings.

Guideline: Prefer Thread-Local:

Whenever possible, avoid sharing resources. Give each thread its own copy.
This only works if threads don’t need to communicate through that specific resource.
Example: java.util.Random objects are often best kept thread-local.
Local variables within methods are always thread-local because each thread has its own call stack.
Minimize shared mutable memory.

Guideline: Prefer Immutable:

Whenever possible, make objects immutable (their state cannot change after creation).
Create new objects instead of modifying existing ones.
This is a core principle of functional programming and greatly simplifies concurrency.
If a location is only read (effectively immutable), no synchronization is needed for reads. Simultaneous reads are safe.
Minimize mutation.

Guidelines for Synchronized Data

When you must have shared, mutable data, you need synchronization.

Guideline #0: No Data Races.

Never allow two threads to simultaneously read/write or write/write the same memory location without synchronization.
Data races cause undefined behavior. Avoid them rigorously.
Necessary but not sufficient: Our original peek example had no data races but was still wrong due to bad interleavings.

Guideline #1: Consistent Locking.

For each piece of shared mutable data (“location”) that needs synchronization, designate one specific lock that guards that data.
Always hold that specific lock when reading or writing the guarded data.
Document which lock guards which data.
Often, the lock used is the intrinsic lock of the object containing the data (this).
Sometimes, one lock guards multiple locations (e.g., all fields of an object, or even an entire data structure).

The mapping from data locations to guarding locks is a convention the programmer must follow.
Consistent locking prevents data races.
Not sufficient: Consistent locking alone does not prevent bad interleavings. Our corrected peek example uses consistent locking (all accesses to the stack’s internal state happen under the stack’s lock), but it correctly prevents bad interleavings by making the entire operation atomic.

While consistent locking is an excellent default guideline, it’s not strictly required for correctness if more complex synchronization protocols involving multiple phases are used (this is advanced and error-prone).

Lock Granularity

How much data should a single lock protect?

Coarse-grained Locking: Fewer locks, each protecting more data.
- Example: One lock for an entire data structure (like a hash map), or one lock for all bank accounts.
Fine-grained Locking: More locks, each protecting less data.
- Example: One lock per hash bucket, or one lock per bank account.

This is a continuum, not a binary choice.

Trade-offs:

Coarse-grained:
- Advantages: Simpler to implement and reason about. Easier to implement operations that modify multiple locations or the structure’s shape (since you hold the single lock).
- Disadvantages: Reduces concurrency. If threads need to access different parts of the data protected by the same lock, they block each other unnecessarily (contention).
Fine-grained:
- Advantages: Allows more simultaneous access, potentially improving performance by reducing contention.
- Disadvantages: More complex to implement correctly. Operations spanning multiple locations require acquiring multiple locks (increasing deadlock risk). Higher overhead due to managing more locks.

Guideline #2: Start with coarse-grained locking (simpler). Only move to fine-grained locking if performance measurements show that contention on the coarse lock is a bottleneck. Be wary, as moving to fine-grained locking often introduces subtle bugs.

Critical-Section Granularity

Another granularity issue is the size of the critical section: how much code executes while holding a lock?

Too Long: Holding a lock for too long blocks other threads unnecessarily, hurting performance.
Too Short: Breaking down an operation that needs to be atomic into multiple smaller critical sections can introduce bad interleavings (allowing other threads to see inconsistent intermediate states). It can also hurt performance due to the overhead of acquiring/releasing locks frequently.

Guideline #3: Do not perform expensive computations (long-running calculations) or I/O operations inside critical sections. Move such operations outside the locked region if possible, but be careful not to introduce race conditions by doing so.

Example: Hashtable Update

Suppose we want to update a value in a hashtable by looking it up, performing an expensive calculation on it, and inserting the new value. Assume lock guards the entire table.

Too Long:

synchronized(lock) {
    v1 = table.lookup(k);
    v2 = expensive(v1); // Expensive computation *inside* lock
    table.remove(k);
    table.insert(k, v2);
}

Problem: The table is locked during the potentially long expensive(v1) call, blocking all other threads.

Too Short:

synchronized(lock) {
    v1 = table.lookup(k);
} // Lock released
v2 = expensive(v1);
synchronized(lock) { // Lock re-acquired
    table.remove(k);
    table.insert(k, v2);
}

Problem: Another thread could modify the entry for key k between the two synchronized blocks. We might remove the wrong entry or insert based on stale data (v1). This is a bad interleaving.

Just Right (Optimistic Approach):

done = false;
while (!done) {
    synchronized(lock) {
        v1 = table.lookup(k); // Read under lock
    } // Release lock for expensive computation
    v2 = expensive(v1);
    synchronized(lock) { // Re-acquire lock
        // Check if the value hasn't changed since we last looked
        if (table.lookup(k) == v1) {
            table.remove(k);
            table.insert(k, v2);
            done = true; // Success!
        }
        // else: Value changed, loop again to retry
    }
}

This performs the expensive computation outside the lock but re-checks the state before committing the update, retrying if necessary. (This is a form of optimistic concurrency).

Atomicity

Atomicity: An operation is atomic if it appears to other threads as if it occurred instantaneously and indivisibly. No other thread can observe the operation partly executed.

Guideline #4: Think in terms of what operations need to be atomic.

Identify the high-level operations that must be atomic to maintain correctness (e.g., withdraw, peek, the hashtable update).
Make critical sections just long enough to preserve the atomicity of these operations.
Design your locking strategy (which locks to use, when to acquire/release) to correctly implement these critical sections.

Think about atomicity first, then locks.

Don’t Roll Your Own Concurrent Data Structures

Implementing correct and efficient concurrent data structures (especially with fine-grained locking) is extremely difficult and error-prone.
Standard libraries provide thread-safe concurrent data structures (e.g., java.util.concurrent.ConcurrentHashMap) written and tested by experts.

Practical Guideline: Use built-in concurrent libraries whenever they meet your needs.

Guideline for this course: Implement things yourself to understand the concepts and challenges involved, but be aware that production code should rely on well-tested library implementations.

Recap

Here a recap of what we have looked at until now:

Virtual Threads – Project Loom

Not exam relevant

We delve into a significant advancement in Java concurrency: Virtual Threads, introduced by Project Loom. We’ll explore how they aim to simplify high-throughput concurrent applications by rethinking the relationship between Java threads and operating system threads.

Big Picture (Part I)

Let’s quickly revisit our system overview. We have the hardware (CPU cores, Memory), the Operating System (managing resources, OS threads, OS scheduler), and the JVM (managing memory, JVM threads, JVM scheduler). Traditionally, Java threads (JVM threads in this diagram) have had a close relationship with OS threads. Project Loom, represented by the “Virtual threads” addition (L12), introduces a new layer of abstraction here.

Plan

Java Threads Recap: Briefly review the traditional threading model in Java.
Virtual Threads (Project Loom): Introduce the concepts, motivation, and mechanics behind virtual threads.
Code Examples: See how to create and use virtual threads in practice.

Helpful Links:

Project Loom Main Page: https://wiki.openjdk.org/display/loom/Main
Baeldung Summary: https://www.baeldung.com/openjdk-project-loom
Loom Proposal (Motivation & Examples): https://cr.openjdk.org/~rpressler/loom/Loom-Proposal.html

Java Threads Recap

Before diving into virtual threads, let’s refresh our understanding of traditional Java threads.

Threads Overview

Definition: A thread is essentially a sequence of instructions executed one at a time. It’s the smallest unit of processing that can be managed independently by a scheduler and performed concurrently with other threads. Every Java application starts with at least one thread: the main thread.
Importance: Threads enable multitasking (doing multiple things seemingly at once) and can lead to more efficient CPU utilization by allowing computation to proceed while other parts of the application might be waiting (e.g., for I/O).
Java Concurrency Model (Traditional): Relies heavily on the concepts of Thread objects, locks (often intrinsic monitor locks via synchronized), and coordination mechanisms like wait() and notify().

Creating Traditional Threads

There are two primary ways to define the code a thread will execute:

Extending Thread Class: Create a subclass of java.lang.Thread and override the public void run() method. Instantiate your subclass and call the start() method (which internally calls your run() method in a new thread of execution).

class ConcurrWriter extends Thread {
    public void run() {
        // code here executes concurrently with caller
    }
}
// ...
ConcurrWriter writerThread = new ConcurrWriter();
writerThread.start(); // Creates a new OS thread and calls run()

Implementing Runnable Interface: Implement the java.lang.Runnable interface, which requires defining the public void run() method. Create an instance of your Runnable class and pass it to the Thread constructor. Then call start() on the Thread object. This approach is often preferred as it separates the task (the Runnable) from the execution mechanism (the Thread).
```
public class ConcurrWriter implements Runnable {
    public void run() {
        // code here executes concurrently with caller
    }
}
// ...
ConcurrWriter writerTask = new ConcurrWriter();
Thread t = new Thread(writerTask);
t.start(); // Creates a new OS thread and calls writerTask.run()
```

Thread State Model

Threads transition through various states during their lifecycle: NEW, RUNNABLE (potentially waiting for the scheduler or actually running), BLOCKED/WAITING/TIMED_WAITING (e.g., waiting for I/O, a lock, wait(), sleep()), and TERMINATED.

Synchronization and Thread Safety

When multiple threads access shared, mutable resources, we need:

Synchronization: Mechanisms to control access, preventing data corruption and inconsistencies.
Thread Safety: Ensuring our code behaves correctly even when accessed by multiple threads concurrently, avoiding issues like race conditions and deadlocks.

Java Concurrency Model Summary

The traditional model revolves around:

Intrinsic Locks (synchronized): Every object has a lock. The synchronized keyword acquires and releases this lock automatically. Locks are re-entrant.
wait()/notify()/notifyAll(): Mechanisms for threads to coordinate by waiting for conditions and notifying others when conditions change (used in conjunction with locks).
java.util.concurrent: A package providing higher-level concurrency utilities like explicit locks, atomic variables, thread pools (ExecutorService), and concurrent collections, often offering better performance and scalability than basic synchronized/wait/notify.

Challenges with Traditional Threads

While powerful, the traditional Java threading model faces several challenges, especially in modern high-throughput applications:

Resource Consumption:
- High Memory Usage: Each thread requires a significant amount of memory for its stack (typically 1MB or more).
- CPU Overhead: Creating, managing, and context-switching threads incurs CPU overhead.
Scalability Issues:
- Limited Thread Count: Operating systems impose limits on the number of threads that can be created (due to memory and scheduling constraints). Creating thousands of threads is often feasible, but millions typically are not.
- Thread Starvation: If many threads are competing for limited CPU resources or locks, some threads might get very little execution time.
Complexity:
- Concurrency Bugs: Writing correct concurrent code is difficult. Race conditions, deadlocks, and livelocks are common pitfalls.
- Lifecycle Management: Explicitly managing thread lifecycles can be cumbersome.
Blocking I/O:
- When a traditional thread performs a blocking I/O operation (e.g., reading from a network socket), the underlying OS thread is typically blocked and cannot be used for other work, even though the CPU might be idle. This leads to inefficient resource utilization, especially in I/O-bound applications.

The OS Thread Bottleneck

A core reason for these challenges is that traditional Java threads (often called “platform threads”) are essentially thin wrappers around Operating System (OS) threads.

Consequences of the 1:1 Mapping:

start() Creates OS Thread: Calling Thread.start() usually results in a system call to create a new, relatively heavyweight OS thread.
Limited OS Threads: The number of OS threads a system can support is finite and limited by system resources (memory, kernel structures).
Limited Throughput: Because OS threads are expensive, you cannot efficiently create and manage millions of them. This limits the throughput of applications that try to use a “thread-per-task” model for a very large number of tasks.
Wasted Resources: Blocking a platform thread (e.g., on I/O) blocks the underlying OS thread, wasting that resource.

OS Scheduling Limitations:

The OS scheduler manages OS threads but is generally agnostic to the application-level logic.
No Control Over Next Thread: You can’t easily control which Java thread runs next. Priorities are hints at best.
Inefficient Waking: notify() wakes an arbitrary waiting thread; notifyAll() wakes all, which can be inefficient (thundering herd problem). There’s no built-in way to wake a specific thread efficiently without external constructs.
CPU Scheduling Inefficiency: Threads working on related data might be scheduled on different CPU cores, potentially leading to cache misses and reduced performance. The OS scheduler doesn’t have this application-level context.

Virtual Threads – Project Loom

Project Loom introduces virtual threads to overcome the limitations of the traditional thread-per-OS-thread model.

Motivation

“Project Loom aims to drastically reduce the effort of writing, maintaining, and observing high-throughput concurrent applications that make the best use of available hardware.” - Ron Pressler (Tech Lead, Project Loom)

The goal is to make it easier to write scalable concurrent applications using simple, familiar blocking code styles, rather than resorting to complex asynchronous or reactive programming models just to handle a large number of concurrent operations.

Project Loom Components

Project Loom consists of several related features, but the two main ones are:

JEP 425: Virtual Threads: Lightweight, user-mode threads managed by the Java Virtual Machine (JVM). (This is our main focus today).
JEP 428: Structured Concurrency: An API to treat groups of related tasks running in different threads as a single unit of work (simplifies error handling and cancellation).

(Note: Project Loom is still evolving, and terminology has changed. Virtual threads were previously called fibers. It’s important to check the specific Java version for feature status - they are no longer experimental as of Java 21).

Big Picture with Virtual Threads

Virtual threads (L12) are managed by the JVM scheduler, which maps them onto a smaller pool of OS threads (often called “carrier threads”).

Virtual Threads in a Nutshell

Decoupling: Virtual threads decouple the application’s unit of concurrency (the virtual thread) from the OS’s unit of concurrency (the OS thread).
Carrier Threads: A pool of standard platform threads (OS threads) is still used under the hood. These are called carrier threads.
JVM Scheduling: The JVM scheduler efficiently maps (mounts) virtual threads onto carrier threads to execute their code.
Blocking: Crucially, when a virtual thread blocks on an operation (like I/O), it unmounts from its carrier thread, freeing the carrier thread to run another virtual thread. The OS thread itself does not block.
Efficiency: This allows a small number of OS threads to handle a very large number (potentially millions) of virtual threads, significantly reducing resource consumption and context-switching overhead compared to using platform threads directly for every task.

Platform Threads vs. Virtual Threads

Project Loom introduces two distinct types of threads in Java:

Platform Threads: The traditional Java threads we’ve always used. They correspond directly (1:1 mapping) to an underlying OS thread. They are relatively heavyweight and resource-intensive. The number you can create is limited (thousands).
Virtual Threads: Lightweight threads managed entirely by the JVM. They do not have their own dedicated OS thread. Many virtual threads run their code on a shared pool of platform (carrier) threads.

Key Advantage: Project Loom allows the creation of millions of virtual threads, whereas platform threads are typically limited to the thousands.

What is a Virtual Thread?

It’s a java.lang.Thread: From the perspective of Java code, a virtual thread is an instance of Thread. You interact with it using the same API (mostly). It appears in debuggers and profilers like any other thread.
JVM Entity, Not OS Wrapper: Unlike platform threads, it’s not just a wrapper around an OS thread. It’s managed within the JVM.
Cheap to Create: Creating a virtual thread is very fast and consumes minimal memory compared to a platform thread. Don’t pool virtual threads – create them as needed.
Cheap to Block: Blocking a virtual thread (e.g., on I/O, sleep, BlockingQueue.take) typically does not block the underlying OS (carrier) thread. The virtual thread is simply unmounted, and the carrier thread is free to run another virtual thread. This allows for writing simple, synchronous, blocking code that scales very well.
Minimal Language Changes: Designed to work with existing Java code and APIs with few (if any) changes required.

When to Use Virtual Threads

Virtual threads excel in high-throughput concurrent applications where tasks spend significant time waiting, such as:

Server applications handling many concurrent client connections (thread-per-request model).
Tasks involving frequent blocking I/O operations (file access, database queries).
Tasks making network calls to other services.

They are less beneficial for long-running, CPU-intensive operations. If a task continuously uses the CPU without blocking, running it on a virtual thread provides little advantage over a platform thread, as it will occupy a carrier thread (and thus an OS thread and CPU core) for its entire duration anyway. Virtual threads primarily save resources during waiting periods.

Continuations and Schedulers: The Magic Behind Loom

How do virtual threads work internally? The implementation relies on two key concepts:

Continuation: A representation of a computation’s state at a specific point, allowing it to be suspended and resumed later. When a virtual thread blocks, its state is captured in a continuation object stored in the JVM’s heap memory.
Scheduler: Responsible for assigning continuations (representing runnable virtual threads) to carrier threads for execution. Project Loom implements its own scheduler(s) within the JVM (typically a work-stealing ForkJoinPool).

Previously, Java relied on the OS for both thread management (scheduling) and representing the execution state (stack). Project Loom moves these responsibilities into the JVM for virtual threads, enabling much lighter-weight suspension and resumption.

This conceptual pseudocode illustrates how continuations allow suspending (yield) and resuming (continue) execution flow.

OS vs. Runtime Schedulers

Platform Threads: Scheduled by the OS scheduler.
Virtual Threads: Scheduled by a JVM runtime scheduler (e.g., a ForkJoinPool).

A runtime scheduler has more application-level context than an OS scheduler. For example, if virtual thread A produces data for virtual thread B, the JVM scheduler could try to schedule both A and B on the same carrier thread (and thus the same CPU core) to improve data locality and cache performance. An OS scheduler, being agnostic, cannot easily make such optimizations.

Schedulers in Project Loom

The default scheduler for virtual threads is a work-stealing ForkJoinPool, similar to the one used by parallel streams and CompletableFuture.
The ExecutorService framework, which we’ve seen before, is used extensively with virtual threads.

Crucially, while the underlying scheduling becomes much more efficient with virtual threads, most application code using standard concurrency APIs (like ExecutorService) remains largely the same. The main change is how you obtain an ExecutorService optimized for virtual threads and the understanding that you can now submit many more blocking tasks without overwhelming the system.

Virtual Threads – Code Examples

Let’s see how to create and use virtual threads.

Creation Method 1: `Thread.ofVirtual()`

The Thread class now has factory methods to create virtual threads directly. Thread.startVirtualThread(Runnable) is a convenient shortcut.

// Create and start a virtual thread directly
Thread thread = Thread.ofVirtual().start(new SomeRunnable());
 
try {
    thread.join(); // Wait for the virtual thread to finish
} catch (InterruptedException e) {
    e.printStackTrace();
}

Creation Method 2: `Thread.Builder` Factory

You can use a Thread.Builder for more configuration options.

Thread.Builder builder = Thread.ofVirtual().name("worker-", 0); // Name pattern
 
Runnable task = new SomeRunnable();
 
// name "worker-0"
Thread t1 = builder.start(task);
 
// name "worker-1"
Thread t2 = builder.start(task);
 
try {
    t1.join();
    t2.join();
} catch (InterruptedException e) {
    e.printStackTrace();
}
 
System.out.println(t1.getName() + " terminated");
System.out.println(t2.getName() + " terminated");

Creation Method 3: `Executors.newVirtualThreadPerTaskExecutor()`

This is often the most practical approach for managing many tasks. It creates an ExecutorService that starts a new virtual thread for every submitted task. There’s no pooling of virtual threads because they are so cheap.

// Executor that creates a new virtual thread for each task
ExecutorService myExecutor = Executors.newVirtualThreadPerTaskExecutor();
 
Future<?> future = myExecutor.submit(new SomeRunnable());
 
try {
    future.get(); // Wait for the task completion
} catch (InterruptedException e) {
    e.printStackTrace();
} catch (ExecutionException e) {
    e.printStackTrace();
} finally {
    myExecutor.shutdown(); // Remember to shut down the executor
}

Comparison: Number of Threads

This code compares creating 1 million virtual threads using newVirtualThreadPerTaskExecutor versus attempting to create 1 million platform threads directly.

int upto = 1_000_000;
 
// Start 1 million virtual threads
System.out.println("virtualExecutor started...");
ExecutorService virtualExecutor = Executors.newVirtualThreadPerTaskExecutor();
for (int i = 0; i < upto; i++) {
    virtualExecutor.submit(new SomeRunnable()); // Submits task to run on a new virtual thread
}
virtualExecutor.shutdown();
// Wait for virtual executor to finish (not shown, but typically needed)
System.out.printf("virtualExecutor finished\nStarting platform threads...\n");
 
// Compare that to 1 million platform threads (Likely fails!)
for (int i = 0; i < upto; i++) {
    Thread t = new Thread(new SomeRunnable()); // Creates a platform thread
    t.start();
    // Joining these would be necessary but omitted for brevity
}
System.out.println("platformExecutor finished"); // Might not reach here

Attempting to create a million platform threads will almost certainly fail with an OutOfMemoryError because each requires significant OS resources. Virtual threads handle this scale easily.

Runtime Comparison

This graph shows the dramatic difference in runtime when creating increasing numbers of threads. Creating platform threads becomes prohibitively expensive very quickly (exponential increase in time/resources), while creating virtual threads scales almost linearly and remains cheap even for millions of threads.

Example Use Case: Thread-Per-Request Server

Virtual threads make the simple “thread-per-request” server model highly scalable again.

while (true) {
    Socket clientSocket = serverSocket.accept();
    // Start a new VIRTUAL thread for each incoming connection
    Thread.ofVirtual().start(() -> {
        try {
            PrintWriter out = new PrintWriter(clientSocket.getOutputStream(), true);
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(clientSocket.getInputStream()));
 
            String inputLine;
            while ((inputLine = in.readLine()) != null) { // Blocking read
                System.out.println(inputLine);
                out.println(inputLine); // Blocking write
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
             try { clientSocket.close(); } catch (IOException e) {}
        }
    });
}

Even though readLine() and println might block, they block the virtual thread, not the underlying OS thread. The server can handle thousands or millions of concurrent connections with a small number of OS threads.

Comparison Side-by-Side

Feature	Platform Threads	Virtual Threads
Nature	Scarce, precious resource (OS thread)	Plentiful (JVM entity)
Management	Often pooled	Never pool; create per task
Scheduling	Managed by OS	Managed by JVM (e.g., ForkJoinPool)
Use Case	Heavier, CPU-intensive workloads	Small tasks, I/O- or network-intensive
Blocking	Blocks OS thread	Unmounts from OS thread
Cost	High (memory, creation time)	Very Low

Continue here: 14 Exam Preparation (First Half)

CS Notes

Explorer

13 Virtual Threads

Shared Memory Concurrency, Locks, and Data Races

Getting Concurrency Right: Guidelines

The 3 Choices for Memory Locations

Guidelines for Synchronized Data

Lock Granularity

Critical-Section Granularity

Example: Hashtable Update

Atomicity

Don’t Roll Your Own Concurrent Data Structures

Recap

Virtual Threads – Project Loom

Big Picture (Part I)

Plan

Java Threads Recap

Threads Overview

Creating Traditional Threads

Thread State Model

Synchronization and Thread Safety

Java Concurrency Model Summary

Challenges with Traditional Threads

The OS Thread Bottleneck

Virtual Threads – Project Loom

Motivation

Project Loom Components

Big Picture with Virtual Threads

Virtual Threads in a Nutshell

Platform Threads vs. Virtual Threads

What is a Virtual Thread?

When to Use Virtual Threads

Continuations and Schedulers: The Magic Behind Loom

OS vs. Runtime Schedulers

Schedulers in Project Loom

Virtual Threads – Code Examples

Creation Method 1: Thread.ofVirtual()

Creation Method 2: Thread.Builder Factory

Creation Method 3: Executors.newVirtualThreadPerTaskExecutor()

Comparison: Number of Threads

Runtime Comparison

Example Use Case: Thread-Per-Request Server

Comparison Side-by-Side

Table of Contents

Graph View

Creation Method 1: `Thread.ofVirtual()`

Creation Method 2: `Thread.Builder` Factory

Creation Method 3: `Executors.newVirtualThreadPerTaskExecutor()`