Learn Java Concurrency Correctness Part 035 Production Architecture And Final Playbook
title: Learn Java Concurrency & Correctness - Part 035 description: Final production architecture playbook for choosing concurrency models, reviewing correctness, preventing failure modes, and migrating Java systems safely. series: learn-java-concurrency-correctness seriesTitle: Learn Java Concurrency & Correctness order: 35 partTitle: Production Architecture and Final Playbook tags:
- java
- concurrency
- correctness
- architecture
- production
- review-checklist
- virtual-threads
- structured-concurrency
- reactive date: 2026-06-28
Part 035 — Production Architecture and Final Playbook
Goal: menyatukan seluruh seri menjadi playbook arsitektur concurrency production-grade: kapan memilih thread-per-request, virtual threads, executor pools, structured concurrency, event loop, reactive streams, locking, atomics, queues, atau actor-like confinement.
Ini bagian terakhir dari seri. Fokusnya bukan API baru, tetapi kemampuan yang membedakan engineer senior biasa dengan engineer yang benar-benar kuat di concurrency:
Mampu memilih model concurrency yang benar untuk problem tertentu, mendefinisikan invariant, mencegah failure mode, menguji correctness, mengobservasi production, dan memigrasikan sistem tanpa menciptakan risiko tersembunyi.
Seri ini sudah membahas:
- concurrency vs parallelism vs async,
- correctness-first thinking,
- Java Memory Model,
volatile,final, safe publication,- locks, conditions, wait/notify,
- liveness failure,
- synchronizers,
- blocking queues,
- concurrent collections,
- atomics and VarHandle,
- executors and thread pools,
- fork/join and parallel streams,
CompletableFuture,- async API design,
- virtual threads,
- pinning and JDK 24+,
- structured concurrency,
- scoped values,
ThreadLocalleakage,- Reactive Streams and
Flow, - Reactor/RxJava boundaries,
- NIO/event loop,
- timeout, cancellation, deadline,
- testing,
- observability and forensics.
Sekarang kita ubah semuanya menjadi satu operational framework.
1. Final Mental Model
Concurrency architecture bukan tentang “pakai fitur terbaru”. Ia tentang menjawab lima pertanyaan:
- Apa unit kerja?
- Siapa pemilik state?
- Bagaimana progress dijamin?
- Bagaimana overload dibatasi?
- Bagaimana kegagalan dibatalkan, dibersihkan, dan diamati?
Jika satu saja tidak jelas, sistem concurrent akan menjadi rapuh.
2. Kaufman Final Deconstruction
Josh Kaufman menekankan deconstruction, self-correction, practice, dan removing barriers. Untuk concurrency, final decomposition-nya seperti ini:
| Layer | Competency | Failure jika lemah |
|---|---|---|
| Memory model | Mengerti visibility/order/happens-before | stale reads, unsafe publication |
| State ownership | Menentukan siapa boleh mutate | data race, invariant corruption |
| Coordination | Memilih lock/queue/synchronizer | deadlock, missed signal, starvation |
| Execution | Memilih thread/pool/virtual/event-loop | starvation, saturation, wasted CPU |
| Composition | Menggabungkan subtasks dengan lifecycle jelas | orphan work, leak, hidden failure |
| Backpressure | Membatasi queue/resource/demand | OOM, retry storm, latency collapse |
| Cancellation | Menghentikan work yang tidak lagi berguna | pool exhaustion, side effects late |
| Testing | Memaksa interleaving dan invariant | false confidence |
| Observability | Melihat failure mode di production | incident blind spot |
| Architecture | Memilih model sesuai workload | accidental complexity |
Top 1% bukan berarti hafal semua class. Top 1% berarti bisa melihat hidden coupling:
“This is not just a timeout bug; this is a missing ownership/cancellation/backpressure invariant across the boundary.”
3. The Concurrency Decision Tree
Gunakan decision tree ini sebelum menulis code.
A concurrency model is not chosen in isolation. It is chosen together with:
- state ownership model,
- cancellation semantics,
- timeout/deadline policy,
- backpressure policy,
- observability strategy,
- test strategy.
4. Model Selection Matrix
4.1 Execution model
| Model | Best for | Avoid when | Main risk |
|---|---|---|---|
| Plain sequential | Simple logic, no concurrency need | Latency from independent IO dominates | Underutilization |
| Platform thread per request | Low concurrency, legacy stack | Many blocking requests | OS thread exhaustion |
| Virtual thread per task | High-throughput blocking IO, simple code | CPU-bound parallelism or unbounded resource use | Resource saturation masked by cheap threads |
| Fixed thread pool | Bounded CPU/IO work | Pool sizing unknown, hidden blocking | Starvation/deadlock |
| ForkJoinPool | Recursive CPU decomposition | Blocking IO tasks | Common-pool contamination |
CompletableFuture | One-shot async composition | Complex lifecycle/cancellation | Orphan work, executor ambiguity |
| Structured concurrency | Parent-child subtask lifetime | API unavailable/preview constraints | Misconfigured scope/deadline |
| Event loop | Protocol engine, proxy, streaming transport | Blocking business workflow | Blocking contamination |
| Reactive streams | Continuous async streams with backpressure | Simple CRUD with blocking dependencies | Debuggability/context complexity |
4.2 State model
| State problem | Preferred model |
|---|---|
| Immutable snapshot | record, final fields, safe publication |
| Per-request temporary state | local variables, virtual-thread stack |
| Per-connection state | event-loop attachment or actor-owned state |
| Shared cache value | concurrent collection + immutable values |
| Counter/statistics | LongAdder / AtomicLong |
| Multi-field invariant | lock-protected aggregate |
| Multi-entity workflow | durable transaction/workflow state, not memory lock |
| Cross-node coordination | database/lease/consensus, not JVM lock |
| Request context | explicit parameter or ScopedValue |
| Logging/security context legacy | disciplined ThreadLocal with cleanup |
5. Correctness Architecture: Define Invariants First
Before implementation, write invariants. No concurrency review is serious without them.
5.1 Invariant template
Component:
State:
Owner:
Invariant:
Mutation protocol:
Read protocol:
Failure behavior:
Timeout/cancellation:
Backpressure:
Observability:
Test strategy:
Example:
Component: CaseAssignmentQueue
State:
- pendingAssignments
- assignedOfficerByCaseId
Owner:
- single AssignmentCoordinator actor
Invariant:
- a case has at most one active assigned officer
- pending case cannot also be assigned
Mutation protocol:
- all mutations go through coordinator mailbox
Read protocol:
- snapshot reads only
Failure behavior:
- failed assignment returns case to pending with attempt record
Timeout/cancellation:
- assignment attempt deadline 2s
Backpressure:
- mailbox bounded by tenant quota
Observability:
- queue depth, assignment latency, rejected assignments, duplicate prevention
Test strategy:
- stress concurrent assignment requests, assert uniqueness
This is much more valuable than:
Use ConcurrentHashMap.
ConcurrentHashMap can make map operations safe. It does not make business invariants safe.
6. Ownership Patterns
6.1 Immutable by default
Best concurrency design: no shared mutable state.
public record DecisionSnapshot(
CaseId caseId,
CaseStatus status,
OfficerId owner,
Instant capturedAt
) {}
Safe if:
- fields are immutable or treated immutably,
- object is safely published,
- readers never mutate nested state.
6.2 Thread confinement
State belongs to one thread or one task.
Virtual thread example:
void handle(Request request) {
var draft = new DecisionDraft();
validate(request, draft);
enrich(request, draft);
persist(draft);
}
No synchronization required if draft never escapes.
6.3 Actor/event-loop confinement
State belongs to one serialized executor.
coordinator.submit(() -> {
// only coordinator mutates aggregate state
assign(caseId, officerId);
});
Good for:
- aggregate invariants,
- per-connection state,
- routing tables,
- sequential protocols,
- state machine transitions.
Risk:
- actor mailbox overload,
- single owner bottleneck,
- blocking inside actor,
- hidden queue latency.
6.4 Lock-protected aggregate
Use lock when multiple fields must change atomically.
final class CaseState {
private final ReentrantLock lock = new ReentrantLock();
private CaseStatus status;
private OfficerId owner;
private long version;
Assignment assign(OfficerId officer, Deadline deadline) throws Exception {
if (!lock.tryLock(deadline.remainingNanos(), TimeUnit.NANOSECONDS)) {
throw new TimeoutException("case lock timeout");
}
try {
if (status != CaseStatus.OPEN) {
throw new InvalidTransitionException();
}
status = CaseStatus.ASSIGNED;
owner = officer;
version++;
return new Assignment(owner, version);
} finally {
lock.unlock();
}
}
}
Lock rule:
Lock protects invariants, not code blocks.
If you cannot name the invariant, you probably should not add a lock yet.
6.5 Atomic state transition
Use atomic when invariant fits one reference.
record State(CaseStatus status, OfficerId owner, long version) {}
final AtomicReference<State> state =
new AtomicReference<>(new State(CaseStatus.OPEN, null, 0));
boolean assign(OfficerId officer) {
while (true) {
State current = state.get();
if (current.status() != CaseStatus.OPEN) {
return false;
}
State next = new State(CaseStatus.ASSIGNED, officer, current.version() + 1);
if (state.compareAndSet(current, next)) {
return true;
}
}
}
Good:
- small immutable state,
- clear transition,
- low contention,
- no blocking.
Bad:
- complex multi-step workflows,
- blocking inside CAS loop,
- side effects during retry,
- large mutable object graphs.
7. Capacity Architecture
Concurrency without capacity limits is not architecture. It is delayed failure.
7.1 Every queue must answer four questions
Queue:
Owner:
Capacity:
Enqueue policy:
Dequeue policy:
Rejection policy:
Metrics:
Example:
Queue: risk-evaluation-worker-queue
Owner: RiskEvaluationExecutor
Capacity: 500 tasks
Enqueue policy: offer until request deadline
Dequeue policy: FIFO, priority for regulatory deadline cases
Rejection policy: return 503 / reschedule async workflow
Metrics: depth, age, rejected, execution latency
7.2 Bulkheads
Bulkhead by resource, not by random service name.
| Resource | Bulkhead |
|---|---|
| Database pool | max connections + query timeout |
| External API | semaphore + deadline |
| CPU-heavy validator | fixed pool |
| Event loop | no blocking + offload cap |
| Case aggregate actor | bounded mailbox |
| Tenant | quota by tenant |
| Notification sender | bounded queue + retry budget |
7.3 Little’s Law intuition
For a stable system:
concurrency ≈ throughput × latency
If average latency rises while arrival rate stays constant, in-flight work rises. If in-flight work is unbounded, memory and queues grow.
This is why slow dependencies cause queue explosion.
8. Deadline Architecture
Use one top-level deadline and propagate it.
Rules:
- Parent creates deadline.
- Child receives deadline.
- Child fails fast if expired.
- Resource acquisition respects deadline.
- Client-native timeout respects deadline.
- Retry consumes same deadline.
- Cancellation cleans up child work.
- Side-effecting operations use idempotency key.
8.1 Timeout hierarchy
Total request deadline
├── queue wait budget
├── auth budget
├── data load budget
│ ├── connection acquire timeout
│ ├── query timeout
│ └── result mapping time
├── external enrichment budget
│ ├── bulkhead acquire timeout
│ ├── connect timeout
│ └── read timeout
└── response write budget
Each timeout should be observable separately.
9. Cancellation Architecture
Cancellation must follow ownership.
9.1 Cancellation propagation
9.2 Cancellation contract
Every public async/concurrent API should state:
If caller cancels:
- does waiting stop?
- does underlying work stop?
- is thread interrupted?
- is socket/request closed?
- are permits released?
- are child tasks cancelled?
- can operation still complete late?
- are side effects idempotent?
If the answer is “unknown”, the API is not production-ready.
10. Choosing Between Virtual Threads and Reactive
This deserves a dedicated final rule set.
10.1 Prefer virtual threads when
- workflow is request/response,
- business logic is naturally sequential,
- dependencies are blocking but bounded,
- team values debuggable stack traces,
- context propagation should be simple,
- per-request state fits stack/local variables,
- backpressure can be represented by semaphores/queues/pools,
- streaming is not the core abstraction.
Example:
Response handle(Request request) {
Deadline deadline = Deadline.after(Duration.ofMillis(500));
try (var scope = StructuredTaskScope.open(joiner, config)) {
var customer = scope.fork(() -> customerClient.get(request.customerId(), deadline));
var risk = scope.fork(() -> riskClient.evaluate(request.customerId(), deadline));
scope.join();
return compose(customer.get(), risk.get());
}
}
10.2 Prefer reactive/event-loop when
- protocol is streaming,
- backpressure is part of contract,
- many mostly-idle connections,
- framework stack is already reactive,
- transport is Netty/event-loop based,
- you need demand-driven pipelines,
- you are building a gateway/proxy/stream processor,
- non-blocking client ecosystem is mature for dependencies.
Example:
Flux<Event> stream =
source.receive()
.limitRate(256)
.flatMap(this::enrichAsync, 32)
.timeout(Duration.ofSeconds(2))
.onBackpressureBuffer(1000);
10.3 Avoid hybrid confusion
Dangerous hybrid:
- reactive pipeline with random blocking calls,
- virtual threads calling
.block()inside event-loop thread, CompletableFutureusing common pool for blocking IO,ThreadLocalcontext expected to cross reactive boundaries,- unbounded
flatMap, - event-loop code waiting on future.
Bridge with explicit boundaries:
boundedElasticor equivalent for blocking in reactive,- virtual-thread executor for blocking adapters,
ScopedValueonly where lexical/thread inheritance is valid,- explicit context for reactive,
- semaphores for external dependency concurrency,
- deadlines everywhere.
11. Choosing Between Locks, Atomics, Queues, and Actors
11.1 Quick matrix
| Need | Use |
|---|---|
| Protect multiple mutable fields | Lock |
| Wait for condition with explicit predicate | Condition or blocking queue |
| Single numeric metric under high contention | LongAdder |
| Single state reference transition | AtomicReference |
| Producer-consumer handoff | BlockingQueue |
| Limit concurrent access to resource | Semaphore |
| Coordinate one-time start/finish | CountDownLatch |
| Repeated phase coordination | Phaser |
| Serialize aggregate mutations | Actor/serial executor |
| Avoid sharing entirely | Immutability/confinement |
11.2 What not to do
- Do not use atomics to protect multi-field mutable object graphs unless state is immutable snapshot.
- Do not use
ConcurrentHashMapas substitute for transaction/invariant. - Do not use
synchronizedaround blocking external calls. - Do not use unbounded queues to “absorb spikes”.
- Do not use parallel streams in request path without measuring and isolating pool.
- Do not use
CompletableFuturewithout explicit executor ownership. - Do not assume timeout cancels underlying work.
- Do not assume virtual threads remove resource limits.
12. Production Review Template
Use this template for design review.
# Concurrency Design Review
## Component
Name:
Owner:
Runtime:
Criticality:
## Workload
Request rate:
Peak rate:
Latency target:
Work type:
- CPU-bound:
- IO-bound:
- blocking:
- streaming:
- bursty:
## Execution Model
Chosen model:
Alternatives considered:
Why chosen:
Thread/executor/event-loop ownership:
Blocking policy:
Offload policy:
## State Ownership
Mutable state:
Owner:
Invariant:
Mutation protocol:
Read protocol:
Publication protocol:
## Capacity
Queues:
Limits:
Bulkheads:
Rejection behavior:
Tenant fairness:
Slow consumer policy:
## Timeout / Deadline
Top-level deadline:
Phase timeouts:
Dependency timeouts:
Retry budget:
Shutdown deadline:
## Cancellation
Cancellation triggers:
Propagation path:
Underlying resource cancellation:
Cleanup:
Late completion policy:
## Failure Modes
Data race:
Deadlock:
Starvation:
Livelock:
Pool exhaustion:
Queue growth:
Slow dependency:
Slow consumer:
Orphan work:
Context leak:
Unknown side effect:
## Observability
Metrics:
Logs:
Traces:
JFR events:
Thread dump strategy:
Alert thresholds:
## Tests
Unit deterministic:
Stress:
JCStress-style:
Timeout/cancellation:
Leak:
Load:
Failure injection:
Regression from incidents:
This template forces architecture-level thinking.
13. Failure Mode Catalogue
13.1 Data race
Signal:
- non-deterministic state,
- impossible values,
- flakiness,
- only appears under load.
Prevention:
- immutability,
- safe publication,
- lock ownership,
- actor confinement,
- concurrent collection with immutable values.
13.2 Unsafe publication
Signal:
- partially initialized objects observed,
- stale config,
- null field impossible by constructor logic.
Prevention:
- final fields,
- immutable objects,
- volatile reference,
- lock handoff,
- static initialization,
- thread-safe containers.
13.3 Deadlock
Signal:
- thread dump shows cycles,
- no CPU,
- requests hang.
Prevention:
- lock ordering,
- no external calls under lock,
- timed lock,
- smaller critical sections,
- actor confinement.
13.4 Starvation
Signal:
- some work never runs,
- pool busy with wrong work,
- priority inversion.
Prevention:
- separate pools,
- bounded queues,
- fair locks only when justified,
- queue age metrics,
- bulkheads.
13.5 Thread pool deadlock
Signal:
- tasks wait for subtasks submitted to same saturated pool.
Prevention:
- structured concurrency,
- avoid blocking joins inside fixed pool,
- use separate executor,
- use virtual threads for blocking composition,
- enforce pool ownership.
13.6 Event-loop blocking
Signal:
- p99 spikes,
- low CPU,
- event loop thread stack in DB/logging/lock/sleep.
Prevention:
- offload blocking work,
- blocking guard,
- event-loop lag metric,
- bounded worker queue.
13.7 Queue explosion
Signal:
- memory growth,
- old work,
- latency grows before error rate.
Prevention:
- bounded queues,
- rejection policy,
- deadline-aware queueing,
- load shedding,
- queue age metrics.
13.8 Orphan work
Signal:
- work continues after request timeout,
- executor busy with irrelevant tasks.
Prevention:
- structured scopes,
- cancellation handles,
- deadline check before execution,
Future.cancel,- underlying resource close.
13.9 Context leak
Signal:
- wrong tenant/user/correlation id,
- security context bleed,
- MDC wrong in logs.
Prevention:
ScopedValue,- explicit context,
ThreadLocal.remove,- executor wrappers,
- reactive context discipline.
13.10 Unknown side effect
Signal:
- caller timed out but external action may have happened.
Prevention:
- idempotency key,
- operation status table,
- reconciliation,
- compensation workflow,
- audit trail.
14. Migration Playbook: Legacy Thread Pools to Virtual Threads
Virtual threads are powerful, but migration must be controlled.
14.1 Good candidates
- synchronous REST handlers,
- blocking HTTP clients,
- blocking database access with bounded connection pool,
- per-request orchestration,
- file/network IO where JDK/libraries cooperate,
- code where stack traces matter.
14.2 Bad first candidates
- CPU-bound workloads,
- code relying heavily on thread-local mutable caches,
- code with unbounded fan-out,
- event-loop frameworks where blocking is forbidden,
- code with hidden native blocking,
- code without timeout/cancellation discipline.
14.3 Migration steps
- Inventory blocking paths.
- Inventory resource pools: DB, HTTP, file, locks, semaphores.
- Add top-level deadlines.
- Add dependency-native timeouts.
- Add bulkheads.
- Replace platform-thread request executor with virtual-thread-per-task executor.
- Remove unnecessary large IO thread pools.
- Keep CPU-bound pools bounded.
- Audit
ThreadLocalusage. - Add JFR/thread dump observability.
- Load test with slow dependencies.
- Roll out gradually.
14.4 Virtual thread anti-patterns
Bad:
// "Threads are cheap, so no limit"
for (Item item : millionItems) {
Thread.startVirtualThread(() -> callExternalApi(item));
}
Better:
Semaphore permits = new Semaphore(100);
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
for (Item item : items) {
executor.submit(() -> {
permits.acquire();
try {
callExternalApi(item);
} finally {
permits.release();
}
});
}
}
Cheap threads do not make external systems infinite.
15. Migration Playbook: Ad Hoc Futures to Structured Concurrency
Ad hoc future code often leaks lifecycle.
Bad:
CompletableFuture<A> a = callA();
CompletableFuture<B> b = callB();
return a.thenCombine(b, this::combine);
Potential issues:
- executor unclear,
- cancellation unclear,
- one failure may not cancel sibling,
- deadline not shared,
- child work may outlive parent,
- context propagation unclear.
Migration:
- Identify parent operation.
- Identify child tasks.
- Define success policy: all success? first success? quorum?
- Define failure policy.
- Define deadline.
- Use structured scope where available.
- Pass deadline to child dependencies.
- Cancel siblings on failure/deadline.
- Observe scope as one operation.
Structured design:
try (var scope = StructuredTaskScope.open(joiner, config)) {
var a = scope.fork(() -> callA(deadline));
var b = scope.fork(() -> callB(deadline));
scope.join();
return combine(a.get(), b.get());
}
Where structured concurrency is preview/unavailable, emulate the same lifecycle discipline manually.
16. Migration Playbook: ThreadLocal to ScopedValue
16.1 When to migrate
Migrate request-scoped immutable context:
- correlation id,
- tenant id,
- deadline,
- read-only security principal,
- tracing metadata,
- locale,
- request feature flags.
Be careful with mutable context:
- transaction/session,
- persistence context,
- mutable security context,
- accumulators,
- caches.
16.2 ThreadLocal risk
USER_CONTEXT.set(user);
try {
service.handle();
} finally {
USER_CONTEXT.remove();
}
Problems:
- forgotten cleanup,
- thread pool reuse leak,
- async boundary loss,
- virtual thread memory overhead if abused,
- unclear ownership.
16.3 ScopedValue pattern
static final ScopedValue<RequestContext> REQUEST_CONTEXT =
ScopedValue.newInstance();
ScopedValue.where(REQUEST_CONTEXT, context)
.run(() -> service.handle(request));
Benefits:
- bounded lifetime,
- immutable sharing,
- easier reasoning,
- child task inheritance with structured concurrency,
- lower risk of stale context.
Rule:
Use
ScopedValuefor immutable scoped context; use explicit parameters for domain-critical values; avoid hidden mutable global state.
17. Architecture Patterns
17.1 Thread-per-request with virtual threads
Use for:
- synchronous service endpoints,
- request orchestration,
- blocking dependencies,
- straightforward business workflow.
Shape:
Key controls:
- DB pool,
- HTTP client connection pool,
- semaphores,
- deadline,
- cancellation,
- structured subtasks,
- context via scoped value.
17.2 Event-loop transport with worker offload
Use for:
- gateway,
- protocol server,
- high connection count,
- streaming transport.
Shape:
Key controls:
- event-loop non-blocking invariant,
- bounded worker queue,
- outbound pending bytes,
- interest ops,
- close reasons,
- event-loop lag.
17.3 Reactive pipeline
Use for:
- async streams,
- backpressure protocol,
- data/event pipeline,
- non-blocking clients.
Shape:
Key controls:
- bounded concurrency,
- scheduler discipline,
- timeout,
- cancellation,
- context propagation,
- blocking isolation,
- backpressure policy.
17.4 Actor-like aggregate coordinator
Use for:
- complex aggregate invariants,
- sequential state machine,
- workflow coordination,
- high correctness requirements.
Shape:
Key controls:
- bounded mailbox,
- no blocking in actor,
- persistent outbox if side effects,
- snapshot reads,
- command idempotency,
- replay/recovery if durable.
18. Regulatory Workflow Angle
For regulatory systems, concurrency correctness is not merely technical. It affects defensibility.
Common high-risk cases:
- duplicate enforcement actions,
- missed escalation deadline,
- inconsistent case status,
- notification sent twice,
- audit log order mismatch,
- stale authorization context,
- cross-tenant context leak,
- timeout interpreted as failure when outcome unknown,
- race between manual override and automated escalation.
18.1 Domain invariant examples
A case cannot be simultaneously CLOSED and PENDING_REVIEW.
An enforcement action cannot be issued twice for same violation and version.
A manual override must supersede automated escalation if committed first.
Every external notification must have an idempotency key.
Every status transition must have audit causality.
A timeout from remote registry update produces UNKNOWN_REMOTE_OUTCOME, not FAILED.
18.2 Concurrency design implications
- Use versioned state transitions.
- Use optimistic locking at durable boundary.
- Use idempotency keys for side effects.
- Use outbox pattern for events.
- Use reconciliation for unknown outcomes.
- Keep audit append independent but causally linked.
- Use command IDs to deduplicate retries.
- Avoid in-memory locks as source of cross-node truth.
JVM concurrency protects in-process correctness. Regulatory correctness often needs durable concurrency control too.
19. Code Review Red Flags
If you see these, pause review.
19.1 Execution red flags
CompletableFuture.supplyAsync(() -> blockingCall());
No executor.
parallelStream().forEach(item -> repository.save(item));
Blocking side effects in parallel stream.
Executors.newCachedThreadPool();
Unbounded thread creation.
new LinkedBlockingQueue<>();
Unbounded queue by default.
future.get();
No timeout.
Thread.sleep(...);
Suspicious coordination.
19.2 State red flags
static Map<K, V> cache = new HashMap<>();
Shared non-thread-safe state.
volatile List<Item> items;
items.add(x);
Volatile reference does not make list mutation safe.
ConcurrentHashMap<K, MutableValue>
Map is concurrent; value may not be.
if (!map.containsKey(k)) {
map.put(k, v);
}
Check-then-act race.
19.3 Lock red flags
synchronized (publicObject) { ... }
External code can lock it.
lock.lock();
externalCall();
lock.unlock();
External call under lock.
catch (InterruptedException ignored) {}
Cancellation swallowed.
condition.await();
No loop predicate.
19.4 Reactive/event-loop red flags
mono.map(x -> blockingCall(x));
Blocking inside operator without scheduler boundary.
.flatMap(this::call)
No concurrency bound.
eventLoop.execute(() -> repository.findAll());
Blocking event loop.
key.interestOps(OP_READ | OP_WRITE);
Always write-ready spin risk.
20. Production Readiness Checklist
20.1 Design
- Execution model is explicitly chosen.
- State ownership is documented.
- Invariants are written.
- Shared mutable state is minimized.
- Resource limits are explicit.
- Deadlines are propagated.
- Cancellation is defined.
- Backpressure behavior is defined.
- Failure taxonomy is defined.
- Test strategy includes concurrency-specific tests.
20.2 Implementation
- No unbounded executor/queue accidentally.
- No blocking call on event-loop/reactive scheduler.
- No external call under lock.
- No swallowed interrupt.
- No unsafe publication.
- No mutable object stored in concurrent map without ownership.
- No common-pool blocking.
- No timeout without cleanup.
- No retry without deadline.
- No context without cleanup/boundary.
20.3 Operations
- Thread names are meaningful.
- Thread dumps are actionable.
- JFR profile can be captured in production.
- Executor metrics exist.
- Queue depth and age are measured.
- Lock contention is observable.
- Event-loop lag is measured where relevant.
- Timeout phase is labeled.
- Cancellation reason is labeled.
- Late completion/orphan work is observable.
20.4 Resilience
- Slow dependency test passed.
- Slow consumer test passed.
- Pool saturation test passed.
- Queue saturation test passed.
- Cancellation storm test passed.
- Shutdown drain test passed.
- Context leak test passed.
- Duplicate side-effect test passed.
- Unknown outcome reconciliation exists.
21. Load Testing Playbook
Load tests must include failure, not only happy path.
21.1 Baseline tests
- normal load,
- peak load,
- burst load,
- soak test,
- high cardinality tenant test,
- large payload test.
21.2 Failure injection
- dependency latency +500ms,
- dependency timeout,
- connection pool exhaustion,
- slow DB query,
- slow client read,
- event loop blocking injection,
- executor saturation,
- queue full,
- lock holder pause,
- cancellation storm,
- client disconnect storm,
- retry storm,
- GC pressure.
21.3 What to measure
- p50/p95/p99/p999 latency,
- throughput,
- error rate by failure class,
- queue depth,
- queue age,
- active threads,
- virtual thread count,
- carrier thread utilization,
- blocked/waiting thread count,
- lock contention,
- DB pool active/waiting,
- external API concurrency,
- timeouts by phase,
- cancellation latency,
- CPU,
- heap/direct memory,
- GC pause,
- JFR events.
21.4 Pass/fail criteria
Good load test has explicit fail conditions:
p99 < 300ms under normal load
p99 < 800ms under dependency p95=500ms
queue age < request deadline
no unbounded memory growth for 30 min soak
timeout error has phase label
no thread leak after cancellation storm
no pending futures after client disconnect
slow clients are closed before pending bytes > cap
22. Incident Response Playbook
When concurrency incident happens, avoid random fixes. Follow structure.
22.1 First classify symptom
| Symptom | Likely class |
|---|---|
| High latency, low CPU | blocking, deadlock, starvation, pool wait |
| High CPU, low throughput | spin, retry storm, busy loop |
| Memory growth | unbounded queue, slow consumer, leak |
| Requests hang | missing timeout, deadlock, orphan wait |
| Wrong user/tenant context | context leak |
| Duplicate action | idempotency/race |
| Thread count explosion | unbounded executor/platform threads |
| DB pool exhausted | leaked connection, slow query, orphan work |
| Event loop lag | blocking contamination |
22.2 Capture evidence
- Thread dump.
- JFR recording.
- Executor/queue metrics.
- DB pool metrics.
- Dependency latency.
- GC logs/metrics.
- Error taxonomy.
- Recent deployment diff.
- Load/retry rate.
- Context/correlation samples.
22.3 Build wait-for graph
Ask:
Request waits for what?
Thread waits for what?
Task waits for what?
Pool waits for what?
Lock waits for what?
Dependency waits for what?
Queue waits for what?
Diagram it:
This often reveals circular dependency or missing bulkhead.
22.4 Mitigate before optimizing
Possible mitigations:
- lower timeout,
- disable retry,
- shed load,
- reduce concurrency to dependency,
- increase connection pool only if dependency can handle it,
- close slow consumers,
- restart stuck nodes,
- enable fallback,
- isolate tenant,
- roll back blocking change,
- switch off expensive feature.
Do not “just increase thread pool” unless wait-for graph proves it is safe.
23. Final Capstone: Designing a Concurrent Case Evaluation Service
23.1 Requirements
A service receives a case evaluation request:
- authenticate user,
- load case,
- load regulatory rules,
- call risk scoring service,
- call sanctions registry,
- evaluate decision,
- persist audit,
- return result under 800ms,
- never duplicate enforcement action,
- support cancellation on client disconnect,
- survive slow risk service,
- preserve tenant context,
- expose diagnostics.
23.2 Architecture
23.3 Execution model
- Use virtual thread per request.
- Use structured concurrency for independent subtasks.
- Use deadline propagated to every dependency.
- Use scoped value for immutable request context.
- Use DB optimistic versioning for durable case transition.
- Use idempotency key for enforcement action.
- Use outbox for external notifications.
- Use semaphores/bulkheads for risk and sanctions calls.
23.4 Critical invariants
Invariant 1:
A case transition is valid only from the latest persisted version.
Invariant 2:
An enforcement action for (caseId, violationId, decisionVersion) is created at most once.
Invariant 3:
A timed-out external side effect is UNKNOWN_REMOTE_OUTCOME until reconciled.
Invariant 4:
Tenant context used for authorization, query, audit, and outbox must be identical.
Invariant 5:
If request deadline expires, child tasks are cancelled and no new side effect begins.
Invariant 6:
Audit record must causally reference decision input version and command id.
23.5 Example pseudo-code
Response evaluate(EvaluateCommand command) throws Exception {
Deadline deadline = Deadline.after(Duration.ofMillis(800));
RequestContext context = RequestContext.from(command, deadline);
return ScopedValue.where(RequestContexts.CURRENT, context)
.call(() -> evaluateWithinContext(command, deadline));
}
Response evaluateWithinContext(EvaluateCommand command, Deadline deadline)
throws Exception {
deadline.throwIfExpired();
try (var scope = StructuredTaskScope.open(joiner, configWithDeadline(deadline))) {
var auth = scope.fork(() -> authService.authorize(command, deadline));
var caseData = scope.fork(() -> caseRepository.load(command.caseId(), deadline));
var rules = scope.fork(() -> rulesCache.load(command.ruleSetId(), deadline));
var risk = scope.fork(() -> riskClient.score(command.subject(), deadline));
var sanctions = scope.fork(() -> sanctionsClient.check(command.subject(), deadline));
scope.join();
Authorization authorization = auth.get();
CaseRecord record = caseData.get();
authorization.assertCanEvaluate(record);
Decision decision = decisionEngine.evaluate(
record,
rules.get(),
risk.get(),
sanctions.get()
);
deadline.throwIfExpired();
return persistDecision(command, record, decision, deadline);
}
}
Response persistDecision(
EvaluateCommand command,
CaseRecord record,
Decision decision,
Deadline deadline
) throws Exception {
String idempotencyKey = command.commandId().value();
return transactionTemplate.execute(deadline, tx -> {
CaseRecord latest = caseRepository.loadForUpdate(record.id(), deadline);
if (latest.version() != record.version()) {
throw new ConcurrentCaseUpdateException();
}
EnforcementAction action = enforcementActions.createIfAbsent(
idempotencyKey,
latest.id(),
decision
);
caseRepository.transition(latest, decision.nextStatus(), deadline);
auditRepository.append(decision.auditEvent(command.commandId()), deadline);
outboxRepository.enqueue(action.notificationEvent(), deadline);
return Response.from(decision, action);
});
}
This design intentionally combines:
- virtual threads,
- structured concurrency,
- scoped context,
- deadlines,
- durable concurrency control,
- idempotency,
- transactional outbox.
This is the level where concurrency becomes system correctness.
24. Final “Do This, Not That”
| Do this | Not that |
|---|---|
| Define invariants before locks | Add locks until bug disappears |
| Prefer immutable snapshots | Share mutable DTOs everywhere |
| Use deadline propagation | Give every layer fixed timeout |
| Bound every queue | Use unbounded queues as buffer |
| Use virtual threads for blocking IO orchestration | Use virtual threads as excuse for unbounded dependency calls |
| Use structured concurrency for child tasks | Fire futures and forget lifecycle |
| Use scoped values for immutable request context | Depend on leaking thread locals |
| Use event loop only with non-blocking discipline | Call blocking repositories on event loop |
| Use reactive when backpressure is core | Use reactive for simple CRUD because it looks advanced |
| Use lock for multi-field invariant | Use atomics for complex mutable graphs |
| Observe queue age and close reasons | Only observe average latency |
| Test cancellation and late completion | Only test happy path |
| Treat timeout as unknown outcome for side effects | Assume timeout means operation failed |
25. What “Top 1%” Looks Like in Concurrency
A top-tier engineer does these consistently:
- Names the invariant.
- Chooses ownership before choosing primitive.
- Rejects unbounded queues by default.
- Asks what happens when dependency is slow.
- Asks what happens when caller goes away.
- Separates timeout from cancellation.
- Knows when virtual threads simplify and when they hide saturation.
- Knows when reactive is essential and when it is accidental complexity.
- Avoids shared mutable state unless there is a protocol.
- Reads thread dumps and JFR recordings.
- Tests interleavings, not just examples.
- Designs for unknown remote outcomes.
- Documents concurrency contracts in APIs.
- Makes failure modes observable.
- Does not confuse throughput with correctness.
26. Final Study Path After This Series
To keep improving, build projects that force different models.
Project 1 — Virtual-thread service
Build:
- REST endpoint,
- blocking DB,
- external HTTP call,
- deadline propagation,
- structured concurrency,
- scoped context,
- cancellation on client disconnect,
- JFR profiling.
Project 2 — Event-loop protocol server
Build:
- NIO selector server,
- length-prefixed protocol,
- bounded outbound queue,
- slow client close,
- worker offload,
- event-loop lag metric.
Project 3 — Reactive stream processor
Build:
Flow.Publisher,- bounded demand,
- cancellation,
- timeout,
- bridge to blocking operation,
- tests for protocol violations.
Project 4 — Concurrent aggregate coordinator
Build:
- actor-like case state coordinator,
- bounded mailbox,
- snapshot reads,
- idempotent commands,
- stress tests for invariant.
Project 5 — JCStress-style experiments
Build:
- unsafe publication test,
- volatile visibility test,
- atomic transition test,
- lock ordering failure,
ThreadLocalleak simulation.
27. Final Summary
Concurrency correctness is not one topic. It is a stack:
A system is only as reliable as the weakest layer.
Final rules:
- Start from invariants.
- Minimize shared mutable state.
- Choose ownership model first.
- Choose execution model second.
- Choose primitive last.
- Bound queues and resources.
- Propagate deadlines.
- Treat cancellation as part of API contract.
- Make timeout phases observable.
- Test failure interleavings.
- Use virtual threads to simplify blocking orchestration, not to ignore capacity.
- Use reactive/event-loop when backpressure and streaming are central.
- Use structured concurrency to prevent orphan work.
- Use scoped values for bounded immutable context.
- Treat production incidents as feedback into tests and design templates.
This completes the series.
References
- Java SE 25 API —
ExecutorService: https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/ExecutorService.html - Java SE 25 API —
StructuredTaskScope: https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/StructuredTaskScope.html - Oracle Java SE 25 — Structured Concurrency: https://docs.oracle.com/en/java/javase/25/core/structured-concurrency.html
- Java SE 25 API —
ScopedValue: https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/lang/ScopedValue.html - Java SE 25 API —
Flow: https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/util/concurrent/Flow.html - Java SE 25 API —
Thread: https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/lang/Thread.html - Oracle Java SE 21 Guide — Virtual Threads: https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html
- JEP 444 — Virtual Threads: https://openjdk.org/jeps/444
- JEP 491 — Synchronize Virtual Threads without Pinning: https://openjdk.org/jeps/491
- JEP 505 — Structured Concurrency: https://openjdk.org/jeps/505
- JEP 506 — Scoped Values: https://openjdk.org/jeps/506
- Java SE 25 Troubleshooting — Diagnostic Tools / JFR: https://docs.oracle.com/en/java/javase/25/troubleshoot/diagnostic-tools.html
- Java SE 25 API —
jdk.jfr: https://docs.oracle.com/en/java/javase/25/docs/api/jdk.jfr/module-summary.html
You just completed lesson 35 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.