Final StretchOrdered learning track

IO Performance Diagnostics and Tuning

Learn Java IO, Modern IO, Streams, Buffers, Resources, Serialization & Data Boundaries - Part 031

IO performance diagnostics and tuning for production Java systems: latency, throughput, buffer sizing, page cache behavior, direct memory, profiling, JFR, JMH pitfalls, and review checklists.

17 min read3316 words
PrevNext
Lesson 3132 lesson track2832 Final Stretch
#java#io#nio#performance+5 more

Part 031 — IO Performance Diagnostics and Tuning

Goal: mampu mendiagnosis dan men-tune performa IO Java dengan mental model yang benar: bukan menebak ukuran buffer, bukan mengganti semua hal menjadi NIO, dan bukan menganggap benchmark lokal sebagai kebenaran production.

Di level top engineer, pertanyaan IO performance bukan:

“API mana yang paling cepat?”

Pertanyaan yang benar adalah:

“Di boundary mana waktu, memory, syscall, contention, copy, page fault, flush, dan backpressure benar-benar terjadi?”

IO performance adalah gabungan dari beberapa sistem:

  • Java API layer: InputStream, OutputStream, Reader, Writer, ByteBuffer, Channel, Files.
  • JVM layer: heap, direct memory, GC, JIT, allocation, safepoint.
  • OS layer: syscall, page cache, scheduler, socket buffer, file descriptor, permissions, disk cache.
  • Storage/network layer: SSD/HDD/NFS/object storage/TLS/proxy/load balancer.
  • Application layer: parsing, compression, validation, persistence, retry, queueing, cancellation.

Kalau satu layer lambat, layer lain sering terlihat bersalah.

Contoh klasik:

  • “File read lambat” padahal bottleneck ada di parsing CSV.
  • “Network lambat” padahal consumer downstream tidak menguras stream.
  • “GC spike” padahal IO pipeline meng-materialize body besar ke byte[].
  • FileChannel.transferTo tidak cepat” padahal target bukan socket/file yang bisa dioptimasi OS.
  • “Direct buffer cepat” padahal allocation direct buffer dilakukan per request.
  • “flush mempercepat output” padahal flush justru memecah batch dan memperbanyak syscall.

Part ini adalah diagnostic map untuk membedakan semua itu.


1. Kaufman Deconstruction: Sub-Skill Performance IO

Skill “IO performance” harus dipecah menjadi sub-skill berikut.

Sub-skillPertanyaan yang dijawabFailure mode jika tidak dikuasai
Workload characterizationApa yang sebenarnya dibaca/ditulis?Tuning berdasarkan benchmark palsu
Boundary localizationWaktu hilang di API, JVM, OS, disk, network, atau parser?Optimasi di tempat yang salah
Copy accountingBerapa kali data disalin?Memory bandwidth dan GC membengkak
Buffer tuningBuffer mana yang benar-benar mengurangi syscall?Buffer terlalu kecil/besar/berlapis
Allocation controlApakah IO membuat garbage per chunk/request?GC latency dan memory pressure
Direct memory diagnosisApakah off-heap dipakai dan dibatasi?Native OOM, slice retention, hidden memory
Page cache reasoningData dari disk atau dari cache?Benchmark terlihat cepat tapi tidak realistis
Flush/durability reasoningApakah butuh visible, sent, atau durable?Latency spike dan false durability
Backpressure diagnosisApakah producer lebih cepat dari consumer?Queue growth, OOM, timeout cascade
Benchmark disciplineApakah microbenchmark valid?Kesimpulan performance yang salah

Latihan 20 jam untuk bagian ini bukan “coba buffer 8 KB vs 64 KB”. Latihannya adalah membuat hipotesis, mengukur, memisahkan layer, lalu membuktikan bottleneck.


2. Mental Model: IO Performance Equation

Untuk banyak sistem IO, throughput kasar dapat dipahami sebagai:

throughput = useful_bytes / total_elapsed_time

Tapi total_elapsed_time adalah gabungan:

total_elapsed_time = queue_time
                   + open/setup_time
                   + syscall_time
                   + copy_time
                   + kernel_wait_time
                   + storage_or_network_time
                   + decode_parse_transform_time
                   + allocation_gc_time
                   + flush_sync_time
                   + downstream_wait_time
                   + cleanup_time

Kalau ingin tuning, jangan langsung mengganti API. Pecah dulu komponen waktunya.

Setiap panah bisa menjadi bottleneck.


3. Workload Characterization Sebelum Tuning

Sebelum menyentuh code, jawab ini.

3.1 Read atau write?

Read-heavy dan write-heavy punya bottleneck berbeda.

Read-heavy:

  • cache hit ratio penting;
  • random vs sequential access sangat menentukan;
  • decoding/parsing sering lebih mahal daripada read;
  • mmap bisa membantu random access tetapi memperumit lifecycle.

Write-heavy:

  • flush/fsync policy dominan;
  • batching menentukan throughput;
  • append vs overwrite berbeda;
  • atomic rename pattern menambah write amplification;
  • durability requirement harus eksplisit.

3.2 Sequential atau random?

Sequential read/write cocok dengan:

  • BufferedInputStream/BufferedOutputStream;
  • Files.copy;
  • FileChannel.transferTo/transferFrom;
  • large chunk pipeline.

Random access cocok dengan:

  • FileChannel positional read/write;
  • SeekableByteChannel;
  • mmap windowing;
  • index + data file design.

Anti-pattern:

// Looks innocent; terrible for many random reads from large file if each call opens a new stream.
byte[] readRange(Path file, long offset, int length) throws IOException {
    try (InputStream in = Files.newInputStream(file)) {
        in.skipNBytes(offset);
        return in.readNBytes(length);
    }
}

Lebih baik:

byte[] readRange(Path file, long offset, int length) throws IOException {
    ByteBuffer buffer = ByteBuffer.allocate(length);

    try (FileChannel channel = FileChannel.open(file, StandardOpenOption.READ)) {
        while (buffer.hasRemaining()) {
            int n = channel.read(buffer, offset + buffer.position());
            if (n < 0) {
                break;
            }
        }
    }

    return buffer.array();
}

3.3 Small files atau large files?

Small files biasanya bottleneck di:

  • open/close;
  • metadata lookup;
  • directory traversal;
  • permission checks;
  • object allocation;
  • scheduling overhead.

Large files biasanya bottleneck di:

  • transfer bandwidth;
  • buffer copy;
  • page cache;
  • decompression;
  • downstream write;
  • disk/network throughput.

3.4 Local disk, network filesystem, atau object storage?

Java API bisa sama, tetapi semantics berbeda.

StorageKarakteristikRisiko
Local SSDlow latency, high IOPSbenchmark terlalu optimis
HDDsequential bagus, random burukrandom IO collapse
NFS/SMBnetwork latency + filesystem semanticslocking, atomicity, metadata cache ambiguity
Container volumehost dependentfsync/rename semantics bisa berbeda
Object storage gatewaybukan filesystem sejatirename/copy/list consistency trap

Jangan menganggap Files.move(..., ATOMIC_MOVE) di semua provider punya perilaku yang sama. Jika atomic move tidak didukung, API dapat melempar AtomicMoveNotSupportedException.


4. Bottleneck Localization

Gunakan hirarki diagnosis berikut.

4.1 Gejala: CPU tinggi

Kemungkinan:

  • decode text terlalu sering;
  • regex parsing line-by-line mahal;
  • checksum/compression/encryption mahal;
  • banyak copy dari byte[] ke byte[];
  • allocation per chunk;
  • logging verbose per record;
  • charset decoder fallback/replacement berat.

Diagnosis:

  • CPU profiler;
  • allocation profiler;
  • JFR method profiling;
  • sample stack saat load;
  • disable transform sementara untuk isolasi.

4.2 Gejala: CPU rendah, latency tinggi

Kemungkinan:

  • thread blocked di read/write;
  • downstream lambat;
  • socket/file buffer penuh;
  • fsync/force menunggu storage;
  • open file descriptor starvation;
  • lock contention;
  • directory with too many entries;
  • remote filesystem.

Diagnosis:

  • thread dump;
  • JFR file/socket events;
  • OS metrics: iowait, disk util, read/write latency;
  • request timeline;
  • queue depth;
  • timeout histogram.

4.3 Gejala: memory naik

Kemungkinan:

  • readAllBytes() pada body besar;
  • unbounded queue antara producer/consumer;
  • direct buffer pool tidak dibatasi;
  • ByteBuffer.slice() menahan parent besar;
  • Files.lines() stream tidak ditutup;
  • compression bomb;
  • process output ditampung tanpa batas.

Diagnosis:

  • heap dump;
  • native memory tracking;
  • BufferPoolMXBean;
  • queue size metric;
  • per-request memory budget;
  • direct memory max.

5. Copy Accounting

Optimasi IO sering gagal karena engineer tidak menghitung copy.

Contoh upload pipeline buruk:

socket -> byte[] all body -> String -> JSON object -> byte[] -> temp file -> byte[] -> downstream

Pipeline lebih baik:

socket -> bounded chunks -> temp file + checksum -> metadata validation -> committed file -> downstream stream

5.1 Copy map

Buat tabel untuk setiap pipeline.

StepFromToCopy?Allocation?Can stream?
receivesocket bufferheap chunkyesreusable?yes
validate sizechunkcounternonoyes
checksumchunkdigest statenonoyes
persistchunkfile/page cacheyesno if reusedyes
parsefiledomain objectdependsyesmaybe

Tujuan bukan menghapus semua copy. Tujuan adalah menghapus copy yang tidak memberi boundary value.

Boundary value yang sah:

  • validate sebelum commit;
  • checksum/integrity;
  • charset decoding;
  • decompression;
  • encryption/decryption;
  • durable staging;
  • protocol framing;
  • ownership isolation.

Copy yang biasanya boros:

  • InputStream -> byte[] -> ByteArrayInputStream hanya agar API cocok;
  • byte[] -> String -> byte[] untuk binary data;
  • ByteBuffer -> byte[] setiap loop;
  • Files.readAllBytes lalu Files.write untuk copy file besar;
  • StringBuilder untuk seluruh file log besar.

6. Buffer Tuning

Buffer bukan magic. Buffer mengubah frekuensi boundary crossing.

6.1 Buffer mengurangi syscall

Tanpa buffer:

read 1 byte -> syscall
read 1 byte -> syscall
read 1 byte -> syscall

Dengan buffer:

read 8192 bytes -> syscall
serve many small reads from memory

6.2 Buffer terlalu kecil

Gejala:

  • syscall count tinggi;
  • CPU kernel mode naik;
  • throughput rendah;
  • banyak context switch;
  • file/socket read kecil-kecil.

6.3 Buffer terlalu besar

Gejala:

  • memory per request tinggi;
  • cache locality buruk;
  • latency naik karena batch terlalu besar;
  • direct memory pressure;
  • GC/native memory spike;
  • throughput tidak naik setelah titik tertentu.

6.4 Rule of thumb yang lebih aman

Mulai dari:

  • 8 KB untuk classic buffered stream default mental model;
  • 16–64 KB untuk general file copy pipeline;
  • 64–256 KB untuk high-throughput sequential transfer, jika measured membantu;
  • lebih besar hanya jika workload dan memory budget membuktikan benefit.

Jangan treat angka ini sebagai dogma. Ukur.

6.5 Buffer budget formula

total_buffer_memory = concurrent_operations
                    * buffers_per_operation
                    * buffer_size

Contoh:

2,000 concurrent uploads
* 3 buffers per upload
* 256 KB
= 1.5 GB buffer memory

Itu belum termasuk parser object, queue, direct buffer, TLS, dan application state.

6.6 Avoid double buffering blindly

Ini sering redundant:

try (InputStream in = new BufferedInputStream(
        Files.newInputStream(path));
     BufferedReader reader = new BufferedReader(
        new InputStreamReader(in, StandardCharsets.UTF_8))) {
    // ...
}

BufferedReader sudah buffering character reads. BufferedInputStream tambahan bisa berguna pada kasus tertentu, tetapi jangan otomatis menumpuk buffer tanpa alasan.

Lebih jelas:

try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
    String line;
    while ((line = reader.readLine()) != null) {
        consume(line);
    }
}

7. Page Cache Reasoning

File IO di OS modern sering tidak langsung ke disk. Banyak read/write melewati page cache.

7.1 Read path simplified

7.2 Benchmark trap: hot cache

Local benchmark sering membaca file yang sudah ada di page cache. Hasilnya mengukur memory copy, bukan disk.

Gejala:

  • read throughput lebih tinggi dari kemampuan disk;
  • run kedua jauh lebih cepat;
  • CPU copy dominan;
  • disk utilization rendah.

Untuk diagnosis production, catat:

  • cold read vs hot read;
  • repeated access pattern;
  • file working set size;
  • memory size vs dataset size;
  • page cache eviction;
  • container memory limit.

7.3 Write path simplified

write() sering hanya menyalin data ke kernel/page cache. Itu belum berarti data durable di storage.

Application write success != durable on disk
close success != necessarily application-level crash-safe protocol
atomic rename != necessarily fsynced directory entry

Part 012 sudah membahas crash consistency. Di sini poin performance-nya: durability itu mahal dan harus dibatch dengan sadar.


8. Flush, Force, Sync: Performance Consequences

8.1 flush()

flush() mendorong buffered data ke sink di layer berikutnya. Pada BufferedOutputStream, flush berarti menulis buffer ke underlying stream. Pada writer, flush mengalirkan encoded character output ke bawah.

flush() bukan jaminan durable disk.

8.2 FileChannel.force(boolean)

force meminta update channel file dipaksa ke storage device. Ini primitive mahal. Jangan panggil per record kecuali requirement durability memang mengharuskan.

Bad:

for (Record record : records) {
    writeRecord(channel, record);
    channel.force(true); // extremely expensive under load
}

Better jika business mengizinkan group commit:

int sinceLastForce = 0;

for (Record record : records) {
    writeRecord(channel, record);
    sinceLastForce++;

    if (sinceLastForce >= 1_000) {
        channel.force(false);
        sinceLastForce = 0;
    }
}

channel.force(false);

Trade-off:

  • throughput naik;
  • risiko kehilangan batch terakhir jika crash;
  • recovery protocol wajib jelas.

8.3 SYNC dan DSYNC

StandardOpenOption.SYNC dan DSYNC dapat membuat setiap update lebih sinkron ke storage. Ini bukan default yang boleh dipakai tanpa benchmark dan requirement.

Gunakan hanya jika:

  • data loss window harus sangat kecil;
  • throughput requirement realistis;
  • hardware/storage latency diketahui;
  • recovery design mengandalkan durable-per-write semantics.

9. Direct Buffer Diagnostics

Direct buffer membantu beberapa native IO path karena buffer berada di luar Java heap dan bisa mengurangi copy tertentu. Tapi direct buffer bukan gratis.

9.1 Failure modes

  • allocate direct buffer per request;
  • direct memory tidak dibatasi eksplisit;
  • slice kecil menahan direct buffer besar;
  • pool tidak punya max size;
  • cleaner delay menyebabkan native memory naik;
  • heap tampak aman tetapi process RSS besar;
  • container killed karena native memory, bukan Java heap OOM.

9.2 Instrumentasi direct memory

Gunakan BufferPoolMXBean untuk melihat pool seperti direct dan mapped.

import java.lang.management.BufferPoolMXBean;
import java.lang.management.ManagementFactory;

public final class BufferPoolSnapshot {
    public static void printBufferPools() {
        for (BufferPoolMXBean bean : ManagementFactory.getPlatformMXBeans(BufferPoolMXBean.class)) {
            System.out.printf(
                "%s: count=%d, used=%d, capacity=%d%n",
                bean.getName(),
                bean.getCount(),
                bean.getMemoryUsed(),
                bean.getTotalCapacity()
            );
        }
    }
}

Metrics yang berguna:

  • direct buffer count;
  • direct memory used;
  • mapped memory used;
  • allocation rate;
  • pool borrow latency;
  • pool exhaustion count;
  • request count using direct buffer;
  • largest retained buffer.

9.3 Pooling direct buffer

Pooling berguna jika:

  • buffer besar;
  • allocation sering;
  • lifecycle jelas;
  • concurrency bounded;
  • pool punya cap;
  • borrower wajib return.

Simple bounded pool:

import java.nio.ByteBuffer;
import java.util.ArrayDeque;
import java.util.Optional;

public final class BoundedByteBufferPool {
    private final int bufferSize;
    private final int maxIdle;
    private final ArrayDeque<ByteBuffer> idle = new ArrayDeque<>();
    private int allocated;

    public BoundedByteBufferPool(int bufferSize, int maxIdle) {
        if (bufferSize <= 0 || maxIdle <= 0) {
            throw new IllegalArgumentException("bufferSize and maxIdle must be positive");
        }
        this.bufferSize = bufferSize;
        this.maxIdle = maxIdle;
    }

    public synchronized ByteBuffer borrow() {
        ByteBuffer buffer = idle.pollFirst();
        if (buffer != null) {
            buffer.clear();
            return buffer;
        }
        allocated++;
        return ByteBuffer.allocateDirect(bufferSize);
    }

    public synchronized void release(ByteBuffer buffer) {
        if (buffer == null) {
            return;
        }
        buffer.clear();
        if (idle.size() < maxIdle && buffer.capacity() == bufferSize) {
            idle.addFirst(buffer);
        }
        // Else let it be reclaimed eventually.
    }

    public synchronized int idleCount() {
        return idle.size();
    }

    public synchronized int allocatedCount() {
        return allocated;
    }
}

Production pool harus lebih kuat:

  • close/shutdown behavior;
  • leak detection;
  • max allocated, bukan hanya max idle;
  • metrics;
  • timeout borrow;
  • owner token;
  • no double release;
  • no release after close.

10. Allocation Control in IO Loops

Bad:

try (InputStream in = Files.newInputStream(source);
     OutputStream out = Files.newOutputStream(target)) {

    while (true) {
        byte[] buffer = new byte[8192]; // allocates every loop
        int n = in.read(buffer);
        if (n < 0) {
            break;
        }
        out.write(buffer, 0, n);
    }
}

Good:

try (InputStream in = Files.newInputStream(source);
     OutputStream out = Files.newOutputStream(target)) {

    byte[] buffer = new byte[64 * 1024];
    int n;
    while ((n = in.read(buffer)) >= 0) {
        out.write(buffer, 0, n);
    }
}

Bad with ByteBuffer:

while (channel.read(ByteBuffer.allocateDirect(64 * 1024)) >= 0) {
    // lost data and allocates direct buffer repeatedly
}

Good:

ByteBuffer buffer = ByteBuffer.allocateDirect(64 * 1024);
while (channel.read(buffer) >= 0) {
    buffer.flip();
    consume(buffer);
    buffer.clear();
}

Allocation checklist:

  • Is buffer allocated outside loop?
  • Is String created per byte/chunk unnecessarily?
  • Is byte[] materialized for reusable stream?
  • Is line parsing creating many intermediate arrays?
  • Are direct buffers allocated per request?
  • Are slices retaining large parent buffers?

11. JMH Pitfalls for IO Benchmarking

JMH is useful, but IO microbenchmarking is dangerous.

11.1 What JMH can measure well

  • parser function over in-memory buffer;
  • charset decoder performance on fixed input;
  • checksum/compression CPU cost;
  • buffer manipulation overhead;
  • allocation rate of API variants;
  • ByteBuffer state machine overhead.

11.2 What JMH often measures badly

  • real disk latency;
  • page cache behavior;
  • network jitter;
  • fsync latency;
  • object storage semantics;
  • filesystem metadata contention;
  • production concurrency;
  • cold start file open cost.

11.3 Benchmark traps

TrapWhy wrongBetter approach
Reading same file repeatedlypage cache dominatesseparate hot/cold tests
Tiny file benchmarkmetadata/open cost dominatesbenchmark representative file sizes
No checksum of resultdead-code elimination or incomplete workconsume result with blackhole/checksum
Benchmark on laptopstorage and CPU differrun on production-like host
Single thread onlymisses queue/backpressurerun concurrency tests
Measuring whole pipeline onlyno localizationstage-level timing
Using tempfs accidentallynot diskverify mount/storage

11.4 Structure of useful benchmark suite

benchmarks/
  parser/
    CsvRecordParserBenchmark.java
    BinaryFrameParserBenchmark.java
  buffer/
    ByteBufferFlipCompactBenchmark.java
    HeapVsDirectCopyBenchmark.java
  transfer/
    FileCopyHotCacheBenchmark.java
    FileCopyColdCacheHarness.md
  integration/
    UploadPipelineLoadTest.md
    ArchiveExtractionStressTest.md

Keep microbenchmark and load test separate. They answer different questions.


12. JFR-Based IO Diagnostics

Java Flight Recorder is often the most practical first tool because it correlates JVM events, threads, allocation, file IO, socket IO, and method samples.

Useful investigation questions:

  • Which files are read/written most?
  • Which thread blocks on IO?
  • Is latency dominated by file read, socket read, write, allocation, or locks?
  • Are there allocation spikes during transfer?
  • Are there long pauses near direct/mapped buffer usage?
  • Are requests timing out while IO thread is blocked?

12.1 What to capture

At minimum:

  • CPU samples;
  • allocation samples;
  • file read/write events;
  • socket read/write events;
  • thread park/block events;
  • GC events;
  • exception events if relevant;
  • object allocation outside TLAB if memory pressure exists.

12.2 How to read result

Avoid staring at averages first. Look at:

  • p95/p99 duration of file/socket events;
  • top paths/sockets by bytes;
  • longest blocked threads;
  • allocation hot methods;
  • correlation between GC and IO latency;
  • event timeline around spikes.

12.3 Diagnostic flow with JFR


13. OS-Level Metrics to Correlate

Java metrics alone are insufficient.

Correlate with:

  • disk read/write throughput;
  • disk latency;
  • disk queue depth;
  • iowait;
  • filesystem mount options;
  • network throughput;
  • retransmits/errors;
  • open file descriptor count;
  • process RSS;
  • page faults;
  • container memory limit;
  • cgroup throttling;
  • CPU steal time in virtualized environments.

If Java says “write took 2 seconds”, OS metrics help answer whether it was:

  • storage saturated;
  • network filesystem slow;
  • process throttled;
  • GC paused;
  • downstream not reading;
  • lock contention;
  • flush/force latency.

14. Metrics for Production IO Components

Every serious IO component should expose metrics at the boundary.

14.1 File ingestion metrics

  • files discovered;
  • files claimed;
  • files skipped;
  • bytes read;
  • read duration;
  • parse duration;
  • validation failure count;
  • quarantine count;
  • commit duration;
  • temp file cleanup count;
  • retry count;
  • oldest unprocessed file age;
  • active workers;
  • queue depth;
  • in-flight bytes.

14.2 Transfer metrics

  • bytes transferred;
  • transfer duration;
  • throughput histogram;
  • partial transfer count;
  • resume count;
  • checksum mismatch count;
  • cancellation count;
  • timeout count;
  • downstream write latency;
  • buffer pool borrow time;
  • buffer pool exhaustion.

14.3 Direct memory metrics

  • direct buffer count;
  • direct memory used;
  • mapped memory used;
  • allocation count;
  • pool idle count;
  • pool active count;
  • borrow timeout;
  • leak suspicion count.

14.4 Process IO metrics

  • process start latency;
  • stdout bytes;
  • stderr bytes;
  • drain duration;
  • exit code distribution;
  • timeout kill count;
  • output truncation count;
  • process tree kill failures.

15. Tuning Patterns

15.1 Replace materialization with streaming

Bad:

byte[] body = input.readAllBytes();
validate(body);
Files.write(target, body);

Better:

MessageDigest digest = MessageDigest.getInstance("SHA-256");
long bytes = 0;

try (InputStream in = source.openStream();
     OutputStream out = Files.newOutputStream(temp, StandardOpenOption.CREATE_NEW)) {

    byte[] buffer = new byte[64 * 1024];
    int n;
    while ((n = in.read(buffer)) >= 0) {
        bytes += n;
        if (bytes > maxBytes) {
            throw new IOException("payload too large");
        }
        digest.update(buffer, 0, n);
        out.write(buffer, 0, n);
    }
}

15.2 Batch small writes

Bad:

for (String line : lines) {
    writer.write(line);
    writer.write('\n');
    writer.flush();
}

Better:

for (String line : lines) {
    writer.write(line);
    writer.write('\n');
}
writer.flush();

For stronger control, batch records into chunks.

15.3 Avoid per-record open/close

Bad:

for (Record record : records) {
    Files.writeString(logFile, record.toLine(), StandardOpenOption.APPEND);
}

Better:

try (BufferedWriter writer = Files.newBufferedWriter(
        logFile,
        StandardCharsets.UTF_8,
        StandardOpenOption.CREATE,
        StandardOpenOption.APPEND)) {

    for (Record record : records) {
        writer.write(record.toLine());
        writer.newLine();
    }
}

15.4 Use transfer APIs when moving bytes unchanged

If you are only copying bytes, don't parse them.

try (FileChannel in = FileChannel.open(source, StandardOpenOption.READ);
     FileChannel out = FileChannel.open(target,
         StandardOpenOption.CREATE_NEW,
         StandardOpenOption.WRITE)) {

    long position = 0;
    long size = in.size();

    while (position < size) {
        long transferred = in.transferTo(position, size - position, out);
        if (transferred <= 0) {
            break;
        }
        position += transferred;
    }
}

Always loop. Transfer methods may transfer fewer bytes than requested.

15.5 Limit concurrency by bytes, not only tasks

Bad:

maxWorkers = 200

Better:

maxWorkers = 32
maxInFlightBytes = 512 MB
maxBufferMemory = 128 MB
maxOpenFiles = 256

A thousand small files and a thousand 2 GB files are not equivalent.

15.6 Separate metadata scan from content processing

Directory scan can be cheap or expensive depending on filesystem. Do not mix scan latency with processing latency without measuring separately.

Expose separate metrics for each stage.


16. Performance Anti-Patterns

16.1 available() as size estimate

Bad:

byte[] buffer = new byte[in.available()];
in.read(buffer);

available() is not total stream size. It is at most a non-blocking availability hint.

16.2 Assuming read(byte[]) fills the array

Bad:

byte[] header = new byte[16];
in.read(header); // may read fewer than 16 bytes

Good:

byte[] header = in.readNBytes(16);
if (header.length != 16) {
    throw new EOFException("truncated header");
}

16.3 Logging per chunk at info level

Bad:

log.info("copied {} bytes", n);

inside every loop.

Better:

  • aggregate metrics;
  • debug-level sample logs;
  • final transfer summary;
  • structured metrics.

16.4 Creating strings for binary payload

Bad:

String body = new String(bytes, StandardCharsets.UTF_8);
byte[] again = body.getBytes(StandardCharsets.UTF_8);

If payload is binary, keep it binary.

16.5 Using mmap for everything

mmap can improve some random access/read-heavy workloads, but it introduces page fault behavior, lifecycle complexity, mapping window design, and unmapping concerns.

16.6 Using async for slow CPU work

AsynchronousFileChannel does not make parsing faster. If bottleneck is CPU parser/compression, async IO only complicates code.


17. Review Checklist for IO Performance

Use this during design review.

Workload

  • Is workload read/write/mixed?
  • Is access sequential/random?
  • Are file sizes and concurrency known?
  • Is storage local, networked, containerized, or object-backed?
  • Is performance target latency, throughput, or durability?

Memory

  • Is there a per-request memory budget?
  • Are buffers allocated outside loops?
  • Are queues bounded?
  • Is direct memory measured?
  • Are large payloads streamed rather than materialized?

IO API

  • Is chosen API aligned with boundary contract?
  • Are partial reads/writes handled?
  • Are transfer APIs looped?
  • Is flush/force policy explicit?
  • Are resources closed deterministically?

Diagnostics

  • Are bytes, duration, queue depth, and failures measured?
  • Are JFR/OS metrics available?
  • Can performance be broken down by stage?
  • Are p95/p99 measured, not only average?
  • Are test workloads representative?

Tuning discipline

  • Is there a hypothesis?
  • Is only one variable changed at a time?
  • Is benchmark hot/cold cache aware?
  • Is production-like concurrency used?
  • Is correctness preserved after tuning?

18. Deliberate Practice

Exercise 1 — Copy accounting

Take one existing file upload/download path in your project. Draw copy map:

source -> buffer -> parser -> temp -> committed -> downstream

For each step, mark:

  • copy yes/no;
  • allocation yes/no;
  • blocking yes/no;
  • replayable yes/no;
  • bounded yes/no.

Then remove one unnecessary materialization.

Exercise 2 — Buffer experiment

Implement file copy using:

  • InputStream with 8 KB buffer;
  • InputStream with 64 KB buffer;
  • FileChannel.transferTo;
  • Files.copy.

Measure:

  • elapsed time;
  • bytes/s;
  • allocation;
  • CPU;
  • hot cache vs cold-ish cache behavior;
  • correctness via checksum.

Do not conclude “winner” universally. Conclude for this workload.

Exercise 3 — JFR diagnosis

Run a load test for an IO-heavy endpoint. Capture JFR. Identify:

  • top file/socket events;
  • allocation hotspots;
  • blocked threads;
  • p99 event duration;
  • correlation with GC.

Write one hypothesis and test it.

Exercise 4 — Direct memory audit

Add BufferPoolMXBean metrics to a service that uses direct/mapped buffers. Stress it. Verify:

  • direct count stabilizes;
  • memory used stabilizes;
  • no per-request unbounded growth;
  • container RSS matches expectation.

19. Summary

IO performance engineering is not API superstition. It is boundary accounting.

A top engineer should be able to say:

  • where time is spent;
  • where bytes are copied;
  • where memory is allocated;
  • where backpressure is applied;
  • where data becomes durable;
  • where partial failure is handled;
  • what benchmark actually measured;
  • what production metrics prove.

The strongest default is:

stream large data, bound memory, batch writes, measure stages, avoid hidden materialization, and tune only after locating the bottleneck.

Part 032 closes the series with a capstone: a production-grade IO design combining safe ingestion, staging, validation, idempotency, durability, retry, and resumable transfer.

Lesson Recap

You just completed lesson 31 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.