Build CoreOrdered learning track

JDBC Streaming Large Result

Learn Java Data Access Pattern In Action - Part 016

Streaming large result JDBC secara production-grade: ResultSet cursor, fetch size, driver behavior, transaction scope, memory safety, callback vs Stream, chunking, export, cancellation, timeout, backpressure illusion, dan operational playbook.

21 min read4150 words
PrevNext
Lesson 1660 lesson track12–33 Build Core
#java#data-access#jdbc#resultset+6 more

Part 016 — JDBC Streaming Large Result

Membaca banyak row bukan hanya masalah while (rs.next()).

Large result adalah masalah resource lifecycle:

  • berapa banyak row yang ada di memory;
  • berapa lama connection dipegang;
  • apakah transaction/snapshot tetap terbuka;
  • apakah database cursor benar-benar dipakai;
  • apakah client/downstream lebih lambat dari database;
  • apakah cancellation bekerja;
  • apakah result bisa diulang setelah failure;
  • apakah export menahan pool selama user download.

Streaming yang salah hanya memindahkan masalah dari heap ke connection pool dan database session.

Part ini membahas large result dengan JDBC secara production-grade.


1. Core Thesis

ResultSet terlihat seperti cursor row-by-row:

while (rs.next()) {
    Row row = mapRow(rs);
    handle(row);
}

Tetapi behavior fisik bisa berbeda:

driver may fetch all rows eagerly
driver may fetch rows in chunks
driver may require fetchSize
driver may require autoCommit=false
driver may use server-side cursor
driver may stream only in special mode

Jadi mental model yang benar:

ResultSet API is cursor-like.
Actual memory/network behavior is driver/database-specific.
Verify it.

2. The Problem With findAll

Bad:

public List<CaseRow> findAllOpenCases() {
    List<CaseRow> rows = new ArrayList<>();

    try (...) {
        try (ResultSet rs = ps.executeQuery()) {
            while (rs.next()) {
                rows.add(mapRow(rs));
            }
        }
    }

    return rows;
}

If result has 5 million rows:

  • heap explodes;
  • GC pressure;
  • long connection hold;
  • transaction/snapshot long;
  • request timeout;
  • network large;
  • serialization impossible;
  • user cannot consume.

Rule:

No unbounded list query in production application path.

3. Large Result Use Cases

Large result appears in:

  • export CSV/XLSX;
  • regulatory reporting;
  • data migration/backfill;
  • ETL pipeline;
  • repair job;
  • audit archive;
  • search reindex;
  • cache rebuild;
  • analytics extract;
  • outbox publisher catch-up;
  • reconciliation job.

Each use case needs different design.

Use CaseBetter Pattern
User dashboardbounded pagination
Infinite scrollkeyset pagination
CSV exportasync export job + chunking
Backfillchunk reader/writer
ETL extractcursor/chunk + checkpoint
Reindexkeyset scan + resume
Audit archivepartition/chunk
API responsestrict max limit
Report snapshotmaterialized/read model

4. Streaming vs Chunking

Streaming

One query/result set stays open while processing rows.

open connection
execute query
while rows:
  fetch/process
close result/statement/connection

Pros:

  • simple sequential scan;
  • low heap if driver streams;
  • no repeated query overhead;
  • stable snapshot possible.

Cons:

  • connection held long;
  • transaction/snapshot may be long;
  • failure restarts from beginning unless checkpointed externally;
  • downstream slowness holds DB resource;
  • cancellation must work;
  • driver-specific behavior.

Chunking

Repeated bounded queries.

read 500 rows after cursor
close connection
process/write
save cursor
repeat

Pros:

  • bounded connection time;
  • resumable;
  • easier retry;
  • easier throttling;
  • lower lock/snapshot duration;
  • easier operational control.

Cons:

  • more queries;
  • cursor correctness required;
  • data can change between chunks;
  • snapshot consistency harder.

Default recommendation:

Use chunking for jobs/backfills/exports unless you explicitly need one database cursor/snapshot.

5. ResultSet Lifecycle for Streaming

Callback pattern:

public void scanOpenCases(CaseRowHandler handler) {
    String sql = """
        select id, case_number, status, updated_at
        from case_file
        where status = ?
        order by id
        """;

    try (
        Connection connection = dataSource.getConnection();
        PreparedStatement ps = connection.prepareStatement(
                sql,
                ResultSet.TYPE_FORWARD_ONLY,
                ResultSet.CONCUR_READ_ONLY
        )
    ) {
        ps.setFetchSize(500);
        ps.setString(1, "OPEN");

        try (ResultSet rs = ps.executeQuery()) {
            while (rs.next()) {
                handler.handle(mapRow(rs));
            }
        }
    } catch (SQLException e) {
        throw translator.translate("CaseFile.scanOpenCases", e);
    }
}

Handler runs while DB resources are open.

Therefore handler must be fast and bounded.

Bad handler:

row -> externalApi.call(row)

This holds connection while waiting on external network.

Better:

  • write row to bounded local buffer;
  • process in chunks;
  • close DB resource before slow external call;
  • use staging file/table.

6. Fetch Size

setFetchSize hints how many rows driver should fetch at a time.

ps.setFetchSize(500);

Meaning conceptually:

fetch 500 rows from DB
consume them
fetch next 500
...

But actual behavior depends driver/database.

Some drivers may:

  • ignore fetch size;
  • fetch all rows unless special mode;
  • require auto-commit disabled;
  • require forward-only result set;
  • require connection property;
  • treat special values specially.

Rule:

Do not claim streaming until you have measured memory and observed database/driver behavior.

7. Forward-Only Read-Only ResultSet

For large scans, request simple result set:

PreparedStatement ps = connection.prepareStatement(
        sql,
        ResultSet.TYPE_FORWARD_ONLY,
        ResultSet.CONCUR_READ_ONLY
);

Avoid scrollable/updatable result sets for large data.

Scrollable result sets may require buffering more data.

Production default for large scan:

TYPE_FORWARD_ONLY
CONCUR_READ_ONLY
explicit fetch size
explicit order
bounded use case

8. Auto-Commit and Cursor Behavior

Some drivers/databases require transaction context for cursor-based fetching.

Pattern:

try (Connection connection = dataSource.getConnection()) {
    boolean oldAutoCommit = connection.getAutoCommit();

    try {
        connection.setAutoCommit(false);

        try (PreparedStatement ps = connection.prepareStatement(SQL)) {
            ps.setFetchSize(500);

            try (ResultSet rs = ps.executeQuery()) {
                while (rs.next()) {
                    handle(mapRow(rs));
                }
            }
        }

        connection.commit();
    } catch (Exception ex) {
        connection.rollback();
        throw ex;
    } finally {
        connection.setAutoCommit(oldAutoCommit);
    }
}

Caution:

  • long read transaction can hold snapshot;
  • commit/rollback required;
  • connection held until scan ends.

If you do not need one snapshot, chunking is often safer.


9. Long Read Transaction Risk

A long read transaction can:

  • hold old snapshot;
  • delay cleanup/vacuum in MVCC databases;
  • consume database resources;
  • keep connection active;
  • conflict with DDL/migration;
  • make failover/cancellation slower;
  • cause read replica lag impact.

Streaming one million rows may avoid heap OOM but still create operational risk.

Trade-off:

Heap safety does not imply database safety.

10. Processing Speed Controls Resource Duration

Connection usage duration:

total time = query execution + row fetching + mapping + handler processing

If handler is slow, connection is held longer.

Example:

5 ms per row * 100,000 rows = 500 seconds

A single scan can hold connection for 8+ minutes.

Better:

  • map rows quickly;
  • write to local file quickly;
  • chunk;
  • avoid remote call per row;
  • decouple DB read from slow downstream;
  • use bounded queue carefully.

11. Backpressure Illusion

ResultSet loop does not create full reactive backpressure.

It gives pull-style reading from Java code, but:

  • driver may prefetch;
  • database may execute query fully before rows consumed;
  • server may hold resources;
  • network buffers exist;
  • downstream slowness still holds DB connection.

Backpressure should be designed:

  • chunk size;
  • bounded queues;
  • worker concurrency;
  • rate limit;
  • timeout;
  • cancellation;
  • progress checkpoint;
  • retry.

Do not say "we stream, therefore it is safe".


12. Callback vs Stream Return

Callback

query.scan(filter, row -> writer.write(row));

Pros:

  • DAO owns resource lifecycle;
  • resources always closed;
  • simple.

Cons:

  • handler runs inside DB lifecycle;
  • harder composition;
  • handler must obey rules.

Java Stream

try (Stream<Row> rows = query.stream(filter)) {
    rows.forEach(...);
}

Pros:

  • composable;
  • familiar API.

Cons:

  • caller must close stream;
  • easy to leak connection;
  • exceptions inside stream can be awkward;
  • resource lifecycle less visible;
  • lazy operations can escape boundary.

For most internal data access, callback or chunking is safer than returning Stream.


13. Safe Stream Implementation Pattern

If you must return stream, attach close handler.

Conceptual example:

public Stream<CaseRow> streamOpenCases() {
    Connection connection = dataSource.getConnection();
    PreparedStatement ps = connection.prepareStatement(SQL);
    ps.setFetchSize(500);
    ps.setString(1, "OPEN");
    ResultSet rs = ps.executeQuery();

    Iterator<CaseRow> iterator = new ResultSetIterator<>(rs, this::mapRow);

    Spliterator<CaseRow> spliterator =
            Spliterators.spliteratorUnknownSize(iterator, Spliterator.ORDERED);

    return StreamSupport.stream(spliterator, false)
            .onClose(() -> closeAll(rs, ps, connection));
}

Problems:

  • checked exceptions must be wrapped;
  • if caller forgets try-with-resources, leak;
  • if stream short-circuits, close required;
  • if parallel stream used, disaster;
  • transaction cleanup must be handled.

If you expose stream, document:

Caller must close stream.
Stream must not be parallelized.
Rows must be consumed within method scope.

14. Prefer Chunk Reader for Jobs

public List<CaseExportRow> readAfter(
        CaseExportCursor cursor,
        int limit
) {
    String sql = """
        select id, case_number, status, updated_at
        from case_file
        where (updated_at, id) > (?, ?)
        order by updated_at asc, id asc
        limit ?
        """;

    ...
}

Job:

CaseExportCursor cursor = cursorStore.load(jobId);

while (true) {
    List<CaseExportRow> rows = reader.readAfter(cursor, 1000);
    if (rows.isEmpty()) {
        break;
    }

    csvWriter.writeRows(rows);

    cursor = CaseExportCursor.fromLast(rows);
    cursorStore.save(jobId, cursor);
}

Benefits:

  • each DB read bounded;
  • resource closed before next chunk;
  • resumable;
  • progress visible;
  • no long transaction;
  • easier cancellation.

15. Choosing Cursor Key

A good cursor key must be:

  • deterministic;
  • unique or tie-broken;
  • indexed;
  • monotonic for scan purpose;
  • stable during scan if possible.

Common:

id
(created_at, id)
(updated_at, id)
(sequence_number)

If updated_at changes during scan, rows can move. For backfill, prefer immutable key like id or created sequence.

Keyset condition:

where id > ?
order by id
limit ?

Compound cursor:

where (updated_at > ?)
   or (updated_at = ? and id > ?)
order by updated_at, id
limit ?

Database support for tuple comparison varies; explicit OR form is more portable.


16. Offset Pagination Is Bad for Deep Scans

select ...
from case_file
order by id
limit 1000 offset 5000000;

Problems:

  • database still skips many rows;
  • latency grows with offset;
  • unstable under concurrent changes;
  • not resumable cleanly.

For large scan, use keyset/cursor.

where id > ?
order by id
limit ?

17. Export Architecture

Synchronous export:

HTTP request -> query all rows -> write response

Problems:

  • connection held while client downloads;
  • client disconnect wastes work;
  • request timeout;
  • no resume;
  • pool pressure;
  • memory risk;
  • hard audit.

Better:

Export job:

  • reads chunk;
  • writes to file/storage;
  • saves progress;
  • handles cancellation;
  • records audit;
  • exposes status.

18. CSV Export Chunk Pattern

public void exportCases(ExportJob job, Writer writer) {
    CaseExportCursor cursor = job.cursor();

    while (!job.isCancellationRequested()) {
        List<CaseExportRow> rows = reader.readAfter(cursor, 1000);

        if (rows.isEmpty()) {
            job.markCompleted();
            return;
        }

        for (CaseExportRow row : rows) {
            csv.write(writer, row);
        }

        writer.flush();

        cursor = CaseExportCursor.fromLast(rows);
        job.saveCursor(cursor);
    }

    job.markCancelled();
}

Important:

  • DB resource closed before writing huge next chunk;
  • writer flush can be slow but does not hold DB connection if rows already read;
  • cursor saved after file write;
  • if crash after file write before cursor save, duplicate rows may be written unless output is idempotent/replaceable.

Better for exact export:

  • write to temp file;
  • if job restarts, restart file from beginning; or
  • checkpoint file offset and cursor together; or
  • write chunk files and compose; or
  • stage export snapshot.

19. Snapshot Consistency for Export

Question:

Should export reflect one consistent point in time?

If yes:

Options:

  1. Long read transaction snapshot.
  2. Create export snapshot table.
  3. Materialized view/versioned read model.
  4. Record cutoff time and query only rows <= cutoff.
  5. Use database-specific snapshot/export features.
  6. Pause writes? Usually bad.

Long transaction snapshot:

  • simple conceptually;
  • can be harmful for large export.

Snapshot table:

create table export_case_snapshot as
select ...
from case_file
where ...

Then export from snapshot in chunks.

Trade-off:

  • storage cost;
  • job cleanup;
  • consistency clearer;
  • DB load concentrated during snapshot creation.

20. Read Replica Consideration

Large reads can run on read replica if:

  • eventual consistency acceptable;
  • replica lag monitored;
  • query load won't break replica;
  • export/report can tolerate lag;
  • transaction snapshot semantics understood.

Do not route command consistency reads to replica unless designed.

For export/reporting, replica/read model often better than OLTP primary.


21. Streaming with Transaction Boundary

If using cursor streaming:

connection.setAutoCommit(false);
connection.setReadOnly(true);

try {
    scan(connection);
    connection.commit();
} catch (Exception ex) {
    connection.rollback();
    throw ex;
}

Even read-only scan should end with commit/rollback to close transaction.

If handler fails midway, rollback/close.

Do not leave read transaction open.


22. Timeout for Large Result

Large scans need timeout strategy, but not always same as API query.

For batch/export:

  • per-query timeout;
  • per-chunk timeout;
  • job deadline;
  • cancellation flag;
  • max runtime;
  • lock timeout;
  • statement timeout;
  • read socket timeout if applicable.

If streaming one long query, query timeout may abort even though rows are being consumed depending driver/database behavior. Test.

Chunking makes timeout easier:

each chunk must complete under N seconds

23. Cancellation

For chunk job:

while (!cancellationRequested()) {
    processChunk();
}

Cancellation boundary is chunk boundary.

For streaming query, cancellation can require:

  • Statement.cancel();
  • connection close;
  • thread interruption may not be enough;
  • database-side cancel support;
  • timeout.

Chunking gives cleaner cancellation.


24. Result Mapping Cost

Mapping can dominate.

Example:

DB fetch row: cheap
Java JSON parse per row: expensive

If mapping heavy:

  • connection stays open longer;
  • DB cursor held longer;
  • throughput drops.

Better:

  • select simpler columns;
  • avoid per-row object mapper if possible;
  • move heavy transform after chunk read;
  • batch transform outside DB lifecycle;
  • precompute read model;
  • profile allocation.

25. Avoid One-to-Many Streaming Explosion

Streaming joined one-to-many can create huge duplicate parent data.

select c.*, a.*
from case_file c
join action a on a.case_id = c.id

If each case has many actions, rows multiply.

For export/report, consider:

  • separate export sections;
  • aggregate child data in SQL if appropriate;
  • two-step chunk by parent IDs;
  • denormalized read model;
  • avoid exporting nested object graph as flat repeated rows unless intended.

26. Streaming and Sorting

Large result with order by may require database sort.

If sort not supported by index:

  • DB may sort huge dataset;
  • temp files;
  • slow first row;
  • high memory/disk;
  • timeout.

For streaming scans, order by indexed cursor key.

where id > ?
order by id
limit ?

For full export sorted by non-indexed display field, consider materialized export/read model.


27. Streaming and Index-Only Reads

If exporting only selected columns, index-only scan may help if database supports and visibility conditions allow.

Select minimal columns:

select id, case_number, status, updated_at

Avoid:

select *

Large result magnifies every extra column.


28. Streaming and Large Columns

Avoid selecting BLOB/CLOB/large JSON in large scan unless necessary.

If needed:

  • stream large object separately;
  • store object externally;
  • batch small metadata query first;
  • process one object at a time;
  • set size limits.

Large JSON per row can dominate network and memory.


29. Streaming and Network

Large result transfers data over network.

Consider:

  • row width;
  • compression;
  • database/app network bandwidth;
  • cross-AZ/region cost;
  • fetch round trips;
  • packet size;
  • TLS overhead;
  • slow consumer.

Fetch size too small can cause many round trips. Too large can increase memory.

Measure.


30. Streaming and Connection Pool

A streaming scan can hold one connection for minutes. If many exports run:

10 exports * 1 connection each = 10 pool slots held long

Controls:

  • limit concurrent exports;
  • use separate export pool;
  • use read replica;
  • use chunking;
  • queue jobs;
  • enforce max runtime;
  • monitor connection usage duration.

31. Streaming and Isolation

Isolation affects what rows are seen.

At read committed with chunking:

  • each chunk may see newer data;
  • row changes can move between chunks;
  • inserts can appear later;
  • updates to cursor column can skip/duplicate.

At repeatable read/serializable long transaction:

  • snapshot stable;
  • longer resource impact;
  • possible serialization issues depending database.

For backfill, use immutable cursor and idempotent writes. Do not rely on perfect snapshot unless required.


32. Backfill Scan Pattern

Backfill table by primary key:

select id, old_value
from case_file
where id > ?
  and new_value is null
order by id
limit ?

Then update:

update case_file
set new_value = ?
where id = ?
  and new_value is null;

If row already updated, count 0 acceptable.

Cursor advances by max ID read, not updated count.

Idempotent because update predicate prevents duplicate effect.


33. Reindex Scan Pattern

Search reindex:

read cases by id chunk
send documents to search index
save cursor

Problem: external search index side effect.

Design:

  • outbox/indexing table better;
  • or save cursor only after successful index batch;
  • idempotent document ID in search index;
  • retry same chunk safe;
  • handle deletes/tombstones;
  • full rebuild can use index alias swap.

Do not hold DB connection while sending each document individually to search service.


34. Outbox Publisher Scan

Outbox publisher reads unpublished events.

select id, event_type, payload
from outbox_event
where published_at is null
order by created_at asc, id asc
limit ?

Then publish and mark.

Concurrency pattern may need claiming:

update outbox_event
set claimed_by = ?,
    claimed_at = ?
where id in (...)
  and published_at is null
  and claimed_at is null

Or database-specific skip locked.

Large outbox catch-up should be chunked. Do not stream all unpublished events in one transaction.


35. SKIP LOCKED and Work Queues

Some databases support selecting rows while skipping locked rows.

Concept:

select id
from outbox_event
where published_at is null
order by created_at
limit ?
for update skip locked

Use with care:

  • database-specific;
  • can improve worker concurrency;
  • ordering may become approximate under contention;
  • starvation possible;
  • transaction must be short.

Good for job queues/outbox claiming, not general query pagination.


36. Server-Side Cursor Caveats

Server-side cursor may:

  • hold database resources;
  • require transaction;
  • be closed on commit;
  • not survive connection close;
  • consume memory on server;
  • interact with fetch size;
  • have different behavior for scrollable cursors.

Application must close ResultSet/statement/connection.

Cursor is not free.


37. Fetch Size Tuning

Trade-off:

Fetch SizeEffect
too smallmany round trips
too largememory spike
moderatebalanced

Tune based on:

  • row size;
  • network latency;
  • memory;
  • processing speed;
  • database driver behavior.

Example starting points:

100, 500, 1000

Benchmark actual.

Do not copy values blindly.


38. Large Result Query Plan

Before streaming, inspect plan.

Questions:

  • does query use index for ordering?
  • does it scan table intentionally?
  • does it sort huge dataset?
  • does it join huge tables?
  • does it filter by tenant/partition?
  • does it select large columns?
  • are statistics accurate?
  • can it be partitioned by date/id?
  • can read model help?

Streaming a bad query still hurts database.


39. Partitioned Scan

For very large tables:

scan by date partition
scan by tenant
scan by ID range
scan by hash bucket

Example ID range:

where id >= ?
  and id < ?
order by id

Benefits:

  • parallelizable;
  • resumable;
  • bounded;
  • easier progress estimate.

Risks:

  • uneven distribution;
  • hot partitions;
  • duplicates if ranges overlap;
  • missing rows if ranges wrong.

40. Parallel Scan

Parallelizing scans can improve throughput but increases DB load.

Control:

  • fixed worker count;
  • separate ranges;
  • no overlap;
  • rate limit;
  • DB metrics feedback;
  • stop on error spike.

Do not parallelize by default.

If one worker uses 20% DB capacity,
five workers may not use 100%;
they may use 300% through contention.

41. Streaming API Response

Direct streaming endpoint:

@GetMapping("/export")
public void export(HttpServletResponse response) {
    query.scanRows(row -> writeCsv(response.getWriter(), row));
}

Risks:

  • DB connection held while client receives;
  • slow client controls DB resource duration;
  • client disconnect creates partial response;
  • retry starts over;
  • no durable audit/status;
  • hard timeout.

Prefer async export job.

If direct streaming is unavoidable:

  • require small bounded result;
  • set timeout;
  • limit concurrency;
  • use read-only replica;
  • handle client abort;
  • do not hold transaction if possible;
  • monitor connection usage.

42. Streaming to File Safely

Better direct job:

Path tempFile = Files.createTempFile("case-export-", ".csv");

try (BufferedWriter writer = Files.newBufferedWriter(tempFile)) {
    CaseExportCursor cursor = CaseExportCursor.start();

    while (true) {
        List<CaseExportRow> rows = reader.readAfter(cursor, 1000);
        if (rows.isEmpty()) {
            break;
        }

        csv.writeRows(writer, rows);
        cursor = CaseExportCursor.fromLast(rows);
    }
}

storage.upload(tempFile);

DB connection is held only per chunk, not during entire file upload.


43. Streaming and Checkpoint Consistency

If output is append-only file and cursor saved after each chunk, crash cases matter.

Scenario:

write chunk to file
process crashes before cursor save
restart reads same chunk
file gets duplicate rows

Solutions:

  • write chunk files named by cursor range, idempotently;
  • rebuild entire export from scratch on restart;
  • save cursor and file offset atomically if storage supports;
  • stage rows in export table;
  • use deterministic output and replace final artifact only after success.

44. Large Result and select *

Never select * for large result.

Cost multiplies:

extra columns * millions rows = network + memory + CPU + security risk

Use explicit projection.


45. Large Result and Compression

Compression can help file output, not necessarily DB transfer unless driver/protocol supports.

CSV gzip output:

try (
    OutputStream file = Files.newOutputStream(path);
    GZIPOutputStream gzip = new GZIPOutputStream(file);
    OutputStreamWriter writer = new OutputStreamWriter(gzip, StandardCharsets.UTF_8)
) {
    ...
}

But compression CPU happens while job runs. If compression occurs inside handler while DB connection open, connection held longer. Chunking decouples this.


46. Large Result and Object Allocation

Mapping millions of rows creates allocation pressure.

Optimizations after measuring:

  • use records/simple DTO;
  • avoid unnecessary string transformations;
  • avoid per-row formatter creation;
  • reuse date/time formatters;
  • avoid converting to domain object if export only needs strings;
  • write directly to output from row values if safe;
  • avoid intermediate lists for streaming writer.

But do not prematurely write unreadable code.


47. Direct Row Writer Pattern

For export, sometimes map less:

public void writeCsvRows(Connection connection, Writer writer)
        throws SQLException, IOException {
    try (PreparedStatement ps = connection.prepareStatement(SQL)) {
        ps.setFetchSize(500);

        try (ResultSet rs = ps.executeQuery()) {
            while (rs.next()) {
                csv.writeField(writer, rs.getString("case_number"));
                csv.writeField(writer, rs.getString("status"));
                csv.endRecord(writer);
            }
        }
    }
}

This is efficient but couples SQL to file format. Use only in dedicated export infrastructure, not general repository.


48. Large Result and Error Recovery

If streaming query fails at row 800,000:

What now?

Options:

  • restart from beginning;
  • resume by cursor if you tracked last row;
  • discard partial output;
  • repair output;
  • mark job failed and require rerun.

Chunking makes recovery clearer because last committed cursor is known.

If streaming, you can still track last processed key, but consistency is more complex.


49. Large Result and Idempotent Output

For re-runnable jobs:

  • output record key deterministic;
  • destination upsert by key;
  • file chunks replaceable;
  • cursor replay safe;
  • downstream side effects idempotent.

Example search indexing:

document id = case:{caseId}
index operation = upsert
retry same row safe

Example CSV file:

retry append is not safe unless file chunking/replacement designed

50. Large Result and Data Changes During Scan

If scanning live OLTP table:

Possible anomalies:

  • row inserted after cursor passed may be missed;
  • row updated to match filter may appear later;
  • row updated out of filter may disappear;
  • cursor column update may duplicate/skip;
  • deleted row disappears.

Mitigations:

  • immutable cursor;
  • cutoff timestamp;
  • snapshot table;
  • repeatable read transaction if acceptable;
  • idempotent periodic reconciliation;
  • scan by primary key and predicate update idempotently;
  • domain event/outbox instead of table scan.

51. Cutoff Timestamp Pattern

For export/backfill:

cutoff = now()
scan rows where created_at <= cutoff

SQL:

where created_at <= ?
  and id > ?
order by id
limit ?

This avoids chasing rows inserted after job start.

But if rows update after cutoff, snapshot still not perfect unless versioned/snapshot semantics exist.


52. High-Water Mark Pattern

For append-only/event tables:

max_id_at_start = select max(id)
scan id > last_id and id <= max_id_at_start

Works if ID is monotonic and append-only.

Good for:

  • event archive;
  • outbox catch-up snapshot;
  • immutable audit export.

Not good if rows can be updated/deleted and state at time matters.


53. Large Result and Regulatory Evidence

For regulatory reports, "which data did this report include?" matters.

Store report metadata:

create table report_run (
    id uuid primary key,
    report_type text not null,
    requested_by uuid not null,
    cutoff_at timestamptz,
    filter_payload jsonb not null,
    status text not null,
    created_at timestamptz not null,
    completed_at timestamptz
);

For stronger evidence, store:

  • snapshot table ID;
  • source schema version;
  • query version;
  • generated file hash;
  • row count;
  • generation logs;
  • policy version.

Large result architecture is not only technical; it can be evidentiary.


54. Large Result and Testing

Test:

  • query returns expected ordered chunks;
  • cursor resumes correctly;
  • empty result;
  • exactly chunk size;
  • more than chunk size;
  • row inserted during scan behavior if relevant;
  • duplicate cursor tie-breaker;
  • cancellation after chunk;
  • failure after writing chunk before cursor save;
  • large-ish dataset memory smoke;
  • fetch size behavior if using streaming;
  • stream close releases connection;
  • slow handler does not hold connection unexpectedly if chunked.

Use real database for fetch/cursor behavior.


55. Memory Smoke Test

Create enough rows to reveal unbounded list.

@Test
void exportReadsInChunks() {
    insertCases(10_000);

    exportJob.run();

    assertThat(metrics.maxRowsLoadedInMemory()).isLessThanOrEqualTo(1000);
}

You may not literally track memory in unit test, but can instrument reader calls and chunk sizes.

For true memory behavior, use performance/integration test.


56. Connection Usage Test

If custom streaming API returns Stream, test close.

@Test
void streamCloseReleasesConnection() {
    Stream<CaseRow> stream = query.streamOpenCases();

    stream.limit(10).toList();

    stream.close();

    assertThat(pool.activeConnections()).isZero();
}

In practice, pool metrics in tests may be awkward, but lifecycle tests are valuable if exposing streams.


57. Observability for Large Result

Metrics:

large_query.rows_read{query="CaseExport.readChunk"}
large_query.chunk.duration{query="CaseExport.readChunk"}
large_query.chunk.size{query="CaseExport.readChunk"}
large_query.cursor.lag{job="case-export"}
large_query.connection.usage.duration
large_query.fetch.error.count
export.bytes_written
export.rows_written
export.job.duration
export.job.failure.count

Logs per chunk:

{
  "job": "case-export",
  "jobId": "...",
  "cursor": "...",
  "rows": 1000,
  "durationMs": 230
}

Avoid logging row payload.


58. Incident Playbook: Export Kills API

Symptoms:

  • API latency spikes;
  • DB connections active high;
  • export jobs running;
  • slow clients downloading;
  • DB CPU or IO high;
  • pool acquisition timeout.

Immediate actions:

  1. pause export workers;
  2. limit export concurrency;
  3. move export to read replica if available;
  4. reduce chunk size;
  5. kill runaway queries carefully;
  6. shed load;
  7. disable direct HTTP streaming.

Long-term fixes:

  • async export architecture;
  • separate pool;
  • chunking;
  • read model;
  • throttling;
  • query/index tuning;
  • report snapshot table.

59. Incident Playbook: Streaming Query OOM

Symptoms:

  • heap rises during export;
  • GC pressure;
  • OOM;
  • large result query in logs.

Causes:

  • driver fetched all rows;
  • code accumulated list;
  • fetch size ignored;
  • row object too heavy;
  • JSON/BLOB selected;
  • writer buffers too much.

Fix:

  • verify driver streaming config;
  • use chunking;
  • select fewer columns;
  • avoid unbounded list;
  • stream file writer carefully;
  • profile allocation.

60. Incident Playbook: Long Snapshot Blocks Cleanup

Symptoms:

  • long read transaction;
  • vacuum/cleanup lag in MVCC database;
  • replica lag;
  • old snapshots;
  • export/report running.

Fix:

  • shorten transaction;
  • chunk without long transaction;
  • snapshot table;
  • run on replica;
  • limit report duration;
  • cancel runaway job.

61. Decision Matrix

RequirementPrefer
Small API listbounded pagination
Deep scrollingkeyset pagination
Backfill mutable tablechunk by immutable key
Full export with evidenceasync job + snapshot/read model
Search reindexchunk + idempotent upsert
One consistent DB snapshotcursor/transaction or snapshot table
Minimal DB resource durationchunking
Highest sequential throughputstreaming cursor, carefully
Slow downstreamchunk and decouple
Failure resumechunk + cursor
Direct client downloadavoid unless bounded

62. Anti-Pattern: Streaming to Slow Client

ResultSet -> HTTP response -> user's slow network

Connection held hostage by client.

Fix: async export.


63. Anti-Pattern: Returning Stream Without Close Contract

Stream<Row> rows = dao.streamAll();
rows.limit(10).toList(); // stream not closed

Connection leak.

Fix:

try (Stream<Row> rows = dao.streamAll()) {
    rows.limit(10).toList();
}

Better: avoid returning stream unless necessary.


64. Anti-Pattern: Fetch Size Without Verification

ps.setFetchSize(1000);
// assume memory safe

Maybe driver ignores it.

Fix:

  • integration/performance test;
  • driver docs;
  • observe memory/network;
  • chunk if uncertain.

65. Anti-Pattern: Offset for Millions

limit 1000 offset 9000000

Fix: keyset cursor.


66. Anti-Pattern: Large Scan With Non-Indexed Sort

order by officer_display_name

On huge table this may sort massive result.

Fix:

  • index if valid;
  • read model;
  • precomputed export;
  • accept unsorted by display;
  • sort outside DB only if bounded.

67. Anti-Pattern: External Call Per Row While ResultSet Open

while (rs.next()) {
    externalApi.send(mapRow(rs));
}

Fix:

  • read chunk;
  • close DB resource;
  • send batch to external;
  • checkpoint after successful send;
  • use outbox.

68. Safe Large Result Checklist

  • Result is bounded or intentionally chunked/streamed.
  • select * avoided.
  • Deterministic order exists.
  • Cursor key is indexed and stable.
  • Fetch size behavior verified if streaming.
  • Connection lifetime understood.
  • Transaction/snapshot duration acceptable.
  • Downstream processing does not hold DB connection unnecessarily.
  • Cancellation strategy exists.
  • Timeout strategy exists.
  • Resume/retry strategy exists.
  • Output idempotency considered.
  • Export/report evidence captured if regulatory.
  • Metrics for rows/chunks/duration exist.
  • Direct HTTP streaming avoided or strictly bounded.
  • Test uses real database for large-result behavior.

69. Mini Lab

Design large export for:

Export all closed enforcement cases for a regulator for the previous quarter.

Requirements:

  • potentially 3 million rows;
  • must include exact filter criteria;
  • must be reproducible;
  • must not block API;
  • must record who requested export;
  • must produce CSV;
  • must allow retry after crash;
  • must avoid duplicate rows;
  • must not expose sensitive notes;
  • must produce row count and file hash.

Questions:

  1. Direct streaming or async job?
  2. Snapshot table or cutoff timestamp?
  3. Cursor key?
  4. Chunk size?
  5. Where is progress stored?
  6. How is output written idempotently?
  7. What DB pool is used?
  8. What query/index supports scan?
  9. What audit metadata is stored?
  10. What happens if job is cancelled?
  11. What happens if app crashes after writing a chunk?
  12. How do you verify row count and file hash?

70. Summary

Large result handling is about resource control, not only memory.

You must master:

  • ResultSet cursor API vs driver behavior;
  • fetch size;
  • forward-only/read-only result set;
  • long transaction risk;
  • callback vs stream vs chunking;
  • keyset cursor;
  • offset pagination limits;
  • export architecture;
  • snapshot/cutoff/high-water mark;
  • cancellation and timeout;
  • backpressure illusion;
  • connection pool impact;
  • row mapping cost;
  • large column risk;
  • idempotent output;
  • data changes during scan;
  • regulatory report evidence;
  • observability and incident playbooks.

This closes the JDBC primitive phase. Part berikutnya masuk Phase 3: Transaction Pattern dan Consistency Boundary, dimulai dari desain transaction boundary pada service/use case/command handler/async boundary.


71. References

Lesson Recap

You just completed lesson 16 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.