JDBC Streaming Large Result
Learn Java Data Access Pattern In Action - Part 016
Streaming large result JDBC secara production-grade: ResultSet cursor, fetch size, driver behavior, transaction scope, memory safety, callback vs Stream, chunking, export, cancellation, timeout, backpressure illusion, dan operational playbook.
Part 016 — JDBC Streaming Large Result
Membaca banyak row bukan hanya masalah
while (rs.next()).Large result adalah masalah resource lifecycle:
- berapa banyak row yang ada di memory;
- berapa lama connection dipegang;
- apakah transaction/snapshot tetap terbuka;
- apakah database cursor benar-benar dipakai;
- apakah client/downstream lebih lambat dari database;
- apakah cancellation bekerja;
- apakah result bisa diulang setelah failure;
- apakah export menahan pool selama user download.
Streaming yang salah hanya memindahkan masalah dari heap ke connection pool dan database session.
Part ini membahas large result dengan JDBC secara production-grade.
1. Core Thesis
ResultSet terlihat seperti cursor row-by-row:
while (rs.next()) {
Row row = mapRow(rs);
handle(row);
}
Tetapi behavior fisik bisa berbeda:
driver may fetch all rows eagerly
driver may fetch rows in chunks
driver may require fetchSize
driver may require autoCommit=false
driver may use server-side cursor
driver may stream only in special mode
Jadi mental model yang benar:
ResultSet API is cursor-like.
Actual memory/network behavior is driver/database-specific.
Verify it.
2. The Problem With findAll
Bad:
public List<CaseRow> findAllOpenCases() {
List<CaseRow> rows = new ArrayList<>();
try (...) {
try (ResultSet rs = ps.executeQuery()) {
while (rs.next()) {
rows.add(mapRow(rs));
}
}
}
return rows;
}
If result has 5 million rows:
- heap explodes;
- GC pressure;
- long connection hold;
- transaction/snapshot long;
- request timeout;
- network large;
- serialization impossible;
- user cannot consume.
Rule:
No unbounded list query in production application path.
3. Large Result Use Cases
Large result appears in:
- export CSV/XLSX;
- regulatory reporting;
- data migration/backfill;
- ETL pipeline;
- repair job;
- audit archive;
- search reindex;
- cache rebuild;
- analytics extract;
- outbox publisher catch-up;
- reconciliation job.
Each use case needs different design.
| Use Case | Better Pattern |
|---|---|
| User dashboard | bounded pagination |
| Infinite scroll | keyset pagination |
| CSV export | async export job + chunking |
| Backfill | chunk reader/writer |
| ETL extract | cursor/chunk + checkpoint |
| Reindex | keyset scan + resume |
| Audit archive | partition/chunk |
| API response | strict max limit |
| Report snapshot | materialized/read model |
4. Streaming vs Chunking
Streaming
One query/result set stays open while processing rows.
open connection
execute query
while rows:
fetch/process
close result/statement/connection
Pros:
- simple sequential scan;
- low heap if driver streams;
- no repeated query overhead;
- stable snapshot possible.
Cons:
- connection held long;
- transaction/snapshot may be long;
- failure restarts from beginning unless checkpointed externally;
- downstream slowness holds DB resource;
- cancellation must work;
- driver-specific behavior.
Chunking
Repeated bounded queries.
read 500 rows after cursor
close connection
process/write
save cursor
repeat
Pros:
- bounded connection time;
- resumable;
- easier retry;
- easier throttling;
- lower lock/snapshot duration;
- easier operational control.
Cons:
- more queries;
- cursor correctness required;
- data can change between chunks;
- snapshot consistency harder.
Default recommendation:
Use chunking for jobs/backfills/exports unless you explicitly need one database cursor/snapshot.
5. ResultSet Lifecycle for Streaming
Callback pattern:
public void scanOpenCases(CaseRowHandler handler) {
String sql = """
select id, case_number, status, updated_at
from case_file
where status = ?
order by id
""";
try (
Connection connection = dataSource.getConnection();
PreparedStatement ps = connection.prepareStatement(
sql,
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY
)
) {
ps.setFetchSize(500);
ps.setString(1, "OPEN");
try (ResultSet rs = ps.executeQuery()) {
while (rs.next()) {
handler.handle(mapRow(rs));
}
}
} catch (SQLException e) {
throw translator.translate("CaseFile.scanOpenCases", e);
}
}
Handler runs while DB resources are open.
Therefore handler must be fast and bounded.
Bad handler:
row -> externalApi.call(row)
This holds connection while waiting on external network.
Better:
- write row to bounded local buffer;
- process in chunks;
- close DB resource before slow external call;
- use staging file/table.
6. Fetch Size
setFetchSize hints how many rows driver should fetch at a time.
ps.setFetchSize(500);
Meaning conceptually:
fetch 500 rows from DB
consume them
fetch next 500
...
But actual behavior depends driver/database.
Some drivers may:
- ignore fetch size;
- fetch all rows unless special mode;
- require auto-commit disabled;
- require forward-only result set;
- require connection property;
- treat special values specially.
Rule:
Do not claim streaming until you have measured memory and observed database/driver behavior.
7. Forward-Only Read-Only ResultSet
For large scans, request simple result set:
PreparedStatement ps = connection.prepareStatement(
sql,
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY
);
Avoid scrollable/updatable result sets for large data.
Scrollable result sets may require buffering more data.
Production default for large scan:
TYPE_FORWARD_ONLY
CONCUR_READ_ONLY
explicit fetch size
explicit order
bounded use case
8. Auto-Commit and Cursor Behavior
Some drivers/databases require transaction context for cursor-based fetching.
Pattern:
try (Connection connection = dataSource.getConnection()) {
boolean oldAutoCommit = connection.getAutoCommit();
try {
connection.setAutoCommit(false);
try (PreparedStatement ps = connection.prepareStatement(SQL)) {
ps.setFetchSize(500);
try (ResultSet rs = ps.executeQuery()) {
while (rs.next()) {
handle(mapRow(rs));
}
}
}
connection.commit();
} catch (Exception ex) {
connection.rollback();
throw ex;
} finally {
connection.setAutoCommit(oldAutoCommit);
}
}
Caution:
- long read transaction can hold snapshot;
- commit/rollback required;
- connection held until scan ends.
If you do not need one snapshot, chunking is often safer.
9. Long Read Transaction Risk
A long read transaction can:
- hold old snapshot;
- delay cleanup/vacuum in MVCC databases;
- consume database resources;
- keep connection active;
- conflict with DDL/migration;
- make failover/cancellation slower;
- cause read replica lag impact.
Streaming one million rows may avoid heap OOM but still create operational risk.
Trade-off:
Heap safety does not imply database safety.
10. Processing Speed Controls Resource Duration
Connection usage duration:
total time = query execution + row fetching + mapping + handler processing
If handler is slow, connection is held longer.
Example:
5 ms per row * 100,000 rows = 500 seconds
A single scan can hold connection for 8+ minutes.
Better:
- map rows quickly;
- write to local file quickly;
- chunk;
- avoid remote call per row;
- decouple DB read from slow downstream;
- use bounded queue carefully.
11. Backpressure Illusion
ResultSet loop does not create full reactive backpressure.
It gives pull-style reading from Java code, but:
- driver may prefetch;
- database may execute query fully before rows consumed;
- server may hold resources;
- network buffers exist;
- downstream slowness still holds DB connection.
Backpressure should be designed:
- chunk size;
- bounded queues;
- worker concurrency;
- rate limit;
- timeout;
- cancellation;
- progress checkpoint;
- retry.
Do not say "we stream, therefore it is safe".
12. Callback vs Stream Return
Callback
query.scan(filter, row -> writer.write(row));
Pros:
- DAO owns resource lifecycle;
- resources always closed;
- simple.
Cons:
- handler runs inside DB lifecycle;
- harder composition;
- handler must obey rules.
Java Stream
try (Stream<Row> rows = query.stream(filter)) {
rows.forEach(...);
}
Pros:
- composable;
- familiar API.
Cons:
- caller must close stream;
- easy to leak connection;
- exceptions inside stream can be awkward;
- resource lifecycle less visible;
- lazy operations can escape boundary.
For most internal data access, callback or chunking is safer than returning Stream.
13. Safe Stream Implementation Pattern
If you must return stream, attach close handler.
Conceptual example:
public Stream<CaseRow> streamOpenCases() {
Connection connection = dataSource.getConnection();
PreparedStatement ps = connection.prepareStatement(SQL);
ps.setFetchSize(500);
ps.setString(1, "OPEN");
ResultSet rs = ps.executeQuery();
Iterator<CaseRow> iterator = new ResultSetIterator<>(rs, this::mapRow);
Spliterator<CaseRow> spliterator =
Spliterators.spliteratorUnknownSize(iterator, Spliterator.ORDERED);
return StreamSupport.stream(spliterator, false)
.onClose(() -> closeAll(rs, ps, connection));
}
Problems:
- checked exceptions must be wrapped;
- if caller forgets try-with-resources, leak;
- if stream short-circuits, close required;
- if parallel stream used, disaster;
- transaction cleanup must be handled.
If you expose stream, document:
Caller must close stream.
Stream must not be parallelized.
Rows must be consumed within method scope.
14. Prefer Chunk Reader for Jobs
public List<CaseExportRow> readAfter(
CaseExportCursor cursor,
int limit
) {
String sql = """
select id, case_number, status, updated_at
from case_file
where (updated_at, id) > (?, ?)
order by updated_at asc, id asc
limit ?
""";
...
}
Job:
CaseExportCursor cursor = cursorStore.load(jobId);
while (true) {
List<CaseExportRow> rows = reader.readAfter(cursor, 1000);
if (rows.isEmpty()) {
break;
}
csvWriter.writeRows(rows);
cursor = CaseExportCursor.fromLast(rows);
cursorStore.save(jobId, cursor);
}
Benefits:
- each DB read bounded;
- resource closed before next chunk;
- resumable;
- progress visible;
- no long transaction;
- easier cancellation.
15. Choosing Cursor Key
A good cursor key must be:
- deterministic;
- unique or tie-broken;
- indexed;
- monotonic for scan purpose;
- stable during scan if possible.
Common:
id
(created_at, id)
(updated_at, id)
(sequence_number)
If updated_at changes during scan, rows can move. For backfill, prefer immutable key like id or created sequence.
Keyset condition:
where id > ?
order by id
limit ?
Compound cursor:
where (updated_at > ?)
or (updated_at = ? and id > ?)
order by updated_at, id
limit ?
Database support for tuple comparison varies; explicit OR form is more portable.
16. Offset Pagination Is Bad for Deep Scans
select ...
from case_file
order by id
limit 1000 offset 5000000;
Problems:
- database still skips many rows;
- latency grows with offset;
- unstable under concurrent changes;
- not resumable cleanly.
For large scan, use keyset/cursor.
where id > ?
order by id
limit ?
17. Export Architecture
Synchronous export:
HTTP request -> query all rows -> write response
Problems:
- connection held while client downloads;
- client disconnect wastes work;
- request timeout;
- no resume;
- pool pressure;
- memory risk;
- hard audit.
Better:
Export job:
- reads chunk;
- writes to file/storage;
- saves progress;
- handles cancellation;
- records audit;
- exposes status.
18. CSV Export Chunk Pattern
public void exportCases(ExportJob job, Writer writer) {
CaseExportCursor cursor = job.cursor();
while (!job.isCancellationRequested()) {
List<CaseExportRow> rows = reader.readAfter(cursor, 1000);
if (rows.isEmpty()) {
job.markCompleted();
return;
}
for (CaseExportRow row : rows) {
csv.write(writer, row);
}
writer.flush();
cursor = CaseExportCursor.fromLast(rows);
job.saveCursor(cursor);
}
job.markCancelled();
}
Important:
- DB resource closed before writing huge next chunk;
- writer flush can be slow but does not hold DB connection if rows already read;
- cursor saved after file write;
- if crash after file write before cursor save, duplicate rows may be written unless output is idempotent/replaceable.
Better for exact export:
- write to temp file;
- if job restarts, restart file from beginning; or
- checkpoint file offset and cursor together; or
- write chunk files and compose; or
- stage export snapshot.
19. Snapshot Consistency for Export
Question:
Should export reflect one consistent point in time?
If yes:
Options:
- Long read transaction snapshot.
- Create export snapshot table.
- Materialized view/versioned read model.
- Record cutoff time and query only rows <= cutoff.
- Use database-specific snapshot/export features.
- Pause writes? Usually bad.
Long transaction snapshot:
- simple conceptually;
- can be harmful for large export.
Snapshot table:
create table export_case_snapshot as
select ...
from case_file
where ...
Then export from snapshot in chunks.
Trade-off:
- storage cost;
- job cleanup;
- consistency clearer;
- DB load concentrated during snapshot creation.
20. Read Replica Consideration
Large reads can run on read replica if:
- eventual consistency acceptable;
- replica lag monitored;
- query load won't break replica;
- export/report can tolerate lag;
- transaction snapshot semantics understood.
Do not route command consistency reads to replica unless designed.
For export/reporting, replica/read model often better than OLTP primary.
21. Streaming with Transaction Boundary
If using cursor streaming:
connection.setAutoCommit(false);
connection.setReadOnly(true);
try {
scan(connection);
connection.commit();
} catch (Exception ex) {
connection.rollback();
throw ex;
}
Even read-only scan should end with commit/rollback to close transaction.
If handler fails midway, rollback/close.
Do not leave read transaction open.
22. Timeout for Large Result
Large scans need timeout strategy, but not always same as API query.
For batch/export:
- per-query timeout;
- per-chunk timeout;
- job deadline;
- cancellation flag;
- max runtime;
- lock timeout;
- statement timeout;
- read socket timeout if applicable.
If streaming one long query, query timeout may abort even though rows are being consumed depending driver/database behavior. Test.
Chunking makes timeout easier:
each chunk must complete under N seconds
23. Cancellation
For chunk job:
while (!cancellationRequested()) {
processChunk();
}
Cancellation boundary is chunk boundary.
For streaming query, cancellation can require:
Statement.cancel();- connection close;
- thread interruption may not be enough;
- database-side cancel support;
- timeout.
Chunking gives cleaner cancellation.
24. Result Mapping Cost
Mapping can dominate.
Example:
DB fetch row: cheap
Java JSON parse per row: expensive
If mapping heavy:
- connection stays open longer;
- DB cursor held longer;
- throughput drops.
Better:
- select simpler columns;
- avoid per-row object mapper if possible;
- move heavy transform after chunk read;
- batch transform outside DB lifecycle;
- precompute read model;
- profile allocation.
25. Avoid One-to-Many Streaming Explosion
Streaming joined one-to-many can create huge duplicate parent data.
select c.*, a.*
from case_file c
join action a on a.case_id = c.id
If each case has many actions, rows multiply.
For export/report, consider:
- separate export sections;
- aggregate child data in SQL if appropriate;
- two-step chunk by parent IDs;
- denormalized read model;
- avoid exporting nested object graph as flat repeated rows unless intended.
26. Streaming and Sorting
Large result with order by may require database sort.
If sort not supported by index:
- DB may sort huge dataset;
- temp files;
- slow first row;
- high memory/disk;
- timeout.
For streaming scans, order by indexed cursor key.
where id > ?
order by id
limit ?
For full export sorted by non-indexed display field, consider materialized export/read model.
27. Streaming and Index-Only Reads
If exporting only selected columns, index-only scan may help if database supports and visibility conditions allow.
Select minimal columns:
select id, case_number, status, updated_at
Avoid:
select *
Large result magnifies every extra column.
28. Streaming and Large Columns
Avoid selecting BLOB/CLOB/large JSON in large scan unless necessary.
If needed:
- stream large object separately;
- store object externally;
- batch small metadata query first;
- process one object at a time;
- set size limits.
Large JSON per row can dominate network and memory.
29. Streaming and Network
Large result transfers data over network.
Consider:
- row width;
- compression;
- database/app network bandwidth;
- cross-AZ/region cost;
- fetch round trips;
- packet size;
- TLS overhead;
- slow consumer.
Fetch size too small can cause many round trips. Too large can increase memory.
Measure.
30. Streaming and Connection Pool
A streaming scan can hold one connection for minutes. If many exports run:
10 exports * 1 connection each = 10 pool slots held long
Controls:
- limit concurrent exports;
- use separate export pool;
- use read replica;
- use chunking;
- queue jobs;
- enforce max runtime;
- monitor connection usage duration.
31. Streaming and Isolation
Isolation affects what rows are seen.
At read committed with chunking:
- each chunk may see newer data;
- row changes can move between chunks;
- inserts can appear later;
- updates to cursor column can skip/duplicate.
At repeatable read/serializable long transaction:
- snapshot stable;
- longer resource impact;
- possible serialization issues depending database.
For backfill, use immutable cursor and idempotent writes. Do not rely on perfect snapshot unless required.
32. Backfill Scan Pattern
Backfill table by primary key:
select id, old_value
from case_file
where id > ?
and new_value is null
order by id
limit ?
Then update:
update case_file
set new_value = ?
where id = ?
and new_value is null;
If row already updated, count 0 acceptable.
Cursor advances by max ID read, not updated count.
Idempotent because update predicate prevents duplicate effect.
33. Reindex Scan Pattern
Search reindex:
read cases by id chunk
send documents to search index
save cursor
Problem: external search index side effect.
Design:
- outbox/indexing table better;
- or save cursor only after successful index batch;
- idempotent document ID in search index;
- retry same chunk safe;
- handle deletes/tombstones;
- full rebuild can use index alias swap.
Do not hold DB connection while sending each document individually to search service.
34. Outbox Publisher Scan
Outbox publisher reads unpublished events.
select id, event_type, payload
from outbox_event
where published_at is null
order by created_at asc, id asc
limit ?
Then publish and mark.
Concurrency pattern may need claiming:
update outbox_event
set claimed_by = ?,
claimed_at = ?
where id in (...)
and published_at is null
and claimed_at is null
Or database-specific skip locked.
Large outbox catch-up should be chunked. Do not stream all unpublished events in one transaction.
35. SKIP LOCKED and Work Queues
Some databases support selecting rows while skipping locked rows.
Concept:
select id
from outbox_event
where published_at is null
order by created_at
limit ?
for update skip locked
Use with care:
- database-specific;
- can improve worker concurrency;
- ordering may become approximate under contention;
- starvation possible;
- transaction must be short.
Good for job queues/outbox claiming, not general query pagination.
36. Server-Side Cursor Caveats
Server-side cursor may:
- hold database resources;
- require transaction;
- be closed on commit;
- not survive connection close;
- consume memory on server;
- interact with fetch size;
- have different behavior for scrollable cursors.
Application must close ResultSet/statement/connection.
Cursor is not free.
37. Fetch Size Tuning
Trade-off:
| Fetch Size | Effect |
|---|---|
| too small | many round trips |
| too large | memory spike |
| moderate | balanced |
Tune based on:
- row size;
- network latency;
- memory;
- processing speed;
- database driver behavior.
Example starting points:
100, 500, 1000
Benchmark actual.
Do not copy values blindly.
38. Large Result Query Plan
Before streaming, inspect plan.
Questions:
- does query use index for ordering?
- does it scan table intentionally?
- does it sort huge dataset?
- does it join huge tables?
- does it filter by tenant/partition?
- does it select large columns?
- are statistics accurate?
- can it be partitioned by date/id?
- can read model help?
Streaming a bad query still hurts database.
39. Partitioned Scan
For very large tables:
scan by date partition
scan by tenant
scan by ID range
scan by hash bucket
Example ID range:
where id >= ?
and id < ?
order by id
Benefits:
- parallelizable;
- resumable;
- bounded;
- easier progress estimate.
Risks:
- uneven distribution;
- hot partitions;
- duplicates if ranges overlap;
- missing rows if ranges wrong.
40. Parallel Scan
Parallelizing scans can improve throughput but increases DB load.
Control:
- fixed worker count;
- separate ranges;
- no overlap;
- rate limit;
- DB metrics feedback;
- stop on error spike.
Do not parallelize by default.
If one worker uses 20% DB capacity,
five workers may not use 100%;
they may use 300% through contention.
41. Streaming API Response
Direct streaming endpoint:
@GetMapping("/export")
public void export(HttpServletResponse response) {
query.scanRows(row -> writeCsv(response.getWriter(), row));
}
Risks:
- DB connection held while client receives;
- slow client controls DB resource duration;
- client disconnect creates partial response;
- retry starts over;
- no durable audit/status;
- hard timeout.
Prefer async export job.
If direct streaming is unavoidable:
- require small bounded result;
- set timeout;
- limit concurrency;
- use read-only replica;
- handle client abort;
- do not hold transaction if possible;
- monitor connection usage.
42. Streaming to File Safely
Better direct job:
Path tempFile = Files.createTempFile("case-export-", ".csv");
try (BufferedWriter writer = Files.newBufferedWriter(tempFile)) {
CaseExportCursor cursor = CaseExportCursor.start();
while (true) {
List<CaseExportRow> rows = reader.readAfter(cursor, 1000);
if (rows.isEmpty()) {
break;
}
csv.writeRows(writer, rows);
cursor = CaseExportCursor.fromLast(rows);
}
}
storage.upload(tempFile);
DB connection is held only per chunk, not during entire file upload.
43. Streaming and Checkpoint Consistency
If output is append-only file and cursor saved after each chunk, crash cases matter.
Scenario:
write chunk to file
process crashes before cursor save
restart reads same chunk
file gets duplicate rows
Solutions:
- write chunk files named by cursor range, idempotently;
- rebuild entire export from scratch on restart;
- save cursor and file offset atomically if storage supports;
- stage rows in export table;
- use deterministic output and replace final artifact only after success.
44. Large Result and select *
Never select * for large result.
Cost multiplies:
extra columns * millions rows = network + memory + CPU + security risk
Use explicit projection.
45. Large Result and Compression
Compression can help file output, not necessarily DB transfer unless driver/protocol supports.
CSV gzip output:
try (
OutputStream file = Files.newOutputStream(path);
GZIPOutputStream gzip = new GZIPOutputStream(file);
OutputStreamWriter writer = new OutputStreamWriter(gzip, StandardCharsets.UTF_8)
) {
...
}
But compression CPU happens while job runs. If compression occurs inside handler while DB connection open, connection held longer. Chunking decouples this.
46. Large Result and Object Allocation
Mapping millions of rows creates allocation pressure.
Optimizations after measuring:
- use records/simple DTO;
- avoid unnecessary string transformations;
- avoid per-row formatter creation;
- reuse date/time formatters;
- avoid converting to domain object if export only needs strings;
- write directly to output from row values if safe;
- avoid intermediate lists for streaming writer.
But do not prematurely write unreadable code.
47. Direct Row Writer Pattern
For export, sometimes map less:
public void writeCsvRows(Connection connection, Writer writer)
throws SQLException, IOException {
try (PreparedStatement ps = connection.prepareStatement(SQL)) {
ps.setFetchSize(500);
try (ResultSet rs = ps.executeQuery()) {
while (rs.next()) {
csv.writeField(writer, rs.getString("case_number"));
csv.writeField(writer, rs.getString("status"));
csv.endRecord(writer);
}
}
}
}
This is efficient but couples SQL to file format. Use only in dedicated export infrastructure, not general repository.
48. Large Result and Error Recovery
If streaming query fails at row 800,000:
What now?
Options:
- restart from beginning;
- resume by cursor if you tracked last row;
- discard partial output;
- repair output;
- mark job failed and require rerun.
Chunking makes recovery clearer because last committed cursor is known.
If streaming, you can still track last processed key, but consistency is more complex.
49. Large Result and Idempotent Output
For re-runnable jobs:
- output record key deterministic;
- destination upsert by key;
- file chunks replaceable;
- cursor replay safe;
- downstream side effects idempotent.
Example search indexing:
document id = case:{caseId}
index operation = upsert
retry same row safe
Example CSV file:
retry append is not safe unless file chunking/replacement designed
50. Large Result and Data Changes During Scan
If scanning live OLTP table:
Possible anomalies:
- row inserted after cursor passed may be missed;
- row updated to match filter may appear later;
- row updated out of filter may disappear;
- cursor column update may duplicate/skip;
- deleted row disappears.
Mitigations:
- immutable cursor;
- cutoff timestamp;
- snapshot table;
- repeatable read transaction if acceptable;
- idempotent periodic reconciliation;
- scan by primary key and predicate update idempotently;
- domain event/outbox instead of table scan.
51. Cutoff Timestamp Pattern
For export/backfill:
cutoff = now()
scan rows where created_at <= cutoff
SQL:
where created_at <= ?
and id > ?
order by id
limit ?
This avoids chasing rows inserted after job start.
But if rows update after cutoff, snapshot still not perfect unless versioned/snapshot semantics exist.
52. High-Water Mark Pattern
For append-only/event tables:
max_id_at_start = select max(id)
scan id > last_id and id <= max_id_at_start
Works if ID is monotonic and append-only.
Good for:
- event archive;
- outbox catch-up snapshot;
- immutable audit export.
Not good if rows can be updated/deleted and state at time matters.
53. Large Result and Regulatory Evidence
For regulatory reports, "which data did this report include?" matters.
Store report metadata:
create table report_run (
id uuid primary key,
report_type text not null,
requested_by uuid not null,
cutoff_at timestamptz,
filter_payload jsonb not null,
status text not null,
created_at timestamptz not null,
completed_at timestamptz
);
For stronger evidence, store:
- snapshot table ID;
- source schema version;
- query version;
- generated file hash;
- row count;
- generation logs;
- policy version.
Large result architecture is not only technical; it can be evidentiary.
54. Large Result and Testing
Test:
- query returns expected ordered chunks;
- cursor resumes correctly;
- empty result;
- exactly chunk size;
- more than chunk size;
- row inserted during scan behavior if relevant;
- duplicate cursor tie-breaker;
- cancellation after chunk;
- failure after writing chunk before cursor save;
- large-ish dataset memory smoke;
- fetch size behavior if using streaming;
- stream close releases connection;
- slow handler does not hold connection unexpectedly if chunked.
Use real database for fetch/cursor behavior.
55. Memory Smoke Test
Create enough rows to reveal unbounded list.
@Test
void exportReadsInChunks() {
insertCases(10_000);
exportJob.run();
assertThat(metrics.maxRowsLoadedInMemory()).isLessThanOrEqualTo(1000);
}
You may not literally track memory in unit test, but can instrument reader calls and chunk sizes.
For true memory behavior, use performance/integration test.
56. Connection Usage Test
If custom streaming API returns Stream, test close.
@Test
void streamCloseReleasesConnection() {
Stream<CaseRow> stream = query.streamOpenCases();
stream.limit(10).toList();
stream.close();
assertThat(pool.activeConnections()).isZero();
}
In practice, pool metrics in tests may be awkward, but lifecycle tests are valuable if exposing streams.
57. Observability for Large Result
Metrics:
large_query.rows_read{query="CaseExport.readChunk"}
large_query.chunk.duration{query="CaseExport.readChunk"}
large_query.chunk.size{query="CaseExport.readChunk"}
large_query.cursor.lag{job="case-export"}
large_query.connection.usage.duration
large_query.fetch.error.count
export.bytes_written
export.rows_written
export.job.duration
export.job.failure.count
Logs per chunk:
{
"job": "case-export",
"jobId": "...",
"cursor": "...",
"rows": 1000,
"durationMs": 230
}
Avoid logging row payload.
58. Incident Playbook: Export Kills API
Symptoms:
- API latency spikes;
- DB connections active high;
- export jobs running;
- slow clients downloading;
- DB CPU or IO high;
- pool acquisition timeout.
Immediate actions:
- pause export workers;
- limit export concurrency;
- move export to read replica if available;
- reduce chunk size;
- kill runaway queries carefully;
- shed load;
- disable direct HTTP streaming.
Long-term fixes:
- async export architecture;
- separate pool;
- chunking;
- read model;
- throttling;
- query/index tuning;
- report snapshot table.
59. Incident Playbook: Streaming Query OOM
Symptoms:
- heap rises during export;
- GC pressure;
- OOM;
- large result query in logs.
Causes:
- driver fetched all rows;
- code accumulated list;
- fetch size ignored;
- row object too heavy;
- JSON/BLOB selected;
- writer buffers too much.
Fix:
- verify driver streaming config;
- use chunking;
- select fewer columns;
- avoid unbounded list;
- stream file writer carefully;
- profile allocation.
60. Incident Playbook: Long Snapshot Blocks Cleanup
Symptoms:
- long read transaction;
- vacuum/cleanup lag in MVCC database;
- replica lag;
- old snapshots;
- export/report running.
Fix:
- shorten transaction;
- chunk without long transaction;
- snapshot table;
- run on replica;
- limit report duration;
- cancel runaway job.
61. Decision Matrix
| Requirement | Prefer |
|---|---|
| Small API list | bounded pagination |
| Deep scrolling | keyset pagination |
| Backfill mutable table | chunk by immutable key |
| Full export with evidence | async job + snapshot/read model |
| Search reindex | chunk + idempotent upsert |
| One consistent DB snapshot | cursor/transaction or snapshot table |
| Minimal DB resource duration | chunking |
| Highest sequential throughput | streaming cursor, carefully |
| Slow downstream | chunk and decouple |
| Failure resume | chunk + cursor |
| Direct client download | avoid unless bounded |
62. Anti-Pattern: Streaming to Slow Client
ResultSet -> HTTP response -> user's slow network
Connection held hostage by client.
Fix: async export.
63. Anti-Pattern: Returning Stream Without Close Contract
Stream<Row> rows = dao.streamAll();
rows.limit(10).toList(); // stream not closed
Connection leak.
Fix:
try (Stream<Row> rows = dao.streamAll()) {
rows.limit(10).toList();
}
Better: avoid returning stream unless necessary.
64. Anti-Pattern: Fetch Size Without Verification
ps.setFetchSize(1000);
// assume memory safe
Maybe driver ignores it.
Fix:
- integration/performance test;
- driver docs;
- observe memory/network;
- chunk if uncertain.
65. Anti-Pattern: Offset for Millions
limit 1000 offset 9000000
Fix: keyset cursor.
66. Anti-Pattern: Large Scan With Non-Indexed Sort
order by officer_display_name
On huge table this may sort massive result.
Fix:
- index if valid;
- read model;
- precomputed export;
- accept unsorted by display;
- sort outside DB only if bounded.
67. Anti-Pattern: External Call Per Row While ResultSet Open
while (rs.next()) {
externalApi.send(mapRow(rs));
}
Fix:
- read chunk;
- close DB resource;
- send batch to external;
- checkpoint after successful send;
- use outbox.
68. Safe Large Result Checklist
- Result is bounded or intentionally chunked/streamed.
-
select *avoided. - Deterministic order exists.
- Cursor key is indexed and stable.
- Fetch size behavior verified if streaming.
- Connection lifetime understood.
- Transaction/snapshot duration acceptable.
- Downstream processing does not hold DB connection unnecessarily.
- Cancellation strategy exists.
- Timeout strategy exists.
- Resume/retry strategy exists.
- Output idempotency considered.
- Export/report evidence captured if regulatory.
- Metrics for rows/chunks/duration exist.
- Direct HTTP streaming avoided or strictly bounded.
- Test uses real database for large-result behavior.
69. Mini Lab
Design large export for:
Export all closed enforcement cases for a regulator for the previous quarter.
Requirements:
- potentially 3 million rows;
- must include exact filter criteria;
- must be reproducible;
- must not block API;
- must record who requested export;
- must produce CSV;
- must allow retry after crash;
- must avoid duplicate rows;
- must not expose sensitive notes;
- must produce row count and file hash.
Questions:
- Direct streaming or async job?
- Snapshot table or cutoff timestamp?
- Cursor key?
- Chunk size?
- Where is progress stored?
- How is output written idempotently?
- What DB pool is used?
- What query/index supports scan?
- What audit metadata is stored?
- What happens if job is cancelled?
- What happens if app crashes after writing a chunk?
- How do you verify row count and file hash?
70. Summary
Large result handling is about resource control, not only memory.
You must master:
ResultSetcursor API vs driver behavior;- fetch size;
- forward-only/read-only result set;
- long transaction risk;
- callback vs stream vs chunking;
- keyset cursor;
- offset pagination limits;
- export architecture;
- snapshot/cutoff/high-water mark;
- cancellation and timeout;
- backpressure illusion;
- connection pool impact;
- row mapping cost;
- large column risk;
- idempotent output;
- data changes during scan;
- regulatory report evidence;
- observability and incident playbooks.
This closes the JDBC primitive phase. Part berikutnya masuk Phase 3: Transaction Pattern dan Consistency Boundary, dimulai dari desain transaction boundary pada service/use case/command handler/async boundary.
71. References
- Oracle Java SE
ResultSet: https://docs.oracle.com/en/java/javase/21/docs/api/java.sql/java/sql/ResultSet.html - Oracle Java SE
Statement#setFetchSize: https://docs.oracle.com/en/java/javase/21/docs/api/java.sql/java/sql/Statement.html#setFetchSize(int) - Oracle Java SE
Statement#setQueryTimeout: https://docs.oracle.com/en/java/javase/21/docs/api/java.sql/java/sql/Statement.html#setQueryTimeout(int) - Oracle Java SE
Connection: https://docs.oracle.com/en/java/javase/21/docs/api/java.sql/java/sql/Connection.html - Oracle Java SE
Statement#cancel: https://docs.oracle.com/en/java/javase/21/docs/api/java.sql/java/sql/Statement.html#cancel() - PostgreSQL JDBC Documentation: https://jdbc.postgresql.org/documentation/
- PostgreSQL Cursors: https://www.postgresql.org/docs/current/plpgsql-cursors.html
- PostgreSQL Transaction Isolation: https://www.postgresql.org/docs/current/transaction-iso.html
You just completed lesson 16 in build core. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.