Zero-Downtime Data Access Change
Learn Java Data Access Pattern In Action - Part 057
Zero-downtime data access change untuk Java production: deploy order, rolling deployment, old/new app compatibility, migration lock, feature flags, query switch, read model switch, rollback, observability, dan failure playbook.
Part 057 — Zero-Downtime Data Access Change
Zero-downtime database/data access change bukan hanya soal migration script yang “tidak error”.
Zero-downtime berarti perubahan tetap aman saat:
- beberapa versi aplikasi berjalan bersamaan;
- migration berlangsung;
- traffic tetap masuk;
- worker async tetap berjalan;
- read model masih lag;
- rollback aplikasi mungkin dibutuhkan;
- feature flag dinyalakan bertahap;
- query plan berubah;
- index belum selesai dibuat;
- backfill belum komplet.
Di production, perubahan data access adalah choreography, bukan satu commit.
Part ini membahas deploy order, rolling deployment, old/new app compatibility, migration lock, failure rollback, dan runbook zero-downtime.
1. Core Thesis
Zero-downtime data access change membutuhkan kombinasi:
Backward-compatible schema
Forward-compatible app
Feature-flagged behavior
Observable migration/backfill
Safe deploy order
Reversible read/write switches
Delayed cleanup
Satu prinsip kunci:
Do not make an irreversible contract break while old code can still run.
2. What Counts as Data Access Change?
Data access change mencakup:
- schema migration;
- new column/table/index/constraint;
- query shape change;
- ORM entity mapping change;
- jOOQ generated schema update;
- MyBatis XML change;
- JDBC SQL change;
- repository contract change;
- read model schema;
- projector update;
- outbox/inbox table change;
- report query;
- transaction isolation/locking change;
- cache/read replica routing;
- batch/backfill job;
- stored procedure/function/view change.
Semua bisa menyebabkan downtime jika deploy order salah.
3. Downtime Failure Modes
Contoh failure:
New app deployed before column exists.
Old app runs after column dropped.
Backfill locks table.
New query uses missing index and saturates DB.
View replaced incompatibly while old app uses old shape.
Read model v2 not fully populated but readers switched.
JPA entity not tolerant of nullable transition column.
MyBatis XML references renamed column.
jOOQ generated code compiled against schema not in prod.
Zero-downtime plan harus mencegah semua ini.
4. Rolling Deployment Reality
Rolling deploy means:
t0: old app only
t1: old app + new app mixed
t2: new app only
During t1, both versions must work.
Therefore, migration state must support both.
If old app cannot operate on new schema, deploy is not zero-downtime.
If new app cannot operate with partially migrated data, deploy is not zero-downtime.
5. Compatibility Matrix
For every risky change, create matrix:
| App | Schema/Data State | Must Work? |
|---|---|---|
| old app | old schema | yes |
| old app | expanded schema | yes |
| new app | expanded schema | yes |
| new app | partial backfill | yes |
| old app | contracted schema | no, only after old gone |
| new app | contracted schema | yes |
This matrix is stronger than “tested migration”.
6. Safe Deploy Order: Schema Expand First
Standard order:
1. Apply backward-compatible schema expand.
2. Deploy app version that is compatible with old/new data.
3. Run backfill/repair if needed.
4. Switch read/write behavior with feature flag.
5. Enforce constraints after data valid.
6. Later contract old schema.
Do not deploy code that requires schema before schema exists.
Do not drop schema while old code may use it.
7. Example: Add Required Column
Goal:
Add case_file.priority NOT NULL
Zero-downtime order:
Release A:
add nullable priority column
Release B:
app writes priority for new/updated rows
app reads priority with fallback default
Backfill:
fill priority for old rows in chunks
Release C:
enforce NOT NULL/check
enable priority filter/query
Release D:
cleanup fallback code later
Direct add not-null on large existing table is unsafe.
8. Example: Rename Column
Goal:
status -> case_status
Zero-downtime order:
1. add case_status nullable
2. deploy app dual write status + case_status
3. backfill case_status from status
4. read with fallback
5. switch reads to case_status
6. stop writing status after rollback window
7. drop status later
Rename is implemented as add-copy-switch-drop.
9. Example: Query Change Needs New Index
Goal:
dashboard query sorts by priority desc, updated_at desc
Order:
1. Create new index first.
2. Verify index valid/used.
3. Deploy query behind feature flag.
4. Canary new query.
5. Ramp traffic.
6. Keep old query fallback.
7. Remove old query/index later if safe.
New query before index can cause DB incident.
10. Example: Read Model v2
Goal:
case_dashboard_read_model_v2 replaces v1
Order:
1. create v2 table
2. projector writes v1 + v2
3. backfill/rebuild v2
4. compare v1/v2
5. read path flag uses v2 for canary
6. ramp v2
7. stop v1 writes
8. drop v1 later
Read model migration is data migration plus traffic switch.
11. Migration Before App vs App Before Migration
Most additive schema changes:
migration first
But sometimes app must be made tolerant before new data appears.
Example new enum code:
1. deploy app that tolerates unknown/new code
2. migration/feature starts writing new code
Compatibility direction depends on which side breaks first.
Rule:
Deploy tolerance before producing new shape/value.
Deploy dependency after schema exists.
12. Forward Compatibility
Forward-compatible app can tolerate future data/schema state.
Examples:
- unknown enum code maps to
UNKNOWNor controlled error; - nullable new column fallback;
- extra JSON field ignored;
- event consumer ignores unknown field;
- query does not
select *; - API mapping ignores extra DB columns.
Forward compatibility is crucial for rolling and async systems.
13. Backward Compatibility
Backward-compatible schema supports old app.
Examples:
- old column remains;
- old view still exists;
- old function signature remains;
- old enum/code still allowed;
- old index not dropped;
- old table still writable.
Backward compatibility is what allows rollback to old app.
14. Feature Flags for Data Access
Feature flags can control:
- read from new column;
- write new column;
- dual write enabled;
- use new query;
- use read model v2;
- enable new constraint-dependent behavior;
- route to read replica;
- enable new batch job.
Flags should be temporary and documented.
15. Feature Flag Ordering
Define allowed order.
Example:
dual_write_new_column = true
read_new_column = false
write_old_column = true
Valid transitions:
enable dual_write -> complete backfill -> enable read_new -> disable old_write -> cleanup
Invalid:
read_new=true before backfill
write_old=false while rollback to old app possible
Guard invalid config combinations.
16. Canary Strategy
Canary data access change:
- small percentage traffic;
- internal users;
- single tenant;
- single app instance;
- one worker partition;
- read-only path first.
Monitor:
- error rate;
- query latency;
- row count;
- fallback reads;
- mismatch count;
- DB CPU/IO;
- lock waits;
- connection pool wait;
- cache/read model lag.
17. Rollback Types
Rollback can mean:
App rollback
Deploy previous app version.
Requires schema still backward-compatible.
Feature flag rollback
Switch read/query path back.
Requires old path still maintained.
Migration rollback
Undo schema change. Often unsafe.
Forward fix
Deploy new migration/code to fix issue.
Usually safest after schema/data changed.
Zero-downtime design should prefer feature flag/app rollback early, forward fix later.
18. Rollback Window
After schema expand, rollback old app is usually safe.
After stop-old-write or contract, rollback old app may be unsafe.
Define rollback window:
For 2 weeks after read switch, keep old column updated.
After 2 weeks, stop old writes.
After 4 weeks, drop old column.
Window depends risk and deployment speed.
19. Contract Cleanup Timing
Contract cleanup is not urgent unless cost/security requires.
Before cleanup:
- old app gone;
- jobs/scripts updated;
- old query metrics zero;
- reports updated;
- feature flags removed;
- support runbooks updated;
- backup exists.
Cleanup too early causes hidden outages.
20. Migration Lock and Deployment
Migration tool lock prevents concurrent migration executions.
But app may start while migration runs depending pipeline.
Rules:
- app should not start requiring migration until migration complete;
- migration should be separate job for risky changes;
- app startup migration okay only for small safe changes;
- ensure only one migration runner;
- handle failed lock.
Migration lock is not table lock. DDL can still lock business tables.
21. DDL Lock Awareness
DDL may lock tables.
Zero-downtime requires:
- understand lock level;
- set lock timeout;
- use online/concurrent options;
- avoid long transaction around DDL;
- run during low traffic if needed;
- monitor lock waits.
Example: adding index concurrently avoids blocking writes on PostgreSQL but has special transaction caveats.
22. App Startup Migration Risk
If app runs migration on startup:
new pods start -> migration runs -> migration slow/fails -> pods not ready
Risks:
- multiple pods waiting;
- deploy stuck;
- partial migration;
- app unavailable during migration;
- app user needs DDL permission.
For production zero-downtime, dedicated migration job is often safer.
23. Dedicated Migration Job
Recommended for serious systems:
CI/CD:
run migration job
validate migration
deploy app
Benefits:
- one runner;
- controlled permissions;
- clear logs;
- approval gate;
- migration completion before app rollout;
- easier rollback decision.
24. Data Backfill Is Not Startup Migration
Backfill of large table should be external job.
Why:
- needs chunking;
- may take hours/days;
- needs pause/resume;
- needs metrics;
- may need throttling;
- may need kill switch.
Schema migration adds column; backfill job moves data; later migration enforces constraint.
25. Query Plan Switch
When changing query:
- deploy code with old and new query;
- keep old query default;
- add/validate index;
- enable new query for canary;
- compare results/latency;
- ramp;
- remove old later.
This is expand-contract for query behavior.
26. Result Parity Check
For query switch, compare old/new result for sample.
Example:
if (shadowCompareEnabled) {
OldResult oldResult = oldQuery.search(q);
NewResult newResult = newQuery.search(q);
parityRecorder.record(q, oldResult, newResult);
return oldResult;
}
Shadow compare should be bounded and sampled to avoid doubling DB load.
27. Shadow Query Caution
Running old + new query doubles DB work.
Use:
- sample rate;
- low traffic tenants;
- async comparison if possible;
- read replica if safe;
- strict timeout;
- kill switch.
Do not shadow high-cost queries globally.
28. Data Access Change and Caches
If query/cache key changes:
- cache can contain old shape;
- stale cache can hide migration issue;
- key version may be needed;
- invalidate old cache on switch;
- include source version in cached values.
For API response cache:
case-detail:v2:{tenant}:{caseId}
Keep v1/v2 separate during rollout.
29. Read Replica Routing Change
Changing reads from primary to replica is data access change.
Need:
- stale tolerance;
- read-your-writes behavior;
- routing flag;
- fallback to primary;
- replica lag metrics;
- command path still primary;
- cache interaction.
Canary and monitor lag.
30. ORM Mapping Change
JPA entity change can break during transition.
Example:
@Column(nullable = false)
private String priority;
while DB column nullable/backfill incomplete.
During expand phase, entity should tolerate null/fallback or use custom mapper.
Do not let object model get ahead of migration phase.
31. jOOQ Codegen Change
jOOQ generated code reflects target schema.
If production schema not yet expanded, new app using generated column fails.
Therefore:
run expand migration before deploying code using generated field
For contract cleanup:
drop column only after code no longer references generated field
Compile-time safety does not solve runtime deploy ordering.
32. MyBatis/JDBC SQL Change
String SQL must be tested against migration state.
If SQL references new column, migration must exist.
If SQL drops old column, old queries must be gone.
Real DB integration tests should use same migrations.
33. Multi-Service Shared Schema
Zero-downtime harder if multiple services share DB.
Need:
- schema owner;
- consumer inventory;
- backward-compatible contract;
- migration communication;
- deprecation window;
- access logs/metrics;
- contract tests if possible.
Better architecture: service owns DB and exposes API/events. But real systems sometimes have legacy shared schema.
34. Async Workers During Migration
Workers may run old code while web app new code deployed.
Inventory:
- outbox publisher;
- inbox consumers;
- scheduled jobs;
- backfill jobs;
- report jobs;
- cache warmers;
- search indexers.
All need compatibility.
Pause low-priority workers before risky migration if needed.
35. Message/Event Schema During DB Change
If DB schema change affects event payload:
- event versioning;
- dual publish old/new fields;
- consumers tolerate both;
- outbox table stores payload version;
- old unprocessed events still processable.
Do not change event payload and DB schema incompatibly in same irreversible step.
36. Failure Mode: New App Before Migration
Symptom:
column does not exist
Prevention:
- migration stage before app;
- startup schema validation;
- deployment dependency;
- feature flag default off until migration complete.
37. Failure Mode: Old App After Contract
Symptom:
old app crashes on missing column
Prevention:
- wait until old app gone;
- query metrics/code search;
- contract later;
- rollback window expired.
38. Failure Mode: Partial Backfill Read
Symptom:
new read returns null/incorrect values
Prevention:
- fallback read;
- feature flag switch only after backfill;
- not-null precondition;
- parity check.
39. Failure Mode: Query Plan Incident
Symptom:
new query saturates DB
Prevention:
- index first;
- explain plan;
- canary;
- query timeout;
- feature rollback;
- read model if needed.
40. Failure Mode: Backfill Overload
Symptom:
OLTP latency spikes during backfill
Prevention:
- throttling;
- chunk size;
- off-peak;
- kill switch;
- DB health adaptive pause;
- separate pool/user.
41. Failure Mode: Migration Lock Stuck
Symptom:
future migrations blocked
Prevention/runbook:
- only one migration runner;
- monitor migration duration;
- inspect lock table;
- release only after verifying no active migration;
- document repair process.
42. Deployment Runbook Template
# Zero-Downtime Data Access Change Runbook
Change:
Owner:
Risk:
Affected tables:
Affected app versions:
Feature flags:
Migration files:
Backfill job:
Indexes:
Compatibility matrix:
Deploy order:
Canary plan:
Metrics:
Rollback plan:
Stop conditions:
Cleanup plan:
Use for risky changes.
43. Stop Conditions
Before rollout, define stop conditions:
- DB CPU > threshold;
- pool wait > threshold;
- query p95 > threshold;
- mismatch count > 0;
- fallback read not decreasing;
- error rate > threshold;
- lock wait spike;
- replication lag > threshold;
- backfill error rate > threshold.
If stop condition hit, pause/reroute/rollback flag.
44. Observability Dashboard
Dashboard should show:
- migration applied version;
- feature flag state;
- old/new query latency;
- mismatch/parity;
- backfill progress;
- fallback read count;
- DB pool metrics;
- slow query count;
- lock waits;
- error rate;
- deployment version mix.
Zero-downtime rollout without observability is gambling.
45. Deployment Version Mix
During rolling deploy, know:
old version instances count
new version instances count
worker version mix
Do not contract until old count zero.
For Kubernetes-like systems, watch rollout status and old pods/jobs.
46. Readiness Checks
New app can fail readiness if required schema missing.
Example:
check column/table exists
But be careful: readiness check should not hammer DB or require expensive validation.
Schema validation at startup can catch migration ordering issues.
47. Feature Flag Default
For new data access path:
default off
Deploy code safely, then enable after migration/index/backfill.
This decouples deployment from activation.
48. Dark Launch
Dark launch new path without serving response.
Example:
- execute new query for 1% traffic;
- compare result;
- discard new result;
- monitor latency.
Use carefully due extra DB load.
49. Blue/Green Caveat
Blue/green deploy still has database shared state.
Even if traffic switches atomically, rollback to blue requires schema backward compatibility.
Database is not blue/green unless you have sophisticated replication/cutover.
Do not assume blue/green removes expand-contract need.
50. Rollback After Data Writes
If new app wrote new values, old app may not understand.
Example new enum code inserted.
Rollback old app may fail parsing code.
Prevent:
- deploy old app tolerant first;
- feature flag new value production;
- avoid writing incompatible data until rollback window gone.
51. Zero-Downtime Checklist
- Compatibility matrix completed.
- Expand migration additive.
- App tolerant of partial data.
- Feature flag default safe.
- Index exists before query switch.
- Backfill chunked/resumable.
- Parity check defined.
- Old/new queries can coexist.
- Rollback by phase defined.
- Old app still works after expand.
- Contract delayed until old app/jobs gone.
- Migration lock/runbook ready.
- Metrics/dashboard ready.
- Stop conditions defined.
- Cleanup scheduled.
52. Anti-Pattern: Deploy Code and Destructive Migration Together
Breaks rollback and rolling deploy.
53. Anti-Pattern: Feature Flag Only Around UI
Data access path also needs flag/control.
54. Anti-Pattern: Backfill Before Dual Write
Live writes can create new missing rows during/after backfill.
55. Anti-Pattern: Contract Cleanup Forgotten Forever
Temporary dual path becomes permanent complexity.
Create cleanup ticket/date.
56. Anti-Pattern: Shadow Query Without Load Budget
Can double DB load.
57. Mini Lab
Plan zero-downtime change:
Move dashboard from live joins to case_dashboard_read_model.
Current app queries case_file + officer + assignment count live.
New app should read read model.
Read model initially empty.
Traffic high.
Rollback must be possible.
Tasks:
- Create migration plan.
- Projector dual-write/rebuild plan.
- Backfill/rebuild strategy.
- Parity comparison.
- Feature flag for read switch.
- Index plan.
- Canary plan.
- Rollback plan.
- Stop conditions.
- Contract cleanup.
58. Summary
Zero-downtime data access change is choreography.
You must master:
- rolling deployment reality;
- compatibility matrix;
- schema expand first;
- deploy order;
- forward/backward compatibility;
- feature flags;
- canary;
- rollback types/window;
- delayed contract cleanup;
- migration lock/DDL locks;
- dedicated migration job;
- query plan switch;
- shadow comparison;
- cache/versioned key;
- read replica routing;
- ORM/jOOQ/MyBatis timing;
- shared schema/async workers;
- failure modes;
- runbook;
- observability;
- stop conditions;
- blue/green caveat.
Part berikutnya membahas Data Backfill and Repair Job: chunking, resume cursor, idempotent update, throttling, audit trail, progress metrics, error isolation, and safe repair in production.
59. References
- Flyway Documentation: https://documentation.red-gate.com/fd
- Liquibase Documentation: https://docs.liquibase.com/
- PostgreSQL CREATE INDEX: https://www.postgresql.org/docs/current/sql-createindex.html
- PostgreSQL Explicit Locking: https://www.postgresql.org/docs/current/explicit-locking.html
- PostgreSQL ALTER TABLE: https://www.postgresql.org/docs/current/sql-altertable.html
You just completed lesson 57 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.