Final StretchOrdered learning track

Zero-Downtime Data Access Change

Learn Java Data Access Pattern In Action - Part 057

Zero-downtime data access change untuk Java production: deploy order, rolling deployment, old/new app compatibility, migration lock, feature flags, query switch, read model switch, rollback, observability, dan failure playbook.

13 min read2509 words
PrevNext
Lesson 5760 lesson track51–60 Final Stretch
#java#data-access#zero-downtime#migration+6 more

Part 057 — Zero-Downtime Data Access Change

Zero-downtime database/data access change bukan hanya soal migration script yang “tidak error”.

Zero-downtime berarti perubahan tetap aman saat:

  • beberapa versi aplikasi berjalan bersamaan;
  • migration berlangsung;
  • traffic tetap masuk;
  • worker async tetap berjalan;
  • read model masih lag;
  • rollback aplikasi mungkin dibutuhkan;
  • feature flag dinyalakan bertahap;
  • query plan berubah;
  • index belum selesai dibuat;
  • backfill belum komplet.

Di production, perubahan data access adalah choreography, bukan satu commit.

Part ini membahas deploy order, rolling deployment, old/new app compatibility, migration lock, failure rollback, dan runbook zero-downtime.


1. Core Thesis

Zero-downtime data access change membutuhkan kombinasi:

Backward-compatible schema
Forward-compatible app
Feature-flagged behavior
Observable migration/backfill
Safe deploy order
Reversible read/write switches
Delayed cleanup

Satu prinsip kunci:

Do not make an irreversible contract break while old code can still run.

2. What Counts as Data Access Change?

Data access change mencakup:

  • schema migration;
  • new column/table/index/constraint;
  • query shape change;
  • ORM entity mapping change;
  • jOOQ generated schema update;
  • MyBatis XML change;
  • JDBC SQL change;
  • repository contract change;
  • read model schema;
  • projector update;
  • outbox/inbox table change;
  • report query;
  • transaction isolation/locking change;
  • cache/read replica routing;
  • batch/backfill job;
  • stored procedure/function/view change.

Semua bisa menyebabkan downtime jika deploy order salah.


3. Downtime Failure Modes

Contoh failure:

New app deployed before column exists.
Old app runs after column dropped.
Backfill locks table.
New query uses missing index and saturates DB.
View replaced incompatibly while old app uses old shape.
Read model v2 not fully populated but readers switched.
JPA entity not tolerant of nullable transition column.
MyBatis XML references renamed column.
jOOQ generated code compiled against schema not in prod.

Zero-downtime plan harus mencegah semua ini.


4. Rolling Deployment Reality

Rolling deploy means:

t0: old app only
t1: old app + new app mixed
t2: new app only

During t1, both versions must work.

Therefore, migration state must support both.

If old app cannot operate on new schema, deploy is not zero-downtime.

If new app cannot operate with partially migrated data, deploy is not zero-downtime.


5. Compatibility Matrix

For every risky change, create matrix:

AppSchema/Data StateMust Work?
old appold schemayes
old appexpanded schemayes
new appexpanded schemayes
new apppartial backfillyes
old appcontracted schemano, only after old gone
new appcontracted schemayes

This matrix is stronger than “tested migration”.


6. Safe Deploy Order: Schema Expand First

Standard order:

1. Apply backward-compatible schema expand.
2. Deploy app version that is compatible with old/new data.
3. Run backfill/repair if needed.
4. Switch read/write behavior with feature flag.
5. Enforce constraints after data valid.
6. Later contract old schema.

Do not deploy code that requires schema before schema exists.

Do not drop schema while old code may use it.


7. Example: Add Required Column

Goal:

Add case_file.priority NOT NULL

Zero-downtime order:

Release A:
  add nullable priority column

Release B:
  app writes priority for new/updated rows
  app reads priority with fallback default

Backfill:
  fill priority for old rows in chunks

Release C:
  enforce NOT NULL/check
  enable priority filter/query

Release D:
  cleanup fallback code later

Direct add not-null on large existing table is unsafe.


8. Example: Rename Column

Goal:

status -> case_status

Zero-downtime order:

1. add case_status nullable
2. deploy app dual write status + case_status
3. backfill case_status from status
4. read with fallback
5. switch reads to case_status
6. stop writing status after rollback window
7. drop status later

Rename is implemented as add-copy-switch-drop.


9. Example: Query Change Needs New Index

Goal:

dashboard query sorts by priority desc, updated_at desc

Order:

1. Create new index first.
2. Verify index valid/used.
3. Deploy query behind feature flag.
4. Canary new query.
5. Ramp traffic.
6. Keep old query fallback.
7. Remove old query/index later if safe.

New query before index can cause DB incident.


10. Example: Read Model v2

Goal:

case_dashboard_read_model_v2 replaces v1

Order:

1. create v2 table
2. projector writes v1 + v2
3. backfill/rebuild v2
4. compare v1/v2
5. read path flag uses v2 for canary
6. ramp v2
7. stop v1 writes
8. drop v1 later

Read model migration is data migration plus traffic switch.


11. Migration Before App vs App Before Migration

Most additive schema changes:

migration first

But sometimes app must be made tolerant before new data appears.

Example new enum code:

1. deploy app that tolerates unknown/new code
2. migration/feature starts writing new code

Compatibility direction depends on which side breaks first.

Rule:

Deploy tolerance before producing new shape/value.
Deploy dependency after schema exists.

12. Forward Compatibility

Forward-compatible app can tolerate future data/schema state.

Examples:

  • unknown enum code maps to UNKNOWN or controlled error;
  • nullable new column fallback;
  • extra JSON field ignored;
  • event consumer ignores unknown field;
  • query does not select *;
  • API mapping ignores extra DB columns.

Forward compatibility is crucial for rolling and async systems.


13. Backward Compatibility

Backward-compatible schema supports old app.

Examples:

  • old column remains;
  • old view still exists;
  • old function signature remains;
  • old enum/code still allowed;
  • old index not dropped;
  • old table still writable.

Backward compatibility is what allows rollback to old app.


14. Feature Flags for Data Access

Feature flags can control:

  • read from new column;
  • write new column;
  • dual write enabled;
  • use new query;
  • use read model v2;
  • enable new constraint-dependent behavior;
  • route to read replica;
  • enable new batch job.

Flags should be temporary and documented.


15. Feature Flag Ordering

Define allowed order.

Example:

dual_write_new_column = true
read_new_column = false
write_old_column = true

Valid transitions:

enable dual_write -> complete backfill -> enable read_new -> disable old_write -> cleanup

Invalid:

read_new=true before backfill
write_old=false while rollback to old app possible

Guard invalid config combinations.


16. Canary Strategy

Canary data access change:

  • small percentage traffic;
  • internal users;
  • single tenant;
  • single app instance;
  • one worker partition;
  • read-only path first.

Monitor:

  • error rate;
  • query latency;
  • row count;
  • fallback reads;
  • mismatch count;
  • DB CPU/IO;
  • lock waits;
  • connection pool wait;
  • cache/read model lag.

17. Rollback Types

Rollback can mean:

App rollback

Deploy previous app version.

Requires schema still backward-compatible.

Feature flag rollback

Switch read/query path back.

Requires old path still maintained.

Migration rollback

Undo schema change. Often unsafe.

Forward fix

Deploy new migration/code to fix issue.

Usually safest after schema/data changed.

Zero-downtime design should prefer feature flag/app rollback early, forward fix later.


18. Rollback Window

After schema expand, rollback old app is usually safe.

After stop-old-write or contract, rollback old app may be unsafe.

Define rollback window:

For 2 weeks after read switch, keep old column updated.
After 2 weeks, stop old writes.
After 4 weeks, drop old column.

Window depends risk and deployment speed.


19. Contract Cleanup Timing

Contract cleanup is not urgent unless cost/security requires.

Before cleanup:

  • old app gone;
  • jobs/scripts updated;
  • old query metrics zero;
  • reports updated;
  • feature flags removed;
  • support runbooks updated;
  • backup exists.

Cleanup too early causes hidden outages.


20. Migration Lock and Deployment

Migration tool lock prevents concurrent migration executions.

But app may start while migration runs depending pipeline.

Rules:

  • app should not start requiring migration until migration complete;
  • migration should be separate job for risky changes;
  • app startup migration okay only for small safe changes;
  • ensure only one migration runner;
  • handle failed lock.

Migration lock is not table lock. DDL can still lock business tables.


21. DDL Lock Awareness

DDL may lock tables.

Zero-downtime requires:

  • understand lock level;
  • set lock timeout;
  • use online/concurrent options;
  • avoid long transaction around DDL;
  • run during low traffic if needed;
  • monitor lock waits.

Example: adding index concurrently avoids blocking writes on PostgreSQL but has special transaction caveats.


22. App Startup Migration Risk

If app runs migration on startup:

new pods start -> migration runs -> migration slow/fails -> pods not ready

Risks:

  • multiple pods waiting;
  • deploy stuck;
  • partial migration;
  • app unavailable during migration;
  • app user needs DDL permission.

For production zero-downtime, dedicated migration job is often safer.


23. Dedicated Migration Job

Recommended for serious systems:

CI/CD:
  run migration job
  validate migration
  deploy app

Benefits:

  • one runner;
  • controlled permissions;
  • clear logs;
  • approval gate;
  • migration completion before app rollout;
  • easier rollback decision.

24. Data Backfill Is Not Startup Migration

Backfill of large table should be external job.

Why:

  • needs chunking;
  • may take hours/days;
  • needs pause/resume;
  • needs metrics;
  • may need throttling;
  • may need kill switch.

Schema migration adds column; backfill job moves data; later migration enforces constraint.


25. Query Plan Switch

When changing query:

  1. deploy code with old and new query;
  2. keep old query default;
  3. add/validate index;
  4. enable new query for canary;
  5. compare results/latency;
  6. ramp;
  7. remove old later.

This is expand-contract for query behavior.


26. Result Parity Check

For query switch, compare old/new result for sample.

Example:

if (shadowCompareEnabled) {
    OldResult oldResult = oldQuery.search(q);
    NewResult newResult = newQuery.search(q);

    parityRecorder.record(q, oldResult, newResult);
    return oldResult;
}

Shadow compare should be bounded and sampled to avoid doubling DB load.


27. Shadow Query Caution

Running old + new query doubles DB work.

Use:

  • sample rate;
  • low traffic tenants;
  • async comparison if possible;
  • read replica if safe;
  • strict timeout;
  • kill switch.

Do not shadow high-cost queries globally.


28. Data Access Change and Caches

If query/cache key changes:

  • cache can contain old shape;
  • stale cache can hide migration issue;
  • key version may be needed;
  • invalidate old cache on switch;
  • include source version in cached values.

For API response cache:

case-detail:v2:{tenant}:{caseId}

Keep v1/v2 separate during rollout.


29. Read Replica Routing Change

Changing reads from primary to replica is data access change.

Need:

  • stale tolerance;
  • read-your-writes behavior;
  • routing flag;
  • fallback to primary;
  • replica lag metrics;
  • command path still primary;
  • cache interaction.

Canary and monitor lag.


30. ORM Mapping Change

JPA entity change can break during transition.

Example:

@Column(nullable = false)
private String priority;

while DB column nullable/backfill incomplete.

During expand phase, entity should tolerate null/fallback or use custom mapper.

Do not let object model get ahead of migration phase.


31. jOOQ Codegen Change

jOOQ generated code reflects target schema.

If production schema not yet expanded, new app using generated column fails.

Therefore:

run expand migration before deploying code using generated field

For contract cleanup:

drop column only after code no longer references generated field

Compile-time safety does not solve runtime deploy ordering.


32. MyBatis/JDBC SQL Change

String SQL must be tested against migration state.

If SQL references new column, migration must exist.

If SQL drops old column, old queries must be gone.

Real DB integration tests should use same migrations.


33. Multi-Service Shared Schema

Zero-downtime harder if multiple services share DB.

Need:

  • schema owner;
  • consumer inventory;
  • backward-compatible contract;
  • migration communication;
  • deprecation window;
  • access logs/metrics;
  • contract tests if possible.

Better architecture: service owns DB and exposes API/events. But real systems sometimes have legacy shared schema.


34. Async Workers During Migration

Workers may run old code while web app new code deployed.

Inventory:

  • outbox publisher;
  • inbox consumers;
  • scheduled jobs;
  • backfill jobs;
  • report jobs;
  • cache warmers;
  • search indexers.

All need compatibility.

Pause low-priority workers before risky migration if needed.


35. Message/Event Schema During DB Change

If DB schema change affects event payload:

  • event versioning;
  • dual publish old/new fields;
  • consumers tolerate both;
  • outbox table stores payload version;
  • old unprocessed events still processable.

Do not change event payload and DB schema incompatibly in same irreversible step.


36. Failure Mode: New App Before Migration

Symptom:

column does not exist

Prevention:

  • migration stage before app;
  • startup schema validation;
  • deployment dependency;
  • feature flag default off until migration complete.

37. Failure Mode: Old App After Contract

Symptom:

old app crashes on missing column

Prevention:

  • wait until old app gone;
  • query metrics/code search;
  • contract later;
  • rollback window expired.

38. Failure Mode: Partial Backfill Read

Symptom:

new read returns null/incorrect values

Prevention:

  • fallback read;
  • feature flag switch only after backfill;
  • not-null precondition;
  • parity check.

39. Failure Mode: Query Plan Incident

Symptom:

new query saturates DB

Prevention:

  • index first;
  • explain plan;
  • canary;
  • query timeout;
  • feature rollback;
  • read model if needed.

40. Failure Mode: Backfill Overload

Symptom:

OLTP latency spikes during backfill

Prevention:

  • throttling;
  • chunk size;
  • off-peak;
  • kill switch;
  • DB health adaptive pause;
  • separate pool/user.

41. Failure Mode: Migration Lock Stuck

Symptom:

future migrations blocked

Prevention/runbook:

  • only one migration runner;
  • monitor migration duration;
  • inspect lock table;
  • release only after verifying no active migration;
  • document repair process.

42. Deployment Runbook Template

# Zero-Downtime Data Access Change Runbook

Change:
Owner:
Risk:
Affected tables:
Affected app versions:
Feature flags:
Migration files:
Backfill job:
Indexes:
Compatibility matrix:
Deploy order:
Canary plan:
Metrics:
Rollback plan:
Stop conditions:
Cleanup plan:

Use for risky changes.


43. Stop Conditions

Before rollout, define stop conditions:

  • DB CPU > threshold;
  • pool wait > threshold;
  • query p95 > threshold;
  • mismatch count > 0;
  • fallback read not decreasing;
  • error rate > threshold;
  • lock wait spike;
  • replication lag > threshold;
  • backfill error rate > threshold.

If stop condition hit, pause/reroute/rollback flag.


44. Observability Dashboard

Dashboard should show:

  • migration applied version;
  • feature flag state;
  • old/new query latency;
  • mismatch/parity;
  • backfill progress;
  • fallback read count;
  • DB pool metrics;
  • slow query count;
  • lock waits;
  • error rate;
  • deployment version mix.

Zero-downtime rollout without observability is gambling.


45. Deployment Version Mix

During rolling deploy, know:

old version instances count
new version instances count
worker version mix

Do not contract until old count zero.

For Kubernetes-like systems, watch rollout status and old pods/jobs.


46. Readiness Checks

New app can fail readiness if required schema missing.

Example:

check column/table exists

But be careful: readiness check should not hammer DB or require expensive validation.

Schema validation at startup can catch migration ordering issues.


47. Feature Flag Default

For new data access path:

default off

Deploy code safely, then enable after migration/index/backfill.

This decouples deployment from activation.


48. Dark Launch

Dark launch new path without serving response.

Example:

  • execute new query for 1% traffic;
  • compare result;
  • discard new result;
  • monitor latency.

Use carefully due extra DB load.


49. Blue/Green Caveat

Blue/green deploy still has database shared state.

Even if traffic switches atomically, rollback to blue requires schema backward compatibility.

Database is not blue/green unless you have sophisticated replication/cutover.

Do not assume blue/green removes expand-contract need.


50. Rollback After Data Writes

If new app wrote new values, old app may not understand.

Example new enum code inserted.

Rollback old app may fail parsing code.

Prevent:

  • deploy old app tolerant first;
  • feature flag new value production;
  • avoid writing incompatible data until rollback window gone.

51. Zero-Downtime Checklist

  • Compatibility matrix completed.
  • Expand migration additive.
  • App tolerant of partial data.
  • Feature flag default safe.
  • Index exists before query switch.
  • Backfill chunked/resumable.
  • Parity check defined.
  • Old/new queries can coexist.
  • Rollback by phase defined.
  • Old app still works after expand.
  • Contract delayed until old app/jobs gone.
  • Migration lock/runbook ready.
  • Metrics/dashboard ready.
  • Stop conditions defined.
  • Cleanup scheduled.

52. Anti-Pattern: Deploy Code and Destructive Migration Together

Breaks rollback and rolling deploy.


53. Anti-Pattern: Feature Flag Only Around UI

Data access path also needs flag/control.


54. Anti-Pattern: Backfill Before Dual Write

Live writes can create new missing rows during/after backfill.


55. Anti-Pattern: Contract Cleanup Forgotten Forever

Temporary dual path becomes permanent complexity.

Create cleanup ticket/date.


56. Anti-Pattern: Shadow Query Without Load Budget

Can double DB load.


57. Mini Lab

Plan zero-downtime change:

Move dashboard from live joins to case_dashboard_read_model.
Current app queries case_file + officer + assignment count live.
New app should read read model.
Read model initially empty.
Traffic high.
Rollback must be possible.

Tasks:

  1. Create migration plan.
  2. Projector dual-write/rebuild plan.
  3. Backfill/rebuild strategy.
  4. Parity comparison.
  5. Feature flag for read switch.
  6. Index plan.
  7. Canary plan.
  8. Rollback plan.
  9. Stop conditions.
  10. Contract cleanup.

58. Summary

Zero-downtime data access change is choreography.

You must master:

  • rolling deployment reality;
  • compatibility matrix;
  • schema expand first;
  • deploy order;
  • forward/backward compatibility;
  • feature flags;
  • canary;
  • rollback types/window;
  • delayed contract cleanup;
  • migration lock/DDL locks;
  • dedicated migration job;
  • query plan switch;
  • shadow comparison;
  • cache/versioned key;
  • read replica routing;
  • ORM/jOOQ/MyBatis timing;
  • shared schema/async workers;
  • failure modes;
  • runbook;
  • observability;
  • stop conditions;
  • blue/green caveat.

Part berikutnya membahas Data Backfill and Repair Job: chunking, resume cursor, idempotent update, throttling, audit trail, progress metrics, error isolation, and safe repair in production.


59. References

Lesson Recap

You just completed lesson 57 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.