Production Checklists
Learn Java Microservices File Handling, State, Configuration and Secret Management - Part 067
Production checklists untuk Java microservices file, state, configuration, dan secret: design review, security review, operational readiness, launch readiness, incident readiness, and audit readiness.
Part 067 — Production Checklists
A checklist is not bureaucracy when it encodes failures the organization has already paid for.
This part turns the whole series into practical review checklists.
A good checklist is not a replacement for engineering judgment. It is a guardrail against predictable omissions. In complex systems, the most expensive incidents often come from boring missed details:
- upload size limit missing at one layer;
- ConfigMap edited directly in production;
- secret rotated but connection pool still using old credential;
- file metadata says accepted but object is missing;
- audit event emitted only after a non-transactional side effect;
- presigned URL logged;
- retention policy encoded in storage lifecycle rule without domain legal hold;
- cache used for authorization without stale-read policy;
- startup config validation missing;
- readiness probe says healthy while required secret expired.
Part 067 is intentionally operational. Use it as:
- architecture review checklist;
- pull request review guide;
- production readiness review;
- security review;
- SRE handoff checklist;
- audit readiness checklist;
- pre-launch gate;
- post-incident improvement map.
1. How to Use These Checklists
Do not treat the checklist as a form to fill blindly.
Use it in layers:
1. Design review: are the boundaries and invariants correct?
2. Implementation review: does code preserve them?
3. Security review: can an attacker bypass them?
4. Operational review: can the team observe and recover?
5. Compliance review: can the system prove what happened?
6. Launch review: are risks accepted or resolved?
Each item should end in one of four states:
| State | Meaning |
|---|---|
| Pass | implemented and verified |
| Fail | must be fixed before launch |
| Accepted Risk | explicitly approved with owner and expiry |
| Not Applicable | justified, not ignored |
For serious systems, every Accepted Risk must have:
risk owner
reason
mitigation
expiry/review date
rollback/incident plan
2. System Boundary Checklist
Before reviewing details, establish the system shape.
[ ] Service responsibility is clearly stated.
[ ] Bounded context is explicit.
[ ] File payload owner is known.
[ ] Metadata owner is known.
[ ] State source of truth is known.
[ ] Configuration owner is known.
[ ] Secret owner and security owner are known.
[ ] Audit owner is known.
[ ] Storage platform owner is known.
[ ] Operational owner/on-call is known.
[ ] Compliance/legal owner is known where applicable.
Boundary questions:
What does this service own?
What does it only reference?
What can it mutate?
What can it never mutate directly?
Which external dependency failures can corrupt correctness?
Which failures only affect availability/performance?
Architecture smell:
The answer to “who owns this?” is “the platform” for everything.
Platform may own infrastructure custody. Domain service still owns semantic correctness.
3. File Handling Design Checklist
3.1 Identity and Naming
[ ] File has stable server-generated domain identity.
[ ] Client filename is not used as physical path or object key.
[ ] Object key does not contain PII or sensitive domain text.
[ ] Bucket/key/version is not exposed as public domain API.
[ ] Original filename is treated as untrusted display metadata.
[ ] Download filename is sanitized against header/path injection.
Bad sign:
storageKey = caseId + "/" + originalFilename
Better:
storageKey = evidence/yyyy/mm/dd/{fileId}/payload
3.2 Upload Flow
[ ] Authentication happens before upload session creation.
[ ] Authorization checks actor can attach/upload file.
[ ] Upload session has TTL.
[ ] Upload size limit exists at ingress/proxy/app/storage policy.
[ ] Direct upload/presigned URL TTL is short and auditable.
[ ] Multipart upload has abort/cleanup policy.
[ ] Proxy upload streams, not loads whole file into memory.
[ ] Temp file location has quota/size limit.
[ ] Temp file cleanup exists for success, failure, and crash recovery.
[ ] Upload completion verifies object existence and size.
[ ] Upload completion does not mark file trusted/accepted.
3.3 Validation and Inspection
[ ] Client Content-Type is not trusted.
[ ] Extension allowlist exists where applicable.
[ ] Magic bytes/content detection exists where applicable.
[ ] Structural validation exists for high-risk formats.
[ ] Malware scan or equivalent policy exists.
[ ] Raw upload enters quarantine before acceptance.
[ ] Scanner timeout does not accidentally accept file.
[ ] Scanner duplicate/out-of-order results are idempotent.
[ ] Zip/archive extraction limits exist if archive processing is supported.
3.4 Lifecycle
[ ] File lifecycle state machine is explicit.
[ ] Allowed transitions are encoded in domain/service layer.
[ ] Invalid transitions fail closed.
[ ] Terminal states are terminal.
[ ] Accepted file requires checksum and accepted scan/content decision.
[ ] Rejected file cannot become accepted without new validation path.
[ ] Physical delete is separate from deletion request.
[ ] Legal hold/retention checked before physical delete.
3.5 Download
[ ] Metadata access and payload access are separate policies.
[ ] Download grant requires fresh authorization decision.
[ ] File lifecycle allows download.
[ ] Presigned URL is not logged.
[ ] Presigned URL TTL is bounded by policy.
[ ] Sensitive download response uses safe cache headers.
[ ] Content-Disposition and X-Content-Type-Options are set where applicable.
[ ] Download grant/deny is audited.
4. Metadata and Database Checklist
[ ] Metadata table has stable primary key.
[ ] Metadata includes owner/domain reference.
[ ] Metadata includes lifecycle status.
[ ] Metadata includes object storage reference.
[ ] Metadata includes checksum after verification.
[ ] Metadata includes size and detected content type.
[ ] Metadata includes scan decision and policy version.
[ ] Metadata includes retention/legal hold fields where applicable.
[ ] Optimistic locking/versioning exists for lifecycle transition.
[ ] Database constraints enforce basic invariant.
[ ] State transition update is atomic.
[ ] Metadata mutation path is not exposed to unrelated services via shared DB.
Suggested constraints:
CHECK (status IN (...))
CHECK (status <> 'ACCEPTED' OR sha256 IS NOT NULL)
CHECK (status <> 'ACCEPTED' OR scan_decision = 'CLEAN')
Review question:
Can a direct DB update make an unscanned file appear accepted?
If yes, add constraints, access control, audit triggers, or remove direct access.
5. Object Storage Checklist
[ ] Bucket public access is blocked.
[ ] Service IAM is least-privilege by action and prefix.
[ ] Separate prefixes for temporary/quarantine/accepted/archive where useful.
[ ] Object keys are generated server-side.
[ ] Encryption at rest configured.
[ ] KMS key policy reviewed if SSE-KMS/envelope encryption used.
[ ] Versioning enabled where integrity/history requires.
[ ] Object lock/legal hold enabled where required.
[ ] Incomplete multipart upload cleanup configured.
[ ] Temporary prefix lifecycle cleanup configured.
[ ] Object delete requires domain decision, not direct user input.
[ ] Object storage access/data events enabled where forensic need exists.
[ ] Object metadata/tags do not leak sensitive information.
Storage review question:
If object exists without metadata, who cleans it?
If metadata points to missing object, who detects it?
If the answer is “nobody”, the design is incomplete.
6. State Management Checklist
6.1 State Taxonomy
[ ] Durable state identified.
[ ] Ephemeral state identified.
[ ] Derived state identified.
[ ] Cache state identified.
[ ] Workflow state identified.
[ ] Session/user journey state identified.
[ ] Operational checkpoint/lock state identified.
6.2 Source of Truth
[ ] Each business fact has one authoritative owner.
[ ] No service writes another service’s owned state through shared DB.
[ ] Derived state can be rebuilt.
[ ] Cache is not treated as authoritative unless explicitly designed.
[ ] Local disk/heap is not the only copy of correctness-critical state.
6.3 Consistency and Idempotency
[ ] Commands that can be retried have idempotency key.
[ ] Event consumers are idempotent.
[ ] Duplicate event behavior is tested.
[ ] Out-of-order event behavior is tested.
[ ] Optimistic locking or equivalent conflict control exists.
[ ] Replay/rebuild process has policy/schema versioning.
[ ] Reconciliation exists for known divergence states.
6.4 Cache
[ ] Cache purpose is documented.
[ ] Cache TTL is defined.
[ ] Cache invalidation strategy exists.
[ ] Cache stale-read risk is classified.
[ ] Security-sensitive cache has stricter TTL/recheck behavior.
[ ] Cache metrics exist.
[ ] Cache failure mode is defined.
7. Configuration Checklist
7.1 Classification
[ ] Config is not secret.
[ ] Feature flags are separated from config.
[ ] Startup-bound config identified.
[ ] Runtime-reloadable config identified.
[ ] Reload-dangerous config identified.
[ ] High-risk config identified.
[ ] Tenant-specific config bounded by global policy.
7.2 Schema and Validation
[ ] Config bound to typed @ConfigurationProperties or equivalent.
[ ] Required config fields validated.
[ ] Range/enum/duration validation exists.
[ ] Cross-field validation exists.
[ ] Safe defaults are used.
[ ] Production invariants are checked at startup.
[ ] Unknown config keys are reviewed or rejected where possible.
[ ] Config schema versioning exists for platform configs.
7.3 Delivery and Provenance
[ ] Config source is known.
[ ] Config precedence is documented.
[ ] Config version is visible at runtime.
[ ] Config change goes through review/promotion.
[ ] Config drift detection exists.
[ ] Manual emergency override is audited and backfilled.
[ ] Config rollback path exists.
[ ] Config/app compatibility is tested.
7.4 Runtime Reload
[ ] Only reload-safe config can reload at runtime.
[ ] Candidate config validated before activation.
[ ] Reload applies atomically.
[ ] Failed reload keeps previous config active.
[ ] Reload success/failure metrics exist.
[ ] Reload event is audited/operationally logged.
8. Secret Management Checklist
8.1 Inventory and Ownership
[ ] Secret inventory entry exists.
[ ] Secret owner assigned.
[ ] Security owner assigned.
[ ] Consumer list known.
[ ] Secret purpose and capability documented.
[ ] Secret environment scope clear.
[ ] Secret is not shared across unrelated services.
8.2 Storage and Delivery
[ ] Secret stored in secret manager or encrypted GitOps mechanism.
[ ] Secret not stored in ConfigMap.
[ ] Secret not stored plaintext in Git.
[ ] Kubernetes Secret RBAC is least privilege.
[ ] Workload identity used instead of static cloud key where possible.
[ ] Mounted/env delivery method is intentional.
[ ] Runtime reload or rollout strategy defined.
8.3 Consumption
[ ] Secret value is not logged.
[ ] Secret value is not exposed in metrics/traces/errors.
[ ] Actuator/config endpoints do not reveal secret.
[ ] Connection/client using secret can refresh or restart safely.
[ ] Readiness validates required credential where appropriate.
[ ] Secret TTL/lease respected if dynamic.
8.4 Rotation
[ ] Rotation strategy defined.
[ ] Overlap window exists where possible.
[ ] Rollback possible before old revoke.
[ ] Old credential usage observable.
[ ] New credential validated before rollout.
[ ] Old credential revoked only after adoption proof.
[ ] Rotation event audited.
[ ] Emergency rotation runbook exists.
9. Access Control Checklist
[ ] Authentication boundary clear.
[ ] Service-to-service identity clear.
[ ] User authorization and service authorization are separate.
[ ] Metadata read and payload read are separate permissions.
[ ] File delete requires stronger permission than file read.
[ ] Legal hold/retention override requires special authority.
[ ] Admin/break-glass path audited.
[ ] Kubernetes RBAC least privilege.
[ ] Cloud IAM least privilege.
[ ] KMS decrypt permission scoped.
[ ] Secret manager access scoped.
[ ] Object storage access scoped by bucket/prefix/action.
[ ] Negative authorization tests exist.
Review question:
What can this service account do if the app is compromised?
If answer is “read all secrets and all buckets”, stop.
10. Encryption Checklist
10.1 In Transit
[ ] External traffic uses TLS.
[ ] Internal service traffic encryption requirement decided.
[ ] mTLS used where platform requires identity at transport layer.
[ ] TLS certificate lifecycle/rotation defined.
[ ] Java truststore/keystore management understood.
[ ] Certificate reload or rollout strategy exists.
10.2 At Rest
[ ] Object storage encryption configured.
[ ] Database encryption/storage encryption understood.
[ ] Secret manager encryption/KMS boundary understood.
[ ] KMS key ownership defined.
[ ] KMS access audited.
[ ] Envelope encryption used if app-level encryption required.
[ ] Key rotation impact understood.
[ ] Decryption failure mode defined.
Important:
Encryption does not replace authorization, audit, retention, or access control.
11. Observability Checklist
11.1 Metrics
[ ] Upload started/completed/failed metrics.
[ ] File lifecycle transition metrics.
[ ] Scan queue/pending age metrics.
[ ] Metadata-payload mismatch metrics.
[ ] Object storage latency/error/retry metrics.
[ ] Config validation/reload/drift metrics.
[ ] Secret TTL/refresh/rotation metrics.
[ ] Audit outbox/backlog metrics.
[ ] Reconciliation mismatch metrics.
[ ] Cost/egress/orphan metrics where relevant.
11.2 Logs
[ ] Structured logs.
[ ] Correlation ID present.
[ ] Sensitive data redacted.
[ ] Presigned URLs not logged.
[ ] Secret values not logged.
[ ] Request/response body logging disabled by default.
[ ] Security-relevant denials are not sampled away.
11.3 Traces
[ ] Key spans around DB/storage/scanner/audit.
[ ] Async correlation propagated.
[ ] No auth headers/request bodies/presigned URLs in trace attributes.
[ ] Route templates used instead of raw URLs with query strings.
11.4 Alerts
[ ] Accepted file without checksum alert.
[ ] Metadata-payload mismatch alert.
[ ] Scan backlog alert.
[ ] Audit outbox backlog alert.
[ ] Secret expiry/refresh failure alert.
[ ] Config validation/drift alert.
[ ] Old credential usage after rotation alert.
[ ] Object storage access denied/throttling alert.
[ ] Runbooks linked to alerts.
12. Audit and Forensics Checklist
[ ] Material file lifecycle events audited.
[ ] Material config changes audited.
[ ] Secret access/rotation lifecycle audited.
[ ] High-risk deny/block events audited.
[ ] Audit event includes actor/resource/decision/reason/policy/timestamp.
[ ] Audit event schema versioned.
[ ] Audit event does not contain raw secret/payload/presigned URL.
[ ] Audit write is durable or represented in transactional outbox.
[ ] Audit publisher idempotent.
[ ] Audit backlog monitored.
[ ] Audit store access controlled and audited.
[ ] Forensic correlation ID propagated.
[ ] Kubernetes/storage/KMS audit sources enabled where needed.
[ ] Forensic drill performed.
For every high-risk operation, ask:
Can we prove who did it, why it was allowed or denied, and what changed?
13. Compliance Checklist
[ ] Data classification documented.
[ ] Retention policy owner defined.
[ ] Retention policy versioned.
[ ] Legal hold supported if required.
[ ] Delete checks retention/legal hold.
[ ] Physical delete audited.
[ ] Evidence integrity/tamper-evidence defined.
[ ] Chain of custody exists for evidence-like files.
[ ] Access review process defined.
[ ] Control catalog exists for regulated controls.
[ ] Reconciliation reports retained.
[ ] Manual exception process audited.
Review question:
If auditor asks why FILE-123 was deleted, can we answer with evidence?
14. Failure and Chaos Testing Checklist
[ ] Storage timeout tested.
[ ] Storage 403/503 tested.
[ ] DB commit failure after object write tested.
[ ] Duplicate event tested.
[ ] Out-of-order event tested.
[ ] Worker crash tested.
[ ] Scanner timeout tested.
[ ] Config invalid startup tested.
[ ] Runtime config reload failure tested.
[ ] Secret manager outage tested.
[ ] Secret rotation bad credential tested.
[ ] Pod restart/eviction tested.
[ ] Reconciliation tested.
[ ] Alerts tested.
[ ] Game day performed for critical workflow.
Each failure test must verify invariant, not only HTTP status.
15. Operational Readiness Checklist
[ ] On-call owner assigned.
[ ] Service dashboard exists.
[ ] Runbooks exist for top failure modes.
[ ] SLOs defined.
[ ] Error budget policy defined if applicable.
[ ] Deployment rollback tested.
[ ] Config rollback tested.
[ ] Secret rotation rollback tested.
[ ] Data repair/reconciliation process documented.
[ ] DLQ replay process documented.
[ ] Incident severity mapping defined.
[ ] Support escalation path defined.
[ ] Backup/restore requirements defined.
Minimum runbooks:
upload failures spike
scan backlog
metadata-payload mismatch
object storage access denied
config drift
secret refresh failure
secret rotation failure
audit outbox backlog
retention delete blocked unexpectedly
DLQ growth
16. Launch Readiness Gate
Before production launch:
[ ] Architecture reviewed.
[ ] Threat model reviewed.
[ ] Security controls verified.
[ ] Config schema and prod validation implemented.
[ ] Secret inventory and rotation runbook complete.
[ ] Audit event coverage reviewed.
[ ] Observability dashboard and alerts live.
[ ] Failure tests completed.
[ ] Reconciliation jobs enabled.
[ ] Data retention/legal hold requirements signed off.
[ ] Operational runbooks approved.
[ ] Rollback plan tested.
[ ] Known risks documented with owners.
Launch decision should be explicit:
GO
NO-GO
GO with accepted risks
17. Pull Request Review Checklist
For code changes touching file/state/config/secret:
[ ] Does this change affect lifecycle invariant?
[ ] Does it introduce new config?
[ ] Is new config typed and validated?
[ ] Does it introduce or consume a secret?
[ ] Does it log or trace sensitive data?
[ ] Does it change authorization behavior?
[ ] Does it need audit event?
[ ] Is retry/idempotency handled?
[ ] Are failure modes tested?
[ ] Are metrics/alerts updated?
[ ] Does documentation/runbook need update?
18. ADR Checklist
For significant decisions:
[ ] Context explains problem.
[ ] Decision is explicit.
[ ] Alternatives considered.
[ ] Consequences stated.
[ ] Security impact stated.
[ ] Operational impact stated.
[ ] Compliance impact stated if relevant.
[ ] Failure modes considered.
[ ] Rollback/migration plan included.
[ ] Owner and date included.
ADR topics worth writing:
file identity model
storage provider and key layout
upload strategy
scanner/quarantine model
config delivery model
secret delivery model
secret rotation strategy
audit event model
retention/legal hold model
reconciliation strategy
19. Executive Risk Summary Template
For leadership or launch approval:
# Production Risk Summary
## Service
- Name:
- Owner:
- Launch date:
## Critical Capabilities
- ...
## Top Risks
| Risk | Impact | Mitigation | Owner | Status |
|---|---|---|---|---|
## Accepted Risks
| Risk | Reason | Expiry | Owner |
|---|---|---|---|
## Readiness Evidence
- Architecture review:
- Security review:
- Failure tests:
- Observability:
- Runbooks:
- Compliance review:
## Decision
- GO / NO-GO / GO WITH RISKS
20. Key Takeaways
- Production readiness is an evidence problem, not a confidence problem.
- Checklists encode failure memory and prevent predictable omissions.
- File review must cover identity, lifecycle, validation, storage, download, and retention.
- State review must classify source of truth, idempotency, replay, and cache risk.
- Config review must cover schema, provenance, policy, reload, drift, and rollback.
- Secret review must cover inventory, delivery, consumption, rotation, and leakage prevention.
- Observability review must verify invariant-based metrics and actionable alerts.
- Audit review must prove material decisions are durable, redacted, and reconstructable.
- Failure testing must validate invariants, not just dependency error handling.
- Accepted risk must have owner, expiry, and mitigation.
Next, we sharpen intuition by studying anti-patterns and production war stories.
References
- Spring Boot Externalized Configuration: https://docs.spring.io/spring-boot/reference/features/external-config.html
- Kubernetes ConfigMaps: https://kubernetes.io/docs/concepts/configuration/configmap/
- Kubernetes Secrets: https://kubernetes.io/docs/concepts/configuration/secret/
- Kubernetes Probes: https://kubernetes.io/docs/concepts/workloads/pods/probes/
- OWASP File Upload Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/File_Upload_Cheat_Sheet.html
- OWASP Logging Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html
- OWASP Secrets Management Cheat Sheet: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html
You just completed lesson 67 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.