Production Readiness Review Template for Microservice Communication
Learn Java Microservices Communication - Part 095
Final production readiness review template for Java microservices communication: HTTP, gRPC, Kafka, service mesh, gateway, egress, security, resilience, observability, testing, capacity, rollout, runbooks, ownership, and executive readiness scoring.
Part 095 — Production Readiness Review Template for Microservice Communication
A production readiness review is not a ceremony.
It is a risk-reduction mechanism.
For communication-heavy microservices, the goal is to prove:
the service can communicate safely, reliably, observably, securely, and operably under expected and abnormal conditions
This template consolidates the whole series into a review artifact.
Use it before:
- launching a new service,
- exposing a public API,
- adding a major service dependency,
- introducing event-driven workflows,
- changing schema/event contracts,
- adding service mesh policy,
- enabling multi-region failover,
- integrating external providers,
- migrating from sync to async,
- performing high-risk refactors.
A top-tier engineer does not only ask:
Does it work?
They ask:
How does it fail, how do we know, who owns it, and how do we recover?
1. Review Output
The review should produce one of these outcomes:
| Outcome | Meaning |
|---|---|
| Ready | safe to launch |
| Ready with conditions | launch allowed after explicit mitigations |
| Trial/canary only | limited blast radius allowed |
| Not ready | risks too high |
| Blocked | missing fundamental requirement |
Example:
readinessDecision:
status: ready-with-conditions
conditions:
- DLQ alert must be connected to on-call before 100% traffic.
- Consumer lag dashboard must include oldest event age.
approvedBy:
- case-platform
- platform-sre
- security
reviewDate: 2026-07-05
followUpDate: 2026-07-19
The outcome must be explicit.
2. Readiness Scorecard
Use weighted dimensions.
scorecard:
contract: 4/5
resilience: 4/5
security: 5/5
observability: 3/5
capacity: 4/5
testing: 4/5
operations: 3/5
ownership: 5/5
overall: ready-with-conditions
Scoring guidance:
| Score | Meaning |
|---|---|
| 0 | absent |
| 1 | informal/manual |
| 2 | implemented partially |
| 3 | implemented but not fully tested |
| 4 | tested and observable |
| 5 | automated, governed, and drilled |
A perfect score is not always required.
But critical gaps must be visible.
3. Service Summary
service:
name: case-service
owner: case-platform
runtime: Java 21
framework: Spring Boot
deployment: Kubernetes
namespace: case
serviceAccount: case-service
criticality: tier-1
dataClassification: internal-confidential
userFacing: true
publicRoutes:
- /cases/**
internalRoutes:
- /internal/cases/**
Questions:
- Who owns the service?
- Is ownership current?
- What business capability does it support?
- What is the criticality tier?
- What data does it handle?
- Is it user-facing?
- Is it externally exposed?
- Does it process regulated/sensitive data?
No readiness without ownership.
4. Communication Inventory
List all communication surfaces.
communication:
inboundHttp:
- route: POST /cases/{caseId}/escalations
exposure: public
viaGateway: true
authRequired: true
inboundGrpc:
- service: CaseQueryService
exposure: internal
outboundHttp:
- dependency: customer-service
operation: GetCustomerProfile
- dependency: payment-provider
external: true
outboundGrpc:
- dependency: risk-service
operation: ScoreCaseRisk
producesEvents:
- topic: case-events
eventTypes:
- CaseCreated.v1
- CaseEscalated.v1
consumesEvents:
- topic: customer-events
groupId: case-service-customer-cache
platform:
gateway: public-api-gateway
mesh: enabled
egressGateway: required-for-payment-provider
If a dependency is not in inventory, it is not reviewable.
5. Boundary Classification
Classify each communication boundary:
| Boundary | Type | Risk |
|---|---|---|
| browser -> gateway | external/public | high |
| gateway -> case-service | internal edge-to-service | high |
| case-service -> customer-service | internal sync | medium |
| case-service -> payment-provider | external sync | critical |
| case-service -> Kafka case-events | async publish | high |
| search-indexer -> case-events | async consume | high |
| case-service -> database | local persistence | high |
Boundary classification determines review depth.
External, public, cross-region, payment, regulated, and async workflow boundaries deserve deeper review.
6. API Contract Readiness
For HTTP APIs:
- OpenAPI exists.
- Operation IDs stable.
- Request/response schemas valid.
- Error schema standardized.
- Status codes documented.
- Idempotency documented for commands.
- Pagination/filtering documented.
- Auth requirements documented.
- Rate limits documented.
- Examples present.
- Backward compatibility reviewed.
- Deprecation policy defined.
Review artifact:
httpApiContract:
openApiPath: api/openapi/case-service.yaml
compatibleWithPrevious: true
errorModel: problem-json
idempotencyKeyRequiredFor:
- POST /cases/{caseId}/escalations
contractTests: passing
If API has no contract, generated clients and consumers depend on behavior by rumor.
7. gRPC Contract Readiness
For gRPC:
.protofiles versioned.- Package and service names stable.
- Field numbers not reused.
- Removed fields reserved.
- Enum evolution policy defined.
- Deadlines documented.
- Status code mapping documented.
- Metadata headers documented.
- Streaming behavior documented.
- Backward compatibility checked.
- Generated code tested.
- Reflection/health exposure controlled.
Review artifact:
grpcContract:
protoPath: proto/case/query/v1/case_query.proto
compatibilityCheck: passing
deadlineRequired: true
statusMappingDoc: docs/grpc-status.md
gRPC compatibility is strict because clients often compile against generated contracts.
8. Event Contract Readiness
For events:
- AsyncAPI exists.
- Topic/channel documented.
- Event type documented.
- Payload schema registered.
- Key policy defined.
- Required headers documented.
- Event ID stable.
- Correlation/causation propagated.
- Compatibility mode enforced.
- Fixtures present.
- Producer contract tests pass.
- Consumer fixture tests pass.
- Known consumers documented.
- Classification/retention documented.
Review artifact:
eventContract:
asyncApiPath: asyncapi/case-events.yaml
topic: case-events
key: caseId
compatibility: full-transitive
fixtures: contracts/events/case-events/
producerContractTests: passing
consumerContracts:
- search-indexer
- notification-service
A topic is a production API.
Treat it accordingly.
9. Consistency Model Readiness
State exactly what is committed before response.
Example:
operation: POST /cases/{caseId}/escalations
consistency:
response: 202 Accepted
committedBeforeResponse:
- case escalation state
- outbox event row
eventuallyConsistent:
- Kafka publication
- notification sent
- search projection updated
freshnessSlo:
searchProjectionP99Seconds: 30
readYourWrites:
direct case GET: yes
search API: eventually consistent
Review questions:
- What does client know after response?
- What may happen later?
- What can fail after response?
- How is pending/failure visible?
- Which reads are stale?
- Is stale read acceptable?
Ambiguous consistency creates user-facing bugs.
10. Idempotency Readiness
For commands:
idempotency:
required: true
keyHeader: Idempotency-Key
scope: tenant + operation + key
retention: 24h
duplicateSamePayload: return original result
duplicateDifferentPayload: 409 conflict
propagatedTo:
- commandId
- outboxEventId
- providerIdempotencyKey
Review questions:
- Is duplicate client request safe?
- Is retry after timeout safe?
- Is provider retry safe?
- Is idempotency response stable?
- Is request hash checked?
- Is key retained long enough?
- Is key included in logs safely?
If operation is not idempotent, automatic retries must be disabled.
11. Timeout Readiness
For each sync dependency:
dependency: customer-service
operation: GetCustomerProfile
timeout:
connectMs: 100
responseMs: 300
totalBudgetMs: 400
callerDeadlineMs: 700
cancellationPropagated: true
Review questions:
- Does every call have timeout?
- Are timeouts nested?
- Does timeout fit SLO?
- Does server stop work on cancellation?
- Are DB/external timeouts within app budget?
- Is gateway timeout aligned?
- Is mesh timeout aligned?
No production dependency should rely on infinite default timeouts.
12. Retry Readiness
retry:
operation: GetCustomerProfile
owner: client-library
maxTotalAttemptsAcrossLayers: 2
retryable:
- connect-timeout
- connection-reset
- 503
nonRetryable:
- 400
- 401
- 403
- 409
backoff: exponential-jitter
metrics: enabled
Review questions:
- Which layer owns retry?
- Are gateway/mesh retries disabled or coordinated?
- Are unsafe methods protected?
- Is retry budget bounded?
- Is backoff/jitter used?
- Are attempts observable?
- Does retry respect deadline?
Retry is not allowed to be accidental.
13. Circuit Breaker and Bulkhead Readiness
resilience:
circuitBreaker:
enabled: true
dependency: payment-provider
failureRateThreshold: 50
minimumCalls: 50
openDurationSeconds: 30
bulkhead:
maxConcurrentCalls: 50
queueSize: 0
fallback:
behavior: mark-payment-pending
Review questions:
- What dependency can fail slowly?
- Is the failure isolated?
- Is thread/connection pool isolated?
- Is fallback domain-correct?
- Is circuit state observable?
- Is half-open behavior safe?
Bulkheads prevent one dependency from exhausting the whole service.
14. Outbox Readiness
outbox:
required: true
table: outbox_message
sameTransactionAsBusinessState: true
relay:
replicas: 2
maxBatchSize: 100
publishAckRequired: true
monitoring:
pendingCount: true
oldestPendingAge: true
publishFailureRate: true
cleanup:
publishedRetentionDays: 7
Review questions:
- Are business state and outbox row committed atomically?
- Is event ID stable across retry?
- Can duplicate publish happen safely?
- Is relay observable?
- Is pending age alerted?
- Is cleanup safe?
Critical domain events should not be published only by best-effort direct send.
15. Consumer Readiness
consumer:
topic: case-events
groupId: search-indexer
autoCommit: false
ackAfterDurableEffect: true
idempotency: processed-message-table
duplicateBehavior: skip
retryPolicy: bounded
dlq: case-events.search-indexer.dlq
replaySafe: true
Review questions:
- Is auto-commit disabled for critical consumer?
- Is ack after durable effect?
- Are duplicates safe?
- Is ordering scope understood?
- Is retry bounded?
- Is DLQ owned?
- Is lag/freshness monitored?
- Is replay tested?
At-least-once delivery means duplicate handling is mandatory.
16. DLQ Readiness
dlq:
topic: case-events.search-indexer.dlq
owner: search-platform
alertOnFirstMessage: true
retention: 14d
preservesOriginalMetadata: true
replayTool: available
replayApprovalRequired: true
dashboard: search-dlq-dashboard
Review questions:
- Who owns DLQ?
- Is alert configured?
- Is reason classified?
- Is replay possible?
- Is replay audited?
- Is DLQ access restricted?
- Is retention enough?
DLQ without owner and replay plan is unresolved failure.
17. Replay Readiness
replay:
supported: true
historicalFixturesTested: true
sideEffectsSuppressed: true
maxReplayRate: 1000/s
pauseWhenLiveLagAboveSeconds: 45
auditRequired: true
approvalRequiredForSensitiveTopics: true
Review questions:
- Can old events still process?
- Are old schemas supported?
- Are side effects suppressed?
- Is replay throttled?
- Is live traffic protected?
- Is replay audited?
- Is data privacy respected?
Replay is production change.
Not a casual command.
18. Gateway Readiness
gateway:
route: /cases/**
owner: case-platform
authRequired: true
tls: enabled
timeoutMs: 1500
retries:
enabled: true
methods:
- GET
- HEAD
rateLimit:
by: clientId
default: 1000/min
bodyLimitBytes: 1048576
identityHeaders:
stripUntrusted: true
setTrusted: true
routeTests: passing
Review questions:
- Is public route authenticated?
- Are identity headers protected?
- Are request size limits configured?
- Are retries safe?
- Are timeouts aligned?
- Are rate limits defined?
- Are route tests passing?
- Is CORS correct if browser route?
Gateway is part of the API.
19. Service Mesh Readiness
mesh:
enabled: true
mtls: strict
identity:
serviceAccount: case-service
authorization:
defaultDeny: true
allowedCallers:
- api-gateway.edge
- order-service.order
trafficPolicy:
timeoutMs: 1000
retries:
safeMethodsOnly: true
observability:
mtlsMetrics: true
authzDenyLogs: true
Review questions:
- Is service account unique?
- Is mTLS strict or migration mode?
- Are authz rules least-privilege?
- Are wildcard allows prohibited?
- Are retries coordinated?
- Are proxy resources sized?
- Are mesh tests passing?
Mesh policy is production code.
20. Egress Readiness
egress:
dependency: payment-provider
host: api.payment.example.com
viaEgressGateway: true
auth: mtls + oauth-client-credentials
credentialSource: secret-manager
timeoutMs: 1000
retryRequiresIdempotency: true
circuitBreaker: enabled
rateLimit: 300/s
sourceIpAllowlisted: true
syntheticProbe: enabled
Review questions:
- Is external host declared?
- Is egress allowed explicitly?
- Are credentials managed safely?
- Is provider timeout bounded?
- Are retries idempotent?
- Is circuit breaker configured?
- Is provider quota known?
- Is failure drill performed?
External dependencies are ownership and reliability boundaries.
21. Security and Privacy Readiness
security:
transportEncryption: true
mTLSInternal: true
publicAuth: oidc-jwt
domainAuthorization: application
topicAclsLeastPrivilege: true
egressDefaultDeny: true
secretsInEventsForbidden: true
payloadLoggingDisabled: true
piiClassification: internal-confidential
replayAuditRequired: true
Review questions:
- Is data encrypted in transit?
- Is service identity unique?
- Is public route authenticated?
- Is domain authorization in app?
- Are topics/ACLs least-privilege?
- Is PII minimized?
- Are logs redacted?
- Are DLQs protected?
- Are replay/offset reset audited?
Security must cover sync, async, platform, and logs.
22. Observability Readiness
Required signals:
observability:
http:
inboundRate: true
inboundLatency: true
errorRateByOperation: true
dependencies:
latencyByDependency: true
timeoutType: true
retryAttempts: true
async:
outboxAge: true
consumerLagSeconds: true
dlqCount: true
retryRate: true
projectionFreshness: true
platform:
gatewayRouteMetrics: true
meshSourceDestinationMetrics: true
authzDenies: true
dnsErrors: true
tracing:
traceContextPropagated: true
correlationIdPropagated: true
logs:
structured: true
redacted: true
Review questions:
- Can we identify where a 503 came from?
- Can we see retry amplification?
- Can we see freshness?
- Can we see DLQ/outbox?
- Can we see authz denies?
- Are logs safe?
- Are dashboards linked?
If operators cannot see it, they cannot own it.
23. SLO Readiness
slos:
availability:
CreateEscalation: 99.9%
latency:
GetCaseP99Ms: 500
freshness:
SearchProjectionP99Seconds: 30
eventPublication:
CaseEscalatedOutboxToKafkaP99Seconds: 10
workflow:
NotificationCompletedP99Minutes: 5
Review questions:
- What user outcome is measured?
- Are SLOs realistic?
- Are metrics available?
- Are alerts tied to SLO?
- Is error budget owner defined?
- Are async freshness/completion SLOs included?
Async systems need freshness and completion SLOs.
24. Capacity Readiness
capacity:
peakQps: 2000
peakEventRate: 5000/s
recordSizeP95Bytes: 12000
partitions: 48
consumerThroughput: 7000/s
replayAllowance: 1000/s
downstreamDbCapacity: 8000 writes/s
failoverCapacity: 70%
loadTest:
peak: passed
replayWithLiveTraffic: passed
retryStorm: passed
Review questions:
- Is capacity measured?
- Is peak load tested?
- Is replay included?
- Is retry amplification included?
- Is hot partition tested?
- Is failover capacity known?
- Is downstream capacity sufficient?
Capacity is end-to-end pipeline capacity.
25. Testing Readiness
testing:
unitPolicyTests: passing
openApiContractTests: passing
grpcContractTests: passing
asyncApiContractTests: passing
producerContractTests: passing
consumerFixtureTests: passing
schemaCompatibility: passing
integrationTests: passing
gatewayRouteTests: passing
meshAuthzTests: passing
egressFailureTests: passing
replayTests: passing
loadTests: passing
chaosDrills:
requiredBeforeGA: true
Review questions:
- Are contracts tested?
- Are negative tests included?
- Are duplicates tested?
- Are platform routes tested?
- Is observability tested?
- Are failure paths tested?
- Are load/failure tests realistic?
Testing should match risk.
26. Rollout Readiness
rollout:
strategy: canary
initialTrafficPercent: 5
promotionCriteria:
errorRateRegression: <0.2%
p99LatencyRegression: <10%
dlqCount: 0
criticalAlerts: 0
rollback:
routeRollback: true
producerFlag: true
consumerPause: true
schemaBackwardCompatible: true
Review questions:
- Is rollout gradual?
- Are metrics versioned?
- Is rollback safe?
- Are data/schema/event side effects reversible?
- Are consumers ready?
- Are clients compatible?
Traffic rollback is not data rollback.
27. Operational Readiness
operations:
runbooks:
- http-503
- timeout
- dlq-spike
- consumer-lag
- outbox-backlog
- bad-canary
- egress-provider-down
onCall:
primary: case-platform
secondary: platform-sre
dashboards:
- case-api
- case-events
- case-projection
- case-egress
incidentSeverity:
tier: 1
Review questions:
- Who is on-call?
- Are dashboards linked?
- Are runbooks tested?
- Are alerts actionable?
- Are escalation paths defined?
- Is maintenance window known?
- Is support aware of user-facing degradation?
Production readiness is operational readiness.
28. Governance Readiness
governance:
adr: ADR-042
openApi: linked
asyncApi: linked
policyAsCode: passing
ownerLabels: complete
driftDetection: enabled
exceptions:
active: 1
expiringWithin30Days: 1
reviewDate: 2026-10-01
Review questions:
- Is ADR accepted?
- Are policy checks passing?
- Are exceptions approved and expiring?
- Is catalog updated?
- Is drift detection enabled?
- Is next review scheduled?
Governance keeps readiness from decaying.
29. Readiness Risk Register
risks:
- id: RISK-001
description: Notification provider outage delays user communication.
impact: medium
likelihood: high
mitigation:
- async notification workflow
- circuit breaker
- pending status
- DLQ alert
owner: notification-team
status: mitigated
- id: RISK-002
description: Search projection stale during consumer lag.
impact: medium
likelihood: medium
mitigation:
- freshness metric
- stale marker
- lag alert
owner: search-platform
status: accepted
Risks should not be hidden.
They should be owned.
30. Executive Summary Template
## Production Readiness Summary
Service: case-service
Criticality: Tier 1
Decision: Ready with conditions
Key strengths:
- API and event contracts are versioned and tested.
- Critical events use transactional outbox.
- Consumers are idempotent and DLQ-owned.
- Gateway and mesh policies are tested.
- Dashboards and runbooks exist.
Conditions:
- Projection freshness alert must be connected to on-call before 100% traffic.
- DLQ replay drill must be completed before GA.
Accepted risks:
- Search projection can be stale up to 60s during replay.
- External notification provider outage results in pending notification state.
Next review: 2026-10-01
Leadership needs concise readiness summary.
Engineering needs detailed checklist.
Provide both.
31. Red Flags That Should Block Launch
Block launch if:
- no owner,
- public route without auth,
- critical command without idempotency,
- critical sync dependency without timeout,
- event producer without schema/contract,
- critical event without outbox or accepted risk,
- consumer not idempotent,
- DLQ unowned,
- no observability for critical path,
- no rollback for high-risk change,
- secrets in payload/logs,
- unknown data classification,
- no runbook for tier-1 capability.
Some gaps are not conditions.
They are blockers.
32. Conditions That May Be Acceptable
May launch with conditions if:
- dashboard needs minor label fix,
- non-critical route lacks synthetic probe,
- low-risk exception has expiry,
- canary limited to internal users,
- capacity test passed at 80% while expected traffic is 30%,
- DLQ replay tool exists but drill scheduled before full GA.
Conditions must be:
- specific,
- owned,
- dated,
- tracked,
- limited in blast radius.
33. Review Meeting Format
Agenda:
- service and communication inventory,
- contracts,
- consistency/idempotency,
- resilience/timeouts/retries,
- async correctness,
- gateway/mesh/platform,
- security/privacy,
- observability/SLO,
- capacity/testing,
- rollout/rollback,
- runbooks/ownership,
- risks/blockers,
- decision.
Keep review evidence-based.
Avoid opinions without artifacts.
34. Reviewer Roles
Suggested reviewers:
| Role | Focus |
|---|---|
| service owner | business semantics |
| platform engineer | gateway/mesh/Kubernetes |
| SRE | observability/SLO/runbooks |
| security engineer | auth/mTLS/ACL/privacy |
| data engineer | schema/events/projections |
| QA/test engineer | test coverage |
| architect/principal | trade-offs/risk |
| product owner | user-facing semantics |
Not every review needs all roles.
Critical flows do.
35. Final PRR Checklist
Before marking ready:
- Communication inventory complete.
- API contracts versioned.
- Event contracts versioned.
- Consistency model documented.
- Idempotency defined.
- Timeouts nested.
- Retry owner defined.
- Outbox used where required.
- Consumers idempotent.
- DLQs owned.
- Replay safe or restricted.
- Gateway routes tested.
- Mesh/security policies tested.
- Egress governed.
- Observability dashboards ready.
- SLOs defined.
- Capacity tested.
- Failure drills completed or scheduled.
- Runbooks linked.
- Owners assigned.
- Rollout and rollback ready.
- Risks accepted or mitigated.
- ADR linked.
This checklist is the practical culmination of the series.
36. The Real Lesson
Production readiness is not a feeling.
It is evidence that the system can communicate correctly under real conditions.
For Java microservices communication, readiness requires:
contracts
+ consistency semantics
+ idempotency
+ timeout/retry ownership
+ async correctness
+ gateway/mesh policy
+ security/privacy
+ observability
+ capacity
+ tests
+ runbooks
+ ownership
A service can pass functional tests and still not be production-ready.
A top-tier engineer knows the difference.
References
- Google SRE Book — Service Level Objectives: https://sre.google/sre-book/service-level-objectives/
- Google SRE Book — Addressing Cascading Failures: https://sre.google/sre-book/addressing-cascading-failures/
- Production Readiness Review concepts in SRE practice: https://sre.google/workbook/
- OpenAPI Specification: https://spec.openapis.org/oas/latest.html
- AsyncAPI Specification: https://www.asyncapi.com/docs/reference/specification/latest
- Kubernetes Gateway API: https://gateway-api.sigs.k8s.io/
- Istio Security Concepts: https://istio.io/latest/docs/concepts/security/
You just completed lesson 95 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.