Final StretchOrdered learning track

Microservices Design Checklist

Learn Java Microservices Design and Architect - Part 099

A production-grade checklist for reviewing Java microservices across boundaries, data ownership, reliability, observability, security, deployment, governance, and evolution.

22 min read4383 words
PrevNext
Lesson 99100 lesson track83–100 Final Stretch
#java#microservices#architecture#checklist+4 more

Part 099 — Microservices Design Checklist

A checklist is not architecture.

A checklist is a way to prevent predictable mistakes when your brain is busy thinking about the interesting part.

A senior engineer does not use a checklist because they cannot think. They use it because production systems fail in boring, repeatable ways:

  • the service boundary was actually a database table boundary;
  • the API looked clean but encoded another team's workflow assumption;
  • the retry policy amplified failure;
  • the health check returned green while the service was not ready;
  • the event payload leaked sensitive data;
  • the system had dashboards but no useful symptom-based alert;
  • the migration had no rollback criteria;
  • the service had an owner in a document but nobody owned it at 03:00.

This part compresses the whole series into a reviewable engineering checklist. Use it before building a new service, before splitting a monolith, before approving a boundary ADR, before onboarding a service into production, and after incidents.

The rule is simple:

A microservice is not ready because it compiles, starts, and exposes endpoints. It is ready when its boundary, data, failure behavior, telemetry, security, deployment, ownership, and evolution path are explicit.


1. The review model

A useful microservice review has three layers.

The mistake is to review only code.

Code review answers:

Is this implementation locally acceptable?

Architecture review answers:

Does this service reduce system complexity or merely move complexity across the network?

Production readiness review answers:

Can we operate this service safely when dependencies fail, traffic spikes, credentials rotate, data drifts, and humans are under pressure?

Runtime fitness review answers:

Are the assumptions still true after the system has been running for months?


2. Checklist severity levels

Not every failed checklist item blocks release. Use severity.

SeverityMeaningExampleAction
BLOCKERUnsafe to releaseNo owner, no rollback, writes to another service databaseDo not approve
HIGHLikely production or governance riskNo idempotency for retryable commandFix before general availability
MEDIUMRisk accepted with explicit mitigationMissing non-critical dashboardCreate follow-up with owner/date
LOWImprovement itemNaming inconsistency in internal metricBacklog
ACCEPTEDRisk consciously acceptedTemporary bridge during migrationRecord expiry and owner

A checklist without severity becomes bureaucracy.

A checklist with no owner becomes decoration.


3. Service existence checklist

Before designing a microservice, challenge the premise.

QuestionGood signalBad signal
What business capability does it own?Clear capability and lifecycle“It owns customer table operations”
Can it be deployed independently?Contract-compatible releasesMust deploy with three other services
Does it have a stable owner?One team owns roadmap + operations“Shared by platform and product”
Does it own data authority?Single writer / source of truth definedReads/writes same DB as others
Is the split driven by real force?Different scaling, volatility, team, policy, lifecycle“Microservices are our standard”
Would modular monolith be enough?Explicit trade-off documentedNot considered
What complexity does it remove?Reduces cognitive/load/release/data couplingAdds network hops without autonomy

Decision rule

Create a microservice when at least one of these forces is strong:

  1. Ownership force: different team must evolve the capability independently.
  2. Volatility force: part of the domain changes at a different pace.
  3. Consistency force: invariant boundary is clearly local.
  4. Scaling force: workload profile is materially different.
  5. Compliance force: data/policy/audit boundary needs isolation.
  6. Runtime force: failure isolation or deployment independence matters.

Do not create a microservice merely because a noun exists in the domain.


4. Boundary checklist

Boundary design is the first real architecture decision.

CheckQuestionEvidence
Capability ownershipWhat business capability is owned?Capability map, service charter
Language boundaryWhat terms have local meaning?Glossary, bounded context notes
Invariant boundaryWhich rules must be transactionally true?Aggregate/invariant list
Data authorityWhat records can only this service change?Ownership matrix
Lifecycle ownershipWhat lifecycle does this service control?State machine
Policy ownershipWhich decisions are made here?Decision table/policy map
External dependenciesWhat does it depend on to complete work?Dependency graph
Consumer obligationsWhat must consumers know?API/event contract
Rejected boundariesWhat alternatives were rejected?ADR

Boundary smells

  • Service named after a table: case-service, party-service, document-service with CRUD-only behavior.
  • Service has no verbs of its own.
  • Service cannot answer “what decision do you own?”
  • Service requires synchronous calls to enforce its core invariant.
  • Two services update the same business fact.
  • Every feature requires changes in multiple services.
  • Boundary matches team org chart accidentally, not domain capability.

Boundary review card

service: enforcement-decision-service
capability: "Evaluate regulatory case evidence and issue defensible enforcement decision"
owner: enforcement-platform-team
dataAuthority:
  owns:
    - decision
    - decision_rationale
    - decision_condition
  references:
    - case_id
    - allegation_id
    - evidence_snapshot_id
transactionalInvariants:
  - "A decision cannot be issued without approved evidence snapshot"
  - "A decision version is immutable after publication"
externalDependencies:
  requiredForCommand:
    - evidence-service
    - case-service
  optionalForRead:
    - party-profile-service
contractSurface:
  api:
    - POST /decisions/draft
    - POST /decisions/{id}/submit-review
    - POST /decisions/{id}/publish
  events:
    - DecisionDrafted
    - DecisionPublished
adr: ADR-042

5. API checklist

API review is not about whether endpoints are REST-shaped. It is about whether the contract is safe to evolve and safe to operate.

AreaQuestions
IntentDoes the endpoint express business intent or leak internal CRUD operations?
CompatibilityCan fields be added without breaking consumers?
Error semanticsAre validation, conflict, authorization, dependency failure, and retryable failure distinct?
IdempotencyAre retryable commands protected by idempotency key or natural idempotency?
ConcurrencyDoes the API support expected version, ETag, or conflict detection where needed?
PaginationAre result limits, cursors, sort order, and stability defined?
FilteringAre filters bounded and indexed?
Partial failureCan optional fragments fail without failing the whole response?
SecurityIs object-level authorization checked per resource/action?
PrivacyDoes response shape minimize sensitive fields?
ObservabilityAre route, status, latency, error class, and correlation IDs emitted?
LifecycleIs deprecation/version policy clear?

API smell examples

POST /cases/updateStatus

This is ambiguous. What status? Who is allowed? What state transition? What if status is already set?

Better:

POST /cases/{caseId}/submit-for-supervisor-review
Idempotency-Key: 01J2M8...
If-Match: "case-version-17"

The better API encodes:

  • actor intent;
  • target resource;
  • retry behavior;
  • concurrency expectation;
  • domain transition.

Error shape checklist

Every public/internal API should distinguish:

Error kindExampleRetry?HTTP/RPC mapping
ValidationMissing required fieldNo400 / INVALID_ARGUMENT
AuthenticationMissing/invalid credentialNo401 / UNAUTHENTICATED
AuthorizationActor cannot perform actionNo403 / PERMISSION_DENIED
Not foundResource absent or hiddenNo/Maybe404 / NOT_FOUND
ConflictVersion mismatch / invalid transitionNo until state changes409 / ABORTED
Rate limitedToo many requestsYes with delay429 / RESOURCE_EXHAUSTED
Dependency unavailableRequired dependency downYes with budget503 / UNAVAILABLE
Unknown outcomeTimeout after side effect maybe occurredRetry only if idempotent202/409/503 depending design

6. Event contract checklist

Events are not just serialized objects. They are historical facts other services may depend on.

CheckQuestion
Event meaningDoes the event name describe something that already happened?
Source authorityIs the publisher authoritative for the fact?
Event identityIs eventId globally unique?
Aggregate identityIs the affected business object identified?
OrderingIs aggregate version/sequence present?
CausalityAre correlation/causation IDs present?
Schema evolutionAre additive changes safe?
PrivacyAre sensitive fields minimized or tokenized?
ReplayCan consumers handle replay safely?
IdempotencyCan consumers deduplicate by event ID/version?
Time semanticsAre occurredAt, publishedAt, and processing time distinct?
DLQ policyIs poison-message handling defined?

Event envelope baseline

{
  "eventId": "01J2MA3Y3BQ9S8V7T3EQK4P9NQ",
  "eventType": "DecisionPublished",
  "eventVersion": 1,
  "source": "enforcement-decision-service",
  "aggregateType": "Decision",
  "aggregateId": "dec_1039",
  "aggregateVersion": 8,
  "occurredAt": "2026-07-05T02:14:11Z",
  "publishedAt": "2026-07-05T02:14:12Z",
  "correlationId": "corr_44f",
  "causationId": "cmd_91c",
  "tenantId": "tenant_sg_regulator",
  "payload": {
    "caseId": "case_8831",
    "decisionId": "dec_1039",
    "decisionType": "ENFORCEMENT_ACTION_REQUIRED",
    "effectiveFrom": "2026-07-05"
  }
}

Event anti-patterns

  • CaseUpdated with huge mutable payload.
  • Event payload mirrors internal database row.
  • Event contains full PII because “consumer might need it”.
  • Event order matters globally but only partition order exists.
  • Consumer uses event as command without explicit ownership.
  • No event version.
  • No replay test.
  • No DLQ triage process.

7. Data ownership checklist

Data ownership is the backbone of microservices.

QuestionExpected answer
Who can create this fact?One authoritative service
Who can update this fact?One authoritative service or explicit workflow/policy owner
Who can read this fact?Through API/event/read model, not direct database access
Who can delete/redact this fact?Owner plus privacy workflow
Who can reconstruct history?Owner/audit service with immutable evidence
Who owns derived copies?Read-model owner with staleness contract
Who detects drift?Projection/reporting owner with reconciliation loop

Ownership matrix

DataAuthorityReadersPropagationStalenessNotes
Case lifecycle stateCase serviceWorkflow, ReportingEventSecondsState transitions are audited
Evidence metadataEvidence serviceDecision, ReportingSnapshot/APIMinutesBlob access controlled separately
Decision rationaleDecision serviceCase, Audit, ReportingEvent/APIImmediate for auditImmutable after publication
Party profileParty serviceCase, NotificationSnapshot/APIHoursPII-minimized copy only
SLA timerWorkflow serviceCase, OpsEventSecondsOperational state, not domain truth

Hard blockers

Do not approve a service when:

  • it writes to another service's database;
  • it reads private tables for online request path;
  • it has no data owner for key business facts;
  • reporting requirement forces cross-service SQL joins;
  • ownership is split by operation, such as “service A creates, service B updates, service C deletes” without workflow authority;
  • data privacy obligations cannot be assigned to a clear owner.

8. Transaction and consistency checklist

Distributed consistency must be designed at business level.

CheckQuestion
Local transactionWhat changes happen atomically inside one service?
Business transactionWhat process spans services/time/humans?
Consistency windowHow stale can each read be?
User experienceWhat does user see during pending state?
Retry safetyCan commands/events be retried safely?
CompensationWhat business correction is valid if later step fails?
ReconciliationHow is drift detected and repaired?
Unknown outcomeWhat happens if caller times out after side effect?
AuditabilityCan we reconstruct the final state and path?

State machine check

Every long-running process needs explicit states.

Ask:

  • Which service owns the state?
  • Which transitions are synchronous commands?
  • Which transitions are event-driven?
  • Which transitions need human approval?
  • Which transitions have timers?
  • Which transitions are irreversible?
  • Which transitions create audit evidence?

9. Idempotency checklist

Retries are normal. Duplicates are normal. Network ambiguity is normal.

OperationRequired idempotency strategy
Create with client-generated IDNatural idempotency by business key
Create with server-generated IDIdempotency key + response replay
State transitionExpected version + transition guard
Event consumerInbox/dedupe table by event ID
External payment/notificationProvider idempotency key + local operation log
Workflow activityActivity ID + command dedupe
Projection updateIgnore old aggregate version

Idempotency record

create table idempotency_record (
  tenant_id varchar(80) not null,
  idempotency_key varchar(120) not null,
  request_hash varchar(128) not null,
  status varchar(30) not null,
  response_code int,
  response_body jsonb,
  created_at timestamptz not null,
  expires_at timestamptz not null,
  primary key (tenant_id, idempotency_key)
);

Review questions

  • What happens if the client retries after timeout?
  • What happens if two identical requests arrive concurrently?
  • What happens if same idempotency key is reused with different payload?
  • What happens if service crashes after DB commit but before response?
  • What happens if event is delivered twice?
  • What happens if message broker rebalances consumers during processing?

10. Reliability checklist

Reliability is designed before incidents.

AreaQuestions
TimeoutDoes every remote call have a timeout smaller than caller budget?
DeadlineIs end-to-end deadline propagated?
RetryIs retry limited by idempotency and budget?
BackoffIs exponential backoff with jitter used for transient failures?
Circuit breakerDoes it protect overloaded/dead dependency?
BulkheadAre critical paths isolated from noisy paths?
Rate limitAre per-tenant/per-client/system limits defined?
Load sheddingCan the service reject early under overload?
BackpressureAre queues bounded and consumer lag monitored?
FallbackIs fallback semantically safe?
Partial availabilityCan non-critical features degrade?
RecoveryIs restart/reconnect/replay safe?

Failure propagation review

For each edge, define:

edge: decision-service -> evidence-service
criticality: required_for_publish
callType: synchronous
p95BudgetMs: 350
hardTimeoutMs: 900
retry:
  enabled: true
  maxAttempts: 2
  condition: transient read failure only
fallback: fail closed; decision cannot be published without evidence snapshot
circuitBreaker: enabled
bulkhead: evidence-client-pool
observability:
  metric: dependency_call_duration_seconds
  span: EvidenceClient.fetchApprovedSnapshot
  logEvent: dependency_call_failed

Reliability blockers

  • No timeouts.
  • Infinite retries.
  • Retry configured at client, mesh, gateway, and SDK without total budget.
  • Health check restarts overloaded service repeatedly.
  • Queue is unbounded.
  • Thread pool is shared across critical and non-critical paths.
  • Fallback returns stale/unsafe decision data.
  • DLQ exists but nobody owns it.

11. Observability checklist

Observability is not “we have Prometheus and logs”.

SignalRequired design
LogsStructured, event-named, correlated, redacted
MetricsRED/USE/business/SLO metrics with bounded cardinality
TracesCross-service trace context and useful span naming
AuditImmutable business evidence, not debug logs
HealthLiveness/readiness/startup semantics separated
AlertsSymptom-based, SLO-based, runbook-linked
DashboardsUser journey, dependency, saturation, queue, JVM
RunbooksDiagnosis tree + safe mitigation commands

Minimum service telemetry

logs:
  requiredFields:
    - timestamp
    - level
    - service
    - environment
    - tenantId
    - correlationId
    - traceId
    - actorType
    - eventName
    - outcome
metrics:
  http:
    - request_count
    - request_duration
    - error_count_by_error_class
  dependency:
    - dependency_duration
    - dependency_error_count
    - dependency_timeout_count
  runtime:
    - jvm_memory
    - gc_pause
    - thread_pool_active
    - db_pool_active
  business:
    - cases_submitted_total
    - decisions_published_total
    - evidence_review_sla_breaches_total
traces:
  propagation: W3C trace context
  sampling: tail-based for errors/high latency where possible
audit:
  separateFromDebugLogs: true
  immutable: true
  actorAttribution: required

Observability review questions

  • Can we answer “which users/tenants are impacted?”
  • Can we answer “which dependency started failing first?”
  • Can we answer “which deployment introduced the issue?”
  • Can we answer “which request/event caused this state transition?”
  • Can we answer “why did this decision happen?”
  • Can we answer “did we leak sensitive data into logs/traces?”
  • Can we debug projection lag without reading production tables manually?

12. Security checklist

Security in microservices is distributed policy enforcement.

AreaReview questions
Workload identityDoes each service have stable runtime identity?
Service-to-service authAre service calls authenticated and authorized?
mTLSIs transport identity/encryption enforced where required?
API authorizationIs object-level and action-level authorization enforced?
Tenant isolationIs tenant context verified at every boundary?
Secret managementAre secrets externalized, rotated, and redacted?
Admin endpointsAre actuator/admin/debug endpoints protected?
Input validationAre DTOs validated at boundary?
Output minimizationAre responses least-data?
Dependency securityAre SBOM, vulnerability scanning, and patch policy in place?
AuditAre security-relevant decisions logged safely?

API security blockers

  • Authorization only checked at gateway, not at service boundary.
  • Actor can change object ID to access another user's resource.
  • Tenant ID accepted from request body without trusted context.
  • Internal API assumes network location equals trust.
  • Admin endpoints exposed to normal traffic path.
  • Secrets are present in environment dumps/logs/traces.
  • Error response leaks internal class/table/system names.
  • Event payload contains unnecessary sensitive data.

13. Privacy checklist

Privacy is not a frontend concern. It is a data-flow architecture concern.

CheckQuestion
ClassificationAre fields classified by sensitivity?
PurposeWhy does this service need the field?
MinimizationCan it receive token/reference/snapshot instead of raw value?
RetentionHow long is the data kept?
RedactionAre logs/traces/DLQ/search/read models redacted?
DeletionHow is deletion/anonymization propagated?
AccessWho can view sensitive fields?
ExportCan data subject/reporting exports be reconstructed?
AuditAre accesses to sensitive data auditable?

Sensitive data flow diagram

Privacy blockers

  • Service receives full party profile but uses only display name.
  • PII copied into event payload “for convenience”.
  • DLQ stores raw payload indefinitely.
  • Trace attributes include email, phone, identity number, address, or free-text narrative.
  • Search index contains sensitive fields without access control.
  • Data deletion request cannot be traced through projections.

14. Deployment checklist

A microservice must be safe to deploy independently.

AreaQuestions
ArtifactIs image immutable and promoted by digest?
ConfigAre required config values validated at startup?
MigrationAre DB changes backward-compatible?
ReadinessDoes service only receive traffic when ready?
ShutdownDoes service drain requests/consumers safely?
RolloutIs strategy defined: rolling/canary/blue-green/shadow?
RollbackCan previous version run with current schema/contracts?
Feature flagAre flags owned, observable, and expiring?
CompatibilityAre provider/consumer contracts verified?
EvidenceDoes deployment produce release evidence?

Expand-contract checklist

Do not approve database changes that require lockstep deployment across services unless the risk is explicitly accepted and the release is controlled.


15. Runtime topology checklist

Logical architecture lies unless mapped to runtime.

QuestionWhy it matters
Which namespace does it run in?Isolation/governance
Which node pool?Resource isolation/cost/noisy neighbor
Which region/zone?Availability/data residency
Which gateway/ingress path?Edge policy/security/routing
Which service mesh policy?mTLS/retry/timeout/observability
Which DB/queue/cache?Dependency blast radius
Which HPA signal?Scaling correctness
Which pod disruption budget?Maintenance availability
Which priority class?Overload/emergency behavior
Which network policy?Zero-trust enforcement

Topology card

service: case-service
namespace: regulatory-core
regions:
  primary: ap-southeast-1
  standby: ap-southeast-2
workload:
  type: Deployment
  minReplicas: 4
  maxReplicas: 30
  hpaSignals:
    - http_server_active_requests
    - cpu_utilization
runtime:
  java: 21
  memoryLimit: 1024Mi
  heapMax: 512Mi
  gracefulShutdownSeconds: 45
network:
  ingress: internal-gateway
  mesh: enabled
  mtls: strict
dependencies:
  postgres: case-db
  broker: enforcement-events
  cache: case-cache
availability:
  slo: 99.9
  pdb: minAvailable 3

16. Capacity checklist

Capacity is not just replica count.

AreaQuestions
WorkloadIs it CPU-bound, IO-bound, memory-bound, queue-bound, or DB-bound?
ThroughputWhat is safe requests/events per second per replica?
ConcurrencyWhat concurrent in-flight operations can one replica handle?
LatencyWhat p95/p99 target must be protected?
DB poolDoes total pool size exceed DB capacity?
ThreadingAre platform/virtual/reactive models chosen intentionally?
QueueAre queue depth and oldest-age monitored?
Scaling signalDoes HPA scale on the bottleneck?
Load testWas test representative of production traffic mix?
CostWhat is unit cost per successful business transaction?

Capacity equation intuition

Use Little's Law as a sanity check:

concurrency ≈ throughput × latency

If a service handles 200 requests/second and average latency is 250 ms:

concurrency ≈ 200 × 0.25 = 50 in-flight requests

If p99 latency jumps to 2 seconds during dependency slowness:

concurrency ≈ 200 × 2 = 400 in-flight requests

That increase must be absorbed by threads, connection pools, memory, queues, and downstream capacity. If not, a latency problem becomes an overload problem.


17. Governance checklist

Governance should be executable where possible.

Governance itemManual evidenceAutomated evidence
Service ownerService charterCatalog owner field
BoundaryADRArchUnit/Spring Modulith checks
API compatibilityDesign reviewContract tests/schema diff
SecuritySecurity reviewSAST/SCA/container scan/policy check
ObservabilityDashboard reviewRequired metric/log/span check
ReliabilityPRRLoad/chaos/readiness tests
DeploymentRelease checklistPipeline gates/progressive rollout
CostCost reviewUnit-cost dashboard
LifecycleOwner reviewCatalog stale-service report

Service catalog minimum

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: enforcement-decision-service
  description: Owns enforcement decision lifecycle and rationale
  tags:
    - java
    - microservice
    - regulatory
spec:
  type: service
  lifecycle: production
  owner: group:enforcement-platform-team
  system: regulatory-case-management
  providesApis:
    - enforcement-decision-api
  consumesApis:
    - evidence-api
    - case-api

18. Cost checklist

Microservices create duplicated runtime cost: compute, memory, network, observability, deployment pipelines, ownership, on-call, and cognitive load.

Cost dimensionReview question
ComputeDoes service require its own runtime envelope?
MemoryIs Java memory overhead justified by capability independence?
NetworkDoes it add significant fan-out/egress?
StorageAre duplicated read models necessary?
ObservabilityAre high-cardinality logs/metrics controlled?
PlatformDoes it require custom infrastructure?
TeamIs there owner capacity to run it?
CognitiveDoes it reduce or increase developer cognitive load?
LifecycleWill it be retired if it fails to justify cost?

A service that is cheap to write but expensive to operate is not cheap.


19. Migration checklist

Migration is not complete when traffic is routed to the new service. It is complete when old paths are removed and ownership is clean.

CheckQuestion
SeamWhat seam enables safe extraction?
RoutingHow is traffic split/cohorted?
ShadowCan new behavior be compared before serving users?
ReconciliationHow are mismatches detected?
CutoverWhat are the go/no-go thresholds?
RollbackWhat state changes prevent rollback?
Data ownershipHas write authority moved?
Legacy consumersAre hidden direct DB/API consumers detected?
Bridge expiryWhen will migration bridge be removed?
EvidenceIs migration decision recorded?

Cutover readiness card

migration: case-lifecycle-extraction
candidateService: case-service
legacySystem: legacy-case-monolith
shadowComparison:
  sampleRate: 25%
  mismatchThreshold: 0.1%
  criticalMismatchThreshold: 0
reconciliation:
  daily: true
  owner: migration-squad
cutoverGates:
  - no critical mismatch for 14 days
  - p95 latency under 300ms
  - rollback tested in staging
  - all known consumers routed via facade
rollback:
  possibleUntil: write-authority-cutover
cleanup:
  removeLegacyWritePathBy: 2026-09-30

20. Architecture risk register

Every significant service should have a risk register.

Risk IDRiskLikelihoodImpactControlResidual riskOwner
R-001Decision event contains sensitive rationale textMediumHighEvent payload minimization + audit APILowDecision team
R-002Projection lag causes stale supervisor dashboardHighMediumWatermark + stale banner + lag alertMediumReporting team
R-003Evidence dependency outage blocks decision publishMediumHighTimeout + circuit breaker + fail-closed stateMediumDecision team
R-004Retry storm during evidence-service degradationMediumHighRetry budget + jitter + bulkheadLowPlatform team
R-005Workflow version change breaks in-flight casesMediumHighWorkflow versioning + migration testLowWorkflow team
R-006Temporary legacy bridge becomes permanentHighMediumExpiry + catalog lifecycle checkMediumMigration owner

Risk review rules

  • A risk without an owner is an unresolved decision.
  • A mitigation without telemetry is wishful thinking.
  • A high-impact risk without runbook is operational debt.
  • A temporary exception without expiry is permanent architecture.

21. Complete service review template

Use this when approving a new or extracted service.

# Service Review: <service-name>

## 1. Intent
- Capability owned:
- Business outcomes:
- Why service, not module:
- Rejected alternatives:

## 2. Boundary
- Bounded context:
- Data authority:
- Invariants:
- State machine:
- Context map:

## 3. Contracts
- APIs:
- Events:
- Workflow activities:
- Compatibility policy:
- Idempotency strategy:

## 4. Data
- Database/store:
- Ownership matrix:
- Read models:
- Consistency windows:
- Reconciliation plan:

## 5. Reliability
- Dependencies:
- Timeout/deadline policy:
- Retry budget:
- Circuit breaker/bulkhead:
- Load shedding/backpressure:
- Failure modes:

## 6. Observability
- Logs:
- Metrics:
- Traces:
- SLOs:
- Alerts:
- Runbooks:

## 7. Security and Privacy
- Workload identity:
- Authorization:
- Tenant isolation:
- Secrets:
- Sensitive data flow:
- Audit events:

## 8. Deployment and Runtime
- Runtime topology:
- Scaling profile:
- Resource envelope:
- Deployment strategy:
- Rollback strategy:
- DR posture:

## 9. Governance
- Owner:
- Service catalog entry:
- ADRs:
- Fitness functions:
- Lifecycle state:
- Risk register:

## 10. Decision
- Approved / Approved with conditions / Rejected
- Conditions:
- Review date:

22. The fastest checklist for senior review

When time is short, ask these 20 questions.

  1. What business capability does this service own?
  2. What business facts can only this service change?
  3. What invariant is protected locally?
  4. What state machine does it own?
  5. What are its synchronous dependencies on the write path?
  6. What happens if each dependency times out?
  7. Are commands idempotent?
  8. Are events replay-safe?
  9. Can consumers evolve independently?
  10. Can the service be deployed without lockstep release?
  11. Does it have readiness/liveness/startup semantics?
  12. Are logs/metrics/traces correlated and redacted?
  13. Is there a symptom-based alert with a runbook?
  14. Is object-level authorization enforced inside the service?
  15. Are tenant and privacy boundaries explicit?
  16. Is the DB private to the service?
  17. Is rollback or roll-forward realistic?
  18. Is the runtime topology known?
  19. Does one team own it in production?
  20. What would make us merge it back or retire it?

If you cannot answer these, the design is not mature yet.


23. Practical exercise

Take one service in your system and fill this scorecard.

DimensionScore 1-5EvidenceAction
Boundary clarity
Data ownership
API compatibility
Idempotency
Reliability controls
Observability
Security
Privacy
Deployment safety
Ownership
Cost awareness
Lifecycle governance

Scoring rule:

  • 1: implicit, undocumented, untested.
  • 2: partially documented, manually verified.
  • 3: documented and used in reviews.
  • 4: automated guardrail exists.
  • 5: runtime telemetry validates the assumption continuously.

Your target is not all 5.

Your target is to know where you are consciously taking risk.


24. Key takeaways

  • A microservice review must cover boundary, data, failure, observability, security, runtime, ownership, and evolution.
  • A good checklist prevents predictable failures without replacing engineering judgment.
  • The most dangerous microservice risk is often not code quality; it is unclear ownership, hidden data coupling, unsafe retries, missing observability, or compatibility-breaking release coordination.
  • Every checklist item should produce evidence: ADR, service catalog entry, contract test, metric, alert, runbook, policy, or runtime signal.
  • Senior-level architecture is not “using advanced patterns”. It is knowing which risks must be explicit before production.

References

  • Martin Fowler — Microservices Guide
  • Martin Fowler — Bounded Context
  • Google SRE Book — Addressing Cascading Failures
  • Google SRE Workbook — Alerting on SLOs
  • AWS Well-Architected Framework
  • OpenTelemetry Documentation
  • OWASP API Security Project
  • NIST SP 800-207 — Zero Trust Architecture
  • NIST SP 800-92 — Guide to Computer Security Log Management
Lesson Recap

You just completed lesson 99 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.