Series/Learn Java Microservices Design and Architect

Series MapLesson 98 / 100

Final StretchOrdered learning track

Case Study - Architecture Review and Risk Register

Learn Java Microservices Design and Architect - Part 098

Studi kasus production-grade tentang architecture review dan risk register untuk regulatory case-management microservices, mencakup review pack, risk taxonomy, severity scoring, residual risk, remediation roadmap, production readiness gates, decision log, dan fitness functions.

[2026-07-05]18 min read3441 words

In This Lesson

1. What Architecture Review Is For 2. Review Scope 3. Architecture Review Pack

PrevNext

Lesson 98100 lesson track83–100 Final Stretch

#java#microservices#architecture-review#risk-register+4 more

Part 098 — Case Study: Architecture Review and Risk Register

Architecture review yang bagus tidak bertanya “apakah diagramnya bagus?” Architecture review yang bagus bertanya: risiko apa yang tersisa, siapa memilikinya, apa bukti mitigasinya, dan kapan kita tahu bahwa risiko itu sudah berubah?

Part ini menutup case study regulatory case-management sebelum masuk final synthesis. Kita akan membuat review pack dan risk register untuk sistem yang sudah kita desain di Part 091–097.

Konteks sistem:

high auditability,
workflow panjang,
regulatory decision,
sensitive data,
multi-service consistency,
operational SLO,
migration/evolution requirement,
runtime topology di Kubernetes.

Targetnya bukan membuat dokumen formal yang indah, tetapi membuat alat berpikir dan alat governance yang bisa dipakai untuk keputusan produksi.

1. What Architecture Review Is For

Architecture review punya tiga tujuan:

Discover risk
- boundary risk,
- data risk,
- reliability risk,
- security/privacy risk,
- operability risk,
- cost risk,
- migration risk.
Create alignment
- apa yang sengaja dipilih,
- trade-off apa yang diterima,
- alternatif apa yang ditolak,
- siapa owner keputusan.
Create evidence
- ADR,
- service catalog,
- topology card,
- SLO,
- runbook,
- fitness function,
- risk register,
- remediation plan.

Architecture review bukan izin dari “arsitek pusat”. Review adalah mekanisme untuk membuat risiko terlihat sebelum risiko itu muncul sebagai incident, audit finding, cost explosion, atau delivery paralysis.

2. Review Scope

Untuk case-management system, review dibagi menjadi scope berikut:

Review area	Core question
Domain boundary	apakah service boundary mengikuti capability dan invariant?
Collaboration	apakah API/event/workflow contract aman berevolusi?
Data ownership	siapa source of truth setiap data penting?
Consistency	apa consistency window dan recovery path?
Workflow	apakah long-running process durable, observable, dan versioned?
Reliability	apa failure mode dan degradation policy?
Security	apakah trust boundary, identity, dan authorization jelas?
Privacy	apakah PII/sensitive data flow terkontrol?
Audit	bisakah keputusan direkonstruksi?
Runtime topology	apakah deployment, scaling, and failure isolation jelas?
Operability	apakah alert, runbook, and SLO siap?
Delivery	apakah independent deployability realistis?
Cost	apakah service sprawl dan telemetry cost terkendali?

3. Architecture Review Pack

Review harus berbasis artefak. Tanpa artefak, diskusi berubah menjadi opini.

3.1 Required artifacts

Each artifact answers a different risk question.

Artifact	Proves
Capability map	services are not arbitrary CRUD splits
Boundary ADR	trade-off was intentional
Service catalog	ownership and metadata are discoverable
Context map	upstream/downstream relationships are known
Contracts	consumers and providers can evolve safely
Data ownership matrix	source of truth is explicit
Runtime topology	deployment and failure boundaries are visible
Failure model	partial failures have defined behavior
SLO/alerts	user-visible reliability is measurable
Runbooks	operators can respond consistently
Risk register	unresolved risk is owned and prioritized

4. Review Pack Folder Structure

A practical repository structure:

architecture/
  regulatory-case-management/
    00-overview.md
    01-capability-map.md
    02-context-map.md
    03-service-catalog.yaml
    04-data-ownership-matrix.md
    05-contract-index.md
    06-workflow-model.md
    07-runtime-topology.md
    08-failure-model.md
    09-observability-and-slo.md
    10-security-and-privacy.md
    11-auditability.md
    12-risk-register.yaml
    13-remediation-roadmap.md
    adr/
      ADR-001-service-boundary-case.md
      ADR-002-decision-service-boundary.md
      ADR-003-workflow-orchestration.md
      ADR-004-audit-event-model.md
      ADR-005-runtime-topology.md

Review pack harus dekat dengan code atau service catalog, bukan hidup sebagai dokumen yang tidak pernah disentuh.

5. Risk Register Mental Model

Risk register bukan daftar “hal buruk mungkin terjadi”. Risk register adalah struktur decision-making.

Minimum risk record:

id: RISK-CASE-001
title: Audit event loss during broker outage
area: auditability
severity: high
likelihood: medium
detectability: medium
status: open
owner: audit-platform-team
systemImpact: Regulatory decision may not be reconstructable if audit event is lost.
userImpact: Compliance/audit users may not be able to prove action sequence.
trigger:
  - broker unavailable
  - outbox publisher stuck
  - audit consumer lag exceeds threshold
currentControls:
  - transactional outbox
  - audit event id
  - audit lag metric
missingControls:
  - replay drill automation
  - audit event reconciliation dashboard
mitigationPlan:
  - implement audit event reconciliation job
  - add alert on oldest unaudited event age
  - run quarterly reconstructability drill
residualRisk: medium
dueDate: 2026-08-15
linkedArtifacts:
  - ADR-004-audit-event-model.md
  - runbooks/audit-lag.md
  - dashboards/audit-evidence.json

A risk is useful when it has:

owner,
impact,
trigger,
controls,
missing controls,
mitigation,
residual risk,
evidence link.

6. Risk Scoring

Avoid fake precision. Use simple scoring but force clarity.

6.1 Severity

Severity	Meaning
Critical	regulatory breach, irreversible data loss, major outage, unauthorized sensitive-data exposure
High	significant user/business impact, auditability compromised, major SLO breach
Medium	degraded operation, recoverable inconsistency, limited blast radius
Low	local inconvenience, minor maintainability/cost issue

6.2 Likelihood

Likelihood	Meaning
High	expected under normal traffic/change/failure patterns
Medium	plausible under known scenarios
Low	possible but needs multiple uncommon conditions

6.3 Detectability

Detectability	Meaning
Low	likely invisible until audit/incident/customer report
Medium	observable with manual investigation
High	automatically detected by alert/fitness/check

6.4 Priority

Use this rule:

priority = severity first, then likelihood, then detectability

A high-severity low-detectability risk should not be buried because probability is uncertain.

7. Review Map for the Case Study

Review is not linear. Some risks cut across areas.

Example: “Decision finalization requires Policy Service.”

This touches:

collaboration,
reliability,
audit,
security,
runtime topology,
SLO,
workflow.

8. Example Risk Register

Below is a condensed but realistic risk register for the case-management architecture.

RISK-001 — Shared Concept Drift Between Case and Decision

id: RISK-001
title: Shared concept drift between Case Service and Decision Service
area: domain-boundary
severity: high
likelihood: medium
detectability: medium
status: open
owner: case-platform-architect
impact: Case status and decision status may diverge semantically, causing incorrect workflow transitions or audit confusion.
triggers:
  - new status added in Case Service
  - Decision Service interprets old status meaning
  - projection maps statuses inconsistently
currentControls:
  - context map
  - decision status enum owned by Decision Service
  - integration events include semantic version
missingControls:
  - automated semantic compatibility tests
  - published language glossary linting
mitigationPlan:
  - create status transition contract tests
  - add glossary ownership table
  - add projection mapping review gate
residualRisk: medium

Why this matters: in regulatory systems, status naming becomes evidence. A field called APPROVED, ACCEPTED, or CLOSED can have legal meaning.

RISK-002 — Workflow State and Domain State Divergence

id: RISK-002
title: Workflow process state diverges from domain source of truth
area: workflow-consistency
severity: high
likelihood: medium
detectability: medium
status: open
owner: workflow-team
impact: Workflow may advance or block a case based on stale or incorrect state.
triggers:
  - activity retry after domain state changed
  - manual correction in Case Service
  - workflow replay with old logic
currentControls:
  - workflow activity idempotency
  - expected version in domain commands
  - workflow audit event
missingControls:
  - reconciliation job between workflow state and case state
  - stuck workflow dashboard
mitigationPlan:
  - implement workflow-domain reconciliation
  - add workflow freshness metric
  - define manual remediation playbook
residualRisk: medium

Workflow engine is process memory. Domain service remains business state authority.

RISK-003 — Audit Event Loss or Delay Beyond Evidence Window

id: RISK-003
title: Audit event loss or delay beyond evidence window
area: auditability
severity: critical
likelihood: medium
detectability: medium
status: open
owner: audit-team
impact: Regulatory actions may not be defensible if evidence chain cannot be reconstructed.
triggers:
  - outbox publisher failure
  - broker outage
  - audit consumer stuck
  - audit store outage
currentControls:
  - transactional outbox
  - audit event id
  - append-only audit store
  - audit consumer lag metric
missingControls:
  - end-to-end audit reconciliation
  - reconstructability drill automation
  - oldest unaudited event alert
mitigationPlan:
  - implement audit reconciliation job
  - add burn-rate alert for audit lag SLO
  - run quarterly reconstructability drill
residualRisk: medium

Audit is not “nice to have” for this domain. It is part of correctness.

RISK-004 — BFF Fan-Out Causes Latency and Partial Failure

id: RISK-004
title: Case BFF fan-out causes latency amplification and partial failure
area: reliability
severity: medium
likelihood: high
detectability: high
status: accepted-with-control
owner: edge-team
impact: Case overview page becomes slow or partially unavailable when one fragment service is degraded.
triggers:
  - read model slow
  - Decision Service slow
  - Evidence metadata call times out
currentControls:
  - fragment timeout budget
  - partial response contract
  - read model aggregation
  - trace spans per fragment
missingControls:
  - client-visible freshness and partial-data marker on all pages
mitigationPlan:
  - add fragment completeness field to BFF response
  - add dashboard for fan-out latency
residualRisk: low

The correct fix is not always “avoid fan-out”. Sometimes the fix is explicit partial response semantics.

RISK-005 — Policy Service Outage Blocks Decision Commands

id: RISK-005
title: Policy Service outage blocks final decision commands
area: availability-security
severity: high
likelihood: medium
detectability: high
status: accepted-with-control
owner: policy-team
impact: Final decisions cannot be submitted during policy service outage.
triggers:
  - policy service unavailable
  - rule bundle load failure
  - policy cache corrupted
currentControls:
  - fail-closed for final decision
  - local cache for non-final advisory checks
  - policy health alert
  - decision command explicit failure reason
missingControls:
  - policy bundle rollback automation
mitigationPlan:
  - add policy bundle canary
  - add emergency read-only policy fallback for draft validation only
residualRisk: medium

In regulated decisioning, availability cannot silently override authorization/policy correctness.

RISK-006 — Projection Staleness Misleads Investigators

id: RISK-006
title: Projection staleness misleads investigators
area: data-consistency
severity: high
likelihood: medium
detectability: high
status: open
owner: read-model-team
impact: Users may act on stale dashboard information.
triggers:
  - projection consumer lag
  - read model DB slow
  - poison event blocks partition
currentControls:
  - projection watermark
  - oldest event age metric
  - idempotent projection updates
missingControls:
  - UI freshness indicator in all critical views
  - partition-specific lag alert
mitigationPlan:
  - add freshness watermark to BFF response
  - add poison-event quarantine path
  - add query-side stale-data warning
residualRisk: medium

A stale read model is acceptable only when staleness is understood and visible.

RISK-007 — Evidence Metadata and Evidence Content Boundary Leak

id: RISK-007
title: Evidence metadata and evidence content boundary leak
area: privacy-security
severity: critical
likelihood: low
detectability: medium
status: open
owner: evidence-team
impact: Sensitive evidence content may leak into logs, events, read models, or unauthorized services.
triggers:
  - developer includes object content in event payload
  - debug logging of request body
  - BFF exposes signed URL too broadly
currentControls:
  - metadata-only event rule
  - signed URL generation in Evidence Service
  - log redaction
missingControls:
  - automated schema lint for sensitive fields
  - DLP scanning in log pipeline
  - access review for evidence download
mitigationPlan:
  - implement sensitive-field schema classification
  - add contract test blocking evidence content in events
  - add evidence access audit dashboard
residualRisk: medium

Sensitive content should have fewer paths than metadata.

RISK-008 — Autoscaling Overloads Database

id: RISK-008
title: Autoscaling overloads database through unbounded connection pools
area: runtime-capacity
severity: high
likelihood: medium
detectability: high
status: open
owner: platform-team
impact: Increased replicas create connection storm and database saturation.
triggers:
  - HPA scales Case/Decision services under latency pressure
  - each pod opens max pool
  - database connection budget exceeded
currentControls:
  - max replicas defined
  - resource requests/limits
missingControls:
  - global DB connection budget check
  - pool-size policy-as-code
mitigationPlan:
  - add CI check for maxReplicas * poolSize budget
  - set per-service DB pool caps
  - add DB saturation alert linked to HPA events
residualRisk: medium

Autoscaling without dependency budget is a reliability risk.

RISK-009 — Permanent Migration Bridge Becomes New Legacy

id: RISK-009
title: Temporary legacy bridge becomes permanent architecture
area: migration-governance
severity: medium
likelihood: high
detectability: medium
status: open
owner: modernization-lead
impact: Legacy coupling remains hidden, preventing true service autonomy.
triggers:
  - migration deadline pressure
  - bridge has no retirement date
  - old consumers still access legacy DB
currentControls:
  - bridge service owner
  - migration dashboard
missingControls:
  - bridge expiry policy
  - consumer inventory automation
mitigationPlan:
  - add bridge retirement ADR
  - add runtime detection for legacy access
  - review bridge status monthly
residualRisk: medium

Every bridge needs an exit plan.

RISK-010 — Architecture Knowledge Goes Stale

id: RISK-010
title: Architecture knowledge goes stale after launch
area: governance-operability
severity: medium
likelihood: high
detectability: medium
status: open
owner: architecture-governance
impact: Service catalog, topology, and runbooks no longer reflect production reality.
triggers:
  - new dependencies added without catalog update
  - runtime topology drift
  - dashboard/runbook links broken
currentControls:
  - service catalog
  - ADRs
missingControls:
  - runtime-vs-catalog reconciliation
  - CI gate for topology metadata
  - quarterly architecture review
mitigationPlan:
  - build catalog reconciliation job
  - add dependency telemetry comparison
  - run lightweight quarterly review
residualRisk: low

Architecture documentation is not a deliverable. It is a living control system.

9. Risk Register as YAML

A compact version can be checked into the repository.

risks:
  - id: RISK-003
    title: Audit event loss or delay beyond evidence window
    area: auditability
    severity: critical
    likelihood: medium
    detectability: medium
    status: open
    owner: audit-team
    residualRisk: medium
    dueDate: 2026-08-15
    linkedArtifacts:
      - ADR-004-audit-event-model.md
      - runbooks/audit-lag.md
      - dashboards/audit-evidence.json

  - id: RISK-008
    title: Autoscaling overloads database through unbounded connection pools
    area: runtime-capacity
    severity: high
    likelihood: medium
    detectability: high
    status: open
    owner: platform-team
    residualRisk: medium
    dueDate: 2026-08-01
    linkedArtifacts:
      - runtime-topology.md
      - service-catalog.yaml
      - policies/db-pool-budget.rego

Do not hide risks in slide decks. Store them where engineers see them.

10. Architecture Review Flow

A productive review should be structured.

Review outcomes:

Outcome	Meaning
Approved	risks are acceptable and controls are in place
Approved with conditions	can proceed, but required mitigations have due dates/owners
Rework required	key risk/decision missing or unsafe
Rejected	architecture violates non-negotiable constraint

Most real systems should be “approved with conditions”. Zero-risk architecture does not exist.

11. Review Questions by Area

11.1 Domain boundary

What capability does each service own?
What invariant is local to each service?
Which concepts have different meanings across contexts?
What would force this boundary to split or merge?
What business change is this boundary optimized for?

11.2 API/event/workflow contract

Which APIs are commands vs queries?
Which commands are idempotent?
Which events are domain events vs integration events?
What is the compatibility policy?
What is the deprecation window?
Which workflow activities call which service?

11.3 Data ownership

Who owns each source of truth?
Which service may write which data?
Is any cross-service database access still present?
Which read models duplicate data?
What is the staleness contract?
How is reconciliation performed?

11.4 Reliability

What happens when each dependency times out?
Are retries bounded and jittered?
Are commands safe to retry?
What is the degradation mode?
Can this service shed load?
What causes cascading failure?

11.5 Security/privacy

What is the workload identity per service?
How is service-to-service access authorized?
Which data is PII/sensitive?
Where does sensitive data flow?
What is redacted from logs/traces/events?
What is the secret rotation path?

11.6 Auditability

What business actions create audit events?
What is the evidence chain?
Can decisions be reconstructed?
Are corrections append-only?
What happens if audit ingestion lags?
Is audit store independent from debug logs?

11.7 Runtime topology

What namespace and node pool does each workload use?
What are min/max replicas?
Are critical pods spread across zones?
Does autoscaling respect downstream capacity?
Are probes correct?
Is shutdown graceful?

11.8 Operability

What SLO represents user pain?
Which alerts page humans?
Is each alert linked to runbook?
Are dashboards tied to service catalog?
Can on-call reconstruct an incident timeline?
Are known-bad states documented?

12. Architecture Decision Log

A risk register is not enough. Some risks are results of intentional decisions.

Decision log example:

ADR	Decision	Alternatives rejected	Main consequence
ADR-001	Separate Decision Service from Case Service	keep decision inside Case Service	clearer audit/policy boundary, more cross-service coordination
ADR-002	Use orchestration for enforcement lifecycle	pure choreography	better process visibility, workflow engine dependency
ADR-003	Use read model for case overview	BFF live fan-out only	staleness risk, better latency
ADR-004	Audit through transactional outbox + audit consumer	direct audit write in every command	async lag risk, better durability/replay
ADR-005	Fail-closed for final decision policy check	allow cached policy on outage	lower availability, stronger regulatory correctness

A decision without consequences is not an ADR. It is a preference.

13. Risk-to-Control Matrix

Map risks to controls.

Risk	Preventive control	Detective control	Corrective control
audit event loss	transactional outbox, append-only audit store	audit lag alert, reconciliation	replay outbox, rebuild audit view
stale projection	idempotent projection, checkpointing	watermark, oldest event age	replay projection, quarantine poison event
policy outage	policy canary, cache for advisory checks	policy health/SLO	rollback policy bundle, fail closed
DB overload	pool budget, HPA max cap	DB saturation alert	load shedding, reduce worker concurrency
sensitive data leak	data classification, schema lint	DLP/log scanning	revoke URL, purge logs if allowed, incident process
workflow stuck	timeout, activity idempotency	stuck workflow dashboard	manual remediation, workflow retry/patch
contract break	consumer-driven tests, compatibility rules	contract registry alert	rollback provider, restore compatible field

This matrix is powerful because it avoids one-dimensional mitigation.

A serious risk needs:

preventive control,
detective control,
corrective control.

14. Production Readiness Gates

For the case-management system, define readiness gates.

Gate 1 — Boundary readiness

Required:

capability map approved,
service catalog entries complete,
data ownership matrix complete,
context map complete,
boundary ADRs accepted.

Exit criteria:

no open critical boundary/data authority risk.

Gate 2 — Contract readiness

Required:

API contracts documented,
event contracts documented,
workflow activity contracts documented,
compatibility policy defined,
consumer impact reviewed.

Exit criteria:

no unknown command/event owner,
breaking changes have rollout plan.

Gate 3 — Reliability readiness

Required:

timeout/retry/circuit breaker policy,
failure model,
SLO/SLI,
dashboards,
alerts,
runbooks.

Exit criteria:

critical user journeys have SLO,
page alerts are actionable.

Gate 4 — Security/privacy/audit readiness

Required:

service identity,
access policy,
data classification,
redaction rule,
audit event model,
reconstructability test plan.

Exit criteria:

no unowned sensitive data flow,
audit chain can be demonstrated for one full decision lifecycle.

Gate 5 — Runtime readiness

Required:

deployment manifests,
topology cards,
resource requests/limits,
HPA/scaling policy,
probe config,
graceful shutdown behavior,
deployment strategy.

Exit criteria:

one-zone failure behavior known,
DB/broker connection budget reviewed.

15. Architecture Fitness Functions

Review should not remain manual. Convert repeatable rules into fitness functions.

15.1 Static code fitness

Examples:

domain package must not depend on infrastructure package,
controller must not call repository directly,
application service must not call external HTTP client inside DB transaction,
service module must not import another bounded context internals.

Pseudo ArchUnit-style rule:

@ArchTest
static final ArchRule domain_should_not_depend_on_infrastructure =
    noClasses()
        .that().resideInAPackage("..domain..")
        .should().dependOnClassesThat()
        .resideInAnyPackage("..infrastructure..", "..adapter..", "..web..");

15.2 Contract fitness

no breaking OpenAPI change without version/deprecation plan,
event payload cannot remove required field,
sensitive fields cannot appear in public integration events,
consumer contract tests must pass before provider deployment.

15.3 Runtime fitness

every service has owner label,
every deployment has resource requests/limits,
critical services have min replicas >= 2 or documented exception,
each service has service account,
DB pool budget does not exceed allowed total,
HPA max replicas has dependency budget.

15.4 Observability fitness

service emits RED metrics,
logs include correlation id,
trace propagation enabled,
page alerts link to runbook,
no high-cardinality unbounded label,
dashboards exist for critical SLOs.

15.5 Audit/privacy fitness

audit event schema has event id, actor, subject, action, time, causation, policy version where relevant,
sensitive fields are classified,
log schema excludes PII by default,
evidence object content is not emitted into events,
data retention policy exists.

16. Example Policy-as-Code Control

Example Rego-like intent for Kubernetes manifest validation:

package microservices.runtime

deny[msg] {
  input.kind == "Deployment"
  not input.metadata.labels["owner"]
  msg := sprintf("deployment %s must have owner label", [input.metadata.name])
}

deny[msg] {
  input.kind == "Deployment"
  container := input.spec.template.spec.containers[_]
  not container.resources.requests.cpu
  msg := sprintf("container %s must define cpu request", [container.name])
}

deny[msg] {
  input.kind == "Deployment"
  input.metadata.labels["criticality"] == "critical"
  input.spec.replicas < 2
  msg := sprintf("critical deployment %s must have at least 2 replicas", [input.metadata.name])
}

This is the difference between governance as meeting and governance as executable guardrail.

17. Residual Risk and Sign-Off

Some risks will remain. Architecture maturity means being explicit about residual risk.

Example:

## Residual Risk Acceptance: RISK-005 Policy Service Outage Blocks Final Decision

We accept that final decision submission fails closed when Policy Service is unavailable.

Rationale:
- Regulatory correctness is more important than availability for final decision commands.
- Advisory checks may use cached policy but final decisions require current policy decision.
- User receives explicit retryable error with incident reference if outage is active.

Controls:
- Policy Service SLO: 99.95% availability.
- Decision command alert on policy dependency error budget burn.
- Policy bundle canary before rollout.
- Manual escalation runbook.

Accepted by:
- Head of Enforcement Operations
- Security/Compliance Owner
- Engineering Owner

Review date: 2026-10-01

A residual risk without business owner acceptance is not accepted. It is ignored.

18. Architecture Review Meeting Template

Keep review focused.

# Architecture Review: Regulatory Case Management v1

## 1. Goal
What production decision/change is being reviewed?

## 2. Scope
Services, workflows, data, users, environments.

## 3. Non-goals
What is explicitly not being solved now?

## 4. Key decisions
List ADRs.

## 5. Risk summary
Critical/high risks first.

## 6. Walkthrough
- user journey
- command path
- event path
- workflow path
- audit path
- failure path

## 7. Readiness gates
Pass/fail/conditional.

## 8. Open questions
Questions requiring owner/date.

## 9. Decision
Approved / approved with conditions / rework / rejected.

## 10. Follow-up
Owner, due date, evidence required.

Timebox detailed debates. If a topic needs deep design, create an ADR follow-up.

19. Review Walkthrough: Decision Submission

Architecture review should walk one critical user journey end to end.

Review questions:

What if user double-clicks submit?
What if Policy Service times out?
What if DB commit succeeds but response is lost?
What if outbox publisher is down?
What if audit consumer lags?
What if workflow consumes duplicate event?
What if event schema changes?
What if submitted decision contains sensitive field?

Each answer should map to a control or a risk.

20. Review Walkthrough: Evidence Upload

Review questions:

Does metadata event include content? It should not.
Who can generate signed URL?
How long does signed URL live?
Is evidence access audited?
What happens if upload succeeds but callback fails?
What happens if metadata exists but object is missing?
Is reconciliation available?

21. Review Walkthrough: Projection Staleness

Review questions:

What is maximum allowed staleness for dashboard?
Does UI show freshness?
Can user perform critical action based on stale read model?
Does command handler validate against source-of-truth state?
Can projection rebuild without corrupting state?

22. Risk Burndown Roadmap

A review should produce a roadmap.

Timeframe	Focus	Example actions
Before production	critical controls	audit outbox, SLO alerts, service identity, DB pool budget, runbooks
First 30 days	operational feedback	tune alerts, run first reconstructability drill, review incident data
First 90 days	automation	policy-as-code checks, runtime-catalog reconciliation, contract gate
Quarterly	architecture drift	review dependency graph, risk register, cost profile, service maturity

Do not demand every control before first release. Demand the right controls for the risk level.

23. Architecture Review Outcome Example

# Review Outcome

System: Regulatory Case Management v1
Date: 2026-07-05
Outcome: Approved with conditions

## Conditions before production
1. Implement audit oldest-event-age alert.
2. Add DB pool budget validation for Case/Decision/Workflow services.
3. Add UI freshness watermark for case overview read model.
4. Complete Evidence Service sensitive-field contract test.
5. Link runbooks from all page alerts.

## Accepted residual risks
- Policy outage fails closed for final decision commands.
- Projection staleness up to 60 seconds accepted for dashboard if freshness is visible.
- Notification delivery may lag up to 15 minutes during provider outage.

## Rejected risks
- Direct read access from Reporting to Case DB is not approved.
- Gateway must not own regulatory decision logic.
- Evidence content must not appear in integration events.

This outcome is actionable. It states what can proceed and what cannot.

24. Architecture Review Anti-Patterns

24.1 Review as approval theater

The team presents slides. Reviewers nod. No risks are captured.

Fix:

require risk register,
require ADRs,
require controls/evidence.

24.2 Review too late

Architecture review happens after the system is implemented.

Fix:

review boundary and data ownership early,
review runtime readiness before production,
review drift after launch.

24.3 Checklist without judgment

Every item is checked, but key risk remains misunderstood.

Fix:

use scenario walkthroughs,
ask failure questions,
force business impact statement.

24.4 Security and privacy bolted on

Security review happens after API/event/data design.

Fix:

include security/privacy in boundary and contract review,
classify data early,
verify flows with diagrams.

24.5 Risk without owner

Risks are documented but nobody owns mitigation.

Fix:

no risk enters register without owner,
review open risk aging,
escalate overdue critical risks.

24.6 Architecture docs disconnected from runtime

Service catalog says one thing; production telemetry shows another.

Fix:

reconcile runtime call graph with declared dependencies,
fail build for missing metadata,
review drift quarterly.

25. Architecture Review Checklist

Boundary

Capability map exists.
Context map exists.
Service boundary ADR exists.
Service ownership is clear.
No CRUD/entity-only decomposition without justification.

Contracts

Command/query APIs documented.
Event contracts documented.
Workflow activity contracts documented.
Compatibility policy exists.
Deprecation/rollout policy exists.

Data

Data ownership matrix exists.
No unauthorized cross-service DB access.
Read model staleness contract exists.
Reconciliation path exists.
Audit/event identity stable.

Reliability

Failure model exists.
Timeouts/deadlines defined.
Retry policy bounded.
Circuit breaker/bulkhead/rate limiter decisions documented.
Load shedding/degradation defined.

Security/privacy

Workload identity per service.
Service-to-service authorization defined.
Sensitive data classified.
Redaction rules defined.
Secret rotation path documented.

Audit

Audit event model exists.
Decision reconstructability demonstrated.
Audit store retention defined.
Correction model append-only.
Audit lag alert exists.

Runtime

Topology card per service.
Namespace/node pool strategy documented.
Replica/scaling profile defined.
DB/broker capacity budget reviewed.
Probe/shutdown behavior tested.

Operability

SLOs defined for critical journeys.
Alerts linked to runbooks.
Dashboards exist.
Incident response path known.
Production readiness gates passed or waived with owner.

Delivery

CI/CD pipeline gates defined.
Contract tests integrated.
Deployment strategy defined.
Rollback/roll-forward plan exists.
Feature/migration flags have owner and expiry.

26. Practical Exercise

Create a risk register for one service.

Choose one:

Decision Service,
Evidence Service,
Workflow Service,
Audit Service,
Projection Service.

Write at least 8 risks:

one boundary risk,
one data ownership risk,
one consistency risk,
one reliability risk,
one security/privacy risk,
one auditability risk,
one runtime/capacity risk,
one delivery/migration risk.

For each risk, define:

severity,
likelihood,
detectability,
owner,
current controls,
missing controls,
mitigation,
residual risk,
linked evidence.

Then run this test:

Could an engineer who joins next month understand why the risk exists and what must be done?

If not, rewrite the risk.

27. Key Takeaways

Architecture review is risk discovery, not diagram approval.
A risk register must include owner, impact, trigger, controls, mitigation, residual risk, and evidence.
Regulatory systems require auditability and reconstructability as first-class architecture qualities.
Manual review should gradually become executable fitness functions.
Residual risk must be explicitly accepted by the right owner.
Review must cover boundary, contract, data, consistency, workflow, reliability, security, privacy, audit, runtime, operability, delivery, and cost.
The best architecture review produces fewer surprises in production.

References

AWS Well-Architected Framework — Six Pillars: https://docs.aws.amazon.com/wellarchitected/latest/framework/the-pillars-of-the-framework.html
AWS Well-Architected Tool — Risks: https://docs.aws.amazon.com/wellarchitected/latest/userguide/identify-and-understand-risks.html
Google SRE — Production Readiness Review: https://sre.google/sre-book/evolving-sre-engagement-model/
Google SRE Workbook — Alerting on SLOs: https://sre.google/workbook/alerting-on-slos/
Backstage Software Catalog: https://backstage.io/docs/features/software-catalog/
Open Policy Agent: https://www.openpolicyagent.org/docs/latest/
Kubernetes Deployments: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
NIST SP 800-92 — Guide to Computer Security Log Management: https://csrc.nist.gov/pubs/sp/800/92/final

Lesson Recap

You just completed lesson 98 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 97

Case Study - Runtime and Deployment Topology

Next Lesson

Lesson 99

Microservices Design Checklist