Zero Trust Service-to-Service Architecture
Learn Java Microservices Design and Architect - Part 055
Zero-trust service-to-service architecture untuk Java microservices: workload identity, mTLS, authorization antar service, network segmentation, service mesh boundary, dan operational failure mode.
Part 055 — Zero Trust Service-to-Service Architecture
Zero trust bukan slogan.
Zero trust adalah keputusan arsitektur bahwa lokasi jaringan tidak boleh menjadi sumber kepercayaan.
Di microservices, asumsi lama biasanya seperti ini:
Kalau request datang dari internal network, berarti aman.
Asumsi itu gagal di sistem modern.
Kenapa?
Karena service berjalan di banyak pod, node, cluster, region, pipeline, sidecar, job, worker, dan environment. Deployment berubah cepat. Credential bisa bocor. Pod bisa disusupi. Internal API bisa dipanggil oleh service yang salah. Debug endpoint bisa terbuka. Network path bisa berubah. Jika keamanan hanya bergantung pada “ini traffic internal”, maka satu breach bisa berubah menjadi lateral movement.
Zero trust service-to-service berarti:
- setiap workload punya identity yang bisa diverifikasi
- setiap request dievaluasi berdasarkan identity, action, resource, context, dan policy
- transport security tidak dianggap cukup untuk business authorization
- network segmentation mengurangi blast radius, bukan menjadi satu-satunya kontrol
- observability dan audit membuktikan siapa memanggil apa, kapan, dan untuk tujuan apa
Part ini membahas zero trust untuk komunikasi antar service, bukan login user, OAuth tutorial, atau authorization domain secara mendalam. Authentication/authorization user sudah punya seri tersendiri. Di sini kita fokus pada keputusan arsitektur microservices.
1. Core Mental Model
Zero trust dalam microservices adalah perubahan dari network trust ke workload identity + policy trust.
The question changes.
Old model:
Is the caller inside the network?
Zero trust model:
Who is the caller, what is it allowed to do, on which resource, under which context, and can we prove it?
For Java microservices, this affects:
- HTTP clients
- gRPC clients
- messaging consumers
- batch jobs
- scheduled workers
- database access
- cache access
- admin endpoints
- service mesh configuration
- Kubernetes network policy
- audit logs
- incident response
2. What Zero Trust Is Not
Zero trust is not:
- “use JWT everywhere”
- “put everything behind API Gateway”
- “enable mTLS and call it done”
- “install service mesh and forget application security”
- “block all network access until developers suffer”
- “replace domain authorization with infrastructure authorization”
- “encrypt internal traffic only”
mTLS answers:
Can I cryptographically verify the peer identity and protect transport?
It does not answer:
Is this caller allowed to approve this enforcement decision?
Service mesh answers:
Can the platform enforce transport and routing policy?
It does not fully answer:
Does this request satisfy business-level policy?
Zero trust requires layered decisions.
3. Service-to-Service Threat Model
A useful architecture starts with threat paths.
In microservices, common service-to-service threats include:
| Threat | Example | Architectural Control |
|---|---|---|
| Service impersonation | Fake service calls decision-service | Workload identity, mTLS, SPIFFE/SVID |
| Lateral movement | Compromised pod calls internal admin APIs | Network policy, service authz, least privilege |
| Credential leakage | Static service token copied from config | Short-lived credentials, rotation, secret manager |
| Confused deputy | Service A tricks Service B to access resource | Caller context propagation, resource authorization |
| Over-permissive service account | One service account used by many workloads | Per-service identity, scoped RBAC |
| Missing object authorization | Caller can access another tenant/case | Object-level authorization in callee |
| Internal debug endpoint exposure | /actuator/env exposed internally | Endpoint hardening, network policy, auth |
| Replay/duplicate command | Retried privileged command executes twice | Idempotency, nonce, command identity |
| Message forgery | Unauthorized producer emits domain event | Producer identity, topic ACL, event signature/policy |
| Policy drift | Mesh allows calls app no longer expects | Policy-as-code, contract review, drift detection |
A strong design does not ask “which tool do we install?” first.
It asks:
Which path can an attacker or faulty workload use to cross a boundary it should not cross?
4. The Zero Trust Control Plane Model
A zero-trust model usually separates three concerns:
- Policy Enforcement Point — where access is blocked or allowed
- Policy Decision Point — where decision is evaluated
- Policy Information Point — where attributes/context come from
In microservices, PEP can live in several places:
| PEP Location | Good For | Weakness |
|---|---|---|
| API Gateway | edge user traffic | insufficient for east-west traffic |
| Service mesh sidecar | mTLS, service identity, coarse routing authz | weak business context |
| Application middleware | route/action/resource policy | needs consistent implementation |
| Domain/application service | object and business invariant authorization | cannot replace transport security |
| Message broker ACL | producer/consumer/topic policy | weak payload-level authorization |
| Database permission | final data access guard | too low-level for business intent |
The mature architecture uses multiple PEPs, but each PEP owns the right level of decision.
5. Workload Identity
User identity is not enough.
A service call has at least two identities:
- end-user identity — the human or external client who initiated the action
- workload identity — the service/job/worker that is calling another service
For example:
User: investigator-123
Caller workload: case-api
Callee workload: decision-service
Action: proposeDecision
Resource: case:CASE-2026-001
Tenant: regulator-id
The callee needs to reason about both:
- Is
case-apiallowed to calldecision-service.proposeDecision? - Is
investigator-123allowed to propose a decision forCASE-2026-001? - Is this request inside the correct tenant and jurisdiction?
- Is the case in a state that allows proposal?
- Is there a conflict-of-interest constraint?
Workload identity protects service-to-service trust.
User identity protects user action authorization.
Domain state protects business invariants.
6. SPIFFE-Style Identity
A workload identity should be:
- unique
- cryptographically verifiable
- short-lived
- automatically rotated
- bound to runtime attestation
- independent from IP address
- independent from mutable pod name
- usable across heterogeneous platforms
A SPIFFE-style identity looks like this:
spiffe://regulator.example/ns/enforcement/sa/case-api
spiffe://regulator.example/ns/enforcement/sa/decision-service
spiffe://regulator.example/ns/reporting/sa/audit-projector
The important part is not the string format itself.
The important part is that service identity becomes explicit and policy-addressable.
Bad identity model:
Caller is 10.8.14.91
Better identity model:
Caller is workload case-api in namespace enforcement, attested by cluster identity control plane.
7. mTLS as Transport Identity
mTLS gives two properties:
- encrypted transport
- mutual peer authentication
In regular TLS, the client verifies the server.
In mTLS, both sides verify each other.
This is useful because service B can know:
The caller is not merely “some pod on the network”; it is a cryptographically authenticated workload.
But do not overstate mTLS.
mTLS does not automatically enforce:
- object authorization
- tenant isolation
- business policy
- endpoint-specific permissions
- rate limiting
- payload validation
- audit completeness
- data minimization
mTLS says “who is talking”.
Authorization says “what may they do”.
8. Service-to-Service Authorization Layers
A strong architecture has layered authorization.
Example authorization matrix:
| Caller | Endpoint | Service Permission | Object Policy | Business Policy |
|---|---|---|---|---|
case-api | POST /decision-proposals | allowed | user must be assigned investigator | case must be UNDER_REVIEW |
audit-projector | GET /decisions/{id} | allowed read-only | system read by event reference | decision must not be sealed |
notification-worker | POST /decisions/{id}/notify | denied | n/a | n/a |
reporting-service | GET /case-summary/{id} | allowed | tenant/jurisdiction scoped | redacted view only |
This table is more valuable than a vague statement like:
Internal services authenticate each other.
9. Network Policy Is Blast-Radius Control
Kubernetes NetworkPolicy can limit which pods can communicate.
This is valuable, but it is not full authorization.
Network policy can express:
case-apipods may connect todecision-servicepods on port 8080.
It usually cannot express:
case-apimay callPOST /decision-proposalsonly for cases assigned to the current investigator.
So network policy is a coarse boundary.
It reduces lateral movement.
It should be paired with application/service authorization.
Example conceptual policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: decision-service-ingress
namespace: enforcement
spec:
podSelector:
matchLabels:
app: decision-service
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: case-api
- podSelector:
matchLabels:
app: workflow-worker
ports:
- protocol: TCP
port: 8080
This is not the final policy you blindly paste.
It is the architecture idea:
default deny, then explicitly permit expected paths.
10. Default Deny Service Topology
A secure microservice topology starts from denied traffic.
Allowed paths are intentional.
Denied paths are not accidents.
Architecture review should ask:
- Which service can call which service?
- Which service can publish to which topic?
- Which service can consume from which topic?
- Which service can access which database?
- Which admin endpoint is reachable from where?
- Which emergency tool can bypass normal paths?
- Who owns the exception list?
- How do we detect drift?
11. API Gateway Is Not Enough
API Gateway is often necessary.
It is not sufficient.
A gateway is good for:
- external authentication
- coarse rate limiting
- request normalization
- routing
- edge logging
- WAF-style controls
- client-specific aggregation
- token validation
But many attacks and mistakes happen after the gateway:
- compromised internal workload calls service directly
- service has broader permission than expected
- worker consumes unauthorized messages
- admin endpoint reachable inside cluster
- sidecar policy allows unexpected path
- internal API lacks object-level authorization
- debug headers are trusted internally
If DecisionService trusts every internal caller because “gateway already checked auth”, then zero trust is missing.
Every sensitive callee must enforce its own security boundary.
12. Service Mesh: Useful, But Not Magic
A service mesh can help with:
- mTLS
- identity
- traffic policy
- service-to-service authorization
- retries/timeouts
- telemetry
- traffic splitting
- certificate rotation
But a mesh should not own domain authorization.
A mesh can express:
case-api may call decision-service on POST /decision-proposals
The application must still decide:
investigator-123 may propose decision for CASE-2026-001 only if assigned and case is under review
Dangerous mesh anti-pattern:
All internal service-to-service traffic allowed because we enabled mTLS.
Correct model:
mTLS gives identity. Policy gives least privilege. Application/domain code enforces business constraints.
13. Java Request Context Model
A Java service should make security context explicit.
Bad style:
public DecisionProposal propose(String caseId, ProposalRequest request) {
// implicitly trust caller and user from somewhere
}
Better style:
public record CallerContext(
String workloadId,
String serviceName,
String userId,
String tenantId,
Set<String> scopes,
String correlationId
) {}
public record ProposeDecisionCommand(
CaseId caseId,
DecisionText decisionText,
CallerContext caller,
ExpectedVersion expectedVersion,
IdempotencyKey idempotencyKey
) {}
Application service:
public final class ProposeDecisionUseCase {
private final CaseRepository cases;
private final DecisionPolicy decisionPolicy;
private final Outbox outbox;
public DecisionProposalId handle(ProposeDecisionCommand command) {
CaseRecord caseRecord = cases.get(command.caseId());
decisionPolicy.assertMayProposeDecision(
command.caller(),
caseRecord
);
DecisionProposal proposal = caseRecord.proposeDecision(
command.decisionText(),
command.expectedVersion()
);
cases.save(caseRecord);
outbox.append(proposal.toIntegrationEvent(command.caller()));
return proposal.id();
}
}
Notice the sequence:
- parse caller context
- load resource
- authorize against resource and business state
- execute invariant
- save state
- emit auditable event
Security is not a filter only.
Security is part of the use case.
14. Propagating User Context Without Trusting It Blindly
Service-to-service calls often propagate user context.
Example headers:
Authorization: Bearer <token>
X-Correlation-Id: 7d7a...
X-Request-Id: 31bc...
X-Tenant-Id: regulator-a
Risk:
- internal caller forges
X-User-Id - service trusts tenant header without validation
- token audience is wrong
- token is too broad
- token is forwarded to service that should not see it
- PII leaks into logs
Safer model:
| Context | Trust Rule |
|---|---|
| Workload identity | from mTLS/SPIFFE/mesh, not arbitrary header |
| User identity | from verified token or trusted token exchange |
| Tenant | derived from token/resource, not only header |
| Correlation ID | accepted but sanitized/generated if missing |
| Request ID | generated per hop |
| Authorization decision | evaluated locally by callee for protected resource |
Do not let convenience headers become security authorities.
15. Token Relay vs Token Exchange
There are two common models for carrying user identity:
Token Relay
Service A forwards the original user token to Service B.
Pros:
- simple
- user identity preserved
- downstream can evaluate user claims
Cons:
- token audience may be wrong
- downstream gets broader claims than needed
- token may leak
- hard to constrain delegation
- service may act as confused deputy
Token Exchange / Downscoped Token
Service A exchanges user token for a token scoped to Service B/action.
Pros:
- better least privilege
- clear audience
- narrower claims
- better audit
Cons:
- more infrastructure
- latency/availability dependency
- harder local development
- requires careful caching/expiry behavior
Decision rule:
For sensitive cross-service actions, prefer explicit delegation or token exchange over blind token relay.
16. Confused Deputy Problem
A confused deputy occurs when one component with authority is tricked into using that authority on behalf of an unauthorized caller.
Example:
User calls case-api:
GET /cases/CASE-999/decision-summary
case-api calls decision-service:
GET /decisions/by-case/CASE-999
Decision-service trusts case-api because it is internal.
If case-api fails to authorize CASE-999, and decision-service only checks workload identity, unauthorized data can leak.
Defense:
- case-api must authorize user-to-case access
- decision-service must authorize service-to-action access
- decision-service should verify resource scope if it exposes sensitive data
- audit must record user and workload
- tokens should be audience/scoped
17. Service Permission Model
Service permission should be explicit.
Not:
all services in namespace enforcement can call each other
Better:
servicePermissions:
- caller: spiffe://regulator.example/ns/enforcement/sa/case-api
callee: decision-service
operations:
- decision.proposal.create
- decision.summary.read
conditions:
tenantMode: same-tenant
environment: production
- caller: spiffe://regulator.example/ns/enforcement/sa/workflow-worker
callee: decision-service
operations:
- decision.proposal.timeout
- decision.approval.escalate
This model can be enforced by:
- service mesh authorization policy
- gateway policy
- application middleware
- OPA/authorization service
- internal library
- contract tests
The exact tool is secondary.
The explicit permission model is primary.
18. Endpoint Sensitivity Classification
Not every endpoint has the same risk.
Classify endpoints by sensitivity.
| Class | Example | Required Controls |
|---|---|---|
| Public read | public metadata | edge auth maybe none, rate limit |
| User read | case summary | user auth, object auth, tenant auth |
| Sensitive read | evidence metadata, decision rationale | strong object auth, audit, redaction |
| State change | propose decision | idempotency, auth, audit, invariant |
| Privileged state change | approve sanction | step-up, dual control, audit evidence |
| Admin operation | reindex, repair, replay | break-glass, approval, strong audit |
| Internal callback | workflow timeout | workload auth, replay protection |
This classification drives controls.
A generic “secured endpoint” label is not enough.
19. Message-Based Zero Trust
Zero trust also applies to messaging.
For event-driven systems, ask:
- Who may publish to this topic?
- Who may consume from this topic?
- Who owns the event schema?
- Is payload sensitive?
- Can a consumer trust the producer identity?
- How are poison/fake/replayed messages detected?
- How do we audit event handling?
- Can a compromised producer trigger unauthorized business action?
Example event envelope:
{
"eventId": "evt-2026-000182",
"eventType": "DecisionProposalCreated",
"producer": "spiffe://regulator.example/ns/enforcement/sa/decision-service",
"tenantId": "regulator-a",
"subject": {
"userId": "investigator-123",
"workloadId": "case-api"
},
"resource": {
"caseId": "CASE-2026-001",
"proposalId": "DP-001"
},
"occurredAt": "2026-07-05T10:15:30Z",
"schemaVersion": 1,
"traceId": "7d7a..."
}
Consumers should not blindly trust event payload if the event triggers sensitive action.
They should verify:
- topic/source identity
- event type allowed from producer
- schema version
- tenant/resource consistency
- idempotency
- ordering/version
- authorization if the action crosses resource boundary
20. Database and Secret Boundaries
Zero trust requires database access to be scoped.
Bad model:
All services use the same DB user.
Better model:
case-api uses case_service_rw against case database.
decision-service uses decision_service_rw against decision database.
reporting-service uses reporting_service_ro against reporting database.
Credential rules:
- no shared DB credentials across services
- no static secrets in repository
- no credentials in logs
- no broad admin credentials in app runtime
- short-lived credentials where possible
- rotation tested in staging and production
- emergency credentials audited
- least privilege per service
Even if application authorization has a bug, database isolation should reduce blast radius.
Defense-in-depth means one broken control should not give full system authority.
21. Admin Endpoint Security
Java microservices often expose operational endpoints:
/actuator/health/actuator/metrics/actuator/prometheus/actuator/env/actuator/loggers/actuator/threaddump/actuator/heapdump- custom repair endpoints
- replay endpoints
- migration endpoints
Zero trust rule:
Operational endpoints are production APIs with security impact.
Controls:
| Endpoint | Risk | Recommended Control |
|---|---|---|
| health/readiness | low to medium | separate public shallow vs private deep |
| metrics | medium | restrict network, avoid sensitive labels |
| env/config | high | disable or strongly restrict |
| heapdump | critical | disable in production unless controlled |
| loggers mutation | high | authenticated admin, audit |
| replay/repair | critical | break-glass, approval, idempotency, audit |
Do not expose internal operational endpoints just because “only internal users can reach them”.
22. Tenant-Aware Zero Trust
In multi-tenant systems, service identity is not enough.
Example:
reporting-service is allowed to call case-summary-service.
That does not mean:
reporting-service may read every tenant's case summary for every request.
Tenant-aware checks must happen at:
- API boundary
- application service
- query/read model
- cache key
- event envelope
- log attributes
- metric labels with cardinality discipline
- audit record
Bad cache key:
case-summary:CASE-001
Better cache key:
tenant:regulator-a:case-summary:CASE-001
Tenant bugs often bypass network and mTLS controls because the caller is legitimate but the resource scope is wrong.
23. Authorization Must Be Close to the Resource
A common weak design:
Gateway authorizes request.
Downstream services trust gateway.
This works only for simple systems.
In microservices, resource semantics often live downstream.
The service owning the resource is usually the best place to enforce resource authorization.
Why?
Because it knows:
- current resource state
- ownership
- assignment
- tenant
- jurisdiction
- lifecycle state
- sealed/restricted flags
- escalation status
- conflict conditions
- privacy classification
public final class CaseAccessPolicy {
public void assertCanView(CallerContext caller, CaseRecord caseRecord) {
if (!caseRecord.tenantId().equals(caller.tenantId())) {
throw AccessDenied.crossTenantAccess();
}
if (caseRecord.isSealed() && !caller.scopes().contains("case.sealed.read")) {
throw AccessDenied.sealedCase();
}
if (!caseRecord.isAssignedTo(caller.userId()) && !caller.scopes().contains("case.supervisor.read")) {
throw AccessDenied.notAssigned();
}
}
}
This policy cannot be fully encoded at the gateway without duplicating domain truth.
24. Policy-as-Code vs Policy-in-Code
There is no universal answer.
Policy-in-Code
Good for:
- domain invariants
- strongly typed domain state
- complex business rules
- code review with feature change
- low latency local decision
Bad for:
- cross-service platform policy
- organization-wide control
- dynamic access rules
- audit by security team
Policy-as-Code
Good for:
- service-to-service allowlist
- environment constraints
- tenant isolation rules
- central governance
- testable policy bundles
- OPA-style decisions
Bad for:
- rules requiring rich domain behavior
- over-centralized policy bottleneck
- difficult local debugging
- policy/data drift
Mature design uses both.
Example split:
| Rule | Location |
|---|---|
case-api may call decision-service | mesh/OPA/platform policy |
| investigator may propose decision only for assigned case | domain/application policy |
| sealed case requires supervisor role | domain/application policy |
| admin endpoint reachable only from ops namespace | network/platform policy |
| no service may call another service's database | network/IAM policy |
25. Java Security Boundary Placement
A service should have multiple enforcement layers.
Controller layer:
- validate token/caller context
- reject missing identity
- map identity into typed context
- enforce coarse route permission
- never trust arbitrary identity headers
Application layer:
- load resource
- check object authorization
- enforce idempotency
- coordinate transaction
- emit audit event
Domain layer:
- enforce invariant
- reject invalid state transition
- preserve business consistency
Infrastructure:
- use scoped credentials
- apply timeout/retry limits
- log dependency caller identity
26. Avoiding Security Context Leakage
Security context is dangerous if it is global mutable state.
Common Java risk:
- ThreadLocal context leaks across reused threads
- async code loses context
- Reactor context not propagated
- scheduled job inherits stale context
- logging MDC not cleared
- test suite passes because context is accidentally shared
Safer discipline:
- pass
CallerContextexplicitly into use cases - keep framework security context at edge
- clear MDC after request
- instrument async propagation deliberately
- create system context explicitly for jobs/workers
- test no-context and wrong-context cases
Bad:
String tenant = SecurityContext.getCurrentTenant();
Better:
public DecisionProposalId handle(ProposeDecisionCommand command) {
CallerContext caller = command.caller();
// explicit, testable, auditable
}
Framework context is fine at the boundary.
Domain/application code should not become untestable because authorization lives in static globals.
27. System Actors and Scheduled Jobs
Not every action comes from a human.
Examples:
- workflow timeout escalates overdue case
- nightly reconciliation job repairs read model
- audit projector consumes events
- retention job archives old data
- notification worker sends deadline reminder
These need explicit system actor identity.
public record SystemActor(
String workloadId,
String jobName,
String reason,
String runId
) {}
Do not use fake user IDs like:
userId = "system"
That collapses audit semantics.
Better audit:
{
"actorType": "SYSTEM_WORKLOAD",
"workloadId": "spiffe://regulator.example/ns/enforcement/sa/workflow-worker",
"jobName": "case-escalation-timeout",
"runId": "run-2026-07-05-01",
"reason": "SLA_TIMEOUT"
}
System actors need least privilege too.
A workflow worker should not have broad admin authority just because it is “system”.
28. Break-Glass Access
Production systems need emergency paths.
But emergency access must be designed, not improvised.
Break-glass controls:
- explicit role
- time-bound access
- approval workflow
- reason required
- strong audit
- command log
- read-only by default
- dual control for destructive actions
- automatic revocation
- post-use review
Emergency tool anti-pattern:
kubectl exec into pod, run SQL manually, no audit record.
Better:
approved operational command -> audited service endpoint -> idempotent repair operation -> evidence record -> runbook-linked ticket
Zero trust does not mean nobody can fix production.
It means emergency authority is explicit, constrained, and reviewable.
29. Observability for Zero Trust
You cannot defend what you cannot see.
Security-relevant telemetry:
- caller workload identity
- callee service
- operation name
- resource type and sanitized ID
- tenant ID
- authorization decision
- deny reason category
- policy version
- token audience/client ID
- mTLS peer identity
- source namespace
- request correlation ID
- admin operation reason
- break-glass session ID
Example structured log:
{
"event.name": "service_authorization_decision",
"service.name": "decision-service",
"operation": "decision.proposal.create",
"caller.workload_id": "spiffe://regulator.example/ns/enforcement/sa/case-api",
"subject.user_id": "investigator-123",
"tenant.id": "regulator-a",
"resource.type": "case",
"resource.id_hash": "sha256:...",
"decision": "DENY",
"deny.reason": "CASE_NOT_ASSIGNED",
"policy.version": "decision-policy-2026-07-01",
"trace.id": "7d7a..."
}
Do not log sensitive payload merely for security.
Log decision evidence, not secrets.
30. Deny Decisions Are Signals
Authorization denies should not disappear.
But they should not always page humans either.
Classify deny signals:
| Deny Type | Example | Response |
|---|---|---|
| Expected user deny | user opens unassigned case | normal audit, no alert |
| Suspicious user deny burst | many object IDs attempted | security signal |
| Unexpected service deny | service tries unapproved endpoint | platform/team alert |
| Cross-tenant deny | tenant mismatch | security signal |
| Admin deny | failed break-glass/admin access | high-severity security signal |
| Policy drift deny | new deployment lacks permission | deployment/runbook signal |
A mature system uses denies to discover:
- attempted abuse
- broken clients
- missing policy rollout
- accidental dependency
- service topology drift
- tenant isolation bugs
31. Local Development Without Destroying Security
Developers often disable security locally.
That creates two risks:
- local code paths differ from production
- authorization bugs are discovered late
Better local model:
- use local fake identity provider
- use explicit test workload identity
- use fixed test tenants
- run policy engine locally if possible
- provide golden test tokens/certs
- never bypass resource authorization
- allow opt-in simplified transport only
- contract test service permissions
Local mode should simplify infrastructure, not erase security semantics.
Example local caller:
security:
local-mode: true
workload-id: spiffe://local/ns/dev/sa/case-api
allowed-test-users:
- investigator-123
- supervisor-456
The app still sees a real CallerContext.
32. Test Strategy
Zero trust needs tests at multiple levels.
Unit Tests
- policy allows valid caller/action/resource
- policy denies wrong tenant
- policy denies wrong state
- policy denies missing assignment
- system actor has limited permissions
- audit event includes correct actor
Integration Tests
- service rejects missing token/cert
- service rejects wrong audience
- service rejects unauthorized workload
- controller maps identity correctly
- object authorization works with real repository
Contract Tests
- service permission matrix matches expected operations
- gateway/mesh policy permits documented paths
- undocumented service call fails
Runtime Tests
- denied call produces security telemetry
- credential rotation does not break service
- cert expiry is alerted before failure
- break-glass path records evidence
33. Failure Modes
Zero trust can fail in two directions.
Fail Open
The service allows traffic it should deny.
Examples:
- missing policy defaults to allow
- identity parser accepts unsigned token
- internal caller headers trusted
- network policy disabled by CNI mismatch
- service account shared across workloads
- wildcard mesh authorization
Fail open is dangerous.
Fail Closed
The service denies valid traffic.
Examples:
- certificate rotation failure
- policy bundle stale
- identity provider outage
- token clock skew
- mesh sidecar misconfigured
- namespace label changed
Fail closed protects security but can cause outage.
Architecture must decide fail behavior explicitly.
For high-risk action:
fail closed.
For low-risk read with cached data:
fail closed for sensitive data, maybe degrade for public data.
For emergency operation:
fail closed unless break-glass path is available and audited.
34. Rollout Strategy
Do not switch a large estate to strict zero trust overnight.
Pragmatic rollout:
- inventory service-to-service calls
- create service catalog identity per workload
- observe actual call graph
- classify sensitive endpoints
- implement mTLS/workload identity
- run policy in shadow/audit mode
- fix undocumented dependencies
- enforce default deny for high-risk services
- expand enforcement service-by-service
- add drift detection and ownership review
Shadow mode is important.
It reveals calls teams forgot to document.
But shadow mode must have an end date. Otherwise it becomes security theater.
35. Architecture Decision Record Template
# ADR: Service-to-Service Zero Trust Model for Decision Service
## Context
Decision Service owns decision proposals, approval workflow state, and decision rationale.
It exposes sensitive endpoints used by case-api, workflow-worker, audit-projector, and reporting-service.
## Decision
We will enforce workload identity via mTLS/SPIFFE-style identities.
Decision Service will default-deny all service callers and allow only documented operations.
Object-level and business authorization remain inside Decision Service application/domain layer.
## Allowed Callers
- case-api: create proposal, read proposal summary
- workflow-worker: timeout proposal, escalate overdue decision
- audit-projector: consume emitted events only, no HTTP write endpoint
- reporting-service: read redacted decision summary
## Denied Paths
- notification-worker cannot call Decision Service write endpoints
- reporting-service cannot access Decision DB
- any service outside enforcement/reporting namespaces denied by default
## Consequences
- service catalog must maintain permission matrix
- local development needs test workload identity
- policy drift becomes deployment risk
- deny telemetry becomes required signal
## Fitness Functions
- undocumented service call fails in staging
- all sensitive endpoints log authorization decision
- mesh/network policy matches service catalog
- credential/cert rotation tested monthly
36. Review Checklist
Before approving a Java microservice for zero-trust production:
- Does the service have a stable workload identity?
- Are service-to-service callers explicitly allowlisted?
- Is mTLS enabled or equivalent peer authentication enforced?
- Are internal headers treated as untrusted unless produced by trusted infrastructure?
- Does the service enforce object-level authorization close to resource ownership?
- Are tenant boundaries checked at API, query, cache, event, and audit layers?
- Are admin/actuator endpoints restricted?
- Are database credentials scoped per service?
- Are messaging producers/consumers authorized?
- Are system actors explicit and auditable?
- Is break-glass access time-bound and logged?
- Are deny decisions observable?
- Is policy drift detectable?
- Does local development preserve security semantics?
- Is fail-open vs fail-closed behavior explicit?
37. Final Mental Model
Zero trust for microservices is not “trust nobody” in a vague sense.
It is this:
Every service call must carry verifiable identity, pass explicit policy, preserve resource-level authorization, and leave evidence.
The strongest systems do not rely on one giant security layer.
They combine:
- workload identity
- mTLS
- network segmentation
- service permission
- object authorization
- domain invariants
- secret isolation
- admin endpoint control
- telemetry
- auditability
A service is not safe because it runs inside the cluster.
A service is safe when every meaningful boundary is explicit, enforced, observable, and reviewable.
38. Practical Exercise
Design zero-trust controls for this scenario:
case-apisubmits a decision proposal todecision-service.workflow-workerlater escalates overdue proposals.reporting-servicereads redacted decision summaries.notification-workersends email reminders but should never mutate decision state.
Create:
- workload identity list
- service permission matrix
- network allowlist
- sensitive endpoint classification
- object authorization rule
- system actor model
- deny telemetry schema
- break-glass rule
- local development identity setup
- rollout plan from shadow mode to enforcement
Then answer:
If
notification-workeris compromised, which controls prevent it from approving a decision?
That answer reveals whether zero trust is only a diagram, or a real architecture.
You just completed lesson 55 in deepen practice. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.
Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.