Final StretchOrdered learning track

Authentication Operational Runbook

Learn Java Authentication Pattern - Part 038

Operational runbook untuk Java authentication systems: incident classification, key compromise, token leak, session purge, refresh-token reuse, password hash breach, API key leak, IdP outage, account takeover campaign, tenant confusion, evidence handling, containment, recovery, and post-incident hardening.

7 min read1245 words
PrevNext
Lesson 3840 lesson track34–40 Final Stretch
#java#authentication#incident-response#runbook+11 more

Part 038 — Authentication Operational Runbook

Target part ini: membangun runbook operasional untuk authentication incident. Fokusnya bukan teori incident response umum, tetapi langkah konkret saat sistem Java auth menghadapi key compromise, token leak, refresh-token reuse, session compromise, password hash breach, API key leak, IdP outage, account takeover campaign, dan tenant confusion.

Authentication system tidak selesai ketika fitur login bekerja.

Pertanyaan produksi yang lebih penting:

What happens when a signing key leaks?
What happens when access tokens appear in logs?
What happens when Redis session store is compromised?
What happens when refresh-token reuse is detected?
What happens when IdP JWKS endpoint is down?
What happens when one tenant's token is accepted by another tenant route?
What happens when password hashes are exfiltrated?
What happens when credential stuffing traffic rises 100x?

Kalau jawabannya “kita akan diskusikan saat terjadi”, sistem belum production-grade.

Runbook adalah desain sistem yang ditulis dalam bentuk tindakan.


1. Mental model: incident is a state transition

Authentication incident harus diperlakukan sebagai state machine.

Invariant:

Do not destroy evidence during containment.
Do not preserve availability by accepting invalid authentication.
Do not rotate one secret while leaving derived/session state valid without analysis.
Do not communicate certainty beyond evidence.

Authentication incident response is not just “rotate secret”. Often you must rotate secret, revoke dependent tokens, invalidate sessions, patch validation logic, preserve audit trail, and notify affected parties.


2. Severity classification

Use severity based on blast radius and authentication assurance impact.

SeverityConditionExample
SEV-1Active or likely compromise of authentication trust rootSigning private key leaked, verifier accepts invalid tokens, cross-tenant token acceptance
SEV-2Credential/session/token compromise affecting multiple users/clientsToken logs exposed, refresh-token reuse campaign, API key leak at major client
SEV-3Attack attempt with controls holdingCredential stuffing spike, enumeration spike, unknown kid storm failing closed
SEV-4Localized auth degradationOne IdP integration outage, MFA provider intermittent failure

Severity is not only user count. One cross-tenant auth bypass can be SEV-1 with few observed requests.


3. First 15 minutes checklist

The first minutes should be boring and scripted.

[ ] Declare incident channel and commander.
[ ] Freeze risky deploys except approved emergency changes.
[ ] Preserve relevant logs, metrics, traces, audit events.
[ ] Identify suspected mechanism: password/session/JWT/OIDC/API key/mTLS/MFA/tenant.
[ ] Determine if invalid authentication is being accepted.
[ ] Determine if active abuse is ongoing.
[ ] Decide immediate containment: block, revoke, rotate, disable, degrade.
[ ] Record every action with timestamp and operator.

Do not start with a database migration or key rotation before understanding dependency graph.

Example dependency graph:

If a trust root is compromised, derived credentials may need revocation.


4. Evidence handling

Authentication evidence is sensitive.

Collect:

authentication audit events
security logs
resource server rejection logs
IdP admin events
JWKS/key rotation events
session creation/revocation events
refresh token rotation events
API key usage events
rate limiter metrics
network/proxy access logs
application deploy timeline
configuration changes

Avoid collecting raw secrets:

passwords
access tokens
refresh tokens
session ids
API key secrets
authorization codes
MFA codes
recovery codes

If secrets already exist in logs, treat log storage as contaminated.

Evidence record template:

Incident ID:
Time window:
Detected by:
Mechanism affected:
Tenants affected:
Users/clients affected:
Indicators:
Controls observed:
Immediate containment:
Open questions:
Evidence locations:
Access restrictions:

5. Runbook: JWT signing key compromise

5.1 Symptoms

private key exposed in repository, logs, build artifact, or secret manager incident
unexpected tokens signed with valid key
resource servers accepting tokens not issued by trusted IdP flow
unknown token issuance pattern

5.2 Immediate containment

[ ] Confirm key id (`kid`) and issuer.
[ ] Stop new token issuance with compromised key.
[ ] Generate new signing key pair in secure environment.
[ ] Publish new JWKS key if not already available.
[ ] Change active signing key.
[ ] Decide whether to remove compromised public key immediately or after emergency window.
[ ] Force resource servers to refresh JWKS cache.
[ ] Revoke/expire affected access tokens if possible.
[ ] Revoke refresh token families that could mint new access tokens.
[ ] Increase monitoring for old `kid` usage.

Critical decision:

Remove old key immediately => invalidates all tokens signed by it, may cause outage.
Keep old key briefly => attacker may continue using forged tokens.

For confirmed private key compromise, prefer security. Availability impact is acceptable compared to accepting forged tokens.

5.3 Java/Spring resource server considerations

Resource servers usually cache JWKS.

Containment requires:

cache eviction endpoint or restart
short emergency JWKS cache TTL
alert for old kid
issuer/audience validation verification

Pseudo-control:

public interface JwksCacheControl {
    void evictIssuer(String issuer);
    void evictKey(String issuer, String kid);
}

Do not implement “accept token if JWKS unavailable” as emergency fallback.

5.4 Validation checklist after rotation

[ ] New tokens use new `kid`.
[ ] Old compromised `kid` is rejected or accepted only under explicit temporary window.
[ ] Resource servers refreshed key cache.
[ ] Tokens with wrong issuer rejected.
[ ] Tokens with wrong audience rejected.
[ ] Unsigned/none-alg tokens rejected.
[ ] Audit detects any old `kid` usage.

5.5 Post-incident hardening

[ ] Move signing to HSM/KMS if appropriate.
[ ] Reduce access to private key material.
[ ] Separate signing key by environment/tenant if needed.
[ ] Add secret scanning for key format.
[ ] Test emergency key rotation quarterly.
[ ] Document max JWKS cache staleness.

6. Runbook: access token leak

6.1 Symptoms

access tokens found in logs, analytics, browser crash reports, support tickets, URL query strings, referrer headers, third-party tooling

6.2 Triage questions

Are tokens JWT or opaque?
Are they expired?
What issuer/audience/client/tenant?
Were refresh tokens also leaked?
Was leakage continuous or one-time?
Who had access to the logs/tool?
Can leaked tokens be replayed from attacker environment?

6.3 Containment

For short-lived JWT access tokens:

[ ] Stop further leakage immediately.
[ ] Reduce token lifetime if systemic.
[ ] Revoke related refresh tokens if leak includes refresh token or long-lived path.
[ ] Add detection for leaked token `jti` if present.
[ ] Rotate session/token if user session path contaminated.

For opaque tokens:

[ ] Revoke token through token store/introspection source.
[ ] Invalidate cache entries in resource servers.
[ ] Search for related tokens from same session/client.

If tokens were logged, the log system becomes part of blast radius.

[ ] Restrict log access.
[ ] Create sanitized copy if investigation needs broad sharing.
[ ] Delete/expire contaminated logs according to legal/security policy.
[ ] Fix logger/redaction at source.

6.4 Prevention controls

Authorization header redaction
query parameter denylist
structured logging allowlist
HTTP client interceptor redaction
proxy/access-log redaction
support bundle scrubber
browser URL fragment/callback hygiene

7. Runbook: refresh token reuse detected

Refresh token rotation detects replay when an already-used refresh token appears again.

RFC 9700 describes refresh token rotation as issuing a new refresh token on each refresh and invalidating the previous one while retaining relationship information to detect compromise.

7.1 Signal

refresh_token_reuse_detected{client_id, tenant_id, token_family_id}

7.2 Immediate response

[ ] Mark token family compromised.
[ ] Revoke all descendants in family.
[ ] Revoke active access tokens associated with family if possible.
[ ] Terminate related sessions.
[ ] Require user/client re-authentication.
[ ] Notify user/client if policy requires.
[ ] Record source IP/device/client metadata for investigation.

State machine:

7.3 Common bug

Race condition creates false reuse:

Two legitimate refresh requests arrive concurrently.
Both see old token active.
Both issue new child.

Correct implementation needs atomic rotation.

UPDATE refresh_token
SET status = 'ROTATED', rotated_at = now()
WHERE id = :token_id
  AND status = 'ACTIVE';

Only one request should update one active token.

If affected rows = 0, investigate whether it is reuse, expiry, or already rotated.


8. Runbook: session compromise or session store compromise

8.1 Individual session compromise

Symptoms:

user reports suspicious activity
impossible travel
same session id used from different networks/devices
session id leaked in logs

Containment:

[ ] Revoke specific session id.
[ ] Rotate user's active sessions if necessary.
[ ] Revoke remember-me tokens.
[ ] Require re-authentication / MFA step-up.
[ ] Review account changes after suspected compromise time.

8.2 Session store compromise

If Redis/database session store is exposed:

[ ] Isolate session store network access.
[ ] Rotate session encryption/signing keys if used.
[ ] Revoke all sessions in affected environment.
[ ] Force global re-login.
[ ] Review whether session payload contained secrets.
[ ] Rotate downstream tokens stored in session.
[ ] Patch network/IAM/security group/config.

Global session purge should be tested before incident.

Pseudo-interface:

public interface SessionIncidentService {
    int revokeSession(String sessionId, String reason);
    int revokeAllForAccount(String accountId, String reason);
    int revokeAllForTenant(String tenantId, String reason);
    int revokeAll(String environment, String reason);
}

8.3 Session purge failure modes

session index stale
Redis scan too slow
application local session cache not evicted
sticky sessions retain old context
logout event not propagated

Validation:

[ ] Old session cookie rejected.
[ ] Existing API request cannot continue with old SecurityContext after boundary.
[ ] All pods observe revocation.
[ ] Metrics show session recreation after login only.

9. Runbook: password hash database breach

A password hash breach is not the same as plaintext password breach, but it is serious.

9.1 Triage

Which tables/columns were accessed?
Were password hashes exfiltrated or only read?
Which algorithm/parameters?
Were salts included?
Was pepper used?
Was pepper exposed?
Were reset tokens/recovery codes/MFA secrets also exposed?
What time window?

9.2 Containment

[ ] Stop data exfiltration path.
[ ] Rotate database credentials and affected app secrets.
[ ] If pepper exposed, rotate pepper strategy carefully.
[ ] Invalidate password reset tokens/recovery codes if exposed.
[ ] Increase credential stuffing monitoring.
[ ] Force password reset for affected users if risk/policy requires.
[ ] Block known compromised passwords during reset.
[ ] Rehash credentials with upgraded parameters on reset/login.

If pepper is used and exposed, rotating it may require password reset or multi-pepper migration depending design.

9.3 User protection

Controls:

password reset campaign
MFA enrollment/step-up
breached password screening
suspicious login detection
session revocation after password reset
notification with concrete guidance

9.4 Post-incident hardening

[ ] Upgrade weak hashes.
[ ] Remove legacy hash acceptance after migration window.
[ ] Review DB access control.
[ ] Add anomaly detection for credential table reads.
[ ] Add canary credentials / honey hashes if appropriate.
[ ] Verify backup/security snapshot exposure.

10. Runbook: API key leak

API keys are common in machine-to-machine integrations.

10.1 Triage

Which API key prefix?
Which client/tenant/environment?
What scopes?
When was it last used?
Was it used from unusual IP/location?
Was it production or sandbox?
Was it logged in client code/repository/support ticket?

10.2 Containment

[ ] Disable leaked key.
[ ] Create replacement key with least privilege.
[ ] Notify client owner.
[ ] Monitor old key usage attempts.
[ ] Review actions performed by leaked key.
[ ] Rotate related webhook/HMAC secret if same client secret hygiene failed.

If client cannot rotate immediately:

temporary allowlist IP
scope reduction
short emergency overlap window
heightened monitoring
explicit expiry

Do not silently extend leaked key lifetime without risk acceptance.

10.3 Prevention

key prefix for lookup and identification
hashed secret storage
last-used throttled audit
secret scanning pattern
client self-service rotation
dual-key overlap rotation
scope minimization

11. Runbook: HMAC/webhook secret compromise

HMAC shared secret compromise lets attacker sign requests.

Containment:

[ ] Identify key id/client id.
[ ] Disable compromised secret or mark as retiring.
[ ] Issue new secret.
[ ] Support dual-signature overlap if required.
[ ] Reduce replay window if under attack.
[ ] Clear nonce cache if semantics require.
[ ] Monitor old key id usage.
[ ] Review signed requests during compromise window.

Validation:

[ ] Old secret rejected after cutoff.
[ ] New secret accepted.
[ ] Requests without timestamp rejected.
[ ] Replayed signatures rejected.
[ ] Canonicalization tests still pass.

12. Runbook: IdP outage

IdP outage can affect:

new login
token refresh
JWKS fetch
introspection
UserInfo
federated logout
admin API automation

Do not treat all as same.

12.1 Decision table

FunctionIf IdP unavailableSafe behavior
Existing sessionCan continue until local expiryContinue if session already established
JWT validation with cached keyCan continue for known keysContinue until max stale key window
Unknown kidCannot validateFail closed
Opaque token introspectionCannot verify active statusUsually fail closed
New federated loginCannot completeShow login unavailable
Token refreshCannot completeRequire retry; do not mint locally
Admin provisioningCannot completeQueue only if idempotent and safe

12.2 User-facing behavior

Do not show stack trace.
Do not reveal IdP internals.
Do not claim credentials are wrong.
Use clear temporary-unavailable message for login path.
Existing authenticated users may continue if policy allows.

12.3 Internal controls

IdP health dashboard
JWKS cache status
token endpoint error rate
introspection error rate
login callback failure rate
fallback admin/break-glass path

Break-glass access must be stronger, not weaker.

hardware MFA
small allowlisted group
separate audit
time-limited activation
post-use review

13. Runbook: account takeover campaign

Signals:

credential stuffing spike
many failed logins across many accounts
successful logins from unusual networks
MFA push fatigue pattern
password reset spike
risk score spike
known breached credential list usage

Containment:

[ ] Tighten rate limits for affected routes.
[ ] Enable step-up for suspicious logins.
[ ] Block obvious abusive networks/ASNs if reliable.
[ ] Require password reset for confirmed compromised accounts.
[ ] Revoke sessions for confirmed compromised accounts.
[ ] Disable risky recovery paths temporarily if abused.
[ ] Increase audit retention for incident window.

Do not globally lock accounts too aggressively. Account lockout can become attacker-driven denial-of-service.

Better:

progressive throttling
risk-based step-up
known breached password checks
user notification for confirmed suspicious successful login
support workflow for recovery

14. Runbook: account enumeration attack

Signals:

high volume of login/recovery attempts
identifier spray
response timing anomaly
registration lookup spike
forgot-password event spike

Containment:

[ ] Verify generic response behavior.
[ ] Check response timing distribution for known vs unknown accounts.
[ ] Tighten pre-auth rate limits.
[ ] Add CAPTCHA/challenge only where appropriate.
[ ] Monitor email/SMS sending volume.
[ ] Prevent notification flooding.

Validation:

Known and unknown account responses have equivalent semantics.
Recovery does not reveal account existence.
Registration does not reveal existing account unless intentionally designed and protected.

15. Runbook: MFA provider outage or MFA abuse

15.1 MFA provider outage

Questions:

Which factor is affected? TOTP? SMS? Email? Push? WebAuthn?
Are backup codes available?
Are high-risk actions blocked?
Can existing sessions continue?

Safe behavior:

Do not disable MFA globally as first action.
Prefer alternate enrolled factors.
Allow backup codes if designed.
For low-risk sessions, defer step-up if policy allows.
For high-risk/admin actions, fail closed.

15.2 MFA fatigue/push abuse

Containment:

[ ] Rate limit MFA challenges.
[ ] Require number matching or phishing-resistant factor if available.
[ ] Temporarily disable push for targeted accounts.
[ ] Notify affected users.
[ ] Revoke sessions if compromise confirmed.

16. Runbook: tenant confusion or cross-tenant auth bypass

This is one of the most severe auth incidents.

Symptoms:

token from tenant A accepted on tenant B endpoint
issuer/realm routing mismatch
email domain maps to wrong tenant
session tenant switched without revalidation
admin from one tenant sees another tenant resource

Immediate containment:

[ ] Disable affected route/client/tenant routing path if needed.
[ ] Add emergency check: token tenant must equal route/resource tenant.
[ ] Revoke sessions/tokens created through broken flow.
[ ] Identify all cross-tenant access events.
[ ] Preserve audit logs and data access logs.
[ ] Notify legal/compliance/security leadership.

Root cause classes:

issuer not validated per tenant
audience too broad
tenant taken from request parameter instead of authenticated principal
account linked by email alone
shared realm without tenant-bound membership check
resource server tries multiple issuers until one validates

Validation tests after fix:

[ ] Tenant A token rejected on Tenant B route.
[ ] Tenant B session cannot switch to Tenant A by parameter/header.
[ ] OIDC issuer + subject mapping is tenant-safe.
[ ] API key tenant binding enforced.
[ ] Audit event contains authenticated tenant and resource tenant.

17. Runbook: authentication logic regression in deployment

Symptoms:

sudden login failure spike after deploy
all JWTs rejected due to audience config
all sessions invalidated unexpectedly
CSRF token mismatch spike
OIDC callback broken due to redirect URI change

Containment:

[ ] Compare deploy/config timeline.
[ ] Roll back if security invariant remains safe.
[ ] If new version accepted invalid auth, revoke sessions/tokens created during window.
[ ] Run auth regression suite before redeploy.
[ ] Check feature flags and environment-specific config.

Important distinction:

Fail-closed regression: users cannot login, but invalid auth not accepted.
Fail-open regression: invalid/unauthorized auth accepted.

Fail-open requires deeper containment.


18. Operational commands and scripts

Production systems should expose safe internal operations.

Examples:

revoke session by id
revoke all sessions by account
revoke all sessions by tenant
revoke refresh token family
disable API key
rotate signing key
evict JWKS cache
force reauthentication for account/tenant
increase auth risk policy level
put IdP integration in maintenance mode

Guardrails:

admin authentication with step-up
authorization for operation type
dry-run support
idempotency key
audit event for every operation
bounded batch size
progress reporting
rollback where possible

Example operation record:

{
  "operationId": "authop_20260703_001",
  "operation": "REVOKE_ACCOUNT_SESSIONS",
  "actor": "security-ops-user-id",
  "targetAccountId": "acc_123",
  "reason": "suspected_account_takeover",
  "dryRun": false,
  "startedAt": "2026-07-03T10:15:00Z",
  "completedAt": "2026-07-03T10:15:02Z",
  "affectedCount": 4
}

19. Audit event taxonomy for operations

Minimum operational audit events:

SIGNING_KEY_CREATED
SIGNING_KEY_ACTIVATED
SIGNING_KEY_RETIRED
JWKS_CACHE_EVICTED
SESSION_REVOKED
ACCOUNT_SESSIONS_REVOKED
TENANT_SESSIONS_REVOKED
REFRESH_TOKEN_FAMILY_REVOKED
API_KEY_DISABLED
MFA_FACTOR_RESET
PASSWORD_RESET_FORCED
BREAK_GLASS_ACCESS_USED
AUTH_POLICY_CHANGED
IDP_CONFIGURATION_CHANGED
TENANT_ROUTING_CHANGED

Every event should include:

actor
target
reason
correlation id
request id
before/after state if safe
timestamp
source system
approval reference if required

Never include raw secrets.


20. Communication model

Authentication incidents often require clear internal communication.

Internal update structure:

Status:
Impact:
Mechanism affected:
Known affected tenants/users/clients:
Containment completed:
Current risk:
Next action:
Open questions:
ETA policy:

Avoid vague statements:

Bad: We had an auth issue.
Better: We detected refresh-token reuse for client X and revoked affected token families. No evidence currently indicates signing key compromise.

External communication depends on legal/compliance policy. Technical team should provide precise facts and uncertainty boundaries.


21. Post-incident review

Post-incident review should produce engineering changes, not only timeline.

Questions:

Which invariant failed or was at risk?
Which detection fired?
Which detection was missing?
Which containment step was manual?
Which operation lacked tooling?
Which logs were insufficient?
Which secret/token appeared somewhere it should not?
Which test would have caught this before production?
Which architecture decision increased blast radius?

Output:

new regression tests
new alert
new runbook step
automated operation
auth model change
secret handling improvement
tenant isolation improvement
policy/ADR update

22. Runbook drill program

Runbooks rot unless tested.

Drills:

quarterly signing key rotation drill
monthly session revoke-all dry run in staging
refresh-token reuse simulation
JWT unknown kid storm simulation
IdP outage game day
API key leak tabletop
password hash breach tabletop
tenant confusion regression drill
MFA provider outage drill

For each drill measure:

time to detect
time to triage
time to contain
tooling gaps
operator confusion
missing permissions
log quality
customer impact estimate

23. Production readiness checklist

Trust root and key operations

[ ] Signing keys have owner, lifecycle, rotation cadence.
[ ] Emergency key rotation tested.
[ ] JWKS cache eviction supported.
[ ] Resource servers fail closed on invalid/unknown keys.
[ ] Private key access audited.

Token/session operations

[ ] Access token leak response documented.
[ ] Refresh token family revoke implemented.
[ ] Session revoke by account/tenant/global implemented.
[ ] Revocation propagates to all pods.
[ ] Existing sessions can be forced to reauthenticate.

Credential operations

[ ] Password hash breach runbook exists.
[ ] Legacy hash migration plan exists.
[ ] Forced password reset process exists.
[ ] Recovery token invalidation exists.
[ ] MFA reset operation is audited and protected.

Client/API operations

[ ] API key disable/rotate workflow exists.
[ ] HMAC secret rotation supports overlap/cutoff.
[ ] Client owner mapping exists.
[ ] Scope reduction can be applied quickly.

Detection and evidence

[ ] Auth audit events are structured.
[ ] Logs redact secrets.
[ ] Token/session/API key IDs are represented by safe hashes/prefixes.
[ ] Incident queries are prewritten.
[ ] Evidence retention policy is known.

24. Exercises

Exercise 1 — Signing key compromise tabletop

Assume:

JWT signing private key for production issuer leaked in CI logs.
Tokens are valid for 15 minutes.
JWKS cache TTL in resource servers is 1 hour.

Design:

containment steps
resource server cache refresh strategy
old key retirement decision
token/session impact
communications
post-incident hardening

Exercise 2 — Refresh-token reuse drill

Simulate reuse of a rotated refresh token.

Verify:

token family revoked
active access tokens invalidated if supported
user forced to reauthenticate
audit event emitted
false-positive race condition avoided

Exercise 3 — Session store compromise

Assume Redis session store exposed for 20 minutes.

Define:

blast radius
global session purge procedure
secret rotation
user impact
validation queries
post-incident controls

Exercise 4 — Tenant confusion regression

Create a test where token from tenant A is replayed against tenant B route.

Expected:

request rejected
cross-tenant rejection audit event created
no resource access
metric increments

25. Key takeaways

Authentication operations must be designed before incident.

Core rules:

Runbook is part of architecture.
Key compromise requires dependent-token analysis.
Token leak requires log/tool blast-radius analysis.
Refresh-token reuse should revoke the whole token family.
Session compromise requires tested purge operations.
Password hash breach response depends on algorithm, parameter, pepper, and exposed recovery data.
IdP outage must not become reason to skip validation.
Tenant confusion is usually SEV-1 even with small observed impact.
Every emergency operation must be audited.
Every incident should produce tests and tooling improvements.

Production-grade authentication is not the absence of incidents.

It is the ability to contain incidents without guessing.


References

  • NIST SP 800-63B-4 — Digital Identity Guidelines: Authentication and Authenticator Management.
  • OWASP Authentication Cheat Sheet.
  • OWASP Password Storage Cheat Sheet.
  • OWASP Logging Cheat Sheet.
  • RFC 7009 — OAuth 2.0 Token Revocation.
  • RFC 7662 — OAuth 2.0 Token Introspection.
  • RFC 8725 — JSON Web Token Best Current Practices.
  • RFC 9700 — Best Current Practice for OAuth 2.0 Security.
  • Spring Security Reference — OAuth2 Resource Server, JWT validation, session management, password storage.
  • Keycloak Server Administration and Securing Applications documentation.
Lesson Recap

You just completed lesson 38 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.