Final StretchOrdered learning track

Learn Aws Part 030 Multi Region Architecture Dr Data Replication And Traffic Control

[]25 min read4981 words

In This Lesson

Multi-Region Architecture: DR, Data Replication, and Traffic Control 1. Target Skill 2. Kaufman Skill Decomposition

Lesson 3035 lesson track30–35 Final Stretch

title: Learn AWS Engineering Mastery - Part 030 description: Multi-region architecture on AWS through disaster recovery strategy, RTO/RPO contracts, Route 53 and ARC traffic control, data replication, failover, failback, consistency, and operational drills. series: learn-aws seriesTitle: Learn AWS Engineering Mastery order: 30 partTitle: Multi-Region Architecture: DR, Data Replication, and Traffic Control tags:

aws
cloud
architecture
multi-region
disaster-recovery
reliability
data-replication date: 2026-07-01

Learn AWS Engineering Mastery - Part 030

Multi-Region Architecture: DR, Data Replication, and Traffic Control

Multi-region architecture is one of the most misunderstood areas of AWS system design. It is often presented as the ultimate maturity level, but in practice it is a trade-off. Multi-region can improve survivability against regional impairment, reduce latency for global users, or satisfy data-residency requirements. It can also multiply cost, operational complexity, consistency problems, deployment risk, observability burden, security surface, and incident response difficulty.

A senior AWS engineer does not ask, “Can we run this in multiple Regions?” The better question is:

What failure or business requirement justifies multi-region complexity, and what exact RTO/RPO/consistency contract must the design satisfy?

This part teaches how to answer that question.

1. Target Skill

After this part, you should be able to:

distinguish high availability, disaster recovery, regional redundancy, and active-active architecture;
choose between backup/restore, pilot light, warm standby, active/passive, and active/active;
define RTO and RPO as testable contracts;
design multi-region traffic control using Route 53, health checks, failover routing, latency routing, weighted routing, and Amazon Application Recovery Controller where appropriate;
reason about Aurora Global Database, DynamoDB global tables, S3 replication, Kinesis/MSK replication alternatives, ECR replication, secrets replication, and KMS key strategy;
avoid split brain, data loss, stale reads, duplicate processing, inconsistent authorization, and broken failback;
design failover and failback runbooks;
evaluate when multi-region is overengineering.

2. Kaufman Skill Decomposition

Multi-region architecture decomposes into the following sub-skills:

Kaufman-style learning here means you practice failover reasoning repeatedly until you can predict failure behavior before seeing the outage.

3. Core Mental Model

A multi-region system has four planes:

A design is not multi-region-ready until all four planes are ready.

Common incomplete designs:

traffic can fail over, but data cannot;
data is replicated, but application secrets are missing;
application is deployed, but IAM roles differ;
DNS fails over, but clients cache old endpoints;
secondary Region exists, but capacity is too small;
failover works, but failback is undefined;
dashboards show primary Region only;
runbook requires console access that is unavailable during incident;
data model cannot tolerate concurrent writes in multiple Regions.

4. Terms You Must Not Confuse

4.1 High Availability

High availability usually means the workload continues despite component or AZ-level failures. On AWS, this often means Multi-AZ within one Region.

4.2 Disaster Recovery

Disaster recovery means the workload can be recovered after a larger failure such as regional impairment, severe data corruption, major security event, or accidental deletion.

4.3 Multi-Region

Multi-region means workload capabilities exist in more than one AWS Region. It does not automatically mean active-active or zero data loss.

4.4 Active/Passive

One Region serves production traffic; another is ready to take over at some level of preparedness.

4.5 Active/Active

More than one Region serves production traffic at the same time.

Active-active is not “just Route 53 latency routing.” True active-active requires data, conflict, identity, idempotency, operations, and consistency design.

4.6 RTO

Recovery Time Objective: how long the business can tolerate the system being unavailable after a disaster.

4.7 RPO

Recovery Point Objective: how much data loss the business can tolerate, measured as time or committed work.

4.8 Failover

Moving service from impaired primary capability to recovery capability.

4.9 Failback

Returning service to the original Region or steady-state topology after recovery.

Failback is often more dangerous than failover because data divergence may have accumulated.

5. Do Not Start with Active-Active

Active-active is expensive and difficult. Many workloads do not need it.

Start with requirements:

Requirement	Likely Pattern
Can tolerate hours of downtime and data restored from backup	Backup/restore
Need infrastructure pre-created, but data restore/startup can take time	Pilot light
Need fast recovery with scaled-down live stack	Warm standby
Need low RTO with one active serving Region	Active/passive hot standby
Need global low latency and Region survivability	Active-active or cell-based multi-region
Need local data residency	Regional isolation or partitioned multi-region
Need zero/near-zero data loss	Stronger replication/consistency design, often more expensive

The strongest design principle:

Choose the simplest DR pattern that satisfies RTO, RPO, compliance, and user-experience requirements.

6. Disaster Recovery Strategy Matrix

Strategy	Approximate Cost	Complexity	RTO	RPO	Typical Use
Backup/Restore	Low	Low-Medium	Hours+	Last backup/PITR	Internal systems, lower criticality
Pilot Light	Low-Medium	Medium	Tens of minutes to hours	Replication-dependent	Core data protected, app rebuilt/scaled during event
Warm Standby	Medium	Medium-High	Minutes to tens of minutes	Replication-dependent	Critical systems needing faster recovery
Active/Passive Hot Standby	High	High	Minutes	Low, replication-dependent	High criticality with one serving Region
Active/Active	Very High	Very High	Seconds-minutes	Depends on data model	Global systems, low latency, regional survivability

Do not treat this as exact SLA math. It is a design framing tool. Actual RTO/RPO must be tested.

7. Multi-Region Reference Architectures

7.1 Backup/Restore

Characteristics:

low steady-state cost;
slower recovery;
requires restore automation;
good for lower criticality;
RTO depends on restore time;
RPO depends on backup frequency/PITR;
must test restore regularly.

Failure mode:

Backups exist, but nobody has restored them at production scale.

7.2 Pilot Light

Characteristics:

critical data path kept ready;
app infrastructure partially prepared;
lower cost than warm standby;
recovery depends on scaling and deployment speed;
requires tested automation.

Failure mode:

The pilot light exists, but dependencies such as secrets, DNS, IAM, and observability are incomplete.

7.3 Warm Standby

Characteristics:

secondary Region runs scaled-down live environment;
faster recovery than pilot light;
requires pre-scaling before or during failover;
good for critical workloads with acceptable cost.

Failure mode:

Secondary Region is alive but cannot absorb full traffic.

7.4 Active/Passive Hot Standby

Characteristics:

secondary Region is near full capacity;
failover mostly traffic/data-role decision;
higher cost;
simpler consistency than active-active;
still needs failback strategy.

Failure mode:

Automated health check fails over during partial dependency issue and creates unnecessary regional event.

7.5 Active/Active

Characteristics:

multiple Regions serve traffic simultaneously;
low user latency possible;
can reduce regional blast radius;
complex consistency/conflict handling;
expensive observability and operations;
requires strong idempotency and regional isolation design.

Failure mode:

Both Regions accept conflicting writes and business process has no deterministic conflict resolution.

8. Traffic Control

Traffic control is how you move users between regional capabilities.

8.1 Route 53 Health Checks and Failover Routing

Route 53 health checks can monitor endpoints, other health checks, or CloudWatch alarms. Failover routing can route to a primary resource when healthy and secondary resource when the primary is unhealthy.

Use cases:

active/passive DNS failover;
regional endpoint health;
static website failover;
API failover when clients use DNS correctly.

Risks:

DNS TTL means failover is not instantaneous;
client resolvers may cache longer than expected;
health checks can be too shallow;
false positive failover can worsen incident;
DNS failover does not solve data promotion.

8.2 Weighted Routing

Weighted routing can gradually shift traffic between Regions.

Use cases:

controlled migration;
blue/green region rollout;
canarying secondary Region;
partial traffic steering.

Risks:

client stickiness may not match weights exactly;
stateful sessions need design;
data writes can diverge if both Regions accept traffic without plan.

8.3 Latency-Based Routing

Latency routing routes users to the Region with lowest measured latency from their location.

Use cases:

global active-active read-heavy services;
regional user experience optimization;
multi-region frontend/API.

Risks:

nearest Region may not contain correct data;
session/data affinity must be designed;
compliance may restrict routing;
latency routing is not a data consistency strategy.

8.4 Geolocation and Geoproximity Routing

Use when routing must follow geography, regulation, or business partitioning.

Examples:

EU users to eu-central-1;
Indonesian users to ap-southeast-3 or ap-southeast-1 depending requirement;
tenant-specific regional residency;
government workloads isolated by jurisdiction.

Risk:

Geographic routing solves policy placement, not failover correctness.

8.5 Amazon Application Recovery Controller

Amazon Application Recovery Controller provides capabilities such as routing controls for multi-region recovery and zonal shift/zonal autoshift for supported resources. ARC routing controls are designed for highly reliable failover control, while zonal shift lets you move traffic away from an impaired AZ for supported resources.

Use cases:

high-confidence manual or automated regional failover;
safety rules before traffic shift;
zonal evacuation for supported resources;
operational drills.

Important principle:

For critical failover, prefer explicit recovery control with safety checks over shallow automatic DNS health logic.

9. Data Replication

Traffic failover is easy compared to data failover. Data determines correctness.

9.1 Data Replication Questions

For every data store, answer:

Is it system of record or projection?
Is replication synchronous or asynchronous?
What is expected replication lag?
What is the RPO under normal conditions?
What is the RPO under regional failure?
Can the secondary become writable?
How is writer promotion done?
Can both Regions accept writes?
What happens to conflicting writes?
What is the failback process?
How is data corruption handled?
How is deletion replicated?
Are backups isolated from replication mistakes?

9.2 Aurora Global Database

Aurora Global Database has one primary Region where writes occur and up to multiple read-only secondary Regions. It is designed for globally distributed applications with low-latency reads in secondary Regions and disaster recovery capability.

Design implications:

writes go to the primary Region;
secondary Regions can support local reads;
failover/switchover must promote a secondary;
application connection endpoints must handle role change;
RPO/RTO depend on replication and failover execution;
failback requires careful reconciliation and planned switchover.

Good use cases:

mostly single-writer systems;
global read latency improvement;
regional DR for relational workloads;
systems where write locality can be centralized.

Bad fit:

true multi-writer relational workload without conflict model;
low-latency writes required in every Region;
systems where ORM/connection behavior cannot tolerate failover.

9.3 DynamoDB Global Tables

DynamoDB global tables provide multi-Region, multi-active replication. Applications can write to replicas in different Regions, and DynamoDB replicates changes.

Design implications:

excellent for region-local reads/writes when access patterns fit DynamoDB;
conflict behavior must be understood;
idempotency is mandatory;
item ownership and update semantics should avoid conflicts;
global secondary indexes and capacity must be region-aware;
streams/side effects must avoid duplicate processing across Regions.

Good use cases:

tenant/session/profile/config data with clear item ownership;
globally distributed metadata;
event idempotency tables;
user-facing low-latency reads/writes where conflict can be controlled.

Bad fit:

complex relational transactions;
heavy cross-item consistency requirements;
workloads that cannot tolerate eventual replication behavior.

9.4 S3 Replication

S3 replication can replicate objects across buckets in the same or different Regions, subject to configuration requirements such as versioning.

Design implications:

good for object DR and regional data distribution;
replication is asynchronous;
delete marker behavior must be explicitly understood;
metadata/tag/encryption behavior must be configured correctly;
replication failures need monitoring;
replication is not a backup substitute for corruption/deletion unless combined with versioning/Object Lock/backup strategy.

Good use cases:

document/archive replication;
static assets;
data lake distribution;
cross-account evidence copy;
recovery Region object availability.

Bad fit:

synchronous object consistency across Regions;
replacing backup retention controls;
high-frequency overwrite workloads without versioning discipline.

9.5 RDS Cross-Region Read Replicas

For some engines and configurations, RDS supports cross-region read replicas. They can improve read locality or support DR patterns.

Design implications:

asynchronous replication;
replica lag matters;
promotion changes topology;
application endpoints and secrets must be ready;
failback is non-trivial.

9.6 Streams and Event Replication

For Kinesis, MSK, and event systems, multi-region design is not only “copy messages.” You must decide:

event identity;
ordering scope;
duplicate handling;
replay boundary;
consumer idempotency;
regional ownership;
poison message handling;
schema compatibility;
replication lag;
data retention.

Common pattern:

But the critical detail is duplicate-safe semantics, not the pipe.

10. Application State and Session Design

Multi-region systems fail when application state is region-bound accidentally.

10.1 Avoid Local Sticky State

Avoid storing critical session/process state only in:

EC2 instance memory;
pod local filesystem;
single-AZ cache;
local worker queue;
primary-region-only database;
unreplicated S3 bucket;
region-specific secret without equivalent in secondary.

10.2 Session Patterns

Pattern	Multi-Region Impact
Stateless JWT	Easier routing, but revocation/claims freshness must be handled
Server-side session in regional cache	Requires regional affinity or replicated/session fallback
DB-backed session	Simpler consistency, can add DB load
Sticky DNS/load balancer	Fragile during failover
Tenant-region affinity	Good for residency and data locality

10.3 Idempotency

Idempotency is mandatory for multi-region systems because failover creates ambiguity:

Did the request reach primary before failure?
Did the event replicate?
Did the client retry to secondary?
Did the workflow partially complete?

Use:

idempotency keys;
request IDs;
deterministic business keys;
conditional writes;
deduplication tables;
outbox/inbox pattern;
exactly-once illusion through at-least-once plus idempotent effects.

11. Consistency and Conflict Handling

Multi-region data usually means some form of eventual consistency unless using specialized strongly consistent systems with trade-offs.

11.1 Conflict Types

Conflict	Example
Concurrent update	Same case edited in two Regions
Duplicate create	Same business request retried after failover
Delete/update race	Record deleted in one Region and updated in another
Workflow transition conflict	Enforcement case moves to different states in different Regions
Counter conflict	Metrics/counts incremented in multiple Regions
Authorization conflict	User permission revoked in one Region but accepted in another

11.2 Conflict Strategies

Strategy	Use When	Risk
Single writer	Stronger correctness needed	Higher write latency for remote users
Regional ownership	Tenant/entity belongs to one Region at a time	Rebalancing complexity
Last writer wins	Low-value overwrite data	Can lose business updates
Version/check-and-set	Conflicts should be rejected	Requires user/workflow resolution
CRDT/counter design	Specific commutative updates	Complexity and limited fit
Workflow compensation	Business process can reverse/adjust	Requires auditability
Manual reconciliation	Rare high-value conflicts	Operational burden

11.3 Regulatory Workflow Example

For enforcement lifecycle systems, active-active writes are dangerous if case state transitions are not conflict-safe.

Example invalid state race:

Better design:

single writer per case;
case-region ownership;
command idempotency;
versioned state transitions;
append-only audit log;
async replicated read model;
explicit transfer-of-ownership workflow.

12. Identity, Secrets, and Key Management

Multi-region is not only compute and data.

12.1 IAM

IAM is global for many identity constructs, but deployed workload roles, resource policies, permission boundaries, SCPs, and account structure must be consistent.

Checklist:

same account/OUs guardrails;
same workload roles;
same trust policies;
same permission boundaries;
same break-glass path;
same CI/CD deployment role model;
same audit logging baseline;
cross-region assumptions documented.

12.2 Secrets

Secrets must exist in the recovery Region before failover.

Watch for:

Secrets Manager replication strategy;
region-specific ARNs;
app config references;
rotation lambdas;
DB credentials after promotion;
third-party API credentials with regional allowlists;
emergency manual rotation.

12.3 KMS

KMS design matters because encrypted data cannot be recovered if keys and policies are not available.

Questions:

Are keys regional or multi-region keys?
Are key policies equivalent?
Can recovery roles decrypt replicated data?
Are grants recreated in secondary?
Are backup vault keys available?
Is key deletion protected?
Does data residency restrict key placement?

13. Deployment and Configuration

A common multi-region failure is configuration drift.

13.1 Deployment Rules

Use the same artifact across Regions.
Promote configuration explicitly.
Separate global config from regional config.
Validate regional dependencies before traffic shift.
Use health checks that verify real dependency readiness.
Keep database migrations backward-compatible.
Roll out primary and secondary with controlled order.
Test rollback in both Regions.
Ensure feature flags are region-aware.
Store deployment evidence.

13.2 Regional Configuration Example

service: case-api
artifactVersion: 2026.07.01-1842
regions:
  ap-southeast-1:
    active: true
    databaseRole: primary
    trafficWeight: 100
  ap-southeast-3:
    active: false
    databaseRole: standby
    trafficWeight: 0

Avoid hidden Region-specific values buried inside application code.

14. Observability for Multi-Region

A multi-region dashboard must compare Regions.

14.1 Required Dimensions

Region;
AZ;
account;
cell;
tenant;
workload;
dependency;
traffic source;
data replication lag;
failover state;
deployment version;
feature flag state.

14.2 Critical Metrics

Area	Metrics
Traffic	request count per Region, error rate, latency, health check status
Data	replication lag, conflict count, write error, restore status
Compute	capacity, saturation, warm standby size, scaling lag
Queue/Stream	backlog per Region, iterator age, duplicate count
DNS/Routing	active routing state, TTL, health check result
Security	GuardDuty/Security Hub findings per Region, CloudTrail delivery
Cost	steady-state multi-region cost, data transfer, standby waste
DR	last failover drill, RTO achieved, RPO observed

14.3 Alarm Design

Avoid single-region blind spots.

Example alarms:

primary Region 5xx > threshold;
secondary Region readiness check fails;
replication lag > RPO threshold;
standby capacity < failover requirement;
Route 53 health check unhealthy;
ARC routing control changed;
global table replication errors;
S3 replication failure;
Aurora global replication lag high;
failover runbook step timeout;
cross-region data transfer anomaly.

15. Failover Runbook

A failover runbook must be executable under stress.

15.1 Pre-Failover Checks

Confirm incident scope: AZ, service, Region, app-only, data corruption, security event.
Confirm failover is safer than staying.
Confirm secondary Region health.
Confirm secondary capacity.
Confirm replication lag and possible RPO impact.
Freeze risky deployments.
Stop conflicting writers if needed.
Notify stakeholders.
Assign incident commander and scribe.

15.2 Failover Steps

Example active/passive relational workload:

1. Declare failover decision.
2. Disable writes in primary application if reachable.
3. Check replication lag.
4. Promote secondary database or execute managed failover/switchover.
5. Validate database writer endpoint.
6. Scale secondary application to required capacity.
7. Validate secrets/config/IAM access.
8. Run synthetic checks against secondary.
9. Shift traffic using Route 53/ARC/traffic layer.
10. Monitor error rate, latency, write success, replication status.
11. Communicate user impact and current operating mode.

15.3 Post-Failover Checks

Error rate stable.
Latency acceptable.
Data writes succeeding.
No unexpected writes to old primary.
Queue backlog controlled.
Security monitoring active.
Audit logging active.
Cost anomaly expected.
Business-critical workflows validated.

16. Failback Runbook

Failback is not rollback.

16.1 Failback Risks

old primary has stale data;
both Regions accepted writes;
replication direction changed;
DNS/client caches still point to secondary;
queues contain old messages;
scheduled jobs run twice;
audit logs split;
third-party integrations changed callback endpoint;
secrets rotated in only one Region;
operators assume original topology is still valid.

16.2 Failback Steps

1. Confirm original Region is healthy.
2. Decide whether to fail back or remain in new primary.
3. Reconcile data divergence.
4. Re-establish replication direction.
5. Validate app stack in original Region.
6. Run synthetic and business checks.
7. Shift small traffic percentage if possible.
8. Monitor error, latency, data correctness.
9. Complete traffic shift.
10. Keep enhanced monitoring during stabilization.
11. Document observed RTO/RPO and lessons.

A mature organization may choose not to fail back immediately. Sometimes the safer move is to operate from the recovery Region until a planned migration window.

17. Cost Model

Multi-region cost includes more than duplicate compute.

17.2 Cost Questions

Is the workload valuable enough for this RTO/RPO?
What is the cost of downtime per hour?
What is the cost of data loss per minute?
What is the cost of inconsistent decisions?
Can we use warm standby instead of active-active?
Can we partition by tenant/cell instead of global active-active?
Can we reduce RTO through automation without full duplicate capacity?

18. Security and Compliance

Multi-region expands security scope.

18.1 Security Checklist

18.2 Compliance Trap

A system may be technically recoverable but non-compliant after failover.

Example:

Primary Region stores regulated records in approved jurisdiction.
Failover Region is technically ready but not approved for that data class.

That is not a valid DR design.

19. Multi-Region Anti-Patterns

“Active-active” with no conflict strategy.
DNS failover without data promotion plan.
Replication without restore/failover drills.
Secondary Region without enough capacity.
Backups copied but never restored.
Hard-coded regional ARNs in application config.
Secrets only in primary Region.
Dashboards only in primary Region.
CI/CD deploys primary but not recovery Region.
Health checks that only test /health and not dependencies.
Automatic failover on shallow signal.
No failback plan.
Using active-active to hide poor single-region reliability.
Cross-region synchronous calls in normal request path.
Replicating corrupted data without immutable backup.
Running scheduled jobs in both Regions without leader election or idempotency.
No RTO/RPO measurement during drills.
Ignoring client DNS caching behavior.
No regional cost allocation.
No regulatory approval for recovery Region.

20. Decision Matrix: Should This Be Multi-Region?

Question	If Yes	If No
Is regional outage survivability required?	Consider DR/multi-region	Multi-AZ may be enough
Is RTO less than one hour?	Warm standby/hot standby likely	Backup/restore may fit
Is RPO near zero?	Need strong replication/data strategy	Periodic backup may fit
Are users global and latency-sensitive?	Consider regional serving/edge	Single Region + CloudFront may fit
Is data residency regional?	Partition by jurisdiction	Simpler global data placement possible
Can writes be single-region?	Active/passive simpler	Active-active conflict design needed
Can business tolerate conflict resolution?	Multi-writer possible	Avoid active-active writes
Is team operationally mature?	Consider advanced design	Start simpler and drill
Is cost justified?	Proceed with evidence	Do not overbuild

21. Reference Blueprint: Regulated Case Platform DR

For a regulated enforcement/case-management platform, a sane default is often:

primary Region active;
secondary Region warm standby;
single writer for case state;
Aurora Global Database or cross-region DR strategy for relational case store;
S3 replication for documents/evidence with versioning/Object Lock where required;
DynamoDB global table only for suitable metadata/idempotency/session/config patterns;
EventBridge/SQS workflows designed for idempotency and replay;
Route 53/ARC traffic control;
immutable audit logs replicated to separate account/Region;
failover runbook requiring incident commander approval;
failback treated as planned recovery project, not automatic reversal.

Why this is safer than naive active-active:

case state transitions remain single-writer;
auditability is preserved;
recovery is fast enough for many regulated workloads;
conflict model is simpler;
cost is lower than full active-active;
failover can be drilled.

22. Production Readiness Checklist

22.1 Requirement

22.2 Topology

DR strategy selected.
Primary/secondary roles documented.
Regional account structure defined.
Network baseline deployed.
Compute baseline deployed.
Ingress/egress baseline deployed.
Capacity target for failover defined.

22.3 Data

22.4 Traffic

22.5 Operations

Failover runbook tested.
Failback runbook drafted and tested where possible.
Incident roles defined.
Dashboards include both Regions.
Alarms include replication/readiness.
Break-glass tested.
Game day completed.

22.6 Security and Compliance

CloudTrail/Config/Security Hub coverage in all Regions.
KMS and secrets ready in recovery Region.
Data movement approved.
Audit evidence replicated.
Access controls consistent.
Recovery Region included in compliance scope.

23. Game Day Scenarios

Run these before claiming multi-region readiness.

Scenario 1: Primary App Region Unavailable

block primary application endpoint;
verify health checks;
promote/activate secondary;
shift traffic;
measure RTO;
estimate RPO;
validate critical workflows.

Scenario 2: Database Primary Unavailable

simulate database failover;
validate app reconnection;
verify no duplicate writes;
confirm read/write endpoints;
measure recovery.

Scenario 3: Replication Lag Exceeds RPO

delay replication;
trigger alarm;
decide whether failover is allowed;
document decision policy.

Scenario 4: Corrupted Data Replicated

introduce controlled bad data in non-prod;
observe replication;
restore from backup/versioned data;
validate audit trail.

Scenario 5: Secondary Region Missing Secret

remove a non-prod secret equivalent;
test synthetic readiness;
ensure failover blocked before user traffic.

Scenario 6: Duplicate Event Replay

replay event stream after failover;
confirm idempotent consumers;
inspect side effects.

24. Deliberate Practice

Exercise 1: Pick a DR Strategy

Given:

System: regulatory case management API
Users: national agency staff
Data: sensitive enforcement records
Downtime tolerance: 30 minutes
Data loss tolerance: <= 1 minute
Budget: medium-high
Writes: mostly during office hours
Read traffic: high during reporting deadlines

Deliver:

DR strategy;
primary/secondary topology;
data replication choice;
traffic control choice;
failover steps;
failback risks;
why not active-active.

Exercise 2: Conflict Model

Design a conflict model for:

Case status transition: UnderReview -> Escalated / Closed

Assume failover occurs while users retry requests.

Deliver:

idempotency key design;
version check;
state transition validation;
audit log entry;
retry behavior;
operator reconciliation path.

Exercise 3: Multi-Region Readiness Review

For an existing workload, create a table:

Component	Primary Region	Secondary Region	Replication	RTO Risk	RPO Risk
API
DB
S3
Queue
Secrets
KMS
Observability

Then identify the first component that blocks real failover.

Exercise 4: Failover Runbook Test

Write a failover runbook with:

decision criteria;
required approver;
exact commands or console actions;
validation queries;
traffic shift method;
rollback/abort criteria;
communication template;
RTO/RPO measurement fields.

Run it in non-prod.

25. Self-Correction Questions

Ask these in every multi-region design review:

What exact failure are we designing for?
What is the agreed RTO?
What is the agreed RPO?
Who accepts data loss if RPO is exceeded?
Is the secondary Region already deployed?
Is secondary capacity sufficient?
What data is replicated?
What data is not replicated?
Is replication asynchronous?
How do we detect replication lag?
Can both Regions accept writes?
If yes, what is the conflict strategy?
If no, how is single-writer enforced?
How is traffic shifted?
What prevents false failover?
How do clients behave during DNS change?
Are secrets and KMS keys ready?
Are IAM and policies equivalent?
Does compliance allow this recovery Region?
Has failover been tested under load?
Has failback been tested or at least designed?
What is the cost of steady-state readiness?

If the team cannot answer these, the architecture is not multi-region-ready.

26. Key Takeaways

Multi-region architecture is a business-risk decision, not a badge of maturity.
RTO and RPO must be explicit, testable, and accepted by stakeholders.
Traffic failover without data readiness is not disaster recovery.
Active-active is usually the hardest and most expensive option.
Data consistency and conflict handling determine correctness.
DNS, health checks, ARC, and routing controls solve traffic movement, not application semantics.
Failover and failback are separate operational procedures.
Replication is not a substitute for immutable backup.
Recovery Region must include security, observability, IAM, secrets, keys, and compliance scope.
A simpler warm-standby design with tested runbooks often beats an untested active-active design.

27. References

AWS Well-Architected Reliability Pillar - Disaster Recovery: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_planning_for_recovery_disaster_recovery.html
Amazon Route 53 Health Checks: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html
Amazon Route 53 Failover Routing: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-failover.html
Amazon Application Recovery Controller: https://docs.aws.amazon.com/r53recovery/latest/dg/what-is-route53-recovery.html
ARC Zonal Shift: https://docs.aws.amazon.com/r53recovery/latest/dg/arc-zonal-shift.html
Aurora Global Database: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database.html
Aurora Global Database Disaster Recovery: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database-disaster-recovery.html
DynamoDB Global Tables: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GlobalTables.html
Amazon S3 Replication: https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html
AWS Backup: https://docs.aws.amazon.com/aws-backup/latest/devguide/whatisbackup.html

Lesson Recap

You just completed lesson 30 in final stretch. Use the series map if you want to review the broader track, or continue directly into the next lesson while the context is still warm.

Back To Series Next Lesson

Continue The Track

Keep the momentum while the lesson is still fresh. Move backward for review or continue forward into the next concept.

Previous Lesson

Lesson 29

Learn Aws Part 029 Performance Efficiency Capacity Scaling And Latency

Next Lesson

Lesson 31

Learn Aws Part 031 Modernization Migration And Strangler Architecture

Learn Aws Part 030 Multi Region Architecture Dr Data Replication And Traffic Control

Learn AWS Engineering Mastery - Part 030

Multi-Region Architecture: DR, Data Replication, and Traffic Control

1. Target Skill

2. Kaufman Skill Decomposition

3. Core Mental Model

4. Terms You Must Not Confuse

4.1 High Availability

4.2 Disaster Recovery

4.3 Multi-Region

4.4 Active/Passive

4.5 Active/Active

4.6 RTO

4.7 RPO

4.8 Failover

4.9 Failback

5. Do Not Start with Active-Active

6. Disaster Recovery Strategy Matrix

7. Multi-Region Reference Architectures

7.1 Backup/Restore

7.2 Pilot Light

7.3 Warm Standby

7.4 Active/Passive Hot Standby

7.5 Active/Active

8. Traffic Control

8.1 Route 53 Health Checks and Failover Routing

8.2 Weighted Routing

8.3 Latency-Based Routing

8.4 Geolocation and Geoproximity Routing

8.5 Amazon Application Recovery Controller

9. Data Replication

9.1 Data Replication Questions

9.2 Aurora Global Database

9.3 DynamoDB Global Tables

9.4 S3 Replication

9.5 RDS Cross-Region Read Replicas

9.6 Streams and Event Replication

10. Application State and Session Design

10.1 Avoid Local Sticky State

10.2 Session Patterns

10.3 Idempotency

11. Consistency and Conflict Handling

11.1 Conflict Types

11.2 Conflict Strategies

11.3 Regulatory Workflow Example

12. Identity, Secrets, and Key Management

12.1 IAM

12.2 Secrets

12.3 KMS

13. Deployment and Configuration

13.1 Deployment Rules

13.2 Regional Configuration Example

14. Observability for Multi-Region

14.1 Required Dimensions

14.2 Critical Metrics

14.3 Alarm Design

15. Failover Runbook

15.1 Pre-Failover Checks

15.2 Failover Steps

15.3 Post-Failover Checks

16. Failback Runbook

16.1 Failback Risks

16.2 Failback Steps

17. Cost Model

17.1 Cost Categories

17.2 Cost Questions

18. Security and Compliance

18.1 Security Checklist

18.2 Compliance Trap

19. Multi-Region Anti-Patterns

20. Decision Matrix: Should This Be Multi-Region?

21. Reference Blueprint: Regulated Case Platform DR

22. Production Readiness Checklist

22.1 Requirement

22.2 Topology

22.3 Data

22.4 Traffic

22.5 Operations

22.6 Security and Compliance

23. Game Day Scenarios