Chapter 17: Disaster Recovery & High Availability

Question 1Knowledge

What is the correct definition of RTO and RPO in disaster recovery planning?

ARTO (Recovery Point Objective) is the maximum acceptable downtime; RPO (Recovery Time Objective) is the maximum acceptable data loss.
BRTO (Recovery Time Objective) is the maximum acceptable time to restore service after a disaster; RPO (Recovery Point Objective) is the maximum acceptable amount of data loss measured in time.✓ Correct
CRTO and RPO are both measures of system downtime tolerance expressed in minutes.
DRPO defines how fast backups are made; RTO defines how fast they are restored.

Explanation

RTO = Recovery Time Objective: how long can the business tolerate being down? (e.g., "we must be back online within 4 hours"). RPO = Recovery Point Objective: how much data loss is acceptable? (e.g., "we can lose at most 1 hour of data"). A lower RTO/RPO requires a more expensive, always-ready DR solution.

Question 2Scenario

A company has an RTO of 4 hours and RPO of 1 hour. They want the MOST cost-effective DR strategy. Which approach BEST meets these requirements without over-engineering?

AMulti-Site Active-Active — full production environment running in two regions simultaneously (RTO ≈ 0). Massively over-engineered and expensive for a 4-hour RTO.
BBackup and Restore — regular snapshots to S3. Typical RTO of 24+ hours to restore and validate. Does not meet 4-hour RTO.
CPilot Light — core systems (database) are running at minimum capacity in the DR region; application servers are launched only during failover. RTO of hours, cost-effective.✓ Correct
DWarm Standby — scaled-down but fully functional environment running continuously. More expensive than needed for a 4-hour RTO.

Explanation

DR Strategies (cheapest → most expensive, slowest → fastest RTO): Backup & Restore (RTO: hours/days) → Pilot Light (RTO: hours) → Warm Standby (RTO: minutes) → Multi-Site Active-Active (RTO: seconds). Pilot Light keeps the "core flame" lit (e.g., DB replication) and scales up the rest on failover. It meets a 4-hour RTO at lower cost than Warm Standby.

Strategy	RTO	RPO	Cost
Backup & Restore	Hours–Days	Hours	$
Pilot Light	Hours	Minutes	$$
Warm Standby	Minutes	Seconds	$$$
Multi-Site Active-Active	Seconds	Near Zero	$$$$

Question 3Scenario

A company uses Route 53 Failover routing with a primary endpoint in us-east-1. What MUST be configured for Route 53 to automatically switch DNS to the secondary endpoint when the primary fails?

AAn SQS queue to receive health failure events from the primary endpoint.
BA Route 53 Health Check associated with the primary DNS record — Route 53 monitors the endpoint and triggers failover when it becomes unhealthy.✓ Correct
CA CloudWatch Alarm that manually updates the Route 53 DNS record when triggered.
DAn Auto Scaling policy that creates new Route 53 records during failure events.

Explanation

Route 53 Health Checks actively monitor endpoints (HTTP/HTTPS/TCP) at configurable intervals (10 or 30 seconds). If the health check fails a threshold number of consecutive checks, Route 53 marks the record as unhealthy and routes traffic to the healthy failover record. Health checks can also monitor CloudWatch alarms for application-level health signals.

Question 4Scenario

A company wants centralised backup management for RDS, DynamoDB, EFS, EBS volumes, and EC2 instances across 20 AWS accounts in their Organisation. Which service simplifies this?

AManual RDS snapshots and EBS snapshots scripted per account — not centralised.
BAmazon S3 Lifecycle Policies — manages object storage tiers, not multi-service backup.
CAWS Backup — centralised backup service supporting multiple AWS services with cross-account and cross-region backup policies.✓ Correct
DAWS DataSync — a data transfer service, not a backup management service.

Explanation

AWS Backup provides a centralised place to configure and audit backup policies across AWS services (RDS, Aurora, DynamoDB, EFS, EBS, FSx, Storage Gateway, EC2) and AWS accounts. You create Backup Plans with schedules and retention rules, and assign resources. Cross-account backup and cross-region copy are supported for DR compliance.

Question 5Scenario

During a Pilot Light DR failover, a company's DR team needs to promote the standby RDS Read Replica in the DR region to a primary database. After promotion, what else must be done to restore service?

ANothing — DNS and application connection strings are automatically updated by AWS during promotion.
BRestart the Read Replica replication process — promotion is temporary.
CScale up application servers in the DR region, update Route 53 DNS or application connection strings to point to the new primary endpoint, and verify application health.✓ Correct
DThe Pilot Light application servers automatically start and discover the new primary.

Explanation

Pilot Light failover is not fully automatic. After promoting the Read Replica: (1) Scale up EC2/ECS application tier in DR region (it was stopped/minimal). (2) Update connection strings or Route 53 records to the new RDS endpoint. (3) Verify all application components are healthy. This manual orchestration is why Pilot Light has a longer RTO than Warm Standby.

← Chapter 16: CloudWatch Chapter 18: Migration →