Maelstrom: mitigating datacenter-level disasters by draining interdependent traffic safely and efficiently

4+ year of experience dealing with datacenter outages at Facebook
Maelstrom = system FB use in prod to mitigate and recover from DC-level disasters
Idea: drain traffic away from the failed DC, move it to another DC
4+ years, 100+ disasters => 1 disaster per 2 weeks!! (DC-wide incident)
3 valuable contributions:
- The description of the Maelstrom
- An explanation of drain tests as they are practiced at FB
- Hard-won wisdom and expectation settings for rolling out similar capabilities in your own organization
Traffic types:
Maelstrom is used to execute recovery runbooks:
- to drain traffic out of the datacenter
- to restore traffic back
Runbooks are made of tasks. Dependencies between tasks.
Every service maintains its own service-specific runbook
If an entire datacenter is down - datacenter evacuation runbook is used. Such a runbook aggregates a collection of service-specific runbooks.
Collection of task templates
Task has health metrics, pre-conditions, post-conditions and dependencies
Library for common operations (LB to alter traffic, migrate shards, etc)
Maelstrom Runtime: Scheduler (correct order of tasks) + Executor (execute tasks + validate results)
Maelstrom compares the duration of each task to 75th percentile (stuck? -> alert)
Feedback loop to pace the speed of traffic drains based on monitoring
Drain Testing
Runbook should be changed if the system is changed -> stale
Drain test - is to verify and build trust in the runbook

Gliush Notebook