CloudFlare outage (Jul 2nd)
https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/
- Transparency - is the best
- TLDR
- Deployed a new rule that caused CPU to become exhausted (https://blog.cloudflare.com/cloudflare-outage/)
- -> 502 status (by front line servers that had CPU)
- Bad regular expression that created excessive backtracking
(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
- Sequence of events
- 13:42 engineer deployed a minor change to the XSS detection rule
- 13:45 PagerDuty page with error in WAF Manager (synthetic test)
- 13:45 Many other tests failed
- 13:45 P0 incident, SRE teams started working, all other teams helped
- 14:00 WAF was identified as a component causing problems (“attack variant” dismissed)
- 14:02 proposed a ‘global kill’ = mechanism to disable single component worldwide
- => Access service will be down, can’t authenticate to control panel
- => Security feature will disable credentials if anyone won’t use it frequently
- => Can’t access Jira or build system directly. Need to use workaround that is used very seldom
- 14:07 executed ‘global kill’
- 14:09 CPU is back to the expected levels
- 14:52 were 100% sure they found and fixed a problem and restored WAF
- Releases:
- WAF is released globally, unlike all other changes. 476 changes per 60 days (one every 3 hour)
- Usual release (may take hours - days):
- DOG (install for cloudflare internal usage)
- PIG (small subset of non-paying customers)
- Canary releases
- Live
- WAF is to protect again some attacks. Need to react quickly
- CI test must pass before WAF deploy
- At 13:31 - the PR is merged
- At 13:37 - TeamCity turn on the green light (all tests passed)
- Tests: unit tests (positive and negative), but not for the CPU usage.
- Tests run time didn’t increase
- TeamCity wrote changes in QuickSilver KV (p99: 2.29sec to write change to ~200cities world wide)
- Reasons of the problem:
- Regular expression that could easily backtrack enormously
- Protection was removed weeks ago (to reduce CPU usage, lol)
- Tests didn’t check CPU usage
- No staged rollout
- Rollback plan required WAF build twice, take too long
- The first alert took too long to fire
- Dogfooting -> the SRE had the same problems accessing internal resources
- Simplified regex:
.*.*=.*
x=x
requires 23 steps to match
x=xx
requires 33 steps to match
x=xxx
requires 45 steps to match
- not linear, exponent
- https://www.regular-expressions.info/catastrophic.html