Testing in Production, the safe way
- It is not easy, it is not risk-free.
- Often “staging” = “miniature replica of production”
- Need to keep stage “in-sync” with production, which is almost impossible:
- Impossible to copy production data (GDPR, etc)
- The size of staging cluster
- Different configuration of LB, DB, Queues, etc (# of connections opened, # of replicas live, # of Kafka partitions, etc)
- Different monitoring configuration (because traffic is not similar to production)
- Some testing better to perform on prod (long-living tests, memory leaks, unusual CPU usage patterns)
- Testing in prod requires:
- Huge automation
- Understanding of best practices
- Prepared design of the system
- Best effort simulation of the production is - production
-
Production consists of 3 phases:
-
Phase 1 - Deploy
- Process of installing new version to production servers
- Deploy does not expose users to the new version
-
Phase 2 - Release
- Process of moving production traffic to the new version
- Up to several days
- In case of any problem - rollback to the previous version
- Best, when automated
-
Phase 3 - Post-Release
- Some degree of degradation all the time (something is broken)
-
Testing in Production in the Deploy Phase.
- Integration testing: just use prod env, do not try to copy it
- Deploy services (without releasing, e.g., with new labels), but use in Testing
- Service Mesh helps
- New DB row for testing, do not touch production data
- Shadowing (= Dark Traffic Testing = Mirroring)
- Production traffic is captured and replayed against newly deployed version
- Tap Compare
- Send traffic to the new version and to previous version => And compare the results
- Load Testing
- Config Tests
- Canary release to test configuration.
- No other test could find problems
-
Testing in Production in the Release Phase.
- Canarying
- Monitoring
- Exception Tracking
- Traffic Shaping (0% -> 100% and back in case of problem)
-
Testing in Production in the Post-Release Phase.
- Feature Flagging or Dark Launch
- A/B Testing
- Logs/Events, Metrics and Tracing
- Profiling
- Teeing
Testing in Production: the hard parts
Testing in Production: the hard parts