Overload Control for Scaling WeChat Microservices
https://www.cs.columbia.edu/~ruigu/papers/socc18-final100.pdf
- 1.INTRODUCTION
- Overload control aims to mitigate problems during service overload
- For simple services, gateway monitors the load status and rejects
client requests (load schedding)
- For complex SOA systems it is more difficult:
- all services should be monitored
- independent load schedding is hard to implement
- with large number of services, high probability to not handle user request
because of some service rejects request
- introduce overload control scheme DAGOR:
- each client request is assigned with business priority and user priority
- each microservice maintain its own priority thresholds
- during overload, the threshold is increased (to sched half of the load), and new
threshold value is passed to the direct “upstream microservices”
- these “upper microservices” don’t send request with low priorities
- 2.WECHAT SERVICE ARCHITECTURE
- DAG of services (directed acyclic graph)
- leap services = number of outbound edges > 0
- basic services = number of outbound edges = 0
- entry leap services = leap services with number of inboud edges = 0
- total amount of requests to the entry services ~ 10^10 -10^11 per day!!
- 3k services
- 20k machines
- 3.OVERLOAD
- Scenario1: ServiceA->ServiceM (overloaded)
- Scenario2 (subsequent overload): ServiceA->ServiceM + ServiceA->ServiceM
- Scenario3 (subsequent overload): ServiceA->ServiceM + ServiceA->ServiceN
- For subsequent overload probability to handle request to serviceA: p=0.5 * p=0.5 => p=0.25
- 4.DAGOR OVERLOAD CONTROL
- Service agnostic. Can be run anywhere
- Runs on each individual machine
- !! load status = queue time (time difference between the request arrival and processing being started)
- Response time is bad for leap services
- CPU is bad because high CPU doesn’t mean it is overloaded.
- load is calculated every 2k requests or 1s (what is sooner)
-
20ms in the queue = the service is overloaded (default timeout for each service is 500ms)
- Business Oriented Admission Control
- Business priorities (Login > Pay > Message > ImageSent > …)
- Small table replicated on all the machines
- Any request is assigned a business priority. Sub requests copy the same business priority
- First provided API to change it, deprecated some time later
- User Oriented Admission Control
- Problem: a lot of requests with the same business priority.
Turn off - too low load, turn on - too high load. Becomes flappy.
- Generate hash for each. And sched load inside the same business priority
by user hash (hash priority)
- Regenerate hash for each user every hour. For fairness
- Session Oriented Admission Control
- User login session hash - for the same purpose as user oriented admission control.
- Users realized that relogin helps them (but doesn’t help the system).
- => Is not used
- Adaptive Admission Control
- Each business level has 128 user levels -> ~10k total levels (B,U)
- O(N) - won’t work, O(logN) - won’t work => Histogram of all (B,U) requests
- Adjusting the load
- during overload, reduce the number of handled requests by A (5%) percent
- if ok -> increase the requests by B (1%) percent
- 5.EVALUATION
- Comparison of different load indicators
- Service admissoin control
- Stress test
- IMHO
- It’s the fist detailed overload handling system that works in production for several years.
- There’re several interesting design solutions that are not intuitive
- It will not be widely used now, there’re few companies that might need that.
All of them have similar systems already