Introducing Thanos: Prometheus at scale

https://improbable.io/games/blog/thanos-prometheus-at-scale

Prometheus: simple and reliable
Problem: can’t scale: Petabytes of historic data,
Default solution: Hierarchical Federation
- Leaf-prometheus servers, one meta-Prometheus server
- Problems:
  - Configuration
  - Add one more single point of failure
  - Complex rules to expose only certain data on the federated endpoint
  - Not all data is available from a single query API
Another solution: HA pairs of Prometheus servers
- independently collect data -> problem with deduplication
Prometheus 2.0: total number of time series doesn’t impact server performance
Downsampling: reduce sampling rate (to see the big picture)
Thanos:
- Prometheus Sidecar to store and query data
- Querier: request data from all the sidecars, then run PromQL query agains the data (deduplication from HA pairs)
- Object storage to store historic data in cloud storage. Align data (how?)
- Immutable data: you can always write blocks to storage
- Store component. Gossip -> are treated like Sidecar, used to cache and handle data in Storages
- Files are large, slow to download => Store Gateway caches index parts of files, smart query planner => minimize requests to storage (get part of a file). 4-6 orders of magnitude faster than naive implementation.
- => hard to distinguish object storage requests from local ssd requests
- Compactor: apply Prometheus local compaction (downsampling) to the Object Storage.
- Ruler: evaluates rules and alerts against Thanos Queriers. Then they are backed up to the Object Stores.
- How to migrate from Prometheus (sequence of steps)

Gliush Notebook

Notes about IT, work, and life

Notes for episode-0208

Introducing Thanos: Prometheus at scale