Comparison of the Open Source OLAP Systems for Big Data: ClickHouse, Druid and Pinot
https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
- Similarities:
- Store data and do query processing on the same node.
- Have their own format for storing data with indexes + integrated with query processing engines.
- Have data distributed quite statically between the nodes. Flip side: they don’t support queries that require movement of flare amounts of data between the nodes, e.g. joins between two large tables
- No point updates and deletes -> more efficient columnar compression and more aggressive indexes.
- Druid and Pinot support Lambda-style streaming and batch ingestion. ClickHouse supports batch inserts directly.
- Immature (a lot of bugs).
- Performance Comparison and Choice of the System
- Usually everybody ingest part of real data, do some benchmark -> choose. Incorrect!
- You should compare how fast your org could make those systems to move in the direction of being more optimal for your use case.
- Druid’s dev process is similar to Apache model: several large contributors-companies. (Less risky?)
- ClickHouse is developed by Yandex
- Pinot - by LinkedIn
- Druid commits to support “Developer API”, that allows to contribute custom column types, aggregation, “deep storage” options, etc. and maintain them separate from the core codebase. (Still fails often)
- According to GitHub, Pinot has the most people who work on it, 10 man-years for the last year. 6 - for ClickHouse, 7 for Druid.
- Differences between ClickHouse and Druid/Pinot
- Druid + Pinot: all data in each “table” is partitioned to N parts. By the time duration, the data is divided with some interval. Then these “parts” are “sealed” individually into self-contained ‘segments’. Each segment includes table metadata, compressed columnar data and indexes.
- Segments are persisted in the “deep storage” (HDFS). Nodes are not responsible for durability of segments -> could be replaced
- Segments could be loaded on any node.
- Some “master” server is responsible for assigning segments to nodes and moving them between nodes if needed
- Metadata is stored elsewhere (ZK in Druid, “Helix” framework in Pinot)
- Data management in ClickHouse
- Partitioned tables. Each node with the table has full table metadata + info about other nodes: 40% goes to A, 30% to B, 30% to C (different amount is needed to fill new node with data)
- Data management: comparison
- Clickhouse model is simpler: no deep storage, no master-node. It becomes a problem when all nodes are full of data