Comparing Three Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse

Written By

Chinmay Soman

, Neha Pawar

Published

April 18, 2023

Reading Time

Video

The authors of this blog gave a presentation about it at Real-Time Analytics Summit 2023.

Introduction

With Real-Time Analytics gaining popularity, we are often asked, “How do you compare Apache Pinot™ vs. Apache® Druid vs. ClickHouse®?” In this article, we attempt to provide a fair comparison of the three technologies, along with areas of strength and opportunities for improvement for each.

Recently, we’ve seen a shift in the analytical landscape from traditional OnLine Analytical Processing (OLAP), which was batch-oriented in nature, to real-time OLAP. Instead of generating reports or dashboards which are hours delayed, we now want to analyze data as its being generated – within milliseconds.

Consequently, real-time OLAP databases are getting increasingly popular in the last few years and have seen mainstream adoption in several big companies such as LinkedIn, Uber, Netflix, Cisco, and Bloomberg. As organizations continue to move from batch analytics to real-time analytics, choosing the right OLAP database is an increasingly critical decision. The choice depends on several factors, such as use case, ingestion constraints, query performance, and production readiness, to name a few.

In this article, we present an in-depth comparison of three open-source real-time OLAP databases: Apache Pinot, Apache Druid, and ClickHouse, with the above dimensions in mind. In addition, we will also look at some salient features in-depth to uncover how things work under the hood across different systems.

Comparison Structure

Here’s a simplified view of the internal layers of a typical real-time OLAP database. In the following few sections, we will go through each layer and compare the capabilities across the three systems mentioned above. The versions used for each project are Apache Pinot 0.12.0, released on Jan 19, 2023; Apache Druid 25.0.0, released on Jan 4, 2023; and ClickHouse 23.1, released on Jan 26, 2023. Since then, Clickhouse released 22.2 and 23.3; these releases have not yet been incorporated into our analysis.

Fig: Layers of a typical real-time OLAP database

Note: There are some cases where certain capabilities are supported in the respective commercial Database-as-a-Service (DBaaS) product offerings. We will make explicit note of these in the comparison table wherever necessary.

Summary of Comparison: Apache Pinot vs. Apache Druid vs. ClickHouse

Based on our observations, as described below in great detail, here’s a quick summary of the strengths and weaknesses of each system.

Apache Pinot

Highlights

Apache Pinot has been designed for scale and performance with best-in-class indexing strategies and smart data layout techniques used to boost query performance, amongst other things. It has great support for ingesting data from real-time and batch data sources and is the only system that can support upserts in real-time. Overall, it’s easy to operate with rich cluster management tools and the only system to support true multi-tenancy. Lastly, the flexible cloud deployment model and best-in-class cloud tiered storage make it a compelling choice for companies of all sizes to build real-time analytical applications.

Needs Improvements

Apache Pinot needs to improve on its data type support, outbound connectors for ease of BI integration, SQL based pre-processing, and some security improvements.

Apache Druid

Highlights

Apache Druid is also designed to handle large-scale data and does a great job of ingesting from real-time and batch data sources. It has great support for data pre-processing during ingestion and also has recently launched a flexible SQL based ingestion approach.

Needs Improvements

Apache Druid currently does not support real-time upserts, which is needed in many critical use cases. It does not support any way to push/write data in real time. It has limited indexing capabilities and no current support for any cloud based tiered storage.

ClickHouse

Highlights

Of all the three systems, ClickHouse is the easiest to get started for new developers with the least number of components. Compared to the other two – it has a better real-time push model.although it also has some limitations). It has great data type support and very rich support for outbound connectors.

Needs Improvements

ClickHouse has poor support for pulling data from real-time sources.see below why this is important). Like Druid, indexing support is also quite limited, which hinders query performance, especially when dealing with a large amount of data. Although single node deployment is easy, running ClickHouse in distributed mode is quite complex. It has some operational limitations in terms of difficulty to scale-up/out nodes, lack of multi-tenancy, and management console. In general, scalability is one of the key challenges with ClickHouse.

We now begin to go deep into each of the layers in detail.

High-Level Architecture: Apache Pinot, Apache Druid, and ClickHouse

To set some context for the rest of the blog, let’s take a high-level look at each system’s architecture.

All of the real-time OLAP databases considered here are column-oriented. Columnar data formats are good for efficiently scanning only selected columns of data, minimizing storage footprint, and scanning through highly redundant data entries (also called “low-cardinality columns”).

They are also optimized to run in a cluster of servers for high availability and horizontal scalability (scale out). However, the ease of scaling and management and the way queries are handled differs greatly across the different implementations, as we will see in the sections to come.

Each of these three databases has the concept of a table, a logical abstraction representing a collection of related data composed of columns and rows. Each table’s data is broken down into small chunks, which are packed in a columnar fashion. These chunks are called segments in Apache Pinot and Apache Druid, and parts in ClickHouse.

Apache Pinot

Fig: Apache Pinot architecture

Data layer : This layer contains Servers , which store a subset of segments locally and process them at query time. Servers either pull data from streaming sources directly to build segments, or Minions can build and push segments from batch data sources. Segments are also stored onto a deep store —a cloud object store such as Amazon S3, GCS, ADLS—for permanent backups and recovery.
Query layer : This layer contains Brokers , who receive queries from clients and perform a scatter gather to and from Servers.
Control layer : This layer contains the Controllers , which manage all the components of the cluster. Apache Zookeeper is used as a persistent metadata store.

Read more details in the Pinot architecture docs.

Apache Druid

Fig: Apache Druid Architecture

Data layer : This layer consists of Middle Managers, which execute ingestion workloads, and Historicals, which store all queryable data. Segments are stored in a deep store for permanent backup and recovery.
Query layer : This layer contains the Brokers, which handle queries from external clients, and the optional Routers, which route requests to Brokers, Coordinators, and Overlords.
Control layer : This layer contains the Coordinator which manages data availability on the cluster, and the Overlords, which control the assignment of data ingestion workloads. Zookeeper is used for internal service discovery, coordination, and leader election. A system like MySql or Postgres is used for segment and task metadata storage.

Read more details in the Druid design docs.

ClickHouse

Fig: ClickHouse architecture

Data layer : This layer contains the ClickHouse Servers , which host the data shards and manage ingestion. ClickHouse has no concept of a deep store for permanent storage.
Query layer : There is no architectural separation between data and query layer. A query can be directly made to any ClickHouse server to query data on that server. Of course, data is usually distributed, so a Distributed Table would be created on some dedicated nodes to act as query nodes.
Control layer : ClickHouse Keeper / Apache Zookeeper is used for metadata. No other centralized cluster management takes place.

Read more details in the ClickHouse “scaling out” docs.

Overall, ClickHouse has the fewest moving parts, followed by Pinot, then Druid. ClickHouse achieves this by combining many functionalities onto the same node or skipping some altogether.

While this makes getting started with ClickHouse relatively easier, several challenges arise when you hit a certain scale that comes with any practical setup.

Scalability : Separation of concerns is always a better design for scalability – e.g. In Pinot if the rate of events goes up one can simply scale the real-time servers minimizing the impact on query latency if queries per second increase one can simply scale the brokers.
Operational complexity : Users have to build tooling and automation on their own for every skipped functionality – e.g. distributed query routing, or cluster management operations such as replication, ingestion, and node maintenance in ClickHouse have to be set up manually, which is automatic and out-of-the-box in Pinot and Druid, making them more operator-friendly for managing production workloads.

Data Sources

This section gives a very high-level view of native ingestion support for various kinds of data sources. The corresponding in-depth details are captured in the ingestion section below.

Note : Native here means no external component is needed for that capability.

Real-Time Data Sources

Native, out-of-the-box support for real-time data sources:

Real-time data source	Apache Pinot	Apache Druid	ClickHouse
Apache Kafka	Supported	Supported	Limited (see Ingestion section below)
Amazon Kinesis	Supported	Supported	No
Google PubSub	Supported *StarTree Cloud	No	No
Apache Pulsar	Supported	Not native. Can be done via KoP	Not native. Can be done via KoP
Change Data Capture (CDC) Ability to ingest CDC stream consistently	Supported	Not supported (lack of upserts)	Limited (due to limited upsert support)

Table: Real-Time Data Source: a comparison of Apache Pinot, Apache Druid, and Clickhouse

Batch Data Sources

Native, out-of-the-box support for batch sources:

Batch data source	Apache Pinot	Apache Druid	ClickHouse
S3	Yes	Yes	Limited Support (see below)
GCS	Yes	Yes	Needs custom data pipeline
ADLS	Yes	Yes	Needs custom data pipeline
HDFS	Yes	Yes	Needs custom data pipeline
Databricks DeltaLake	Yes *StarTree Cloud	Needs custom data pipeline	Needs custom data pipeline
Apache Iceberg	Needs custom data pipeline	Needs custom data pipeline	Needs custom data pipeline
Apache Hudi	Needs custom data pipeline	Needs custom data pipeline	Needs custom data pipeline

Table: Batch Data Sources: a comparison of Apache Pinot, Apache Druid, and ClickHouse

Note: Although ClickHouse mentions the above systems in the integrations section, that’s primarily for query processing and not for data ingestion.

SQL Data Sources

Native, out-of-the-box support for SQL sources:

SQL data source	Apache Pinot	Apache Druid	ClickHouse
Snowflake	Yes *StarTree Cloud	Needs custom data pipeline	Needs custom data pipeline
BigQuery	Yes *StarTree Cloud	Needs custom data pipeline	Needs custom data pipeline
MySQL	No (can be done via CDC)	Yes	Needs custom data pipeline
Postgres	No (can be done via CDC)	Yes	Needs custom data pipeline

Table: SQL Data Sources, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Note: Although ClickHouse mentions the above systems in the integrations section, that’s primarily for query processing and not for data ingestion.

Ingestion

Before a database can run analytical queries, the data must get into the system. This is the process of ingestion . This layer deals with all the capabilities required for ingesting data into the OLAP system from various kinds of data sources, as mentioned above. As we go through the details of each data source, here are the key things to keep in mind:

Append only vs upserts : Append only ingestion is the minimum requirement from an OLAP database. But supporting upserts provides data and query consistency.
Pull vs push : Can the OLAP database support a pull / push based ingestion or both. Each one has its pros and cons.
At least once vs exactly once : At-least-once semantics is the minimum requirement to ensure there’s no data loss. Having the ability to do exactly once data ingestion is great for certain business critical applications.

Real-Time Ingestion

Ingesting data from real-time data sources like Apache Kafka is integral to real-time analytics. In this section, we will look at the different ways of ingesting real-time data and compare the nuances of each system.

Pull vs Push Based Real-Time Ingestion

There are two ways of ingesting real-time data:

Pull based: Actively poll/fetch data from the real-time data source
Push based: Push messages directly into the database

A native pull-based ingestion model is generally considered to be better than a push-based model for real-time data ingestion. Here are the reasons why:

Fewer moving parts : A native pull-based approach works great with systems like Apache Kafka and Amazon Kinesis and obviates the need for an intermediate connector (e.g., the Kafka Connect framework). This improves overall freshness and error probability by reducing the number of hops in the pipeline.
More scalable : Based on empirical data, the pull-based approach supports higher throughput than a corresponding push-based approach.
Ease of replication : The pull-based model makes it very easy to replicate data since all replicas can fetch independently from the data source. In addition, strict partition ordering within the data source also alleviates the need for strong coordination among replicas to keep them in sync. For instance, replicas can all agree on the same Kafka offset to ensure data synchronization. A push-based model has to deal with the overhead of how to keep the replicas in sync.
Ease of exactly once semantics : A pull-based model makes it easier to guarantee exactly once semantics with systems like Kafka. Arguably, a push based model has to do more work to ensure correctness in this case.

Need for Push Based Ingestion

There are certain cases where push based ingestion is preferred. For instance, in some cases, the real-time data might be generated directly by an application/service without any Kafka in the picture. In other cases, even though data might exist in a Kafka cluster, pull based ingestion may not be possible due to network connectivity issues (eg: Kafka in another VPC without any inbound access). For all such cases, we need the ability to push events directly to the database.

Next, we’ll go through each such mechanism in more detail.

Pull Based Ingestion

Capability	Apache Pinot	Apache Druid	ClickHouse
Real-time pull model Ability to fetch data from real-time sources	Native support which provides high ingestion performance and low ingestion latency	Native support which provides high ingestion performance and low ingestion latency	ClickHouse does not have a true pull based model but you can get similar behavior using a complex setup. For example, in the case of Apache Kafka users need to configure Kafka table engine + Materialized view + MergeTree. Ultimately, data is still being pushed into the MergeTree family table. Real-time data goes through multiple hops before it can be queried by the user. (see Integrating Kafka with ClickHouse) In general, this approach has a bunch of limitations: Higher ingestion latency due to large batches of data Chance of duplication *Cloud Readiness For more details of these limitations, please read the Apache Kafka Integration section below
Out-of-order handling Ability to handle late arriving events in real-time	Supported	Supported. Out of order events will end up generating smaller segments	Supported
Real-time ingestion scalability Ability to handle very high ingestion rate (messages/second)	High: Can sustain high incoming message rate	High: Can sustain high incoming message rate	Moderate-Low because of the complex setup mentioned above
Exactly once ingestion guarantees Ability to persist messages from the data source in an exactly once manner in the database	Supported	Supported	Not supported in pull based approach
Stream partition changes Automatically handle partition changes at the source	Supported	Supported	Supported

Table: Pull-Based Ingestion, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Apache Kafka Integration: Apache Pinot & ClickHouse comparison

It can be said that the real-time analytics journey began with Apache Kafka. The popularity and widespread adoption of Kafka in nearly every major company worldwide has made it one of the most important sources of real-time data. Apache Pinot was designed to integrate well with Apache Kafka and has excellent support for handling a high ingestion rate, out of order events, and exactly once semantics.

As noted in the table above, Druid and Pinot are very similar in terms of Kafka integration, whereas ClickHouse took a very different approach. This section will focus on these differences between Pinot and ClickHouse.

Low-Level vs Consumer groups

Internally, ClickHouse uses Consumer Groups within the Kafka Table Engine to fetch data from Kafka. This has many issues, including message duplication, scalability challenges, fault tolerance, and so on, as described in this talk about Pinot’s journey away from the Consumer group approach. Conversely, Pinot evolved to use a low-level Kafka consumer approach that addresses all these issues.

Ingestion Latency

Ingestion latency is defined as the amount of time between an event being produced at the source and the event being queryable in the database. Kafka Table Engine in ClickHouse reads messages in large batches. Default is mentioned as 64K, and higher batch sizes are recommended.500K to 1 Million entries per batch are not uncommon). Needless to say, this will result in higher ingestion latency. Pinot, on the other hand, is able to stream data in much smaller batches – almost 3 orders of magnitude lower thus leading to much lower ingestion latency without adding any additional overhead.

Data Duplication

The recommended pull ingestion mechanism in ClickHouse is not only complex, but it can also result in data duplication. The recommended way to resolve those is to use ReplacingMergeTree, which has its own issues as discussed in the upsert section. On the other hand, since Pinot generates segments in lockstep with Kafka offsets – it’s guaranteed to not generate any duplicates (unless duplicates are present in the Kafka stream itself, which can also be handled by Pinot – see Upsert section below).

Scalability

In ClickHouse, the number of servers is limited by the number of Kafka partitions and, therefore, difficult to scale. In the case of Pinot, segments can be automatically moved to different hosts.non-consuming), which makes it very easy to scale and not limited by the number of Kafka partitions.

Data Replication

As mentioned previously, a pull based approach makes data replication in Pinot much simpler by relying on Kafka offsets for coordinating across replicas. On the other hand, since ClickHouse does not have a pure pull model, it has to rely on a much more complex replication system. For each batch of data inserted into the ReplicatedMergeTree family table, metadata is written to Zookeeper / ClickHouse Keeper. It’s important to note that this happens in the fast path of data ingestion, and having a strict dependency on Zookeeper / ClickHouse Keeper is quite a bit of an overhead.

Cloud Readiness

It appears that Kafka Table Engine is not supported in ClickHouse Cloud and the recommendation is to use alternate approaches (eg: push based techniques). Conversely, Pinot has productionized Kafka integration over the last several years and is fully supported in StarTree Cloud.

Push Based Ingestion

As mentioned before, push based ingestion is often needed to get around Kafka dependency or network connectivity issues. Let’s take a look at how it works in the three systems:

Capability	Apache Pinot	Apache Druid	ClickHouse
Real-time push model Ability to insert records in real-time	Experimental feature – *supported in StarTree Cloud In addition, Apache Pinot also supports writing batches of data via Segment Writer API	Supported on older versions with Tranquility. not guaranteed to work with newer versions. Apache Druid does have the ability to do a batch insert using the SQL based ingestion approach	Supported. However, the recommendation is to use bigger batch sizes during data ingestion. As mentioned previously, this comes at a cost of higher ingestion latency
Out-of-order handling Ability to handle late arriving events in real-time	Supported	N/A	Supported
Real-time ingestion scalability Ability to handle very high ingestion rate (messages/second)	Limited as compared to the pull model	N/A	Limited as compared to the pull model
Exactly once ingestion guarantees Ability to persist messages from the data source in an exactly once manner in the database	Currently not supported	N/A	Supported via Kafka Connect Sink for ClickHouse, which is in beta as of this writing. Relies on a complex design that uses a hash of insert data blocks which is stored in ClickHouse Keeper (like Zookeeper) for de-duplication.

Table: Push-Based Ingestion, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Real-Time Upserts

In the case of real-time event streaming data sources such as Apache Kafka, duplicate or upsert events may exist in the stream. Several mission-critical use cases require the database to handle such events during ingestion to ensure accurate query results. Uber’s financial dashboards, Citi’s risk monitoring, and many others rely on upserts to ensure business continuity. This section compares how the different systems handle such events.

Capability	Apache Pinot	Apache Druid	ClickHouse
Full row upsert Ability to upsert an existing row with a new row having the same primary key.	Supported	Not supported during ingestion	Limited support during ingestion: The recommended way is to use ReplacingMergeTree which handles upsert events asynchronously. This means queries results may not be accurate. From this link it appears that this asynchronous process typically happens every 9-10 mins or not at all (in case of not enough data) – thus leading to severe query inconsistency issues. It is possible to perform mutations synchronously via this flag: mutations_sync to 2. However this is not recommended to be used since updating is a very expensive step in ClickHouse.
Partial row upsert Ability to upsert certain columns in schema with new values for the same primary key	Supported	Not supported during ingestion	Supported with same caveats as above
Real-time Deduplication Ability to de-duplicate incoming rows based on a primary key	Supported	Not supported during ingestion	Supported with same caveats as above
Scalability Ability to handle large number of primary keys	Supported *StarTree Cloud	N/A	Limited due to the above caveats

Table: Real-Time Upserts, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Note that both Apache Druid and ClickHouse claim the ability to do upserts. However, it requires complex workarounds such as batch insertions, query time aggregations, or a complex data model that is difficult to generalize.

Upserts in Pinot

Pinot has great support for handling upsert events in the incoming real-time stream for event-streaming technologies like Kafka. This is done without sacrificing ingestion and query performance and maintains real-time data freshness — which is not true in the case of the other systems.

How does it work?

Users can enable upserts in a real-time Pinot table by specifying a primary key (see the Stream Ingestion with Upsert docs.. Each server in turn maintains a map of the primary key to the location of the latest version of the record for that key. When an existing record is upserted by a new one, the server uses this map to mark the old record as obsolete using a per segment bitmap. All such obsolete records are then filtered out during query execution. This design ensures least overhead during ingestion and query time and works at scale.

Fig: Upserts in Apache Pinot

Scalable Upserts with StarTree Cloud

One caveat of upserts in Pinot was the memory overhead experienced by Pinot servers when a large number of primary keys are involved. StarTree Cloud removes this limitation by eliminating the in-memory requirement of managing upsert metadata. StarTree Cloud workspaces can easily handle billions of primary keys per server without sacrificing performance.

Upsert Improvements for Pinot (Coming Soon)

You can read more about additional improvements coming to Apache Pinot in this related blog.

Some of the highlights related to upserts include:

Time To Live ( TTL) : for the primary keys — helps in reducing the footprint of the upsert metadata and thus memory overhead on the server.
Compaction : ability to delete obsolete records and improve storage efficiency
Ease of bootstrap : Bootstrapping historical data in an upsert table is currently challenging. Some users rely on a Flink Pinot connector to push historical segments in the real-time upsert enabled table. We will be adding native support for this shortly.

Batch Data Ingestion

A native, out-of-the-box support for importing data from batch data sources (e.g., S3 or GCS) is crucial for developer productivity. Otherwise, users must maintain custom data pipelines, which becomes a big overhead to maintain and scale up.

Capability	Apache Pinot	Apache Druid	ClickHouse
Pull based Batch Ingestion: Native Support for batch ingestion by pulling data from remote data sources	Yes. Scalable and performant. Uses a background task framework to ingest data from batch data source into Pinot. Most of the work is done outside of Pinot, with minimal impact on query performance.	Yes. Scalable and performant. Uses a background task framework to ingest data from batch data source into Druid. Most of the work is done outside of Druid, with minimal impact on query performance.	Limited. The suggested approach is to use `INSERT INTO with SELECT * FROM` batch location. This is considered to be less scalable in comparison. In addition, this consumes resources on the live server that’s also serving queries, which can affect query performance. See Integrating S3 with ClickHouse for an example.
Batch ingestion input format: Different supported data formats for batch ingestion	JSON CSV/TSV ORC Parquet Avro Protobuf Thrift See Input Formats in the Pinot docs for details.	JSON CSV/TSV Kafka (load Kafka header, key, value) ORC Parquet Avro Stream Avro OCF Protobuf See Data Formats in the Druid docs for details.	JSON CSV/TSV ORC Parquet Avro Protobuf Arrow CapnProto and many more! See Formats for Input and Output Data in the ClickHouse docs for details.
Push based ingestion: Native support for pushing large batches of data from remote data sources	Supported (Spark and Flink connectors available)	Supported (no built-in connector)	Supported (no built-in connector)
At least once semantics	Supported	Supported	Supported
Exactly once semantics	Supported at segment level	Supported at segment level	No native support. Needs custom pipeline

Table: Batch-Based Ingestion, a comparison of Apache Pinot, Apache Druid, and ClickHouse

SQL Data Ingestion

SQL data ingestion can be done in 2 ways – using CDC (Change Data Capture) in real-time or in a batch manner. This section covers the batch oriented ingestion from SQL data sources.

Capability	Apache Pinot	Apache Druid	ClickHouse
Native support for one time ingestion (bootstrap) from SQL sources	Supported *StarTree Cloud	Supported	Needs custom data pipeline
Native support for incremental ingestion from SQL sources	Supported *StarTree Cloud	Supported	Needs custom data pipeline

Table: SQL Data Ingestion, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Batch and SQL ingestion support in StarTree

StarTree provides a native, out-of-the-box support for ingesting the batch and SQL data sources mentioned above. Read No-code Batch Ingestion for Apache Pinot in StarTree Cloud for more details. This is done by a background job that runs alongside the primary cluster.

Key benefits:

Easy to use: There is no need for additional ETL tools for most common scenarios.
Efficient: Segments are generated by the background job outside the primary cluster and pushed to the server serving live queries, with negligible impact
Robust: Ingestion tasks are checkpointed and retried upon failures
Scalable & cost efficient: auto-scaling is enabled in StarTree cloud for elastic scale-up/down of the ingestion tasks

Ingestion time pre-processing

Often, data needs to be pre-processed before it can be ingested and queried in Pinot. Examples include flattening a complex JSON structure, rolling up data, or type/value transformation. Let’s look at some of the popular pre-processing capabilities requested by users.

Capability	Apache Pinot	Apache Druid	ClickHouse
Ingestion time transforms Basic filtering, type/value transformation applied as data is being ingested	Rich filter and transformation function support. Filter functions: We can use either Groovy expressions or built-in scalar functions for filtering (see Ingestion-Level Transformation in the Pinot docs). Rich transformation function support including Groovy expressions as well as built-in functions for string, math, JSON, datetime computation and so on. A full list of Pinot functions is available in the docs.	Rich filter and transformation function support. Various types of built-in Filter functions including JavaScript, regex, logical, extraction, column types and so on (see Query Filters in the Druid docs). Transformation functions including string, math, time, array, JSON, lambda and other operators.	Limited support: Data can be transformed via the insert into select clause from another table. However, there doesn’t seem to be a way to apply filtering or aggregation functions during ingestion itself.
Flattening Flatten a nested schema often needed for query optimization	Built-in flattening support for nested JSON records. In addition, users can also use the json_extract transformation function. In general, a JSON column and an optional JSON index is recommended to accelerate query performance on JSON/nested fields.	Convenient way of flattening nested records using FlattenSpec.	Limited support for Nested column (not for Json): via the flatten_nested spec
Rollups during Real time ingestion	Supported. Variety of ingestion aggregation functions such as SUM, MIN, MAX, DISTINCTCOUNTHLL. Currently, specifying granularity/interval is not supported, but can be achieved with derived columns.	Supported. Variety of aggregation functions such as SUM, MIN, MAX, Approximate (Sketches). In addition, it’s possible to define custom granularity, interval (see Ingestion Data Rollup in the docs).	Limited support: Can use SummingMergeTree or AggregationMergeTree for rolling up records based on a specified primary key in the background (not during ingestion).
Rollups during batch/offline ingestion	Supported.SUM, MIN, MAX, DistinctCountHLL, Theta sketches)	Supported	Limited: Same as above (use the right MergeTreeFamily)
SQL based ingestion transform Flexible way of inserting data into a table based on the output of a SQL query	Not Supported	Supported with the new Druid multi-stage query engine.	Supported with insert into select clause.
Denormalization support Denormalize data from multiple sources during ingestion	Not Supported	Basic support using Druid MSQ: Insert with Join: limited to dimension tables	Limited to existing tables in ClickHouse using insert into select clause
Ability to partition data during ingestion	Supported for batch sources. In case of real-time sources, partitioning scheme is tied to the source (Eg: Kafka/Kinesis)	Supported for batch sources. In case of real-time sources, partitioning scheme is tied to the source (Eg: Kafka/Kinesis)	Supported for MergeTree family tables. As data is inserted, it can be partitioned as configured by the PARTITION BY spec
Ability to sort data during ingestion	Supported for batch sources on any dimension (limited to 1 dimension)	Druid is limited to sorting by timestamp as the primary sorted dimension	Supported for MergeTree family tables using the primary key

Table: Ingestion Time Pre-Processing, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Table Management

Table data often needs to undergo some change after ingestion to keep up with changes in the data or the workload. This section compares how the three systems fare in this area.

Capability	Apache Pinot	Apache Druid	ClickHouse
Backfills Ability to replace the data in the database for a certain time range or condition, using data from another batch or streaming source (e.g. replace data for the 3am hour two days ago, using files in an S3 bucket). Useful when data needs to be corrected due to upstream changes or issues.	Backfill is easily possible if the table is time partitioned, which is often true for offline tables. Realtime tables are not naturally time partitioned, but tools are available to time-partition and then execute a backfill.	Could not find documentation or recipes, but seems possible by design.	Support is available for making small scale additions, updates and deletes to columns, but no support for large backfills for time ranges to data.
Dynamic re-indexing Ability to dynamically add remove indexing for columns of the table, without needing a full bootstrap or downtime. This is extremely useful when iterating between optimizations during initial setup.	Can easily add or remove indexes with no downtime by making config changes followed by a reload operation.	Can be achieved via reindexing, which refers to a reingestion task that can apply any changes to existing data. However, this is modeled as batch ingestion with existing data as source, effectively making it not-in-place.	Skipping indexes can be added / removed dynamically.
Dynamic data layout change Support for making changes to the data layout such as partitioning, sorting, bucketing, rollups, resizing dynamically without having to rebootstrap and without downtime. Picking the right layout based on the data and workload aids better compression, optimized routing, segment pruning and scanning, which helps improve performance.	Native support for roll ups and merge via Minions which handles partitioning, bucketing, trimming. Partitioning: Yes Sorting: Yes.StarTree Cloud Bucketing: Yes Rollups: Yes Trimming: Yes	Can be achieved through overwrite, which is essentially re-ingestion using the existing data as source. Partitioning: Yes Sorting: Custom sorting itself is not supported, so this is NA Bucketing: Yes Rollups: Yes Trimming: Yes	While some operations like altering columns / deleting columns can be done via mutations, the ones involving entire data layout changes can only be done by dropping and recreating the data. It can be achieved by creating a materialized view or projection but this solution increases storage footprint, defeating the purpose of some operations like rollups and trimming. Partitioning: No Sorting: No Bucketing: No Rollups: No Trimming: No

Table: Table Management, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Schema Management

This section compares the three systems on schema characteristics, such as native support for data of varied types and being able to evolve the schema.

Capability	Apache Pinot	Apache Druid	ClickHouse
Data types Extensivity of data types supported as compared to SQL databases. Native data types are easier to use and in their absence, columns need to be modeled as strings which can be inefficient for compression and access.	Basic ones such as int, long, float, double, char, string, boolean, bytes, in addition to some special ones like big decimal, geo types, JSON. No array support like SQL, multi-value columns semantics are like column explosion.	Limited to the basic types broadly in the categories of int, long, float, double, char, string, boolean. No array support like SQL, multi-value columns semantics are like column explosion.	A very rich set of data types supported natively in comparison to Druid and Pinot. Along with the basic int, long, float, double, string, boolean, many specialized ones such as UUID, Enum, Array, Map, tuple, geo types, IP. expression, set.
Nested Columns Support for ingestion and fast query of unstructured / semi-structured nested columns. Such support helps eliminate the need for data pre-processing workflows to extract data from nested structures.	Strong support for nested columns ingestion, with strong indexing support during query time.	Supported	Good support for ingesting and storing JSON either as structured or semi-structured objects using string/JsonObject
Schema Evolution Ability to make schema changes dynamically, in place. This is needed as data keeps evolving, and the system needs to react fast so as to not miss these changes.	Schema changes possible and reloads functionality to make changes take effect	Supported for new data, but for existing data, documentation suggests to go with reindexing which is called out as time consuming since it is a full overwrite of data	Possible with mutations

Table: Schema Management, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Storage

As mentioned earlier, each of these databases has a columnar storage format.

At a columnar level, there are differences in compression & encoding schemes which further affect the storage footprint and access latencies.
At a segment level, data layout configuration options help increase pruning.
At a server level, smart storage mapping strategies help drastically reduce query-fanout, further reducing latency and boosting throughput.
At a cluster level, the availability of storage options (multi-volumes, cloud object storage) and the ability to create storage tiers for hot / cold data to control cost-latency trade off, greatly improves cost to serve.

This section looks at the above aspects for the three systems.

Capability	Apache Pinot	Apache Druid	ClickHouse
Encoding Types of dictionary encoding available. These impact column size which affect access latency and cost to serve.	Dictionary encoding done by default. Ability to skip dictionary encoding based on configurable compression ratio. Option to create a variable length dictionary for string / bytes columns for better compression. Ability to load dictionary on-Heap for faster access	Dictionary encoding. No ways to smartly determine when to skip this encoding, or create any advanced / nuanced dictionary encodings.	Dictionary encoding done via LowCardinality data type. No advanced thresholds available, other than cardinality based disabling of LowCardinality.
Compression Standard / specialized techniques available to compress columns. These impact column size, which affect cost to serve.	snappy, lz4, zstd	lz4, lzf, zstd	z4, lzf, zstd Plus some specialized codecs like delta, double delta, gorilla, fpc, t64.
Data layout: Partitioning Control the data layout in terms of custom partitioning. Partitioning data helps with pruning servers / segments reducing query fanout.	Can choose custom keys for partitioning	Data always partitioned by time column first, and a secondary partitioning can be done using a custom key.	Can choose a custom key for partitioning
Data layout: Sorting Control the data layout in terms of custom sorting. Sorting data helps improve data locality and enable smarter encoding (like RLE) reducing work done in applying filters during scans.	Ability to pick any column as sort key	Always sorts by primary time column. No option to make a column of your choice as primary sort column (can set a secondary sort column).	Ability to pick any column as sort key
Data layout: Strategies to reduce query fan out Ability to use smart instance assignment and segment assignment. These strategies help reduce query fanout which improves latency, throughput and operability.	Highly mature and advanced strategies for segment assignment which help achieve better p99 latency and throughput, plus instance assignment strategies available, to improve ease of operations. More details in the Data Layout: Strategies to reduce query fanout section below.	Only strategy available for segment to instance assignment is the balanced strategy which distributes segments evenly on historical nodes.	Documentation doesn’t indicate the existence of any advanced segment to instance assignment strategies.
Tiered Storage using local storage / compute nodes separation Ability to split hot / cold data onto multiple local (or directly-attached) volumes or onto different compute node tenants. For example, some local storage volumes might be fast NVMe SSDs, and others slower HDDs.	Two types of tiering available: Multi-volume tiering using different types of local volumes Compute node separation using multiple tenants	No tiering using multi-volumes on same tenant Supports compute node separation using service tiers.	Tiered storage support using multi-volumes No compute node separation due to the absence of tenant / tagging concepts.
Tiered Storage using Cloud Object Storage Ability to split hot / cold data between local and cloud object storage. While reducing cost to serve is one aspect of this feature, the design choices need to be made carefully as we still want the database to be fast.	Tiered storage using cloud object storage available in StarTree Cloud. Cloud object store as a tier supported in StarTree Cloud. Fully on par with local mode, no restrictions on data altering. All indexes available Several optimization techniques like prefetching, pipelined fetch and execution, columnar / block fetch for performance Caching available in addition, for further speedup, but no dependence on it Limitations: Needs Pinot segment format *StarTree Cloud	No support for tiered storage using Cloud Object Storage	Tiered storage support using cloud object storage: Supports open file formats like gzip, parquet, csv Can prefetch granules async Limitations: No data altering support. Documentation mentions no indexing support, but this article says index and mark files can be cached locally. No columnar fetch, always reading granule which contains all columns. Relies on caching (lazy loading) to make cold tier queries fast.

Table: Storage, a comparison of Apache Pinot, Apache Druid and, ClickHouse

Data Layout: Strategies to Reduce Query Fanout

This blog post talks about the lifecycle of a query in Apache Pinot and the optimizations available at each point – from broker, to servers, to segments, to columns – to reduce the amount of work done. This section focuses on query fanout – reducing the number of nodes a query is scattered to. Reducing the fanout helps eliminate long tail latency (your query will be as slow as the slowest server) and, in turn, helps improve throughput.

The strategies available in Pinot to map the data segments onto servers and enhance pruning at broker level are:

Balanced: This strategy simply distributes segments evenly and is also the only strategy adopted by Druid and ClickHouse.
Replica-group: This strategy creates groups of servers such that an entire replica of the dataset is constrained to each group, thus reducing server fanout and being able to scale horizontally for higher throughput.
Partition-based replica-group: This strategy further creates partition based groups to reduce segment fanout within a replica group.

The strategies available in Pinot to map the tables onto servers and improve ease of operation are:

Creating pools for a no-downtime rolling restart of large shared clusters
Fault-domain aware instance assignment

Tiered Storage using Cloud Object Storage

Each of the three systems started with having tightly coupled storage and compute, wherein they used compute nodes that included local or directly attached high performance storage, such as NVMe SSDs. This is one of the reasons that make them suitable for low latency.

However, local high-performance storage is expensive compared to decoupled systems that primarily use cloud object storage. Cloud storage is cheaper but slower and, therefore, usually unsuitable as the sole storage layer for interactive real-time analytics. Simply choosing one of these two systems for all analytics needs does not make sense.

Real-time (or recent) data tends to be more valuable and is queried more frequently. On the other hand, historical data is queried less frequently. Latency is still important, but with this data being large and ever growing, cost becomes a dominating factor. Data teams often choose a hybrid approach where they add both systems and migrate the data from one to another depending on cost / latency expectations. Therefore, many users need to consider how to span their data, choosing what they wish to store in high performance local storage and what is placed on slower but more cost-effective cloud storage.

Fig: Tiered Storage for Apache Pinot in StarTree Cloud

Tiered storage for Apache Pinot in StarTree Cloud is designed to address this problem. With the option to use cloud object storage such as S3 / GCS directly in the query path, Pinot can now act as a hybrid system — it can use completely decoupled or tightly-coupled storage or have a mix of both. It lets you flexibly configure data to be on any tier and query it all seamlessly like any other table.

Most importantly, several investments have been made to reduce the latency issues that come with using decoupled storage. For instance, the design consciously stays away from approaches such as lazy loading — where entire segments would be downloaded during query time — or using local caches. Typical OLAP workloads consist of arbitrary slice-and-dice over varying time ranges, and caching-based approaches cannot offer predictable p99 latency due to the frequent download and churn of segments.

Instead, the tiered storage approach in Pinot for reading from cloud object storage uses strategies such as selective columnar / block reads, smart prefetching, pipelining fetch and compute. These optimizations, along with the ability to use the army of native indexes available in Pinot, help Pinot achieve the best of both worlds — reduced cost to serve while keeping the speed of Pinot.

Indexing Strategies

Indexing is a fundamental function of a real-time OLAP database which ensures queries can be executed efficiently and perform minimum data scans. In this section, we will compare the indexing strategies available across the three systems and also dive into why this is important.

Type of Index	Functionality	Apache Pinot	Apache Druid	ClickHouse
Inverted Index	Fast Filter	Supported	Supported	Experimental
Sorted Index	Personalized Analytics	Supported	No	Supported
Range Index	Range Queries	Supported	No	No
JSON Index	Analyze Semi-structured data	Supported	No	No
Geo Index	Location queries	Supported	No	No
StarTree Index	Fast Aggregation	Supported	No	No
Bloom Filter	Approximate Filter	Supported	Supported	Supported
Text Index	Log Analytics	Supported	No	Limited: via bloom filters
TimeStamp Index	Time based rollups	Supported	No	No
Sparse Index	Approximate Filter	*StarTree Cloud (Alpha)	No	Limited: only on the primary sorted column

Table: Indexing Strategies, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Why do Indexes Matter in an OLAP Database?

Indexes help minimize the data scans needed for processing a given OLAP query. This is important to ensure query performance is optimal and predictable regarding high throughput and latency, even as the underlying data keeps growing organically.

Fig: Impact of indexes on query latency with increasing data size

Apache Pinot natively supports a wide range of indexing strategies, making it the ideal OLAP platform for myriad use cases ( as discussed here.. In addition, it has a pluggable architecture, making it extremely easy to add new indexes as and when needed. For instance:

Star-Tree Index, Pinot’s way of building intelligent materialized views is crucial for running user-facing analytical applications. Specifically, it allows users to pre-aggregate data in a tunable fashion, leaving the source data intact. This enables highly concurrent, low latency key-value style queries. Any changes in the pre-aggregation function or corresponding dimensions can be handled dynamically without having the need to re-ingest all the data.
JSON index and text index are great at processing semi-structured data in sub-seconds (eg: analyzing system and application logs at scale)
Geo index accelerates geospatial queries , which are required by many logistics and delivery companies around the globe, as well as various other real-world location-based services.

Here’s a quick view of the power of indexes – comparing query latencies with and without the corresponding index.

Fig: How Apache Pinot’s indexes bring down the latency compared to a raw scan

As depicted above, indexes help reduce the query latency by many orders of magnitude. In addition, it reduces the amount of work and resources needed per server, thus increasing the system’s overall throughput.

Query

This section compares the three systems in all aspects related to querying the database. It discusses commonly used query constructs in OLAP and advanced multi-stage query features such as joins and window functions. It finishes with query latency optimizations and features such as caching, routing, and pruning strategies.

Capability	Apache Pinot	Apache Druid	ClickHouse
Filter	Yes	Yes	Yes
Group by	Yes	Yes	Yes
Order by	Yes	Yes	Yes
Sketches Useful for fast and approximate counting of distinct values in large sets	Yes	Yes	Yes
Multi-stage query engine Required for executing more complex SQL queries such as JOIN, OVER window, which can cause bottlenecks if executed in a single-stage scatter-gather fashion.	Has a multi-stage query engine that is fully distributed, designed for interactive queries	The multi-stage query engine executes SQL statements as batch tasks in the indexing service, and appears to be used more for ingestion. This is not suitable for interactive low-latency OLAP queries.	Has a multi-stage query engine.
Joins A useful query construct for combining data from multiple tables in the scope of a single query. In the absence of joins, expensive data pre-processing and denormalization needs to be done.	Supported, with limitations. See below section for more details.	Limited to only joining regular Druid tables with lookup and inline data sources.in other words, no multi-stage joins)	Supported, with limitations. See below section for more details.
Window functions A useful query construct for applying aggregations over a window (set of rows).	Work in progress. Support added for WINDOW OVER construct, with some limitations.	Work in progress.	Standard window function syntax supported along with many basic aggregations
Pluggable User Defined Functions (UDFs) Ability to easily write custom UDFs and plug them into the system.	Support for using Groovy, or defining Scalar functions via annotations which can be plugged in as a custom jar.	Support for writing JavaScript functions. No ability to write a function and plug it in via custom jar.	Can define UDFs using CREATE FUNCTION. But uses lambda expressions which aren’t as powerful and can be inadequate at times. No ability to write a function and plug it in via a custom jar.
Query caching Support for caching results of a query. Useful when there would be repeated executions of the same expensive query.	No query caching mechanisms exist.	Two types of query result caches available – a per-segment cache on server and a whole-query cache on broker. Mechanisms in place to evict stale cache entries as soon as data changes, and also an explicit config to use cache.	Recently introduced a feature for query caching on servers. Entries are not kept in sync with underlying data changes, TTL is the only mechanism for evicting stale entries, and no user overrides available. Plus it’s been called out as experimental and has the possibility of returning wrong results.
Pruning strategies during query execution Ability to prune out servers / segments without having to process all the segments. Two stages where this can be done: Reduce server fanout Reduce segment fanout	Reduce server fanout: Partition based: Prune out server groups based on partition info Replica group based: Route to only 1 replica group for any query Reduce segment fanout: Min / max value: Prune out a segment based on filter clause and min / max values stored in segment metadata Partition info: Prune out a segment based on partition info stored in segment metadata Bloom filters: Prune out segments based on bloom filters created for columns and stored in segments.	Reduce server fanout: None. The segment pruning techniques are directly applied on brokers, but no attempts to reduce the set of servers fanned out to. Reduce segment fanout: Time based as segments are always time partitioned first Partition based pruning	Reduce server fanout: None. Routing itself is a manual process – due to the absence of a dedicated Query Layer – involving setting up a Distributed Table using cluster.xml, and no smart controls exist to further optimize this. Reduce segment fanout: Block level pruning achieved using sparse primary indexes and skip indexes.with techniques like min / max, set, bloom filters). Partition based pruning

Table: Querying, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Joins

A Join query fundamentally works on two parts – a left table and a right table. There are three ways to execute a join.

Broadcast join : All data post filtering from the right table (R), is shipped to all nodes (N) of the left table (effective data moved R * N). This works well for cases when the filtered data being shipped is small and the number of nodes is also small. But query latency increases proportionately as the size of R increases or the number of nodes N increases. This is the only algorithm that ClickHouse applies for joins.
Shuffle distributed hash join : After applying filters, data from both sides (L and R) is hashed on the join key and shipped to an independent shuffle stage (effective data moved L + R). This scales well even if the size of data on either side increases. Plus, the performance is unaffected by the number of nodes on either side. Pinot is the only system that can execute joins using this algorithm .
Lookup join : Right table must be pre-loaded (replicated) on all nodes of left table. This can only work for very small right hand datasets. This is the only kind of join that Druid can execute . This is available in Pinot as well, for small dimension tables.

In a separate blog post, we will follow up with more details about the nuances of join implementation for all systems, so stay tuned.

Materialized Views

Materialized views (MVs) create a virtual table only using a subset of data from a base table, such as a select number of columns pertinent to a specific query. Users commonly use this to optimize query performance for well-known query patterns. Here’s how these different systems have implemented materialized views:

Capability	Apache Pinot	Apache Druid	ClickHouse
Flexibility	High – native support via star-tree index	Low. Requires a Hadoop cluster	High – native support via CREATE VIEW
Ingestion overhead	Low – Index can be created at any time and segment reload will automatically create the index for the entire source table	High. Requires a Hadoop cluster	High for Materialized Views Any aggregation in the view query is only applied on new data POPULATE can backfill existing data but is not recommended since any data being ingested at the time of view creation will be lost. Special consideration needs to be given to the `materialized_views_ignore_errors` flag. Otherwise any errors in any of the views will block ingestion ClickHouse also has the concept of Projection which seems to be built automatically for the entire table
Storage overhead	Medium – Allows tuning space vs query latency overhead. In other words, not all rows of the view are fully materialized, thus allowing for space efficiency, while still maintaining low query latency	High – materializes all the rows as a traditional view	High – materializes all the rows as a traditional view
Query overhead	Low – User does not need to know about the existence of the index. The same query on the source table will automatically pick the right star-tree index under the hood	Moderate – queries need to target a specific view	Moderate – queries need to target a specific view. ClickHouse has a separate concept of Projection which is transparently applied without the user having to know about it. In general though, Projections are considered more heavyweight as compared to a Materialized View.

Table: Materialized Views, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Star-Tree Index: Pinot’s intelligent materialized view

The star-tree index provides an intelligent way to build materialized views within Pinot. Traditional MVs work by fully materializing the computation for each source record that matches the specified predicates. Although useful, this can result in non-trivial storage overhead. On the other hand, the star-tree index allows us to partially materialize the computations and provide the ability to tune the space-time tradeoff by providing a configurable threshold between pre-aggregation and data scans.

Fig: star-tree index lets you choose your space-time tradeoff between pre-aggregating everything (a materialized view) or doing everything on-the-fly

The star-tree index allows us to pre-compute values of an aggregation function across the dimensions specified in a given materialized view. A configurable parameter called maxLeafRecords determines the maximum rows that need to be scanned to compute an aggregation. In other words, if maxLeafRecords is set to 1 then we get full materialization – same as a traditional MV. Setting this to a higher value results in significant space savings with a small query overhead.

Key benefits:

User controllable: Tune space vs. time overhead
Flexible: create any number of indexes. The right index is chosen based on the query structure.
Transparent: Unlike traditional MVs, users don’t need to know about the existence of a star-tree index. The same query will be accelerated with a star-tree index in place.
Dynamic: Very easy to generate a new index at any point of time.

Learn more about the star-tree index here.

Query Performance

Performance benchmarking is highly use-case, query pattern, and dataset dependent. We recommend you evaluate query performance on your datasets across these systems. Here are some benchmarking numbers provided by neutral third parties:

Fig: Query performance benchmark graphs from the various sources as listed below

Cisco did an evaluation comparing ClickHouse and Pinot. View the setup details and results in Webex: Real-Time Observability & Analytics with Apache Pinot. A quote from the presentation – “ Pinot is 4x faster than ClickHouse (in most categories) and is able to maintain p99 with increased load.”
This engineering blog post titled Real-time Security Insights: Apache Pinot at Confluera talks about their benchmarking done on Druid and Pinot. The results showed queries on Pinot were 2x – 4x faster than on Druid and could sustain a much higher throughput.
This blog post titled Pinot in YouGov discusses a comparison they did with several technologies, and Pinot was 5x – 7x faster than Druid and ClickHouse .
This session How Zomato uses Apache Pinot talks about how Pinot could handle 2x the query throughput compared to Druid using the same cluster size.

Cluster Management

This section compares the systems on operability, which comprises several aspects such as, ease of cluster operations (scaling, maintenance, disaster recovery), visibility into cluster status, flexible cluster setup options to enhance manageability and help maintain SLA guarantees, and so on.

While all three databases are designed for high availability (HA), and each seeks to avoid single points of failure (SPOFs), each handles cluster component failures differently and thus provides differing overall reliability and resiliency.

Capability	Apache Pinot	Apache Druid	Clickhouse
Management console A UI for managing cluster components and performing regular operations related to the lifecycle of tables, schemas, segments. Having such a console greatly improves ease of operability and developer productivity.	A functional console available, covering the management and lifecycle of all physical and logical components. Also includes a query console. For advanced users, a UI for access to all endpoints and Zookeeper viewer. Advanced consoles available in StarTree Cloud for ingestion (Data Manager) and easy data layout and cluster management (Pinot Control Panel) *StarTree Cloud	A comprehensive console for loading data, managing data sources and tasks, and viewing server status and segment information. The console can also be used to run SQL and native Druid queries.	No management console in ClickHouse. SQL console is available only in *ClickHouse Cloud.
Multi-tenancy Ability to create groupings amongst components of different use-cases in the same cluster. This helps provide physical as well as logical isolation and eases operations.	Tenant is a first class concept in Pinot, that can be used to isolate brokers and/or servers of a table within a multi-tenant cluster. Tenants in Pinot not only help with isolation of resources and workload, but also bring benefits to operations such as independent scaling, rollouts etc.	The concept of tenant is used to define a “user,” and multi-tenancy refers to multiple users / teams querying the same data source. No references found regarding techniques like node pools / tagging etc, for achieving isolation of tables within a cluster. Instead, suggestions include creating separate partitions for every tenant, creating separate data sources for every tenant or group similar tenants and creating a shared datasource.	No concept of tenants. The recommendation is to keep the number of tables in the cluster small.
Scale up / scale out Ease of making capacity changes to the cluster hardware (vertical as well as horizontal scaling up / down) and data redistribution. This is commonly required in order to adapt to changing workloads on a cluster.	Very easy to make capacity changes and rebalance the cluster. StarTree Cloud has a Pinot Operator, which facilitates more automated scaling up / down of nodes, including rebalancing without down-time. It includes heterogeneous hardware support, which is often needed for multi-tenant clusters or tiered storage use cases. *StarTree Cloud	While no documentation was found, these links indicate it’s possible to scale out the nodes and automatically balance the segments.	Document indicates this is not easily possible. One needs to manually create the table on new nodes, copy over data, and is operationally heavy.
Node maintenance Ease of operations when having to replace or take down a node from the cluster for maintenance, which is a common operation in large scale distributed clusters.	Easy to take down a node and have another one join in its place by using the same node identifier.	There’s a maintenance mode flag that can be used to mark a node as decommissioned, so new data stops getting assigned to it, and balancer starts moving existing data to a new node.	Involves manual steps like copying over data from 1 replica to another, which means you cannot do this unless replica > 1.
Disaster Recovery Permanent backup and recovery mechanisms, to recover the data in the event of node / disk failures and cluster outages.	All segments are backed up to deep store by default, and data can be restored completely in the event of cluster failures / data loss. This copy is also used to load data onto newly added nodes, nodes which lose their data.	All segments are backed up to deep store by default, and data can be restored completely in the event of cluster failures / data loss. This copy is also used to load data onto newly added nodes, nodes which lose their data.	Backups need to be set up manually. There’s no concept of a deep store, where a permanent copy is backed up by default. Recovering after node failures is manual and requires copying data from another replica, which indicates we cannot recover if we lose all replicas.
High Availability Ability to keep service available in the event of partial component failures. Avoid any “Single Point of Failure (SPOF).”	Each component and data shard can be scaled and replicated. Data can be replicated easily. Can operate queries, ingestion, and background tasks in the face of a failure in any of the components. Pinot can also tolerate failures in deep store availability, with peer downloading capabilities	Each component and data shard can be scaled and replicated. Data can be replicated easily. Can operate queries, ingestion, and background tasks in the face of a failure in any of the components. However, does not have a way to provide high availability if the deep store goes down.	It is possible to run ClickHouse without Zookeeper involved. However, in this mode, you cannot have replication, and hence cannot be highly available. Additionally, absence of a deep store and the fact that ClickHouse servers are the sole component responsible for all activities in the system (no separation of concerns), makes it much harder to provide high availability.
System tables System tables capture information about the cluster components, query / ingestion workload, etc, which can be extremely useful in monitoring system health and debugging without requiring a separate management port or interface.	No concept of system metadata tables. Virtual metadata columns – docId, hostId, segmentName – built by default and available during query time.	Two system tables available – Information Schema (tables and column types) and Sys Schema (servers, segments, tasks).	A rich set of system tables for server states, internal processes and environments

Table: Cluster Management, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Security

Here, we compare the key security requirements needed for an enterprise-grade real-time OLAP database.

Capability	Apache Pinot	Apache Druid	ClickHouse
OIDC based Authentication	Supported *StarTree Cloud	Supported	No
Kerberos based Authentication	No	Supported	Supported but not available in Clickhouse Cloud
Authorization: (ABAC) Attributed Based Access Control	Supported *StarTree Cloud	No	No
Authorization: (RBAC) Role Based Access Control	No	Supported	Supported
Data encryption (“Data-at-Rest”)	Storage level encryption	Storage level encryption	Storage level encryption + Built-in encryption support
Network encryption (“Data-in-Motion”)	SSL/TLS	SSL/TLS	SSL/TLS

Table: Security, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Cloud Deployment Model

Finally, we look at how easy it is to deploy these three systems as a fully-managed cloud solution, requiring a commercial Database-as-a-Service (DBaaS) provider. Users can, of course, self-deploy any open-source databases to the public cloud as a self-managed solution.

Capability	Apache Pinot (StarTree Cloud)	Apache Druid (Imply)	ClickHouse (ClickHouse Cloud)
SaaS Cluster provisioned in vendor’s cloud account. User has to ship data to this cluster	Supported *StarTree Cloud	Supported *Imply	Supported *ClickHouse Cloud
Bring Your Own Cloud (BYOC) Cluster provisioned in customer’s cloud account. Customer retains full control of data.	Supported. No VPC peering needed between control plane and data plane. *StarTree Cloud	Supported but with limitations. Needs VPC peering between control plane and data plane. This is not only a point of friction for the users, but also problematic in terms of having to deal with CIDR range overlap across multiple deployments. Resolving CIDR overlap needs a complex setup as documented here. *Imply	Not Supported *ClickHouse Cloud

Table: Cloud Deployment Model, a comparison of StarTree Cloud, Imply, and ClickHouse Cloud

StarTree’s Bring Your Own Cloud (BYOC) deployment model

StarTree provides a fully managed BYOC offering in which the data never leaves the customer’s cloud account. This is tremendously beneficial for the customer who retains full control over the data and access to that data without having to deal with the complexities of running a production grade Pinot cluster.

Here are the key benefits of this model:

Data security/governance: Data never leaves the customer’s cloud perimeter. Customers also retain full control over data access.
SLA : StarTree customers get the same SLA as the corresponding SaaS model
Fully managed : StarTree is responsible for handling deployment, upgrades, routine maintenance, backups / recoveries, and capacity planning on behalf of the customers.
Principle of least privilege : StarTree Cloud runs with the least privilege model for operating and managing BYOC clusters with the exception of the one time deployment.
No VPC peering needed : StarTree’s control plane does not need any VPC peering with the remote data plane. This makes it frictionless to deploy and scalable across multiple deployments
Cost Effective : Customers can fine-tune the cluster configuration for cost optimization regarding instance type, storage type, and avail of their cloud discounts as applicable.

Outbound Connectors

The ability to connect these databases with Business Intelligence (BI) or visualization tools is crucial for doing ad-hoc analysis on real-time data. Here is how the three systems compare in terms of outbound connectors:

Connector	Apache Pinot	Apache Druid	ClickHouse
JDBC	Limited Support	Yes	Yes
PrestoDB	Yes	Yes	Yes
Trino	Yes	Yes	Yes
Tableau	Yes	Yes	Yes
Looker	Via PrestoDB/Trino	Yes	Yes
PowerBI	Via PrestoDB/Trino	Via PrestoDB/Trino	Yes (ODBC driver)
Grafana	No	Yes	Yes
Superset	Yes	Yes	Yes

Table: Outbound Connectors, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Feedback

We hope you enjoyed reading this comparison! We’ve tried our best to ensure that all the content is accurate and to the best of our knowledge. However, it’s possible that we missed something or overlooked some details. Plus, of course, this is a point-in-time analysis. Future updates to any of these databases could require revisions to the information provided above. Please contact us if you have any feedback, and we’d be happy to incorporate it!

Intrigued by Apache Pinot? Try it out!

Perhaps the analysis you’ve read above has gotten you curious about Apache Pinot. If so, you are free to download open-source Apache Pinot. Or, if you want a fully-managed experience, you can sign up to try StarTree Cloud, built on Apache Pinot, free for 30 days.

Acknowledgements:

Thanks to Rachel Pedreschi for the very thorough review and insights on Apache Druid, Xiaobing Li for reviewing and helping us dig deep into ClickHouse, and Shounak Kulkarni for validating some of our findings and assumptions on real cluster setups for all three systems!

Many thanks to Madison Sorenson for making the beautiful illustrations. And to all our internal reviewers and editors – Peter Corless, Tim Berglund, Kishore Gopalakrishna, Kulbir Nijjer, and Allison Murph. for the valuable feedback and direction!

Changelog

4/18/2023 – Original publish date

Apache Pinot Industry Insights Technology

Contents