Resources
Blog

Comparing Three Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse


20220624 PNG Startree Logo Mark Fill Storm
StarTree
released on
April 18, 2023
READ TIME
28 minutes

Video

The authors of this blog gave a presentation about it at Real-Time Analytics Summit 2023.

Introduction

With Real-Time Analytics gaining popularity, we are often asked, “How do you compare Apache Pinot™ vs. Apache® Druid vs. ClickHouse®?” In this article, we attempt to provide a fair comparison of the three technologies, along with areas of strength and opportunities for improvement for each.

Recently, we’ve seen a shift in the analytical landscape from traditional OnLine Analytical Processing (OLAP), which was batch-oriented in nature, to real-time OLAP. Instead of generating reports or dashboards which are hours delayed, we now want to analyze data as its being generated – within milliseconds.

Consequently, real-time OLAP databases are getting increasingly popular in the last few years and have seen mainstream adoption in several big companies such as LinkedIn, Uber, Netflix, Cisco, and Bloomberg. As organizations continue to move from batch analytics to real-time analytics, choosing the right OLAP database is an increasingly critical decision. The choice depends on several factors, such as use case, ingestion constraints, query performance, and production readiness, to name a few.

In this article, we present an in-depth comparison of three open-source real-time OLAP databases: Apache Pinot, Apache Druid, and ClickHouse, with the above dimensions in mind. In addition, we will also look at some salient features in-depth to uncover how things work under the hood across different systems.

Apache Pinot, Apache Druid, And ClickHouse Logos

Comparison Structure

Here’s a simplified view of the internal layers of a typical real-time OLAP database. In the following few sections, we will go through each layer and compare the capabilities across the three systems mentioned above. The versions used for each project are Apache Pinot 0.12.0, released on Jan 19, 2023; Apache Druid 25.0.0, released on Jan 4, 2023; and ClickHouse 23.1, released on Jan 26, 2023. Since then, Clickhouse released 22.2 and 23.3; these releases have not yet been incorporated into our analysis.

Real-time OLAP Database Typical Layers

Fig: Layers of a typical real-time OLAP database

Note: There are some cases where certain capabilities are supported in the respective commercial Database-as-a-Service (DBaaS) product offerings. We will make explicit note of these in the comparison table wherever necessary.

Summary of Comparison: Apache Pinot vs. Apache Druid vs. ClickHouse

Based on our observations, as described below in great detail, here’s a quick summary of the strengths and weaknesses of each system.

Apache Pinot

Highlights
Apache Pinot has been designed for scale and performance with best-in-class indexing strategies and smart data layout techniques used to boost query performance, amongst other things. It has great support for ingesting data from real-time and batch data sources and is the only system that can support upserts in real-time. Overall, it’s easy to operate with rich cluster management tools and the only system to support true multi-tenancy. Lastly, the flexible cloud deployment model and best-in-class cloud tiered storage make it a compelling choice for companies of all sizes to build real-time analytical applications.

Needs Improvements
Apache Pinot needs to improve on its data type support, outbound connectors for ease of BI integration, SQL based pre-processing, and some security improvements.

Apache Druid

Highlights
Apache Druid is also designed to handle large-scale data and does a great job of ingesting from real-time and batch data sources. It has great support for data pre-processing during ingestion and also has recently launched a flexible SQL based ingestion approach.

Needs Improvements
Apache Druid currently does not support real-time upserts, which is needed in many critical use cases. It does not support any way to push/write data in real time. It has limited indexing capabilities and no current support for any cloud based tiered storage.

ClickHouse

Highlights
Of all the three systems, ClickHouse is the easiest to get started for new developers with the least number of components. Compared to the other two – it has a better real-time push model.although it also has some limitations). It has great data type support and very rich support for outbound connectors.

Needs Improvements
ClickHouse has poor support for pulling data from real-time sources.see below why this is important). Like Druid, indexing support is also quite limited, which hinders query performance, especially when dealing with a large amount of data. Although single node deployment is easy, running ClickHouse in distributed mode is quite complex. It has some operational limitations in terms of difficulty to scale-up/out nodes, lack of multi-tenancy, and management console. In general, scalability is one of the key challenges with ClickHouse.

We now begin to go deep into each of the layers in detail.

High-Level Architecture: Apache Pinot, Apache Druid, and ClickHouse

To set some context for the rest of the blog, let’s take a high-level look at each system’s architecture.

All of the real-time OLAP databases considered here are column-oriented. Columnar data formats are good for efficiently scanning only selected columns of data, minimizing storage footprint, and scanning through highly redundant data entries (also called “low-cardinality columns”).

They are also optimized to run in a cluster of servers for high availability and horizontal scalability (scale out). However, the ease of scaling and management and the way queries are handled differs greatly across the different implementations, as we will see in the sections to come.

Each of these three databases has the concept of a table, a logical abstraction representing a collection of related data composed of columns and rows. Each table’s data is broken down into small chunks, which are packed in a columnar fashion. These chunks are called segments in Apache Pinot and Apache Druid, and parts in ClickHouse.

Apache Pinot

Apache Pinot Architecture

Fig: Apache Pinot architecture

  • Data layer : This layer contains Servers , which store a subset of segments locally and process them at query time. Servers either pull data from streaming sources directly to build segments, or Minions can build and push segments from batch data sources. Segments are also stored onto a deep store —a cloud object store such as Amazon S3, GCS, ADLS—for permanent backups and recovery.
  • Query layer : This layer contains Brokers , who receive queries from clients and perform a scatter gather to and from Servers.
  • Control layer : This layer contains the Controllers , which manage all the components of the cluster. Apache Zookeeper is used as a persistent metadata store.

Read more details in the Pinot architecture docs.

Apache Druid

Apache Druid Architecture

Fig: Apache Druid Architecture

  • Data layer : This layer consists of Middle Managers, which execute ingestion workloads, and Historicals, which store all queryable data. Segments are stored in a deep store for permanent backup and recovery.
  • Query layer : This layer contains the Brokers, which handle queries from external clients, and the optional Routers, which route requests to Brokers, Coordinators, and Overlords.
  • Control layer : This layer contains the Coordinator which manages data availability on the cluster, and the Overlords, which control the assignment of data ingestion workloads. Zookeeper is used for internal service discovery, coordination, and leader election. A system like MySql or Postgres is used for segment and task metadata storage.

Read more details in the Druid design docs.

ClickHouse

Clickhouse Architecture

Fig: ClickHouse architecture

  • Data layer : This layer contains the ClickHouse Servers , which host the data shards and manage ingestion. ClickHouse has no concept of a deep store for permanent storage.
  • Query layer : There is no architectural separation between data and query layer. A query can be directly made to any ClickHouse server to query data on that server. Of course, data is usually distributed, so a Distributed Table would be created on some dedicated nodes to act as query nodes.
  • Control layer : ClickHouse Keeper / Apache Zookeeper is used for metadata. No other centralized cluster management takes place.

Read more details in the ClickHouse “scaling out” docs.

Overall, ClickHouse has the fewest moving parts, followed by Pinot, then Druid. ClickHouse achieves this by combining many functionalities onto the same node or skipping some altogether.

While this makes getting started with ClickHouse relatively easier, several challenges arise when you hit a certain scale that comes with any practical setup.

  • Scalability : Separation of concerns is always a better design for scalability – e.g. In Pinot if the rate of events goes up one can simply scale the real-time servers minimizing the impact on query latency if queries per second increase one can simply scale the brokers.
  • Operational complexity : Users have to build tooling and automation on their own for every skipped functionality – e.g. distributed query routing, or cluster management operations such as replication, ingestion, and node maintenance in ClickHouse have to be set up manually, which is automatic and out-of-the-box in Pinot and Druid, making them more operator-friendly for managing production workloads.

Data Sources

This section gives a very high-level view of native ingestion support for various kinds of data sources. The corresponding in-depth details are captured in the ingestion section below.

Note : Native here means no external component is needed for that capability.

Real-Time Data Sources

Native, out-of-the-box support for real-time data sources:

Real-time data source Apache Pinot Apache Druid ClickHouse

Apache Kafka

Supported

Supported

Limited (see Ingestion section below)

Amazon Kinesis

Supported

Supported

No

Google PubSub

Supported
*StarTree Cloud

No

No

Apache Pulsar

Supported

Not native. Can be done via KoP

Not native. Can be done via KoP

Change Data Capture (CDC)

Ability to ingest CDC stream consistently

Supported

Not supported (lack of upserts)

Limited (due to limited upsert support)

Table: Real-Time Data Source: a comparison of Apache Pinot, Apache Druid, and Clickhouse

Batch Data Sources

Native, out-of-the-box support for batch sources:

Batch data source Apache Pinot Apache Druid ClickHouse

S3

Yes

Yes

Limited Support (see below)

GCS

Yes

Yes

Needs custom data pipeline

ADLS

Yes

Yes

Needs custom data pipeline

HDFS

Yes

Yes

Needs custom data pipeline

Databricks DeltaLake

Yes
*StarTree Cloud

Needs custom data pipeline

Needs custom data pipeline

Apache Iceberg

Needs custom data pipeline

Needs custom data pipeline

Needs custom data pipeline

Apache Hudi

Needs custom data pipeline

Needs custom data pipeline

Needs custom data pipeline

Table: Batch Data Sources: a comparison of Apache Pinot, Apache Druid, and ClickHouse

Note: Although ClickHouse mentions the above systems in the integrations section, that’s primarily for query processing and not for data ingestion.

SQL Data Sources

Native, out-of-the-box support for SQL sources:

SQL data source Apache Pinot Apache Druid ClickHouse

Snowflake

Yes
*StarTree Cloud

Needs custom data pipeline

Needs custom data pipeline

BigQuery

Yes
*StarTree Cloud

Needs custom data pipeline

Needs custom data pipeline

MySQL

No (can be done via CDC)

Yes

Needs custom data pipeline

Postgres

No (can be done via CDC)

Yes

Needs custom data pipeline

Table: SQL Data Sources, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Note: Although ClickHouse mentions the above systems in the integrations section, that’s primarily for query processing and not for data ingestion.

Ingestion

Before a database can run analytical queries, the data must get into the system. This is the process of ingestion . This layer deals with all the capabilities required for ingesting data into the OLAP system from various kinds of data sources, as mentioned above. As we go through the details of each data source, here are the key things to keep in mind:

  • Append only vs upserts : Append only ingestion is the minimum requirement from an OLAP database. But supporting upserts provides data and query consistency.
  • Pull vs push : Can the OLAP database support a pull / push based ingestion or both. Each one has its pros and cons.
  • At least once vs exactly once : At-least-once semantics is the minimum requirement to ensure there’s no data loss. Having the ability to do exactly once data ingestion is great for certain business critical applications.

Real-Time Ingestion

Ingesting data from real-time data sources like Apache Kafka is integral to real-time analytics. In this section, we will look at the different ways of ingesting real-time data and compare the nuances of each system.

Pull vs Push Based Real-Time Ingestion

There are two ways of ingesting real-time data:

  • Pull based: Actively poll/fetch data from the real-time data source
  • Push based: Push messages directly into the database

A native pull-based ingestion model is generally considered to be better than a push-based model for real-time data ingestion. Here are the reasons why:

  • Fewer moving parts : A native pull-based approach works great with systems like Apache Kafka and Amazon Kinesis and obviates the need for an intermediate connector (e.g., the Kafka Connect framework). This improves overall freshness and error probability by reducing the number of hops in the pipeline.
  • More scalable : Based on empirical data, the pull-based approach supports higher throughput than a corresponding push-based approach.
  • Ease of replication : The pull-based model makes it very easy to replicate data since all replicas can fetch independently from the data source. In addition, strict partition ordering within the data source also alleviates the need for strong coordination among replicas to keep them in sync. For instance, replicas can all agree on the same Kafka offset to ensure data synchronization. A push-based model has to deal with the overhead of how to keep the replicas in sync.
  • Ease of exactly once semantics : A pull-based model makes it easier to guarantee exactly once semantics with systems like Kafka. Arguably, a push based model has to do more work to ensure correctness in this case.

Need for Push Based Ingestion

There are certain cases where push based ingestion is preferred. For instance, in some cases, the real-time data might be generated directly by an application/service without any Kafka in the picture. In other cases, even though data might exist in a Kafka cluster, pull based ingestion may not be possible due to network connectivity issues (eg: Kafka in another VPC without any inbound access). For all such cases, we need the ability to push events directly to the database.

Next, we’ll go through each such mechanism in more detail.

Pull Based Ingestion

Capability Apache Pinot Apache Druid ClickHouse

Real-time pull model

Ability to fetch data from real-time sources

Native support which provides high ingestion performance and low ingestion latency

Native support which provides high ingestion performance and low ingestion latency

ClickHouse does not have a true pull based model but you can get similar behavior using a complex setup. For example, in the case of Apache Kafka users need to configure Kafka table engine + Materialized view + MergeTree. Ultimately, data is still being pushed into the MergeTree family table. Real-time data goes through multiple hops before it can be queried by the user. (see Integrating Kafka with ClickHouse)

In general, this approach has a bunch of limitations:
*Higher ingestion latency due to large batches of data
*Chance of duplication
*Cloud Readiness

For more details of these limitations, please read the Apache Kafka Integration section below

Out-of-order handling

Ability to handle late arriving events in real-time

Supported. Out of order events will end up generating smaller segments

Supported

Real-time ingestion scalability

Ability to handle very high ingestion rate (messages/second)

High: Can sustain high incoming message rate

High: Can sustain high incoming message rate

Moderate-Low because of the complex setup mentioned above

Exactly once ingestion guarantees

Ability to persist messages from the data source in an exactly once manner in the database

Supported

Supported

Not supported in pull based approach

Stream partition changes

Automatically handle partition changes at the source

Supported

Supported

Supported

Table: Pull-Based Ingestion, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Apache Kafka Integration: Apache Pinot & ClickHouse comparison

It can be said that the real-time analytics journey began with Apache Kafka. The popularity and widespread adoption of Kafka in nearly every major company worldwide has made it one of the most important sources of real-time data. Apache Pinot was designed to integrate well with Apache Kafka and has excellent support for handling a high ingestion rate, out of order events, and exactly once semantics.

As noted in the table above, Druid and Pinot are very similar in terms of Kafka integration, whereas ClickHouse took a very different approach. This section will focus on these differences between Pinot and ClickHouse.

Low-Level vs Consumer groups

Internally, ClickHouse uses Consumer Groups within the Kafka Table Engine to fetch data from Kafka. This has many issues, including message duplication, scalability challenges, fault tolerance, and so on, as described in this talk about Pinot’s journey away from the Consumer group approach. Conversely, Pinot evolved to use a low-level Kafka consumer approach that addresses all these issues.

Ingestion Latency

Ingestion latency is defined as the amount of time between an event being produced at the source and the event being queryable in the database. Kafka Table Engine in ClickHouse reads messages in large batches. Default is mentioned as 64K, and higher batch sizes are recommended.500K to 1 Million entries per batch are not uncommon). Needless to say, this will result in higher ingestion latency. Pinot, on the other hand, is able to stream data in much smaller batches – almost 3 orders of magnitude lower thus leading to much lower ingestion latency without adding any additional overhead.

Data Duplication

The recommended pull ingestion mechanism in ClickHouse is not only complex, but it can also result in data duplication. The recommended way to resolve those is to use ReplacingMergeTree, which has its own issues as discussed in the upsert section. On the other hand, since Pinot generates segments in lockstep with Kafka offsets – it’s guaranteed to not generate any duplicates (unless duplicates are present in the Kafka stream itself, which can also be handled by Pinot – see Upsert section below).

Scalability

In ClickHouse, the number of servers is limited by the number of Kafka partitions and, therefore, difficult to scale. In the case of Pinot, segments can be automatically moved to different hosts.non-consuming), which makes it very easy to scale and not limited by the number of Kafka partitions.

Data Replication

As mentioned previously, a pull based approach makes data replication in Pinot much simpler by relying on Kafka offsets for coordinating across replicas. On the other hand, since ClickHouse does not have a pure pull model, it has to rely on a much more complex replication system. For each batch of data inserted into the ReplicatedMergeTree family table, metadata is written to Zookeeper / ClickHouse Keeper. It’s important to note that this happens in the fast path of data ingestion, and having a strict dependency on Zookeeper / ClickHouse Keeper is quite a bit of an overhead.

Cloud Readiness

It appears that Kafka Table Engine is not supported in ClickHouse Cloud and the recommendation is to use alternate approaches (eg: push based techniques). Conversely, Pinot has productionized Kafka integration over the last several years and is fully supported in StarTree Cloud.

Push Based Ingestion

As mentioned before, push based ingestion is often needed to get around Kafka dependency or network connectivity issues. Let’s take a look at how it works in the three systems:

Capability Apache Pinot Apache Druid ClickHouse

Real-time push model

Ability to insert records in real-time

Experimental feature – *supported in StarTree Cloud

In addition, Apache Pinot also supports writing batches of data via Segment Writer API

Supported on older versions with Tranquility. not guaranteed to work with newer versions.

Apache Druid does have the ability to do a batch insert using the SQL based ingestion approach

Supported. However, the recommendation is to use bigger batch sizes during data ingestion. As mentioned previously, this comes at a cost of higher ingestion latency

Out-of-order handling

Ability to handle late arriving events in real-time

Supported

N/A

Supported

Real-time ingestion scalability

Ability to handle very high ingestion rate (messages/second)

Limited as compared to the pull model

N/A

Limited as compared to the pull model

Exactly once ingestion guarantees

Ability to persist messages from the data source in an exactly once manner in the database

Currently not supported

N/A

Supported via Kafka Connect Sink for ClickHouse, which is in beta as of this writing. Relies on a complex design that uses a hash of insert data blocks which is stored in ClickHouse Keeper (like Zookeeper) for de-duplication.

Table: Push-Based Ingestion, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Real-Time Upserts

In the case of real-time event streaming data sources such as Apache Kafka, duplicate or upsert events may exist in the stream. Several mission-critical use cases require the database to handle such events during ingestion to ensure accurate query results. Uber’s financial dashboards, Citi’s risk monitoring, and many others rely on upserts to ensure business continuity. This section compares how the different systems handle such events.

Capability Apache Pinot Apache Druid ClickHouse

Full row upsert

Ability to upsert an existing row with a new row having the same primary key.

Not supported during ingestion

Limited support during ingestion: The recommended way is to use ReplacingMergeTree which handles upsert events asynchronously. This means queries results may not be accurate.

From this link it appears that this asynchronous process typically happens every 9-10 mins or not at all (in case of not enough data) – thus leading to severe query inconsistency issues.

It is possible to perform mutations synchronously via this flag: mutations_sync to 2. However this is not recommended to be used since updating is a very expensive step in ClickHouse.

Partial row upsert

Ability to upsert certain columns in schema with new values for the same primary key

Supported

Not supported during ingestion

Supported with same caveats as above

Real-time Deduplication

Ability to de-duplicate incoming rows based on a primary key

Supported

Not supported during ingestion

Supported with same caveats as above

Scalability

Ability to handle large number of primary keys

Supported
*StarTree Cloud

N/A

Limited due to the above caveats

Table: Real-Time Upserts, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Note that both Apache Druid and ClickHouse claim the ability to do upserts. However, it requires complex workarounds such as batch insertions, query time aggregations, or a complex data model that is difficult to generalize.

Upserts in Pinot

Pinot has great support for handling upsert events in the incoming real-time stream for event-streaming technologies like Kafka. This is done without sacrificing ingestion and query performance and maintains real-time data freshness — which is not true in the case of the other systems.

How does it work?

Users can enable upserts in a real-time Pinot table by specifying a primary key (see the Stream Ingestion with Upsert docs.. Each server in turn maintains a map of the primary key to the location of the latest version of the record for that key. When an existing record is upserted by a new one, the server uses this map to mark the old record as obsolete using a per segment bitmap. All such obsolete records are then filtered out during query execution. This design ensures least overhead during ingestion and query time and works at scale.

Upserts In Apache Pinot

Fig: Upserts in Apache Pinot

Scalable Upserts with StarTree Cloud

One caveat of upserts in Pinot was the memory overhead experienced by Pinot servers when a large number of primary keys are involved. StarTree Cloud removes this limitation by eliminating the in-memory requirement of managing upsert metadata. StarTree Cloud workspaces can easily handle billions of primary keys per server without sacrificing performance.

Upsert Improvements for Pinot (Coming Soon)

You can read more about additional improvements coming to Apache Pinot in this related blog.

Some of the highlights related to upserts include:

  • Time To Live ( TTL) : for the primary keys — helps in reducing the footprint of the upsert metadata and thus memory overhead on the server.
  • Compaction : ability to delete obsolete records and improve storage efficiency
  • Ease of bootstrap : Bootstrapping historical data in an upsert table is currently challenging. Some users rely on a Flink Pinot connector to push historical segments in the real-time upsert enabled table. We will be adding native support for this shortly.

Batch Data Ingestion

A native, out-of-the-box support for importing data from batch data sources (e.g., S3 or GCS) is crucial for developer productivity. Otherwise, users must maintain custom data pipelines, which becomes a big overhead to maintain and scale up.

Capability Apache Pinot Apache Druid ClickHouse

Pull based Batch Ingestion: Native Support for batch ingestion by pulling data from remote data sources

Yes. Scalable and performant.

Uses a background task framework to ingest data from batch data source into Pinot.

Most of the work is done outside of Pinot, with minimal impact on query performance.

Yes. Scalable and performant.

Uses a background task framework to ingest data from batch data source into Druid.

Most of the work is done outside of Druid, with minimal impact on query performance.

Limited. The suggested approach is to use INSERT INTO with SELECT * FROM batch location. This is considered to be less scalable in comparison. In addition, this consumes resources on the live server that’s also serving queries, which can affect query performance. See Integrating S3 with ClickHouse for an example.

Batch ingestion input format: Different supported data formats for batch ingestion

JSON
CSV/TSV
ORC
Parquet
Avro
Protobuf
Thrift

See Input Formats in the Pinot docs for details.

JSON
CSV/TSV
Kafka (load Kafka header, key, value)
ORC
Parquet
Avro Stream
Avro OCF
Protobuf

See Data Formats in the Druid docs for details.

JSON
CSV/TSV
ORC
Parquet
Avro
Protobuf
Arrow
CapnProto

and many more!

See Formats for Input and Output Data in the ClickHouse docs for details.

Push based ingestion: Native support for pushing large batches of data from remote data sources

Supported (Spark and Flink connectors available)

Supported (no built-in connector)

Supported (no built-in connector)

At least once semantics

Supported

Supported

Supported

Exactly once semantics

Supported at segment level

Supported at segment level

No native support. Needs custom pipeline

Table: Batch-Based Ingestion, a comparison of Apache Pinot, Apache Druid, and ClickHouse

SQL Data Ingestion

SQL data ingestion can be done in 2 ways – using CDC (Change Data Capture) in real-time or in a batch manner. This section covers the batch oriented ingestion from SQL data sources.

Capability Apache Pinot Apache Druid ClickHouse

Native support for one time ingestion (bootstrap) from SQL sources

Supported
*StarTree Cloud

Needs custom data pipeline

Native support for incremental ingestion from SQL sources

Supported
*StarTree Cloud

Needs custom data pipeline

Table: SQL Data Ingestion, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Batch and SQL ingestion support in StarTree

StarTree provides a native, out-of-the-box support for ingesting the batch and SQL data sources mentioned above. Read No-code Batch Ingestion for Apache Pinot in StarTree Cloud for more details. This is done by a background job that runs alongside the primary cluster.

Key benefits:

  • Easy to use: There is no need for additional ETL tools for most common scenarios.
  • Efficient: Segments are generated by the background job outside the primary cluster and pushed to the server serving live queries, with negligible impact
  • Robust: Ingestion tasks are checkpointed and retried upon failures
  • Scalable & cost efficient: auto-scaling is enabled in StarTree cloud for elastic scale-up/down of the ingestion tasks

Ingestion time pre-processing

Often, data needs to be pre-processed before it can be ingested and queried in Pinot. Examples include flattening a complex JSON structure, rolling up data, or type/value transformation. Let’s look at some of the popular pre-processing capabilities requested by users.

Capability Apache Pinot Apache Druid ClickHouse

Ingestion time transforms

Basic filtering, type/value transformation applied as data is being ingested

Rich filter and transformation function support.

Filter functions: We can use either Groovy expressions or built-in scalar functions for filtering (see Ingestion-Level Transformation in the Pinot docs).

Rich transformation function support including Groovy expressions as well as built-in functions for string, math, JSON, datetime computation and so on. A full list of Pinot functions is available in the docs.

Rich filter and transformation function support.

Various types of built-in Filter functions including JavaScript, regex, logical, extraction, column types and so on (see Query Filters in the Druid docs).

Transformation functions including string, math, time, array, JSON, lambda and other operators.

Limited support: Data can be transformed via the insert into select clause from another table. However, there doesn’t seem to be a way to apply filtering or aggregation functions during ingestion itself.

Flattening

Flatten a nested schema often needed for query optimization

Built-in flattening support for nested JSON records. In addition, users can also use the json_extract transformation function.

In general, a JSON column and an optional JSON index is recommended to accelerate query performance on JSON/nested fields.

Convenient way of flattening nested records using FlattenSpec.

Limited support for Nested column (not for Json): via the flatten_nested spec

Rollups
during Real time ingestion

Supported. Variety of ingestion aggregation functions such as SUM, MIN, MAX, DISTINCTCOUNTHLL.

Currently, specifying granularity/interval is not supported, but can be achieved with derived columns.

Supported. Variety of aggregation functions such as SUM, MIN, MAX, Approximate (Sketches).

In addition, it’s possible to define custom granularity, interval (see Ingestion Data Rollup in the docs).

Limited support: Can use SummingMergeTree or AggregationMergeTree for rolling up records based on a specified primary key in the background (not during ingestion).

Rollups
during batch/offline ingestion

Supported.SUM, MIN, MAX, DistinctCountHLL, Theta sketches)

Supported

Limited: Same as above (use the right MergeTreeFamily)

SQL based ingestion transform

Flexible way of inserting data into a table based on the output of a SQL query

Not Supported

Supported with the new Druid multi-stage query engine.

Supported with insert into select clause.

Denormalization support

Denormalize data from multiple sources during ingestion

Not Supported

Basic support using Druid MSQ: Insert with Join: limited to dimension tables

Limited to existing tables in ClickHouse using insert into select clause

Ability to partition data during ingestion

Supported for batch sources. In case of real-time sources, partitioning scheme is tied to the source (Eg: Kafka/Kinesis)

Supported for batch sources. In case of real-time sources, partitioning scheme is tied to the source (Eg: Kafka/Kinesis)

Supported for MergeTree family tables. As data is inserted, it can be partitioned as configured by the PARTITION BY spec

Ability to sort data during ingestion

Supported for batch sources on any dimension (limited to 1 dimension)

Druid is limited to sorting by timestamp as the primary sorted dimension

Supported for MergeTree family tables using the primary key

Table: Ingestion Time Pre-Processing, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Table Management

Table data often needs to undergo some change after ingestion to keep up with changes in the data or the workload. This section compares how the three systems fare in this area.

Capability Apache Pinot Apache Druid ClickHouse

Backfills

Ability to replace the data in the database for a certain time range or condition, using data from another batch or streaming source (e.g. replace data for the 3am hour two days ago, using files in an S3 bucket).
Useful when data needs to be corrected due to upstream changes or issues.

Backfill is easily possible if the table is time partitioned, which is often true for offline tables.
Realtime tables are not naturally time partitioned, but tools are available to time-partition and then execute a backfill.

Could not find documentation or recipes, but seems possible by design.

Support is available for making small scale additions, updates and deletes to columns, but no support for large backfills for time ranges to data.

Dynamic re-indexing

Ability to dynamically add remove indexing for columns of the table, without needing a full bootstrap or downtime.
This is extremely useful when iterating between optimizations during initial setup.

Can easily add or remove indexes with no downtime by making config changes followed by a reload operation.

Can be achieved via reindexing, which refers to a reingestion task that can apply any changes to existing data. However, this is modeled as batch ingestion with existing data as source, effectively making it not-in-place.

Skipping indexes can be added / removed dynamically.

Dynamic data layout change

Support for making changes to the data layout such as partitioning, sorting, bucketing, rollups, resizing dynamically without having to rebootstrap and without downtime.
Picking the right layout based on the data and workload aids better compression, optimized routing, segment pruning and scanning, which helps improve performance.

Native support for roll ups and merge via Minions which handles partitioning, bucketing, trimming.

Partitioning: Yes

Sorting: Yes.StarTree Cloud

Bucketing: Yes

Rollups: Yes

Trimming: Yes

Can be achieved through overwrite, which is essentially re-ingestion using the existing data as source.

Partitioning: Yes

Sorting: Custom sorting itself is not supported, so this is NA

Bucketing: Yes

Rollups: Yes

Trimming: Yes

While some operations like altering columns / deleting columns can be done via mutations, the ones involving entire data layout changes can only be done by dropping and recreating the data. It can be achieved  by creating a materialized view or projection but this solution increases storage footprint, defeating the purpose of some operations like rollups and trimming.

Partitioning: No

Sorting: No

Bucketing: No

Rollups: No

Trimming: No

Table: Table Management, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Schema Management

This section compares the three systems on schema characteristics, such as native support for data of varied types and being able to evolve the schema.

Capability Apache Pinot Apache Druid ClickHouse

Data types

Extensivity of data types supported as compared to SQL databases.
Native data types are easier to use and in their absence, columns need to be modeled as strings which can be inefficient for compression and access.

Basic ones such as int, long, float, double, char, string, boolean, bytes, in addition to some special ones like big decimal, geo types, JSON.
No array support like SQL, multi-value columns semantics are like column explosion.

Limited to the basic types broadly in the categories of int, long, float, double, char, string, boolean.
No array support like SQL, multi-value columns semantics are like column explosion.

A very rich set of data types supported natively in comparison to Druid and Pinot. Along with the basic int, long, float, double, string, boolean, many specialized ones such as UUID, Enum, Array, Map, tuple, geo types, IP. expression, set.

Nested Columns

Support for ingestion and fast query of unstructured / semi-structured nested columns.
Such support helps eliminate the need for data pre-processing workflows to extract data from nested structures.

Strong support for nested columns ingestion, with strong indexing support during query time.

Good support for ingesting and storing JSON either as structured or semi-structured objects using string/JsonObject

Schema Evolution

Ability to make schema changes dynamically, in place.
This is needed as data keeps evolving, and the system needs to react fast so as to not miss these changes.

Supported for new data, but for existing data, documentation suggests to go with reindexing which is called out as time consuming since it is a full overwrite of data

Possible with mutations

Table: Schema Management, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Storage

As mentioned earlier, each of these databases has a columnar storage format.

  1. At a columnar level, there are differences in compression & encoding schemes which further affect the storage footprint and access latencies.
  2. At a segment level, data layout configuration options help increase pruning.
  3. At a server level, smart storage mapping strategies help drastically reduce query-fanout, further reducing latency and boosting throughput.
  4. At a cluster level, the availability of storage options (multi-volumes, cloud object storage) and the ability to create storage tiers for hot / cold data to control cost-latency trade off, greatly improves cost to serve.

This section looks at the above aspects for the three systems.

Capability Apache Pinot Apache Druid ClickHouse

Encoding

Types of dictionary encoding available.
These impact column size which affect access latency and cost to serve.

Dictionary encoding done by default.

Ability to skip dictionary encoding based on configurable compression ratio.

Option to create a variable length dictionary for string / bytes columns for better compression.

Ability to load dictionary on-Heap for faster access

Dictionary encoding.
No ways to smartly determine when to skip this encoding, or create any advanced / nuanced dictionary encodings.

Dictionary encoding done via LowCardinality data type.

No advanced thresholds available, other than cardinality based disabling of LowCardinality.

Compression

Standard / specialized techniques available to compress columns.
These impact column size, which affect cost to serve.

snappy, lz4, zstd

z4, lzf, zstd
Plus some specialized codecs like delta, double delta, gorilla, fpc, t64.

Data layout: Partitioning

Control the data layout in terms of custom partitioning.
Partitioning data helps with pruning servers / segments reducing query fanout.

Can choose custom keys for partitioning

Data always partitioned by time column first, and a secondary partitioning can be done using a custom key.

Can choose a custom key for partitioning

Data layout: Sorting

Control the data layout in terms of custom sorting.
Sorting data helps improve data locality and enable smarter encoding (like RLE) reducing work done in applying filters during scans.

Ability to pick any column as sort key

Always sorts by primary time column. No option to make a column of your choice as primary sort column (can set a secondary sort column).

Ability to pick any column as sort key

Data layout: Strategies to reduce query fan out

Ability to use smart instance assignment and segment assignment. These strategies help reduce query fanout which improves latency, throughput and operability.

Highly mature and advanced strategies for segment assignment which help achieve better p99 latency and throughput, plus instance assignment strategies available, to improve ease of operations.
More details in the Data Layout: Strategies to reduce query fanout section below.

Only strategy available for segment to instance assignment is the balanced strategy which distributes segments evenly on historical nodes.

Tiered Storage using local storage / compute nodes separation

Ability to split hot / cold data onto multiple local (or directly-attached) volumes or onto different compute node tenants. For example, some local storage volumes might be fast NVMe SSDs, and others slower HDDs.

Two types of tiering available:

  1. Multi-volume tiering using different types of local volumes
  2. Compute node separation using multiple tenants
  1. No tiering using multi-volumes on same tenant
    Supports compute node separation using service tiers.
  1. Tiered storage support using multi-volumes
  2. No compute node separation due to the absence of tenant / tagging concepts.

Tiered Storage using Cloud Object Storage

Ability to split hot / cold data between local and cloud object storage.
While reducing cost to serve is one aspect of this feature, the design choices need to be made carefully as we still want the database to be fast.

Tiered storage using cloud object storage available in StarTree Cloud.

  1. Cloud object store as a tier supported in StarTree Cloud.
  2. Fully on par with local mode, no restrictions on data altering.
  3. All indexes available
  4. Several optimization techniques like prefetching, pipelined fetch and execution, columnar / block fetch for performance
  5. Caching available in addition, for further speedup, but no dependence on it
    Limitations:
  6. Needs Pinot segment format
    *StarTree Cloud

No support for tiered storage using Cloud Object Storage

Tiered storage support using cloud object storage:

  1. Supports open file formats like gzip, parquet, csv
  2. Can prefetch granules async
    Limitations:
  3. No data altering support.
  4. Documentation mentions no indexing support, but this article says index and mark files can be cached locally.
  5. No columnar fetch, always reading granule which contains all columns.
  6. Relies on caching (lazy loading) to make cold tier queries fast.

Table: Storage, a comparison of Apache Pinot, Apache Druid and, ClickHouse

Data Layout: Strategies to Reduce Query Fanout

This blog post talks about the lifecycle of a query in Apache Pinot and the optimizations available at each point – from broker, to servers, to segments, to columns – to reduce the amount of work done. This section focuses on query fanout – reducing the number of nodes a query is scattered to. Reducing the fanout helps eliminate long tail latency (your query will be as slow as the slowest server) and, in turn, helps improve throughput.

The strategies available in Pinot to map the data segments onto servers and enhance pruning at broker level are:

  1. Balanced: This strategy simply distributes segments evenly and is also the only strategy adopted by Druid and ClickHouse.
  2. Replica-group: This strategy creates groups of servers such that an entire replica of the dataset is constrained to each group, thus reducing server fanout and being able to scale horizontally for higher throughput.
  3. Partition-based replica-group: This strategy further creates partition based groups to reduce segment fanout within a replica group.

The strategies available in Pinot to map the tables onto servers and improve ease of operation are:

  1. Creating pools for a no-downtime rolling restart of large shared clusters
  2. Fault-domain aware instance assignment

Tiered Storage using Cloud Object Storage

Each of the three systems started with having tightly coupled storage and compute, wherein they used compute nodes that included local or directly attached high performance storage, such as NVMe SSDs. This is one of the reasons that make them suitable for low latency.

However, local high-performance storage is expensive compared to decoupled systems that primarily use cloud object storage. Cloud storage is cheaper but slower and, therefore, usually unsuitable as the sole storage layer for interactive real-time analytics. Simply choosing one of these two systems for all analytics needs does not make sense.

Real-time (or recent) data tends to be more valuable and is queried more frequently. On the other hand, historical data is queried less frequently. Latency is still important, but with this data being large and ever growing, cost becomes a dominating factor. Data teams often choose a hybrid approach where they add both systems and migrate the data from one to another depending on cost / latency expectations. Therefore, many users need to consider how to span their data, choosing what they wish to store in high performance local storage and what is placed on slower but more cost-effective cloud storage.

Tiered Storage For Apache Pinot In StarTree Cloud

Fig: Tiered Storage for Apache Pinot in StarTree Cloud

Tiered storage for Apache Pinot in StarTree Cloud is designed to address this problem. With the option to use cloud object storage such as S3 / GCS directly in the query path, Pinot can now act as a hybrid system — it can use completely decoupled or tightly-coupled storage or have a mix of both. It lets you flexibly configure data to be on any tier and query it all seamlessly like any other table.

Most importantly, several investments have been made to reduce the latency issues that come with using decoupled storage. For instance, the design consciously stays away from approaches such as lazy loading — where entire segments would be downloaded during query time — or using local caches. Typical OLAP workloads consist of arbitrary slice-and-dice over varying time ranges, and caching-based approaches cannot offer predictable p99 latency due to the frequent download and churn of segments.

Instead, the tiered storage approach in Pinot for reading from cloud object storage uses strategies such as selective columnar / block reads, smart prefetching, pipelining fetch and compute. These optimizations, along with the ability to use the army of native indexes available in Pinot, help Pinot achieve the best of both worlds — reduced cost to serve while keeping the speed of Pinot.

Indexing Strategies

Indexing is a fundamental function of a real-time OLAP database which ensures queries can be executed efficiently and perform minimum data scans. In this section, we will compare the indexing strategies available across the three systems and also dive into why this is important.

Type of Index Functionality Apache Pinot Apache Druid ClickHouse

Inverted Index

Fast Filter

Supported

Supported

Sorted Index

Personalized Analytics

Supported

No

Supported

Range Index

Range Queries

Supported

No

No

JSON Index

Analyze Semi-structured data

Supported

No

No

Geo Index

Location queries

Supported

No

No

StarTree Index

Fast Aggregation

Supported

No

No

Bloom Filter

Approximate Filter

Supported

Supported

Supported

Text Index

Log Analytics

Supported

No

Limited: via bloom filters

TimeStamp Index

Time based rollups

Supported

No

No

Sparse Index

Approximate Filter

*StarTree Cloud (Alpha)

No

Limited: only on the primary sorted column

Table: Indexing Strategies, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Why do Indexes Matter in an OLAP Database?

Indexes help minimize the data scans needed for processing a given OLAP query. This is important to ensure query performance is optimal and predictable regarding high throughput and latency, even as the underlying data keeps growing organically.

Impact Of Indexes On Query Latency With Increasing Data Size Graph

Fig: Impact of indexes on query latency with increasing data size

Apache Pinot natively supports a wide range of indexing strategies, making it the ideal OLAP platform for myriad use cases ( as discussed here.. In addition, it has a pluggable architecture, making it extremely easy to add new indexes as and when needed. For instance:

  • Star-Tree Index, Pinot’s way of building intelligent materialized views is crucial for running user-facing analytical applications. Specifically, it allows users to pre-aggregate data in a tunable fashion, leaving the source data intact. This enables highly concurrent, low latency key-value style queries. Any changes in the pre-aggregation function or corresponding dimensions can be handled dynamically without having the need to re-ingest all the data.
  • JSON index and text index are great at processing semi-structured data in sub-seconds (eg: analyzing system and application logs at scale)
  • Geo index accelerates geospatial queries , which are required by many logistics and delivery companies around the globe, as well as various other real-world location-based services.

Here’s a quick view of the power of indexes – comparing query latencies with and without the corresponding index.

Chart Of How Apache Pinot’s Indexes Bring Down The Latency Compared To A Raw Scan

Fig: How Apache Pinot’s indexes bring down the latency compared to a raw scan

As depicted above, indexes help reduce the query latency by many orders of magnitude. In addition, it reduces the amount of work and resources needed per server, thus increasing the system’s overall throughput.

Query

This section compares the three systems in all aspects related to querying the database. It discusses commonly used query constructs in OLAP and advanced multi-stage query features such as joins and window functions. It finishes with query latency optimizations and features such as caching, routing, and pruning strategies.

Capability Apache Pinot Apache Druid ClickHouse

Filter

Yes

Yes

Yes

Group by

Yes

Yes

Yes

Order by

Yes

Yes

Yes

Sketches

Useful for fast and approximate counting of distinct values in large sets

Yes

Yes

Yes

Multi-stage query engine

Required for executing more complex SQL queries such as JOIN, OVER window, which can cause bottlenecks if executed in a single-stage scatter-gather fashion.

Has a multi-stage query engine that is fully distributed, designed for interactive queries

The multi-stage query engine executes SQL statements as batch tasks in the indexing service, and appears to be used more for ingestion. This is not suitable for interactive low-latency OLAP queries.

Has a multi-stage query engine.

Joins

A useful query construct for combining data from multiple tables in the scope of a single query. In the absence of joins, expensive data pre-processing and denormalization needs to be done.

Supported, with limitations.
See below section for more details.

Limited to only joining regular Druid tables with lookup and inline data sources.in other words, no multi-stage joins)

Supported, with limitations.
See below section for more details.

Window functions

A useful query construct for applying aggregations over a window (set of rows).

Work in progress. Support added for WINDOW OVER construct, with some limitations.

Standard window function syntax supported along with many basic aggregations

Pluggable User Defined Functions (UDFs)

Ability to easily write custom UDFs and plug them into the system.

Support for using Groovy, or defining Scalar functions via annotations which can be plugged in as a custom jar.

Support for writing JavaScript functions. No ability to write a function and plug it in via custom jar.

Can define UDFs using CREATE FUNCTION. But uses lambda expressions which aren’t as powerful and can be inadequate at times. No ability to write a function and plug it in via a custom jar.

Query caching

Support for caching results of a query. Useful when there would be repeated executions of the same expensive query.

No query caching mechanisms exist.

Two types of query result caches available – a per-segment cache on server and a whole-query cache on broker.
Mechanisms in place to evict stale cache entries as soon as data changes, and also an explicit config to use cache.

Recently introduced a feature for query caching on servers. Entries are not kept in sync with underlying data changes, TTL is the only mechanism for evicting stale entries, and no user overrides available. Plus it’s been called out as experimental and has the possibility of returning wrong results.

Pruning strategies during query execution

Ability to prune out servers / segments without having to process all the segments. Two stages where this can be done:

  1. Reduce server fanout
  2. Reduce segment fanout

Reduce server fanout:

  1. Partition based: Prune out server groups based on partition info
  2. Replica group based: Route to only 1 replica group for any query

Reduce segment fanout:

  1. Min / max value: Prune out a segment based on filter clause and min / max values stored in segment metadata
  2. Partition info: Prune out a segment based on partition info stored in segment metadata
  3. Bloom filters: Prune out segments based on bloom filters created for columns and stored in segments.

Reduce server fanout:
None. The segment pruning techniques are directly applied on brokers, but no attempts to reduce the set of servers fanned out to.

Reduce segment fanout:

  1. Time based as segments are always time partitioned first
  2. Partition based pruning

Reduce server fanout:
None. Routing itself is a manual process – due to the absence of a dedicated Query Layer – involving setting up a Distributed Table using cluster.xml, and no smart controls exist to further optimize this.

Reduce segment fanout:

  1. Block level pruning achieved using sparse primary indexes and skip indexes.with techniques like min / max, set, bloom filters).
  2. Partition based pruning

Table: Querying, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Joins

ClickHouse, Pinot, and Druid Joins Execution

A Join query fundamentally works on two parts – a left table and a right table. There are three ways to execute a join.

  • Broadcast join : All data post filtering from the right table (R), is shipped to all nodes (N) of the left table (effective data moved R * N). This works well for cases when the filtered data being shipped is small and the number of nodes is also small. But query latency increases proportionately as the size of R increases or the number of nodes N increases. This is the only algorithm that ClickHouse applies for joins.
  • Shuffle distributed hash join : After applying filters, data from both sides (L and R) is hashed on the join key and shipped to an independent shuffle stage (effective data moved L + R). This scales well even if the size of data on either side increases. Plus, the performance is unaffected by the number of nodes on either side. Pinot is the only system that can execute joins using this algorithm .
  • Lookup join : Right table must be pre-loaded (replicated) on all nodes of left table. This can only work for very small right hand datasets. This is the only kind of join that Druid can execute . This is available in Pinot as well, for small dimension tables.

In a separate blog post, we will follow up with more details about the nuances of join implementation for all systems, so stay tuned.

Materialized Views

Materialized views (MVs) create a virtual table only using a subset of data from a base table, such as a select number of columns pertinent to a specific query. Users commonly use this to optimize query performance for well-known query patterns. Here’s how these different systems have implemented materialized views:

Capability Apache Pinot Apache Druid ClickHouse

Flexibility

High – native support via star-tree index

Low. Requires a Hadoop cluster

High – native support via CREATE VIEW

Ingestion overhead

Low – Index can be created at any time and segment reload will automatically create the index for the entire source table

High. Requires a Hadoop cluster

High for Materialized Views

  • Any aggregation in the view query is only applied on new data
  • POPULATE can backfill existing data but is not recommended since any data being ingested at the time of view creation will be lost.
  • Special consideration needs to be given to the materialized_views_ignore_errors flag. Otherwise any errors in any of the views will block ingestion

ClickHouse also has the concept of Projection which seems to be built automatically for the entire table

Storage overhead

Medium – Allows tuning space vs query latency overhead. In other words, not all rows of the view are fully materialized, thus allowing for space efficiency, while still maintaining low query latency

High – materializes all the rows as a traditional view

High – materializes all the rows as a traditional view

Query overhead

Low – User does not need to know about the existence of the index. The same query on the source table will automatically pick the right star-tree index under the hood

Moderate – queries need to target a specific view

Moderate – queries need to target a specific view.

ClickHouse has a separate concept of Projection which is transparently applied without the user having to know about it. In general though, Projections are considered more heavyweight as compared to a Materialized View.

Table: Materialized Views, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Star-Tree Index: Pinot’s intelligent materialized view

The star-tree index provides an intelligent way to build materialized views within Pinot. Traditional MVs work by fully materializing the computation for each source record that matches the specified predicates. Although useful, this can result in non-trivial storage overhead. On the other hand, the star-tree index allows us to partially materialize the computations and provide the ability to tune the space-time tradeoff by providing a configurable threshold between pre-aggregation and data scans.

Star-Tree Index Space-Time Tradeoff Graph

Fig: star-tree index lets you choose your space-time tradeoff between pre-aggregating everything (a materialized view) or doing everything on-the-fly

The star-tree index allows us to pre-compute values of an aggregation function across the dimensions specified in a given materialized view. A configurable parameter called maxLeafRecords determines the maximum rows that need to be scanned to compute an aggregation. In other words, if maxLeafRecords is set to 1 then we get full materialization – same as a traditional MV. Setting this to a higher value results in significant space savings with a small query overhead.

Key benefits:

  • User controllable: Tune space vs. time overhead
  • Flexible: create any number of indexes. The right index is chosen based on the query structure.
  • Transparent: Unlike traditional MVs, users don’t need to know about the existence of a star-tree index. The same query will be accelerated with a star-tree index in place.
  • Dynamic: Very easy to generate a new index at any point of time.

Learn more about the star-tree index here.

Query Performance

Performance benchmarking is highly use-case, query pattern, and dataset dependent. We recommend you evaluate query performance on your datasets across these systems. Here are some benchmarking numbers provided by neutral third parties:

Query Performance Benchmark Graphs

Fig: Query performance benchmark graphs from the various sources as listed below

  • Cisco did an evaluation comparing ClickHouse and Pinot. View the setup details and results in Webex: Real-Time Observability & Analytics with Apache Pinot. A quote from the presentation – “ Pinot is 4x faster than ClickHouse (in most categories) and is able to maintain p99 with increased load.”
  • This engineering blog post titled Real-time Security Insights: Apache Pinot at Confluera talks about their benchmarking done on Druid and Pinot. The results showed queries on Pinot were 2x – 4x faster than on Druid and could sustain a much higher throughput.
  • This blog post titled Pinot in YouGov discusses a comparison they did with several technologies, and Pinot was 5x – 7x faster than Druid and ClickHouse .
  • This session How Zomato uses Apache Pinot talks about how Pinot could handle 2x the query throughput compared to Druid using the same cluster size.

Cluster Management

This section compares the systems on operability, which comprises several aspects such as, ease of cluster operations (scaling, maintenance, disaster recovery), visibility into cluster status, flexible cluster setup options to enhance manageability and help maintain SLA guarantees, and so on.

While all three databases are designed for high availability (HA), and each seeks to avoid single points of failure (SPOFs), each handles cluster component failures differently and thus provides differing overall reliability and resiliency.

Capability Apache Pinot Apache Druid Clickhouse

Management console

A UI for managing cluster components and performing regular operations related to the lifecycle of tables, schemas, segments.
Having such a console greatly improves ease of operability and developer productivity.

A functional console available, covering the management and lifecycle of all physical and logical components. Also includes a query console. For advanced users, a UI for access to all endpoints and Zookeeper viewer.

Advanced consoles available in StarTree Cloud for ingestion (Data Manager) and easy data layout and cluster management (Pinot Control Panel)
*StarTree Cloud

A comprehensive console for loading data, managing data sources and tasks, and viewing server status and segment information. The console can also be used to run SQL and native Druid queries.

No management console in ClickHouse.

SQL console is available only in *ClickHouse Cloud.

Multi-tenancy

Ability to create groupings amongst components of different use-cases in the same cluster.
This helps provide physical as well as logical isolation and eases operations.

Tenant is a first class concept in Pinot, that can be used to isolate brokers and/or servers of a table within a multi-tenant cluster.
Tenants in Pinot not only help with isolation of resources and workload, but also bring benefits to operations such as independent scaling, rollouts etc.

The concept of tenant is used to define a “user,” and multi-tenancy refers to multiple users / teams querying the same data source. No references found regarding techniques like node pools / tagging etc, for achieving isolation of tables within a cluster. Instead, suggestions include creating separate partitions for every tenant, creating separate data sources for every tenant or group similar tenants and creating a shared datasource.

No concept of tenants.
The recommendation is to keep the number of tables in the cluster small.

Scale up / scale out

Ease of making capacity changes to the cluster hardware (vertical as well as horizontal scaling up / down) and data redistribution. This is commonly required in order to adapt to changing workloads on a cluster.

Very easy to make capacity changes and rebalance the cluster.

StarTree Cloud has a Pinot Operator, which facilitates more automated scaling up / down of nodes, including rebalancing without down-time. It includes heterogeneous hardware support, which is often needed for multi-tenant clusters or tiered storage use cases.
*StarTree Cloud

While no documentation was found, these links indicate it’s possible to scale out the nodes and automatically balance the segments.

Document indicates this is not easily possible. One needs to manually create the table on new nodes, copy over data, and is operationally heavy.

Node maintenance

Ease of operations when having to replace or take down a node from the cluster for maintenance, which is a common operation in large scale distributed clusters.

Easy to take down a node and have another one join in its place by using the same node identifier.

There’s a maintenance mode flag that can be used to mark a node as decommissioned, so new data stops getting assigned to it, and balancer starts moving existing data to a new node.

Involves manual steps like copying over data from 1 replica to another, which means you cannot do this unless replica > 1.

Disaster Recovery

Permanent backup and recovery mechanisms, to recover the data in the event of node / disk failures and cluster outages.

All segments are backed up to deep store by default, and data can be restored completely in the event of cluster failures / data loss.
This copy is also used to load data onto newly added nodes, nodes which lose their data.

All segments are backed up to deep store by default, and data can be restored completely in the event of cluster failures / data loss.
This copy is also used to load data onto newly added nodes, nodes which lose their data.

Backups need to be set up manually. There’s no concept of a deep store, where a permanent copy is backed up by default.
Recovering after node failures is manual and requires copying data from another replica, which indicates we cannot recover if we lose all replicas.

High Availability

Ability to keep service available in the event of partial component failures. Avoid any “Single Point of Failure (SPOF).”

Each component and data shard can be scaled and replicated. Data can be replicated easily. Can operate queries, ingestion, and background tasks in the face of a failure in any of the components.
Pinot can also tolerate failures in deep store availability, with peer downloading capabilities

Each component and data shard can be scaled and replicated. Data can be replicated easily. Can operate queries, ingestion, and background tasks in the face of a failure in any of the components.
However, does not have a way to provide high availability if the deep store goes down.

It is possible to run ClickHouse without Zookeeper involved. However, in this mode, you cannot have replication, and hence cannot be highly available. Additionally, absence of a deep store and the fact that ClickHouse servers are the sole component responsible for all activities in the system (no separation of concerns), makes it much harder to provide high availability.

System tables

System tables capture information about the cluster components, query / ingestion workload, etc, which can be extremely useful in monitoring system health and debugging without requiring a separate management port or interface.

No concept of system metadata tables. Virtual metadata columns – docId, hostId, segmentName – built by default and available during query time.

Two system tables available – Information Schema (tables and column types) and Sys Schema (servers, segments, tasks).

A rich set of system tables for server states, internal processes and environments

Table: Cluster Management, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Security

Here, we compare the key security requirements needed for an enterprise-grade real-time OLAP database.

Capability Apache Pinot Apache Druid ClickHouse

OIDC based Authentication

Supported
*StarTree Cloud

Supported

No

Kerberos based Authentication

No

Supported

Supported but not available in Clickhouse Cloud

Authorization: (ABAC) Attributed Based Access Control

Supported
*StarTree Cloud

No

No

Authorization: (RBAC) Role Based Access Control

No

Supported

Supported

Data encryption (“Data-at-Rest”)

Storage level encryption

Storage level encryption

Storage level encryption + Built-in encryption support

Network encryption (“Data-in-Motion”)

SSL/TLS

SSL/TLS

SSL/TLS

Table: Security, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Cloud Deployment Model

Finally, we look at how easy it is to deploy these three systems as a fully-managed cloud solution, requiring a commercial Database-as-a-Service (DBaaS) provider. Users can, of course, self-deploy any open-source databases to the public cloud as a self-managed solution.

Capability Apache Pinot (StarTree Cloud) Apache Druid (Imply) ClickHouse (ClickHouse Cloud)

SaaS

Cluster provisioned in vendor’s cloud account. User has to ship data to this cluster

Supported
*StarTree Cloud

Supported

*Imply

Supported
*ClickHouse Cloud

Bring Your Own Cloud (BYOC)

Cluster provisioned in customer’s cloud account. Customer retains full control of data.

Supported. No VPC peering needed between control plane and data plane.
*StarTree Cloud

Supported but with limitations. Needs VPC peering between control plane and data plane. This is not only a point of friction for the users, but also problematic in terms of having to deal with CIDR range overlap across multiple deployments. Resolving CIDR overlap needs a complex setup as documented here.

*Imply

Not Supported
*ClickHouse Cloud

Table: Cloud Deployment Model, a comparison of StarTree Cloud, Imply, and ClickHouse Cloud

StarTree’s Bring Your Own Cloud (BYOC) deployment model

StarTree provides a fully managed BYOC offering in which the data never leaves the customer’s cloud account. This is tremendously beneficial for the customer who retains full control over the data and access to that data without having to deal with the complexities of running a production grade Pinot cluster.

Here are the key benefits of this model:

  • Data security/governance: Data never leaves the customer’s cloud perimeter. Customers also retain full control over data access.
  • SLA : StarTree customers get the same SLA as the corresponding SaaS model
  • Fully managed : StarTree is responsible for handling deployment, upgrades, routine maintenance, backups / recoveries, and capacity planning on behalf of the customers.
  • Principle of least privilege : StarTree Cloud runs with the least privilege model for operating and managing BYOC clusters with the exception of the one time deployment.
  • No VPC peering needed : StarTree’s control plane does not need any VPC peering with the remote data plane. This makes it frictionless to deploy and scalable across multiple deployments
  • Cost Effective : Customers can fine-tune the cluster configuration for cost optimization regarding instance type, storage type, and avail of their cloud discounts as applicable.

Outbound Connectors

The ability to connect these databases with Business Intelligence (BI) or visualization tools is crucial for doing ad-hoc analysis on real-time data. Here is how the three systems compare in terms of outbound connectors:

Connector Apache Pinot Apache Druid ClickHouse

JDBC

Limited Support

Yes

Yes

PrestoDB

Yes

Yes

Yes

Trino

Yes

Yes

Yes

Tableau

Yes

Yes

Yes

Looker

Via PrestoDB/Trino

Yes

Yes

PowerBI

Via PrestoDB/Trino

Via PrestoDB/Trino

Yes (ODBC driver)

Grafana

No

Yes

Yes

Superset

Yes

Yes

Yes

Table: Outbound Connectors, a comparison of Apache Pinot, Apache Druid, and ClickHouse

Feedback

We hope you enjoyed reading this comparison! We’ve tried our best to ensure that all the content is accurate and to the best of our knowledge. However, it’s possible that we missed something or overlooked some details. Plus, of course, this is a point-in-time analysis. Future updates to any of these databases could require revisions to the information provided above. Please contact us if you have any feedback, and we’d be happy to incorporate it!

Intrigued by Apache Pinot? Try it out!

Perhaps the analysis you’ve read above has gotten you curious about Apache Pinot. If so, you are free to download open-source Apache Pinot. Or, if you want a fully-managed experience, you can sign up to try StarTree Cloud, built on Apache Pinot, free for 30 days.

Acknowledgements:

Thanks to Rachel Pedreschi for the very thorough review and insights on Apache Druid, Xiaobing Li for reviewing and helping us dig deep into ClickHouse, and Shounak Kulkarni for validating some of our findings and assumptions on real cluster setups for all three systems!

Many thanks to Madison Sorenson for making the beautiful illustrations. And to all our internal reviewers and editors – Peter Corless, Tim Berglund, Kishore Gopalakrishna, Kulbir Nijjer, and Allison Murph. for the valuable feedback and direction!

Changelog

  1. 4/18/2023 – Original publish date

Apache Pinot Industry Insights Technology

Ready to deploy real-time analytics?

Start for free or book a demo with our team.