Apache Kafka and Apache Pinot – Built together to work together.

Both Apache Kafka and Apache Pinot have a similar origin story: birthed within the halls of LinkedIn to solve the challenge of streaming data with real-time analytics. Both projects were fostered through the open source community of the Apache Software Foundation, and now both supported by companies dedicated to their technical evolution and user success. Discover how these projects evolved to complement each other and how best to leverage their respective capabilities.

Apache Kafka

In 2008, LinkedIn hit a wall. They couldn’t move data from their transactional systems to their analytics system fast enough. So a project began within LinkedIn to aggregate logs and distribute data across disparate internal systems.

In June 2011, “Project Kafka” was introduced and open sourced under the Apache 2.0 license. (Read the original whitepaper presented at NetDB ‘11.) By the following month, it had entered incubation at the Apache Software Foundation (ASF) and was thereafter known as Apache Kafka. It readily graduated from incubator status by October 2012.

In 2014, Confluent was founded by Kafka’s original development team to both shepherd the open source Apache Kafka community and to provide advanced capabilities through its commercial offering.

Apache Kafka’s adoption across data-intensive industries was broad and swift, even before it hit its 1.0 release milestone in 2017. As it was adopted, a new term was needed to describe this type of service within modern data architectures: event streaming—now shifting with the development of a new “data streaming” technology category. Today, over 80% of Fortune 100 companies use Apache Kafka.

How does Apache Kafka work?

Apache Kafka runs on a distributed cluster of servers that can efficiently ingest, store, and pass along data structured as events — messages sent at a specific time. Related messages are grouped into topics, into which external systems can contribute updates as producers (data sources), and from which other systems can subscribe as consumers (data sinks).

Events can be stored within Kafka topics for a time based on the required data retention policy — for example, in case of transient outages of systems, so that when they come back online, they can consume all of the events they missed while offline.

Apache Pinot

In 2012, LinkedIn hit another wall. This time it was having a designed-for-purpose real-time analytics database capable of keeping up with the volumes and velocity of data that Apache Kafka-enabled systems could produce. Plus, they wanted to open their analytics to the tens of millions of users of their social media platform designed for professionals. Other in-house systems simply did not fit the bill. By 2013 they had something working internally.

By 2014, they had publicly introduced Pinot — a distributed database designed for real-time analytics. It supports complex high dimensional queries, all with high concurrency queries per second (QPS) and low latencies with times measured in milliseconds. It used a unique kind of indexing, a star-tree index, to prevent the costly overhead of traditional materialized views (MVs) and cubing used in other Online Analytical Processing (OLAP) databases.

By 2018, the project was submitted to the Apache Software Foundation, where it entered incubation status. Apache Pinot became an ASF top-level project in 2021.

In 2019 the original engineering team from LinkedIn organized a new company, StarTree (named after the index type), to support the open source Apache Pinot community and to bring to market StarTree Cloud, its fully-managed Database-as-a-Service (DBaaS).

Apache Pinot’s adoption has likewise been broad and swift. Uber added H3 geospatial indexing, allowing the database to provide analytics for location-based services. Other major brands adopted Apache Pinot, such as DoorDash, Stripe, Citibank, Target, Walmart, Slack, Nvidia, and Cisco WebEx.

As Apache Kafka created a new domain of real-time data management, so did Apache Pinot. While the term “OLAP” has been around since the 1990s, and low-latency “real-time analytics” has existed since the late 2000s, with Apache Pinot, what it also allowed was user-facing analytics — when you are running real-time queries with high concurrency to the scale of hundreds of thousands of QPS.

How does Apache Pinot work?

Apache Pinot runs on a distributed cluster of services that can efficiently ingest, store, index, and query data structured as tables. Data can be ingested from real-time event streaming services, such as Apache Kafka, and batch data sources such as Change Data Capture (CDC) updates from Online Transaction Processing (OLTP) databases or static files stored in cloud object stores.

Once data is within Apache Pinot, it can be indexed in many ways, including with the highly efficient star-tree index. Data within Apache Pinot is stored in columns, sometimes referred to as a column store database. Column stores are more efficient for analytics since you do not need to read data from columns that are not relevant to a query, and repetitive data found in row-after-row can be efficiently compressed.

How do Apache Kafka and Apache Pinot work together?

Apache Pinot includes an Apache Kafka sink connector for real-time data ingestion directly from Kafka topics. You can read the documentation for how to set this up. Once configured, real-time events sent into Apache Kafka topics will go directly into Apache Pinot tables.

Confluent Cloud and StarTree Cloud

For users who want to run their data infrastructure within fully managed cloud-native software-as-as-service (SaaS) offerings, Confluent Cloud is powered by Apache Kafka, and StarTree Cloud is powered by Apache Pinot. Confluent and StarTree have a direct integration and have been tested and certified to work together.

Case study: Dialpad

Dialpad, for example, powers its customer intelligence platform with StarTree Cloud and Confluent Cloud. Read their Kafka-Pinot case study to see how this combination has enabled real-time insights for call center managers and also drastically reduced data ingestion latencies.

Explore Kafka with Apache Pinot on StarTree Cloud

It’s simple to create a connection to stream real-time events into Pinot from Kafka. The quickest way to explore this connection is with StarTree CloudRequest a trial to get set up with a managed serverless account on our public SaaS solution. Or Book a demo if you have questions and would like a quick tour of how StarTree Cloud can work for you.

Contents
Share