Integrating Data Streaming with Real-Time, User-Facing Analytics
Apache Kafka is the world's most popular open-source real-time data streaming technology. Apache Pinot™ is the most powerful open-source real-time, user-facing analytics platform. The two systems are commonly deployed side-by-side at many large enterprises as foundational components in modern data architectures.
Data streaming is the technology by which data changes can be sent and received in real time between disparate data systems, whether Online Transactional Processing (OLTP) databases — both SQL and NoSQL systems — stream processing systems, Artificial Intelligence and Machine Learning (AI/ML) systems, plus data warehouses, data lakes, or data lakehouses.
Real-time analytics is a computer science discipline wherein massive amounts of data generated in a relatively short time need to be ingested, stored, and indexed, followed by additional processes that can search, filter, aggregate, and process that stored data against specific queries to produce results. Real-time analytics includes both the process (including related computing systems and methods) and the resulting products (data sets, query results, downstream data flows, and other outputs).
Real-time analytics is a domain of Online Analytical Processing (OLAP). While traditional OLAP databases use batch processing, real-time analytics can keep up with ingesting data from data streaming systems that can transmit millions of events per second.
Apache Kafka runs on a distributed cluster of servers that can efficiently ingest, store, and pass along data structured as events — messages sent at a specific time. Related messages are grouped into topics, into which external systems can contribute updates as producers (data sources), and from which other systems can subscribe as consumers (data sinks).
Events can be stored within Kafka topics for a time based on the required data retention policy — for example, in case of transient outages of systems, so that when they come back online, they can consume all of the events they missed while offline.
Apache Pinot runs on a distributed cluster of services that can efficiently ingest, store, index, and query data structured as tables. Data can be ingested from real-time event streaming services, such as Apache Kafka, and batch data sources such as Change Data Capture (CDC) updates from Online Transaction Processing (OLTP) databases or static files stored in cloud object stores.
Once data is within Apache Pinot, it can be indexed in many ways, including with the highly efficient star-tree index. Data within Apache Pinot is stored in columns, sometimes referred to as a column store database. Column stores are more efficient for analytics since you do not need to read data from columns that are not relevant to a query, and repetitive data found in row-after-row can be efficiently compressed.
Apache Pinot includes an Apache Kafka sink connector for real-time data ingestion directly from Kafka topics. You can read the documentation for how to set this up. Once configured, real-time events sent into Apache Kafka topics will go directly into Apache Pinot tables.
Both Apache Kafka and Apache Pinot have a similar origin story: birthed within the halls of LinkedIn, fostered through the open source community of the Apache Software Foundation, and now both supported by companies dedicated to their technical evolution and user success.
In 2008, LinkedIn hit a wall. They couldn’t move data from their transactional systems to their analytics system fast enough. So a project began within LinkedIn to aggregate logs and distribute data across disparate internal systems.
In June 2011, “Project Kafka” was introduced and open sourced under the Apache 2.0 license. (Read the original whitepaper presented at NetDB ‘11.) By the following month, it had entered incubation at the Apache Software Foundation (ASF) and was thereafter known as Apache Kafka. It readily graduated from incubator status by October 2012.
In 2014, Confluent was founded by Kafka’s original development team to both shepherd the open source Apache Kafka community and to provide advanced capabilities through its commercial offering.
Apache Kafka’s adoption across data-intensive industries was broad and swift, even before it hit its 1.0 release milestone in 2017. As it was adopted, a new term was needed to describe this type of service within modern data architectures: event streaming—now shifting with the development of a new “data streaming” technology category. Today, over 80% of Fortune 100 companies use Apache Kafka.
In 2012, LinkedIn hit another wall. This time it was having a designed-for-purpose real-time analytics database capable of keeping up with the volumes and velocity of data that Apache Kafka-enabled systems could produce. Plus, they wanted to open their analytics to the tens of millions of users of their social media platform designed for professionals. Other in-house systems simply did not fit the bill. By 2013 they had something working internally.
By 2014, they had publicly introduced Pinot — a distributed database designed for real-time analytics. It supports complex high dimensional queries, all with high concurrency queries per second (QPS) and low latencies with times measured in milliseconds. It used a unique kind of indexing, a star-tree index, to prevent the costly overhead of traditional materialized views (MVs) and cubing used in other Online Analytical Processing (OLAP) databases.
In 2019 the original engineering team from LinkedIn organized a new company, StarTree (named after the index type), to support the open source Apache Pinot community and to bring to market StarTree Cloud, its fully-managed Database-as-a-Service (DBaaS).
Apache Pinot’s adoption has likewise been broad and swift. Uber added H3 geospatial indexing, allowing the database to provide analytics for location-based services. Other major brands adopted Apache Pinot, such as DoorDash, Stripe, Citibank, Target, Walmart, Slack, Nvidia, and Cisco WebEx.
As Apache Kafka created a new domain of real-time data management, so did Apache Pinot. While the term “OLAP” has been around since the 1990s, and low-latency “real-time analytics” has existed since the late 2000s, with Apache Pinot, what it also allowed was user-facing analytics — when you are running real-time queries with high concurrency to the scale of hundreds of thousands of QPS.
For users who want to run their data infrastructure within fully managed cloud-native software-as-as-service (SaaS) offerings, Confluent Cloud is powered by Apache Kafka, and StarTree Cloud is powered by Apache Pinot. The two systems have a direct integration and have been tested and certified to work together.
Dialpad, for example, is powering its customer intelligence platform with StarTree Cloud and Confluent Cloud. Read their Case Study to find out how this powerful duo has not only enabled real-time insights for call center managers but also drastically reduced data ingestion latencies.
If you want to get started with real-time analytics and are already a Confluent Cluster, sign up for a free trial of StarTree Cloud: