Resources
Blog

Integrating Delta Lake and Apache Pinot with StarTree Cloud


1687268191-vibhuti.jpeg
Vibhuti Bhushan
released on
June 20, 2023

An End-to-End Solution for Real-Time Analytics

Apache Pinot and Delta Lake Logos

In today’s fast-paced and data-driven world, organizations face the challenge of extracting valuable insights from their data in real time. While Delta Lake has gained recognition as a robust platform for data processing and ad-hoc analytics, Apache Pinot excels in delivering real-time analytics with low latency and high query per second (QPS) capabilities. By combining these two technologies, businesses can unlock the full potential of their data, providing an end-to-end solution for all their analytics needs.

Let’s explore the synergistic relationship between Apache Pinot and Delta Lake, uncovering their individual strengths, the seamless integration that enables real-time analytics while maintaining data consistency, and the wide range of use cases where this integration shines.

What is Delta Lake?

Delta Lake is an open-source storage layer that enhances the reliability, scalability, and performance of data lakes. Technically it is classified as a “data lakehouse” system, combining attributes of a storage-oriented data warehouse with the analytical capabilities of a data lake, integrating both functions with a shared metadata, caching and indexing layer. Delta Lake offers key features such as ACID transactions, schema enforcement, and data versioning. These capabilities empower data engineers and data scientists to build robust data pipelines, ensure data integrity, and enable efficient data processing and analytics.

Delta Lake introduced the concept of a medallion architecture with bronze, silver, and gold tables as a way to organize and manage data within the data lake. Each table serves a specific purpose and is crucial in the data processing and analytics pipeline. In this blog we will learn how Apache Pinot integrates with the medallion architecture to provide an end to end solution for real-time analytics.

A Brief Overview of Medallion Architecture

Data in a Delta Lake data Lakehouse is organized in a three-tiered method analogously named for sports medals:

  • Bronze Layer — for raw ingestion of data from various sources (e.g., SQL, NoSQL, event streaming, batch data files). This layer provides end to end audit capabilities and acts as the source for backfilling without needing to go back to the source.
  • Silver Layer — for validated, deduplicated data (filtered, cleaned and augmented; e.g., third normalized form). This layer is designed for data discovery, ad-hoc reporting and AI related applications.
  • Gold Layer — for project-specific analytics using curated business tables. This layer contains Fully transformed, (usually) denormalized and aggregated data for business use cases and optimal user experience.

You can read more about the Medallion architecture here.

Delta Lake diagram of three-tiered method of data, bronze, silver, and gold

User-Facing Analytics and Its Challenges

While Delta Lake excels at data processing and ad-hoc analytics, it is not the ideal choice for user-facing analytics use cases. User-facing analytics demand instantaneous data availability (data “freshness”) and extremely low query latencies, which can be challenging to achieve using traditional batch-oriented processing frameworks like Delta Lake. The other key attribute of user-facing analytics is high queries per second (QPS) requirements while maintaining these extremely low latencies.

For example, a financial services company collects transactional, behavioral and customer data into a Delta Lake and then follows the medallion architecture to create the Silver and Gold table. Even with the Gold table, it is only making this data available to a select few analysts because of the responsiveness concerns and the cost associated with allowing more users to query this data. The cost of making this data available to all its customers is prohibitive.

Delta Lake’s underlying architecture is optimized for batch processing and is not designed to handle the strict latency requirements even at low to moderate QPS needs. It is clear that organizations wanting to leverage insights-driven user engagement need to augment their corporate data architecture with another technology which is designed for extremely low latency at very high throughput.

What is Apache Pinot?

Apache Pinot is a real-time distributed open-source OLAP database designed to provide low-latency, high-throughput querying capabilities on large-scale datasets. It is widely used for use cases requiring real-time analytics and fast data exploration, where low latency at high query-per-second (QPS) rates is critical. Pinot enables businesses to create user-facing analytics applications to serve tens of thousands of users with extremely low latency queries by leveraging columnar storage, pluggable indexing, and distributed processing techniques.

How Does Delta Lake Compare to Apache Pinot?

Comparison of Analytics KPIs between data lakehouse (Delta Lake) and OLAP data store (Apache Pinot)

In the world of Big Data, there are the “Three V’s” of how data is classified — in terms of volume, variety, and velocity. While lakehouse architectures like Delta Lake are designed for handling a high volume and wide variety of data, high velocity use cases — low-latency, high QPS, real-time analytics based on the freshest of data — are better served from a database like Apache Pinot.

Users should not think in terms of an “either-or” false choice. In many cases, it is prudent and necessary for organizations to run both types of systems — just as a highway allows trucks to deliver massive cargo loads in one lane while allowing fast passenger vehicles to zoom by in the next lane over.

Integration between Delta Lake and Apache Pinot

An out-of-the-box integration is available to keep the gold table in Delta Lake in sync with a table in Apache Pinot to provide an end-to-end solution for all analytics needs. Any change in the Delta Lake table is atomically synced with Apache Pinot, ensuring that real-time analytics use cases can be supported seamlessly.

This integration provides the best of both worlds – the reliability, performance, and ease of use of Delta Lake for data processing and ad-hoc analytics, combined with the low-latency, high-throughput query capabilities of Apache Pinot for real-time analytics.

To learn more about the integration between Apache Pinot and Delta Lake, please read Delta Lake Managed Ingestion with StarTree’s DeltaConnector for Apache Pinot and also check out the documentation here.

Fully-Managed Service Integration

StarTree Cloud, powered by Apache Pinot, is a fully-managed database-as-a-service (DBaaS) designed for real-time OLAP use cases. It provides a no-code web-based data ingestion experience through StarTree Data Manager, allowing users to synchronize data between their Delta Lake and their managed Apache Pinot service with a few clicks.

StarTree Cloud makes for a perfect complement for users of the Databricks Lakehouse Platform, built on Delta Lake, to provide a fully-managed end-to-end experience.

Learn more about StarTree Data Manager.

Reference Architectures

StarTree is designed to help organizations democratize analytics not only internally but also to their customers. As discussed earlier, infrastructure costs become prohibitive as the data is opened up to more users. StarTree Cloud, powered by Apache Pinot is designed for exactly this use case. Organizations can leverage the power of Apache Pinot and Delta Lake using one of the following two reference architectures:

Reference Architecture 1: Synchronize Silver Table to Pinot and Eliminate Need for Gold table using Star-Tree Index

Synchronize a Silver table in Delta Lake with a table in Apache Pinot in real-time and leverage the capabilities of Apache Pinot for user-facing analytics. This solution also helps contain the storage sprawl by removing the need of creating another copy in Delta Lake in the form of Gold Table. In the medallion architecture Gold tables are usually precalculated aggregates to improve query latency by not having to aggregate data at query time. These tables are specifically created for applications requiring low latency access by ensuring there is less data to scan due to pre-aggregation.

Reference architecture diagram - synchronize Silver table to Pinot and eliminate the need for Gold table using Star-Tree Index

Apache Pinot will be able to handle very high QPS and still serve the queries with milliseconds latency without needing pre-aggregation. In addition, Apache Pinot also has the star-tree index which is designed to further optimize query performance for multi-dimensional range and aggregation queries. By using Apache Pinot, users can save the need to create aggregates and store in Delta Lake as Gold table for higher performance.

To learn why the star-tree index is so fast, please read Star-Tree Indexes in Apache Pinot – Part 1: Understanding the Impact on Query Performance

Reference Architecture 2: Pinot as Real-Time Cache for Delta Lake

Pinot acts as a real-time cache or replica of the Gold table in this architecture. As the tables are updated in Delta Lake, it is continuously synced to Apache Pinot kept up to date. StarTree Cloud provides a comprehensive, fully-automated backfilling solution to update the tables in Apache Pinot if there are any changes in the data pipeline.

Reference architecture of Apache Pinot as a real-time cache for Delta Lake

Users can now rely on tables in Pinot to serve user-facing analytics applications and other applications requiring low-latency at very high concurrency. The AI/ML workloads will continue to run on the Silver table as the data in the Gold table doesn’t have all the raw data needed for such workloads. This architecture also allows users to plug in Apache Pinot in their existing data pipeline without any disruption and enable applications which require extremely low latency with very high throughput.

Summary

Combining Apache Pinot and Delta Lake creates a powerful and comprehensive analytics solution covering batch processing and real-time analytics use cases. While Delta Lake excels at data processing and ad-hoc analytics, Apache Pinot provides the low-latency, high-QPS capabilities required for real-time analytics. The seamless integration between these platforms ensures data consistency and availability, allowing businesses to harness the power of real-time analytics while leveraging both technologies’ strengths. Whether it’s IoT, recommendation engines, or other real-time analytics use cases, the collaboration between Apache Pinot and Delta Lake enables businesses to unlock valuable insights from their data in real time, enabling them to make data-driven decisions and gain a competitive edge in today’s dynamic market.

Next Steps

If what you’ve read above makes you eager to integrate StarTree Cloud with your own Delta Lake implementation, now’s your chance! Sign up for a 30-day free trial. Also, make sure to bring your questions to our Slack community.

Start Your Free Trial of StarTree Cloud

Ready to deploy real-time analytics?

Start for free or book a demo with our team.