Rich Indexing Drives Apache Pinot’s Unprecedented Speed

StarTree
Written by StarTreeJune 01, 20224 minutes read

Apache Pinot calls on a powerful combination of indexes to drive real-time analytics use cases at scale with low query latency, high throughput, and unparalleled efficiency.

The emergence of real-time analytics has redefined speed in the context of data processing. Where fast batch reporting once measured time in seconds and minutes, real-time analytics use cases operate on a drastically shorter time frame — with query latencies required in milliseconds — and with vastly greater throughput. This speed is mandatory, as breaking query latency SLAs leads to unwanted consequences: lower user engagement, poor session time, and inevitable churn.

Speed is Non-Negotiable, and Rich Indexing Makes it Happen


Capable of powering 100s of thousands of queries per second with query latencies in the 10s of milliseconds, Apache Pinot is a distributed analytics datastore purpose-built to deliver real-time speed. To accomplish this feat across a diversity of use cases, Pinot leverages a wide variety of powerful indexes, any of which can be added to a query configuration on the fly. To get a better sense of how Pinot’s stable of indexes powers query efficiency, let’s examine each of them in turn and discuss their impact on various real-time analytics use cases.


1. Inverted Index


Apache Pinot’s inverted index efficiently maps each column value to its location in a table, which means that when an indexed column value appears in a query predicate, the inverted index can immediately identify the location of the associated value. That results in dramatically faster query performance due to far fewer table scans. This ability to arbitrarily sort data makes inverted indexes very effective for use cases like user-facing analytics, business metrics, root cause analysis, and dashboarding.


2. Sorted Index


Pinot can apply a sorted index to one column within a table and then capture the start and end location pointers for given column values. Like the inverted index, this sorting slashes the number of required table scans and thus query processing times. It’s particularly useful for user-facing analytics and personalization use cases which tend to include a primary value (a member ID, company ID, or job ID) in their query predicates.


3. Range Index


A variant of Pinot’s inverted index, the range index accelerates queries that are based on a range of predicates, particularly in columns with a large number of unique values. This is especially useful for anomaly detection, root cause analysis, and visualization dashboards.


4. JSON Index


Apache Pinot supports storage of unstructured JSON data and provides a JSON index to dramatically accelerate value lookup and filtering of that data. In essence, with Pinot users save cycles by eliminating the need for ingestion-time or query-time transformations of JSON data because they already have an
index of all JSON fields. If you’re working with JSON data, this can be a game-changer.


5. Text Index


Another method of quickly searching unstructured data, the Pinot text index enables regex-based text searches or fuzzy text searches. The result is fast query processing for a variety of text search categories, including term or phrase search, prefix query search, regular expression query, and so on. Both text and JSON indexes are crucial to maintain query speed in use cases like user-facing analytics, personalization, log analytics, text search, and ad hoc analytics that deal with unpredictable data structures.


6. GeoSpatial Index


To accelerate the many use cases that include geo-spatial queries, Pinot offers a geo-spatial index to filter records with specified locations based on latitude and longitude values. This means queries that filter geo-locations within a radius of a specific point now run as much as 20x faster. Moreover, Pinot’s geo-spatial index enables applications to render complex geo-spatial visualizations, such as scatter plots and world maps, with low latency and without additional application side data pre-processing or post-processing overhead.


7. StarTree Index


All of the above indexes assist in speeding query performance, but they can’t deliver a hard upper bound on query latencies for those use cases that need one, like anomaly detection, root cause analysis, and user-facing analytics. Apache Pinot provides that capability via its star-tree index. Utilizing intelligent, configurable filtering and pre-aggregation, the star-tree index drives low query latency, high throughput, and efficient storage consumption. Because the precise balance between storage space and query latency can be set per use case, the star-tree index enables users to enforce a hard upper bound on query latency.


There’s More Index Goodness to Come


Indexes are a critical element in the blazing fast speed and high throughput that makes Apache Pinot ideal for so many real-time analytics use cases at scale — and users can look forward to additional indexes that further expand Pinot’s capabilities in the future. Want to learn more? Talk to us about leveraging real-time analytics in your organization.

Resources

StarTree Brief

Compact summaries of Apache Pinot™ use cases and functionality breakdowns

Rich Indexing Drives Apache Pinot’s Unprecedented Speed
StarTree
StarTree
Read more
5 Key Reasons to Choose Apache Pinot For User Facing Analytics
StarTree
StarTree
Read more
Apache Pinot - Versatility For Real-Time Analytics Use Cases
StarTree
StarTree
Read more