Jun 17 - Webinar - High-performance full text search directly on Iceberg : RSVP Here

Sub-Second Analytics on Delta Lake – Without the Workarounds

StarTree Delta Tables Now Available
Written By
Published
Reading Time

Delta Lake is the format of choice for teams running Databricks at scale. But one class of analytic workloads has never quite been at home in the data lakehouse: the ones with SLAs.

When an on-call engineer is waiting five seconds for an observability dashboard to load during an outage, every second is lost signal. When a merchant opens a fintech app expecting to see their data instantly and waits two seconds, that delay doesn’t read as a technical limitation. It reads as a poor product. When an analyst clicks a filter and waits eight seconds, they stop exploring. These aren’t performance complaints. They are operational failures, and they happen because the query engines sitting on top of Delta were not designed for sub-second, high-concurrency analytical workloads.

The consequence is that these workloads have historically lived outside the lakehouse entirely. Teams duplicate data into purpose-built infrastructure and manage two pipelines where one should suffice.

Earlier this year, we announced general availability of StarTree Iceberg Tables, bringing index-driven, sub-second query performance directly to Apache Iceberg without data movement or materialization. Today, we’re continuing that commitment to open table formats with the private preview of StarTree Delta Tables, including full integration with Unity Catalog. The same engineering innovation: precise, page-level Parquet fetching driven by a rich index layer applied directly on Delta, so SLA-driven workloads can finally run where their data already lives.

What’s New 

StarTree Delta Tables work the same way as our Iceberg support: instead of relying on scan-heavy query execution, StarTree applies Apache Pinot’s mature indexing layer directly on Delta Lake data, resolving queries to the exact records and locations needed.

Unity Catalog integration means StarTree can discover and govern Delta tables through the same catalog your Databricks teams already use. No duplication of metadata, no manual table registration, no separate governance layer.

Like other query engines, StarTree applies partitioning and pruning to narrow down the data that needs to be read. But that is where the similarity ends.

Most engines stop there, falling back to scanning entire Parquet files or column chunks to retrieve the actual results. StarTree goes further by fetching at the page level within a Parquet file, accessing only the precise bytes needed to answer the query. This precision is made possible by a rich indexing layer applied directly on Delta Lake data:

  • Inverted indexes for exact-match and multi-value filtering
  • Range indexes for numerical and time-based filters
  • Sorted indexes for ordered access patterns
  • Bloom filters for rapid existence checks
  • JSON indexes for semi-structured and nested data
  • Full-text (Lucene) indexes for keyword and phrase search
  • Vector indexes for similarity search and AI-powered retrieval
  • Geospatial indexes (H3) for location-based queries

These indexes resolve queries to specific pages before a single byte of Parquet is read, which is what makes sub-second latency achievable at scale, while simultaneously reducing costs by 40% or more.

In practice, this means Delta Lake data can be queried directly without materialization, without duplication, and without full reprocessing, while still meeting the demands of SLA-driven workloads requiring sub-second latency, high concurrency, and predictable cost.

A Complement to Databricks, Not a Replacement

Databricks and its ecosystem cover a remarkable amount of ground. Spark and Photon handle large-scale transformation, ML training, ad hoc exploration, and batch processing. Neon brings serverless Postgres to the lakehouse for operational and transactional workloads. For most of what a modern data platform needs to do, Databricks has an answer.

The gap is SLA-driven analytical queries: high-concurrency, high-cardinality workloads that must return sub-second, across unpredictable access patterns, with no tolerance for the variance that comes from scan-based execution. Databricks was not designed with those constraints in mind, and that is not a criticism. It is simply a different design point.

StarTree fills that gap. Rather than replacing Databricks workflows, StarTree sits alongside them, enabling the SLA-driven analytical workloads that currently live outside the lakehouse to move back in. The same Delta tables that Databricks reads for transformation and exploration become the source for sub-second analytical queries in StarTree, governed through the same Unity Catalog, without any data duplication.

From Scan-Heavy to Precise Fetch

Most Delta-backed query engines remain fundamentally scan-heavy, even when optimized.

Techniques like partition pruning, metadata filtering, and bloom filters reduce how much data is scanned, but they don’t eliminate the underlying problem. Queries still read more data files than are necessary to answer the question. As datasets grow, the consequences compound:

  • I/O scales with table size
  • Query cost scales with frequency and concurrency
  • Latency becomes sensitive to how much data must be read on each execution

Materialized views can mask this cost, but they introduce their own tradeoffs: data staleness between refreshes, storage overhead from duplicated results, and operational complexity managing refresh pipelines.

StarTree removes the problem rather than working around it. Execution is index-driven at its core:

  • Only the relevant pages within a Parquet file are accessed directly
  • I/O stays low as datasets grow
  • Performance holds across repeated queries and new access patterns alike

For many teams, this is where 40 to 50%+ cost reduction becomes achievable, not through optimization, but by eliminating unnecessary data access entirely.

This is made possible by bringing Apache Pinot’s rich indexing ecosystem to precisely fetch Parquet files directly from Delta Lake.

Unity Catalog Integration

Unity Catalog sits at the center of the Databricks ecosystem for data governance, discovery, and access control. StarTree’s Unity Catalog integration means:

  • Table discovery happens automatically through the catalog, so there is no manual registration or schema duplication
  • Governance and permissions are inherited from Unity Catalog, keeping access control consistent across your Databricks and StarTree workloads
  • Data lineage and metadata remain unified in one place

For organizations running Databricks as their primary data platform, StarTree extends the value of that investment by enabling SLA-driven workloads to run directly on existing Delta tables, without disrupting the pipelines or governance structures already in place.

What this Enables in Practice

A new class of SLA-driven workloads is now possible directly on Delta Lake. These aren’t new use cases. What changes is that they no longer require separate infrastructure or duplicated data to meet their latency requirements.

Customer-facing analytics embeds dashboards and insights directly into products, such as fintech merchant analytics or SaaS usage reporting. These workloads don’t just require fast queries; they require fast queries under load. We’re talking 40K, 80K, even 100K+ QPS with sub-second P99 latencies on full analytical queries, not preprocessed lookups or cached results. Every extra second at that scale is a visible quality signal to every user who hits it. StarTree enables these workloads to run directly on Delta without a separate serving tier.

Observability and time-series analytics powers the dashboards engineers rely on during incidents. High-cardinality filtering across metrics, logs, and events must return in milliseconds, and the queries are rarely simple. Correlating a spike in error rates with a deployment window requires asof joins across time-series. Detecting anomalies across a rolling baseline requires windowing. StarTree supports both natively, along with PromQL compatibility and Grafana integration, so teams can bring their existing observability tooling directly to Delta data without re-platforming. Slow queries during an outage don’t just frustrate engineers; they slow down recovery. StarTree keeps latency stable under pressure, with the indexing depth to match.

Real-time and historical analytics covers workloads that depend on combining fresh signals with historical context. For most use cases, data that is 10 minutes old is fresh enough, and Delta handles that well. But some workloads can’t wait.

For those cases, StarTree can ingest directly into Pinot tables, making data available for query within seconds of arrival, and federate seamlessly into Delta for deeper historical context. A single query spans both: the last few seconds of live data in Pinot and months or years of history in Delta, joined and returned sub-second. No separate streaming pipeline, no application-layer stitching, no inconsistencies at the seam.

Log and event analytics is often routed away from Delta due to missing indexing capabilities, creating duplication and pipeline complexity. StarTree brings full-text and filter-based indexing directly to the source data.

Vector similarity search for AI-powered retrieval enables semantic search, recommendations, and RAG pipelines to run directly on Delta Lake, making StarTree the first vector database on Delta. Teams can serve AI-powered use cases without replicating data into a separate vector store.

Read more: 5 SLA-driven analytics use-cases possible on the data lake

Benchmarked

We evaluated Delta-backed queries under real SLA-driven workload conditions using a ~1 TB dataset (~12B rows) on a 4-node cluster (r6gd, 16 cores, 128GB RAM each).

The workload reflects real usage patterns, including:

  • High-selectivity filter queries (e.g., count(*) with dimension and time filters)
  • Aggregations with filters
  • Group-by queries on primitive and map-type columns

Query selectivity ranged from under 10% to ~70%, covering both narrow lookups and broader scans. Across these patterns, the system sustained 500+ QPS with sub-second latency directly on Delta Tables.

The driver of performance was reduced data access. Queries operated on specific pages within Parquet files rather than full file scans. Fewer bytes read per query translated directly into lower I/O and stable latency under concurrency, across repeated queries and mixed workloads, without requiring data duplication or materialization.

Performance remains stable without driving up cost. In scan-based systems, performance improvements often come with increased resource usage. Here, performance and cost move in the same direction, because both are driven by how much data is actually accessed per query.

Explore the benchmark

Open Table Formats, One Query Engine for SLA-Driven Workloads

With support for both Apache Iceberg and Delta Lake now available, StarTree provides a unified engine for SLA-driven analytics across the two most widely adopted open table formats. Whether your data lives in Iceberg, Delta, or both, you can now bring consistent, sub-second query performance to your lakehouse without separate infrastructure, duplicated pipelines, or format-specific tradeoffs.

Delta Lake modernizes how data is stored. Unity Catalog governs it. StarTree delivers the SLAs.

Get Started Today

Sign up for a free trial and we’ll get you up and running with StarTree Delta Tables in your own environment, with your own data.

Contents
Share
Confluent White Paper

Data Streaming Report

The Confluent | StarTree 2024 Data Streaming Report taps into the collective wisdom of 4,110 IT leaders to reveal how real-time data streaming is transforming businesses.
Download your free copy
Subscribe to get notifications of the latest news, events, and releases at StarTree