StarTree 0.15: StarTree Iceberg Tables Now Generally Available

Written By
Published
Reading Time

Iceberg has quickly become the standard for open, flexible data storage. But when it comes to SLA-driven analytics—where queries must return in milliseconds, not minutes—those expectations still aren’t met.

The problem isn’t Iceberg. It’s the query engine on top of it.

After a period of private and public preview working alongside some of the largest, most demanding data platforms in production, StarTree Iceberg Tables is now generally available as part of the 0.15 release.

This introduces a fundamentally different model. Today’s Iceberg query engines are scan-heavy. Even with optimizations, they read far more data than necessary, leading to unpredictable latency, rising costs, and complex workarounds like materialized views and duplicated pipelines.

StarTree changes this by bringing an index-driven, precise fetch execution model directly to Iceberg, enabling consistent, sub-second performance without data movement or materialization.

Whats new in 0.15

With StarTree 0.15, Iceberg tables can now be queried directly through StarTree’s real-time serving layer. Instead of treating Parquet files as units to scan, StarTree applies Pinot’s mature indexing layer directly on Iceberg data, resolving queries to the exact records and locations needed. These indexes identify the precise document IDs and map queries down to specific Parquet pages, not entire files or column chunks.

This enables a fundamentally different execution model:

  • Queries are supported by indexes. The query engine knows exactly which parts, of which files, contain the necessary data. .
  • Precise fetch  extracts data at page-level granularity, significantly reducing data transfer. 
  • Partitioning, pruning, Intelligent prefetching and caching further compound a reduction in remote I/O

In practice, this means Iceberg performance is no longer bounded by how much data must be scanned. Instead, queries operate on a minimal working set, where I/O, latency, and cost scale with the result, not the dataset.

The outcome is that fresh data in Iceberg can be queried directly—without materialization, without duplication, and without full reprocessing—while still meeting the demands of SLA-driven workloads requiring low latency, high concurrency, and predictable cost.

Release details:
https://docs.startree.ai/reference/startree-release-notes/0.15.0

The shift: from scan-heavy or materialized engines to a precise fetch engine

Most Iceberg-based query engines are fundamentally scan-heavy, even when optimized.

They reduce how much data is scanned through techniques like partition pruning, metadata filtering, and bloom filters. These improvements matter, but they don’t go far enough. Queries are inefficient and result in reading more data files than are actually needed to answer the question.

As datasets grow, scanning becomes expensive:

  • I/O scales with table size
  • Query cost scales with frequency and concurrency
  • Latency becomes sensitive to how much data must be read each time

Materialized views are a separate strategy, and a different tradeoff. Materialized views attempt to offset scan costs by storing precomputed results. When it works, queries are fast because they avoid re-reading data entirely.

Materializing data is powerful, but it comes with real tradeoffs. The three biggest downsides tend to be:

  • Data staleness: Materialized data can become outdated between refreshes, forcing a tradeoff between freshness and compute cost.
  • Storage overhead: It duplicates data by storing derived results, increasing storage usage and costs at scale.
  • Operational complexity: Requires managing refresh pipelines, handling failures, and keeping data consistent across dependencies.

With StarTree, the model shifts away from both.

Execution is index-driven, which changes how data is accessed at its core.

Queries do not rely on broad scans, and they do not depend on materialization to mask scan costs. Instead:

  • Only the relevant pages within a parquet file is accessed directly, resulting in queries reading orders of magnitude less data
  • I/O remains low even as datasets grow
  • Performance holds across repeated and new query patterns alike

The result is a system where both performance and cost remain stable as usage increases, because unnecessary data access has been removed, not optimized around. For many teams, this is where 40–50%+ cost reduction becomes achievable by eliminating unnecessary data scans entirely.

Under the hood, this shift is powered by bringing Apache Pinot’s rich indexing ecosystem directly to Iceberg—something that hasn’t previously been available in the data lakehouse. StarTree Iceberg Tables introduce a wide variety of index types on open table formats, enabling highly selective, index-driven query execution without requiring data movement or duplication. This includes support for inverted indexes, range indexes, sorted indexes, bloom filters, JSON indexes for semi-structured data, full-text (Lucene) indexes, vector indexes for similarity search, and geospatial indexes such as H3. 

Taken together, these indexes allow queries to operate with precision across a wide range of access patterns—filtering, search, aggregation, geospatial analysis, and vector similarity—without relying on scan-heavy execution or materialized views. This breadth of indexing is what enables Iceberg to move beyond a storage layer into a high-performance serving layer for real-time and interactive workloads.

“From the beginning, Apache Pinot was designed to deliver sub-second query performance at scale, powering high-concurrency workloads at companies like LinkedIn, Uber, and Stripe. The core innovations around indexing and selective data access translate naturally to Iceberg, enabling something fundamentally new: page-level reads where queries operate on precisely the data they need, not entire files.” – Uday Vallamsetty, SVP of Engineering at StarTree

“With open table formats like Iceberg, the friction of storing data in Pinot’s native format is removed, making it easy to run these capabilities directly on your data lake and deploy a cost-efficient serving layer without duplication.” – Chad Meley, SVP of DevRel at StarTree

What this enables in practice

A new class of workloads is emerging on Iceberg—SLA-driven analytics, where queries are tied directly to user experience, revenue, and operational decisions.

These use cases are familiar. What’s changing is the design pattern behind them. Historically, they were solved using point solutions outside the data lakehouse—separate serving systems, duplicated pipelines, and specialized infrastructure. Now, they are becoming first-class workloads on the data lakehouse itself, with the requirement to deliver consistent, sub-second performance under concurrency directly on Iceberg.

Six use cases show up repeatedly:

Customer-facing analytics
Applications now embed dashboards and real-time insights directly into the product—such as fintech merchant analytics or SaaS usage reporting. These workloads demand predictable latency under user load. Historically, this required duplicating data into serving systems. With StarTree, they run directly on Iceberg without that overhead.

Real-time + historical analytics
Use cases like fraud detection or operational monitoring depend on combining live signals with historical context. Traditional architectures stitch together streaming and batch systems with inconsistent results. StarTree enables a single query path across both, without latency tradeoffs.

Interactive analytics
Exploratory workflows—funnels, KPIs, segmentation—depend on fast, repeated queries. Performance must hold across unpredictable patterns and concurrent users. StarTree preserves sub-second interactivity directly on Iceberg, without pre-aggregation or over-provisioning.

Log and event analytics
Teams store logs in Iceberg but query them elsewhere due to missing indexing capabilities. This creates duplication and pipeline complexity. StarTree brings indexing to Iceberg, enabling fast filtering and search directly on the source data.

Observability and time-series analytics
Metrics, logs, and events are often split across systems, increasing cost and fragmentation. StarTree enables high-cardinality, time-series queries directly on Iceberg, allowing teams to unify observability workloads without sacrificing performance.

Vector similarity search (AI-powered retrieval)
Applications are increasingly embedding semantic search, recommendations, and AI-driven retrieval directly into user experiences—such as finding similar products, matching support tickets, or powering RAG pipelines. These workloads require fast nearest-neighbor search over high-dimensional embeddings with low latency under concurrency. Historically, this meant introducing a separate vector database alongside the data lake, creating duplication and operational overhead. With StarTree, vector similarity search runs directly on Apache Iceberg, making it the first vector database on Iceberg, so teams can serve AI-powered use cases without moving or replicating their data.

More on use cases:
https://startree.ai/resources/5-sla-driven-analytics-use-cases-now-possible-with-iceberg/

Benchmarked

We evaluated Iceberg-backed queries under serving conditions using a ~1 TB dataset (~12B rows) on a 4-node cluster (r6gd, 16 cores, 128GB RAM each).

The workload was designed to reflect real usage patterns, including:

  • High-selectivity filter queries (e.g., count(*) with dimension and time filters)
  • Aggregations with filters
  • Group-by queries on primitive and map-type columns

Query selectivity ranged from <10% to ~70%, covering both narrow lookups and broader scans. Across these patterns, the system sustained 500+ QPS with sub-second latency directly on Iceberg tables.

The primary driver of performance was reduced data access:

  • queries operated on specific pages within Parquet files, not full file scans
  • fewer bytes read per query translated directly into lower I/O and stable latency under concurrency

This held across repeated queries and mixed workloads, without requiring data duplication or materialization. Explore the benchmark here:

Performance remains stable without driving up cost. In scan-based systems, performance improvements often come with increased resource usage. Here, performance and cost move in the same direction—because both are driven by how much data is actually accessed per query.

Iceberg modernizes how data is stored, and StarTree delivers the SLAs. 

Get Started Today

It’s easy to get your hands on v0.15 with StarTree Iceberg Tables. Sign up for a free trial and we’ll get you up and running with your own environment to test out features and capabilities with your own data

Contents
Share
Just Released

Guide to Real-Time Analytics at Scale

Leading organizations like Uber and Stripe are harnessing real-time insights to efficiently power customer-facing data products. Are you ready to do the same?
Download your copy
Subscribe to get notifications of the latest news, events, and releases at StarTree