Why StarTree Is the Best Rockset Alternative for Real-Time Analytics
Let me begin by congratulating the Rockset team on the recent collaboration with OpenAI. It is a testament to the great product you folks have built. Unfortunately, this does mean that current Rockset customers need to find a suitable replacement.
If we look at the overall real-time analytics landscape, there are 3 technologies (and their related fully-managed versions) that stand out as a possible alternative:
- Apache Pinot (StarTree)
- Apache Druid (Imply)
- ClickHouse (ClickHouse Cloud or Tinybird)
There are other similar technologies, but we’ll focus on these 3 since they’re the most popular and widely used in the market today. In this blog, I’ll go over why we think StarTree and Apache Pinot are the best replacement for Rockset in terms of features and use cases.
StarTree Cloud, powered by Apache Pinot
Apache Pinot is an open source distributed columnar database that forms the backbone of mission-critical analytical systems at companies like Uber, Doordash, Stripe, LinkedIn, Cisco, and others. Pinot is able to ingest data from various sources of data such as streaming (Kafka/Kinesis/Pubsub) and batch (S3/GCS/ADLS), as well as SQL sources such as Snowflake and BigQuery. It was designed for powering high QPS (tens of thousands of queries per second) on terabytes of data, all while keeping query latencies in milliseconds.
StarTree Cloud provides a fully-managed, cheaper, faster offering of Apache Pinot across all 3 cloud providers. StarTree Cloud’s rich ingestion experience, ability to handle massive QPS and data scale, and being battle-tested in production makes it a natural fit as a Rockset alternative. We’ve also added Vector Index support to Apache Pinot for doing similarity search queries on embeddings stored in the database. Let’s take a deeper look at some key features that make it the top alternative.
Top 3 reasons for choosing Apache Pinot and StarTree as a Rockset replacement
1. Support for real-time upserts
From the above list of alternatives, Apache Pinot is the only system that can support true real-time upserts — a key requirement when identifying a Rockset alternative. Rockset provides a way to upsert records in real-time for a given primary key. This is required to ensure query results are accurate in the presence of data upserts and deletes. This is also needed for analyzing Change Data Capture (CDC) streams coming from DynamoDB/MongoDB/MySQL/Postgres.
In our blog Real-Time Upserts in Apache Pinot and StarTree Cloud, we’ve described in detail how this works in Apache Pinot and how StarTree Cloud increases the scalability and robustness of this feature. Apache Druid/Imply does not support real-time upserts as of today, and it’s severely limited in ClickHouse (only supports asynchronous compaction, which is not real-time).
Figure 1: Real-time Upsert support in Apache Pinot, Apache Druid and ClickHouse
2. Support for handling unstructured or semi-structured data
Another key feature of Rockset that customers love is the ability to analyze free structured JSON or text documents. Apache Pinot has great support for this as well. Users can ingest nested JSON columns as-is — without needing to preprocess or transform it in any way. We can then configure a JSON index on this column and query it using the JSON_MATCH predicate, which allows us to filter relevant records within milliseconds. Here’s a sample query that leverages JSON index for doing a regex based filter:
SELECT ... FROM mytable
WHERE JSON_MATCH(person, 'REGEXP_LIKE("$.addresses[*].street", ''.*st.*'')')
Copy
JSON index support in Apache Pinot enables extremely fast query processing on highly nested JSON columns, as shown in this benchmark comparing latency observed with and without the JSON index:
Figure 2: Benchmark comparing query latency with and without the JSON index
To learn more, please refer to this great video tutorial about the JSON index.
In addition, Pinot also has a way to perform powerful text search queries on arbitrary text data. This can be achieved by configuring a Text Index on the corresponding column to accelerate query processing. For instance, here’s a sample query that leverages Pinot’s text index to filter out logs matching the given regex:
SELECT SKILLS_COL FROM MyTable
WHERE text_match(SKILLS_COL, '/.*Exception/')
Similar to JSON index, Text Index also provides an order of magnitude improvement in query performance as shown in the following benchmark.
Figure 3: Benchmark showing query latency with UDF and with Text Index
3. Intelligent materialized views using the Star-Tree Index
The star-tree index, unique to Apache Pinot, provides users the ability to build an intelligent materialized view for pre-computing certain aggregations for a wide range of dimension filters. This provides many orders of magnitude improvement in query latency as well as overall query throughput as shown below:
Figure 4: Benchmark showing QPS and query latency with and without the star-tree index
However, unlike traditional materialized views, the star-tree index provides an innovative way to control the degree of materialization done, thus allowing the user to trade off query latency versus storage overhead. It’s also dynamic and can be easily modified without having to re-ingest the data. Finally, users can configure multiple different star-tree indexes for the same table and the right index will be picked up for a given user query. This is a unique feature that does not exist in other databases (including Rockset).
The star-tree index is being used in most of the latency-sensitive, mission-critical applications such as the metrics store built by Stripe and the Restaurant Manager application from Uber. To learn more about star-tree index, check out this blog.
Feature comparison of StarTree Cloud vs. Rockset
Compared to the alternatives, StarTree Cloud comes closest in terms of feature parity with Rockset, especially when it comes to supporting real time upserts and analyzing unstructured data. The rich ingestion capabilities of Pinot ensure a seamless migration path for existing Rockset users. In addition, Pinot and Startree provide innovations like the star-tree index and cloud tiered storage that stand out as differentiators.
Here’s a comprehensive comparison of features across StarTree Cloud and Rockset:
Feature | Rockset | StarTree Cloud |
Deployment | ||
SaaS – Serverless | ||
SaaS – Dedicated | ||
BYOC | ||
On Premises | Open source only | |
Scaling | T-shirt sizes | Increase resources of individual service that needs more |
Ingestion | ||
Real-Time Ingestion | ||
Bulk File Ingestion | (via stream) |
(independent of stream, async) |
Upserts | ||
Partial Upserts | Yes, but no access to current values | |
Deduplication | ||
Insert-Time Aggregations | ||
Time Partitioning | ||
Column Value Partitioning | Fixed count, PK only | Configurable, any column, multiple |
Background Re-Partitioning | ||
Write API | (beta) |
|
Materialized Views | (as rollups) |
(as StarTree indices) |
Backups | ||
Queries | ||
Postgres Compatible Query Language | Due 2024 | |
Wire-Compatible Driver | Due 2024 | |
Joins | ||
Collocated Joins | ||
Nested JSON Support | (different due to fixed schema) |
|
Pagination (Beyond Limit & Offset) | (beta) |
|
UDFs | (JavaScript) |
(Groovy) |
Query Lambdas | ||
RBAC | ||
Table Aliases | ||
Query Scaling | (via multiple VIs) |
(via k8s & replica groups) |
Indices | ||
Configurable Per Column | ||
Columnar | ||
Inverted | ||
Time Column Partitioned | ||
Range, Sorted | ||
GeoSpatial | (S2) |
(H3) |
BloomFilters | N/A | |
Text | ||
JSON | (effectively) |
|
Vector | ||
Extras | ||
Anomaly Detection | (ThirdEye) |
Conclusion
Migrating a database is never easy, especially with many choices to consider. We strongly feel StarTree Cloud comes closest in terms of feature parity with Rockset, especially with features like real-time upserts, JSON and Text Index support, and the star-tree index.
If you are a current Rockset user — or even if you found this article compelling regardless of your current database of choice — the good news is that there’s now a StarTree Cloud free tier for you to get started prototyping in preparation for your migration. Also please make sure you contact us for a demo. We’d love to listen to your needs, understand your use case, and answer any questions you may have on whether StarTree Cloud is the best database for your real-time analytics needs.