Scaling Upserts Efficiently: How StarTree Delivers Real-Time Performance for Modern Analytics

Chad Meley

SVP, Marketing

released on

October 24, 2024

READ TIME

6 min

The world of data has shifted significantly with streaming data and the rise of real-time analytics, and upserts play a critical role in this shift. Before diving into how StarTree, based on Apache Pinot, has redefined upserts, it’s essential to understand the general benefits of upserts in databases, why they’re challenging in analytic environments, and why traditional upserts are expensive and hard to scale. Finally, we’ll explore how StarTree addresses these issues and unlocks the potential of real-time analytics.

The general benefits of upserts

In any database, an “upsert” is a combination of “update” and “insert.” Without upserts, you’d first have to check if a record exists, then insert it if it doesn’t. If it does exist, you’d need to delete the old record and insert the new one. The upsert operation allows a system to insert a new record if it doesn’t exist or update it if it does—effectively combining two common database operations into one. Upserts are commonly found in many databases and offer several distinct advantages:

Simplified architecture: Instead of running separate insert and update operations, upserts streamline the process.
Insight accuracy: Upserts ensure that the latest data is always available for querying. By continuously updating records in real-time, upserts guarantee that insights are based on current and accurate information, reducing errors that arise from outdated data.
Improved data freshness: Upserts allow new and updated data to be processed immediately. This ensures that systems have up-to-the-second data.

The challenge of upserts in analytic environments

While upserts are relatively straightforward in transactional databases, they become much more complex in the world of analytics. Some of the core challenges include:

Disruption of bulk loading: Analytic databases are often optimized for large, sequential data loads, but upserts introduce smaller, more frequent updates. This breaks the efficiency of the bulk ingestion design pattern.
Mismatch between columnar storage and row-level updates: Analytic databases, such as those used for big data processing, are usually columnar stores—designed to read large chunks of data from a single column at once. However, upserts involve updating specific rows, which complicates this structure, leading to performance degradation and fragmentation.
Reindexing overhead: Frequent updates, as required by upserts, often necessitate reindexing the data. This can degrade query performance as the system spends additional resources reorganizing and optimizing the data for future queries.

These challenges make upserts a tricky feature to implement efficiently in analytic databases.

Traditional upserts for analytics

Most vendors that offer upserts in analytics platforms often resort to inefficient workarounds. Two common approaches are reconciling updates either at query time or pre-ingestion, both of which introduce significant overhead.

Pre-ingestion reconciliation attempts to determine if a record is an insert or an update before data enters the database. While this method may seem proactive, it significantly increases the complexity of the data pipeline. Each data stream must be checked for changes and modifications, which forces the system to maintain large volumes of temporary data in memory. This process not only consumes valuable memory but also slows down the ingestion speed, as every update or insert has to go through this costly reconciliation step before it can be processed. As data volumes scale, memory overhead balloons, and ingestion performance plummets.

How traditional upserts work in analytics platforms

Conversely, reconciling updates at query time pushes the responsibility for handling updates and inserts to the query engine. This approach defers the work until the user requests the data, reducing the upfront ingestion burden. However, the trade-off is substantial. While reconciling at query time works when query per second (QPS) rates are low, it becomes cost-prohibitive at scale.

Each query must sift through potentially conflicting data, adding significant processing time and requiring extra compute resources. This is how vendors like Imply and ClickHouse operate, and while it works for lower QPS scenarios, it becomes extremely costly and inefficient as query volume grows. The delays introduced can severely impact real-time analytics, leading to degraded user experiences and higher operational costs.

StarTree’s approach to upserts

StarTree has revolutionized the way upserts work by handling reconciliation during the ingestion process in a way that significantly improves scalability and reduces memory overhead. Instead of relying on complex pre-ingestion checks or costly query-time reconciliation, StarTree appends all new records as they arrive. This approach leverages metadata to determine whether each incoming record is a new insert, an update, or even a deletion of an existing record. If the system identifies it as an update, the older record is marked as obsolete, effectively streamlining the update process. If it is a deletion, then all matching records are marked as obsolete and invalid for querying.

By appending new records and only marking older ones as obsolete, rather than deleting them immediately, StarTree’s innovative approach avoids the need for expensive and problematic workarounds. StarTree stores the metadata from older updates in a specialized external storage system while keeping recent updates in main memory for quick access. This balanced approach reduces memory usage and allows StarTree to handle billions of updates per server without slowing down. By managing memory more effectively, StarTree ensures smooth performance at scale, making it both cost-efficient and highly scalable. This architecture makes StarTree ideal for businesses requiring high Query-Per-Second (QPS) on the freshest possible data by minimizing resource consumption and enabling cost-effective scaling.

Real-world benefits of real-time upserts from StarTree customers

StarTree’s robust upsert capabilities have proven success at massive scales. For example, Razorpay handles over 1 billion transactions annually with sub-second query latencies on StarTree Cloud. This performance is coupled with significant cost savings, with Razorpay and Amberdata reporting up to 66% reductions in infrastructure costs.

Conclusion

Upserts are a critical capability for real-time analytics, ensuring that data remains fresh and accurate. However, most analytics platforms struggle to implement them efficiently, often facing performance bottlenecks and high resource consumption. StarTree has redefined how upserts are handled in analytic environments, delivering a solution that is not only efficient but also scalable and cost-effective. By solving the key challenges of real-time data updates, StarTree enables businesses to harness the full potential of their data, ensuring that their analytics platforms are ready for the demands of the on-demand economy.

Scaling Upserts Efficiently: How StarTree Delivers Real-Time Performance for Modern Analytics

The general benefits of upserts

The challenge of upserts in analytic environments

Traditional upserts for analytics

StarTree’s approach to upserts

Conclusion

Ready to deploy real-time analytics?