Despite the increasing popularity of real-time analytics, also known as real-time online analytical processing (real-time OLAP, occasionally abbreviated as RTOLAP), few understand when to use it in their data stack. Some analytical queries can make due with traditional OLAP, while others truly need to run in a real-time OLAP database.
Real-time OLAP does all of the things OLAP does, but in real time. By this, we mean seconds or even sub-millisecond times — as fast as a web page or mobile app might update. In contrast, traditional OLAP might take minutes or hours for complex query results. Real-time OLAP aims to provide timely insights as soon as data is generated or ingested, handle streaming data sources, and continuously update analytical models to reflect the freshest data.
The question that many people ask comes down to this: When do I need to use a real-time OLAP data store? Not asking the question, or not knowing the answer when to switch to a real-time OLAP data store can put organizations at risk, costing the business time-to-market, savings through ROI, and opportunities to generate revenue. This is because the data store you’re using may never achieve the SLAs required for the use case, even as the cost of the infrastructure needed to serve your users balloons beyond your budget.
Examples of online analytical processing (OLAP), a type of data store used specifically for complex data analysis, include Snowflake, Oracle OLAP, Databricks, Google Cloud BigQuery, and Amazon Redshift. These data stores are optimized to scan large numbers of records efficiently and compute aggregations over them utilizing, for example, columnar-based formatting and indexes. OLAP enables you to process historical data and train machine learning models. Its use cases typically cover long-running queries that finish in minutes or hours to use in next-day reports or occasionally refreshed dashboards.
Real-time OLAP data stores, on the other hand, intend to serve multi-dimensional data in real time at lower latencies measured in seconds or milliseconds. It also supports significantly more end users compared to traditional OLAP, reflected by high rates of queries per second (QPS) measured in the thousands to hundreds of thousands.
To determine whether you need conventional OLAP or real-time OLAP, start by assessing your current SLAs for your analytics. How long can a query take to execute? How many concurrent queries does it need to handle? Next, consider the following metrics and their implications.
If you need data results in seconds to milliseconds — low latency — you need real-time OLAP. Because while you can throw nodes at a cluster to support increasing concurrency or storage, you need to re-architect from the ground up for latency. Some systems just don’t have it in them.
In this example comparing P99 latencies from a data warehouse to a real-time analytics database (Apache Pinot), queries against the data warehouse took 1 - 5 minutes to produce. The same queries run against a real-time OLAP database took 2 seconds or less — even as little as 15 milliseconds. P99 latencies were from 37× to 19000× faster using real-time OLAP.
External customers expect and demand much more from user-facing applications than internal users do. Internal users may put up with slow updates to a dashboard if they’re fast enough to support the relevant decisions, but customers using a web or mobile app are not at all forgiving of slow response times.
If the number of end users you need to serve increases to a point where the database cannot keep up, switch to real-time OLAP.
Let’s call this metric concurrency. Many end users will result in many concurrent queries hitting the database. If you use a data warehouse for your OLAP queries, it might not be capable of supporting all of these users. To be specific, if you see your QPS needs in the range of thousands to a hundred thousand or more, you will need a real-time, user-facing analytics database to handle that workload. However, even workloads measured in hundreds of QPS may perform poorly on systems that were never designed for real-time analytics in the first place.
If you encounter an ever-increasing demand for data freshness, you probably want to switch to real-time OLAP. Freshness measures the time from when data is produced to the time it is available for querying.
Consider this use case for a business in the food service industry: To prepare for today’s lunch hour, you need real-time data on when high traffic is occurring and what menu items are being ordered to know how your team should spend their time. While having historical data from prior lunch hours is nice, as are projections for how today’s lunch hour traffic should look, those are not the same as having actual data insights of what is going on during this lunch hour.
A data-driven business not only requires access to data but also a way to get that data to decision makers in a timely manner. In our world today, even next-day delivery of data falls short of keeping up with customer satisfaction, company health, and competitors.
Serving data from a real-time stream of events emitted by your operational systems will empower your decision makers to provide better customer experiences faster — and enable you to serve your customers with features powered by that same data. Real-time OLAP data stores can tap into real-time streams and bring fresh insights at the speed of thought.
Once real-time analytics start feeding your decision makers and customers with instant insights, they will inevitably want to compare those analytics to historical data so that they can learn from the past and compare current performance against a reference point.
To continue with the example above about lunch hour, it may indeed be valuable to also know how past lunch hours behaved in aggregate, to establish trends and expectations. While they do not serve as a substitute to actual data from today’s lunch hour, they provide useful behavioral patterns and baselines. Is today’s lunch hour busier than expected, or lighter? Is a certain menu item more in demand than usual?
Typically, getting real-time data with historical context proves technically difficult because real-time and historical data tend to reside on different systems (real-time platforms and data warehouses, respectively). A real-time OLAP database simplifies this by connecting to both real-time streaming systems and data warehouses without compromising query latency or concurrency.
Some try to solve for the requirements above by scaling their existing data warehouse, or worse, scaling out their existing OLTP database like PostgreSQL. In fact, since they were never designed for low latency queries, traditional OLAP systems will quickly run into limits of Gunther’s Universal Scalability Law, where adding nodes to the system will actually degrade performance over time. If you find your infrastructure (and bill) ballooning in size and becoming increasingly complex to handle, then it’s probably time to make the switch to real-time OLAP.
When your user-facing analytics start to suffer from the issues stated above, you’ll need a solution that addresses all of these challenges and complements your existing data platform. Apache Pinot gives you real-time and historical views of data while providing concurrency, data freshness, and low query latency without invasive disruption. Pinot comes with a rich set of indexes that enable low-latency queries, a columnar data storage format, and fast ingestion from data streaming platforms. These features enable LinkedIn, a well-known Pinot user, to serve millions of users with their analytical metrics such as number of profile views and up-to-the-second statistics on who has been viewing your profile.
When you are faced with the daunting prospect of real-time requirements, the benefits of making the change from traditional OLAP to real-time OLAP easily outweighs the costs of the transition. Just like with any infrastructure choice, the urgency and timing of that switch involves thoughtful consideration across a number of metrics, which this post should help you navigate as you manage your evaluation process. We have highlighted above some of the bigger factors to think about when considering whether you need real-time OLAP, but here’s a more comprehensive list:
Dissatisfied users because of long query times
Serving real-time and historical analytics
Increasing demand for data freshness
Expanding clusters that serve your analytics just for compute
Increasing data size to petabytes
UPSERT requirements that provide more accurate analytics
Full SQL semantics needed to answer complex queries
Increasing use cases for analytical data
Ready for real-time OLAP? Try StarTree Cloud for free for the easiest way to get started with Apache Pinot.