A Journey into Apache Pinot™ and Real-Time Analytics
Some History
If we’re going to talk about real-time analytics, it’s probably good to have a bit of historical perspective. One of the good things about being older is that, well, you know a lot of history. So let’s go back to 1990. (Insert weird time-travel music here.)
My first job in this industry—if I can call it that—was to write a production forecasting program for a sports apparel company. I’d never written a computer program before, but I figured, “how hard can that really be?” If you ever say that to yourself, RUN.
A relative at the time was running this small sports apparel company (that supplied all the competition gear to the US National Gymnastics team), and her processes were overly manual. Once a year, she and her top leaders would lock themselves away at her house for a week with pencils and adding machines (no, really, pencils and adding machines!) and go over all the sales figures from the previous year, forecast sales for the coming year for each product line, and from that would forecast how much fabric of each color/type to order throughout the year.
I pointed out that a single small mistake on Monday could have a catastrophic effect that lasted all year. She acknowledged that this had actually happened before. I believe my words were, “this is what computers are for!” She replied: “so build it for me!” I was about to learn a lot.
She bought me a fancy new Mac LC and I got started. What I ended up writing was essentially a complete ERP system, in HyperCard, on a Mac. That was my “Hello, world.” It allowed her to update forecasting numbers for each product she produced at any time and it would update the amount of fabric, production times, etc. And she could run numbers for any product in her catalog and add new products at any time.
Furthermore, she could run weekly reports to see how her sales were going and make weekly adjustments. This was a game-changer for her and her company. In 1990, this was also state-of-the-art analytics and forecasting and it transformed her business from a niche player to a major manufacturer of athletic and activewear.
Analytics changed her business. And let’s be clear: those were rudimentary analytics at best. The world has evolved considerably since then and what we need and expect from our analytics has shifted dramatically.
Moving Analytics Forward
In the decades since that time, analytics has advanced to the point that 1990s me wouldn’t even recognize it. It has not only moved forward, but also down the org chart.
While that old analytics app was for the CEO (and a few of her VPs), analytics has moved down the org chart to incorporate almost everyone in an organization. We all make “data-driven decisions” now. In DevOps, we look at system performance metrics and application usage metrics. In IoT, we look at environmental data to inform our decision. In e-commerce, we look at sales and inventory data to get insights into our business.
In addition to moving down the org chart, analytics have also become more and more immediate. While the numbers over the last quarter are still important trends, the movement of key indicators over the past few hours has become more important. Last month was ages ago. Yesterday was a long time ago. What’s happening now is what’s important to folks on the front lines.
New Constituents
While we’ve seen the consumers of metrics and analytics move down the org chart and become more immediate, there has also been a move towards making those analytics available directly to users of applications and customers as well.
We now interact with businesses, as customers, digitally through apps and customer portals. Obviously, tools like the one I talked about earlier are entirely inadequate for such uses. In fact, most of the tools we have built over the last decade or so are not up to the task of providing these kinds of analytics.
We are living in a real-time world but we were trying to rely on a batch-process world.
Real-Time Analytics has entered the chat
In this new real-time world we are no longer as interested in what the state of the world was last month, last week, yesterday, or even earlier today. We need to know what the state is now. Right now.
Let’s look at something like Uber Eats. When I open the app and am trying to make a decision of what, and where, to order, the wait times from yesterday aren’t going to be any good to me in informing my decision. Even the state of the queue this morning isn’t of any real value to me. I want to know what the wait times are at the three closest burger joints right now so that I can be confident that if I place an order it will be delivered in a timely fashion.
That is an analytics query made by me, an end user, not by a VP at Uber Eats (who will later want to see reports of average wait times, etc.). That’s user-facing analytics. And end users are not a patient lot, by and large. If you’ve gone to a website to buy something and it takes 15 seconds to load, chances are you left 10 seconds into it. Response times are critical—and even five seconds for a query is maybe 100x too long.
This change to user-facing analytics, and the required response times and “data freshness” that’s required, has entirely changed the game for analytics.
It’s a new multi-dimensional ballgame
In the before-times, when I was using my data to generate reports, or even some dashboards, it wasn’t a big deal if a query took a while to run. Building a report? It’s fine if my query takes tens of seconds or even minutes. It’s just a report. And the chances are that I’m the only one—or one of only a few people—making the query to build that report.
In the real-time times, it’s decidedly not okay if a query takes tens of seconds. It’s not even okay if it takes single-digit seconds. Remember, users aren’t patient, and a sluggish app doesn’t capture the transaction, or worse, doesn’t get used anymore. So I have to respond to queries in the tens to hundreds of milliseconds now. Responses have to feel like they are instant, and anything over 300ms no longer feels that way to users.
But now we’re adding another dimension to this query-response problem. It’s not just one, ten, or even 100 users making the query. We are now scaling these queries to tens or hundreds of thousands of queries being run simultaneously. In essence, we have increased the requirements geometrically from tens of queries per second with a response time of seconds to tens of thousands of queries per second with response times in the tens of milliseconds.
The chance that your existing analytics platform will be able to make this adjustment is vanishingly small.
And if this isn’t discouraging enough, let’s add on one more requirement: data freshness. You will no longer be able to rely on data from yesterday, or even a few hours ago, to respond to these queries. No, our results need to reflect real-world events that have happened within the past few seconds. Real-Time means Right. Now. Queries right now. Data right now. Lots of it. Instantly.
So to recap the new requirements: I need data that is only seconds old, and I need to be able to make tens or hundreds of thousands of queries per second, and I need those queries to return in under 300ms. I need high freshness, high concurrency, and low latency. All at once. We’ve now changed the requirements by multiple orders of magnitude in multiple dimensions all at once.
It’s a continuum
I don’t want to suggest that the older data analysis methods are no longer relevant. That’s not the case. The space is now large enough to accommodate a variety of tools that exist on a continuum of trade-offs in storage and compute architectures. We have great tools like BigQuery, Snowflake, Presto, and many more that are still extremely useful in day-to-day analytics applications because of the different performance characteristics and cost curves they offer. Yes, I know those examples are all very different tools, but they all have their applications in data analytics, especially internal-facing analytics like dashboarding and reporting.
But suppose you try to apply any of those existing tools to real-time, user-facing analytics. In that case, you’re going to face the same issues that led to the creation of Pinot in the first place: data freshness, ingestion capacity, and query latency.
Apache Pinot has entered the chat
A few years ago some motivated engineers saw this change coming and decided to build a database specifically to handle these rather unique requirements of real-time user-facing analytics. This was all done to address some operational requirements at LinkedIn, where these engineers worked at the time.
They saw the need to fundamentally change how data was ingested into the system, how the database itself was structured, and how queries were executed, all while maintaining the same SQL-based interface to the database—something we already know how to use. Those SQL queries can now deliver sub-second results on extremely fresh data at scale.
Analytics databases don’t so much receive writes as they do ingest data from event sources in the transactional data infrastructure. If you’re delivering real-time analytics, you’ll most likely have multiple sources of data to ingest, ranging from platforms like Kafka to a more traditional data lake or even good old relational databases. Just understand that the amount and velocity of this incoming data is likely to be very high.
At its core, Pinot is just a database—and ‘just’ is doing a lot of work in that sentence. It’s a tabular database with a lot of innovative ways of optimizing and storing data, but at its core it presents a familiar interface in which data is in tables, and the tables have columns with all the usual data types, and you query them with SQL.
The real innovation is how data is indexed, which governs the performance and flexibility of the read path. A large and growing collection of sophisticated index types allows for efficient reads in real time. There’s a text index based on Apache Lucene. A geospatial index drives applications like Uber Eats and other location-based services. There’s a nested JSON index that allows you to index on arbitrarily-nested JSON objects. And then there’s the Star-Tree index, which is something like a pivot table in a spreadsheet, only available in real time over an entire table.
All of this allows you to do high-velocity ingestion of multi-dimensional data and then get a million events per second indexing. And we’ve seen this in the real world, so we’re not just talking about theoretical maximums here. LinkedIn, using Apache Pinot, is ingesting over 1 million events per second, and handles over 200k queries per second in real-world use.
Beyond dashboards
With all this data it’s tempting to think of all the cool dashboards you could build to see what’s happening, operationally, across your infrastructure. And yes, you can do that, but remember that dashboards are mostly for internal consumption. I mean, are your end-users interested in the ingestion rates or average JVM garbage collection time? No, they are interested in what the wait time is for that hamburger and how accurate that wait time is. Dashboards are not the endgame here.
You can now make these queries available to thousands, even hundreds of thousands, of concurrent users, giving you the ability to present actionable insights to the world in ways you couldn’t before. Yes, you can still create internal-facing dashboards, but you can also produce customer-facing features driven by analytics queries. Uber Eats provides real-time analytics directly to their end users allowing customers to make better-informed purchasing decisions. With the same data, they provide operational dashboards to participating restaurants so that those restaurants can better plan for, and respond to, rushes and high traffic times.
As you make this transition to real-time analytics, you have a lot of choices to make. What data can and should you make available to your customers? How will you make this data available in ways that can actually use? Where does it currently live, and what will be the costs involved in making it accessible? In making this transition, you can start to operationalize a lot of the costs by moving them to an infrastructure layer. Transitioning all of your operational data to an infrastructure layer will also better position you to continue to leverage the power of your data for both external and internal facing customers.
Do you have Kafka in your infrastructure? By directly ingesting data from Kafka, you can essentially query the topics in those streams of data via Pinot without building a bunch of bespoke Kafka Streams applications. You can separate the operational concerns—the infrastructure—from the application concerns, which is how it should be.
A new game entirely
I know earlier I used the phrase “it’s a new ballgame,” but really—it’s an entirely new game. We’ve figured out how to unlock the value that exists in your data by expanding beyond dashboards and reports. You can now use that data to deliver new services to your customers.
While analytics has existed for a long time, this transformation from dashboard-focused, internal analytics to user-facing analytics is a fundamental shift in how businesses are unlocking the power of their data… It doesn’t matter if you’re a social media company, a food delivery service, or any other kind of user-facing business. This is the ability to extract new value from your existing data and to add new data streams at will to increase the dimensionality of your data. Then take that data and deliver new, innovative services to your users.
You can now empower your users to make better decisions, and then allow you to react to those decisions, in real-time. It’s a new game altogether that you and your users can engage in together to make things better. That’s what real-time, user-facing analytics is and what Pinot is here to help you unlock.
You can enter the chat
If you’d like to learn more about Real-Time Analytics, please register for the Real-Time Analytics Summit in San Francisco, April 25-26.