CASE STUDY

Token-Level Observability for LLMs: How Together AI Does It

As Together AI’s popularity soared and token volumes surged into the billions per hour, they needed an observability platform that could keep up and allow for deep inspection of usage. They chose StarTree Cloud and launched new dashboards in just 30 days that delivered on sub-second query latency, high cardinality slicing and dicing, and near instant visibility into usage.

Sub-second

Query Latency

across billions of token events

High-cardinality

slicing and dicing

(by API key, model, user, region, etc.)

10 second

freshness windows

enabling near-instant visibility into LLM usage

As large language models (LLMs) become core infrastructure, a new engineering challenge is taking shape: capturing how these models are used—and how they behave—as it happens. It’s no longer sufficient to count API calls or log failures after the fact. Teams need real-time answers to questions like: Who ran this prompt? On which model? How many tokens did it use? What did it cost? Why did it fail?

This is the heart of LLM Observability—the ability to track usage and performance with high dimensionality and low latency. It enables providers to build smarter infrastructure, power accurate billing, detect abuse, and debug faster. For customers, it means transparency, control, and confidence.

Together AI, a frontier AI cloud platform for building and deploying the next generation of AI solutions, faced this challenge at full scale. As one of the fastest-growing platforms for open-weight model inference and fine-tuning, Together AI supports over 200 models like DeepSeek, Kimi K2, and OpenAI’s open models. They operate a global network of high-performance data centers and serve AI-native customers like ElevenLabs and Hedra—delivering low-latency, cost-efficient compute with full control over model choice.

That focus on performance extended naturally to their data systems—but the original stack wasn’t designed for observability. With limited streaming, orchestration, and interactive analytics, it served the team well at first, until rapid growth exposed its limits.

The Token Fog Problem: When Usage Outpaces Insight

As Together AI’s platform scaled and token volumes surged into the billions per hour, their engineering team began to feel the strain of an analytics stack that wasn’t optimized for observability. What had started as a backend monitoring layer quickly became a critical requirement across the business.

Customers wanted real-time dashboards with prompt-level breakdowns to track usage and manage spend.
Engineers needed live visibility to debug latency spikes, trace errors, and optimize GPU allocation.
Finance teams required precise, token-level attribution to model costs accurately and support usage-based billing.

All of these needs pointed to the same fundamental gap: a lack of real-time, high-cardinality insight that could be queried interactively—not hours later via batch jobs. Traditional data warehouses were too slow. Metrics tools couldn’t support the dimensionality. Ad hoc pipelines broke under pressure. After nine months of hitting walls, a proof of concept with StarTree, powered by Apache Pinot™, delivered results in just one week.

When evaluating any solution for LLM observability, two dimensions matter most: freshness and granularity. Freshness determines how current your insights are—can you monitor usage from the past 10 seconds, or are you stuck looking at yesterday’s data? Granularity reflects how deeply you can inspect that usage—can you trace behavior down to individual prompts, models, and API keys, or are you limited to coarse, high-level summaries?

Most tools fall short on both counts. OpenAI’s billing dashboard, for instance, updates periodically. Even under ideal conditions, the fastest it gets is around 5 to 10 minutes, even then on coarse grain data sets. By contrast, Together AI now delivers 10-second freshness with fine-grained visibility. But even that’s just the beginning.

Their longer-term vision pushes beyond token-level metrics into the internals of the model itself—down to what they have coined Tensor Compute Units (TCUs), the molecular tensor-level operations that make up an LLM’s inference process. TCUs represent the most granular layer of compute that still holds high level semantic information, and tracking them enables Together AI to analyze exactly how much infrastructure load each prompt generates, which tensors are computed, and why certain interactions trigger higher latency or cost.

The Solution: StarTree, Powered by Apache Pinot™

Together AI adopted StarTree Cloud to power its real-time analytics layer. StarTree ingests data via Kafka Streams and stores it in Apache Pinot’s columnar OLAP engine. Within 30 days, the team launched real-time dashboards for both internal use and customer access.

Key features of the solution include:

Sub-second query latency across billions of token events
High-cardinality slicing and dicing (by API key, model, user, region, etc.)
10-second freshness windows, enabling near-instant visibility into LLM usage

Here’s an example of the performance now:

Making LLMs Observable: Inside StarTree’s Real-Time Telemetry Engine

StarTree, built on Apache Pinot™, offers a uniquely powerful foundation for LLM observability, where speed, cardinality, and concurrency collide.

“Star-tree index made a huge difference. We went from full table scans to hitting pre-aggregated metrics by cluster, model, or user with extremely low data and query latency. We went from 10 seconds latency to 7 milliseconds!”
Pablo Ferrari
Engineer at Together AI

Aggregations at Ingest: The Star-tree index pre-aggregates metrics like token usage, latency, and error counts by API key, prompt template, user, and model—right at ingestion. This enables sub-second queries on billions of rows and several Terabytes of data without costly on-the-fly computation, making real-time billing and usage dashboards not just possible, but fast.
Advanced Indexing, Including Text: StarTree supports rich indexing strategies, including inverted indexes, bloom filters, and text indexing. This allows teams to query not just structured fields like region or user ID, but also search prompt templates and detect patterns in free-form input—critical for debugging and abuse detection in LLM applications.
Scalable Upserts: LLM logs are noisy and fast-moving. StarTree’s upsert capabilities ensure the latest state—like token counts, errors, or model output summaries—is always queryable without duplication or stale data, even across high-ingest environments with streaming updates.
Precise Fetching for Tiered Storage: When dealing with cold storage like S3, StarTree’s Precise Fetching only retrieves the exact bytes needed—specific columns, blocks, or indexes—avoiding full segment loads. This reduces storage costs by over 50% while preserving interactive performance, a critical balance for extreme volumes of LLM data exhaust.
Concurrent Execution Pipeline: Unlike systems that process queries serially, StarTree pipelines data fetch and query execution in parallel. This maximizes throughput and ensures real-time dashboards stay fast even during ingestion bursts or when serving many users at once.
Pinned Metadata for Fast Filtering: Key index structures—like bloom filters, dictionaries, and Star-Tree nodes—are kept in local memory. That means queries can rapidly skip irrelevant segments and home in on just the data that matters, whether you’re looking up errors by prompt ID or token usage by model region.
PromQL + Grafana Support: StarTree offers built-in support for PromQL and seamless integration with Grafana, enabling engineering teams to reuse familiar tools and query languages for building real-time LLM observability dashboards—without re-architecting their monitoring stack.

“Being able to index and query by text and regex directly was a surprise win.”
Pablo Ferrari
Engineer at Together AI

Conclusion

LLM observability isn’t just about tracking usage—it’s about transforming infrastructure data into a product experience. For too long, usage dashboards have functioned like static receipts: delayed, high-level, and disconnected from real-time decisions. Together AI flipped that script. With StarTree, they made observability part of the product itself—dynamic, interactive, and actionable.

This shift redefines what customers expect and what providers can deliver. Because when your infrastructure becomes someone else’s user experience, telemetry can’t be an afterthought. It has to be fast, fine-grained, and built for the way LLMs actually work.

Contents