Clickstream and Funnel Analysis with Apache Pinot and StarTree

Part 2 - Technical Deep Dive with Examples

In this second part of our blog series on clickstream funnel analysis, we dive deeper into Pinot’s funnel analysis functions, demonstrate practical implementation with real-world examples, and explore how to leverage these capabilities for actionable business insights.

Written By

Vivek Sinha

Product Manager

Published

November 12, 2025

Reading Time

Clickstream Funnel Analysis with Apache Pinot and StarTree — Part 1

Clickstream analytics involves analyzing the sequence of actions users take as they navigate through websites or applications. Funnel analysis, a critical subset of clickstream analytics, tracks users through predefined sequences of events to measure conversion rates and identify points of drop-off.

Common funnel analysis use cases include:

E-commerce purchase funnels (view → cart → checkout → purchase)
User onboarding flows (signup → verification → profile completion)
Content engagement paths (landing → view → share → subscribe)
Marketing campaign effectiveness (impression → click → conversion)

Challenges of Funnel Analysis in Traditional Systems

Funnel analysis, which involves tracking how many users move from one event to the next, sounds simple, but at scale, it quickly becomes expensive. Traditional setups rely on ETL pipelines, such as Apache Spark or Apache Flink, to pre-aggregate event logs into summary tables, or on SQL warehouses like Snowflake and BigQuery, which utilize complex joins and window functions. Both approaches struggle because ETL jobs are slow and rigid, while warehouse queries become costly and sluggish on billions of rows. More importantly, neither can handle real-time use cases such as identifying funnel drop-offs or anomalies in the last few minutes, where insights must reflect the latest user activity almost instantly.

Data freshness is not just a technical concern; it has a direct business impact. For fast-moving organizations in e-commerce, fintech, or SaaS, even a few hours of analytical delay can mean missed opportunities, lost conversions, or poor customer experiences. A funnel drop-off detected at the end of the day is too late to fix. Teams need to see what is happening right now so they can act immediately. Apache Pinot was designed for this kind of responsiveness. It keeps data fresh within minutes while still supporting complex analytical queries without waiting for heavy ETL pipelines or pre-aggregations.

Performance is equally important. Traditional systems were not built to handle hundreds of concurrent queries on terabytes of data with low latency. They require expensive computing and caching layers to achieve interactive performance. Apache Pinot’s distributed, columnar architecture makes this possible by optimizing how data is indexed, stored, and retrieved. As a result, teams can explore funnels, understand drop-offs, and analyze user journeys across billions of events in real time, with queries that return in milliseconds instead of minutes.

Funnel Analysis with StarTree

Many of the world’s largest digital businesses, including a major e-commerce leader, use StarTree to power funnel analytics at massive scale. In one such deployment, the hot tier stores roughly a petabyte of data over a two-month retention window, ingesting close to one million events per second with an ingestion latency of under five seconds. The system handles hundreds of interactive queries per second across nearly one million segments, representing over a trillion records. Despite this scale, StarTree consistently delivers sub-second median query latency, with p95 queries completing in under four seconds, enabling teams to perform rich, real-time behavioral and identity-based funnel analyses with ease.

Apache Pinots offer built-in funnel analytical functions that make this even simpler. Functions like FunnelMaxStep, FunnelMatchStep, FunnelCompleteCount, and FUNNEL_COUNT let you compute funnel metrics directly at query time without pre-aggregating data. These functions allow you to express how far each user progressed in a funnel, which steps they completed, or how many times they reached a full conversion, all while supporting flexible modes for ordering, deduplication, and overlapping sequences.

Below, I’ll walk through each function, its syntax, semantics, examples, internal behavior notes, and caveats.

FunnelMaxStep (How far did each user get?)

Purpose:

For a given user (or correlation key) and time window, compute the maximum funnel step they achieved.

Syntax

FunnelMaxStep(
  timestampExpr,
  windowSize,            -- in milliseconds
  numberSteps,            -- e.g. 3
  stepExpr1, stepExpr2, … -- Boolean predicates for each step
  [ mode1, mode2, … ]     -- optional mode strings
)Code language: SQL (Structured Query Language) (sql)

timestampExpr is an expression (often the event timestamp column) that orders events.
windowSize bounds how far apart in time the first and last matching event can be.
numberSteps must match the count of stepExprs you provide.
modes are optional strings that can modify matching behavior.

Semantics & Modes

Apache Pinot evaluates events per correlation key (e.g., user_id). It sorts their events by timestampExpr and then attempts to match the sequence of stepExprs in order, subject to time constraints and mode rules. It returns an integer in [0 .. numberSteps] representing the furthest (highest numbered) step satisfied.

Supported modes:

STRICT_DEDUPLICATION: If the same step’s predicate would fire multiple times before advancing, it prevents reusing the same predicate for multiple progressions.
STRICT_ORDER: Enforce that no intervening non-step events appear; if an event occurs that doesn’t match the next expected step, the funnel matching stops.
STRICT_INCREASE: Enforce strictly increasing timestamps (i.e. no two events can “tie” or go backward); ensures temporal monotonicity.
KEEP_ALL: Include all events (matching or not) in consideration; helps with debugging or context retention.

Example

Suppose we have a clickstreamFunnel events table:

event_name	ts	user_id
view_product	1718112402	U1
add_to_cart	1718112403	U1
purchased	1718112404	U1
view_product	1718112405	U1
add_to_cart	1718112406	U1
purchased	1718112407	U1
view_product	1718112405	U2
add_to_cart	1718112406	U2
purchased	1718112407	U2
view_product	1718112404	U3
add_to_cart	1718112405	U3
cart_viewed	1718112406	U3
purchased	1718112407	U3
view_product	1717939609	U4
add_to_cart	1718112405	U4
purchased	1718112405	U4

We want to track funnel steps: 1 = view_product, 2 = add_to_cart, 3 = purchased, within a 2-day window (i.e. 48 hours = 172800000 ms).

SELECT
  user_id,
  FunnelMaxStep(
    ts,
    172800000,
    4,
    event_name = 'view_product',
    event_name = 'add_to_cart',
    event_name = 'cart_viewed',
    event_name = 'purchased' 
    -- Optional modes can follow, e.g. 'STRICT_INCREASE', 'STRICT_DEDUPLICATION'
  ) AS max_step
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_id;Code language: SQL (Structured Query Language) (sql)

Result:

user_id	max_step
U1	2
U2	2
U3	4
U4	2

Here, U3 progressed all the way to step purchased; however, U1, U2, and U4 stopped at add_to_cart or missed an event cart_viewed.

If we had STRICT_DEDUPLICATION, and if U1 had repeated view_product events before adding to cart, the duplicates would not allow “re-consuming” the same predicate.

Under the hood, Pinot’s funnel engine is heavily optimized to execute complex sequence matching at scale. It processes events in a left-to-right order through the sorted event stream for each user, applying time window and mode constraints to determine valid progressions.

During query execution, Pinot performs early filtering to discard events that do not satisfy any funnel step predicates, reducing the data scanned per user. Event types are dictionary-encoded and internally mapped to integer IDs to speed up predicate evaluation and comparison. The engine also prunes users whose earliest qualifying event falls outside the query window, preventing unnecessary traversal through irrelevant data. Since funnel computation happens in a distributed manner across multiple segments, each server produces a partial progress state per user, which the broker later merges intelligently. This segment-level parallelization, combined with encoded lookups and early pruning, allows Pinot to evaluate multi-step funnels on billions of events with sub-second latency.

FunnelMatchStep (Which steps did the user match?)

Purpose:

Instead of returning the max step as a single integer, FunnelMatchStep returns an array (vector) of bits showing exactly which steps matched in order (1 or 0).

Syntax

FunnelMatchStep(
  timestampExpr,
  windowSize,
  numberSteps,
  stepExpr1, stepExpr2, …,
  [ mode1, mode2, … ]
)Code language: SQL (Structured Query Language) (sql)

It uses the same parameters as FunnelMaxStep, but the return type is an array of length numberSteps, e.g., [1,1,0].

Semantics

If numberSteps = 3 and the user progressed to step 2 (i.e., matched predicates 1 and 2), the result would be [1,1,0].
If they didn’t match any, you’d get [0,0,0].
FunnelMatchStep does the same internal matching logic as FunnelMaxStep (respecting modes), but gives you the “pattern” of success/failure across all steps.

Example (same schema as above):

SELECT
  user_id,
  FunnelMatchStep(
    ts,
    172800000,
    3,
    event_name = 'view_product',
    event_name = 'add_to_cart',
    event_name = 'cart_viewed'
  ) AS match_array
FROM clickstreamFunnel
GROUP BY user_id;Code language: SQL (Structured Query Language) (sql)

Result:

user_id	match_array
U1	[1,1,0]
U2	[1,1,0]
U3	[1,1,1]
U4	[1,1,0]

Here, U1, U2, and U4 did not perform the cart_viewed step; only U3 did, hence the result.

FunnelMatchStep is useful when you want to compute per-step metrics or combine it with array functions (e.g., sum the bits across users to get counts per step).

FunnelCompleteCount (How far did each user get?)

Purpose:

Count how many times the full funnel (all steps) is completed per user (or correlation key), within the time window. Useful when the funnel can repeat (e.g., multiple conversion cycles).

Syntax

FunnelCompleteCount(
  timestampExpr,
  windowSize,
  numberSteps,
  stepExpr1, stepExpr2, …,
  [ mode1, mode2, … ]
)Code language: SQL (Structured Query Language) (sql)

Semantics

It will scan the events for a correlation key, match sequences of all numberSteps predicates in order, and count how many non-overlapping completions occur (subject to window constraints and modes). Return is an integer (0, 1, 2, …).

Example: In a billing or usage app, a user may go through the funnel multiple times (e.g., view_product → add_to_cart → purchased). Suppose we want to see how many times in the last 30 days a user went through steps A → B → C.

SELECT user_id,
  funnelCompleteCount(
    ts,
    '1000000',
    3,
    event_name = 'view_product',
    event_name = 'add_to_cart',
    event_name = 'purchased'
  ) as rounds
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_idCode language: SQL (Structured Query Language) (sql)

Result:

user_id	completions
U1	2
U2	1
U3	1
U4	1

Notice, U1 has gone through the entire 3 steps mentioned in the above query twice, and all other users have done it once.

Let’s take another example to show how to filter out-of-order events using modes

SELECT user_id,
  funnelCompleteCount(
    ts,
    '1000000',
    3,
    event_name = 'view_product',
    event_name = 'add_to_cart',
    event_name = 'purchased'
  ) as rounds
FROM clickstreamFunnel
GROUP BY user_id
ORDER BY user_idCode language: SQL (Structured Query Language) (sql)

Result:

user_id	completions
U1	2
U2	1
U3	0
U4	0

In the above example, U3 has done a step, cart_viewed, which is not part of the funnel, whereas U4 has made a step, purchased, with event time overlap with two events, and they are tied. To filter this out, we need to use the modes STRICT_ORDER and KEEP_ALL. U3 and U4 are filtered out, and hence no completions for the order of the events.

FUNNEL_COUNT (Aggregation-style counts across the cohort)

While the above UDFs operate per correlation key (and are often used in GROUP BY), Pinot also supports a more declarative aggregation-style function: FUNNEL_COUNT. FUNNEL_COUNT aggregates distinct correlated counts at each step in one go, returning an array like [count_step1, count_step2, …, count_stepN]. It uses CORRELATE_BY() for the user key and optional SETTINGS() to pick distinct-count strategies (bitmaps, sketches). This function returns an array of counts: for each step, the number of distinct users who reached that step (i.e., intersection logic between steps).

Syntax

FUNNEL_COUNT(
  STEPS ( predicate1, predicate2, … ),
  CORRELATE_BY ( correlation_column ),
  SETTINGS ( setting1, setting2, … )
)Code language: SQL (Structured Query Language) (sql)

Example:

SELECT
  FUNNEL_COUNT(
    STEPS(
event_name = 'view_product',
     event_name = 'add_to_cart',
	event_name = 'cart_viewed',
     event_name = 'purchased'
    ),
    CORRELATE_BY(user_id),
    SETTINGS('bitmap')   -- or a sketch setting, per docs
  ) AS funnel_counts
FROM clickstreamFunnel
WHERE event_name IN ('view_product','add_to_cart','cart_viewed', 'purchased');
Code language: PHP (php)

Result:

funnel_counts

4, 4, 1, 1

The output gives a view into how many users have completed the event types mentioned in the query, following the order. All 4 users performed view_product and add_to_cart, but only U3 performed all 4 steps.

This function is particularly useful when the questions you want to answer are:

How many users entered the top of the funnel?
How many of these users proceeded to the second step?
How many users reached the bottom of the funnel after completing all steps?

Optimizations for Funnel Analysis in Pinot

Running funnel queries at scale is not just about the functions themselves; it’s about shaping the data and the cluster so that queries remain fast, reliable, and cost-efficient. Over time, several optimizations have proven critical when teams deploy funnel analysis in production Pinot clusters:

Partitioning by Time (event_ts): Ensures queries that filter on time windows (e.g., last 7 days) scan only the relevant partitions, reducing unnecessary reads.
Partitioning by User ID: Keeps all events for a user co-located on the same server/segment, simplifying funnel progression tracking and reducing shuffle during query execution.
Segment Sizing: Tuning Pinot segment sizes and merging smaller ones improves scan efficiency and reduces query fan-out across too many small segments.
Out-of-order and late-arriving events are automatically handled as long as each record includes a valid event timestamp.

Execution flow within Pinot

When a query with funnel UDFs is issued in Pinot, execution happens in three stages. At the segment level, each Pinot server processes a subset of events and runs the funnel function for every correlation key in that segment, producing intermediate states such as “user U progressed to step 2 within a partial window.” These intermediate results may include partial bitmaps or counts. In the broker or merge phase, the broker collects results from all segments and merges them per correlation key, resolving partial progress when a user’s steps span multiple segments. The merge logic must carefully respect window and mode semantics to avoid stitching across windows incorrectly. Finally, in the aggregation phase, the system produces the query output: for functions like FunnelMaxStep or FunnelMatchStep, the merged result per user is returned, while for aggregation functions such as FUNNEL_COUNT, distinct counts are reduced across correlation keys and returned as arrays.

Conclusion

StarTree’s funnel analysis capabilities provide powerful tools for understanding user behavior and optimizing conversion funnels. The combination of FunnelMaxStep, FunnelMatchStep, and FunnelCompleteCount functions enables comprehensive analysis of user journeys, from simple conversion tracking to complex multi-path funnel analysis.

Key benefits of using StarTree for clickstream funnel analysis include:

Real-time Analytics: Ultra-low latency queries enable real-time funnel monitoring
Scalability: Handles massive clickstream datasets with horizontal scaling
Flexibility: Supports complex funnel definitions and custom time windows
Integration: Seamlessly integrates with existing BI and visualization tools like Superset or Tableau.
Cost-effectiveness: Open-source foundation reduces vendor lock-in

By leveraging these capabilities, organizations can gain deep insights into user behavior, identify optimization opportunities, and make data-driven decisions to improve their digital experiences and business outcomes.

The disaggregated analytics approach with StarTree Cloud offers a compelling alternative to traditional monolithic analytics platforms, providing greater flexibility, better performance, and reduced costs while maintaining the analytical depth required for sophisticated funnel analysis.

Contents

Confluent White Paper

Data Streaming Report

The Confluent | StarTree 2024 Data Streaming Report taps into the collective wisdom of 4,110 IT leaders to reveal how real-time data streaming is transforming businesses.

Download your free copy

Edit Promo