Analyze JSON Data Using SQL in Apache Pinot

Modern applications generate vast amounts of semi-structured data, with JSON emerging as one of the most popular formats. From user activity logs and IoT sensor readings to e-commerce transactions and API responses, JSON provides the flexibility to represent complex, nested data.
However, querying JSON efficiently—especially at scale—has traditionally been a challenge.
This is where Apache Pinot, a real-time distributed OLAP datastore, excels. Pinot combines subsecond query performance with support for semi-structured data like JSON, enabling fast, interactive analytics at scale. In this article, we’ll explore how to ingest, store, and query JSON data using SQL in Apache Pinot, allowing you to unlock insights from complex datasets with ease.
Why JSON Support Matters in Apache Pinot
JSON’s nested structures provide flexibility but can introduce performance bottlenecks when querying. Traditional OLAP systems often require data flattening or complex ETL pipelines to process JSON efficiently. This slows down analytics and increases storage costs.
Apache Pinot eliminates these challenges by:
- Natively supporting JSON data types for ingestion and querying.
- Providing built-in JSON functions for parsing and manipulating nested data.
- Leveraging optimized indexes to ensure subsecond query performance, even at petabyte scale.
Ingesting JSON Data into Apache Pinot
Pinot supports various ingestion methods—batch (e.g., from S3, HDFS) and real-time (e.g., from Kafka). JSON data can be ingested directly without complex transformations or using built in JSON function for fast and easy ingestion.
Example Schema for JSON Ingestion
Let’s assume we have a Kafka topic with JSON data representing e-commerce transactions:
{
"transaction_id": "txn_001",
"user_id": "user_123",
"timestamp": "2024-02-20T12:34:56Z",
"items": [
{"item_id": "item_001", "category": "electronics", "price": 299.99},
{"item_id": "item_002", "category": "accessories", "price": 29.99}
],
"shipping_address": {
"city": "San Francisco",
"state": "CA",
"zip": "94107"
},
"payment_method": "credit_card"
}
This can be defined in Pinot Schema as follows:
{
"dateTimeFieldSpecs": [
{
"notNull": false,
"granularity": "MILLISECONDS|1",
"dataType": "STRING",
"name": "timestamp",
"format": "SIMPLE_DATE_FORMAT|yyyy-MM-dd'T'HH:mm:ssZ",
"fieldType": "DATE_TIME"
}
],
"dimensionFieldSpecs": [
{
"notNull": false,
"dataType": "STRING",
"name": "transaction_id",
"fieldType": "DIMENSION"
},
{
"notNull": false,
"dataType": "STRING",
"name": "user_id",
"fieldType": "DIMENSION"
},
{
"notNull": false,
"dataType": "JSON",
"name": "shipping_address",
"fieldType": "DIMENSION"
},
{
"notNull": false,
"dataType": "JSON",
"name": "items",
"fieldType": "DIMENSION"
},
{
"notNull": false,
"dataType": "STRING",
"name": "payment_method",
"fieldType": "DIMENSION"
}
],
"schemaName": "ecommerce_transactions",
"enableColumnBasedNullHandling": false
}
In this schema:
- The
items
andshipping_address
fields are stored as JSON data types. - Pinot will handle these fields natively, allowing you to run SQL queries directly on nested fields.
Here’s the Pinot table definition:
{
“tableName”: “ecommerce_transactions”,
"tableType": "OFFLINE",
"tenants": { },
"tableIndexConfig": {
"invertedIndexColumns": [
"transaction_id",
"user_id",
"payment_method"
],
"loadMode": "MMAP",
"jsonIndexColumns": [
"items",
"shipping_address"
],
},
"metadata": {},
"ingestionConfig": {
"segmentTimeValueCheck": false,
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"consistentDataPush": false
},
"continueOnError": true,
"rowTimeValueCheck": true,
"transformConfigs": [
{
"transformFunction": "json_format(shipping_address_json_str)",
"columnName": "shipping_address"
},
{
"transformFunction": "json_format(items_json_str)",
"columnName": "items"
}
]
},
"isDimTable": false,
"tableName": "ecommerce_transactions_OFFLINE"
}
In this table:
- The
items
andshipping_address
fields are stored as JSON data types. - The ingestion transforms
items
andshipping_address
using thejson_format
function. - Pinot will handle these fields natively, allowing you to run SQL queries directly on nested fields.
Querying JSON Data with SQL in Apache Pinot
Apache Pinot provides native JSON functions that allow querying nested JSON fields without flattening the data.
1. Accessing Nested Fields
Extract values from nested JSON objects using the JSON_EXTRACT_SCALAR
function:
SELECT
transaction_id,
JSON_EXTRACT_SCALAR(shipping_address, '$.city', 'STRING') AS city,
JSON_EXTRACT_SCALAR(shipping_address, '$.state', 'STRING') AS state
FROM
ecommerce_transactions
WHERE
JSON_EXTRACT_SCALAR(shipping_address, '$.state', 'STRING') = 'CA'
LIMIT 10;
What’s Happening Here?
JSON_EXTRACT_SCALAR(json_column, json_path, TYPE)
retrieves scalar values from the JSON object.- The
$.city
and$.state
paths point to the nested fields. - The function returns a STRING.
- This query returns all transactions where the shipping address state is California.
Querying Arrays in JSON
JSON arrays can be queried using JSON_MATCH
for pattern matching:
SELECT
transaction_id,
JSON_EXTRACT_SCALAR(items, '$[0].category', 'STRING') AS first_item_category
FROM
ecommerce_transactions
WHERE
JSON_MATCH(items, '"$[*].category" = ''electronics''');
Key Insights:
JSON_MATCH
checks if any item in the items array has the category “electronics.”- Subsecond performance is maintained even when scanning large arrays.
Aggregations on JSON Fields
Perform aggregations using extracted values from JSON:
SELECT
JSON_EXTRACT_SCALAR(shipping_address, '$.state', 'STRING') AS state,
COUNT(*) AS total_transactions,
SUM(JSON_EXTRACT_SCALAR(items, '$[0].price', 'DOUBLE')) AS revenue
FROM
ecommerce_transactions
GROUP BY
JSON_EXTRACT_SCALAR(shipping_address, '$.state', 'STRING');
What’s Happening:
- We group transactions by state and compute total revenue from the first item’s price.
- Pinot efficiently handles the type casting and grouping operations.
Filtering Nested Fields
Query specific nested fields with conditions:
SELECT
transaction_id,
JSON_EXTRACT_SCALAR(items, '$[0].price','STRING') AS first_item_price
FROM
ecommerce_transactions
WHERE
JSON_MATCH(items, '"$[0].price" > 100');
Why This Matters:
- Easily filter transactions based on nested price fields, maintaining low-latency performance.
- Supports dynamic exploration without the need for preprocessing.
Optimizing JSON Queries in Apache Pinot
To ensure subsecond query performance when working with JSON data, Apache Pinot supports:
JSON Indexes
- Pinot provides specialized JSON indexes for fast lookups on nested fields.
- These indexes significantly reduce the time required for complex JSON queries.
- Example: Creating a JSON index during table configuration:
{
“tableName”: “ecommerce_transactions”,
"tableType": "OFFLINE",
"tenants": { },
"tableIndexConfig": {
"invertedIndexColumns": [
"transaction_id",
"user_id",
"payment_method"
],
"loadMode": "MMAP",
"jsonIndexColumns": [
"items",
"shipping_address"
],
},
"metadata": {},
"ingestionConfig": {
"segmentTimeValueCheck": false,
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"consistentDataPush": false
},
"continueOnError": true,
"rowTimeValueCheck": true,
"transformConfigs": [
{
"transformFunction": "json_format(shipping_address_json_str)",
"columnName": "shipping_address"
},
{
"transformFunction": "json_format(items_json_str)",
"columnName": "items"
}
]
},
"isDimTable": false,
"tableName": "ecommerce_transactions_OFFLINE"
}
NOTE: The resulting query performance without Index took 23 milliseconds. With Index it only took 8 milliseconds.
Columnar Storage for JSON
- Pinot’s columnar storage structure ensures that only relevant JSON fields are scanned during query execution.
- This reduces IO operations and boosts performance for slice-and-dice analytics.
Why Use Apache Pinot for JSON Analytics?
Feature | Apache Pinot for JSON | Traditional Systems |
Query Latency | Subsecond, even on nested JSON fields | Minutes (due to heavy joins) |
Data Freshness | Real-time ingestion with immediate querying | Delayed (batch processing) |
Scalability | Petabyte-scale with high concurrency | Limited concurrency |
Indexing Support | Native JSON indexes for fast lookups | Requires flattening or ETL |
Query Complexity | Direct SQL queries on JSON without preprocessing | Complex transformations required |
Real-World Use Cases
- Personalized Recommendations:
Query user preference JSON data in real time to deliver personalized recommendations on streaming or e-commerce platforms. - Fraud Detection:
Analyze nested JSON transaction data for suspicious patterns with subsecond latency, enabling instant anomaly detection. - Operational Dashboards:
Power dashboards that visualize real-time IoT sensor data or log streams stored in JSON, allowing for rapid troubleshooting and monitoring.
Conclusion
Apache Pinot makes querying JSON data at scale fast, flexible, and efficient. With native JSON support, optimized indexing, and subsecond query performance, Pinot eliminates the need for complex data transformations. Whether you’re analyzing user events, e-commerce transactions, or telemetry data, Pinot allows you to unlock insights in real time using familiar SQL syntax.
By combining real-time ingestion, petabyte-scale performance, and high concurrency, Apache Pinot redefines what’s possible when working with semi-structured data like JSON.