Resources
Blog

Analyze JSON Data Using SQL in Apache Pinot


1680720367-barkha.jpeg
Barkha Herman
Developer Advocate at StarTree.ai
released on
September 21, 2024
READ TIME
10 mins

Modern applications generate vast amounts of semi-structured data, with JSON emerging as one of the most popular formats. From user activity logs and IoT sensor readings to e-commerce transactions and API responses, JSON provides the flexibility to represent complex, nested data.

However, querying JSON efficiently—especially at scale—has traditionally been a challenge.

This is where Apache Pinot, a real-time distributed OLAP datastore, excels. Pinot combines subsecond query performance with support for semi-structured data like JSON, enabling fast, interactive analytics at scale. In this article, we’ll explore how to ingest, store, and query JSON data using SQL in Apache Pinot, allowing you to unlock insights from complex datasets with ease.

Why JSON Support Matters in Apache Pinot

JSON’s nested structures provide flexibility but can introduce performance bottlenecks when querying. Traditional OLAP systems often require data flattening or complex ETL pipelines to process JSON efficiently. This slows down analytics and increases storage costs.

Apache Pinot eliminates these challenges by:

  • Natively supporting JSON data types for ingestion and querying.
  • Providing built-in JSON functions for parsing and manipulating nested data.
  • Leveraging optimized indexes to ensure subsecond query performance, even at petabyte scale.

Ingesting JSON Data into Apache Pinot

Pinot supports various ingestion methods—batch (e.g., from S3, HDFS) and real-time (e.g., from Kafka). JSON data can be ingested directly without complex transformations or using built in JSON function for fast and easy ingestion.

Example Schema for JSON Ingestion

Let’s assume we have a Kafka topic with JSON data representing e-commerce transactions:

{
  "transaction_id": "txn_001",
  "user_id": "user_123",
  "timestamp": "2024-02-20T12:34:56Z",
  "items": [
    {"item_id": "item_001", "category": "electronics", "price": 299.99},
    {"item_id": "item_002", "category": "accessories", "price": 29.99}
  ],
  "shipping_address": {
    "city": "San Francisco",
    "state": "CA",
    "zip": "94107"
  },
  "payment_method": "credit_card"
}

This can be defined in Pinot Schema as follows:

{
  "dateTimeFieldSpecs": [
    {
      "notNull": false,
      "granularity": "MILLISECONDS|1",
      "dataType": "STRING",
      "name": "timestamp",
      "format": "SIMPLE_DATE_FORMAT|yyyy-MM-dd'T'HH:mm:ssZ",
      "fieldType": "DATE_TIME"
    }
  ],
  "dimensionFieldSpecs": [
    {
      "notNull": false,
      "dataType": "STRING",
      "name": "transaction_id",
      "fieldType": "DIMENSION"
    },
    {
      "notNull": false,
      "dataType": "STRING",
      "name": "user_id",
      "fieldType": "DIMENSION"
    },
    {
      "notNull": false,
      "dataType": "JSON",
      "name": "shipping_address",
      "fieldType": "DIMENSION"
    },
    {
      "notNull": false,
      "dataType": "JSON",
      "name": "items",
      "fieldType": "DIMENSION"
    },
    {
      "notNull": false,
      "dataType": "STRING",
      "name": "payment_method",
      "fieldType": "DIMENSION"
    }
  ],
  "schemaName": "ecommerce_transactions",
  "enableColumnBasedNullHandling": false
}

In this schema:

  • The items and shipping_address fields are stored as JSON data types.
  • Pinot will handle these fields natively, allowing you to run SQL queries directly on nested fields.

Here’s the Pinot table definition:

{
    “tableName”: “ecommerce_transactions”,
    "tableType": "OFFLINE",
    "tenants": {   },
    "tableIndexConfig": {
      "invertedIndexColumns": [
        "transaction_id",
        "user_id",
        "payment_method"
      ],
      "loadMode": "MMAP",
      "jsonIndexColumns": [
        "items",
        "shipping_address"
      ],
     },
    "metadata": {},
    "ingestionConfig": {
      "segmentTimeValueCheck": false,
      "batchIngestionConfig": {
        "segmentIngestionType": "APPEND",
        "consistentDataPush": false
      },
      "continueOnError": true,
      "rowTimeValueCheck": true,
      "transformConfigs": [
        {
          "transformFunction": "json_format(shipping_address_json_str)",
          "columnName": "shipping_address"
        },
        {
          "transformFunction": "json_format(items_json_str)",
          "columnName": "items"
        }
      ]
    },
    "isDimTable": false,
    "tableName": "ecommerce_transactions_OFFLINE"
}

In this table:

  • The items and shipping_address fields are stored as JSON data types.
  • The ingestion transforms items and shipping_address using the json_format function.
  • Pinot will handle these fields natively, allowing you to run SQL queries directly on nested fields.

Querying JSON Data with SQL in Apache Pinot

Apache Pinot provides native JSON functions that allow querying nested JSON fields without flattening the data.

1. Accessing Nested Fields

Extract values from nested JSON objects using the JSON_EXTRACT_SCALAR function:

SELECT
 transaction_id,
 JSON_EXTRACT_SCALAR(shipping_address, '$.city', 'STRING') AS city,
 JSON_EXTRACT_SCALAR(shipping_address, '$.state', 'STRING') AS state
FROM
 ecommerce_transactions
WHERE
 JSON_EXTRACT_SCALAR(shipping_address, '$.state', 'STRING') = 'CA'
LIMIT 10;

What’s Happening Here?

  • JSON_EXTRACT_SCALAR(json_column, json_path, TYPE) retrieves scalar values from the JSON object.
  • The $.city and $.state paths point to the nested fields.
  • The function returns a STRING.
  • This query returns all transactions where the shipping address state is California.

Querying Arrays in JSON

JSON arrays can be queried using JSON_MATCH for pattern matching:

SELECT
 transaction_id,
 JSON_EXTRACT_SCALAR(items, '$[0].category', 'STRING') AS first_item_category
FROM
 ecommerce_transactions
WHERE
 JSON_MATCH(items, '"$[*].category" = ''electronics''');

Key Insights:

  • JSON_MATCH checks if any item in the items array has the category “electronics.”
  • Subsecond performance is maintained even when scanning large arrays.

Aggregations on JSON Fields

Perform aggregations using extracted values from JSON:

SELECT
JSON_EXTRACT_SCALAR(shipping_address, '$.state', 'STRING') AS state,
COUNT(*) AS total_transactions,
SUM(JSON_EXTRACT_SCALAR(items, '$[0].price', 'DOUBLE')) AS revenue
FROM
ecommerce_transactions
GROUP BY
JSON_EXTRACT_SCALAR(shipping_address, '$.state', 'STRING');

What’s Happening:

  • We group transactions by state and compute total revenue from the first item’s price.
  • Pinot efficiently handles the type casting and grouping operations.

Filtering Nested Fields

Query specific nested fields with conditions:

SELECT
 transaction_id,
 JSON_EXTRACT_SCALAR(items, '$[0].price','STRING') AS first_item_price
FROM
 ecommerce_transactions
WHERE
 JSON_MATCH(items, '"$[0].price" > 100');

Why This Matters:

  • Easily filter transactions based on nested price fields, maintaining low-latency performance.
  • Supports dynamic exploration without the need for preprocessing.

Optimizing JSON Queries in Apache Pinot

To ensure subsecond query performance when working with JSON data, Apache Pinot supports:

JSON Indexes

  • Pinot provides specialized JSON indexes for fast lookups on nested fields.
  • These indexes significantly reduce the time required for complex JSON queries.
  • Example: Creating a JSON index during table configuration:
{
    “tableName”: “ecommerce_transactions”,
    "tableType": "OFFLINE",
    "tenants": {   },
    "tableIndexConfig": {
      "invertedIndexColumns": [
        "transaction_id",
        "user_id",
        "payment_method"
      ],
      "loadMode": "MMAP",
      "jsonIndexColumns": [
        "items",
        "shipping_address"
      ],
     },
    "metadata": {},
    "ingestionConfig": {
      "segmentTimeValueCheck": false,
      "batchIngestionConfig": {
        "segmentIngestionType": "APPEND",
        "consistentDataPush": false
      },
      "continueOnError": true,
      "rowTimeValueCheck": true,
      "transformConfigs": [
        {
          "transformFunction": "json_format(shipping_address_json_str)",
          "columnName": "shipping_address"
        },
        {
          "transformFunction": "json_format(items_json_str)",
          "columnName": "items"
        }
      ]
    },
    "isDimTable": false,
    "tableName": "ecommerce_transactions_OFFLINE"
}

NOTE: The resulting query performance without Index took 23 milliseconds. With Index it only took 8 milliseconds.

Columnar Storage for JSON

  • Pinot’s columnar storage structure ensures that only relevant JSON fields are scanned during query execution.
  • This reduces IO operations and boosts performance for slice-and-dice analytics.

Why Use Apache Pinot for JSON Analytics?

Feature Apache Pinot for JSON Traditional Systems
Query Latency Subsecond, even on nested JSON fields Minutes (due to heavy joins)
Data Freshness Real-time ingestion with immediate querying Delayed (batch processing)
Scalability Petabyte-scale with high concurrency Limited concurrency
Indexing Support Native JSON indexes for fast lookups Requires flattening or ETL
Query Complexity Direct SQL queries on JSON without preprocessing Complex transformations required

Real-World Use Cases

  1. Personalized Recommendations:
    Query user preference JSON data in real time to deliver personalized recommendations on streaming or e-commerce platforms.
  2. Fraud Detection:
    Analyze nested JSON transaction data for suspicious patterns with subsecond latency, enabling instant anomaly detection.
  3. Operational Dashboards:
    Power dashboards that visualize real-time IoT sensor data or log streams stored in JSON, allowing for rapid troubleshooting and monitoring.

Conclusion

Apache Pinot makes querying JSON data at scale fast, flexible, and efficient. With native JSON support, optimized indexing, and subsecond query performance, Pinot eliminates the need for complex data transformations. Whether you’re analyzing user events, e-commerce transactions, or telemetry data, Pinot allows you to unlock insights in real time using familiar SQL syntax.

By combining real-time ingestion, petabyte-scale performance, and high concurrency, Apache Pinot redefines what’s possible when working with semi-structured data like JSON.


Ready to deploy real-time analytics?

We’re here to help!