In this recipe we’ll learn how to query a table to find out which records are in a particular segment.
Prerequisites
To follow the code examples in this guide, you must install Docker(opens in a new tab) locally and download recipes .
Navigate to recipe
If you haven’t already, download recipes .
In terminal, go to the recipe by running the following command:
cd pinot-recipes/recipes/query -by -segment
Launch Pinot Cluster
You can spin up a Pinot Cluster by running the following command:
This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, Kafka, and Zookeeper. You can find the docker-compose.yml(opens in a new tab) file on GitHub.
Data generator
This recipe contains a data generator that creates events with a timestamp, count, and UUID. You can generate data by running the following command:
python datagen.py 2 >/dev/null | head -n1 | jq
{
"ts" : 1680171022742 ,
"uuid" : "5328f9b8-83ff-4e4c-8ea1-d09524866841" ,
"count" : 260
}
Kafka ingestion
We’re going to ingest this data into an Apache Kafka topic using the kcat(opens in a new tab) command line tool. We’ll also use jq
to structure the data in the key:payload
structure that Kafka expects:
python datagen.py --sleep 0.0001 2 >/dev/null |
jq -cr --arg sep ø '[.uuid, tostring] | join($sep)' |
kcat -P -b localhost:9092 -t events -Kø
We can check that Kafka has some data by running the following command:
docker exec -it kafka-querysegment kafka-run -class .sh \
kafka.tools.GetOffsetShell \
We’ll see something like the following:
Pinot Schema and Table
Now let’s create a Pinot Schema and Table.
First, the schema:
{
"schemaName" : "events" ,
"dimensionFieldSpecs" : [{"name" : "uuid" , "dataType" : "STRING" }],
"metricFieldSpecs" : [{"name" : "count" , "dataType" : "INT" }],
"dateTimeFieldSpecs" : [
{
"name" : "ts" ,
"dataType" : "TIMESTAMP" ,
"format" : "1:MILLISECONDS:EPOCH" ,
"granularity" : "1:MILLISECONDS"
}
]
}
Now for the table config:
{
"tableName" : "events" ,
"tableType" : "REALTIME" ,
"segmentsConfig" : {
"timeColumnName" : "ts" ,
"schemaName" : "events" ,
"replication" : "1" ,
"replicasPerPartition" : "1"
} ,
"tableIndexConfig" : {
"loadMode" : "MMAP" ,
"streamConfigs" : {
"realtime.segment.flush.threshold.rows" : "100000" ,
"streamType" : "kafka" ,
"stream.kafka.topic.name" : "events" ,
"stream.kafka.broker.list" : "kafka-querysegment:9093" ,
"stream.kafka.consumer.type" : "lowlevel" ,
"stream.kafka.consumer.prop.auto.offset.reset" : "smallest" ,
"stream.kafka.consumer.factory.class.name" : "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory" ,
"stream.kafka.decoder.class.name" : "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder" ,
"realtime.segment.flush.threshold.time" : "1h"
}
} ,
"ingestionConfig" : {
} ,
"tenants" : { } ,
"metadata" : { }
}
This highlighted section indicates that we’re going to create new segments after every 100,000 rows.
We’ll create the table by running the following:
docker run \
--network querysegment \
-v $PWD /config:/config \
apachepinot/pinot:1.0.0 AddTable \
-schemaFile /config/schema.json \
-tableConfigFile /config/table.json \
-controllerHost "pinot-controller-querysegment" \
-exec
Querying by segment
Once that’s been created, we can head over to the Pinot UI(opens in a new tab) and run some queries.
Pinot has several built-in virtual columns inside every schema that can be used for debugging purposes:
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">$hostName</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">$segmentName</td>
</tr>
</tbody>
</table>
Column Name Column Type Data Type Description
STRING
STRING
$docId
Dimension
INT
The one that’s useful for us is $segmentName
, which we can use like this to count the number of records in each segment:
select $segmentName, count (*)
from events
group by $segmentName
limit 10
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__142__20230330T1004Z</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__27__20230330T1003Z</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__123__20230330T1004Z</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__15__20230330T1003Z</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__185__20230330T1004Z</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__96__20230330T1003Z</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__169__20230330T1004Z</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__160__20230330T1004Z</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__77__20230330T1003Z</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">events__0__71__20230330T1003Z</td>
</tr>
</tbody>
</table>
$segmentName count(*)
100000
100000
100000
100000
100000
100000
100000
100000
100000
100000
Query Results
We can then pick one of those segments and see what records are stored in that segment:
select *
from events
WHERE $segmentName = 'events__0__142__20230330T1004Z'
limit 10
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">298</td>
</tr>
</tbody>
</table>
<table class="nx-block nx-overflow-x-scroll nextra-scrollbar nx-mt-6 nx-p-0 first:nx-mt-0">
<tbody>
<tr class="nx-m-0 nx-border-t nx-border-gray-300 nx-p-0 dark:nx-border-gray-600 even:nx-bg-gray-100 even:dark:nx-bg-gray-600/20">
<td class="nx-m-0 nx-border nx-border-gray-300 nx-px-4 nx-py-2 dark:nx-border-gray-600">40</td>
</tr>
</tbody>
</table>
<p> </p>
count ts uuid 666
128
659
436
408
457
197
362