Chaining Transformation Functions
In this recipe we’ll learn how to chain transformation functions during the data ingestion process.
To follow the code examples in this guide, you must install Docker(opens in a new tab) locally and download recipes.
Navigate to recipe
- If you haven’t already, download recipes.
- In terminal, go to the recipe by running the following command:
cd pinot-recipes/recipes/chaining-transformation-functions
docker-compose up
This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, and Zookeeper. You can find the docker-compose.yml(opens in a new tab) file on GitHub.
We’re going to import the following JSON file:
{"payload": {"userId": "3287651__David Smith"}}
{"payload": {"userId": "4987622__Jenny Jones"}}
{"payload": {"userId": "1965900__Stephen Davis"}}
The userId
column will store the userId
value from the JSON document. The name
and id
columns will store values extracted from the userId
We’ll also have the following table config:
"transformFunction":"jsonPathString(payload, '$.userId')"
"transformFunction":"Groovy({Long.valueOf(userId.split('__')[0])}, userId)"
"transformFunction":"Groovy({userId.split('__')[1]}, userId)"
In this config we define transform configs (ingestionConfig.transformConfigs
) that do the following:
- Extract
using the jsonPathString(opens in a new tab) function. - Split the corresponding string on
and extracting theid
using Groovy transformation functions.
You can create the table and schema by running the following command:`
docker run \
--network json \
-v $PWD/config:/config \
apachepinot/pinot:1.0.0 AddTable \
-schemaFile /config/schema.json \
-tableConfigFile /config/table.json \
-controllerHost "pinot-controller-json" \
You should see a message similar to the following if everything is working correctly:
2022/02/28 10:08:40.333 INFO [AddTableCommand] [main] Executing command: AddTable -tableConfigFile /config/table.json -schemaFile /config/schema.json -controllerProtocol http -controllerHost -controllerPort 9000 -user null -password [hidden] -exec
2022/02/28 10:08:40.747 INFO [AddTableCommand] [main] {"status":"Table people_OFFLINE succesfully added"}
Ingestion Job
Now we’re going to import the JSON file into Pinot. We’ll do this with the following ingestion spec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
jobType: SegmentCreationAndTarPush
inputDirURI: '/data'
includeFileNamePattern: 'glob:**/import.json'
outputDirURI: '/opt/pinot/data/people/'
overwriteOutput: true
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
dataFormat: 'json'
className: 'org.apache.pinot.plugin.inputformat.json.JSONRecordReader'
tableName: 'people'
- controllerURI: 'http://pinot-controller-json:9000'
pushAttempts: 2
pushRetryIntervalMillis: 1000
The import job will map fields in each JSON document to a corresponding column in the people
schema. If one of the fields doesn’t exist in the schema it will be skipped.
In this case our JSON documents only have one top level field, payload
, which doesn’t have a corresponding column in the schema. Instead, transformation functions extract the payload.userId
field and then store parts of it in different columns.
You can run the following command to run the import:
docker run \
--network json \
-v $PWD/config:/config \
-v $PWD/data:/data \
apachepinot/pinot:1.0.0 LaunchDataIngestionJob \
-jobSpecFile /config/job-spec.yml
Once that’s completed, navigate to localhost:9000/#/query(opens in a new tab) and click on the people
table or copy/paste the following query:
select *
from people
limit 10
You will see the following output: