Release version 0.8.0: February 2024
Apache Pinot updates since last StarTree release
For details on Pinot changes, see Releases.
- Permit defining
NULL
handling at the table level or at each individual column level. Link - Add
lastUsed
option inresumeConsumption
endpoint in the broker API to improve UX. Link - Improve ingestion validation with
TimeValidationTransformer
to mark a record as invalid if the primary time column is out of range (1971 inclusive to 2071 exclusive). Link - Update the
tables
API endpoint to list only the dimension tables by specifyingdimension
as the type, for example:curl -X 'GET' \ 'http://localhost:9000/tables?type=dimension; \ - H 'accept: application/json'
- Enhance consuming segment handling to avoid an under-counting error with upsert tables. Link
- Improve the logic for taking snapshots by making them more atomic and in an order that permits correct table preloading. Link
- Add the following new metrics:
pinot_server_tableRebalanceInProgress_Value{table=${tableName},tabletype=${tableType}}
indicates whether a table is being rebalanced.1
indicates rebalancing is in progress and0
when it’s not. Linkpinot_server_tableDisabled_Value{table=${tableName},tableType=${tableType}}
indicates whether a table is disabled. It uses1
to indicate the table is disabled and0
when it is not. Linkpinot_server_tableConsumptionPaused_Value{table=${tableName},tableType=${tableType}}
indicates whether table consumption is paused.1
indicates the table is consumption is paused and0
when it’s not. Link
- Add a set of catch-all regexes for JMX -> Prometheus Exporter for when a regex used does not match a metric. Link
- Add compression configuration for aggregation in a star-tree index. Link
- Add a new flag to indicate whether the query result is partial or full. Link
- Add
DATETIMECONVERTWINDOWHOP
transformation function. Link - Enable tracking of out of order events in an upsert-enabled table using a new configuration
outOfOrderRecordColumn
. Link - Enable support for leveraging a star-tree index in conjunction with filtered aggregations, including filtered group-by aggregations. Link
- Add a new MV dictionary-encoded forward index format that only stores the unique MV entries, reducing storage footprint for indexes. Link
- Introduce low disk mode to table rebalance, which is set to false by default. When set to true, the server will first offload segments before loading the new segments during rebalance. Link
- Introduce a new configuration
controller.realtime.segment.deepStoreUploadRetry.parallelism
(the default setting is1
) to increase the size of the thread pool used for retrying segment uploads. Also the upload retry is now an asynchronous operation. Link - Enable
SegmentGenerationAndPushTask
to push segment(s) to a realtime table, supporting bootstrapping an upsert enabled table. Link - Add the ability to specify a custom Lucene analyzer used by text index for indexing and search on an individual column basis. Link
- Add murmur3 support as partition function
- Enhance
DistinctCountThetaSketch
aggregation function by adding new parameters to give the end-user more control over how sketches are aggregated at query time. Link - Add new configuration in upsert,
deletedKeysTTL
, which when set will remove deleted keys and mark thevalidDocID
as invalid after thedeletedKeysTTL
threshold period, improving memory utilization. Link - Add support for vector index using Hierarchical Navigable Small World (HNSW). Link
- Add the ability to initialize broker tags from configuration and automatically update the broker resource when broker joins the cluster for the first time. Link
- Enable partition level force-commit functionality, expanding the endpoint to accept a comma-separated list of partitions or consuming segment names. Link
- The following updates are specific to the multi-stage query engine:
- Optimize partition-based query performance when using the multi-stage query engine. The engine is now able to determine table partitioning and apply the best data shuffle mechanism automatically. Link
- Enable the multi-stage query engine to run multiple operator chains, provided there is no requirement for distributed data shuffling. Link
- Enable multiple SEMI-JOINs in the multi-stage query engine to use index lookup within the same node for a left-table scan. Link
- Add support in the multi-stage query engine for early termination and direct error, warning, or stats return in the multi-stage query engine. Link
- Bug fix: Segments created in realtime tables are guided by the parameter
realtime.segment.flush.threshold.segment.size
if it is set. Link
StarTree Cloud
StarTree Extensions for Apache Pinot
- Enable bootstrapping of upsert-enabled tables by supporting batch ingestion using
fileingestiontask
into a realtime table. - StarTree Upsert on by default for all StarTree deployments, providing enhanced scalability and stability when using upsert.
- Improved server restart times when using StarTree upserts
- Ability to take snapshot for improved recoverability of upsert tables
- Provide visibility into the health of various components (Server, Broker, Controller, Tables, etc.) using the Cluster Health Dashboard in Pinot Control Panel. The dashboard is updated every 20 minutes and can be triggered on-demand by using the
/periodictask/run
API call. - Ability to gate access to Pinot tables using a new Role Based Access Control (RBAC) system. Roles can be assigned to individual users, IDP groups or Pinot API tokens. Access can be controlled at a table-level granularity along with the ability to allow/deny specific APIs on Pinot clusters and tables. Alpha Release
Data Manager
- Add Custom Connector option that lets you create a dataset using a JSON connection configuration to a Google Cloud Storage (GCS) data source.
- Enhance interface to select a directory or multiple directories in an AWS S3 bucket.
- Add SSL certificate support for Kafka. Now, you can enter details to connect with Kafka under SSL Authentication Type in Kafka Source.
- Enhance Delta Lake connector to support IAM role access in AWS.
ThirdEye
- Link anomaly alerts to PagerDuty for instant notification and efficient incident management. Link
- Enable customizable bounds for precise anomaly detection, enhancing decision-making accuracy. Link
- Protect sensitive information with automated data masking during the automated anomaly detection alert creation process.Link
- UI/UX improvements
- View related tasks for each alert, including success, failure, and access to logs for troubleshooting.
- View a list of notifications sent per subscription group for better insight into alert distribution.
- Easily identify which subscription groups are receiving specific alerts.