Release version 0.8.0: February 2024

Apache Pinot updates since last StarTree release

For details on Pinot changes, see Releases.

  • Permit defining NULL handling at the table level or at each individual column level. Link
  • Add lastUsed option in resumeConsumption endpoint in the broker API to improve UX. Link
  • Improve ingestion validation with TimeValidationTransformer to mark a record as invalid if the primary time column is out of range (1971 inclusive to 2071 exclusive). Link
  • Update the tables API endpoint to list only the dimension tables by specifying dimension as the type, for example:
    curl -X 'GET' \    'http://localhost:9000/tables?type=dimension; \    - H 'accept: application/json'

    Link

  • Enhance consuming segment handling to avoid an under-counting error with upsert tables. Link
  • Improve the logic for taking snapshots by making them more atomic and in an order that permits correct table preloading. Link
  • Add the following new metrics:
    • pinot_server_tableRebalanceInProgress_Value{table=${tableName},tabletype=${tableType}} indicates whether a table is being rebalanced. 1 indicates rebalancing is in progress and 0 when it’s not. Link
    • pinot_server_tableDisabled_Value{table=${tableName},tableType=${tableType}} indicates whether a table is disabled. It uses 1 to indicate the table is disabled and 0 when it is not. Link
    • pinot_server_tableConsumptionPaused_Value{table=${tableName},tableType=${tableType}} indicates whether table consumption is paused. 1 indicates the table is consumption is paused and 0 when it’s not. Link
  • Add a set of catch-all regexes for JMX -> Prometheus Exporter for when a regex used does not match a metric. Link
  • Add compression configuration for aggregation in a star-tree index. Link
  • Add a new flag to indicate whether the query result is partial or full. Link
  • Add DATETIMECONVERTWINDOWHOP transformation function. Link
  • Enable tracking of out of order events in an upsert-enabled table using a new configuration outOfOrderRecordColumnLink
  • Enable support for leveraging a star-tree index in conjunction with filtered aggregations, including filtered group-by aggregations. Link
  • Add a new MV dictionary-encoded forward index format that only stores the unique MV entries, reducing storage footprint for indexes. Link
  • Introduce low disk mode to table rebalance, which is set to false by default. When set to true, the server will first offload segments before loading the new segments during rebalance. Link
  • Introduce a new configuration controller.realtime.segment.deepStoreUploadRetry.parallelism (the default setting is 1) to increase the size of the thread pool used for retrying segment uploads. Also the upload retry is now an asynchronous operation. Link
  • Enable SegmentGenerationAndPushTask to push segment(s) to a realtime table, supporting bootstrapping an upsert enabled table. Link
  • Add the ability to specify a custom Lucene analyzer used by text index for indexing and search on an individual column basis. Link
  • Add murmur3 support as partition function
  • Enhance DistinctCountThetaSketch aggregation function by adding new parameters to give the end-user more control over how sketches are aggregated at query time. Link
  • Add new configuration in upsert, deletedKeysTTL, which when set will remove deleted keys and mark the validDocID as invalid after the deletedKeysTTL threshold period, improving memory utilization. Link
  • Add support for vector index using Hierarchical Navigable Small World (HNSW). Link
  • Add the ability to initialize broker tags from configuration and automatically update the broker resource when broker joins the cluster for the first time. Link
  • Enable partition level force-commit functionality, expanding the endpoint to accept a comma-separated list of partitions or consuming segment names. Link
  • The following updates are specific to the multi-stage query engine:
    • Optimize partition-based query performance when using the multi-stage query engine. The engine is now able to determine table partitioning and apply the best data shuffle mechanism automatically. Link
    • Enable the multi-stage query engine to run multiple operator chains, provided there is no requirement for distributed data shuffling. Link
    • Enable multiple SEMI-JOINs in the multi-stage query engine to use index lookup within the same node for a left-table scan. Link
    • Add support in the multi-stage query engine for early termination and direct error, warning, or stats return in the multi-stage query engine. Link
  • Bug fix: Segments created in realtime tables are guided by the parameter realtime.segment.flush.threshold.segment.size if it is set. Link

StarTree Cloud

StarTree Extensions for Apache Pinot

  • Enable bootstrapping of upsert-enabled tables by supporting batch ingestion using fileingestiontask into a realtime table.
  • StarTree Upsert on by default for all StarTree deployments, providing enhanced scalability and stability when using upsert.
    • Improved server restart times when using StarTree upserts
    • Ability to take snapshot for improved recoverability of upsert tables
  • Provide visibility into the health of various components (Server, Broker, Controller, Tables, etc.) using the Cluster Health Dashboard in Pinot Control Panel. The dashboard is updated every 20 minutes and can be triggered on-demand by using the /periodictask/run API call.
  • Ability to gate access to Pinot tables using a new Role Based Access Control (RBAC) system. Roles can be assigned to individual users, IDP groups or Pinot API tokens. Access can be controlled at a table-level granularity along with the ability to allow/deny specific APIs on Pinot clusters and tables. Alpha Release

Data Manager

  • Add Custom Connector option that lets you create a dataset using a JSON connection configuration to a Google Cloud Storage (GCS) data source.
  • Enhance interface to select a directory or multiple directories in an AWS S3 bucket.
  • Add SSL certificate support for Kafka. Now, you can enter details to connect with Kafka under SSL Authentication Type in Kafka Source.
  • Enhance Delta Lake connector to support IAM role access in AWS.

ThirdEye

  • Link anomaly alerts to PagerDuty for instant notification and efficient incident management. Link
  • Enable customizable bounds for precise anomaly detection, enhancing decision-making accuracy. Link
  • Protect sensitive information with automated data masking during the automated anomaly detection alert creation process.Link
  • UI/UX improvements
    • View related tasks for each alert, including success, failure, and access to logs for troubleshooting.
    • View a list of notifications sent per subscription group for better insight into alert distribution.
    • Easily identify which subscription groups are receiving specific alerts.