Set up tiered storage with cloud object store

This page shows you how to enable tiered storage for a table.

Set the segment age boundary

Set a segment age boundary to split data across your local and remote tiers.

For example, if you have 30 days of data, and you set the segment age boundary at 7 days, segments less than 7 days old will be on the local tier, and segments older than 7 days are moved to the remote tier.

Set up a tenant

Do one of the following:

  • (Recommended) Set up a server as a dedicated tenant
  • (Default) Use the same server as your table as your tenant. We do not recommend this option for production environments.

(Recommended) Set up a dedicated tenant server

Create a new tenant to serve as a compute-only node. This tenant will be dedicated to serving the segments which are on the remote tier backend. These do not require as much local storage as your table’s default local tenant since the data is stored remotely.

To add one or more dedicated tenants, do the following:

  1. Add servers. You’ll need at least as many servers as in your table replication setting in the Pinot table configuration.
  2. Tag these as RemoteTenant_OFFLINE (or whatever you prefer) using the /instances/{instanceName}/updateTags API in the Swagger UI at http://localhost:9000/help#/Schema/ on the host where Pinot is running. This link is only accessible when Pinot is running.

(Default) Use same tenant as your table

We don’t recommend this option for production environments.

This option is selected by default. StarTree will use the same tenant that you set in tableConfig->tenants->server in your Pinot table configuration. This option is a no-op, requiring you to do nothing.

Processing segments from the cloud object store causes higher than usual memory and CPU utilization on nodes, and can affect local data query performance. If your local data queries are serving latency sensitive, user-facing traffic, we recommend using a separate server as your tenant.

Add cluster configs and restart servers

To enable tiered storage, do one of the following:

After adding your configs, restart the servers for changes to take effect.

Where:

{
  "pinot.server.instance.tier.enabled": "true",
  "pinot.server.instance.segment.cache.directory": "/home/pinot/data/tieredStorage/segmentCache/",
  "pinot.server.instance.buffer.reference.manager.ondemand.total.size": "24G",
  "pinot.server.instance.buffer.reference.manager.ondemand.reserved.size": "4G"
}

Review the related configuration options here:

Description

Configuration to enable tier storage for the cluster

Default value

false

⚠️

The pinot.server.instance.segment.cache.directory enables caching to restart the server more quickly, but does not affect the query execution speed.

The two ondemand configs are shown here for their importance in query performance. More information about these configs others can be found in the sections below. You may want to set some of them before restarting servers.

Restarting servers will take much longer than before (depending on the number of segments on the remote tier backend) because StarTree needs to read the segment metadata from the remote tier backend when loading the segment on the server. To enable configurations that improve performance, see recommended configurations for ease of operations.

Update the Pinot table configuration

  1. If not already set, set timeColumn in the Pinot table configuration.
  2. Add tierConfigs to your Pinot table configuration (example below). The segmentAge will be calculated using the primary timeColumnName as set in Pinot table configuration.
💡

Text-index is under development and not functional.

After adding this configuration, a controller-side periodic task, SegmentRelocator, will check and migrate segments to the proper tier, such as when the segments cross the segmentAge. See the SegmentRelocator configuration documentation for more information.

   "tierConfigs": [
      {
        "name": "remoteTier",
        "segmentSelectorType": "time",
        "segmentAge": "7d",
        "storageType": "pinot_server",
        "serverTag": "RemoteTenant_OFFLINE",
        "tierBackend": "s3",
        "tierBackendProperties": {
             "region": "<region>",
             "bucket": "<bucket>",
             ...
        }
      }
    ],

Review the related configuration options here:

Description

Set this to anything you want, this is more for referring to this tier uniquely.

⚠️

This segmentAge setting is independent of the table’s retention. Segments will continue to be forgotten from Pinot once the table’s retention is reached, regardless of what tier the segments are on.

Tiered storage with Google Cloud Storage (GCS)

Tiered storage with Google Cloud Storage (GCS) is supported using the Cloud Storage XML API. This API lets you access Pinot segments stored in a GCS bucket using a S3 client. To use a GCS bucket as the tier backend, complete the following steps:

  1. To use GCS’s interoperable APIs with a S3 client, you must generate HMAC keys for the GCP bucket. Each HMAC key consists of an access ID and a secret key and can be associated with any service account.
$ gsutil hmac create <SERVICE_ACCOUNT_EMAIL>
AccessId: <HMAC access id>
SecretKey: <HMAC secret key>
  1. For testing purposes, you can store the HMAC key pair in tierBackendProperties, as shown in the following example:
"tierBackendProperties": {
  "bucket": "<gcs bucket name>",
  "endpoint": "https://storage.googleapis.com",
  "region": "auto",
  "accesskey": "<HMAC access id>",
  "secretkey": "<HMAC secret key>",
  ...
}

To avoid exposing the HMAC key pair in the table configuration, we recommend storing the key pair as secrets in the GCP secret manager, and configuring the table to fetch them securely when creating the client. To do this, complete the following steps:

  1. Enable the GCP secret manager API, and then add the key pair as secrets (one for the access ID and the other for the secret key).

  2. Ensure that the service account used to create the cluster has access to read the added secrets.

  3. Set the following properties in tierBackendProperties:

"tierBackendProperties": {
  "bucket": "<gcs bucket name>",
  "endpoint": "https://storage.googleapis.com",
  "accesskey": "<secret namae for HMAC AccessId>",
  "secretkey": "<secret namae for HMAC SccessKey>",
  "region": "auto",
  "keyType": "SECRET",
  "secretmanagertype": "GCS",
  "gcpprojectid": "<projectId>",
  "gcpkeypath": "<path to key.json on the server>",
  ...
 }
  1. The gcpprojectid and gcpkeypath are required to be able to access the GCP secret manager. The gcpkeypath should be set to the absolute path where the gcp key is mounted on the server (typically /home/pinot/gcp/credentials.json).

Upload new segments and migrate existing segments

When uploading new segments to the table, the segments are put on the remote tier if their data is older than the specified segmentAge. For existing segments, the SegmentRelocator task moves them to the remote tier periodically as configured.

Validate and query

For segments that move to the remote tenant, verify the following:

  1. From the Swagger UI found at http://localhost:9000/help#/Schema/ where Pinot is running use the segments/{tableName}/tiers API to check segments’ current tier
  2. Segments are not present on the pinot-server’s local disk (in dataDir and indexDir)
  3. Segments are present in the remote tier backend in uncompressed form

When querying, you may need to increase timeout of your query.

We have adopted multiversion concurrency control (MVCC) to make remote segment management transparent to ongoing queries. MVCC can lead to multiple copies of segment data with different index or encoding types kept in the remote tier backend. Automatic periodic cleanup of unused segment directories can be enabled using the configs listed in the Stale segment directory cleanup section.

With optimizations to be discussed below, the query latency can be reduced as needed.

Fine tunings for query speed

You have the option to use the following tuning adjustments to speed up queries.

Preloaded index

Pin some index types of some columns on the local disk to avoid frequent remote reads.

For example, pin bloom filters on local disk so that segments are pruned first to reduce the number of segments processed, minimizing remote reads. Bloom filters tend to be compact, not affecting local disk usage much. All index types can be pinned, including the tree nodes inside the star-tree index. Decide what to pin based on the need for cost savings and reduced query latencies.

Indexes to be preloaded can be specifed at a cluster level globally, and can be overriden at a table level by setting a table level override either via cluster configuration or table configuration.

  1. To enable the buffer preloading, use cluster config apis to add the following option to your cluster configuration:
{
   "pinot.server.instance.buffer.reference.manager.preload.enable": "true",
   "pinot.server.instance.buffer.reference.manager.preload.total.size": "100G",
   "pinot.server.instance.buffer.reference.manager.preload.dir": "/home/pinot/data/tieredStorage/preload/",
   "pinot.server.instance.buffer.reference.manager.preload.load.existing.buffers": "true",
   "pinot.server.instance.buffer.reference.manager.preload.index.keys.override": "<tableNameWithType1>,col1.bloom_filter,col2.sparse;<tableNameWithType2>,startree_index_0.tree"
}

The following buffer preload options are configured by default:

Description

Whether to use preload feature

Default value

false

  1. Once enabled, set global buffer preloading using the following cluster configuration:
{
   "pinot.server.instance.buffer.reference.manager.preload.index.keys": "*.bloom_filter,col2.sparse,startree_index_0.tree"
}

Follow the <columnName>.<indexType> convention to specify the index type to preload. Use * as a wildcard for the column name to preload all index buffers of a given index type.

  1. To override the global cluster level configuration, specify the configuration option on the table level:
  • To specify the override in the cluster configuration, set the pinot.server.instance.buffer.reference.manager.preload.index.keys.override cluster configuration option as follows:
{
   "pinot.server.instance.buffer.reference.manager.preload.index.keys.override": "<tableNameWithType1>,*.bloom_filter,col2.sparse;<tableNameWithType2>,startree_index_0.tree"
}

Make sure to use tableNameWithType like foo_OFFLINE.

  • To specify the override in the table configuration, set the preload.index.keys.override property inside tierBackendProperties as follows:
   "tierConfigs": [
      {
        "name": "remoteTier",
         ...
        "tierBackendProperties": {
             "region": "<region>",
             "bucket": "<bucket>",
             "preload.index.keys.override": "*.bloom_filter,col2.sparse,col3.inverted_index"
             ...
        }
      }
    ],
  1. Restart affected servers to apply cluster configuration changes.

Automatic caching

Caching is different from preloading an index. You can decide what to cache and evict from local disk automatically, based on commonly accessed segment data by ongoing queries. If queries find cached index data, the remote reads can be saved.

Automatic caching is specified as part of the cluster configuration like this:

{
   "pinot.server.instance.buffer.reference.manager.mmap.enable": "true",
   "pinot.server.instance.buffer.reference.manager.mmap.total.size": "100G",
   "pinot.server.instance.buffer.reference.manager.mmap.dir": "/home/pinot/data/tieredStorage/mmap/",
   "pinot.server.instance.buffer.reference.manager.mmap.load.existing.buffers": "true"
}

Review the related configuration options here:

Description

Whether to use preload feature

Default value

false

Because these are cluster configs, any changes will require you to restart the servers.

On demand access

If index data isn’t found in local cache, StarTree accesses the data from the remote tier backend on demand during query execution.

⚠️

Using cache alone does not ensure stable and predicable query performance, particularly when the amount of data in the remote tier backend is more than the size of local disk by multiple orders. But the size of each Pinot segment is limited, and at any time only a limited number of segments are processed, which lets you configure the memory space for on demand access accordingly.

The size of ondemand space is an important configuration for query performance. If it’s too small, it may become a bottleneck to prefetch segment data effectively or to fully parallelize the remote reads. The reserved size is to ensure query execution can always proceed, so it should be larger than the max segment size in the table. If query execution tends to wait for ondemand space, you should increase the reserved space. You’ve seen these two configs above in Add cluster configs and restart servers.

{
  "pinot.server.instance.buffer.reference.manager.ondemand.total.size": "24G",
  "pinot.server.instance.buffer.reference.manager.ondemand.reserved.size": "4G"
}

Review the related configuration options here:

Description

Max memory for remote data access. Ensure this is smaller than direct memory available on server else you’ll get OutOfDirectMemory exceptions

Default value

24G

Because these are cluster configs, any changes will require you to restart the servers.

Access methods

There are two methods for on demand data access: prefetch and read ahead.

Prefetching happens asynchronously before segment processing but it gets the index data as a whole.

Reading ahead happens on demand during segment processing synchronously, which adds I/O waiting time onto query latency directly, but only to get the index data requested by the query plus a small amount of bytes ahead, reading just a few small chunks from the index data.

By default, all remote data is obtained via prefetching. But based on the query, such as the indexes accessed, you can enable readAhead to speed up the query. The access methods can be turned on/off as a query option, making it easy to experiment with both access methods, as in this example:

SET "readAhead.enable" = true;
SET "prefetch.column.index.list.processing.phase"='col1.dictionary,col2.dictionary';
SET "readAhead.column.index.list.processing.phase"='col3.dictionary.initialBytes=1K;maxBytes=1M;numTriesBeforeDouble=10,teamID.dictionary.downloadAll=true';
SELECT ....

Even when readAhead is enabled, not all index data is accessed by reading ahead. Under the hood, a fetch planner decides which access method is used to read which column index.

For example, if a column is dictionary encoded and is present in the query projection clause, the column’s dictionary is prefetched. Because when evaluating the projection clause, the column’s dictionary is probed (via a binary search) for every single matching document.

Prefetching the whole dictionary allows us to handle those relatively random reads from binary search more efficiently.

By default, this happens like in this example, but remember the order can always be overridden by the query options shown above.

  1. For columns in non-predicate clauses (such as projectiongroupBy, and orderBy): if they are dictionary encoded, their dictionaries are prefetched; but in any case, their forward index data is accessed via the readAhead method, because only part of the data is needed to evaluate the clauses.
  2. For columns in predicate clauses (such as a where clause): if they have no index to be used to evaluate the predicates, their forward index data is prefetched as a whole, because the whole forward index data is scanned to evaluate the predicates anyway; otherwise, their index data (like an inverted indexrange index or json index) is accessed via the readAhead method, because only part of them is needed to evaluate the predicates.

Review the related configuration options here:

Description

Whether to enable read ahead access method

Default value

false, i.e. all by prefetching

Change of these configs do not require you to restart servers, because they are set as query options.

StarTree index

Enable the star-tree index with the enable.startree.index configuration, in tierBackendProperties (shown below). Changes here require you to reload the table or restart the servers. This is disabled by default.

    "tierConfigs": [
      {
        "name": "remoteTier",
         ...
        "tierBackend": "s3",
        "tierBackendProperties": {
             "region": "<region>",
             "bucket": "<bucket>",
             "enable.startree.index": "true"
             ...
        }
      }
    ],

Each star-tree Index consists of two major parts: tree nodes and pre-aggregated values.

The tree nodes have a small number of metadata about where to find the pre-aggregated values. If a query fits a StarTree Index, the Pinot server would traverse the tree nodes to identify the pre-aggregated values and merge them.

Because the total size of tree nodes tends to be much smaller than that of all pre-aggregated values, and traversing tree nodes incurs relatively random reads, we suggest preloading the tree nodes on local disk and leaving the pre-aggregated values in remote store for best performance and cost balance when using the star-tree index. If not preloaded, the tree nodes are accessed via the readAhead method during query.

Here is an example to preload the tree nodes of StarTree Index. The count suffix is aligned with the order of star-ree index configurations set in the table configuration.

{
   "pinot.server.instance.buffer.reference.manager.preload.index.keys.override": "<tableNameWithType1>,startree_index_0.tree,startree_index_1.tree"
}

Tier overwrites

The performance/cost requirements can be very different for data on the local and remote tiers, requiring different index across tiers.

For example, if you want to add a bloom filter, inverted index, and star-tree index for segments on the local tier to lower the query latency, and just add bloom filters on the remote tier to save cost.

To configure indexes according to tiers, see Overwrite index configs at tier level.

S3 client configs

The S3 client can also be tuned, for example, to tolerate longer request latency. However, the default of the S3 configurations should work in most cases.

In case there is a need, below is the list of current configs to customize when S3 clients fetch data from remote segments. All those configs are put inside tierBackendProperties. Most of them need to reload the table or restart servers to take effect.

Pinot supports both sync and async s3 client.

By default, sync s3 client is used and a thread pool is created to fetch data via the sync s3 clients in parallel, without blocking the query processing threads. The configs for sync s3 client are prefixed with s3client.http.

The async s3 client can reduce the size of the thread pool considerably. It uses async I/O to fetch data in parallel in a non-blocking manner; and uses the thread pool mentioned above to process the I/O completion callbacks only. The configs for the async s3 client are prefixed with s3client.asynchttp. Configs prefixed with s3client.general apply to both kinds of clients.

Review the related configuration options here:

Description

Timeout for e2e API call, which may involve a few http request retries.

Default value

no limit

Description

The max number of connections to pool for reuse

Default value

50

Description

The max number of concurrent requests allowed. We use HTTP/1.1, so this controls max connections.

Default value

50

Configs for ease of operations

You have the option to use the following adjustments to simplify operations.

Server restart optimizations

When tiered storage is enabled, server restarts can take a lot longer than otherwise. This is primarily due to servers having to make multiple object store calls when loading a segment. This high restart time can be problematic especially during upgrades and server outages.

To alleviate this, enable segment header caching by setting the pinot.server.instance.segment.cache.directory cluster configuration to a path. This path must be different from the dataDir. Setting the property leads to segment metadata files, such as creation.metametadata.properties, and index_map, along with certain byte slices (specifically, the header bytes of each of the index buffers) of the columns.psf file being cached onto the disk.

These cached files are used during server startup and eliminate the expensive object store calls, which helps in reducing the overall server startup time.

⚠️

This feature only helps with server restarts and not when adding a new server since no cached files will be present when a server comes up for the first time.

Segment reloads using minions

Segment reloads can be an expensive and time consuming operation with tiered storage enabled. This is due to the multiple remote segment fetches and uploads involved when reloading a tiered segment. This can increase the load on the pinot servers hosting the tiered segments during reloads. A recommended solution is to leverage the pinot minion task framework for segment reloads. To enable this, add SegmentReloadTask to the table configuration, like this:

    "task": {
      "taskTypeConfigsMap": {
        "SegmentReloadTask": {}
      }

Additionally, you can set pinot.server.instance.segment.segment.loader.fallback.allowed to false in the cluster configs when using SegmentReloadTask. This ensures that all segment reload operations, for the tables where SegmentReloadTask is enabled, are always handled by the minions and never by the servers. This further guarantees that servers don’t get overloaded with segment reload operations.

Once the configs are added, initiate a segment reload operation for the table using the POST /tasks/schedule API call.

curl -X POST "http://<CONTROLLER_URL>/tasks/schedule?taskType=SegmentReloadTask&tableName=<TABLE_NAME>" -H "accept: application/json"

Stale segment directory cleanup

With the implementation of segment directory level MVCC, a new segment directory will be created in the object store whenever a segment is reloaded to apply changes in the schema or table configs. By default, these directories are not automatically removed, potentially resulting in a significant utilization of object storage space to store outdated segment directories that correspond to previous schema or table configs.

To address this issue, periodic cleanup of these stale segment directories can be enabled by adding the following cluster configs. Please note that the controller needs to be restarted after adding the configs:

{
  "controller.tieredStorage.segment.cleanup.frequencyPeriod": "1h"
}

The following additional settings can also be used to further refine the behavior of deleting stale segment directories.

Description

The frequency at which the controller should check for stale segment directories

Default value

-1 (disabled by default)

Increasing controller.tieredStorage.segment.cleanup.tombstoneTtlMillis allows more time for reusing old segment directories if table config reverts. However, it prolongs the persistence of stale directories in the object store.

controller.tieredStorage.segment.cleanup.tombstoneToDeletionTtlMillis config specifies the wait time before deleting a tombstoned directory. This prevents the wrongful deletion of a segment directory if any server starts using the segment directory concurrently during the tombstoning process.

Monitoring metrics

Below are metrics we emit to help understand how tiered storage works. The metrics mainly cover how:

  • Segments get uploaded to the tier backend, such as data volume and operation duration, rates or failures
  • Queries fetch data from the remote segments, such as data volume and query duration breakdown over query operators, query rates or failures
  • Segment data is kept on servers temporarily, such as data volume from segments temporarily held in memory or on disk

Review the available metrics here:

Description

How many segments uploaded to S3 tier backend when reloading segment.

⚠️

In addition to the metrics emitted from tiered storage modules, the cpu util and network bandwidth util are important to help you fine tune the system.