StarTree Glossary

Aggregation

In databases, an aggregation is a type of function to cluster related query results. For example, COUNT returns how many rows match a certain criteria, whereas DISTINCTCOUNT returns how many specific different values are in the result set. SUM adds together all values from a column that match a query’s criteria. MIN and MAX return lowest and highest values, while AVG can calculate the arithmetic mean of the selected rows. Apache Pinot supports dozens of different aggregation functions. Read Documentation

Analytics

The field of computational investigation and interpretation of data, with the goal to broaden and deepen understanding of the domain under analysis, and enable effective communication and decision-making based on that understanding. Two main types of data analytics consist of batch analytics and real-time analytics.

Apache Helix

An open source dynamic cluster management framework that automates the orchestration of resources within a distributed cluster of systems. By managing resource allocations within a cluster, Helix aids in maintaining optimal system performance and reliability and ensures orderly and efficient operations amidst changing system states. In the case of Apache Pinot, Helix manages topology changes for both brokers and servers, and optimizes query loads across the cluster.

Apache Pinot

Apache Pinot is an open source distributed database designed for real-time, user-facing analytics. Apache Pinot is classified as an Online Analytical Processing (OLAP) database. It is capable of low-latency query execution even at extremely high throughput. Apache Pinot can ingest directly from event streaming sources like Apache Kafka and make events available for querying immediately. It can also ingest data from online transactional processing (OLTP) databases using Change Data Capture (CDC), or from batch data sources such as data warehouses or cloud object stores. Queries are made using a subset of SQL. Read More

Apache Zookeeper

Similiar to Apache Helix, Apache Zookeeper is an open source project that helps with the coordination of distributed systems. Its name is derived from its task to manage the connection between various servers in a distributed system (the “Zoo”). Zookeeper ensures distributed clusters can manage and synchronize their states effectively. It maintains configuration information, provides distributed synchronization, and facilitates group services. By doing so, Zookeeper helps in ensuring a system’s consistency, reliability, and the orderly execution of processes in complex distributed environments.

Batch Data

A data analytics method where data is collected over a period of time and then processed all at once. For example, daily, weekly, monthly or annual accumulations of data. It is contrasted with streaming data which is processed continuously and immediately as it's generated. Grouping and handling data in 'batches' saves resources and manages the processing workload effectively; for example, batch jobs may be throttled to a steady rate of processing, or jobs paused and resumed to prevent overrunning system resources. Batching is often used when immediate processing isn't crucial and when handling large volumes of data efficiently is a priority. Some architectures also use “microbatching,” where data is processed in small, discrete quantities — usually less than one minute of accumulated data.

Bloom Filter

A Bloom filter is not an index type but a probabilistic data structure. Since it mathematically cannot result in false negatives, it can be used to definitively determine where certain data is not located, minimizing the amount of data that needs to be scanned to locate an element. This makes scanning for where specific data is located on disk far more efficient. In Apache Pinot, Bloom filters can be used to determine which segments do not contain specific data.

Cluster

A group of computing resources dedicated to a single logical system. For Apache Pinot the components of the cluster consist of multiple different numbers and types of nodes, including servers, brokers, controllers and minions.

Controller

Certain computer nodes in a cluster are set aside for cluster metadata, as well as configuration and orchestration tasks, as opposed to storing data or serving queries. In Apache Pinot controller nodes are used by Apache Helix and Apache Zookeeper.

Data Lake

A flexible storage system that accommodates diverse data types in their native form without requiring any type of cleanup. Unlike a traditional data warehouse, data lakes primarily store raw, unprocessed data and typically offer larger storage capacity. Data lakes are a cost-effective, modern solution that handle large volumes of data for machine learning, big data processing, and real-time analytics. Popular data lake vendors include Amazon (using various services, such as S3, EMR, Glue and Lake Formation), Databricks, Snowflake, Google Cloud (BigQuery, Google Cloud Storage), and Microsoft (Azure Data Lake Storage).

Data Lakehouse

Data Lakehouse combines the extensive storage of a data lake with the structured processing of a data warehouse. It stores diverse data types in their native form, like a data lake, while providing structured, organized data management akin to a data warehouse. This hybrid solution is ideal for handling large volumes of both raw and processed data, facilitating advanced analytics and machine learning with cost-effective, modern capabilities. Popular commercial data lakehouses are provided by Databricks or Amazon Redshift. Popular open source data lakehouse technologies include Delta Lake, Apache Hudi and Apache Iceberg.

Data Streaming

Data streaming, also known as event streaming, is the process of continuously collecting and sending data across various data sources. This approach allows for data to be immediately processed and analyzed as it becomes available, making it ideal for real-time and event-driven use cases. Popular data streaming technologies include Apache Kafka, Redpanda, and Apache Pulsar.

Data Warehouse

A structured repository that is designed to store data in a highly organized and consistent manner. Compared to a data lake, incoming data needs to be cleaned, transformed, and organized before it's stored. This is typically done via an ETL process, which extracts raw data from various sources and cleans it up to be stored in the data warehouse. This process prepares the data for analysis. Since the data is organized, more complex queries can be run at low latencies while maintaining data integrity and consistency. Common data warehouse vendors include Amazon Redshift, Google BigQuery, Snowflake, and others.

Database

A computer application designed to organize, persistently store, and query various collections of data. There are hundreds of different databases in existence of various types, ranging from SQL to NoSQL, Online Analytical Processing (OLAP) vs. Online Transaction Processing (OTLP), and so on. For example, Apache Pinot is a distributed OLAP-oriented SQL database designed for real-time analytics.

Extract, Load, Transform (ELT)

This is a modern version of the ETL process, in which data is being moved from one or multiple-source systems to a single destination. In comparison to ETL, the data is loaded directly into its final destination after extraction from the sources. The data is then transformed at the destination itself, usually a data warehouse. This change in the order of steps eliminates the need for temporary storage (as is required in the ETL process) and allows for the decoupling of transformation and extraction, so that they can be run independently.

Extract, Transform, Load, (ETL)

A process that moves data from one or multiple-source systems to a single destination. This is done by extracting data from the sources and putting it in a temporary storage, where it is transformed to adhere to standard formatting and quality. As a final step,data is loaded into the destination, which is typically a data warehouse, for analysis. The temporary storage is cleared after each “run”, which means all data has to be transformed at once.

Forward Index

A database index type that stores a pointer to an object and the values for that object. This is a useful index type for when you want to look up data by the object itself. For example, imagine using phone numbers as an index. You could look up what information is associated with that phone number, which could result in related data, such as a business name, an individual point of contact, and/or the address of record for that phone number.

Inverted Index

A database index type that organizes data by values, to determine which object or record includes that value. For example, imagine looking up a person’s name to find out what phone numbers are associated with it. This could result in a home phone number, an office phone number, and/or a mobile phone number record.

JavaScript Object Notation (JSON)

JSON (pronounced like the name “Jason”) is an open standard for data interchange, producing text-based data objects parseable by both humans and computer applications alike. While its origin is associated with JavaScript, JSON data objects are language-independent. JSON encodes information as sets of attribute-value pairs which can be nested.

Key-Value Pair

A key-value pair (KVP) is a set of data with two parts: a key and a value. The key is a unique identifier used to access or retrieve the associated data value, while the value is the data related to the key. Key-value pairs make up the data structure of a key-value store (also known as a key-value database), which provides an efficient way to store and retrieve data. KVPs are versatile, and can be found in various programming languages and data storage formats.

MapReduce

MapReduce is a programming model designed to process large datasets in a distributed computing environment. MapReduce consists of two main phases: the map phase transforms input data into a set of key-value pairs, and the reduce phase groups those key-value pairs and processes the data to generate the final output. It was originally developed at Google and is commonly associated with the Apache Hadoop ecosystem.

Node

A node is a set of computing resources within a cluster, such as compute, memory, storage, and networking, assigned to a specific task. In the context of Apache Pinot, a node can be assigned to one of various roles, including servers, brokers, controllers and minions.

Online Analytical Processing (OLAP)

One of the major classifications of databases, OLAP-oriented systems are focused on read-heavy workloads, scanning large ranges or full tables of data, and often producing results as aggregations. To make them more efficient for these read-oriented workloads, OLAP databases are most often column stores. OLAP databases are often contrasted with OLTP databases.

Online Transaction Processing (OLTP)

One of the major classifications of databases, OLTP-oriented systems are often focused on write-heavy or mixed workloads. OLTP databases are most often row stores, doing basic CRUD operations on a row-by-row basis. OLTP databases are often contrasted with OLAP databases.

Queries per Second (QPS)

The rate at which queries are made to a database, and have results returned, measured in the span of a second.

Query

A request for information from a database, formed using the syntax and grammar defined by a query language, such as Structured Query Language (SQL). Simple SQL queries often use the SELECT statement. More complex SQL queries can use JOIN statements to bring together data from two or more tables. Note that a query may return no matching data, or one or more rows worth of data, or it may return results based on aggregations.

Real-Time Analytics

Data analysis performed within seconds-to-subsecond scale times (low query latency) on the most current, up-to-date data (data freshness). Requires the capacity to perform efficient large range or full table scans, as well as aggregation of data results. Read more

Real-Time Data

Data that is generated and immediately available for users with seconds to milliseconds of an event. In the context of Apache Pinot, real-time data is stored in real-time tables, and contrasted with batch or historical data that can be hours, days, weeks, months or years old, stored in so-called “offline tables.

Segment

Within Apache Pinot tables are divided into horizontal shards, time-based collections of related data, similar to the concept of partitions in other relational databases. Segments allow tables to be organized into smaller subsets of data that can be efficiently distributed and stored across multiple nodes for highest performance and reliability. Segments are automatically created during the process of data ingestion. Read Documentation

Star-Tree Index

A database index type unique to Apache Pinot that uses a tree structure based on multiple columns and pre-computed aggregations to significantly improve query performance and support high concurrency. The star-tree index is a kind of materialized view that is more efficient than traditional methods.

Streaming Data

Unlike batch data which is analyzed in groups, streaming data is handled and processed immediately as its being generated. In a trade off with efficiency, this approach allows for real-time or near real-time insights and actions.

Structured Query Language (SQL)

A domain-specific language used in managing relational databases. It enables data retrieval, manipulation, and management through various commands categorized into Data Query Language (DQL), Data Definition Language (DDL), Data Manipulation Language (DML), and Data Control Language (DCL). Developed in the 1970’s, SQL has been universally adopted by well over 100 database providers. The structured approach of SQL has significantly impacted modern data management practices, evolving over time to support advanced data technologies. Apache Pinot is an example of a SQL database. In contrast, a class of databases that specifically do not use SQL are known collectively as NoSQL databases.

Tiered Storage

A data management strategy that uses a mixture of cloud and local storage solutions to optimize query performance while maintaining low costs. This is achieved by storing the most recent data, that is queried the most often and has the biggest business value, in a local, high performing storage solution. Older data that is queried less often is stored cost-effectively in the cloud.

Time Series Data

A fundamental concept of data analytics, where each piece of information is stamped with the time it was collected. Making “time” one of the given variables of a dataset (lining data up in order) creates continuity that allows for a detailed understanding of underlying patterns and context. This context is important to make the informed decisions that we are looking for in real-time analytics.

Transactional Data

Similar to time series data, transactional data is also tracking activity metrics over time. However, it not only looks at how one metrics is changing over time, but rather captures a wealth of information crucial for analysis and decision-making. A typical transactional dataset would record what occurred, who or what was involved, and when it happened. Transactional data can also be analyzed in a time-sequenced unveiling trends and insights over time.

User-Facing Analytics / Customer-Facing Analytics

User-facing analytics provide insights to end-users of systems, as opposed to internal organizational stakeholders. Because of this, user-facing analytics systems need to support far larger numbers of concurrent queries and provide answers through real-time dashboards, as well as desktop and mobile applications.