StarTree Glossary

Aggregation

In databases, an aggregation is a type of function to cluster related query results. For example, COUNT returns how many rows match a certain criteria, whereas DISTINCTCOUNT returns how many specific different values are in the result set. SUM adds together all values from a column that match a query’s criteria. MIN and MAX return lowest and highest values, while AVG can calculate the arithmetic mean of the selected rows. Apache Pinot supports dozens of different aggregation functions. Read Documentation

Analytics

The field of computational investigation and interpretation of data, with the goal to broaden and deepen understanding of the domain under analysis, and enable effective communication and decision-making based on that understanding. Two main types of data analytics consist of batch analytics and real-time analytics.

Apache Helix

An open source dynamic cluster management framework that automates the orchestration of resources within a distributed cluster of systems. By managing resource allocations within a cluster, Helix aids in maintaining optimal system performance and reliability and ensures orderly and efficient operations amidst changing system states. In the case of Apache Pinot, Helix manages topology changes for both brokers and servers, and optimizes query loads across the cluster.

Apache Pinot

Apache Pinot is an open source distributed database designed for real-time, user-facing analytics. Apache Pinot is classified as an Online Analytical Processing (OLAP) database. It is capable of low-latency query execution even at extremely high throughput. Apache Pinot can ingest directly from event streaming sources like Apache Kafka and make events available for querying immediately. It can also ingest data from online transactional processing (OLTP) databases using Change Data Capture (CDC), or from batch data sources such as data warehouses or cloud object stores. Queries are made using a subset of SQL. Read More

Apache Zookeeper

Similiar to Apache Helix, Apache Zookeeper is an open source project that helps with the coordination of distributed systems. Its name is derived from its task to manage the connection between various servers in a distributed system (the “Zoo”). Zookeeper ensures distributed clusters can manage and synchronize their states effectively. It maintains configuration information, provides distributed synchronization, and facilitates group services. By doing so, Zookeeper helps in ensuring a system’s consistency, reliability, and the orderly execution of processes in complex distributed environments.

Batch Data

A data analytics method where data is collected over a period of time and then processed all at once. For example, daily, weekly, monthly or annual accumulations of data. It is contrasted with streaming data which is processed continuously and immediately as it's generated. Grouping and handling data in 'batches' saves resources and manages the processing workload effectively; for example, batch jobs may be throttled to a steady rate of processing, or jobs paused and resumed to prevent overrunning system resources. Batching is often used when immediate processing isn't crucial and when handling large volumes of data efficiently is a priority. Some architectures also use “microbatching,” where data is processed in small, discrete quantities — usually less than one minute of accumulated data.

Data Lake

A flexible storage system that accommodates diverse data types in their native form without requiring any type of cleanup. Unlike a traditional data warehouse, data lakes primarily store raw, unprocessed data and typically offer larger storage capacity. Data lakes are a cost-effective, modern solution that handle large volumes of data for machine learning, big data processing, and real-time analytics. Popular data lake vendors include Amazon (using various services, such as S3, EMR, Glue and Lake Formation), Databricks, Snowflake, Google Cloud (BigQuery, Google Cloud Storage), and Microsoft (Azure Data Lake Storage).

Data Streaming

Data streaming, also known as event streaming, is the process of continuously collecting and sending data across various data sources. This approach allows for data to be immediately processed and analyzed as it becomes available, making it ideal for real-time and event-driven use cases. Popular data streaming technologies include Apache Kafka, Redpanda, and Apache Pulsar.

Data Warehouse

A structured repository that is designed to store data in a highly organized and consistent manner. Compared to a data lake, incoming data needs to be cleaned, transformed, and organized before it's stored. This is typically done via an ETL process, which extracts raw data from various sources and cleans it up to be stored in the data warehouse. This process prepares the data for analysis. Since the data is organized, more complex queries can be run at low latencies while maintaining data integrity and consistency. Common data warehouse vendors include Amazon Redshift, Google BigQuery, Snowflake, and others.

Data Warehouse

A structured repository that is designed to store data in a highly organized and consistent manner. Compared to a data lake, incoming data needs to be cleaned, transformed, and organized before it's stored. This is typically done via an ETL process, which extracts raw data from various sources and cleans it up to be stored in the data warehouse. This process prepares the data for analysis. Since the data is organized, more complex queries can be run at low latencies while maintaining data integrity and consistency. Common data warehouse vendors include Amazon Redshift, Google BigQuery, Snowflake, and others.

Extract, Load, Transform (ELT)

This is a modern version of the ETL process, in which data is being moved from one or multiple-source systems to a single destination. In comparison to ETL, the data is loaded directly into its final destination after extraction from the sources. The data is then transformed at the destination itself, usually a data warehouse. This change in the order of steps eliminates the need for temporary storage (as is required in the ETL process) and allows for the decoupling of transformation and extraction, so that they can be run independently.

Extract, Transform, Load, (ETL)

A process that moves data from one or multiple-source systems to a single destination. This is done by extracting data from the sources and putting it in a temporary storage, where it is transformed to adhere to standard formatting and quality. As a final step,data is loaded into the destination, which is typically a data warehouse, for analysis. The temporary storage is cleared after each “run”, which means all data has to be transformed at once.

Key-Value Pair

A key-value pair (KVP) is a set of data with two parts: a key and a value. The key is a unique identifier used to access or retrieve the associated data value, while the value is the data related to the key. Key-value pairs make up the data structure of a key-value store (also known as a key-value database), which provides an efficient way to store and retrieve data. KVPs are versatile, and can be found in various programming languages and data storage formats.

MapReduce

MapReduce is a programming model designed to process large datasets in a distributed computing environment. MapReduce consists of two main phases: the map phase transforms input data into a set of key-value pairs, and the reduce phase groups those key-value pairs and processes the data to generate the final output. It was originally developed at Google and is commonly associated with the Apache Hadoop ecosystem.

Online Analytical Processing (OLAP)

One of the major classifications of databases, OLAP-oriented systems are focused on read-heavy workloads, scanning large ranges or full tables of data, and often producing results as aggregations. To make them more efficient for these read-oriented workloads, OLAP databases are most often column stores. OLAP databases are often contrasted with OLTP databases.

Online Transaction Processing (OLTP)

One of the major classifications of databases, OLTP-oriented systems are often focused on write-heavy or mixed workloads. OLTP databases are most often row stores, doing basic CRUD operations on a row-by-row basis. OLTP databases are often contrasted with OLAP databases.

Queries per Second (QPS)

The rate at which queries are made to a database, and have results returned, measured in the span of a second.

Query

A request for information from a database, formed using the syntax and grammar defined by a query language, such as Structured Query Language (SQL). Simple SQL queries often use the SELECT statement. More complex SQL queries can use JOIN statements to bring together data from two or more tables. Note that a query may return no matching data, or one or more rows worth of data, or it may return results based on aggregations.

Real-Time Analytics

Data analysis performed within seconds-to-subsecond scale times (low query latency) on the most current, up-to-date data (data freshness). Requires the capacity to perform efficient large range or full table scans, as well as aggregation of data results. Read more

Real-Time Data

Data that is generated and immediately available for users with seconds to milliseconds of an event. In the context of Apache Pinot, real-time data is stored in real-time tables, and contrasted with batch or historical data that can be hours, days, weeks, months or years old, stored in so-called “offline tables.

Segment

Within Apache Pinot tables are divided into horizontal shards, time-based collections of related data, similar to the concept of partitions in other relational databases. Segments allow tables to be organized into smaller subsets of data that can be efficiently distributed and stored across multiple nodes for highest performance and reliability. Segments are automatically created during the process of data ingestion. Read Documentation

Star-Tree Index

A database index type unique to Apache Pinot that uses a tree structure based on multiple columns and pre-computed aggregations to significantly improve query performance and support high concurrency. The star-tree index is a kind of materialized view that is more efficient than traditional methods.

Streaming Data

Unlike batch data which is analyzed in groups, streaming data is handled and processed immediately as its being generated. In a trade off with efficiency, this approach allows for real-time or near real-time insights and actions.

Structured Query Language (SQL)

A domain-specific language used in managing relational databases. It enables data retrieval, manipulation, and management through various commands categorized into Data Query Language (DQL), Data Definition Language (DDL), Data Manipulation Language (DML), and Data Control Language (DCL). Developed in the 1970’s, SQL has been universally adopted by well over 100 database providers. The structured approach of SQL has significantly impacted modern data management practices, evolving over time to support advanced data technologies. Apache Pinot is an example of a SQL database. In contrast, a class of databases that specifically do not use SQL are known collectively as NoSQL databases.

Tiered Storage

A data management strategy that uses a mixture of cloud and local storage solutions to optimize query performance while maintaining low costs. This is achieved by storing the most recent data, that is queried the most often and has the biggest business value, in a local, high performing storage solution. Older data that is queried less often is stored cost-effectively in the cloud.

Time Series Data

A fundamental concept of data analytics, where each piece of information is stamped with the time it was collected. Making “time” one of the given variables of a dataset (lining data up in order) creates continuity that allows for a detailed understanding of underlying patterns and context. This context is important to make the informed decisions that we are looking for in real-time analytics.

Transactional Data

Similar to time series data, transactional data is also tracking activity metrics over time. However, it not only looks at how one metrics is changing over time, but rather captures a wealth of information crucial for analysis and decision-making. A typical transactional dataset would record what occurred, who or what was involved, and when it happened. Transactional data can also be analyzed in a time-sequenced unveiling trends and insights over time.

User-Facing Analytics / Customer-Facing Analytics

User-facing analytics provide insights to end-users of systems, as opposed to internal organizational stakeholders. Because of this, user-facing analytics systems need to support far larger numbers of concurrent queries and provide answers through real-time dashboards, as well as desktop and mobile applications.