Data Streaming

Data Streaming

What is data streaming?

Data streaming, also known as event streaming, is the process of continuously collecting and sending data from various data producers (sources) to different consumers (sinks). This approach allows for data to be immediately processed and analyzed as it becomes available, making it ideal for real-time and event-driven use cases.

Whereas data in databases is referred to as a “record,” in data streaming contexts the same information would be referred to as an “event.” Timing and ordering of events is a very important part of data streaming.

What are the most popular data streaming technologies?

Popular data open source streaming technologies include Apache Kafka, Apache Pulsar, and Redpanda. The latter two systems are designed to be compatible with Apache Kafka. There are also proprietary cloud vendor solutions like Amazon Kinesis and Google Cloud Dataflow.

What are key characteristics of data streaming?

The key characteristics of data streaming include:

  • Real-time or near-real-time: Data streaming processes data as it is generated , minimizing the delay between data creation and its availability for analysis.

  • Event-driven: Data streaming often relies on an event-driven architecture, where data is produced and consumed in response to specific events or triggers.

  • Continuous flow: Data is sent and received in a continuous, unbroken stream rather than being divided into batches. Data is segregated into data flows called “topics.”

  • Scalability: Data streaming systems are designed to handle large volumes of data — up to millions of events per second — and can add processing and storage resources to accommodate increased data loads and longer data retention windows.

  • Processing in transit: Data streaming architectures typically include components for stream processing as data moves through the system. This can include filtering, aggregation, transformation, enrichment, and analysis of data in real-time. Popular stream processing technologies include Apache Kafka Streams, Apache Pulsar Functions, Apache Flink, and Apache Beam.

How does data streaming work?

Data streaming ingests discrete data from various producers and creates an event — a specifically time-stamped relatively small package of data — that is then added into one or more topics. Topics can then be subscribed to by consumers that join consumer groups. Those consumers then get all events sent into that topic.

Data streaming can achieve this at massive scale and speed, transporting and distributing millions of events per second. Because of this, data streaming is not designed to be directly consumed by a human user — data would scroll by too quickly for the human eye to comprehend. Data streaming was designed for computer-to-computer communications. Streaming applications and real-time databases consume the events, processing the data and providing only the most important results for human attention.

Events themselves can be as small as one kilobyte (KB) of data or less, yet typically range from 1KB to 20KB. The largest messages sent by event streaming systems are 1 megabyte (MB). 

Unlike older message queuing systems that delete data from buffers as soon as they are passed along, event streaming systems persist data so that consumers can go back over past events in the topic. For example, imagine a new consumer joins a consumer group and wants to see the past three days’ worth of activity. Alternatively, imagine a consumer was taken out of service for a few hours, and when it rejoins the network it wants to catch up on events it missed during its downtime window.

While theoretically all historic data events can be kept in a data streaming system, retention policies usually set a maximum time of retention known as a “Time-to-Live” (TTL) to avoid overrunning system storage constraints and budgets.

Data Streaming Use Cases

With data streaming, organizations can use real-time data to make better decisions, improve operations, enhance customer experiences, and stay competitive in today's fast-paced business environment.

Some of the most common use cases for data streaming include:

  • Real-time analytics

  • Customer 360º

  • Fraud detection

  • Stock trading

  • IT monitoring & observability

  • Cybersecurity

  • Social media

  • Telecom

  • Product catalogs & inventory management

  • IoT and sensor data processing

Data Streaming for Real-Time Analytics

StarTree Cloud, powered by Apache Pinot, is a fully-managed user-facing real-time analytics Database-as-a-Service (DBaaS) designed for OLAP at massive speed and scale. Founded by the original creators of Pinot, StarTree Cloud integrates seamlessly with transactional databases and event streaming platforms, ingesting data at millions of events per second and indexing it for lightning-fast query responses. 

StarTree Cloud can handle high-throughput, low-latency analytical queries, such as fast aggregations, making it ideal for a variety of use cases across social media and collaboration platforms, delivery and ridesharing services, retail and telecommunications companies, financial services, and more.

Both Apache Kafka and Apache Pinot were invented at LinkedIn, and Pinot was designed from the ground up specifically to handle real-time data ingestion as a Kafka consumer. Consequently the two technologies are often deployed and used in tandem. For example, Apache Pinot provides support for real-time upserts, both full and partial, which are common in event streaming use cases. StarTree Cloud makes ingestion of Kafka events even easier through StarTree Data Manager, a no-code data ingestion tool.

StarTree also offers StarTree ThirdEye, an anomaly detection system that looks for unusual events that occur in your real-time streaming data, so that users can do immediate root-cause analysis and take action the moment issues are detected.

Additional Resources