Lecture 3

⏳ Duration: 1.5h

🎯 Lecture Goal

Introducing students to the fundamentals of real-time analytics.

Data Streams: Understanding the Flow of Real-Time Information

You might associate streaming with platforms that deliver video content online, such as Netflix or YT. When you’re watching your favorite TV show, for example, the streaming service continuously sends you chunks of video, allowing you to watch without waiting for the entire video to download.

This concept of continuous, real-time delivery applies to streaming data. However, the “chunks” of data don’t have to be video. Instead, they could represent a wide range of information, such as continuous measurements from sensors in factories, power plants, or even monitoring systems in healthcare.

The key idea here is that streaming data is generated continuously and needs to be processed in real time. Unlike traditional batch processing (where large datasets are collected and analyzed later), stream processing works with live data, making it possible to detect issues, identify patterns, and make decisions in the moment.

For instance, you can’t afford to wait for a production line to stop in order to analyze its data. If a sensor detects an anomaly or failure, it needs to be logged immediately so that corrective action can be taken without delay. This is the essence of stream processing—the continuous analysis of data as it flows.

Stream processing refers to the continuous processing and analysis of large sets of data in motion.

This can be compared to other aspects of Big Data. Batch processing is the opposite of stream processing – you first collect large amounts of data, and only then analyze them. Of course, you could download the entire video before watching it, but would that really make sense?

There are situations where a batch approach is sufficient, but you can already see that stream processing can offer additional business value that is difficult to achieve with traditional batch processing.

Interesting Information

Real-Time Data Analytics vs. Event Stream Processing

While both real-time data analytics and event stream processing involve the analysis of data with minimal delay, they are not the same thing.

It’s easy to confuse real-time analytics with stream analytics (or event stream processing). While stream analytics technologies enable real-time analysis, they are not the same thing!

Stream Analytics (or event stream processing) involves analyzing a continuous stream of data, usually tied to a dedicated architecture. It focuses on events that occur over time—things like sensor readings, clicks on a website, or changes in stock prices.

Typically, real-time analytics systems are divided into hard and soft real-time systems: - Hard real-time systems (e.g., flight control systems) require strict adherence to deadlines – any delay could have catastrophic consequences. - Soft real-time systems (e.g., weather stations) can tolerate some delays, although in extreme cases, it could lead to data loss.

Moreover, while stream analytics assumes the presence of a dedicated streaming architecture, real-time analytics is not tied to any specific architecture.

The key point is that real-time analytics means processing data within the time period that a business defines as “real-time” – this can range from a few milliseconds to a few seconds, depending on business needs.

Data Sources for Streaming:

The variety of data sources that can feed into streaming systems is vast and spans several industries. Examples include:

  • IoT devices, equipment sensors,
  • clickstreams,
  • geolocation tracking,
  • user interactions (e.g., user actions on your website),
  • social media channels,
  • stock market data,
  • app activity.

This real-time data is used by businesses to detect patterns, create visualizations, trigger alerts, and take immediate action.

Business Justification: The Need for Streaming Data

The sheer volume and speed of modern data are unprecedented. In 2025, IoT devices are expected to generate a staggering 79.4 zettabytes (ZB) of data. Similarly, Twitter processes over 500 million tweets every day.

The speed at which this data is generated means traditional batch processing is no longer enough. Stream analytics is necessary for businesses to unlock insights from this massive flow of real-time information.

For example, IoT sensors can detect anomalies in equipment performance, allowing maintenance teams to take proactive action. In stock trading, real-time regression models can help traders make split-second decisions about buying and selling based on the latest market data.

Example Business Applications

  1. IoT Sensor Data and Anomaly Detection - Sensors in factories or on machinery can generate continuous data streams. Stream analytics can detect anomalies in real-time, allowing immediate maintenance or corrective actions to prevent costly downtime or equipment failure.
  2. Stock Trading, Real-time regression models analyze price movements and help traders identify trends. Timely processing of streaming data can trigger trades in response to market changes—potentially within milliseconds.
  3. Clickstream Analytics for Websites. Real-time analytics of website visitors’ actions can personalize the user experience, adjust content dynamically, or trigger targeted ads based on live user behavior.

8 Best Examples of Real-Time Data Analytics
Business Applications

Definitions

Learn more about streaming data

Definition 1Event: Any observable occurrence at a specific point in time. In data streams, events are immutable records that represent a specific action.

Definition 2 – In the context of data, an event is encoded as JSON, XML, CSV, or in a binary format.

Definition 3Continuous Event Stream: An ongoing sequence of events that happen over time, such as logs from servers, sensor readings, or user activity.

Definition 4Data Stream: A continuous flow of data from dynamic sources like sensor readings or log entries. These streams can be generated in real-time or derived from static sources like databases or files.

A business is an organization that generates and responds to a continuous stream of events.

Stream Analytics

Stream Analytics (also called Event Stream Processing) refers to processing large amounts of data as they are generated, in real-time.

Regardless of the technology used, all data exists as a continuous stream of events, including: - User actions on websites,
- System logs,
- Sensor measurements.

Time in Real-Time Data Analysis

In batch processing, we analyze historical data, and the timing of the process is unrelated to when the events actually occurred.

In stream processing, there are two key concepts of time: 1. Event Time – the actual moment the event occurred. 2. Processing Time – the moment the system processes the event.

Ideal Data Processing

In an ideal scenario, processing happens immediately after the event occurs:

Ideal Processing

Real Data Processing

In reality, data processing always involves some delay, visible as points below the ideal processing line (below the diagonal in the chart):

Real Processing

In stream processing applications, the difference between event time and processing time is crucial. Common causes of delays include:

  • Data transmission over the network,
  • Lack of communication between the device and the network.

An example of this is tracking the location of a car via a GPS application – passing through a tunnel might cause temporary data loss.

Handling Delays in Stream Processing

Delays in event processing can be managed in two main ways: 1. Monitoring the number of missed events and triggering an alarm if the number of rejected events exceeds a threshold. 2. Applying correction using watermarking, which is an additional mechanism that accounts for delayed events.

The process of real-time event processing can be represented as a step function:

Event Processing

Not all events contribute to the analysis – some may be discarded due to excessive delays.

By using watermarking, additional time is allowed for the appearance of delayed events. This process includes all events above the dashed line. However, there might still be cases where some points are skipped.

Watermarking Event Processing

The situations illustrated in the charts explicitly indicate why the concept of time is a critical factor and needs to be clearly defined at the business requirements level. Assigning timestamps to data (events) is a complex task.

Time Windows in Stream Analytics

In stream processing, time windows allow grouping data into time-limited segments, enabling event analysis within specific time intervals. Depending on the use case, various types of windows are applied, tailored to the characteristics of the data and analytical requirements.


1. Tumbling Window

A tumbling window is a fixed-length window that does not overlap – each event belongs to only one window.

Characteristics:
- Fixed window length
- No overlap between windows
- Ideal for dividing data into equal time segments

📌 Example: Analyzing the number of orders in an online store every 5 minutes.

Tumbling Window


2. Sliding Window

A sliding window includes all events occurring within a specific time interval, where the window slides continuously.

Characteristics:
- Each event can belong to multiple windows
- The window shifts by a specified interval
- Useful for detecting trends and anomalies

📌 Example: Tracking the average temperature over the last 10 minutes, updated every 2 minutes.

Sliding Window


3. Hopping Window

A hopping window is similar to a tumbling window, but it allows overlapping windows, meaning one event can belong to multiple windows. It is used to smooth data.

Characteristics:
- Fixed window length
- Overlapping windows
- Useful for noise reduction in data

📌 Example: Analyzing the number of website visitors every 10 minutes, but updated every 5 minutes to better capture trends.

Hopping Window


4. Session Window

A session window groups events based on activity periods and closes after a specified period of inactivity.

Characteristics:
- Dynamic window length
- Defined by user activity
- Used in user session analysis

📌 Example: Analyzing user sessions on a website – the session lasts as long as the user is active, but ends after 15 minutes of inactivity.


Summary

Different types of time windows are applied depending on the data’s characteristics and the analysis objectives. Choosing the right window impacts the accuracy of results and the efficiency of the analytical system.

Window Type Characteristics Use Cases
Tumbling Fixed length, no overlap Periodic reports
Sliding Fixed length, overlapping windows Trend detection, anomaly detection
Hopping Fixed length, partial overlap Data smoothing
Session Dynamic length, activity-dependent User session analysis

Each window type has its unique use cases and helps with better interpretation of streaming data. The choice of method depends on business needs and the nature of the analyzed data.

In stream data analysis, interpreting time is a complex issue due to: 1. Different systems having different clocks, leading to inconsistencies, 2. Data arriving with delays, requiring watermarking and time window techniques, 3. Different approaches to event time vs. processing time impacting result accuracy.