Lecture 3: Stream Processing — Time and Windows

Duration: 1.5h

Goal: Understand the fundamentals of stream processing — what a data stream is, how to interpret time in streaming systems, and how time windows work.


Data Streams

When you watch a series on a streaming platform, it delivers video fragments continuously. You don’t wait for the entire file to download — data flows.

The same concept applies to business data. Every transaction, page click, sensor reading, or login event is an event that appears in a continuous, infinite stream.

An event is an immutable record describing something that happened at a specific point in time. It can be encoded as JSON, CSV, or a binary format. Once generated, an event doesn’t change — it can be read but not modified.

A data stream is an infinite sequence of events ordered in time.

Code
import json
from datetime import datetime, timedelta
import random

# Simulated e-commerce event stream
random.seed(42)
base_time = datetime(2026, 3, 9, 10, 0, 0)

for i in range(8):
    event = {
        "event_id": f"E{1000+i}",
        "timestamp": (base_time + timedelta(seconds=random.randint(1, 120))).isoformat(),
        "type": random.choice(["page_view", "add_to_cart", "purchase"]),
        "user_id": random.choice(["U01", "U02", "U03"]),
        "value": round(random.uniform(10, 500), 2) if random.random() > 0.5 else None
    }
    print(json.dumps(event))
{"event_id": "E1000", "timestamp": "2026-03-09T10:01:22", "type": "page_view", "user_id": "U01", "value": 130.0}
{"event_id": "E1001", "timestamp": "2026-03-09T10:00:18", "type": "purchase", "user_id": "U01", "value": 447.17}
{"event_id": "E1002", "timestamp": "2026-03-09T10:00:12", "type": "purchase", "user_id": "U02", "value": null}
{"event_id": "E1003", "timestamp": "2026-03-09T10:00:12", "type": "page_view", "user_id": "U01", "value": 23.0}
{"event_id": "E1004", "timestamp": "2026-03-09T10:00:26", "type": "purchase", "user_id": "U03", "value": 215.56}
{"event_id": "E1005", "timestamp": "2026-03-09T10:00:58", "type": "purchase", "user_id": "U02", "value": 13.18}
{"event_id": "E1006", "timestamp": "2026-03-09T10:01:44", "type": "page_view", "user_id": "U03", "value": null}
{"event_id": "E1007", "timestamp": "2026-03-09T10:00:36", "type": "page_view", "user_id": "U01", "value": 174.93}

Systems that handle data streams: transactional systems, IoT monitoring, website analytics, online ads, social media, logging systems.

The key perspective shift from the previous lecture: in batch processing the data source is a file — written once, read many times. In stream processing the source is a producer that generates events continuously, and those events can be processed by multiple consumers simultaneously.

Stream Processing vs Real-Time Analytics

These two concepts are often confused, but they mean different things.

Stream processing is an architecture — we process data in motion as it arrives, instead of collecting it into files.

Real-time analytics is a business requirement — data must be processed fast enough to meet the company’s expectations. What “fast enough” means depends on context:

  • Hard real-time systems (e.g., flight control systems): any delay can have catastrophic consequences. Time guarantees are required.
  • Soft real-time systems (e.g., weather monitoring, dynamic pricing): some delay is tolerated, but shorter is better for business value.

Stream processing enables real-time analytics, but doesn’t guarantee it on its own — everything depends on architecture, infrastructure, and optimization.

Time in Streaming Systems

In batch processing, time isn’t critical — we analyze historical data, and the moment we run a report has no connection to when the analyzed events occurred.

In stream processing, time is critical, and we distinguish two concepts:

Event Time

The moment the event actually occurred. Assigned by the data source (e.g., sensor, application). This is the time we care about analytically.

Processing Time

The moment the system processes the event. Always later than event time — because data must travel through the network, be received by the broker, and processed by the consumer.

Code
import matplotlib.pyplot as plt
import numpy as np

# Illustration: event time vs processing time
np.random.seed(42)
n = 15
event_times = np.sort(np.random.uniform(0, 10, n))
delays = np.random.exponential(0.5, n)
processing_times = event_times + delays

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot([0, 12], [0, 12], 'k--', alpha=0.3, label='Ideal processing (delay = 0)')
ax.scatter(event_times, processing_times, c='steelblue', s=60, zorder=5)

for i in range(n):
    ax.plot([event_times[i], event_times[i]], [event_times[i], processing_times[i]],
            'r-', alpha=0.3, linewidth=1)

ax.set_xlabel('Event time')
ax.set_ylabel('Processing time')
ax.set_title('Latency: the gap between event time and processing time')
ax.legend()
ax.set_xlim(0, 12)
ax.set_ylim(0, 12)
plt.tight_layout()
plt.show()

In an ideal world, every event would be processed instantly — points would lie on the diagonal. In reality, there’s always a delay (points below the diagonal). Causes: network transmission, temporary connectivity loss (e.g., driving through a tunnel), system overload.

Watermarking — Handling Late Events

What to do with events that arrive late? Two strategies:

  1. Discard — ignore events that arrive after the window closes. Monitor the number of skipped events and alert when there are too many.

  2. Watermarking — define additional wait time for late events. A watermark is a marker saying: “all events up to this point should have arrived by now.” Events arriving before the watermark are included; after it — discarded.

Code
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np

np.random.seed(42)
n = 20
event_times = np.sort(np.random.uniform(0, 10, n))
delays = np.random.exponential(0.8, n)
processing_times = event_times + delays
watermark_delay = 1.5

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot([0, 12], [0, 12], 'k--', alpha=0.3, label='Ideal processing')
ax.plot([0, 12], [watermark_delay, 12 + watermark_delay], 'b--', alpha=0.4, label=f'Watermark (delay {watermark_delay}s)')

for i in range(n):
    in_window = processing_times[i] <= event_times[i] + watermark_delay + 0.5
    color = 'green' if in_window else 'red'
    ax.scatter(event_times[i], processing_times[i], c=color, s=50, zorder=5)

ax.set_xlabel('Event time')
ax.set_ylabel('Processing time')
ax.set_title('Watermarking — handling late events')
green_patch = mpatches.Patch(color='green', label='Included')
red_patch = mpatches.Patch(color='red', label='Discarded (too late)')
ax.legend(handles=[green_patch, red_patch, ax.lines[0], ax.lines[1]])
ax.set_xlim(0, 12)
ax.set_ylim(0, 14)
plt.tight_layout()
plt.show()

Watermarking is a trade-off: longer wait time = more events included, but greater delay in producing results.

Time Windows

In stream processing we can’t analyze “all data” because the stream is infinite. Instead, we group events into time windows — bounded segments on which we perform computations.

Tumbling Window

Fixed length, windows don’t overlap. Each event belongs to exactly one window.

Use case: periodic reports — e.g., transaction count every 5 minutes.

Code
from datetime import datetime, timedelta
from collections import defaultdict

# Simulated events
events = []
t = datetime(2026, 3, 9, 10, 0, 0)
for i in range(20):
    t += timedelta(seconds=15 + (i * 7) % 30)
    events.append({"time": t, "amount": 50 + i * 13})

# Tumbling window: 1-minute windows
windows = defaultdict(list)
for e in events:
    key = e["time"].replace(second=0)
    windows[key].append(e["amount"])

print("Tumbling Window (1 min):")
for w, amounts in sorted(windows.items()):
    print(f"  {w.strftime('%H:%M')} -> {len(amounts)} events, total: {sum(amounts)} PLN")
Tumbling Window (1 min):
  10:00 -> 2 events, total: 113 PLN
  10:01 -> 2 events, total: 165 PLN
  10:02 -> 2 events, total: 217 PLN
  10:03 -> 2 events, total: 269 PLN
  10:04 -> 2 events, total: 321 PLN
  10:05 -> 2 events, total: 373 PLN
  10:06 -> 2 events, total: 425 PLN
  10:07 -> 2 events, total: 477 PLN
  10:08 -> 2 events, total: 529 PLN
  10:09 -> 2 events, total: 581 PLN

Sliding Window

Fixed length, but slides continuously. Each event can belong to multiple windows.

Use case: moving averages — e.g., average temperature over the last 10 minutes, updated every 2 minutes.

Hopping Window

Similar to tumbling, but windows can overlap. Window length and slide interval are two independent parameters.

Use case: data smoothing — e.g., page traffic analysis every 10 minutes, updated every 5 minutes.

Session Window

Dynamic length, defined by user activity. Closes after a specified period of inactivity (gap).

Use case: user session analysis — e.g., a shopping session lasts as long as the user clicks, and ends after 15 minutes of inactivity.

Code
from datetime import datetime, timedelta

# Simulated user session
clicks = [
    datetime(2026, 3, 9, 10, 0, 0),
    datetime(2026, 3, 9, 10, 0, 30),
    datetime(2026, 3, 9, 10, 1, 15),
    datetime(2026, 3, 9, 10, 2, 0),
    # gap > 5 minutes
    datetime(2026, 3, 9, 10, 8, 0),
    datetime(2026, 3, 9, 10, 8, 45),
    datetime(2026, 3, 9, 10, 9, 10),
]

gap = timedelta(minutes=5)
sessions = []
session_start = clicks[0]
prev = clicks[0]

for k in clicks[1:]:
    if k - prev > gap:
        sessions.append((session_start, prev, (prev - session_start).seconds))
        session_start = k
    prev = k
sessions.append((session_start, prev, (prev - session_start).seconds))

print("Session Windows (gap = 5 min):")
for i, (start, end, dur) in enumerate(sessions):
    print(f"  Session {i+1}: {start.strftime('%H:%M:%S')} -> {end.strftime('%H:%M:%S')} ({dur}s)")
Session Windows (gap = 5 min):
  Session 1: 10:00:00 -> 10:02:00 (120s)
  Session 2: 10:08:00 -> 10:09:10 (70s)

Window Comparison

Window type Length Overlap Typical use
Tumbling Fixed No Periodic reports
Sliding Fixed Yes (continuous) Moving averages, trends
Hopping Fixed Yes (stepped) Data smoothing
Session Dynamic No User session analysis

Summary

In this lecture you learned the fundamentals of stream processing:

  • A data stream is an infinite sequence of events ordered in time.
  • Event time and processing time are two different things — managing that difference is a core challenge.
  • Watermarking handles late events.
  • Time windows group an infinite stream into finite segments for analysis.

In the next lecture we’ll explore Apache Kafka — a platform that implements these concepts in practice, along with microservice architecture and APIs.

Business Impact

Time windows enable automation of decision processes: dynamic pricing in response to demand (sliding window), detecting unusual customer behavior (session window), or generating operational alerts (tumbling window). With stream analytics, a company detects problems as they happen — not a week later.