Lecture 5 — Apache Spark and Structured Streaming

Real-Time Data Analytics

Apache Spark as a distributed processing engine, Structured Streaming, Spark–Kafka integration, complete real-time analytics pipeline.
Note Duration: 1.5h

Goal: Learn Apache Spark as an engine for distributed data processing. Introduction to Structured Streaming and Spark–Kafka integration. Overview of the complete real-time analytics pipeline.


1 Why Spark?

In previous lectures we learned about Kafka — a system that transports data streams. But Kafka itself doesn’t perform complex analytics. For that we need a processing engine — and that’s where Apache Spark comes in.

Spark is an engine for distributed data processing that supports both batch and streaming modes. It was created at UC Berkeley in 2009 as a response to the limitations of Hadoop MapReduce — mainly its slowness due to constant writes to disk.

Tip In-memory processing

Data kept in RAM, not on disk. Up to 100x faster than MapReduce.

Tip One engine, many modes

Batch, streaming, SQL, ML, graphs — all in one framework.

Tip Multi-language API

PySpark, Scala, Java, R.

Tip Lazy evaluation

Spark builds an execution plan and optimizes it before computing anything.

flowchart TB
    subgraph "Apache Spark"
        CORE["Spark Core\n(RDD, memory management)"]
        SQL["Spark SQL\n& DataFrames"]
        SS["Structured\nStreaming"]
        ML["MLlib\n(Machine Learning)"]
        GR["GraphX\n(Graphs)"]
    end
    SQL --> CORE
    SS --> CORE
    ML --> CORE
    GR --> CORE

    style CORE fill:#FF9800,color:#fff
    style SQL fill:#2196F3,color:#fff
    style SS fill:#F44336,color:#fff
    style ML fill:#4CAF50,color:#fff
    style GR fill:#9C27B0,color:#fff

Apache Spark ecosystem


2 PySpark — basics

2.1 SparkSession — the entry point

Every PySpark program starts by creating a SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyFirstApplication") \
    .master("local[*]") \
    .getOrCreate()

local[*] means: run Spark locally, using all available CPU cores. In production, a cluster address is provided instead.

2.2 DataFrame API

A Spark DataFrame is the equivalent of a SQL table or a Pandas DataFrame — but distributed across multiple machines.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, sum, count

spark = SparkSession.builder.appName("Demo").master("local[*]").getOrCreate()

# Creating a DataFrame
data = [
    ("Warsaw", "Electronics", 4299.00),
    ("Krakow", "Clothing", 189.99),
    ("Warsaw", "Food", 87.50),
    ("Gdansk", "Electronics", 2599.00),
    ("Krakow", "Electronics", 1299.00),
    ("Warsaw", "Clothing", 349.99),
    ("Gdansk", "Food", 156.00),
]

df = spark.createDataFrame(data, ["city", "category", "amount"])
df.show()

2.3 Transformations and actions

Spark distinguishes two types of operations:

Define what you want to do, but don’t execute computations. They are lazy.

Examples: filter, groupBy, select, join.

# Transformations (lazy — nothing is computed yet)
result = df \
    .filter(col("amount") > 100) \
    .groupBy("city") \
    .agg(
        sum("amount").alias("total"),
        count("amount").alias("count"),
        avg("amount").alias("average")
    )

Trigger actual computation.

Examples: show, count, collect, write.

# Action (here Spark actually performs the computation)
result.show()

spark.stop()

You can also use plain SQL:

df.createOrReplaceTempView("sales")

spark.sql("""
    SELECT city,
           SUM(amount) as total,
           COUNT(*) as count
    FROM sales
    WHERE amount > 100
    GROUP BY city
    ORDER BY total DESC
""").show()

3 Structured Streaming — streams as tables

ImportantKey concept

Structured Streaming treats a data stream as a table to which new rows are continuously appended. This lets you write streaming code almost identically to batch — using the same DataFrame API.

flowchart TB
    subgraph "Input stream"
        T1["Batch t1"] --> TAB["Unbounded\ninput table\n(new rows\nappended continuously)"]
        T2["Batch t2"] --> TAB
        T3["Batch t3"] --> TAB
        T4["Batch t4..."] --> TAB
    end
    TAB -->|"Query"| RES["Result table\n(updated\nwith each trigger)"]
    RES --> OUT["Output:\nconsole / Kafka / files / database"]

    style TAB fill:#2196F3,color:#fff
    style RES fill:#4CAF50,color:#fff
    style OUT fill:#FF9800,color:#fff

Structured Streaming: stream as an unbounded table

3.1 Output modes

Note Append

Only new rows are added to the output. Default mode.

Warning Complete

The entire result table is overwritten (e.g., after aggregation).

Tip Update

Only changed rows are written.

3.2 Example: streaming from CSV files

The simplest example — Spark monitors a directory and processes new CSV files as they appear:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, window, sum
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

spark = SparkSession.builder.appName("StreamingDemo").master("local[*]").getOrCreate()

# Data schema
schema = StructType() \
    .add("transaction_id", StringType()) \
    .add("amount", DoubleType()) \
    .add("store", StringType()) \
    .add("timestamp", TimestampType())

# Reading stream from CSV directory
streaming_df = spark.readStream \
    .schema(schema) \
    .option("header", True) \
    .csv("/data/incoming/")

# Aggregation in 5-minute windows
result = streaming_df \
    .groupBy(
        window(col("timestamp"), "5 minutes"),
        col("store")
    ) \
    .agg(sum("amount").alias("total"))

# Write result to console
query = result.writeStream \
    .outputMode("complete") \
    .format("console") \
    .trigger(processingTime="10 seconds") \
    .start()

# query.awaitTermination()

4 Spark + Kafka integration

ImportantThe heart of the course

Kafka transports data, Spark processes it. This is the fundamental pair in real-time analytics architecture.

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window, avg, count
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

spark = SparkSession.builder \
    .appName("KafkaStreaming") \
    .master("local[*]") \
    .getOrCreate()

# Read stream from Kafka
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "transactions") \
    .option("startingOffsets", "latest") \
    .load()

# Kafka returns data as bytes — needs to be parsed
schema = StructType() \
    .add("id", StringType()) \
    .add("amount", DoubleType()) \
    .add("store", StringType()) \
    .add("time", TimestampType())

parsed_df = kafka_df \
    .selectExpr("CAST(value AS STRING) as json") \
    .select(from_json(col("json"), schema).alias("data")) \
    .select("data.*")
# Aggregation: average amount and transaction count
# in 5-minute windows per store
result = parsed_df \
    .withWatermark("time", "2 minutes") \
    .groupBy(
        window(col("time"), "5 minutes"),
        col("store")
    ) \
    .agg(
        avg("amount").alias("average_amount"),
        count("*").alias("transaction_count")
    )

# Write to console (in labs: to database or dashboard)
query = result.writeStream \
    .outputMode("update") \
    .format("console") \
    .option("truncate", False) \
    .trigger(processingTime="10 seconds") \
    .start()

# query.awaitTermination()

Note .withWatermark("time", "2 minutes") — this is the watermarking mechanism we discussed in Lecture 2. Spark tolerates event delays up to 2 minutes and automatically discards those that arrive later.


5 Complete pipeline — course summary

Putting together everything we’ve learned in the lectures:

flowchart LR
    subgraph "Data sources"
        WEB["Web\napplication"]
        PAY["Payment\nsystem"]
        IOT["IoT\nsensors"]
    end
    subgraph "Transport"
        K["Apache Kafka\n(L4)"]
    end
    subgraph "Processing"
        SS["Spark Streaming\n(L5)"]
        ML["FastAPI + ML\n(L3)"]
    end
    subgraph "Results"
        DASH["Dashboard\n(near RT)"]
        ALERT["Alerts\n(real-time)"]
        REP["Reports\n(batch)"]
    end

    WEB --> K
    PAY --> K
    IOT --> K
    K --> SS --> DASH
    K --> ML --> ALERT
    SS --> REP

    style K fill:#FF9800,color:#fff
    style SS fill:#2196F3,color:#fff
    style ML fill:#4CAF50,color:#fff
    style DASH fill:#E3F2FD,stroke:#2196F3
    style ALERT fill:#FFEBEE,stroke:#F44336
    style REP fill:#F3E5F5,stroke:#9C27B0

Complete real-time analytics pipeline — course summary

Course map
Component Role Lecture
Data types, OLTP/OLAP, Data Lake Context and history L1
Lambda/Kappa, time windows Architecture L2
ML batch vs online, SGD, anomalies Models and algorithms L3
Apache Kafka Data transport L4
Apache Spark + Structured Streaming Processing L5

6 Algorithm complexity — a practical note

CautionWhen designing a real-time pipeline, consider the computational dimension
  • Large data, simple computation — e.g., log filtering, aggregations. Spark handles this well.
  • Small data, heavy computation — e.g., training a deep learning model. Better offline on GPU.
  • Large data, heavy computation — e.g., real-time video analysis. Requires specialized architecture (GPU cluster, edge computing).

For real-time systems the key is that the processing time of one event must be shorter than the interval between events. Otherwise the queue grows indefinitely.


7 What’s next — the labs

In the labs you’ll build a complete system step by step:

Lab plan
Lab Topic
1–2 Environment (Git, Docker, Python), exploratory data analysis
3–4 ML model (scikit-learn), deployment preparation
5–6 Kafka in Docker — producer and consumer in Python
7–8 PySpark + Structured Streaming + Kafka
9–10 Complete pipeline + group project

8 Summary

Apache Spark is the engine that turns raw data streams into business value. Combined with Kafka it forms a powerful real-time analytics platform. Structured Streaming simplifies streaming programming — you write code as for batch, and Spark takes care of the rest.

Important This was the last lecture

In the labs you’ll translate this knowledge into practice — from the first docker compose up to a complete real-time data processing system.

Tip Before Lab 1

Make sure you have Docker and Git installed (see the Tools page). Clone the course repository and run docker compose up — if it works, you’re ready.