Lecture 5 — Apache Spark and Structured Streaming

Real-Time Data Analytics

Apache Spark as a distributed processing engine, Structured Streaming, Spark–Kafka integration, complete real-time analytics pipeline.

Duration: 1.5h

Goal: Learn Apache Spark as an engine for distributed data processing. Introduction to Structured Streaming and Spark–Kafka integration. Overview of the complete real-time analytics pipeline.

1 Why Spark?

In previous lectures we learned about Kafka — a system that transports data streams. But Kafka itself doesn’t perform complex analytics. For that we need a processing engine — and that’s where Apache Spark comes in.

Spark is an engine for distributed data processing that supports both batch and streaming modes. It was created at UC Berkeley in 2009 as a response to the limitations of Hadoop MapReduce — mainly its slowness due to constant writes to disk.

In-memory processing

Data kept in RAM, not on disk. Up to 100x faster than MapReduce.

One engine, many modes

Batch, streaming, SQL, ML, graphs — all in one framework.

Multi-language API

PySpark, Scala, Java, R.

Lazy evaluation

Spark builds an execution plan and optimizes it before computing anything.

flowchart TB
    subgraph "Apache Spark"
        CORE["Spark Core\n(RDD, memory management)"]
        SQL["Spark SQL\n& DataFrames"]
        SS["Structured\nStreaming"]
        ML["MLlib\n(Machine Learning)"]
        GR["GraphX\n(Graphs)"]
    end
    SQL --> CORE
    SS --> CORE
    ML --> CORE
    GR --> CORE

    style CORE fill:#FF9800,color:#fff
    style SQL fill:#2196F3,color:#fff
    style SS fill:#F44336,color:#fff
    style ML fill:#4CAF50,color:#fff
    style GR fill:#9C27B0,color:#fff

Apache Spark ecosystem

2 PySpark — basics

2.1 SparkSession — the entry point

Every PySpark program starts by creating a SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyFirstApplication") \
    .master("local[*]") \
    .getOrCreate()

local[*] means: run Spark locally, using all available CPU cores. In production, a cluster address is provided instead.

2.2 DataFrame API

A Spark DataFrame is the equivalent of a SQL table or a Pandas DataFrame — but distributed across multiple machines.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, sum, count

spark = SparkSession.builder.appName("Demo").master("local[*]").getOrCreate()

# Creating a DataFrame
data = [
    ("Warsaw", "Electronics", 4299.00),
    ("Krakow", "Clothing", 189.99),
    ("Warsaw", "Food", 87.50),
    ("Gdansk", "Electronics", 2599.00),
    ("Krakow", "Electronics", 1299.00),
    ("Warsaw", "Clothing", 349.99),
    ("Gdansk", "Food", 156.00),
]

df = spark.createDataFrame(data, ["city", "category", "amount"])
df.show()

2.3 Transformations and actions

Spark distinguishes two types of operations:

Define what you want to do, but don’t execute computations. They are lazy.

Examples: filter, groupBy, select, join.

# Transformations (lazy — nothing is computed yet)
result = df \
    .filter(col("amount") > 100) \
    .groupBy("city") \
    .agg(
        sum("amount").alias("total"),
        count("amount").alias("count"),
        avg("amount").alias("average")
    )

Trigger actual computation.

Examples: show, count, collect, write.

# Action (here Spark actually performs the computation)
result.show()

spark.stop()

Spark SQL — an alternative

You can also use plain SQL:

df.createOrReplaceTempView("sales")

spark.sql("""
    SELECT city,
           SUM(amount) as total,
           COUNT(*) as count
    FROM sales
    WHERE amount > 100
    GROUP BY city
    ORDER BY total DESC
""").show()

3 Structured Streaming — streams as tables

Key concept

Structured Streaming treats a data stream as a table to which new rows are continuously appended. This lets you write streaming code almost identically to batch — using the same DataFrame API.

flowchart TB
    subgraph "Input stream"
        T1["Batch t1"] --> TAB["Unbounded\ninput table\n(new rows\nappended continuously)"]
        T2["Batch t2"] --> TAB
        T3["Batch t3"] --> TAB
        T4["Batch t4..."] --> TAB
    end
    TAB -->|"Query"| RES["Result table\n(updated\nwith each trigger)"]
    RES --> OUT["Output:\nconsole / Kafka / files / database"]

    style TAB fill:#2196F3,color:#fff
    style RES fill:#4CAF50,color:#fff
    style OUT fill:#FF9800,color:#fff

Structured Streaming: stream as an unbounded table

3.1 Output modes

Append

Only new rows are added to the output. Default mode.

Complete

The entire result table is overwritten (e.g., after aggregation).

Update

Only changed rows are written.

3.2 Example: streaming from CSV files

The simplest example — Spark monitors a directory and processes new CSV files as they appear:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, window, sum
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

spark = SparkSession.builder.appName("StreamingDemo").master("local[*]").getOrCreate()

# Data schema
schema = StructType() \
    .add("transaction_id", StringType()) \
    .add("amount", DoubleType()) \
    .add("store", StringType()) \
    .add("timestamp", TimestampType())

# Reading stream from CSV directory
streaming_df = spark.readStream \
    .schema(schema) \
    .option("header", True) \
    .csv("/data/incoming/")

# Aggregation in 5-minute windows
result = streaming_df \
    .groupBy(
        window(col("timestamp"), "5 minutes"),
        col("store")
    ) \
    .agg(sum("amount").alias("total"))

# Write result to console
query = result.writeStream \
    .outputMode("complete") \
    .format("console") \
    .trigger(processingTime="10 seconds") \
    .start()

# query.awaitTermination()

4 Spark + Kafka integration

The heart of the course

Kafka transports data, Spark processes it. This is the fundamental pair in real-time analytics architecture.

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window, avg, count
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType

spark = SparkSession.builder \
    .appName("KafkaStreaming") \
    .master("local[*]") \
    .getOrCreate()

# Read stream from Kafka
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "transactions") \
    .option("startingOffsets", "latest") \
    .load()

# Kafka returns data as bytes — needs to be parsed
schema = StructType() \
    .add("id", StringType()) \
    .add("amount", DoubleType()) \
    .add("store", StringType()) \
    .add("time", TimestampType())

parsed_df = kafka_df \
    .selectExpr("CAST(value AS STRING) as json") \
    .select(from_json(col("json"), schema).alias("data")) \
    .select("data.*")

# Aggregation: average amount and transaction count
# in 5-minute windows per store
result = parsed_df \
    .withWatermark("time", "2 minutes") \
    .groupBy(
        window(col("time"), "5 minutes"),
        col("store")
    ) \
    .agg(
        avg("amount").alias("average_amount"),
        count("*").alias("transaction_count")
    )

# Write to console (in labs: to database or dashboard)
query = result.writeStream \
    .outputMode("update") \
    .format("console") \
    .option("truncate", False) \
    .trigger(processingTime="10 seconds") \
    .start()

# query.awaitTermination()

Note .withWatermark("time", "2 minutes") — this is the watermarking mechanism we discussed in Lecture 2. Spark tolerates event delays up to 2 minutes and automatically discards those that arrive later.

5 Complete pipeline — course summary

Putting together everything we’ve learned in the lectures:

flowchart LR
    subgraph "Data sources"
        WEB["Web\napplication"]
        PAY["Payment\nsystem"]
        IOT["IoT\nsensors"]
    end
    subgraph "Transport"
        K["Apache Kafka\n(L4)"]
    end
    subgraph "Processing"
        SS["Spark Streaming\n(L5)"]
        ML["FastAPI + ML\n(L3)"]
    end
    subgraph "Results"
        DASH["Dashboard\n(near RT)"]
        ALERT["Alerts\n(real-time)"]
        REP["Reports\n(batch)"]
    end

    WEB --> K
    PAY --> K
    IOT --> K
    K --> SS --> DASH
    K --> ML --> ALERT
    SS --> REP

    style K fill:#FF9800,color:#fff
    style SS fill:#2196F3,color:#fff
    style ML fill:#4CAF50,color:#fff
    style DASH fill:#E3F2FD,stroke:#2196F3
    style ALERT fill:#FFEBEE,stroke:#F44336
    style REP fill:#F3E5F5,stroke:#9C27B0

Complete real-time analytics pipeline — course summary

Course map
Component	Role	Lecture
Data types, OLTP/OLAP, Data Lake	Context and history	L1
Lambda/Kappa, time windows	Architecture	L2
ML batch vs online, SGD, anomalies	Models and algorithms	L3
Apache Kafka	Data transport	L4
Apache Spark + Structured Streaming	Processing	L5

6 Algorithm complexity — a practical note

When designing a real-time pipeline, consider the computational dimension

Large data, simple computation — e.g., log filtering, aggregations. Spark handles this well.
Small data, heavy computation — e.g., training a deep learning model. Better offline on GPU.
Large data, heavy computation — e.g., real-time video analysis. Requires specialized architecture (GPU cluster, edge computing).

For real-time systems the key is that the processing time of one event must be shorter than the interval between events. Otherwise the queue grows indefinitely.

7 What’s next — the labs

In the labs you’ll build a complete system step by step:

Lab plan
Lab	Topic
1–2	Environment (Git, Docker, Python), exploratory data analysis
3–4	ML model (scikit-learn), deployment preparation
5–6	Kafka in Docker — producer and consumer in Python
7–8	PySpark + Structured Streaming + Kafka
9–10	Complete pipeline + group project

8 Summary

Apache Spark is the engine that turns raw data streams into business value. Combined with Kafka it forms a powerful real-time analytics platform. Structured Streaming simplifies streaming programming — you write code as for batch, and Spark takes care of the rest.

This was the last lecture

In the labs you’ll translate this knowledge into practice — from the first docker compose up to a complete real-time data processing system.

Before Lab 1

Make sure you have Docker and Git installed (see the Tools page). Clone the course repository and run docker compose up — if it works, you’re ready.

--- title: "Lecture 5 — Apache Spark and Structured Streaming" subtitle: "Real-Time Data Analytics" description: "Apache Spark as a distributed processing engine, Structured Streaming, Spark–Kafka integration, complete real-time analytics pipeline." format: html: code-fold: true code-tools: true code-summary: "Show code" toc: true toc-depth: 3 toc-title: "Contents" number-sections: true smooth-scroll: true theme: light: flatly highlight-style: github fig-align: center fig-cap-location: bottom jupyter: python3 --- ::: {.callout-note appearance="minimal"} ## {{< fa clock >}} Duration: 1.5h **Goal:** Learn Apache Spark as an engine for distributed data processing. Introduction to Structured Streaming and Spark–Kafka integration. Overview of the complete real-time analytics pipeline. ::: --- ## Why Spark? In previous lectures we learned about Kafka — a system that **transports** data streams. But Kafka itself doesn't perform complex analytics. For that we need a processing engine — and that's where **Apache Spark** comes in. Spark is an engine for **distributed data processing** that supports both batch and streaming modes. It was created at UC Berkeley in 2009 as a response to the limitations of Hadoop MapReduce — mainly its slowness due to constant writes to disk. :::: {.columns} ::: {.column width="50%"} ::: {.callout-tip appearance="simple"} ## {{< fa memory >}} In-memory processing Data kept in RAM, not on disk. Up to **100x faster** than MapReduce. ::: ::: {.callout-tip appearance="simple"} ## {{< fa cubes >}} One engine, many modes Batch, streaming, SQL, ML, graphs — all in one framework. ::: ::: ::: {.column width="50%"} ::: {.callout-tip appearance="simple"} ## {{< fa python >}} Multi-language API PySpark, Scala, Java, R. ::: ::: {.callout-tip appearance="simple"} ## {{< fa wand-magic-sparkles >}} Lazy evaluation Spark builds an execution plan and optimizes it **before** computing anything. ::: ::: :::: ```{mermaid} %%| fig-cap: "Apache Spark ecosystem" flowchart TB subgraph "Apache Spark" CORE["Spark Core\n(RDD, memory management)"] SQL["Spark SQL\n& DataFrames"] SS["Structured\nStreaming"] ML["MLlib\n(Machine Learning)"] GR["GraphX\n(Graphs)"] end SQL --> CORE SS --> CORE ML --> CORE GR --> CORE style CORE fill:#FF9800,color:#fff style SQL fill:#2196F3,color:#fff style SS fill:#F44336,color:#fff style ML fill:#4CAF50,color:#fff style GR fill:#9C27B0,color:#fff ``` --- ## PySpark — basics ### SparkSession — the entry point Every PySpark program starts by creating a `SparkSession`: ```python from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("MyFirstApplication") \ .master("local[*]") \ .getOrCreate() ``` ::: {.callout-note appearance="simple"} `local[*]` means: run Spark locally, using all available CPU cores. In production, a cluster address is provided instead. ::: ### DataFrame API A Spark DataFrame is the equivalent of a SQL table or a Pandas DataFrame — but **distributed** across multiple machines. ```python from pyspark.sql import SparkSession from pyspark.sql.functions import col, avg, sum, count spark = SparkSession.builder.appName("Demo").master("local[*]").getOrCreate() # Creating a DataFrame data = [ ("Warsaw", "Electronics", 4299.00), ("Krakow", "Clothing", 189.99), ("Warsaw", "Food", 87.50), ("Gdansk", "Electronics", 2599.00), ("Krakow", "Electronics", 1299.00), ("Warsaw", "Clothing", 349.99), ("Gdansk", "Food", 156.00), ] df = spark.createDataFrame(data, ["city", "category", "amount"]) df.show() ``` ### Transformations and actions Spark distinguishes two types of operations: ::: {.panel-tabset} #### {{< fa wand-magic-sparkles >}} Transformations (lazy) Define *what* you want to do, but **don't execute** computations. They are **lazy**. Examples: `filter`, `groupBy`, `select`, `join`. ```python # Transformations (lazy — nothing is computed yet) result = df \ .filter(col("amount") > 100) \ .groupBy("city") \ .agg( sum("amount").alias("total"), count("amount").alias("count"), avg("amount").alias("average") ) ``` #### {{< fa play >}} Actions (eager) **Trigger** actual computation. Examples: `show`, `count`, `collect`, `write`. ```python # Action (here Spark actually performs the computation) result.show() spark.stop() ``` ::: ::: {.callout-tip collapse="true"} ## {{< fa database >}} Spark SQL — an alternative You can also use plain SQL: ```python df.createOrReplaceTempView("sales") spark.sql(""" SELECT city, SUM(amount) as total, COUNT(*) as count FROM sales WHERE amount > 100 GROUP BY city ORDER BY total DESC """).show() ``` ::: --- ## Structured Streaming — streams as tables ::: {.callout-important} ## Key concept Structured Streaming treats a **data stream as a table to which new rows are continuously appended**. This lets you write streaming code almost identically to batch — using the same DataFrame API. ::: ```{mermaid} %%| fig-cap: "Structured Streaming: stream as an unbounded table" flowchart TB subgraph "Input stream" T1["Batch t1"] --> TAB["Unbounded\ninput table\n(new rows\nappended continuously)"] T2["Batch t2"] --> TAB T3["Batch t3"] --> TAB T4["Batch t4..."] --> TAB end TAB -->|"Query"| RES["Result table\n(updated\nwith each trigger)"] RES --> OUT["Output:\nconsole / Kafka / files / database"] style TAB fill:#2196F3,color:#fff style RES fill:#4CAF50,color:#fff style OUT fill:#FF9800,color:#fff ``` ### Output modes :::: {.columns} ::: {.column width="33%"} ::: {.callout-note appearance="simple"} ## {{< fa plus >}} Append Only **new** rows are added to the output. Default mode. ::: ::: ::: {.column width="33%"} ::: {.callout-warning appearance="simple"} ## {{< fa arrows-rotate >}} Complete The entire result table is **overwritten** (e.g., after aggregation). ::: ::: ::: {.column width="33%"} ::: {.callout-tip appearance="simple"} ## {{< fa pen >}} Update Only **changed** rows are written. ::: ::: :::: ### Example: streaming from CSV files The simplest example — Spark monitors a directory and processes new CSV files as they appear: ```python from pyspark.sql import SparkSession from pyspark.sql.functions import col, window, sum from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType spark = SparkSession.builder.appName("StreamingDemo").master("local[*]").getOrCreate() # Data schema schema = StructType() \ .add("transaction_id", StringType()) \ .add("amount", DoubleType()) \ .add("store", StringType()) \ .add("timestamp", TimestampType()) # Reading stream from CSV directory streaming_df = spark.readStream \ .schema(schema) \ .option("header", True) \ .csv("/data/incoming/") # Aggregation in 5-minute windows result = streaming_df \ .groupBy( window(col("timestamp"), "5 minutes"), col("store") ) \ .agg(sum("amount").alias("total")) # Write result to console query = result.writeStream \ .outputMode("complete") \ .format("console") \ .trigger(processingTime="10 seconds") \ .start() # query.awaitTermination() ``` --- ## Spark + Kafka integration ::: {.callout-important appearance="simple"} ## The heart of the course **Kafka transports data, Spark processes it.** This is the fundamental pair in real-time analytics architecture. ::: ::: {.panel-tabset} ### Reading from Kafka ```python from pyspark.sql import SparkSession from pyspark.sql.functions import from_json, col, window, avg, count from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType spark = SparkSession.builder \ .appName("KafkaStreaming") \ .master("local[*]") \ .getOrCreate() # Read stream from Kafka kafka_df = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "localhost:9092") \ .option("subscribe", "transactions") \ .option("startingOffsets", "latest") \ .load() # Kafka returns data as bytes — needs to be parsed schema = StructType() \ .add("id", StringType()) \ .add("amount", DoubleType()) \ .add("store", StringType()) \ .add("time", TimestampType()) parsed_df = kafka_df \ .selectExpr("CAST(value AS STRING) as json") \ .select(from_json(col("json"), schema).alias("data")) \ .select("data.*") ``` ### Processing and writing ```python # Aggregation: average amount and transaction count # in 5-minute windows per store result = parsed_df \ .withWatermark("time", "2 minutes") \ .groupBy( window(col("time"), "5 minutes"), col("store") ) \ .agg( avg("amount").alias("average_amount"), count("*").alias("transaction_count") ) # Write to console (in labs: to database or dashboard) query = result.writeStream \ .outputMode("update") \ .format("console") \ .option("truncate", False) \ .trigger(processingTime="10 seconds") \ .start() # query.awaitTermination() ``` ::: {.callout-note appearance="simple"} Note `.withWatermark("time", "2 minutes")` — this is the watermarking mechanism we discussed in Lecture 2. Spark tolerates event delays up to 2 minutes and automatically discards those that arrive later. ::: ::: --- ## Complete pipeline — course summary Putting together everything we've learned in the lectures: ```{mermaid} %%| fig-cap: "Complete real-time analytics pipeline — course summary" flowchart LR subgraph "Data sources" WEB["Web\napplication"] PAY["Payment\nsystem"] IOT["IoT\nsensors"] end subgraph "Transport" K["Apache Kafka\n(L4)"] end subgraph "Processing" SS["Spark Streaming\n(L5)"] ML["FastAPI + ML\n(L3)"] end subgraph "Results" DASH["Dashboard\n(near RT)"] ALERT["Alerts\n(real-time)"] REP["Reports\n(batch)"] end WEB --> K PAY --> K IOT --> K K --> SS --> DASH K --> ML --> ALERT SS --> REP style K fill:#FF9800,color:#fff style SS fill:#2196F3,color:#fff style ML fill:#4CAF50,color:#fff style DASH fill:#E3F2FD,stroke:#2196F3 style ALERT fill:#FFEBEE,stroke:#F44336 style REP fill:#F3E5F5,stroke:#9C27B0 ``` | Component | Role | Lecture | |-----------|------|:------:| | Data types, OLTP/OLAP, Data Lake | Context and history | L1 | | Lambda/Kappa, time windows | Architecture | L2 | | ML batch vs online, SGD, anomalies | Models and algorithms | L3 | | Apache Kafka | Data transport | L4 | | Apache Spark + Structured Streaming | Processing | L5 | : Course map {.striped .hover} --- ## Algorithm complexity — a practical note ::: {.callout-caution} ## When designing a real-time pipeline, consider the computational dimension - **Large data, simple computation** — e.g., log filtering, aggregations. Spark handles this well. - **Small data, heavy computation** — e.g., training a deep learning model. Better offline on GPU. - **Large data, heavy computation** — e.g., real-time video analysis. Requires specialized architecture (GPU cluster, edge computing). For real-time systems the key is that the processing time of one event must be **shorter** than the interval between events. Otherwise the queue grows indefinitely. ::: --- ## What's next — the labs In the labs you'll build a complete system step by step: | Lab | Topic | |:---:|-------| | 1–2 | Environment (Git, Docker, Python), exploratory data analysis | | 3–4 | ML model (scikit-learn), deployment preparation | | 5–6 | Kafka in Docker — producer and consumer in Python | | 7–8 | PySpark + Structured Streaming + Kafka | | 9–10 | Complete pipeline + group project | : Lab plan {.striped .hover} --- ## Summary Apache Spark is the engine that turns raw data streams into business value. Combined with Kafka it forms a powerful real-time analytics platform. Structured Streaming simplifies streaming programming — you write code as for batch, and Spark takes care of the rest. ::: {.callout-important appearance="simple"} ## {{< fa flag-checkered >}} This was the last lecture In the labs you'll translate this knowledge into practice — from the first `docker compose up` to a complete real-time data processing system. ::: ::: {.callout-tip appearance="simple"} ## {{< fa laptop-code >}} Before Lab 1 Make sure you have Docker and Git installed (see the [Tools](../info.qmd) page). Clone the course repository and run `docker compose up` — if it works, you're ready. :::