flowchart TB
subgraph "Apache Spark"
CORE["Spark Core\n(RDD, memory management)"]
SQL["Spark SQL\n& DataFrames"]
SS["Structured\nStreaming"]
ML["MLlib\n(Machine Learning)"]
GR["GraphX\n(Graphs)"]
end
SQL --> CORE
SS --> CORE
ML --> CORE
GR --> CORE
style CORE fill:#FF9800,color:#fff
style SQL fill:#2196F3,color:#fff
style SS fill:#F44336,color:#fff
style ML fill:#4CAF50,color:#fff
style GR fill:#9C27B0,color:#fff
Lecture 5 — Apache Spark and Structured Streaming
Real-Time Data Analytics
1 Why Spark?
In previous lectures we learned about Kafka — a system that transports data streams. But Kafka itself doesn’t perform complex analytics. For that we need a processing engine — and that’s where Apache Spark comes in.
Spark is an engine for distributed data processing that supports both batch and streaming modes. It was created at UC Berkeley in 2009 as a response to the limitations of Hadoop MapReduce — mainly its slowness due to constant writes to disk.
Data kept in RAM, not on disk. Up to 100x faster than MapReduce.
Batch, streaming, SQL, ML, graphs — all in one framework.
PySpark, Scala, Java, R.
Spark builds an execution plan and optimizes it before computing anything.
2 PySpark — basics
2.1 SparkSession — the entry point
Every PySpark program starts by creating a SparkSession:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyFirstApplication") \
.master("local[*]") \
.getOrCreate()local[*] means: run Spark locally, using all available CPU cores. In production, a cluster address is provided instead.
2.2 DataFrame API
A Spark DataFrame is the equivalent of a SQL table or a Pandas DataFrame — but distributed across multiple machines.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, sum, count
spark = SparkSession.builder.appName("Demo").master("local[*]").getOrCreate()
# Creating a DataFrame
data = [
("Warsaw", "Electronics", 4299.00),
("Krakow", "Clothing", 189.99),
("Warsaw", "Food", 87.50),
("Gdansk", "Electronics", 2599.00),
("Krakow", "Electronics", 1299.00),
("Warsaw", "Clothing", 349.99),
("Gdansk", "Food", 156.00),
]
df = spark.createDataFrame(data, ["city", "category", "amount"])
df.show()2.3 Transformations and actions
Spark distinguishes two types of operations:
Define what you want to do, but don’t execute computations. They are lazy.
Examples: filter, groupBy, select, join.
# Transformations (lazy — nothing is computed yet)
result = df \
.filter(col("amount") > 100) \
.groupBy("city") \
.agg(
sum("amount").alias("total"),
count("amount").alias("count"),
avg("amount").alias("average")
)Trigger actual computation.
Examples: show, count, collect, write.
# Action (here Spark actually performs the computation)
result.show()
spark.stop()You can also use plain SQL:
df.createOrReplaceTempView("sales")
spark.sql("""
SELECT city,
SUM(amount) as total,
COUNT(*) as count
FROM sales
WHERE amount > 100
GROUP BY city
ORDER BY total DESC
""").show()3 Structured Streaming — streams as tables
Structured Streaming treats a data stream as a table to which new rows are continuously appended. This lets you write streaming code almost identically to batch — using the same DataFrame API.
flowchart TB
subgraph "Input stream"
T1["Batch t1"] --> TAB["Unbounded\ninput table\n(new rows\nappended continuously)"]
T2["Batch t2"] --> TAB
T3["Batch t3"] --> TAB
T4["Batch t4..."] --> TAB
end
TAB -->|"Query"| RES["Result table\n(updated\nwith each trigger)"]
RES --> OUT["Output:\nconsole / Kafka / files / database"]
style TAB fill:#2196F3,color:#fff
style RES fill:#4CAF50,color:#fff
style OUT fill:#FF9800,color:#fff
3.1 Output modes
Only new rows are added to the output. Default mode.
The entire result table is overwritten (e.g., after aggregation).
Only changed rows are written.
3.2 Example: streaming from CSV files
The simplest example — Spark monitors a directory and processes new CSV files as they appear:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, window, sum
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
spark = SparkSession.builder.appName("StreamingDemo").master("local[*]").getOrCreate()
# Data schema
schema = StructType() \
.add("transaction_id", StringType()) \
.add("amount", DoubleType()) \
.add("store", StringType()) \
.add("timestamp", TimestampType())
# Reading stream from CSV directory
streaming_df = spark.readStream \
.schema(schema) \
.option("header", True) \
.csv("/data/incoming/")
# Aggregation in 5-minute windows
result = streaming_df \
.groupBy(
window(col("timestamp"), "5 minutes"),
col("store")
) \
.agg(sum("amount").alias("total"))
# Write result to console
query = result.writeStream \
.outputMode("complete") \
.format("console") \
.trigger(processingTime="10 seconds") \
.start()
# query.awaitTermination()4 Spark + Kafka integration
Kafka transports data, Spark processes it. This is the fundamental pair in real-time analytics architecture.
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window, avg, count
from pyspark.sql.types import StructType, StringType, DoubleType, TimestampType
spark = SparkSession.builder \
.appName("KafkaStreaming") \
.master("local[*]") \
.getOrCreate()
# Read stream from Kafka
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "transactions") \
.option("startingOffsets", "latest") \
.load()
# Kafka returns data as bytes — needs to be parsed
schema = StructType() \
.add("id", StringType()) \
.add("amount", DoubleType()) \
.add("store", StringType()) \
.add("time", TimestampType())
parsed_df = kafka_df \
.selectExpr("CAST(value AS STRING) as json") \
.select(from_json(col("json"), schema).alias("data")) \
.select("data.*")# Aggregation: average amount and transaction count
# in 5-minute windows per store
result = parsed_df \
.withWatermark("time", "2 minutes") \
.groupBy(
window(col("time"), "5 minutes"),
col("store")
) \
.agg(
avg("amount").alias("average_amount"),
count("*").alias("transaction_count")
)
# Write to console (in labs: to database or dashboard)
query = result.writeStream \
.outputMode("update") \
.format("console") \
.option("truncate", False) \
.trigger(processingTime="10 seconds") \
.start()
# query.awaitTermination()Note .withWatermark("time", "2 minutes") — this is the watermarking mechanism we discussed in Lecture 2. Spark tolerates event delays up to 2 minutes and automatically discards those that arrive later.
5 Complete pipeline — course summary
Putting together everything we’ve learned in the lectures:
flowchart LR
subgraph "Data sources"
WEB["Web\napplication"]
PAY["Payment\nsystem"]
IOT["IoT\nsensors"]
end
subgraph "Transport"
K["Apache Kafka\n(L4)"]
end
subgraph "Processing"
SS["Spark Streaming\n(L5)"]
ML["FastAPI + ML\n(L3)"]
end
subgraph "Results"
DASH["Dashboard\n(near RT)"]
ALERT["Alerts\n(real-time)"]
REP["Reports\n(batch)"]
end
WEB --> K
PAY --> K
IOT --> K
K --> SS --> DASH
K --> ML --> ALERT
SS --> REP
style K fill:#FF9800,color:#fff
style SS fill:#2196F3,color:#fff
style ML fill:#4CAF50,color:#fff
style DASH fill:#E3F2FD,stroke:#2196F3
style ALERT fill:#FFEBEE,stroke:#F44336
style REP fill:#F3E5F5,stroke:#9C27B0
| Component | Role | Lecture |
|---|---|---|
| Data types, OLTP/OLAP, Data Lake | Context and history | L1 |
| Lambda/Kappa, time windows | Architecture | L2 |
| ML batch vs online, SGD, anomalies | Models and algorithms | L3 |
| Apache Kafka | Data transport | L4 |
| Apache Spark + Structured Streaming | Processing | L5 |
6 Algorithm complexity — a practical note
- Large data, simple computation — e.g., log filtering, aggregations. Spark handles this well.
- Small data, heavy computation — e.g., training a deep learning model. Better offline on GPU.
- Large data, heavy computation — e.g., real-time video analysis. Requires specialized architecture (GPU cluster, edge computing).
For real-time systems the key is that the processing time of one event must be shorter than the interval between events. Otherwise the queue grows indefinitely.
7 What’s next — the labs
In the labs you’ll build a complete system step by step:
| Lab | Topic |
|---|---|
| 1–2 | Environment (Git, Docker, Python), exploratory data analysis |
| 3–4 | ML model (scikit-learn), deployment preparation |
| 5–6 | Kafka in Docker — producer and consumer in Python |
| 7–8 | PySpark + Structured Streaming + Kafka |
| 9–10 | Complete pipeline + group project |
8 Summary
Apache Spark is the engine that turns raw data streams into business value. Combined with Kafka it forms a powerful real-time analytics platform. Structured Streaming simplifies streaming programming — you write code as for batch, and Spark takes care of the rest.
In the labs you’ll translate this knowledge into practice — from the first docker compose up to a complete real-time data processing system.
Make sure you have Docker and Git installed (see the Tools page). Clone the course repository and run docker compose up — if it works, you’re ready.