flowchart LR
PA["Producer A"] --> T["Topic:\norders"]
PB["Producer B"] --> T
PC["Producer C"] --> T
T --> CX["Consumer X\n(Group 1)"]
T --> CY["Consumer Y\n(Group 1)"]
T --> CZ["Consumer Z\n(Group 2)"]
style T fill:#FF9800,color:#fff
style PA fill:#2196F3,color:#fff
style PB fill:#2196F3,color:#fff
style PC fill:#2196F3,color:#fff
style CX fill:#4CAF50,color:#fff
style CY fill:#4CAF50,color:#fff
style CZ fill:#9C27B0,color:#fff
Lecture 4 — Apache Kafka
Real-Time Data Analytics
1 From monolith to microservices — context
Before we get to Kafka, a brief context. A traditional application is a monolith — one large program that does everything: handles users, processes data, generates reports. Simple to start with, but hard to scale and modify.
Microservices is an approach in which the application is divided into small, independent services — each responsible for one task. The “payments” service doesn’t know the details of the “recommendations” service. They communicate through APIs or message queues.
When you have 20 microservices that need to exchange data, you create a point-to-point mesh of connections. Every new service requires integration with many others. This doesn’t scale.
Solution: a central message broker — one place through which all data flows. And that’s where Apache Kafka comes in.
2 What is Apache Kafka?
Kafka is a distributed streaming platform created at LinkedIn (2011), now developed as an Apache project. It is not “just another queue system” — it’s something more.
Runs on multiple servers (brokers) as a cluster.
Messages written to disk, don’t disappear after being read.
Handles millions of messages per second.
Add brokers and partitions as data grows.
In a classical queue system (e.g., RabbitMQ) a message is deleted after being processed. Kafka doesn’t do that — messages are stored on disk for a configurable period (default 7 days). This means:
- multiple consumers can read the same data,
- a consumer can go back to earlier messages (replay),
- a consumer failure does not cause data loss.
3 Kafka architecture
3.1 Publish/Subscribe pattern
Kafka implements the pub/sub (publish/subscribe) pattern. The sender (producer) doesn’t send messages directly to the receiver — it publishes to a topic. Receivers (consumers) subscribe to the topics they’re interested in.
3.2 Topics and partitions
A topic is a logical data channel — like a folder into which messages of a certain type arrive. E.g., transactions, orders, server-logs, sensor-readings.
Each topic is divided into one or more partitions. A partition is an ordered, immutable sequence of messages — each message receives a unique number (offset).
Partitions serve two purposes:
- Scalability — data from one topic can be distributed across multiple brokers.
- Parallelism — multiple consumers can read different partitions simultaneously.
A broker is a single Kafka server. A Kafka cluster consists of multiple brokers. Each broker stores a subset of partitions.
To ensure reliability, Kafka replicates partitions across multiple brokers. A replication factor (e.g., 3) means each partition has 3 copies on different servers. If one broker goes down — data is not lost.
flowchart LR
subgraph "Topic: transactions"
direction TB
P0["Partition 0\noffset: 0,1,2,3..."]
P1["Partition 1\noffset: 0,1,2,3..."]
P2["Partition 2\noffset: 0,1,2,3..."]
end
P0 -.-> B1["Broker 1"]
P1 -.-> B2["Broker 2"]
P2 -.-> B3["Broker 3"]
style P0 fill:#E3F2FD,stroke:#2196F3
style P1 fill:#E3F2FD,stroke:#2196F3
style P2 fill:#E3F2FD,stroke:#2196F3
style B1 fill:#FFF3E0,stroke:#FF9800
style B2 fill:#FFF3E0,stroke:#FF9800
style B3 fill:#FFF3E0,stroke:#FF9800
4 Producers
A producer is an application that sends messages to Kafka. The producer specifies a topic, and Kafka assigns the message to a partition:
- by default: round-robin (sequentially to each partition),
- with a key: Kafka computes the key’s hash and selects the partition based on it — messages with the same key always go to the same partition.
from kafka import KafkaProducer
import json
from datetime import datetime
producer = KafkaProducer(
bootstrap_servers='localhost:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# Sending a transaction to the "transactions" topic
transaction = {
'id': 'TX0042',
'amount': 1250.00,
'customer': 'C-1001',
'time': datetime.now().isoformat(),
'store': 'Warsaw'
}
producer.send('transactions', value=transaction)
producer.flush()
print(f"Sent: {transaction}")If we want all transactions from the same customer to go to the same partition (e.g., for session analysis):
# Key = customer ID → same customer always in the same partition
producer.send(
'transactions',
key=b'C-1001', # partitioning key
value=transaction
)Use a key when you need ordering guarantees for a given entity (e.g., all transactions for customer C-1001 in order) or when downstream processing requires grouping by key.
5 Consumers
A consumer is an application that reads messages from a topic. The consumer tracks its offset — it knows which message it last processed. After a failure it resumes from that point.
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
'transactions',
bootstrap_servers='localhost:9092',
auto_offset_reset='earliest',
value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)
for message in consumer:
tx = message.value
if tx['amount'] > 1000:
print(f"Large transaction: {tx['id']} = {tx['amount']} PLN ({tx['store']})")5.1 Consumer groups
When a single consumer can’t keep up with processing — you add more. Consumers in the same group divide the partitions among themselves.
- Kafka automatically assigns partitions to consumers in a group.
- Each partition is read by exactly one consumer in the group.
- If a consumer fails — its partitions are taken over by another.
- Ideal: number of consumers in the group = number of partitions.
- If there are more consumers than partitions — the excess will be idle.
Topic: transactions (3 partitions)
Consumer group: "fraud-detection"
Partition 0 ──→ Consumer A
Partition 1 ──→ Consumer B
Partition 2 ──→ Consumer C
Different consumer groups can read the same data independently. E.g., group fraud-detection analyzes transactions for fraud, while group reporting builds reports — both read the transactions topic but have separate offsets.
6 ML microservice — FastAPI
In practice, a Kafka consumer often passes data to an ML model served by an API. FastAPI is a lightweight Python framework ideal for this purpose.
from fastapi import FastAPI
import numpy as np
import pickle
app = FastAPI()
# Load trained model
# with open("model.pkl", "rb") as f:
# model = pickle.load(f)
@app.get("/predict/")
def predict_price(area: float, bedrooms: int, age: int):
"""Real estate price prediction."""
features = np.array([[area, bedrooms, age]])
# price = model.predict(features)[0]
price = area * 8500 + bedrooms * 50000 - age * 2000 # simplified model
return {"estimated_price": round(price, 2)}
# Start: uvicorn main:app --reload
# Query: http://localhost:8000/predict/?area=75&bedrooms=3&age=10Request:
- URL:
http://localhost:8000/predict/?area=75&bedrooms=3&age=10 - HTTP method: GET or POST
- Headers: Content-Type, Authorization
- Body: data in JSON format
Response:
- HTTP status:
200 OK,400 Bad Request,500 Internal Server Error - Body: result in JSON format
{"estimated_price": 787500.0}7 Kafka + ML — the complete picture
Combining elements from recent lectures, here is a typical real-time analytics system:
flowchart LR
subgraph Sources
WEB["Web application"]
IOT["IoT sensors"]
PAY["Payment system"]
end
subgraph Kafka
T["Topic:\nevents"]
end
subgraph Processing
SP["Spark Streaming"]
API["FastAPI + ML"]
DB["Write to database"]
end
subgraph Results
DASH["Dashboard"]
ALERT["Alerts"]
REP["Reports"]
end
WEB --> T
IOT --> T
PAY --> T
T --> SP --> DASH
T --> API --> ALERT
T --> DB --> REP
style T fill:#FF9800,color:#fff
style SP fill:#2196F3,color:#fff
style API fill:#4CAF50,color:#fff
style DB fill:#9C27B0,color:#fff
Producers (applications, sensors, systems) send events to Kafka. Consumers (Spark, FastAPI, reporting systems) read them and process them — each in their own way, independently of the others.
8 Summary
Kafka is the central element of streaming architecture. Its strength lies in data durability (messages don’t disappear), scalability (partitions, replication) and flexibility (many independent consumers). In the labs you’ll run a Kafka cluster in Docker and write producers and consumers in Python.
Apache Spark and Structured Streaming — processing streaming data at scale, integration with Kafka.
You’re designing a competitor price monitoring system for a retail chain. What Kafka topics would you create? How many partitions? What consumer groups?