Lecture 4 — Apache Kafka

Real-Time Data Analytics

Apache Kafka architecture — brokers, topics, partitions, producers, consumers, consumer groups and ML integration.
Note Duration: 1.5h

Goal: Learn Apache Kafka’s architecture — brokers, topics, partitions, producers, consumers and consumer groups. Understand why Kafka is the backbone of modern real-time systems.


1 From monolith to microservices — context

Before we get to Kafka, a brief context. A traditional application is a monolith — one large program that does everything: handles users, processes data, generates reports. Simple to start with, but hard to scale and modify.

Microservices is an approach in which the application is divided into small, independent services — each responsible for one task. The “payments” service doesn’t know the details of the “recommendations” service. They communicate through APIs or message queues.

CautionThe N×N problem

When you have 20 microservices that need to exchange data, you create a point-to-point mesh of connections. Every new service requires integration with many others. This doesn’t scale.

Solution: a central message broker — one place through which all data flows. And that’s where Apache Kafka comes in.


2 What is Apache Kafka?

Kafka is a distributed streaming platform created at LinkedIn (2011), now developed as an Apache project. It is not “just another queue system” — it’s something more.

Tip Distributed

Runs on multiple servers (brokers) as a cluster.

Tip Durable

Messages written to disk, don’t disappear after being read.

Tip Fast

Handles millions of messages per second.

Tip Scalable

Add brokers and partitions as data grows.

ImportantKafka is not a queue!

In a classical queue system (e.g., RabbitMQ) a message is deleted after being processed. Kafka doesn’t do that — messages are stored on disk for a configurable period (default 7 days). This means:

  • multiple consumers can read the same data,
  • a consumer can go back to earlier messages (replay),
  • a consumer failure does not cause data loss.

3 Kafka architecture

3.1 Publish/Subscribe pattern

Kafka implements the pub/sub (publish/subscribe) pattern. The sender (producer) doesn’t send messages directly to the receiver — it publishes to a topic. Receivers (consumers) subscribe to the topics they’re interested in.

flowchart LR
    PA["Producer A"] --> T["Topic:\norders"]
    PB["Producer B"] --> T
    PC["Producer C"] --> T
    T --> CX["Consumer X\n(Group 1)"]
    T --> CY["Consumer Y\n(Group 1)"]
    T --> CZ["Consumer Z\n(Group 2)"]

    style T fill:#FF9800,color:#fff
    style PA fill:#2196F3,color:#fff
    style PB fill:#2196F3,color:#fff
    style PC fill:#2196F3,color:#fff
    style CX fill:#4CAF50,color:#fff
    style CY fill:#4CAF50,color:#fff
    style CZ fill:#9C27B0,color:#fff

Publish/Subscribe pattern in Kafka

3.2 Topics and partitions

A topic is a logical data channel — like a folder into which messages of a certain type arrive. E.g., transactions, orders, server-logs, sensor-readings.

Each topic is divided into one or more partitions. A partition is an ordered, immutable sequence of messages — each message receives a unique number (offset).

Partitions serve two purposes:

  • Scalability — data from one topic can be distributed across multiple brokers.
  • Parallelism — multiple consumers can read different partitions simultaneously.

A broker is a single Kafka server. A Kafka cluster consists of multiple brokers. Each broker stores a subset of partitions.

To ensure reliability, Kafka replicates partitions across multiple brokers. A replication factor (e.g., 3) means each partition has 3 copies on different servers. If one broker goes down — data is not lost.

flowchart LR
    subgraph "Topic: transactions"
        direction TB
        P0["Partition 0\noffset: 0,1,2,3..."]
        P1["Partition 1\noffset: 0,1,2,3..."]
        P2["Partition 2\noffset: 0,1,2,3..."]
    end
    P0 -.-> B1["Broker 1"]
    P1 -.-> B2["Broker 2"]
    P2 -.-> B3["Broker 3"]

    style P0 fill:#E3F2FD,stroke:#2196F3
    style P1 fill:#E3F2FD,stroke:#2196F3
    style P2 fill:#E3F2FD,stroke:#2196F3
    style B1 fill:#FFF3E0,stroke:#FF9800
    style B2 fill:#FFF3E0,stroke:#FF9800
    style B3 fill:#FFF3E0,stroke:#FF9800

Topic with partitions distributed across brokers


4 Producers

A producer is an application that sends messages to Kafka. The producer specifies a topic, and Kafka assigns the message to a partition:

  • by default: round-robin (sequentially to each partition),
  • with a key: Kafka computes the key’s hash and selects the partition based on it — messages with the same key always go to the same partition.
from kafka import KafkaProducer
import json
from datetime import datetime

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Sending a transaction to the "transactions" topic
transaction = {
    'id': 'TX0042',
    'amount': 1250.00,
    'customer': 'C-1001',
    'time': datetime.now().isoformat(),
    'store': 'Warsaw'
}

producer.send('transactions', value=transaction)
producer.flush()
print(f"Sent: {transaction}")

If we want all transactions from the same customer to go to the same partition (e.g., for session analysis):

# Key = customer ID → same customer always in the same partition
producer.send(
    'transactions',
    key=b'C-1001',       # partitioning key
    value=transaction
)
TipWhen to use a key?

Use a key when you need ordering guarantees for a given entity (e.g., all transactions for customer C-1001 in order) or when downstream processing requires grouping by key.


5 Consumers

A consumer is an application that reads messages from a topic. The consumer tracks its offset — it knows which message it last processed. After a failure it resumes from that point.

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'transactions',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

for message in consumer:
    tx = message.value
    if tx['amount'] > 1000:
        print(f"Large transaction: {tx['id']} = {tx['amount']} PLN ({tx['store']})")

5.1 Consumer groups

When a single consumer can’t keep up with processing — you add more. Consumers in the same group divide the partitions among themselves.

NoteConsumer group rules
  • Kafka automatically assigns partitions to consumers in a group.
  • Each partition is read by exactly one consumer in the group.
  • If a consumer fails — its partitions are taken over by another.
  • Ideal: number of consumers in the group = number of partitions.
  • If there are more consumers than partitions — the excess will be idle.
  Topic: transactions (3 partitions)
  Consumer group: "fraud-detection"

  Partition 0 ──→ Consumer A
  Partition 1 ──→ Consumer B
  Partition 2 ──→ Consumer C
TipMultiple groups = independent reads

Different consumer groups can read the same data independently. E.g., group fraud-detection analyzes transactions for fraud, while group reporting builds reports — both read the transactions topic but have separate offsets.


6 ML microservice — FastAPI

In practice, a Kafka consumer often passes data to an ML model served by an API. FastAPI is a lightweight Python framework ideal for this purpose.

from fastapi import FastAPI
import numpy as np
import pickle

app = FastAPI()

# Load trained model
# with open("model.pkl", "rb") as f:
#     model = pickle.load(f)

@app.get("/predict/")
def predict_price(area: float, bedrooms: int, age: int):
    """Real estate price prediction."""
    features = np.array([[area, bedrooms, age]])
    # price = model.predict(features)[0]
    price = area * 8500 + bedrooms * 50000 - age * 2000  # simplified model
    return {"estimated_price": round(price, 2)}

# Start: uvicorn main:app --reload
# Query: http://localhost:8000/predict/?area=75&bedrooms=3&age=10

Request:

  • URL: http://localhost:8000/predict/?area=75&bedrooms=3&age=10
  • HTTP method: GET or POST
  • Headers: Content-Type, Authorization
  • Body: data in JSON format

Response:

  • HTTP status: 200 OK, 400 Bad Request, 500 Internal Server Error
  • Body: result in JSON format
{"estimated_price": 787500.0}

7 Kafka + ML — the complete picture

Combining elements from recent lectures, here is a typical real-time analytics system:

flowchart LR
    subgraph Sources
        WEB["Web application"]
        IOT["IoT sensors"]
        PAY["Payment system"]
    end
    subgraph Kafka
        T["Topic:\nevents"]
    end
    subgraph Processing
        SP["Spark Streaming"]
        API["FastAPI + ML"]
        DB["Write to database"]
    end
    subgraph Results
        DASH["Dashboard"]
        ALERT["Alerts"]
        REP["Reports"]
    end

    WEB --> T
    IOT --> T
    PAY --> T
    T --> SP --> DASH
    T --> API --> ALERT
    T --> DB --> REP

    style T fill:#FF9800,color:#fff
    style SP fill:#2196F3,color:#fff
    style API fill:#4CAF50,color:#fff
    style DB fill:#9C27B0,color:#fff

Typical real-time analytics system with Kafka

Producers (applications, sensors, systems) send events to Kafka. Consumers (Spark, FastAPI, reporting systems) read them and process them — each in their own way, independently of the others.


8 Summary

Kafka is the central element of streaming architecture. Its strength lies in data durability (messages don’t disappear), scalability (partitions, replication) and flexibility (many independent consumers). In the labs you’ll run a Kafka cluster in Docker and write producers and consumers in Python.

Note Next lecture

Apache Spark and Structured Streaming — processing streaming data at scale, integration with Kafka.

Tip Food for thought

You’re designing a competitor price monitoring system for a retail chain. What Kafka topics would you create? How many partitions? What consumer groups?