Lecture 4 — Apache Kafka

Real-Time Data Analytics

Apache Kafka architecture — brokers, topics, partitions, producers, consumers, consumer groups and ML integration.

Duration: 1.5h

Goal: Learn Apache Kafka’s architecture — brokers, topics, partitions, producers, consumers and consumer groups. Understand why Kafka is the backbone of modern real-time systems.

1 From monolith to microservices — context

Before we get to Kafka, a brief context. A traditional application is a monolith — one large program that does everything: handles users, processes data, generates reports. Simple to start with, but hard to scale and modify.

Microservices is an approach in which the application is divided into small, independent services — each responsible for one task. The “payments” service doesn’t know the details of the “recommendations” service. They communicate through APIs or message queues.

The N×N problem

When you have 20 microservices that need to exchange data, you create a point-to-point mesh of connections. Every new service requires integration with many others. This doesn’t scale.

Solution: a central message broker — one place through which all data flows. And that’s where Apache Kafka comes in.

2 What is Apache Kafka?

Kafka is a distributed streaming platform created at LinkedIn (2011), now developed as an Apache project. It is not “just another queue system” — it’s something more.

Distributed

Runs on multiple servers (brokers) as a cluster.

Durable

Messages written to disk, don’t disappear after being read.

Fast

Handles millions of messages per second.

Scalable

Add brokers and partitions as data grows.

Kafka is not a queue!

In a classical queue system (e.g., RabbitMQ) a message is deleted after being processed. Kafka doesn’t do that — messages are stored on disk for a configurable period (default 7 days). This means:

multiple consumers can read the same data,
a consumer can go back to earlier messages (replay),
a consumer failure does not cause data loss.

3 Kafka architecture

3.1 Publish/Subscribe pattern

Kafka implements the pub/sub (publish/subscribe) pattern. The sender (producer) doesn’t send messages directly to the receiver — it publishes to a topic. Receivers (consumers) subscribe to the topics they’re interested in.

flowchart LR
    PA["Producer A"] --> T["Topic:\norders"]
    PB["Producer B"] --> T
    PC["Producer C"] --> T
    T --> CX["Consumer X\n(Group 1)"]
    T --> CY["Consumer Y\n(Group 1)"]
    T --> CZ["Consumer Z\n(Group 2)"]

    style T fill:#FF9800,color:#fff
    style PA fill:#2196F3,color:#fff
    style PB fill:#2196F3,color:#fff
    style PC fill:#2196F3,color:#fff
    style CX fill:#4CAF50,color:#fff
    style CY fill:#4CAF50,color:#fff
    style CZ fill:#9C27B0,color:#fff

Publish/Subscribe pattern in Kafka

3.2 Topics and partitions

A topic is a logical data channel — like a folder into which messages of a certain type arrive. E.g., transactions, orders, server-logs, sensor-readings.

Each topic is divided into one or more partitions. A partition is an ordered, immutable sequence of messages — each message receives a unique number (offset).

Partitions serve two purposes:

Scalability — data from one topic can be distributed across multiple brokers.
Parallelism — multiple consumers can read different partitions simultaneously.

A broker is a single Kafka server. A Kafka cluster consists of multiple brokers. Each broker stores a subset of partitions.

To ensure reliability, Kafka replicates partitions across multiple brokers. A replication factor (e.g., 3) means each partition has 3 copies on different servers. If one broker goes down — data is not lost.

flowchart LR
    subgraph "Topic: transactions"
        direction TB
        P0["Partition 0\noffset: 0,1,2,3..."]
        P1["Partition 1\noffset: 0,1,2,3..."]
        P2["Partition 2\noffset: 0,1,2,3..."]
    end
    P0 -.-> B1["Broker 1"]
    P1 -.-> B2["Broker 2"]
    P2 -.-> B3["Broker 3"]

    style P0 fill:#E3F2FD,stroke:#2196F3
    style P1 fill:#E3F2FD,stroke:#2196F3
    style P2 fill:#E3F2FD,stroke:#2196F3
    style B1 fill:#FFF3E0,stroke:#FF9800
    style B2 fill:#FFF3E0,stroke:#FF9800
    style B3 fill:#FFF3E0,stroke:#FF9800

Topic with partitions distributed across brokers

4 Producers

A producer is an application that sends messages to Kafka. The producer specifies a topic, and Kafka assigns the message to a partition:

by default: round-robin (sequentially to each partition),
with a key: Kafka computes the key’s hash and selects the partition based on it — messages with the same key always go to the same partition.

from kafka import KafkaProducer
import json
from datetime import datetime

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Sending a transaction to the "transactions" topic
transaction = {
    'id': 'TX0042',
    'amount': 1250.00,
    'customer': 'C-1001',
    'time': datetime.now().isoformat(),
    'store': 'Warsaw'
}

producer.send('transactions', value=transaction)
producer.flush()
print(f"Sent: {transaction}")

If we want all transactions from the same customer to go to the same partition (e.g., for session analysis):

# Key = customer ID → same customer always in the same partition
producer.send(
    'transactions',
    key=b'C-1001',       # partitioning key
    value=transaction
)

When to use a key?

Use a key when you need ordering guarantees for a given entity (e.g., all transactions for customer C-1001 in order) or when downstream processing requires grouping by key.

5 Consumers

A consumer is an application that reads messages from a topic. The consumer tracks its offset — it knows which message it last processed. After a failure it resumes from that point.

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'transactions',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

for message in consumer:
    tx = message.value
    if tx['amount'] > 1000:
        print(f"Large transaction: {tx['id']} = {tx['amount']} PLN ({tx['store']})")

5.1 Consumer groups

When a single consumer can’t keep up with processing — you add more. Consumers in the same group divide the partitions among themselves.

Consumer group rules

Kafka automatically assigns partitions to consumers in a group.
Each partition is read by exactly one consumer in the group.
If a consumer fails — its partitions are taken over by another.
Ideal: number of consumers in the group = number of partitions.
If there are more consumers than partitions — the excess will be idle.

  Topic: transactions (3 partitions)
  Consumer group: "fraud-detection"

  Partition 0 ──→ Consumer A
  Partition 1 ──→ Consumer B
  Partition 2 ──→ Consumer C

Multiple groups = independent reads

Different consumer groups can read the same data independently. E.g., group fraud-detection analyzes transactions for fraud, while group reporting builds reports — both read the transactions topic but have separate offsets.

6 ML microservice — FastAPI

In practice, a Kafka consumer often passes data to an ML model served by an API. FastAPI is a lightweight Python framework ideal for this purpose.

from fastapi import FastAPI
import numpy as np
import pickle

app = FastAPI()

# Load trained model
# with open("model.pkl", "rb") as f:
#     model = pickle.load(f)

@app.get("/predict/")
def predict_price(area: float, bedrooms: int, age: int):
    """Real estate price prediction."""
    features = np.array([[area, bedrooms, age]])
    # price = model.predict(features)[0]
    price = area * 8500 + bedrooms * 50000 - age * 2000  # simplified model
    return {"estimated_price": round(price, 2)}

# Start: uvicorn main:app --reload
# Query: http://localhost:8000/predict/?area=75&bedrooms=3&age=10

Request and Response — how does an API work?

Request:

URL: http://localhost:8000/predict/?area=75&bedrooms=3&age=10
HTTP method: GET or POST
Headers: Content-Type, Authorization
Body: data in JSON format

Response:

HTTP status: 200 OK, 400 Bad Request, 500 Internal Server Error
Body: result in JSON format

{"estimated_price": 787500.0}

7 Kafka + ML — the complete picture

Combining elements from recent lectures, here is a typical real-time analytics system:

flowchart LR
    subgraph Sources
        WEB["Web application"]
        IOT["IoT sensors"]
        PAY["Payment system"]
    end
    subgraph Kafka
        T["Topic:\nevents"]
    end
    subgraph Processing
        SP["Spark Streaming"]
        API["FastAPI + ML"]
        DB["Write to database"]
    end
    subgraph Results
        DASH["Dashboard"]
        ALERT["Alerts"]
        REP["Reports"]
    end

    WEB --> T
    IOT --> T
    PAY --> T
    T --> SP --> DASH
    T --> API --> ALERT
    T --> DB --> REP

    style T fill:#FF9800,color:#fff
    style SP fill:#2196F3,color:#fff
    style API fill:#4CAF50,color:#fff
    style DB fill:#9C27B0,color:#fff

Typical real-time analytics system with Kafka

Producers (applications, sensors, systems) send events to Kafka. Consumers (Spark, FastAPI, reporting systems) read them and process them — each in their own way, independently of the others.

8 Summary

Kafka is the central element of streaming architecture. Its strength lies in data durability (messages don’t disappear), scalability (partitions, replication) and flexibility (many independent consumers). In the labs you’ll run a Kafka cluster in Docker and write producers and consumers in Python.

Next lecture

Apache Spark and Structured Streaming — processing streaming data at scale, integration with Kafka.

Food for thought

You’re designing a competitor price monitoring system for a retail chain. What Kafka topics would you create? How many partitions? What consumer groups?

--- title: "Lecture 4 — Apache Kafka" subtitle: "Real-Time Data Analytics" description: "Apache Kafka architecture — brokers, topics, partitions, producers, consumers, consumer groups and ML integration." format: html: code-fold: true code-tools: true code-summary: "Show code" toc: true toc-depth: 3 toc-title: "Contents" number-sections: true smooth-scroll: true theme: light: flatly highlight-style: github fig-align: center fig-cap-location: bottom jupyter: python3 --- ::: {.callout-note appearance="minimal"} ## {{< fa clock >}} Duration: 1.5h **Goal:** Learn Apache Kafka's architecture — brokers, topics, partitions, producers, consumers and consumer groups. Understand why Kafka is the backbone of modern real-time systems. ::: --- ## From monolith to microservices — context Before we get to Kafka, a brief context. A traditional application is a **monolith** — one large program that does everything: handles users, processes data, generates reports. Simple to start with, but hard to scale and modify. **Microservices** is an approach in which the application is divided into small, independent services — each responsible for one task. The "payments" service doesn't know the details of the "recommendations" service. They communicate through **APIs** or **message queues**. ::: {.callout-caution} ## The N×N problem When you have 20 microservices that need to exchange data, you create a point-to-point mesh of connections. Every new service requires integration with many others. This doesn't scale. **Solution:** a central message broker — one place through which all data flows. And that's where **Apache Kafka** comes in. ::: --- ## What is Apache Kafka? Kafka is a **distributed streaming platform** created at LinkedIn (2011), now developed as an Apache project. It is not "just another queue system" — it's something more. :::: {.columns} ::: {.column width="50%"} ::: {.callout-tip appearance="simple"} ## {{< fa server >}} Distributed Runs on multiple servers (brokers) as a cluster. ::: ::: {.callout-tip appearance="simple"} ## {{< fa hard-drive >}} Durable Messages written to disk, don't disappear after being read. ::: ::: ::: {.column width="50%"} ::: {.callout-tip appearance="simple"} ## {{< fa gauge-high >}} Fast Handles millions of messages per second. ::: ::: {.callout-tip appearance="simple"} ## {{< fa arrows-left-right >}} Scalable Add brokers and partitions as data grows. ::: ::: :::: ::: {.callout-important} ## Kafka is not a queue! In a classical queue system (e.g., RabbitMQ) a message is **deleted** after being processed. Kafka doesn't do that — messages are **stored on disk** for a configurable period (default 7 days). This means: - multiple consumers can read the same data, - a consumer can go back to earlier messages (**replay**), - a consumer failure does not cause data loss. ::: --- ## Kafka architecture ### Publish/Subscribe pattern Kafka implements the **pub/sub** (publish/subscribe) pattern. The sender (producer) doesn't send messages directly to the receiver — it publishes to a **topic**. Receivers (consumers) subscribe to the topics they're interested in. ```{mermaid} %%| fig-cap: "Publish/Subscribe pattern in Kafka" flowchart LR PA["Producer A"] --> T["Topic:\norders"] PB["Producer B"] --> T PC["Producer C"] --> T T --> CX["Consumer X\n(Group 1)"] T --> CY["Consumer Y\n(Group 1)"] T --> CZ["Consumer Z\n(Group 2)"] style T fill:#FF9800,color:#fff style PA fill:#2196F3,color:#fff style PB fill:#2196F3,color:#fff style PC fill:#2196F3,color:#fff style CX fill:#4CAF50,color:#fff style CY fill:#4CAF50,color:#fff style CZ fill:#9C27B0,color:#fff ``` ### Topics and partitions ::: {.panel-tabset} #### {{< fa folder >}} Topics A **topic** is a logical data channel — like a folder into which messages of a certain type arrive. E.g., `transactions`, `orders`, `server-logs`, `sensor-readings`. #### {{< fa layer-group >}} Partitions Each topic is divided into one or more **partitions**. A partition is an ordered, immutable sequence of messages — each message receives a unique number (**offset**). Partitions serve two purposes: - **Scalability** — data from one topic can be distributed across multiple brokers. - **Parallelism** — multiple consumers can read different partitions simultaneously. #### {{< fa server >}} Brokers A **broker** is a single Kafka server. A Kafka cluster consists of multiple brokers. Each broker stores a subset of partitions. To ensure reliability, Kafka **replicates** partitions across multiple brokers. A replication factor (e.g., 3) means each partition has 3 copies on different servers. If one broker goes down — data is not lost. ::: ```{mermaid} %%| fig-cap: "Topic with partitions distributed across brokers" flowchart LR subgraph "Topic: transactions" direction TB P0["Partition 0\noffset: 0,1,2,3..."] P1["Partition 1\noffset: 0,1,2,3..."] P2["Partition 2\noffset: 0,1,2,3..."] end P0 -.-> B1["Broker 1"] P1 -.-> B2["Broker 2"] P2 -.-> B3["Broker 3"] style P0 fill:#E3F2FD,stroke:#2196F3 style P1 fill:#E3F2FD,stroke:#2196F3 style P2 fill:#E3F2FD,stroke:#2196F3 style B1 fill:#FFF3E0,stroke:#FF9800 style B2 fill:#FFF3E0,stroke:#FF9800 style B3 fill:#FFF3E0,stroke:#FF9800 ``` --- ## Producers A producer is an application that sends messages to Kafka. The producer specifies a topic, and Kafka assigns the message to a partition: - by default: **round-robin** (sequentially to each partition), - with a key: Kafka computes the key's hash and selects the partition based on it — messages with the same key always go to the same partition. ::: {.panel-tabset} #### Without a key ```python from kafka import KafkaProducer import json from datetime import datetime producer = KafkaProducer( bootstrap_servers='localhost:9092', value_serializer=lambda v: json.dumps(v).encode('utf-8') ) # Sending a transaction to the "transactions" topic transaction = { 'id': 'TX0042', 'amount': 1250.00, 'customer': 'C-1001', 'time': datetime.now().isoformat(), 'store': 'Warsaw' } producer.send('transactions', value=transaction) producer.flush() print(f"Sent: {transaction}") ``` #### With a key If we want all transactions from the same customer to go to the same partition (e.g., for session analysis): ```python # Key = customer ID → same customer always in the same partition producer.send( 'transactions', key=b'C-1001', # partitioning key value=transaction ) ``` ::: {.callout-tip appearance="simple"} ## When to use a key? Use a key when you need **ordering guarantees** for a given entity (e.g., all transactions for customer C-1001 in order) or when downstream processing requires grouping by key. ::: ::: --- ## Consumers A consumer is an application that reads messages from a topic. The consumer tracks its **offset** — it knows which message it last processed. After a failure it resumes from that point. ```python from kafka import KafkaConsumer import json consumer = KafkaConsumer( 'transactions', bootstrap_servers='localhost:9092', auto_offset_reset='earliest', value_deserializer=lambda x: json.loads(x.decode('utf-8')) ) for message in consumer: tx = message.value if tx['amount'] > 1000: print(f"Large transaction: {tx['id']} = {tx['amount']} PLN ({tx['store']})") ``` ### Consumer groups When a single consumer can't keep up with processing — you add more. Consumers in **the same group** divide the partitions among themselves. ::: {.callout-note} ## Consumer group rules - Kafka automatically assigns partitions to consumers in a group. - Each partition is read by **exactly one** consumer in the group. - If a consumer fails — its partitions are taken over by another. - **Ideal:** number of consumers in the group = number of partitions. - If there are more consumers than partitions — the excess will be idle. ::: ``` Topic: transactions (3 partitions) Consumer group: "fraud-detection" Partition 0 ──→ Consumer A Partition 1 ──→ Consumer B Partition 2 ──→ Consumer C ``` ::: {.callout-tip appearance="simple"} ## Multiple groups = independent reads Different consumer groups can read **the same data** independently. E.g., group `fraud-detection` analyzes transactions for fraud, while group `reporting` builds reports — both read the `transactions` topic but have separate offsets. ::: --- ## ML microservice — FastAPI In practice, a Kafka consumer often passes data to an ML model served by an API. FastAPI is a lightweight Python framework ideal for this purpose. ```python from fastapi import FastAPI import numpy as np import pickle app = FastAPI() # Load trained model # with open("model.pkl", "rb") as f: # model = pickle.load(f) @app.get("/predict/") def predict_price(area: float, bedrooms: int, age: int): """Real estate price prediction.""" features = np.array([[area, bedrooms, age]]) # price = model.predict(features)[0] price = area * 8500 + bedrooms * 50000 - age * 2000 # simplified model return {"estimated_price": round(price, 2)} # Start: uvicorn main:app --reload # Query: http://localhost:8000/predict/?area=75&bedrooms=3&age=10 ``` ::: {.callout-tip collapse="true"} ## {{< fa arrows-left-right >}} Request and Response — how does an API work? **Request:** - URL: `http://localhost:8000/predict/?area=75&bedrooms=3&age=10` - HTTP method: GET or POST - Headers: Content-Type, Authorization - Body: data in JSON format **Response:** - HTTP status: `200 OK`, `400 Bad Request`, `500 Internal Server Error` - Body: result in JSON format ```json {"estimated_price": 787500.0} ``` ::: --- ## Kafka + ML — the complete picture Combining elements from recent lectures, here is a typical real-time analytics system: ```{mermaid} %%| fig-cap: "Typical real-time analytics system with Kafka" flowchart LR subgraph Sources WEB["Web application"] IOT["IoT sensors"] PAY["Payment system"] end subgraph Kafka T["Topic:\nevents"] end subgraph Processing SP["Spark Streaming"] API["FastAPI + ML"] DB["Write to database"] end subgraph Results DASH["Dashboard"] ALERT["Alerts"] REP["Reports"] end WEB --> T IOT --> T PAY --> T T --> SP --> DASH T --> API --> ALERT T --> DB --> REP style T fill:#FF9800,color:#fff style SP fill:#2196F3,color:#fff style API fill:#4CAF50,color:#fff style DB fill:#9C27B0,color:#fff ``` ::: {.callout-note appearance="simple"} Producers (applications, sensors, systems) send events to Kafka. Consumers (Spark, FastAPI, reporting systems) read them and process them — each in their own way, independently of the others. ::: --- ## Summary Kafka is the central element of streaming architecture. Its strength lies in data durability (messages don't disappear), scalability (partitions, replication) and flexibility (many independent consumers). In the labs you'll run a Kafka cluster in Docker and write producers and consumers in Python. ::: {.callout-note appearance="simple"} ## {{< fa forward >}} Next lecture Apache Spark and Structured Streaming — processing streaming data at scale, integration with Kafka. ::: ::: {.callout-tip appearance="simple"} ## {{< fa brain >}} Food for thought You're designing a competitor price monitoring system for a retail chain. What Kafka topics would you create? How many partitions? What consumer groups? :::