Tools

Python

In your terminal, type:

python3 --version

Make sure your version is 3.10 or higher. To exit the Python shell, use the exit() function:

Python 3.13.2 (main, Feb  4 2025, 14:51:09) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()

Creating a Python Virtual Environment

python3 -m venv <env_name>

# Linux / macOS
source <env_name>/bin/activate

# Windows
# <env_name>\Scripts\activate

(env_name)$

Install core libraries and JupyterLab:

pip install --no-cache --upgrade pip setuptools
pip install jupyterlab numpy pandas matplotlib scipy

# if you have a requirements.txt
pip install -r requirements.txt

# start JupyterLab
jupyter lab

Open your browser at: localhost:8888

After restarting your computer, navigate to the folder where you created the environment, activate it, and start JupyterLab:

source <env_name>/bin/activate
jupyter lab

Python with JupyterLab – Docker Version

Clone the course infrastructure repository and start the full environment with Docker:

git clone -b 2026Redis https://github.com/sebkaz/jupyterlab-project.git
cd jupyterlab-project
docker compose up -d

Open your browser at: localhost:8888

To stop the environment:

docker compose down

Learn Python

A recommended Python basics course: Tomas Beuzen.

Create an account on Kaggle, go to the Courses tab, and complete the Python module. It covers:

expressions and variables
functions
conditionals and program flow
lists
loops
strings and dictionaries
importing and using external libraries

Git

When working on a project (alone or in a team), you often need to track what changes were made, when, and by whom. Git – a version control system – is the standard tool for this.

You can download and install Git on any operating system. However, most projects use a hosting service like GitHub, which lets you use Git directly from your browser.

GitHub’s free tier supports both public and private repositories.

git --version

GitHub Structure

At the top level, there are individual accounts (e.g., github.com/sebkaz) or organizations. Users can create public or private repositories.

A single file should not exceed 100 MB.

A repo (short for repository) is created with the Create a new repository button. Each repo should have a unique name.

Branches

The default branch of a repository is named main (previously master).

Essential Commands

Clone a repository:

git clone https://github.com/<username>/<repo>.git

You can also download a repository from GitHub as a ZIP file.

Create a local repository:

# create a new directory
mkdir datamining
cd datamining
# initialize a repository
git init
# add a file
echo "Info" >> README.md

Connect a local repository to GitHub:

git remote add origin https://github.com/<username>/<repo>.git

The three-step workflow:

# check status
git status
# 1. stage all changes
git add .
# 2. commit with a message
git commit -m "description of changes"
# 3. push to remote
git push origin main

Recommended: Git crash course on YouTube.

Docker

Download Docker from the official website.

If installed correctly, run the following commands:

Check the installed version:

docker --version
docker compose version

Run a test image:

docker run hello-world

List downloaded images:

docker images

List running containers:

docker ps
docker ps -a

Stop a running container:

docker stop <CONTAINER_ID>

Remove a container:

docker rm -f <CONTAINER_ID>

Docker Compose

Docker Compose lets you define and run multi-container environments with a single docker-compose.yml file. In this course, we use it to set up environments with Kafka, Spark, and other services.

Check that Docker Compose is installed:

docker compose version

Example docker-compose.yml:

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.6.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  kafka:
    image: confluentinc/cp-kafka:7.6.0
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

Start and stop:

# start all services in the background
docker compose up -d
# check running containers
docker compose ps
# stop and remove containers
docker compose down

Apache Kafka

Apache Kafka is a distributed platform for stream data processing. It enables publishing, subscribing to, and processing streams of records in real time.

In this course, we run Kafka using Docker Compose (see above).

Key concepts:

Producer – sends messages (events) to Kafka.
Consumer – reads messages from Kafka.
Topic – a category to which messages are sent.
Partition – a subdivision of a topic enabling parallel processing.
Broker – a Kafka server that stores data.

Quick Kafka test with Docker:

# create a topic
docker compose exec kafka kafka-topics --create \
  --topic test --bootstrap-server localhost:9092 \
  --partitions 1 --replication-factor 1

# send a message (producer)
docker compose exec kafka kafka-console-producer \
  --topic test --bootstrap-server localhost:9092

# read messages (consumer) -- in a new terminal
docker compose exec kafka kafka-console-consumer \
  --topic test --bootstrap-server localhost:9092 --from-beginning

Python library: pip install confluent-kafka

More information: Apache Kafka documentation

Apache Spark

Apache Spark is a distributed data processing engine supporting both batch and streaming modes. In this course, we use PySpark – the Python interface to Spark.

Install PySpark:

pip install pyspark

Quick test:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Test") \
    .master("local[*]") \
    .getOrCreate()

df = spark.createDataFrame(
    [(1, "Anna", 28), (2, "Jan", 35), (3, "Ewa", 22)],
    ["id", "name", "age"]
)

df.show()
spark.stop()

In later labs, we will use Structured Streaming – Spark’s module for processing streaming data, including integration with Apache Kafka.

More information: Apache Spark documentation