Tools
Python
In your terminal, type:
python3 --versionMake sure your version is 3.10 or higher. To exit the Python shell, use the exit() function:
Python 3.13.2 (main, Feb 4 2025, 14:51:09) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()Creating a Python Virtual Environment
python3 -m venv <env_name>
# Linux / macOS
source <env_name>/bin/activate
# Windows
# <env_name>\Scripts\activate
(env_name)$Install core libraries and JupyterLab:
pip install --no-cache --upgrade pip setuptools
pip install jupyterlab numpy pandas matplotlib scipy
# if you have a requirements.txt
pip install -r requirements.txt
# start JupyterLab
jupyter labOpen your browser at: localhost:8888
After restarting your computer, navigate to the folder where you created the environment, activate it, and start JupyterLab:
source <env_name>/bin/activate
jupyter labPython with JupyterLab – Docker Version
Clone the course infrastructure repository and start the full environment with Docker:
git clone -b 2026Redis https://github.com/sebkaz/jupyterlab-project.git
cd jupyterlab-project
docker compose up -dOpen your browser at: localhost:8888
To stop the environment:
docker compose downLearn Python
A recommended Python basics course: Tomas Beuzen.
Create an account on Kaggle, go to the Courses tab, and complete the Python module. It covers:
- expressions and variables
- functions
- conditionals and program flow
- lists
- loops
- strings and dictionaries
- importing and using external libraries
Git
When working on a project (alone or in a team), you often need to track what changes were made, when, and by whom. Git – a version control system – is the standard tool for this.
You can download and install Git on any operating system. However, most projects use a hosting service like GitHub, which lets you use Git directly from your browser.
GitHub’s free tier supports both public and private repositories.
git --versionGitHub Structure
At the top level, there are individual accounts (e.g., github.com/sebkaz) or organizations. Users can create public or private repositories.
A single file should not exceed 100 MB.
A repo (short for repository) is created with the Create a new repository button. Each repo should have a unique name.
Branches
The default branch of a repository is named main (previously master).
Essential Commands
Clone a repository:
git clone https://github.com/<username>/<repo>.gitYou can also download a repository from GitHub as a ZIP file.
Create a local repository:
# create a new directory
mkdir datamining
cd datamining
# initialize a repository
git init
# add a file
echo "Info" >> README.mdConnect a local repository to GitHub:
git remote add origin https://github.com/<username>/<repo>.gitThe three-step workflow:
# check status
git status
# 1. stage all changes
git add .
# 2. commit with a message
git commit -m "description of changes"
# 3. push to remote
git push origin mainRecommended: Git crash course on YouTube.
Docker
Download Docker from the official website.
If installed correctly, run the following commands:
- Check the installed version:
docker --version
docker compose version- Run a test image:
docker run hello-world- List downloaded images:
docker images- List running containers:
docker ps
docker ps -a- Stop a running container:
docker stop <CONTAINER_ID>- Remove a container:
docker rm -f <CONTAINER_ID>Docker Compose
Docker Compose lets you define and run multi-container environments with a single docker-compose.yml file. In this course, we use it to set up environments with Kafka, Spark, and other services.
Check that Docker Compose is installed:
docker compose versionExample docker-compose.yml:
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.6.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
kafka:
image: confluentinc/cp-kafka:7.6.0
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1Start and stop:
# start all services in the background
docker compose up -d
# check running containers
docker compose ps
# stop and remove containers
docker compose downApache Kafka
Apache Kafka is a distributed platform for stream data processing. It enables publishing, subscribing to, and processing streams of records in real time.
In this course, we run Kafka using Docker Compose (see above).
Key concepts:
- Producer – sends messages (events) to Kafka.
- Consumer – reads messages from Kafka.
- Topic – a category to which messages are sent.
- Partition – a subdivision of a topic enabling parallel processing.
- Broker – a Kafka server that stores data.
Quick Kafka test with Docker:
# create a topic
docker compose exec kafka kafka-topics --create \
--topic test --bootstrap-server localhost:9092 \
--partitions 1 --replication-factor 1
# send a message (producer)
docker compose exec kafka kafka-console-producer \
--topic test --bootstrap-server localhost:9092
# read messages (consumer) -- in a new terminal
docker compose exec kafka kafka-console-consumer \
--topic test --bootstrap-server localhost:9092 --from-beginningPython library: pip install confluent-kafka
More information: Apache Kafka documentation
Apache Spark
Apache Spark is a distributed data processing engine supporting both batch and streaming modes. In this course, we use PySpark – the Python interface to Spark.
Install PySpark:
pip install pysparkQuick test:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Test") \
.master("local[*]") \
.getOrCreate()
df = spark.createDataFrame(
[(1, "Anna", 28), (2, "Jan", 35), (3, "Ewa", 22)],
["id", "name", "age"]
)
df.show()
spark.stop()In later labs, we will use Structured Streaming – Spark’s module for processing streaming data, including integration with Apache Kafka.
More information: Apache Spark documentation