Lecture 2: Evolution of Data and Processing Models
Duration: 1.5h
Goal: Understand how data types and processing models evolved — from simple SQL tables, through data warehouses, to Data Lakes and Big Data.
Evolution of Data
Data hasn’t always looked the same. The way we store and process it has changed with technology and business needs.
Structured Data — SQL Tables
For decades, business data was stored in relational databases — tables with columns following a strict schema. This remains the foundation of most transactional systems.
Code
import sqlite3conn = sqlite3.connect(':memory:')cursor = conn.cursor()cursor.execute(""" CREATE TABLE customers ( id INTEGER PRIMARY KEY, name TEXT, city TEXT, revenue REAL )""")cursor.executemany("INSERT INTO customers VALUES (?, ?, ?, ?)", [ (1, 'Anna', 'Warsaw', 125000), (2, 'Piotr', 'Krakow', 89000), (3, 'Kasia', 'Gdansk', 210000),])cursor.execute("SELECT city, SUM(revenue) FROM customers GROUP BY city")for row in cursor.fetchall():print(f"{row[0]}: {row[1]:,.0f} PLN")conn.close()
With the rise of APIs and web applications, data emerged that has some structure but isn’t strictly tabular. JSON became the lingua franca of data exchange between services.
JSON is flexible — you can add new fields without changing the schema. That’s why it dominates in NoSQL systems (MongoDB), REST APIs, and data streams (Kafka).
Unstructured Data
Today, most generated data is unstructured: text (emails, reviews, posts), images, audio, video. These require specialized tools — NLP models (BERT, GPT), convolutional networks (CNN), or multimodal models.
We won’t cover these in detail in this course, but it’s important to know that streaming systems (Kafka, Spark) can process all these data types.
Streaming Data
This is the newest “type” of data — or rather, a way of thinking about it. Every transaction, page click, sensor reading is an event appearing in a continuous stream.
The key shift in perspective: data is always generated as a stream. We decide whether to process it on the fly (streaming) or collect and analyze it later (batch).
The OLTP model (Online Transaction Processing) is classical transaction handling: recording an order, updating an account balance, registering a customer. OLTP systems (ERP, CRM, banking) are optimized for fast read/write operations on individual records.
Limitation: OLTP is not designed for complex analytics. A query like “what was the sales breakdown by region and product category over the last 12 months?” will overwhelm a transactional system.
OLAP — Analytical Processing
The analytical problem was solved by creating data warehouses and the OLAP model (Online Analytical Processing).
Data from multiple OLTP systems is collected via an ETL process (Extract–Transform–Load) and loaded into a single analytical database, optimized for aggregation and reporting.
Typical analytical questions:
What is the sales breakdown by quarter, region, and category?
Which products generate losses?
What does the year-over-year sales trend look like?
A data warehouse operates in batch mode — data is loaded periodically (e.g., daily), and reports are generated on demand.
Code
import pandas as pdimport numpy as np# Simulated warehouse datanp.random.seed(42)n =500df = pd.DataFrame({'date': pd.date_range('2025-01-01', periods=n, freq='D'),'region': np.random.choice(['Warsaw', 'Krakow', 'Gdansk', 'Wroclaw'], n),'category': np.random.choice(['Electronics', 'Appliances', 'Clothing'], n),'sales': np.random.uniform(1000, 50000, n).round(2)})# Typical OLAP query — pivot by region and categorypivot = df.pivot_table(values='sales', index='region', columns='category', aggfunc='sum')print(pivot.round(0))
A data warehouse requires structured data and a strict schema. What if we want to store raw logs, images, IoT data, or JSON files as well?
The answer is the Data Lake — a data store that accepts data in any format (schema-on-read instead of schema-on-write). Data is stored raw and structured only at the point of analysis.
Feature
Data Warehouse
Data Lake
Schema
Defined at load (ETL)
Defined at read
Data types
Structured
Any (raw)
Users
Business analysts
Data Scientists, engineers
Storage cost
High
Low
Technologies
SQL Server, Oracle, Teradata
Hadoop HDFS, S3, Azure Data Lake
Big Data
The term Big Data describes situations where data exceeds the processing capabilities of traditional tools. It’s characterized by the 5 V’s:
Volume — terabytes and petabytes of data,
Velocity — the rate at which new data is generated,
Variety — different formats and sources,
Veracity — data quality and completeness,
Value — the actual business value hidden in the data.
Important distinction: a data warehouse is not a Big Data system. A warehouse stores structured data and serves reporting. Big Data is an ecosystem of tools for processing data of any structure and scale.
MapReduce — Historical Context
When Google needed to process petabytes of web index data, traditional SQL databases weren’t enough. In 2004, they published a paper describing MapReduce — a computational model where data is split into fragments processed in parallel across many machines.
Two steps:
Map — split data into fragments and process them independently.
Reduce — aggregate partial results into the final answer.
Code
from collections import Counter# Map: each text fragment processed independentlydef map_function(text):return Counter(text.split())# Reduce: aggregate resultsdef reduce_function(counters): total = Counter()for c in counters: total.update(c)return totaltexts = ["real time data analytics","stream processing and batch data processing","data analysis and machine learning"]results = [map_function(t) for t in texts]print(dict(reduce_function(results)))
MapReduce (and Hadoop) revolutionized Big Data processing, but it was slow and cumbersome. That’s why Apache Spark was created — a successor that’s faster (in-memory processing) and simpler to use. Spark supports both batch and streaming modes — and it’s what we’ll use in the labs.
2020+: Data Lakehouse, real-time ML, serverless analytics
Summary
In this lecture we traced the evolution from simple SQL tables to complex Big Data ecosystems. Key takeaways:
Data evolved from structured (SQL) through semi-structured (JSON) to streaming (events).
Processing models evolved from OLTP through OLAP to Data Lakes.
MapReduce opened the Big Data era but was replaced by the faster Spark.
In the next lecture we’ll focus on stream processing — the concept of time, latency, and time windows.
Business Impact
Data variety allows companies to build a complete customer picture (Customer 360) — combining transactional data (SQL) with social media opinions (JSON/NLP). Moving from warehouses to Data Lakes reduces storage costs and opens the door to experimenting with new data sources — and that’s the foundation of a data-driven culture.