Lecture 2: Evolution of Data and Processing Models

Duration: 1.5h

Goal: Understand how data types and processing models evolved — from simple SQL tables, through data warehouses, to Data Lakes and Big Data.


Evolution of Data

Data hasn’t always looked the same. The way we store and process it has changed with technology and business needs.

Structured Data — SQL Tables

For decades, business data was stored in relational databases — tables with columns following a strict schema. This remains the foundation of most transactional systems.

Code
import sqlite3

conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
cursor.execute("""
    CREATE TABLE customers (
        id INTEGER PRIMARY KEY,
        name TEXT,
        city TEXT,
        revenue REAL
    )
""")
cursor.executemany("INSERT INTO customers VALUES (?, ?, ?, ?)", [
    (1, 'Anna', 'Warsaw', 125000),
    (2, 'Piotr', 'Krakow', 89000),
    (3, 'Kasia', 'Gdansk', 210000),
])
cursor.execute("SELECT city, SUM(revenue) FROM customers GROUP BY city")
for row in cursor.fetchall():
    print(f"{row[0]}: {row[1]:,.0f} PLN")
conn.close()
Gdansk: 210,000 PLN
Krakow: 89,000 PLN
Warsaw: 125,000 PLN

Advantages: consistency, normalization, ACID. Drawbacks: rigid schema, difficulty handling unstructured data.

Semi-Structured Data — JSON, XML

With the rise of APIs and web applications, data emerged that has some structure but isn’t strictly tabular. JSON became the lingua franca of data exchange between services.

Code
import json

transaction = {
    "id": "TX4521",
    "amount": 1250.00,
    "currency": "PLN",
    "customer": {
        "name": "Anna",
        "segment": "premium"
    },
    "products": ["laptop", "mouse"]
}

print(json.dumps(transaction, indent=2, ensure_ascii=False))
{
  "id": "TX4521",
  "amount": 1250.0,
  "currency": "PLN",
  "customer": {
    "name": "Anna",
    "segment": "premium"
  },
  "products": [
    "laptop",
    "mouse"
  ]
}

JSON is flexible — you can add new fields without changing the schema. That’s why it dominates in NoSQL systems (MongoDB), REST APIs, and data streams (Kafka).

Unstructured Data

Today, most generated data is unstructured: text (emails, reviews, posts), images, audio, video. These require specialized tools — NLP models (BERT, GPT), convolutional networks (CNN), or multimodal models.

We won’t cover these in detail in this course, but it’s important to know that streaming systems (Kafka, Spark) can process all these data types.

Streaming Data

This is the newest “type” of data — or rather, a way of thinking about it. Every transaction, page click, sensor reading is an event appearing in a continuous stream.

The key shift in perspective: data is always generated as a stream. We decide whether to process it on the fly (streaming) or collect and analyze it later (batch).

Code
import time
from datetime import datetime

# Simulated event stream
events = [
    {"type": "purchase", "amount": 150, "store": "Warsaw"},
    {"type": "return", "amount": 50, "store": "Krakow"},
    {"type": "purchase", "amount": 2200, "store": "Warsaw"},
]

for e in events:
    e["time"] = datetime.now().strftime("%H:%M:%S")
    print(f"[{e['time']}] {e['type'].upper()}: {e['amount']} PLN ({e['store']})")
    time.sleep(0.5)
[13:18:32] PURCHASE: 150 PLN (Warsaw)
[13:18:33] RETURN: 50 PLN (Krakow)
[13:18:33] PURCHASE: 2200 PLN (Warsaw)

Data Processing Models

OLTP — Transaction Processing

The OLTP model (Online Transaction Processing) is classical transaction handling: recording an order, updating an account balance, registering a customer. OLTP systems (ERP, CRM, banking) are optimized for fast read/write operations on individual records.

Features:

  • CRUD operations (Create, Read, Update, Delete),
  • ACID guarantees (Atomicity, Consistency, Isolation, Durability),
  • normalized data schemas,
  • support for many concurrent users.

Limitation: OLTP is not designed for complex analytics. A query like “what was the sales breakdown by region and product category over the last 12 months?” will overwhelm a transactional system.

OLAP — Analytical Processing

The analytical problem was solved by creating data warehouses and the OLAP model (Online Analytical Processing).

Data from multiple OLTP systems is collected via an ETL process (Extract–Transform–Load) and loaded into a single analytical database, optimized for aggregation and reporting.

Typical analytical questions:

  • What is the sales breakdown by quarter, region, and category?
  • Which products generate losses?
  • What does the year-over-year sales trend look like?

A data warehouse operates in batch mode — data is loaded periodically (e.g., daily), and reports are generated on demand.

Code
import pandas as pd
import numpy as np

# Simulated warehouse data
np.random.seed(42)
n = 500
df = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=n, freq='D'),
    'region': np.random.choice(['Warsaw', 'Krakow', 'Gdansk', 'Wroclaw'], n),
    'category': np.random.choice(['Electronics', 'Appliances', 'Clothing'], n),
    'sales': np.random.uniform(1000, 50000, n).round(2)
})

# Typical OLAP query — pivot by region and category
pivot = df.pivot_table(values='sales', index='region', columns='category', aggfunc='sum')
print(pivot.round(0))
category  Appliances  Clothing  Electronics
region                                     
Gdansk      996275.0  997330.0    1228807.0
Krakow      969828.0  781050.0     814134.0
Warsaw      919540.0  822135.0    1032894.0
Wroclaw    1053424.0  987009.0    1583392.0

From Warehouses to Data Lakes

A data warehouse requires structured data and a strict schema. What if we want to store raw logs, images, IoT data, or JSON files as well?

The answer is the Data Lake — a data store that accepts data in any format (schema-on-read instead of schema-on-write). Data is stored raw and structured only at the point of analysis.

Feature Data Warehouse Data Lake
Schema Defined at load (ETL) Defined at read
Data types Structured Any (raw)
Users Business analysts Data Scientists, engineers
Storage cost High Low
Technologies SQL Server, Oracle, Teradata Hadoop HDFS, S3, Azure Data Lake

Big Data

The term Big Data describes situations where data exceeds the processing capabilities of traditional tools. It’s characterized by the 5 V’s:

  • Volume — terabytes and petabytes of data,
  • Velocity — the rate at which new data is generated,
  • Variety — different formats and sources,
  • Veracity — data quality and completeness,
  • Value — the actual business value hidden in the data.

Important distinction: a data warehouse is not a Big Data system. A warehouse stores structured data and serves reporting. Big Data is an ecosystem of tools for processing data of any structure and scale.

MapReduce — Historical Context

When Google needed to process petabytes of web index data, traditional SQL databases weren’t enough. In 2004, they published a paper describing MapReduce — a computational model where data is split into fragments processed in parallel across many machines.

Two steps:

  1. Map — split data into fragments and process them independently.
  2. Reduce — aggregate partial results into the final answer.
Code
from collections import Counter

# Map: each text fragment processed independently
def map_function(text):
    return Counter(text.split())

# Reduce: aggregate results
def reduce_function(counters):
    total = Counter()
    for c in counters:
        total.update(c)
    return total

texts = [
    "real time data analytics",
    "stream processing and batch data processing",
    "data analysis and machine learning"
]

results = [map_function(t) for t in texts]
print(dict(reduce_function(results)))
{'real': 1, 'time': 1, 'data': 3, 'analytics': 1, 'stream': 1, 'processing': 2, 'and': 2, 'batch': 1, 'analysis': 1, 'machine': 1, 'learning': 1}

MapReduce (and Hadoop) revolutionized Big Data processing, but it was slow and cumbersome. That’s why Apache Spark was created — a successor that’s faster (in-memory processing) and simpler to use. Spark supports both batch and streaming modes — and it’s what we’ll use in the labs.

Data Technology Timeline

  • 1960s–70s: relational databases, OLTP
  • 1980s–90s: data warehouses, OLAP, data mining
  • 2004: Google publishes MapReduce paper
  • 2005: Hadoop — open-source MapReduce implementation
  • 2009: Apache Kafka — distributed streaming platform (LinkedIn)
  • 2014: Apache Spark — fast in-memory processing
  • 2020+: Data Lakehouse, real-time ML, serverless analytics

Summary

In this lecture we traced the evolution from simple SQL tables to complex Big Data ecosystems. Key takeaways:

  • Data evolved from structured (SQL) through semi-structured (JSON) to streaming (events).
  • Processing models evolved from OLTP through OLAP to Data Lakes.
  • MapReduce opened the Big Data era but was replaced by the faster Spark.
  • In the next lecture we’ll focus on stream processing — the concept of time, latency, and time windows.

Business Impact

Data variety allows companies to build a complete customer picture (Customer 360) — combining transactional data (SQL) with social media opinions (JSON/NLP). Moving from warehouses to Data Lakes reduces storage costs and opens the door to experimenting with new data sources — and that’s the foundation of a data-driven culture.