Lecture 2

⏳ Duration: 1.5h
🎯 Lecture Objective

Understanding how data has evolved in different industries and the tools used for its analysis today.


In this lecture, we will present the evolution of data analysis, showing how technologies and approaches to data processing have changed over the years.
We will start with classical tabular structures, move through more advanced graph and text models, and finish with modern approaches to stream processing.

Evolution of Data

1. Tabular Data (SQL Tables)

Initially, data was stored in tables, where each table contained organized information in columns and rows (e.g., SQL databases).
Such models were perfect for structured data.

πŸ“Œ Features:

βœ… Data divided into columns with a fixed structure.
βœ… CRUD operations (Create, Read, Update, Delete) can be applied.
βœ… Strict consistency and normalization rules.

πŸ“Œ Examples:

➑️ Banking systems, e-commerce, ERP, CRM systems.

πŸ–₯️ Example Python Code (SQLite):

import sqlite3
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
cursor.execute("CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, age INTEGER)")
cursor.execute("INSERT INTO users (name, age) VALUES ('Alice', 30)")
cursor.execute("SELECT * FROM users")
print(cursor.fetchall())
conn.close()

2. Graph Data

As business needs grew, graph data emerged, where relationships between objects are represented as nodes and edges.

πŸ“Œ Features:

βœ… Data describing relationships and connections. βœ… Flexible structure (graphs instead of tables). βœ… Allows analysis of connections (e.g., PageRank algorithms, centrality).

πŸ“Œ Examples:

➑️ Social networks (Facebook, LinkedIn), search engines (Google), recommendation systems (Netflix, Amazon).

πŸ–₯️ Example Python Code (Karate Graph - NetworkX) :

Code
import networkx as nx
G = nx.karate_club_graph()
nx.draw(G, with_labels=True)

3. Semi-structured Data (JSON, XML, YAML)

These data are not fully structured like in SQL databases, but they have some schema.

πŸ“Œ Features:

βœ… Hierarchical structure (e.g., key-value pairs, nested objects). βœ… No strict schema (possibility to add new fields). βœ… Popular in NoSQL systems and APIs.

πŸ“Œ Examples:

➑️ Documents in MongoDB, configuration files, REST APIs, log files.

πŸ–₯️ Example Python Code (JSON):

Code
import json
data = {'name': 'Alice', 'age': 30, 'city': 'New York'}
json_str = json.dumps(data)
print(json.loads(json_str))
{'name': 'Alice', 'age': 30, 'city': 'New York'}

4. Text Data (NLP)

Text has become a key source of information, especially in sentiment analysis, chatbots, and search engines.

πŸ“Œ Features:

βœ… Unstructured data requiring transformation. βœ… Use of embeddings (e.g., Word2Vec, BERT, GPT). βœ… Widely used in sentiment analysis and chatbots.

πŸ“Œ Examples:

➑️ Social media, emails, chatbots, machine translation.

πŸ–₯️ Example Python Code :

Code
import ollama

# PrzykΕ‚adowe zdanie
sentence = "Artificial intelligence is changing the world."
response = ollama.embeddings(model='llama3.2', prompt=sentence)
embedding = response['embedding']
print(embedding[:4])
[-2.021953582763672, 1.5604140758514404, -0.5358548164367676, -1.3182345628738403]

5. Multimedia Data (Images, Sound, Video)

Modern data analysis systems also use images and sound.

πŸ“Œ Features:

βœ… Require significant computational power (AI, deep learning). βœ… Processed by CNN models (images) and RNN/Transformers (sound).

πŸ“Œ Examples:

➑️ Face recognition, speech analysis, biometrics, video content analysis.

πŸ–₯️ Example Python Code (Image - OpenCV) :

import cv2
image = cv2.imread('cloud.jpeg')
cv2.waitKey(0)
cv2.destroyAllWindows()

6. Stream Data

Currently, streaming data analysis is rapidly evolving, where data is analyzed as it flows in real-time.

πŸ“Œ Features:

βœ… Real-time processing. βœ… Technologies such as Apache Kafka, Flink, Spark Streaming.

πŸ“Œ Examples:

➑️ Bank transactions (fraud detection), social media analysis, IoT.

πŸ–₯️ Example Python Code (Streaming Bank Transactions):

Code
import time
transactions = [{'id': 1, 'amount': 100}, {'id': 2, 'amount': 200}]
for transaction in transactions:
    print(f"Processing transaction: {transaction}")
    time.sleep(1)
Processing transaction: {'id': 1, 'amount': 100}
Processing transaction: {'id': 2, 'amount': 200}

7. Sensor Data and IoT

Data from sensors and IoT devices is the next step in evolution.

πŸ“Œ Features:

βœ… Often comes from billions of devices (big data). βœ… Requires edge computing analysis.

πŸ“Œ Examples:

➑️ Smart homes, wearables, autonomous cars, industrial systems.

πŸ–₯️ Example Python Code (Sensor - Temperature):

Code
import random
def get_temperature():
    return round(random.uniform(20.0, 25.0), 2)
print(f"Current temperature: {get_temperature()}Β°C")
Current temperature: 20.31Β°C

The Real Process of Data Generation

Data is generated in an unlimited mannerβ€”it appears as a result of continuous system operations.
Today, you have generated a lot of data on your phone (even during this lecture!).
Will you not generate data in the next session or tomorrow?

Data is always generated as a form of a data stream.

πŸ“Œ Systems handling data streams:

  • Data warehouses
  • Device monitoring systems (IoT)
  • Transactional systems
  • Web analytics systems
  • Online advertising
  • Social media
  • Logging systems

A company is an organization that generates and responds to a continuous stream of data.

In batch processing, the source (and also the result) of data processing is a file.

It is written once and can be referenced multiple times (multiple processes or tasks can operate on it).

The file name serves as an identifier for the set of records.

In the case of a stream, an event is generated only once by a so-called producer (also referred to as a sender or provider).
The generated event can be processed by multiple so-called consumers (receivers).
Streaming events are grouped into so-called topics.

Big Data

When should you make a business decision?

Batch Processing

Hadoop Map-Reduce – Scaling Computation on Big Data

When we talk about scalable data processing, the first association might be Google.
But what actually enables us to search for information in a fraction of a second while processing petabytes of data?

πŸ‘‰ Did you know that the name β€œGoogle” comes from the word β€œGoogol,” which represents the number 10¹⁰⁰?
That’s more than the number of atoms in the known universe! 🌌

πŸ”₯ Challenge: Can you write out the number Googol by the end of this lecture?


πŸ” Why Are SQL and Traditional Algorithms Insufficient?

Traditional SQL databases and single-threaded algorithms fail when data scales beyond a single computer.
This is where MapReduce comes inβ€”a revolutionary computational model developed by Google.

πŸ› οΈ Google’s Solutions for Big Data:

βœ… Google File System (GFS) – a distributed file system.
βœ… Bigtable – a system for storing massive amounts of structured data.
βœ… MapReduce – an algorithm for distributing workloads across multiple machines.

πŸ–ΌοΈ Graphical Representation of MapReduce

1. Mapping Splits Tasks (Map)

Each input is divided into smaller parts and processed in parallel.

🌍 Imagine you have a phone book and want to find all people with the last name β€œNowak”.
➑️ Divide the book into sections and give each person one section to analyze.

2. Reducing Gathers the Results (Reduce)

All partial results are combined into one final answer.

πŸ”„ All students report their findings, and one student collects and summarizes the response.

MapReduce in action – distributing and aggregating data


πŸ’‘ Classic Example: Counting Words in a Text

Let’s assume we have millions of books and we want to count how many times each word appears.

πŸ–₯️ MapReduce Code in Python (Using Multiprocessing)

from multiprocessing import Pool
from collections import Counter

# Map function (splitting text into words)
def map_function(text):
    words = text.split()
    return Counter(words)

# Reduce function (summing up results)
def reduce_function(counters):
    total_count = Counter()
    for counter in counters:
        total_count.update(counter)
    return total_count

texts = [
        "big data is amazing",
        "data science and big data",
        "big data is everywhere"
    ]
if __name__ == '__main__':    
    with Pool() as pool:
        mapped_results = pool.map(map_function, texts)
    
    final_result = reduce_function(mapped_results)
    print(final_result)

# Counter({'data': 4, 'big': 3, 'is': 2, 'amazing': 1, 'science': 1, 'and': 1, 'everywhere': 1})

πŸ”Ή What’s Happening Here?

βœ… Each text fragment is processed independently (map). βœ… The results are collected and summed (reduce). βœ… Outcome: We can process terabytes of text in parallel!

Batch Processing


🎨 Visualization – Comparison of the Classic Approach and MapReduce

πŸ“Š Old Approach – A single computer processes everything sequentially.
πŸ“Š New Approach (MapReduce) – Each machine processes a fragment, and the results are aggregated.

MapReduce Processing


πŸš€ Challenge for You!

πŸ”Ή Find and run your own MapReduce algorithm in any programming language!
πŸ”Ή Can you implement your own MapReduce for a different task? (e.g., log analysis, counting website clicks)


Big Data

Big Data systems can serve as a source for data warehouses (e.g., Data Lake, Enterprise Data Hub).

However, Data Warehouses are not Big Data systems!

1. Data Warehouses

  • Store highly structured data
  • Focused on analytics and reporting processes
  • 100% accuracy

2. Big Data

  • Can handle data of any structure
  • Used for various data-driven purposes (analytics, data science, etc.)
  • Less than 100% accuracy

β€œBig Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”
β€” Dan Ariely, Professor of Psychology and Behavioral Economics, Duke University

One, Two, … Four V

  1. Volume – The size of data produced worldwide is growing at an exponential rate.
  2. Velocity – The speed at which data is generated, transmitted, and processed.
  3. Variety – Traditional data is alphanumeric, consisting of letters and numbers. Today, we also deal with images, audio, video files, and IoT data streams.
  4. Veracity – Are the data complete and accurate? Do they objectively reflect reality? Are they a reliable basis for decision-making?
  5. Value – The actual worth of the data. Ultimately, it’s about cost and benefits.

β€œThe purpose of computing is insight, not numbers.” – R.W. Hamming, 1962.


Data Processing Models

Data has always been processed in business.
Over the past decades, the amount of processed data has been steadily increasing, affecting the way data is prepared and handled.

A Bit of History

  • 1960s: Data collections, databases
  • 1970s: Relational data models and their implementation in OLTP systems
  • 1975: First personal computers
  • 1980s: Advanced data models (extended-relational, object-oriented, application-oriented, etc.)
  • 1983: The beginning of the Internet
  • 1990s: Data mining, data warehouses, OLAP systems
  • Later: NoSQL, Hadoop, SPARK, data lakes
  • 2002: AWS, 2005: Hadoop, Cloud computing

Most data is stored in databases or data warehouses.
Typically, data access is performed through applications by executing queries.

The method of utilizing and accessing a database is called the data processing model.
The two most commonly used implementations are:

Traditional Model

The traditional model refers to online transaction processing (OLTP),
which excels at handling real-time tasks such as customer service, order management, and sales processing.
It is commonly used in Enterprise Resource Planning (ERP) systems, Customer Relationship Management (CRM) software, and web-based applications.

This model provides efficient solutions for:

  • Secure and efficient data storage
  • Transactional data recovery after failures
  • Optimized data access
  • Concurrency management
  • Event processing β†’ read β†’ write

However, what happens when we need to deal with:

  • Aggregation of data from multiple systems (e.g., multiple stores)
  • Reporting and data summarization
  • Optimization of complex queries
  • Business decision support

Research on these topics led to the formulation of a new data processing model and a new type of database – Data Warehouses.

OLAP Model

Online Analytical Processing (OLAP)

OLAP supports data analysis and provides tools for multidimensional analysis
based on dimensions such as time, location, and product.

The process of extracting data from various systems into a single database is known as Extract-Transform-Load (ETL),
which involves normalization, encoding, and schema transformation.

Analyzing data in a data warehouse mainly involves calculating aggregates (summaries) across different dimensions.
This process is entirely user-driven.

Example

Imagine we have access to a data warehouse storing sales information from a supermarket.
How can we analyze queries such as:

  1. What is the total sales of products in the subsequent quarters, months, and weeks?
  2. What is the sales breakdown by product categories?
  3. What is the sales breakdown by supermarket branches?

Answers to these questions help identify bottlenecks in product sales, plan inventory levels, and compare sales across different product groups and supermarket branches.

In a Data Warehouse, two types of queries are most commonly executed (both in batch mode):

  1. Periodic queries that generate business statistics, such as reporting queries.
  2. Ad-hoc queries that support critical business decisions.