Code
import networkx as nx
= nx.karate_club_graph()
G =True) nx.draw(G, with_labels
β³ Duration: 1.5h
π― Lecture Objective
Understanding how data has evolved in different industries and the tools used for its analysis today.
In this lecture, we will present the evolution of data analysis, showing how technologies and approaches to data processing have changed over the years.
We will start with classical tabular structures, move through more advanced graph and text models, and finish with modern approaches to stream processing.
Initially, data was stored in tables, where each table contained organized information in columns and rows (e.g., SQL databases).
Such models were perfect for structured data.
β
Data divided into columns with a fixed structure.
β
CRUD operations (Create, Read, Update, Delete) can be applied.
β
Strict consistency and normalization rules.
β‘οΈ Banking systems, e-commerce, ERP, CRM systems.
import sqlite3
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
cursor.execute("CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, age INTEGER)")
cursor.execute("INSERT INTO users (name, age) VALUES ('Alice', 30)")
cursor.execute("SELECT * FROM users")
print(cursor.fetchall())
conn.close()
As business needs grew, graph data emerged, where relationships between objects are represented as nodes and edges.
β Data describing relationships and connections. β Flexible structure (graphs instead of tables). β Allows analysis of connections (e.g., PageRank algorithms, centrality).
β‘οΈ Social networks (Facebook, LinkedIn), search engines (Google), recommendation systems (Netflix, Amazon).
These data are not fully structured like in SQL databases, but they have some schema.
β Hierarchical structure (e.g., key-value pairs, nested objects). β No strict schema (possibility to add new fields). β Popular in NoSQL systems and APIs.
β‘οΈ Documents in MongoDB, configuration files, REST APIs, log files.
Text has become a key source of information, especially in sentiment analysis, chatbots, and search engines.
β Unstructured data requiring transformation. β Use of embeddings (e.g., Word2Vec, BERT, GPT). β Widely used in sentiment analysis and chatbots.
β‘οΈ Social media, emails, chatbots, machine translation.
[-2.021953582763672, 1.5604140758514404, -0.5358548164367676, -1.3182345628738403]
Modern data analysis systems also use images and sound.
β Require significant computational power (AI, deep learning). β Processed by CNN models (images) and RNN/Transformers (sound).
β‘οΈ Face recognition, speech analysis, biometrics, video content analysis.
Currently, streaming data analysis is rapidly evolving, where data is analyzed as it flows in real-time.
β Real-time processing. β Technologies such as Apache Kafka, Flink, Spark Streaming.
β‘οΈ Bank transactions (fraud detection), social media analysis, IoT.
Data from sensors and IoT devices is the next step in evolution.
β Often comes from billions of devices (big data). β Requires edge computing analysis.
β‘οΈ Smart homes, wearables, autonomous cars, industrial systems.
π₯οΈ Example Python Code (Sensor - Temperature):
Data is generated in an unlimited mannerβit appears as a result of continuous system operations.
Today, you have generated a lot of data on your phone (even during this lecture!).
Will you not generate data in the next session or tomorrow?
Data is always generated as a form of a data stream.
π Systems handling data streams:
A company is an organization that generates and responds to a continuous stream of data.
In batch processing, the source (and also the result) of data processing is a file.
It is written once and can be referenced multiple times (multiple processes or tasks can operate on it).
The file name serves as an identifier for the set of records.
In the case of a stream, an event is generated only once by a so-called producer (also referred to as a sender or provider).
The generated event can be processed by multiple so-called consumers (receivers).
Streaming events are grouped into so-called topics.
When should you make a business decision?
When we talk about scalable data processing, the first association might be Google.
But what actually enables us to search for information in a fraction of a second while processing petabytes of data?
π Did you know that the name βGoogleβ comes from the word βGoogol,β which represents the number 10ΒΉβ°β°?
Thatβs more than the number of atoms in the known universe! π
Traditional SQL databases and single-threaded algorithms fail when data scales beyond a single computer.
This is where MapReduce comes inβa revolutionary computational model developed by Google.
β
Google File System (GFS) β a distributed file system.
β
Bigtable β a system for storing massive amounts of structured data.
β
MapReduce β an algorithm for distributing workloads across multiple machines.
Each input is divided into smaller parts and processed in parallel.
π Imagine you have a phone book and want to find all people with the last name βNowakβ.
β‘οΈ Divide the book into sections and give each person one section to analyze.
All partial results are combined into one final answer.
π All students report their findings, and one student collects and summarizes the response.
Letβs assume we have millions of books and we want to count how many times each word appears.
from multiprocessing import Pool
from collections import Counter
# Map function (splitting text into words)
def map_function(text):
words = text.split()
return Counter(words)
# Reduce function (summing up results)
def reduce_function(counters):
total_count = Counter()
for counter in counters:
total_count.update(counter)
return total_count
texts = [
"big data is amazing",
"data science and big data",
"big data is everywhere"
]
if __name__ == '__main__':
with Pool() as pool:
mapped_results = pool.map(map_function, texts)
final_result = reduce_function(mapped_results)
print(final_result)
# Counter({'data': 4, 'big': 3, 'is': 2, 'amazing': 1, 'science': 1, 'and': 1, 'everywhere': 1})
β Each text fragment is processed independently (map). β The results are collected and summed (reduce). β Outcome: We can process terabytes of text in parallel!
π Old Approach β A single computer processes everything sequentially.
π New Approach (MapReduce) β Each machine processes a fragment, and the results are aggregated.
πΉ Find and run your own MapReduce algorithm in any programming language!
πΉ Can you implement your own MapReduce for a different task? (e.g., log analysis, counting website clicks)
Big Data systems can serve as a source for data warehouses (e.g., Data Lake, Enterprise Data Hub).
However, Data Warehouses are not Big Data systems!
βBig Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.β
β Dan Ariely, Professor of Psychology and Behavioral Economics, Duke University
βThe purpose of computing is insight, not numbers.β β R.W. Hamming, 1962.
Data has always been processed in business.
Over the past decades, the amount of processed data has been steadily increasing, affecting the way data is prepared and handled.
Most data is stored in databases or data warehouses.
Typically, data access is performed through applications by executing queries.
The method of utilizing and accessing a database is called the data processing model.
The two most commonly used implementations are:
The traditional model refers to online transaction processing (OLTP),
which excels at handling real-time tasks such as customer service, order management, and sales processing.
It is commonly used in Enterprise Resource Planning (ERP) systems, Customer Relationship Management (CRM) software, and web-based applications.
This model provides efficient solutions for:
However, what happens when we need to deal with:
Research on these topics led to the formulation of a new data processing model and a new type of database β Data Warehouses.
Online Analytical Processing (OLAP)
OLAP supports data analysis and provides tools for multidimensional analysis
based on dimensions such as time, location, and product.
The process of extracting data from various systems into a single database is known as Extract-Transform-Load (ETL),
which involves normalization, encoding, and schema transformation.
Analyzing data in a data warehouse mainly involves calculating aggregates (summaries) across different dimensions.
This process is entirely user-driven.
Imagine we have access to a data warehouse storing sales information from a supermarket.
How can we analyze queries such as:
Answers to these questions help identify bottlenecks in product sales, plan inventory levels, and compare sales across different product groups and supermarket branches.
In a Data Warehouse, two types of queries are most commonly executed (both in batch mode):