Lecture 1: Introduction to Real-Time Data Analytics

Duration: 1.5h

Goal: Understand what real-time data analytics is, the differences between data processing modes, and where businesses apply these approaches.


What Is Real-Time Data Analytics?

Real-Time Data Analytics is the process of analyzing data immediately after it is generated — without collecting it into files and waiting for later processing.

Key characteristics:

  • Low latency — data is analyzed within milliseconds or seconds of being generated.
  • Continuity — processing runs non-stop as new data arrives.
  • Reactivity — the system makes decisions or triggers alerts in real time.

Consider the contrast: an accountant generating a monthly sales report works in batch mode. A bank’s anti-fraud system blocking a suspicious transaction in a fraction of a second — that’s real-time.

Three Data Processing Modes

In practice, there are three approaches to processing information. Each has different use cases, costs, and trade-offs.

Batch Processing

Data is collected and processed at predefined intervals (hourly, daily, etc.).

Typical use cases:

  • end-of-day financial reports,
  • training machine learning models on historical data,
  • sales trend analysis.

Technologies: Apache Spark (batch mode), SQL, pandas.

Code
import pandas as pd
import numpy as np

# Simulated transaction data
np.random.seed(42)
df = pd.DataFrame({
    'date': pd.date_range("2025-01-01", periods=1000, freq='h'),
    'amount': np.random.uniform(10, 5000, 1000).round(2),
    'store': np.random.choice(['Warsaw', 'Krakow', 'Gdansk', 'Wroclaw'], 1000)
})

# Typical batch analysis — monthly report
report = df.groupby([df['date'].dt.to_period('M'), 'store'])['amount'].sum().unstack()
print(report.round(0))
store      Gdansk    Krakow    Warsaw   Wroclaw
date                                           
2025-01  471110.0  478535.0  466587.0  424542.0
2025-02  148016.0  151922.0  175745.0  139924.0

We collect data, store it, then analyze it. Results come with a delay — but for many use cases that’s perfectly fine.

Near Real-Time Analytics

Data is processed with a small delay — from a few seconds to a few minutes. A compromise between cost and speed.

Typical use cases:

  • monitoring bank transactions (analysis within seconds),
  • dynamically adjusting online ads,
  • server log analysis and anomaly detection.

Technologies: Apache Kafka + Spark Streaming, Elasticsearch.

from kafka import KafkaProducer
import json, random, time
from datetime import datetime

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Producer sends transactions every second
for _ in range(10):
    transaction = {
        'id': f'TX{random.randint(1000,9999)}',
        'amount': round(random.uniform(10, 10000), 2),
        'time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'store': random.choice(['Warsaw', 'Krakow', 'Gdansk']),
    }
    producer.send('transactions', value=transaction)
    print(f"Sent: {transaction}")
    time.sleep(1)

producer.flush()
producer.close()
from kafka import KafkaConsumer
import json

# Consumer reacts to large transactions
consumer = KafkaConsumer(
    'transactions',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

for message in consumer:
    t = message.value
    if t['amount'] > 8000:
        print(f"ALERT: Large transaction {t['id']}: {t['amount']} PLN in {t['store']}")

These two programs illustrate the essence of stream processing: a producer generates data, a consumer reacts to it in near real-time.

Real-Time Analytics

Immediate analysis and reaction — in milliseconds. Requires dedicated infrastructure and is the most expensive, but in some cases there is no alternative.

Typical use cases:

  • High-Frequency Trading (HFT) — investment decisions in microseconds,
  • autonomous vehicles — real-time camera image analysis,
  • DDoS attack detection in computer networks.

Technologies: Apache Flink, Apache Storm, dedicated FPGA systems.

Comparison

Feature Batch Near Real-Time Real-Time
Latency Minutes — hours Seconds — minutes Milliseconds
Infrastructure cost Low Medium High
Complexity Low Medium High
Example Monthly report Transaction monitoring HFT, autonomous vehicles

Key principle: real-time is not always necessary. In many cases near real-time is sufficient and significantly cheaper. Understanding business requirements before choosing an approach is essential.

Business Applications

A few domains where real-time data analytics delivers concrete business value.

Finance and banking: Anti-fraud systems analyze every card transaction in a fraction of a second and block suspicious operations before money leaves the account. HFT systems make thousands of investment decisions per second.

E-commerce: Dynamic pricing (e.g., airlines, Uber) changes in response to current demand. Recommendation engines adapt offers to user behavior during their session.

Telecommunications and IoT: Smart energy meters transmit consumption data in real time, enabling grid optimization. Infrastructure monitoring systems detect failures before users notice them.

Healthcare: Medical devices monitor patient vital signs and alert staff to threats. Epidemiological systems track disease spread.

Challenges

Implementing real-time systems involves specific technical and organizational problems:

Challenge Description Typical solution
Scalability Data volume grows — system must keep up Kafka, Kubernetes, cloud
Latency Every millisecond can matter Edge computing, network optimization
Data quality Streaming data can be incomplete or erroneous In-flight validation, data cleansing
Integration complexity Many systems must work together APIs, microservices, Docker
Security Data in motion needs protection TLS encryption, authorization
Cost Real-time requires powerful infrastructure Serverless, autoscaling

Summary

In this lecture you learned about three data processing modes and their applications. Key takeaways:

  • Data is always generated as a continuous stream — batch is just a way of analyzing it later.
  • The choice of processing mode depends on business requirements, not technology.
  • In upcoming lectures we’ll explore technologies (Kafka, Spark) and architectures (Lambda, Kappa) that enable stream processing.

Business Impact

Shifting from batch to near real-time dramatically reduces the time from event occurrence to decision (Time-to-Insight). For a manager, this means the ability to react immediately to competitor actions or sudden demand changes, which directly translates to higher financial liquidity and better customer-offer alignment.