Lecture 3 — Machine Learning: batch vs online

Real-Time Data Analytics

Batch vs incremental learning, SGD, concept drift, anomaly detection and model explainability.

Duration: 1.5h

Goal: Understand the differences between batch (offline) and incremental (online) learning, the SGD algorithm, the concept drift problem, anomaly detection, and model explainability.

1 Two machine learning modes

In previous lectures we discussed batch and stream processing. The same distinction applies to machine learning.

The model is trained on the entire dataset of historical data. Once trained it is deployed to production, where it makes predictions on new data. When new data arrives — the model is retrained from scratch.

Show code

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

np.random.seed(42)

# Simulation: transaction classification (0 = legitimate, 1 = suspicious)
n = 1000
X = np.column_stack([
    np.random.uniform(10, 5000, n),       # amount
    np.random.uniform(0, 23, n),           # hour
    np.random.randint(1, 50, n)            # transactions per month
])
y = ((X[:, 0] > 3000) & (X[:, 1] > 22) | (X[:, 0] > 4000)).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Batch learning: train once on the entire dataset
model = LogisticRegression()
model.fit(X_train, y_train)
print(f"Batch accuracy: {accuracy_score(y_test, model.predict(X_test)):.3f}")

Batch accuracy: 0.940

Batch learning limitations

Retraining on large datasets is expensive
The model doesn’t learn from new data between retraining runs
If patterns change (e.g., a new type of fraud), the model will be outdated

The model learns continuously — each new observation (or small batch) updates the model parameters. No need to retrain from scratch.

Show code

from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y, test_size=0.2)

# Online learning: model learns on successive mini-batches
model_online = SGDClassifier(loss='log_loss', random_state=42)

batch_size = 50
accuracies = []

for i in range(0, len(X_train_s), batch_size):
    X_batch = X_train_s[i:i+batch_size]
    y_batch = y_train_s[i:i+batch_size]

    model_online.partial_fit(X_batch, y_batch, classes=[0, 1])

    acc = accuracy_score(y_test_s, model_online.predict(X_test_s))
    accuracies.append(acc)

print(f"Online accuracy after {len(accuracies)} mini-batches: {accuracies[-1]:.3f}")
print(f"Accuracy progression: {[f'{a:.2f}' for a in accuracies[:5]]} ... {[f'{a:.2f}' for a in accuracies[-3:]]}")

Online accuracy after 16 mini-batches: 0.920
Accuracy progression: ['0.85', '0.92', '0.92', '0.96', '0.94'] ... ['0.88', '0.92', '0.92']

1.1 Comparison

Batch vs Online learning
Feature	Batch (offline)	Online (incremental)
Training data	Entire dataset at once	In portions (mini-batch)
Model update	Retrain from scratch	Incremental
Retraining cost	High	Low
Adaptation to changes	Slow	Fast
Stability	High	Risk of “forgetting”
Typical algorithms	RandomForest, XGBoost	SGD, Perceptron, online k-means

2 Stochastic Gradient Descent (SGD)

SGD is the foundation of online learning. In classical gradient descent we compute the gradient on the entire dataset. In SGD — on a single observation (or small batch).

Intuition: finding a path in the mountains

Imagine searching for the lowest point in the mountains in fog:

Gradient Descent — you compute the terrain slope from a map of the entire mountains.
SGD — you only look at your feet and take a step in the direction that looks steepest downhill.

Each SGD step is less precise, but you take many more of them and much faster.

2.1 Mathematically

Gradient Descent (batch): \[\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla L_i(\theta_t)\]

Stochastic Gradient Descent: \[\theta_{t+1} = \theta_t - \eta \cdot \nabla L_i(\theta_t)\]

where $\eta$ is the learning rate, and $i$ is a randomly selected observation.

Mini-batch SGD — a compromise: compute the gradient on a small sample (e.g., 32–256 observations): \[\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{|B|} \sum_{i \in B} \nabla L_i(\theta_t)\]

Show code

import matplotlib.pyplot as plt

# Visualization: SGD vs Batch GD on a simple regression problem
np.random.seed(42)
X_reg = np.random.uniform(0, 10, 100)
y_reg = 2.5 * X_reg + 3 + np.random.normal(0, 2, 100)

# Batch GD
theta_batch = [0.0, 0.0]  # [slope, intercept]
lr = 0.001
batch_path = [tuple(theta_batch)]

for _ in range(50):
    pred = theta_batch[0] * X_reg + theta_batch[1]
    error = pred - y_reg
    theta_batch[0] -= lr * (2/len(X_reg)) * np.dot(error, X_reg)
    theta_batch[1] -= lr * (2/len(X_reg)) * np.sum(error)
    batch_path.append(tuple(theta_batch))

# SGD
theta_sgd = [0.0, 0.0]
sgd_path = [tuple(theta_sgd)]

for _ in range(50):
    i = np.random.randint(len(X_reg))
    pred_i = theta_sgd[0] * X_reg[i] + theta_sgd[1]
    error_i = pred_i - y_reg[i]
    theta_sgd[0] -= lr * 2 * error_i * X_reg[i]
    theta_sgd[1] -= lr * 2 * error_i
    sgd_path.append(tuple(theta_sgd))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

batch_slopes = [p[0] for p in batch_path]
sgd_slopes = [p[0] for p in sgd_path]

ax1.plot(batch_slopes, label='Batch GD', linewidth=2)
ax1.plot(sgd_slopes, label='SGD', linewidth=1, alpha=0.7)
ax1.axhline(y=2.5, color='red', linestyle='--', label='True value')
ax1.set_xlabel('Iteration')
ax1.set_ylabel('Slope estimate')
ax1.set_title('Parameter convergence')
ax1.legend()

ax2.scatter(X_reg, y_reg, alpha=0.5, s=15)
ax2.plot([0, 10], [theta_batch[1], theta_batch[0]*10 + theta_batch[1]], label='Batch GD', linewidth=2)
ax2.plot([0, 10], [theta_sgd[1], theta_sgd[0]*10 + theta_sgd[1]], label='SGD', linewidth=2)
ax2.set_title('Fit result')
ax2.legend()

plt.tight_layout()
plt.show()

Figure 1: SGD vs Batch GD — convergence and fit result

SGD is “noisy” — but that’s precisely its strength in online learning: each new observation immediately influences the model.

3 Concept drift — when the world changes

Key challenge

The data distribution changes over time (concept drift). A model trained on last year’s data may be useless today.

3.1 Types of drift

Sudden

E.g., a pandemic changes shopping patterns overnight.

Gradual

E.g., customer preferences change slowly over months.

Recurring

E.g., seasonal sales patterns.

Show code

# Simulation of concept drift: sudden change in pattern
np.random.seed(42)

# Phase 1: normal transactions (amount < 1000 = OK)
X_phase1 = np.random.uniform(10, 2000, 500).reshape(-1, 1)
y_phase1 = (X_phase1.ravel() > 1000).astype(int)

# Phase 2: after change — suspicion threshold drops to 500
X_phase2 = np.random.uniform(10, 2000, 500).reshape(-1, 1)
y_phase2 = (X_phase2.ravel() > 500).astype(int)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Batch model — trained on phase 1
model_batch = LogisticRegression()
model_batch.fit(scaler.fit_transform(X_phase1), y_phase1)
acc_batch_p2 = accuracy_score(y_phase2, model_batch.predict(scaler.transform(X_phase2)))

# Online model — adapts
model_sgd = SGDClassifier(loss='log_loss')
model_sgd.fit(scaler.transform(X_phase1), y_phase1)

# Online training on phase 2
for i in range(0, len(X_phase2), 10):
    X_b = scaler.transform(X_phase2[i:i+10])
    y_b = y_phase2[i:i+10]
    model_sgd.partial_fit(X_b, y_b)

acc_online_p2 = accuracy_score(y_phase2, model_sgd.predict(scaler.transform(X_phase2)))

print(f"After concept drift:")
print(f"  Batch model accuracy:  {acc_batch_p2:.3f}")
print(f"  Online model accuracy: {acc_online_p2:.3f}")

After concept drift:
  Batch model accuracy:  0.760
  Online model accuracy: 0.972

Show code

# Rolling accuracy simulation
from collections import deque

np.random.seed(42)
X_all = np.vstack([X_phase1, X_phase2])
y_all = np.concatenate([y_phase1, y_phase2])
X_all_s = scaler.fit_transform(X_all)

online = SGDClassifier(loss='log_loss')
batch = LogisticRegression()
batch.fit(X_all_s[:500], y_all[:500])

window = deque(maxlen=50)
online_acc, batch_acc = [], []

for i in range(len(X_all)):
    x_i = X_all_s[i:i+1]
    y_i = y_all[i:i+1]

    if i == 0:
        online.partial_fit(x_i, y_i, classes=[0, 1])
    else:
        pred_o = online.predict(x_i)[0]
        pred_b = batch.predict(x_i)[0]
        window.append((pred_o == y_i[0], pred_b == y_i[0]))
        online.partial_fit(x_i, y_i)

    if len(window) > 10:
        online_acc.append(np.mean([w[0] for w in window]))
        batch_acc.append(np.mean([w[1] for w in window]))

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(online_acc, label='Online (SGD)', color='#4CAF50', linewidth=2)
ax.plot(batch_acc, label='Batch (Logistic Regression)', color='#F44336', linewidth=2)
ax.axvline(x=490, color='gray', linestyle='--', alpha=0.5, label='Concept drift')
ax.fill_betweenx([0, 1], 0, 490, alpha=0.05, color='blue')
ax.fill_betweenx([0, 1], 490, len(online_acc), alpha=0.05, color='red')
ax.text(250, 0.55, 'Phase 1\n(old pattern)', ha='center', fontsize=10, alpha=0.5)
ax.text(700, 0.55, 'Phase 2\n(new pattern)', ha='center', fontsize=10, alpha=0.5)
ax.set_xlabel('Observation')
ax.set_ylabel('Rolling accuracy (50 obs.)')
ax.set_title('Concept drift: batch model stagnates, online adapts')
ax.set_ylim(0.5, 1.05)
ax.legend()
plt.tight_layout()
plt.show()

Figure 2: Concept drift: batch vs online model — accuracy over time

4 Anomaly detection

Anomaly detection is one of the most important applications of real-time data analytics. An anomaly (outlier) is an observation significantly distant from the rest of the data.

For a single variable we can use the interquartile range:

\[x_{\text{out}} < Q_1 - 1.5 \times IQR \quad \text{or} \quad x_{\text{out}} > Q_3 + 1.5 \times IQR\]

Show code

salaries = [40, 42, 45, 47, 50, 55, 60, 70, 90, 150]

Q1 = np.percentile(salaries, 25)
Q3 = np.percentile(salaries, 75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = [x for x in salaries if x < lower_bound or x > upper_bound]
print(f"Q1={Q1}, Q3={Q3}, IQR={IQR}")
print(f"Bounds: [{lower_bound:.1f}, {upper_bound:.1f}]")
print(f"Anomalies: {outliers}")

fig, ax = plt.subplots(figsize=(8, 2.5))
bp = ax.boxplot(salaries, vert=False, patch_artist=True,
                boxprops=dict(facecolor='#E3F2FD', edgecolor='#2196F3'),
                medianprops=dict(color='#F44336', linewidth=2),
                flierprops=dict(marker='o', markerfacecolor='#F44336', markersize=10))
ax.set_xlabel('Salary (thousands PLN)')
ax.set_title('Box plot — anomaly detection using IQR')
ax.axvspan(lower_bound, upper_bound, alpha=0.1, color='green', label='Normal range')
plt.tight_layout()
plt.show()

Q1=45.5, Q3=67.5, IQR=22.0
Bounds: [12.5, 100.5]
Anomalies: [150]

A tree-based algorithm (Liu et al., 2008). Key intuition: anomalies are easier to isolate — random splits separate them from the rest of the data faster.

Show code

from sklearn.ensemble import IsolationForest

# Simulation: banking transactions (amount, weekly frequency)
data = np.array([
    [100, 5], [120, 6], [130, 5], [110, 4], [125, 5],
    [115, 5], [140, 7], [135, 6], [145, 5], [105, 4],
    [5000, 1],  # anomaly: large amount, rare
    [50, 30],   # anomaly: small amount, very frequent
])

clf = IsolationForest(contamination=0.15, random_state=42)
predictions = clf.fit_predict(data)

df = pd.DataFrame(data, columns=["Amount", "Transactions/week"])
df["Status"] = ["Anomaly" if p == -1 else "OK" for p in predictions]
print(df.to_string(index=False))

colors = ['#F44336' if s == 'Anomaly' else '#2196F3' for s in df['Status']]
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(df['Amount'], df['Transactions/week'], c=colors, s=80, edgecolors='white', zorder=5)
for _, row in df[df['Status'] == 'Anomaly'].iterrows():
    ax.annotate('ANOMALY', (row['Amount'], row['Transactions/week']),
                textcoords="offset points", xytext=(10, 5), fontsize=9, color='#F44336', fontweight='bold')
ax.set_xlabel('Transaction amount')
ax.set_ylabel('Transactions / week')
ax.set_title('Isolation Forest — anomaly detection')
import matplotlib.patches as mpatches
ax.legend(handles=[mpatches.Patch(color='#2196F3', label='OK'),
                    mpatches.Patch(color='#F44336', label='Anomaly')])
plt.tight_layout()
plt.show()

 Amount  Transactions/week  Status
    100                  5      OK
    120                  6      OK
    130                  5      OK
    110                  4      OK
    125                  5      OK
    115                  5      OK
    140                  7      OK
    135                  6      OK
    145                  5      OK
    105                  4      OK
   5000                  1 Anomaly
     50                 30 Anomaly

In a real-time context we can’t analyze the entire dataset — we must detect anomalies on the fly, within a time window. In the labs we’ll build such a system with Kafka and Spark.

5 Model explainability

Regulations require transparency

In regulated industries (banking, insurance, medicine) it’s not enough to say “the model rejected the loan application”. You need to explain why. Regulations such as the AI Act and GDPR require transparency in algorithmic decisions.

5.1 LIME (Local Interpretable Model-Agnostic Explanations)

LIME explains individual predictions of any model. It works by introducing small changes to the input data and observing how the result changes. Based on this, it builds a simple, interpretable local model (e.g., linear regression) that approximates the behavior of the original model in the neighborhood of the given observation.

Show code

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_tr, X_te, y_tr, y_te = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_tr, y_tr)

# Feature importance — simpler alternative to LIME
importances = pd.Series(rf.feature_importances_, index=iris.feature_names)
print("Feature importance (Random Forest):")
print(importances.sort_values(ascending=False).round(3))
print(f"\nIn the labs we'll install the LIME library and explore local explanations.")

Feature importance (Random Forest):
petal length (cm)    0.440
petal width (cm)     0.422
sepal length (cm)    0.108
sepal width (cm)     0.030
dtype: float64

In the labs we'll install the LIME library and explore local explanations.

6 Summary

Batch and incremental learning are two complementary approaches. In practice they are often combined: a base model trained in batch mode is gradually updated in online mode. SGD enables learning on a data stream, but requires attention to concept drift and stability.

Anomaly detection and model explainability are key applications of ML in real-time analytics — in the labs we’ll translate them into practical systems with Kafka and Spark.

Next lecture

Apache Kafka — architecture, producers, consumers, topics, partitions.

Food for thought

Your bank trains a fraud detection model once a month (batch). What business risk does this approach carry? What would change if the model learned online?

--- title: "Lecture 3 — Machine Learning: batch vs online" subtitle: "Real-Time Data Analytics" description: "Batch vs incremental learning, SGD, concept drift, anomaly detection and model explainability." format: html: code-fold: true code-tools: true code-summary: "Show code" toc: true toc-depth: 3 toc-title: "Contents" number-sections: true smooth-scroll: true theme: light: flatly highlight-style: github fig-align: center fig-cap-location: bottom jupyter: python3 --- ::: {.callout-note appearance="minimal"} ## {{< fa clock >}} Duration: 1.5h **Goal:** Understand the differences between batch (offline) and incremental (online) learning, the SGD algorithm, the concept drift problem, anomaly detection, and model explainability. ::: --- ## Two machine learning modes In previous lectures we discussed batch and stream processing. The same distinction applies to machine learning. ::: {.panel-tabset} ### {{< fa database >}} Batch (offline) The model is trained on the **entire dataset** of historical data. Once trained it is deployed to production, where it makes predictions on new data. When new data arrives — the model is retrained from scratch. ```{python} import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score np.random.seed(42) # Simulation: transaction classification (0 = legitimate, 1 = suspicious) n = 1000 X = np.column_stack([ np.random.uniform(10, 5000, n), # amount np.random.uniform(0, 23, n), # hour np.random.randint(1, 50, n) # transactions per month ]) y = ((X[:, 0] > 3000) & (X[:, 1] > 22) | (X[:, 0] > 4000)).astype(int) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Batch learning: train once on the entire dataset model = LogisticRegression() model.fit(X_train, y_train) print(f"Batch accuracy: {accuracy_score(y_test, model.predict(X_test)):.3f}") ``` ::: {.callout-caution appearance="simple"} ## Batch learning limitations - Retraining on large datasets is **expensive** - The model doesn't learn from new data between retraining runs - If patterns change (e.g., a new type of fraud), the model will be outdated ::: ### {{< fa bolt >}} Online (incremental) The model learns **continuously** — each new observation (or small batch) updates the model parameters. No need to retrain from scratch. ```{python} from sklearn.linear_model import SGDClassifier from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X) X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_scaled, y, test_size=0.2) # Online learning: model learns on successive mini-batches model_online = SGDClassifier(loss='log_loss', random_state=42) batch_size = 50 accuracies = [] for i in range(0, len(X_train_s), batch_size): X_batch = X_train_s[i:i+batch_size] y_batch = y_train_s[i:i+batch_size] model_online.partial_fit(X_batch, y_batch, classes=[0, 1]) acc = accuracy_score(y_test_s, model_online.predict(X_test_s)) accuracies.append(acc) print(f"Online accuracy after {len(accuracies)} mini-batches: {accuracies[-1]:.3f}") print(f"Accuracy progression: {[f'{a:.2f}' for a in accuracies[:5]]} ... {[f'{a:.2f}' for a in accuracies[-3:]]}") ``` ::: ### Comparison | Feature | {{< fa database >}} Batch (offline) | {{< fa bolt >}} Online (incremental) | |-------|-----------------|---------------------| | Training data | Entire dataset at once | In portions (mini-batch) | | Model update | Retrain from scratch | Incremental | | Retraining cost | High | Low | | Adaptation to changes | Slow | Fast | | Stability | High | Risk of "forgetting" | | Typical algorithms | RandomForest, XGBoost | SGD, Perceptron, online k-means | : Batch vs Online learning {.striped .hover} --- ## Stochastic Gradient Descent (SGD) SGD is the foundation of online learning. In classical gradient descent we compute the gradient on the **entire** dataset. In SGD — on a **single observation** (or small batch). ::: {.callout-tip collapse="true"} ## {{< fa mountain >}} Intuition: finding a path in the mountains Imagine searching for the lowest point in the mountains in fog: - **Gradient Descent** — you compute the terrain slope from a map of the entire mountains. - **SGD** — you only look at your feet and take a step in the direction that looks steepest downhill. Each SGD step is less precise, but you take **many more** of them and much faster. ::: ### Mathematically **Gradient Descent (batch):** $$\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla L_i(\theta_t)$$ **Stochastic Gradient Descent:** $$\theta_{t+1} = \theta_t - \eta \cdot \nabla L_i(\theta_t)$$ where $\eta$ is the learning rate, and $i$ is a randomly selected observation. **Mini-batch SGD** — a compromise: compute the gradient on a small sample (e.g., 32–256 observations): $$\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{|B|} \sum_{i \in B} \nabla L_i(\theta_t)$$ ```{python} #| label: fig-sgd-convergence #| fig-cap: "SGD vs Batch GD — convergence and fit result" import matplotlib.pyplot as plt # Visualization: SGD vs Batch GD on a simple regression problem np.random.seed(42) X_reg = np.random.uniform(0, 10, 100) y_reg = 2.5 * X_reg + 3 + np.random.normal(0, 2, 100) # Batch GD theta_batch = [0.0, 0.0] # [slope, intercept] lr = 0.001 batch_path = [tuple(theta_batch)] for _ in range(50): pred = theta_batch[0] * X_reg + theta_batch[1] error = pred - y_reg theta_batch[0] -= lr * (2/len(X_reg)) * np.dot(error, X_reg) theta_batch[1] -= lr * (2/len(X_reg)) * np.sum(error) batch_path.append(tuple(theta_batch)) # SGD theta_sgd = [0.0, 0.0] sgd_path = [tuple(theta_sgd)] for _ in range(50): i = np.random.randint(len(X_reg)) pred_i = theta_sgd[0] * X_reg[i] + theta_sgd[1] error_i = pred_i - y_reg[i] theta_sgd[0] -= lr * 2 * error_i * X_reg[i] theta_sgd[1] -= lr * 2 * error_i sgd_path.append(tuple(theta_sgd)) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4)) batch_slopes = [p[0] for p in batch_path] sgd_slopes = [p[0] for p in sgd_path] ax1.plot(batch_slopes, label='Batch GD', linewidth=2) ax1.plot(sgd_slopes, label='SGD', linewidth=1, alpha=0.7) ax1.axhline(y=2.5, color='red', linestyle='--', label='True value') ax1.set_xlabel('Iteration') ax1.set_ylabel('Slope estimate') ax1.set_title('Parameter convergence') ax1.legend() ax2.scatter(X_reg, y_reg, alpha=0.5, s=15) ax2.plot([0, 10], [theta_batch[1], theta_batch[0]*10 + theta_batch[1]], label='Batch GD', linewidth=2) ax2.plot([0, 10], [theta_sgd[1], theta_sgd[0]*10 + theta_sgd[1]], label='SGD', linewidth=2) ax2.set_title('Fit result') ax2.legend() plt.tight_layout() plt.show() ``` ::: {.callout-note appearance="simple"} SGD is "noisy" — but that's precisely its strength in online learning: each new observation immediately influences the model. ::: --- ## Concept drift — when the world changes ::: {.callout-warning} ## Key challenge **The data distribution changes over time** (concept drift). A model trained on last year's data may be useless today. ::: ### Types of drift :::: {.columns} ::: {.column width="33%"} ::: {.callout-important appearance="simple"} ## {{< fa bolt >}} Sudden E.g., a pandemic changes shopping patterns overnight. ::: ::: ::: {.column width="33%"} ::: {.callout-note appearance="simple"} ## {{< fa arrow-trend-up >}} Gradual E.g., customer preferences change slowly over months. ::: ::: ::: {.column width="33%"} ::: {.callout-tip appearance="simple"} ## {{< fa rotate >}} Recurring E.g., seasonal sales patterns. ::: ::: :::: ```{python} # Simulation of concept drift: sudden change in pattern np.random.seed(42) # Phase 1: normal transactions (amount < 1000 = OK) X_phase1 = np.random.uniform(10, 2000, 500).reshape(-1, 1) y_phase1 = (X_phase1.ravel() > 1000).astype(int) # Phase 2: after change — suspicion threshold drops to 500 X_phase2 = np.random.uniform(10, 2000, 500).reshape(-1, 1) y_phase2 = (X_phase2.ravel() > 500).astype(int) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() # Batch model — trained on phase 1 model_batch = LogisticRegression() model_batch.fit(scaler.fit_transform(X_phase1), y_phase1) acc_batch_p2 = accuracy_score(y_phase2, model_batch.predict(scaler.transform(X_phase2))) # Online model — adapts model_sgd = SGDClassifier(loss='log_loss') model_sgd.fit(scaler.transform(X_phase1), y_phase1) # Online training on phase 2 for i in range(0, len(X_phase2), 10): X_b = scaler.transform(X_phase2[i:i+10]) y_b = y_phase2[i:i+10] model_sgd.partial_fit(X_b, y_b) acc_online_p2 = accuracy_score(y_phase2, model_sgd.predict(scaler.transform(X_phase2))) print(f"After concept drift:") print(f" Batch model accuracy: {acc_batch_p2:.3f}") print(f" Online model accuracy: {acc_online_p2:.3f}") ``` ```{python} #| label: fig-concept-drift #| fig-cap: "Concept drift: batch vs online model — accuracy over time" # Rolling accuracy simulation from collections import deque np.random.seed(42) X_all = np.vstack([X_phase1, X_phase2]) y_all = np.concatenate([y_phase1, y_phase2]) X_all_s = scaler.fit_transform(X_all) online = SGDClassifier(loss='log_loss') batch = LogisticRegression() batch.fit(X_all_s[:500], y_all[:500]) window = deque(maxlen=50) online_acc, batch_acc = [], [] for i in range(len(X_all)): x_i = X_all_s[i:i+1] y_i = y_all[i:i+1] if i == 0: online.partial_fit(x_i, y_i, classes=[0, 1]) else: pred_o = online.predict(x_i)[0] pred_b = batch.predict(x_i)[0] window.append((pred_o == y_i[0], pred_b == y_i[0])) online.partial_fit(x_i, y_i) if len(window) > 10: online_acc.append(np.mean([w[0] for w in window])) batch_acc.append(np.mean([w[1] for w in window])) fig, ax = plt.subplots(figsize=(10, 4)) ax.plot(online_acc, label='Online (SGD)', color='#4CAF50', linewidth=2) ax.plot(batch_acc, label='Batch (Logistic Regression)', color='#F44336', linewidth=2) ax.axvline(x=490, color='gray', linestyle='--', alpha=0.5, label='Concept drift') ax.fill_betweenx([0, 1], 0, 490, alpha=0.05, color='blue') ax.fill_betweenx([0, 1], 490, len(online_acc), alpha=0.05, color='red') ax.text(250, 0.55, 'Phase 1\n(old pattern)', ha='center', fontsize=10, alpha=0.5) ax.text(700, 0.55, 'Phase 2\n(new pattern)', ha='center', fontsize=10, alpha=0.5) ax.set_xlabel('Observation') ax.set_ylabel('Rolling accuracy (50 obs.)') ax.set_title('Concept drift: batch model stagnates, online adapts') ax.set_ylim(0.5, 1.05) ax.legend() plt.tight_layout() plt.show() ``` --- ## Anomaly detection ::: {.callout-important appearance="simple"} Anomaly detection is one of the **most important applications** of real-time data analytics. An anomaly (outlier) is an observation significantly distant from the rest of the data. ::: ::: {.panel-tabset} ### IQR method For a single variable we can use the interquartile range: $$x_{\text{out}} < Q_1 - 1.5 \times IQR \quad \text{or} \quad x_{\text{out}} > Q_3 + 1.5 \times IQR$$ ```{python} salaries = [40, 42, 45, 47, 50, 55, 60, 70, 90, 150] Q1 = np.percentile(salaries, 25) Q3 = np.percentile(salaries, 75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = [x for x in salaries if x < lower_bound or x > upper_bound] print(f"Q1={Q1}, Q3={Q3}, IQR={IQR}") print(f"Bounds: [{lower_bound:.1f}, {upper_bound:.1f}]") print(f"Anomalies: {outliers}") fig, ax = plt.subplots(figsize=(8, 2.5)) bp = ax.boxplot(salaries, vert=False, patch_artist=True, boxprops=dict(facecolor='#E3F2FD', edgecolor='#2196F3'), medianprops=dict(color='#F44336', linewidth=2), flierprops=dict(marker='o', markerfacecolor='#F44336', markersize=10)) ax.set_xlabel('Salary (thousands PLN)') ax.set_title('Box plot — anomaly detection using IQR') ax.axvspan(lower_bound, upper_bound, alpha=0.1, color='green', label='Normal range') plt.tight_layout() plt.show() ``` ### Isolation Forest A tree-based algorithm (Liu et al., 2008). Key intuition: **anomalies are easier to isolate** — random splits separate them from the rest of the data faster. ```{python} from sklearn.ensemble import IsolationForest # Simulation: banking transactions (amount, weekly frequency) data = np.array([ [100, 5], [120, 6], [130, 5], [110, 4], [125, 5], [115, 5], [140, 7], [135, 6], [145, 5], [105, 4], [5000, 1], # anomaly: large amount, rare [50, 30], # anomaly: small amount, very frequent ]) clf = IsolationForest(contamination=0.15, random_state=42) predictions = clf.fit_predict(data) df = pd.DataFrame(data, columns=["Amount", "Transactions/week"]) df["Status"] = ["Anomaly" if p == -1 else "OK" for p in predictions] print(df.to_string(index=False)) colors = ['#F44336' if s == 'Anomaly' else '#2196F3' for s in df['Status']] fig, ax = plt.subplots(figsize=(8, 5)) ax.scatter(df['Amount'], df['Transactions/week'], c=colors, s=80, edgecolors='white', zorder=5) for _, row in df[df['Status'] == 'Anomaly'].iterrows(): ax.annotate('ANOMALY', (row['Amount'], row['Transactions/week']), textcoords="offset points", xytext=(10, 5), fontsize=9, color='#F44336', fontweight='bold') ax.set_xlabel('Transaction amount') ax.set_ylabel('Transactions / week') ax.set_title('Isolation Forest — anomaly detection') import matplotlib.patches as mpatches ax.legend(handles=[mpatches.Patch(color='#2196F3', label='OK'), mpatches.Patch(color='#F44336', label='Anomaly')]) plt.tight_layout() plt.show() ``` ### In a stream In a real-time context we can't analyze the entire dataset — we must detect anomalies **on the fly**, within a time window. In the labs we'll build such a system with Kafka and Spark. ::: --- ## Model explainability ::: {.callout-important} ## Regulations require transparency In regulated industries (banking, insurance, medicine) it's not enough to say "the model rejected the loan application". You need to explain **why**. Regulations such as the **AI Act** and **GDPR** require transparency in algorithmic decisions. ::: ### LIME (Local Interpretable Model-Agnostic Explanations) LIME explains individual predictions of any model. It works by introducing small changes to the input data and observing how the result changes. Based on this, it builds a simple, interpretable local model (e.g., linear regression) that approximates the behavior of the original model in the neighborhood of the given observation. ```{python} from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris iris = load_iris() X_iris, y_iris = iris.data, iris.target X_tr, X_te, y_tr, y_te = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42) rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_tr, y_tr) # Feature importance — simpler alternative to LIME importances = pd.Series(rf.feature_importances_, index=iris.feature_names) print("Feature importance (Random Forest):") print(importances.sort_values(ascending=False).round(3)) print(f"\nIn the labs we'll install the LIME library and explore local explanations.") ``` --- ## Summary Batch and incremental learning are two complementary approaches. In practice they are often combined: a base model trained in batch mode is gradually updated in online mode. SGD enables learning on a data stream, but requires attention to concept drift and stability. Anomaly detection and model explainability are key applications of ML in real-time analytics — in the labs we'll translate them into practical systems with Kafka and Spark. ::: {.callout-note appearance="simple"} ## {{< fa forward >}} Next lecture Apache Kafka — architecture, producers, consumers, topics, partitions. ::: ::: {.callout-tip appearance="simple"} ## {{< fa brain >}} Food for thought Your bank trains a fraud detection model once a month (batch). What business risk does this approach carry? What would change if the model learned online? :::