Lecture 5: Machine Learning in Real Time

Duration: 1.5h

Goal: Understand the differences between offline and online learning, anomaly detection in real time, and algorithm explainability — from a business perspective.


Offline Learning vs Online Learning

In previous lectures we learned the difference between batch and stream processing. The same dichotomy applies to machine learning.

Offline Learning (Batch Learning)

The classical approach: collect historical data, train a model, deploy it — and the model doesn’t change until you retrain it on new data.

Code
from sklearn.linear_model import LinearRegression
import numpy as np

# Historical data — area (sqm) vs price (thousands PLN)
X = np.array([[30], [45], [55], [70], [85], [100], [120]])
y = np.array([250, 370, 430, 560, 680, 790, 950])

model = LinearRegression()
model.fit(X, y)

# Model is "frozen" — it doesn't learn from new data
new = np.array([[65], [90]])
print("Predictions:", model.predict(new).round(0))
Predictions: [520. 715.]

Problem: the world changes. Property prices rise, customer preferences evolve, new fraud patterns emerge. A model trained on last year’s data may be outdated.

Brute-force solution: retrain the model periodically (e.g., weekly) on new data. This works, but it’s expensive and delayed.

Online Learning (Incremental Learning)

An alternative approach: the model learns continuously, updating its parameters with each new observation (or small batch).

Key algorithm: Stochastic Gradient Descent (SGD) — instead of computing the gradient on the entire dataset, we update weights after each observation.

Code
from sklearn.linear_model import SGDRegressor
import numpy as np

# Data arrives as a stream — one observation at a time
np.random.seed(42)
model = SGDRegressor(max_iter=1, warm_start=True, learning_rate='constant', eta0=0.0001)

# Simulated data stream
print("Online learning — model learns with each new observation:")
for i in range(50):
    X_new = np.array([[30 + i * 2]])
    y_new = np.array([250 + i * 15 + np.random.normal(0, 20)])
    model.partial_fit(X_new, y_new)

    if i % 10 == 0:
        pred = model.predict(np.array([[75]]))[0]
        print(f"  After {i+1} observations -> prediction for 75sqm: {pred:.0f}k PLN")
Online learning — model learns with each new observation:
  After 1 observations -> prediction for 75sqm: 59k PLN
  After 11 observations -> prediction for 75sqm: 509k PLN
  After 21 observations -> prediction for 75sqm: 592k PLN
  After 31 observations -> prediction for 75sqm: 590k PLN
  After 41 observations -> prediction for 75sqm: 593k PLN

The partial_fit() method is the heart of online learning — the model updates its parameters without needing to store the entire dataset.

When to Use Which?

Feature Offline learning Online learning
Data Historical, collected Streaming, incremental
Model update Periodic (retrain) Continuous (partial_fit)
Compute cost High one-time Low per observation
Reaction to change Delayed Immediate
Risk Stale model Instability, concept drift
Use case Reports, predictive models Fraud detection, live recommendations

In practice, a hybrid approach is often used: a base model trained offline, updated online with current data.

Anomaly Detection

Anomaly detection is one of the most important applications of real-time analytics. An anomaly (outlier) is an observation that significantly deviates from the rest — a suspicious transaction, unusual sensor reading, unexpected traffic spike.

Statistical Method — IQR

The simplest method: an observation is an anomaly if it lies far from the median. Formally, a value is an outlier when:

\[x_{out} < Q_1 - 1.5 \times IQR \quad \text{or} \quad x_{out} > Q_3 + 1.5 \times IQR\]

where \(IQR = Q_3 - Q_1\) (interquartile range).

Code
import numpy as np
import matplotlib.pyplot as plt

# Credit card transactions (amounts in PLN)
transactions = [85, 92, 78, 110, 95, 88, 102, 91, 87, 99,
                105, 82, 97, 4500, 89, 94, 101, 86, 93, 8200]

Q1 = np.percentile(transactions, 25)
Q3 = np.percentile(transactions, 75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

anomalies = [x for x in transactions if x < lower_bound or x > upper_bound]
print(f"Q1={Q1}, Q3={Q3}, IQR={IQR}")
print(f"Bounds: [{lower_bound:.0f}, {upper_bound:.0f}]")
print(f"Anomalies: {anomalies}")

fig, ax = plt.subplots(figsize=(8, 2))
ax.boxplot(transactions, vert=False)
ax.set_xlabel('Transaction amount (PLN)')
ax.set_title('Anomaly detection with IQR')
plt.tight_layout()
plt.show()
Q1=87.75, Q3=101.25, IQR=13.5
Bounds: [68, 122]
Anomalies: [4500, 8200]

The IQR method is simple but works only for a single variable. For multidimensional data we need something more sophisticated.

Isolation Forest

Isolation Forest is a tree-based algorithm designed specifically for anomaly detection (Liu, Ting, Zhou, 2008). It works on a simple intuition: anomalies are easier to isolate than normal observations.

The algorithm randomly selects a feature and a split point. Outliers — being far from the rest — get isolated after fewer splits (closer to the tree root). Normal observations require many splits.

Code
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

# Transaction data: amount, weekly frequency, hours since last transaction
np.random.seed(42)
normal = np.column_stack([
    np.random.normal(200, 50, 50),
    np.random.normal(10, 3, 50),
    np.random.normal(24, 8, 50),
])

anomalies = np.array([
    [5000, 1, 720],
    [4200, 2, 480],
    [50, 45, 0.5],
])

data = np.vstack([normal, anomalies])

model = IsolationForest(contamination=0.05, random_state=42)
labels = model.fit_predict(data)

df = pd.DataFrame(data, columns=["amount", "weekly_freq", "hours_since_last"])
df["anomaly"] = ["YES" if l == -1 else "no" for l in labels]

print("Detected anomalies:")
print(df[df["anomaly"] == "YES"].to_string(index=False))
Detected anomalies:
 amount  weekly_freq  hours_since_last anomaly
 5000.0          1.0             720.0     YES
 4200.0          2.0             480.0     YES
   50.0         45.0               0.5     YES

Isolation Forest advantages: fast on large datasets, handles multidimensional data, doesn’t require labeled data (unsupervised method).

Anomaly Detection in a Stream

In a real-time context, anomaly detection works in a loop:

  1. A Kafka consumer receives a new transaction.
  2. A model (e.g., Isolation Forest, trained offline) evaluates whether it’s an anomaly.
  3. If so — generates an alert.
  4. Optionally: the model updates itself on new data (online learning).
from kafka import KafkaConsumer
from sklearn.ensemble import IsolationForest
import json, numpy as np

# Assume model is pre-trained
model = IsolationForest(contamination=0.05)
# model.fit(historical_data)

consumer = KafkaConsumer(
    "transactions",
    bootstrap_servers="localhost:9092",
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

for msg in consumer:
    t = msg.value
    features = np.array([[t["amount"], t["weekly_freq"], t["hours_since_last"]]])
    pred = model.predict(features)[0]

    if pred == -1:
        print(f"ALERT: Suspicious transaction {t['id']}: {t['amount']} PLN")

Algorithm Explainability

An ML model may be highly accurate, but if we can’t explain why it made a particular decision, its usefulness in many industries is limited.

This is especially important in regulated sectors:

  • Banking — a bank must explain to a customer why their loan was denied.
  • Healthcare — a diagnostic algorithm must indicate what its recommendation is based on.
  • Insurance — a claim denial decision must be justified.
  • Regulation — the EU AI Act requires transparency for high-risk AI systems.

LIME — Local Interpretable Model-Agnostic Explanations

LIME (Ribeiro et al., 2016) explains individual predictions of any ML model. It works as follows:

  1. Takes a specific observation we want explained.
  2. Generates perturbations — slightly modified versions of that observation.
  3. Checks how the model reacts to those changes.
  4. Builds a simple, interpretable local model (e.g., linear regression) that approximates the complex model’s behavior in the neighborhood of that observation.
Code
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

# Train model on Iris data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Pick one observation
i = 7
obs = X_test[i]
pred = model.predict([obs])[0]
proba = model.predict_proba([obs])[0]

print(f"Observation: {obs}")
print(f"Prediction: {iris.target_names[pred]} (probabilities: {proba.round(3)})")

# Feature importance for this model
importances = model.feature_importances_
print(f"\nFeature importance (global):")
for name, imp in sorted(zip(iris.feature_names, importances), key=lambda x: -x[1]):
    print(f"  {name}: {imp:.3f}")
Observation: [6.9 3.1 5.1 2.3]
Prediction: virginica (probabilities: [0.   0.08 0.92])

Feature importance (global):
  petal length (cm): 0.440
  petal width (cm): 0.422
  sepal length (cm): 0.108
  sepal width (cm): 0.030
# To use LIME (requires: pip install lime):
from lime.lime_tabular import LimeTabularExplainer

explainer = LimeTabularExplainer(
    X_train,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    discretize_continuous=True
)

exp = explainer.explain_instance(X_test[i], model.predict_proba)
exp.show_in_notebook()

LIME will show which features (e.g., petal length > 4.5, petal width > 1.6) influenced the classification of a given flower as Virginica. The same approach works for any model — from logistic regression to deep neural networks.

Course Summary — What’s Next?

Over 5 lectures we followed this path:

  1. L1: What is real-time analytics and when is it needed (batch vs NRT vs RT).
  2. L2: Evolution of data and processing models (OLTP -> OLAP -> Data Lake -> Big Data).
  3. L3: Stream processing — time, watermarking, time windows.
  4. L4: Apache Kafka and microservices — real-time system architecture.
  5. L5: Machine learning in streaming context — online learning, anomalies, explainability.

In the labs, you’ll translate this knowledge into practice: set up an environment in Docker, write Kafka producers and consumers, build a Spark Streaming + Kafka pipeline, deploy an ML model as a FastAPI microservice, and build a complete real-time anomaly detection system.

Business Impact

Anomaly detection is the foundation of modern risk management — from detecting financial fraud to predicting machine failures (predictive maintenance). Algorithm explainability is a necessity in regulated sectors — understanding “why the system said NO” protects a company from reputational and legal damage and ensures compliance with regulations (AI Act, GDPR). Companies that combine real-time analytics with interpretable ML gain a competitive advantage not only in speed, but also in customer trust.