Goal: Understand the differences between offline and online learning, anomaly detection in real time, and algorithm explainability — from a business perspective.
Offline Learning vs Online Learning
In previous lectures we learned the difference between batch and stream processing. The same dichotomy applies to machine learning.
Offline Learning (Batch Learning)
The classical approach: collect historical data, train a model, deploy it — and the model doesn’t change until you retrain it on new data.
Code
from sklearn.linear_model import LinearRegressionimport numpy as np# Historical data — area (sqm) vs price (thousands PLN)X = np.array([[30], [45], [55], [70], [85], [100], [120]])y = np.array([250, 370, 430, 560, 680, 790, 950])model = LinearRegression()model.fit(X, y)# Model is "frozen" — it doesn't learn from new datanew = np.array([[65], [90]])print("Predictions:", model.predict(new).round(0))
Predictions: [520. 715.]
Problem: the world changes. Property prices rise, customer preferences evolve, new fraud patterns emerge. A model trained on last year’s data may be outdated.
Brute-force solution: retrain the model periodically (e.g., weekly) on new data. This works, but it’s expensive and delayed.
Online Learning (Incremental Learning)
An alternative approach: the model learns continuously, updating its parameters with each new observation (or small batch).
Key algorithm: Stochastic Gradient Descent (SGD) — instead of computing the gradient on the entire dataset, we update weights after each observation.
Code
from sklearn.linear_model import SGDRegressorimport numpy as np# Data arrives as a stream — one observation at a timenp.random.seed(42)model = SGDRegressor(max_iter=1, warm_start=True, learning_rate='constant', eta0=0.0001)# Simulated data streamprint("Online learning — model learns with each new observation:")for i inrange(50): X_new = np.array([[30+ i *2]]) y_new = np.array([250+ i *15+ np.random.normal(0, 20)]) model.partial_fit(X_new, y_new)if i %10==0: pred = model.predict(np.array([[75]]))[0]print(f" After {i+1} observations -> prediction for 75sqm: {pred:.0f}k PLN")
Online learning — model learns with each new observation:
After 1 observations -> prediction for 75sqm: 59k PLN
After 11 observations -> prediction for 75sqm: 509k PLN
After 21 observations -> prediction for 75sqm: 592k PLN
After 31 observations -> prediction for 75sqm: 590k PLN
After 41 observations -> prediction for 75sqm: 593k PLN
The partial_fit() method is the heart of online learning — the model updates its parameters without needing to store the entire dataset.
When to Use Which?
Feature
Offline learning
Online learning
Data
Historical, collected
Streaming, incremental
Model update
Periodic (retrain)
Continuous (partial_fit)
Compute cost
High one-time
Low per observation
Reaction to change
Delayed
Immediate
Risk
Stale model
Instability, concept drift
Use case
Reports, predictive models
Fraud detection, live recommendations
In practice, a hybrid approach is often used: a base model trained offline, updated online with current data.
Anomaly Detection
Anomaly detection is one of the most important applications of real-time analytics. An anomaly (outlier) is an observation that significantly deviates from the rest — a suspicious transaction, unusual sensor reading, unexpected traffic spike.
Statistical Method — IQR
The simplest method: an observation is an anomaly if it lies far from the median. Formally, a value is an outlier when:
The IQR method is simple but works only for a single variable. For multidimensional data we need something more sophisticated.
Isolation Forest
Isolation Forest is a tree-based algorithm designed specifically for anomaly detection (Liu, Ting, Zhou, 2008). It works on a simple intuition: anomalies are easier to isolate than normal observations.
The algorithm randomly selects a feature and a split point. Outliers — being far from the rest — get isolated after fewer splits (closer to the tree root). Normal observations require many splits.
Code
import numpy as npimport pandas as pdfrom sklearn.ensemble import IsolationForest# Transaction data: amount, weekly frequency, hours since last transactionnp.random.seed(42)normal = np.column_stack([ np.random.normal(200, 50, 50), np.random.normal(10, 3, 50), np.random.normal(24, 8, 50),])anomalies = np.array([ [5000, 1, 720], [4200, 2, 480], [50, 45, 0.5],])data = np.vstack([normal, anomalies])model = IsolationForest(contamination=0.05, random_state=42)labels = model.fit_predict(data)df = pd.DataFrame(data, columns=["amount", "weekly_freq", "hours_since_last"])df["anomaly"] = ["YES"if l ==-1else"no"for l in labels]print("Detected anomalies:")print(df[df["anomaly"] =="YES"].to_string(index=False))
Isolation Forest advantages: fast on large datasets, handles multidimensional data, doesn’t require labeled data (unsupervised method).
Anomaly Detection in a Stream
In a real-time context, anomaly detection works in a loop:
A Kafka consumer receives a new transaction.
A model (e.g., Isolation Forest, trained offline) evaluates whether it’s an anomaly.
If so — generates an alert.
Optionally: the model updates itself on new data (online learning).
from kafka import KafkaConsumerfrom sklearn.ensemble import IsolationForestimport json, numpy as np# Assume model is pre-trainedmodel = IsolationForest(contamination=0.05)# model.fit(historical_data)consumer = KafkaConsumer("transactions", bootstrap_servers="localhost:9092", value_deserializer=lambda x: json.loads(x.decode('utf-8')))for msg in consumer: t = msg.value features = np.array([[t["amount"], t["weekly_freq"], t["hours_since_last"]]]) pred = model.predict(features)[0]if pred ==-1:print(f"ALERT: Suspicious transaction {t['id']}: {t['amount']} PLN")
Algorithm Explainability
An ML model may be highly accurate, but if we can’t explain why it made a particular decision, its usefulness in many industries is limited.
This is especially important in regulated sectors:
Banking — a bank must explain to a customer why their loan was denied.
Healthcare — a diagnostic algorithm must indicate what its recommendation is based on.
Insurance — a claim denial decision must be justified.
Regulation — the EU AI Act requires transparency for high-risk AI systems.
LIME — Local Interpretable Model-Agnostic Explanations
LIME (Ribeiro et al., 2016) explains individual predictions of any ML model. It works as follows:
Takes a specific observation we want explained.
Generates perturbations — slightly modified versions of that observation.
Checks how the model reacts to those changes.
Builds a simple, interpretable local model (e.g., linear regression) that approximates the complex model’s behavior in the neighborhood of that observation.
Code
from sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitimport numpy as np# Train model on Iris datairis = load_iris()X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42)model = RandomForestClassifier(n_estimators=100, random_state=42)model.fit(X_train, y_train)# Pick one observationi =7obs = X_test[i]pred = model.predict([obs])[0]proba = model.predict_proba([obs])[0]print(f"Observation: {obs}")print(f"Prediction: {iris.target_names[pred]} (probabilities: {proba.round(3)})")# Feature importance for this modelimportances = model.feature_importances_print(f"\nFeature importance (global):")for name, imp insorted(zip(iris.feature_names, importances), key=lambda x: -x[1]):print(f" {name}: {imp:.3f}")
# To use LIME (requires: pip install lime):from lime.lime_tabular import LimeTabularExplainerexplainer = LimeTabularExplainer( X_train, feature_names=iris.feature_names, class_names=iris.target_names, discretize_continuous=True)exp = explainer.explain_instance(X_test[i], model.predict_proba)exp.show_in_notebook()
LIME will show which features (e.g., petal length > 4.5, petal width > 1.6) influenced the classification of a given flower as Virginica. The same approach works for any model — from logistic regression to deep neural networks.
Course Summary — What’s Next?
Over 5 lectures we followed this path:
L1: What is real-time analytics and when is it needed (batch vs NRT vs RT).
L2: Evolution of data and processing models (OLTP -> OLAP -> Data Lake -> Big Data).
L3: Stream processing — time, watermarking, time windows.
L4: Apache Kafka and microservices — real-time system architecture.
In the labs, you’ll translate this knowledge into practice: set up an environment in Docker, write Kafka producers and consumers, build a Spark Streaming + Kafka pipeline, deploy an ML model as a FastAPI microservice, and build a complete real-time anomaly detection system.
Business Impact
Anomaly detection is the foundation of modern risk management — from detecting financial fraud to predicting machine failures (predictive maintenance). Algorithm explainability is a necessity in regulated sectors — understanding “why the system said NO” protects a company from reputational and legal damage and ensures compliance with regulations (AI Act, GDPR). Companies that combine real-time analytics with interpretable ML gain a competitive advantage not only in speed, but also in customer trust.