Lecture 5

Algorithms

Before designing a solution to a business problem, it is essential to consider the complexity of the issue.

Algorithm Classification

  1. Algorithms Processing Large Amounts of Data

Processing vast datasets requires an appropriate approach to organizing and analyzing them. When the amount of data exceeds the available memory of a computing unit, iterative processing methods are often used.

🔹 Example: Recommendation Systems in E-commerce (e.g., Amazon, Netflix)

  • Analyzes large datasets about users, their purchase history, and viewed content.
  • Processes data iteratively (e.g., stream processing in Apache Spark).
  • Uses collaborative filtering or graph-based algorithms to predict user preferences.

🔹 Other Applications:

  • Real-time server log analysis (e.g., DDoS attack detection).
  • IoT network monitoring (e.g., sensor data analysis in smart cities).
  1. Algorithms Performing Intensive Computations

These require significant computing power but typically do not operate on large datasets. An example is an algorithm searching for large prime numbers. Parallel computation techniques are often used to optimize performance.

🔹 Example: Cryptography and Finding Large Prime Numbers (e.g., RSA)

  • Generates large prime numbers essential for RSA encryption.
  • Requires intensive computations but does not operate on vast datasets.
  • Often employs parallel methods, such as the Miller-Rabin probabilistic algorithm for primality testing.

🔹 Other Applications:

  • Physical simulations (e.g., weather forecasting, climate models).
  • Optimization algorithms (e.g., solving the traveling salesman problem).
  1. Algorithms Processing Large Data and Performing Intensive Computations

These combine the requirements of the previous types, demanding both high computational power and handling of large datasets. An example is sentiment analysis in live video streams.

🔹 Example: Sentiment Analysis in Live Video Streams (e.g., YouTube, Twitch)

  • Analyzes both text (chat) and video/audio in real time.
  • Requires both significant computational resources (NLP and CV processing) and large-scale data handling.
  • Uses Transformer models (e.g., BERT) for text analysis and CNN/RNN for image and audio processing.

🔹 Other Applications:

  • Autonomous vehicles (real-time image analysis and decision-making).
  • Anomaly detection in massive financial datasets (e.g., fraud detection in banking).

Data Dimension

To determine the dimensionality of a problem’s data, it is not enough to consider just the amount of storage required. Three main aspects are crucial:

  1. Input Size – Expected volume of data to be processed.
  2. Growth Rate – The rate at which new data is generated during algorithm execution.
  3. Structural Diversity – The types of data that the algorithm must handle.

Computational Dimension

This concerns processing resources and computational power. For example, deep learning (DL) algorithms require significant computational power, making it necessary to provide a parallelized architecture using GPUs or TPUs, significantly speeding up computations.

Algorithm Explainability

In many cases, modeling is used in critical situations, such as in software for administering medications. In such cases, explaining the reasons behind each algorithmic decision is crucial to ensure that the outcomes are error-free and unbiased.

The ability of an algorithm to indicate the mechanisms generating its results is called explainability. Ethical analysis is a standard part of the algorithm validation process.

Achieving high explainability is particularly challenging for machine learning (ML) and deep learning (DL) algorithms. For instance, banks using algorithms for credit decision-making must ensure transparency and justify their decisions.

One method for improving algorithm explainability is LIME (Local Interpretable Model-Agnostic Explanations), published in 2016. This method introduces small changes to input data and analyzes their impact on the output, allowing the identification of local decision-making rules within the model.

Code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from lime.lime_tabular import LimeTabularExplainer

# Wczytanie danych Iris
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

# Podział na zbiór treningowy i testowy
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Trenowanie modelu
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Tworzenie obiektu LIME do interpretacji modelu
explainer = LimeTabularExplainer(X_train, feature_names=iris.feature_names, class_names=iris.target_names, discretize_continuous=True)

# Interpretacja losowego przykładu ze zbioru testowego
i = np.random.randint(0, len(X_test))  # Wybór losowego przykładu
exp = explainer.explain_instance(X_test[i], model.predict_proba)

# Wyświetlenie interpretacji
exp.show_in_notebook()

How Does This Code Work?

  1. Loading Data and Training the Model
  • Uses the Iris dataset, containing 150 examples of flowers from three species:
  • Setosa
  • Versicolor
  • Virginica
  • The RandomForestClassifier model is trained on this data.
  1. Creating an Interpretable Model Using LIME
  • LIME generates local explanations, interpreting the model for individual predictions.
  • A random test example is selected.
  1. Exploring the Outcome for a Single Example
  • LIME slightly modifies input values and observes how the prediction changes.
  • It creates a “local” linear model that shows which features had the most influence on the decision.

Let’s assume our model selects a sample flower and classifies it as Virginica.

Interpretation of Results:

  1. Key Features Affecting Model Decision:
  • Petal length: The most significant factor (e.g., a longer petal suggests Virginica).
  • Petal width: Also a crucial factor (e.g., above a certain threshold indicates Virginica).
  • Sepal length: A less significant but still relevant factor.
  • Sepal width: Usually the least important feature.
  1. Visualization of Results:
  • LIME generates a bar chart showing the impact of each feature on classification.
  • The chart highlights which features increased or decreased the probability of a specific classification.
  1. What Does This Mean?
  • If the model predicts Virginica with high probability, key features (e.g., long petals) strongly indicate this species.
  • If the features had mixed influences, it suggests the model had difficulty classifying the instance (e.g., petal width was ambiguous).

Anomaly detection

An outlier is an observation (a row in a dataset) that is significantly different from the other elements in the sample. This means that the relationship between independent and dependent variables for this observation may differ from other cases.

For single variables, outliers can be identified using a box plot, which is based on quartiles:

  • First quartile and third quartile define the edges of the box,
  • Second quartile (median) is marked inside the box,

Outliers satisfy the condition:

[ x_{out} < Q_1 - 1.5 IQR x_{out} > Q_3 + 1.5 IQR ]

Where: [ IQR = Q_3 - Q_1 ]

An example of an outlier could be a Formula 1 car – in terms of speed, it is an anomaly among regular cars.

Use of Anomaly Detection

Anomaly detection has a wide range of applications, such as:

  • Finance – detecting fraudulent transactions in banking data analysis,
  • Cybersecurity – identifying intruders in a network based on user behavior,
  • Medicine – monitoring health parameters and detecting abnormalities,
  • Industry – detecting faulty components through image analysis.

Anomaly Detection Methods

1. Supervised Methods (supervised learning)

Used when labeled data is available (e.g., fraud cases in transactions). - Neural networks, - K-Nearest Neighbors algorithm (KNN), - Bayesian networks.

2. Unsupervised Methods (unsupervised learning)

Assumes that most data is correct, and anomalies are a small percentage of cases. - K-Means clustering, - Autoencoders in neural networks, - Statistical tests.

Classical Method – Detection Based on Probability

To determine whether a given observation is an anomaly, we can use the probability \(p(x)\): - If \(p(x) < \epsilon\), we consider the value as an outlier. - In practice, we assume the data follows a normal distribution \(N(\mu, \sigma)\). - We estimate the parameters \(\mu\) (mean) and \(\sigma^2\) (variance) based on a sample. - Then, for each value, we calculate the probability of its occurrence and compare it with .

Example: Salary Analysis in a Company

We detect whether there are individuals in the company whose salaries significantly deviate from the average.

Code
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

salaries = [40, 42, 45, 47, 50, 55, 60, 70, 90, 150]  # 150 to outlier

Q1 = np.percentile(salaries, 25)
Q3 = np.percentile(salaries, 75)
IQR = Q3 - Q1

outlier_threshold_low = Q1 - 1.5 * IQR
outlier_threshold_high = Q3 + 1.5 * IQR

outliers = [x for x in salaries if x < outlier_threshold_low or x > outlier_threshold_high]

print(f"Outliers: {outliers}")

sns.boxplot(salaries)
plt.title("Box plot")
plt.show()
Outliers: [150]

Result: The box plot shows that 150k is an anomaly.

Isolation Forest – Anomaly Detection Using Isolation Forest

Isolation Forest is an algorithm based on decision trees, proposed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in 2008. It identifies anomalies by isolating outliers during the data partitioning process: - Randomly selects a feature and a split value, - Outliers are isolated more quickly (closer to the root of the tree), - The result is aggregated based on multiple trees.

Its advantages include low computational requirements and effectiveness in analyzing high-dimensional data.

Anomaly Detection Methods in sklearn

Example: Detecting Credit Card Fraud

A bank analyzes credit card transactions and detects those that may be unauthorized.

Code
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest

# Przykładowe dane transakcji (kwota, liczba transakcji w tygodniu)
data = np.array([
    [100, 5], [120, 6], [130, 5], [110, 4], [5000, 1],  # Outlier (duża kwota, rzadkość)
    [125, 5], [115, 5], [140, 7], [135, 6], [145, 5]
])

# Model Isolation Forest
clf = IsolationForest(contamination=0.1)  # 10% transakcji uznajemy za anomalie
clf.fit(data)

# Predykcja (1 = normalna transakcja, -1 = oszustwo)
predictions = clf.predict(data)
df = pd.DataFrame(data, columns=["Kwota", "Liczba transakcji"])
df["Anomalia"] = predictions

print(df)
   Kwota  Liczba transakcji  Anomalia
0    100                  5         1
1    120                  6         1
2    130                  5         1
3    110                  4         1
4   5000                  1        -1
5    125                  5         1
6    115                  5         1
7    140                  7         1
8    135                  6         1
9    145                  5         1

Result: A transaction of 5000 PLN will be detected as an anomaly.