Before designing a solution to a business problem, it is essential to consider the complexity of the issue.
Algorithm Classification
Algorithms Processing Large Amounts of Data
Processing vast datasets requires an appropriate approach to organizing and analyzing them. When the amount of data exceeds the available memory of a computing unit, iterative processing methods are often used.
🔹 Example: Recommendation Systems in E-commerce (e.g., Amazon, Netflix)
Analyzes large datasets about users, their purchase history, and viewed content.
Processes data iteratively (e.g., stream processing in Apache Spark).
Uses collaborative filtering or graph-based algorithms to predict user preferences.
🔹 Other Applications:
Real-time server log analysis (e.g., DDoS attack detection).
IoT network monitoring (e.g., sensor data analysis in smart cities).
Algorithms Performing Intensive Computations
These require significant computing power but typically do not operate on large datasets. An example is an algorithm searching for large prime numbers. Parallel computation techniques are often used to optimize performance.
🔹 Example: Cryptography and Finding Large Prime Numbers (e.g., RSA)
Generates large prime numbers essential for RSA encryption.
Requires intensive computations but does not operate on vast datasets.
Often employs parallel methods, such as the Miller-Rabin probabilistic algorithm for primality testing.
Optimization algorithms (e.g., solving the traveling salesman problem).
Algorithms Processing Large Data and Performing Intensive Computations
These combine the requirements of the previous types, demanding both high computational power and handling of large datasets. An example is sentiment analysis in live video streams.
🔹 Example: Sentiment Analysis in Live Video Streams (e.g., YouTube, Twitch)
Analyzes both text (chat) and video/audio in real time.
Requires both significant computational resources (NLP and CV processing) and large-scale data handling.
Uses Transformer models (e.g., BERT) for text analysis and CNN/RNN for image and audio processing.
🔹 Other Applications:
Autonomous vehicles (real-time image analysis and decision-making).
Anomaly detection in massive financial datasets (e.g., fraud detection in banking).
Data Dimension
To determine the dimensionality of a problem’s data, it is not enough to consider just the amount of storage required. Three main aspects are crucial:
Input Size – Expected volume of data to be processed.
Growth Rate – The rate at which new data is generated during algorithm execution.
Structural Diversity – The types of data that the algorithm must handle.
Computational Dimension
This concerns processing resources and computational power. For example, deep learning (DL) algorithms require significant computational power, making it necessary to provide a parallelized architecture using GPUs or TPUs, significantly speeding up computations.
Algorithm Explainability
In many cases, modeling is used in critical situations, such as in software for administering medications. In such cases, explaining the reasons behind each algorithmic decision is crucial to ensure that the outcomes are error-free and unbiased.
The ability of an algorithm to indicate the mechanisms generating its results is called explainability. Ethical analysis is a standard part of the algorithm validation process.
Achieving high explainability is particularly challenging for machine learning (ML) and deep learning (DL) algorithms. For instance, banks using algorithms for credit decision-making must ensure transparency and justify their decisions.
One method for improving algorithm explainability is LIME (Local Interpretable Model-Agnostic Explanations), published in 2016. This method introduces small changes to input data and analyzes their impact on the output, allowing the identification of local decision-making rules within the model.
Code
import numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom lime.lime_tabular import LimeTabularExplainer# Wczytanie danych Irisfrom sklearn.datasets import load_irisiris = load_iris()X = iris.datay = iris.target# Podział na zbiór treningowy i testowyX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Trenowanie modelumodel = RandomForestClassifier(n_estimators=100, random_state=42)model.fit(X_train, y_train)# Tworzenie obiektu LIME do interpretacji modeluexplainer = LimeTabularExplainer(X_train, feature_names=iris.feature_names, class_names=iris.target_names, discretize_continuous=True)# Interpretacja losowego przykładu ze zbioru testowegoi = np.random.randint(0, len(X_test)) # Wybór losowego przykładuexp = explainer.explain_instance(X_test[i], model.predict_proba)# Wyświetlenie interpretacjiexp.show_in_notebook()
How Does This Code Work?
Loading Data and Training the Model
Uses the Iris dataset, containing 150 examples of flowers from three species:
Setosa
Versicolor
Virginica
The RandomForestClassifier model is trained on this data.
Creating an Interpretable Model Using LIME
LIME generates local explanations, interpreting the model for individual predictions.
A random test example is selected.
Exploring the Outcome for a Single Example
LIME slightly modifies input values and observes how the prediction changes.
It creates a “local” linear model that shows which features had the most influence on the decision.
Let’s assume our model selects a sample flower and classifies it as Virginica.
Interpretation of Results:
Key Features Affecting Model Decision:
Petal length: The most significant factor (e.g., a longer petal suggests Virginica).
Petal width: Also a crucial factor (e.g., above a certain threshold indicates Virginica).
Sepal length: A less significant but still relevant factor.
Sepal width: Usually the least important feature.
Visualization of Results:
LIME generates a bar chart showing the impact of each feature on classification.
The chart highlights which features increased or decreased the probability of a specific classification.
What Does This Mean?
If the model predicts Virginica with high probability, key features (e.g., long petals) strongly indicate this species.
If the features had mixed influences, it suggests the model had difficulty classifying the instance (e.g., petal width was ambiguous).
Anomaly detection
An outlier is an observation (a row in a dataset) that is significantly different from the other elements in the sample. This means that the relationship between independent and dependent variables for this observation may differ from other cases.
For single variables, outliers can be identified using a box plot, which is based on quartiles:
First quartile and third quartile define the edges of the box,
Second quartile (median) is marked inside the box,
An example of an outlier could be a Formula 1 car – in terms of speed, it is an anomaly among regular cars.
Use of Anomaly Detection
Anomaly detection has a wide range of applications, such as:
Finance – detecting fraudulent transactions in banking data analysis,
Cybersecurity – identifying intruders in a network based on user behavior,
Medicine – monitoring health parameters and detecting abnormalities,
Industry – detecting faulty components through image analysis.
Anomaly Detection Methods
1. Supervised Methods (supervised learning)
Used when labeled data is available (e.g., fraud cases in transactions). - Neural networks, - K-Nearest Neighbors algorithm (KNN), - Bayesian networks.
2. Unsupervised Methods (unsupervised learning)
Assumes that most data is correct, and anomalies are a small percentage of cases. - K-Means clustering, - Autoencoders in neural networks, - Statistical tests.
Classical Method – Detection Based on Probability
To determine whether a given observation is an anomaly, we can use the probability \(p(x)\): - If \(p(x) < \epsilon\), we consider the value as an outlier. - In practice, we assume the data follows a normal distribution\(N(\mu, \sigma)\). - We estimate the parameters \(\mu\) (mean) and \(\sigma^2\) (variance) based on a sample. - Then, for each value, we calculate the probability of its occurrence and compare it with .
Example: Salary Analysis in a Company
We detect whether there are individuals in the company whose salaries significantly deviate from the average.
Code
import numpy as npimport matplotlib.pyplot as pltimport seaborn as snssalaries = [40, 42, 45, 47, 50, 55, 60, 70, 90, 150] # 150 to outlierQ1 = np.percentile(salaries, 25)Q3 = np.percentile(salaries, 75)IQR = Q3 - Q1outlier_threshold_low = Q1 -1.5* IQRoutlier_threshold_high = Q3 +1.5* IQRoutliers = [x for x in salaries if x < outlier_threshold_low or x > outlier_threshold_high]print(f"Outliers: {outliers}")sns.boxplot(salaries)plt.title("Box plot")plt.show()
Outliers: [150]
Result: The box plot shows that 150k is an anomaly.
Isolation Forest – Anomaly Detection Using Isolation Forest
Isolation Forest is an algorithm based on decision trees, proposed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in 2008. It identifies anomalies by isolating outliers during the data partitioning process: - Randomly selects a feature and a split value, - Outliers are isolated more quickly (closer to the root of the tree), - The result is aggregated based on multiple trees.
Its advantages include low computational requirements and effectiveness in analyzing high-dimensional data.