Introduction to Anomaly Detection with Python
Last Updated :
05 Jul, 2024
Anomaly detection is the process of identifying data points that deviate significantly from the expected pattern or behavior within a dataset.
The article aims to provide a comprehensive understanding of anomaly detection, including its definition, types, and techniques, and to demonstrate how to implement anomaly detection in Python using the PyOD library.
What is Anomaly Detection?
Anomaly detection, also called outlier detection, is the process of finding patterns in any dataset that deviate significantly from the expected or 'normal behavior.' The difference between 'normal' and 'abnormal' varies depending on the context. Usually, the basic approach of anomaly detection is defining a boundary around the 'normal' data points that separates them from the outliers. However, several factors make this approach very challenging, and it's not effective when dealing with large and multiple datasets.
Techniques in Python
In Python, many approaches can be used to detect these anomalies, such as using ML models, algorithms, or Python libraries, packages, or toolkits. For example:
- Anomaly Detection Toolkit (ADTK): A Python package for unsupervised or rule-based time series anomaly detection.
- PyOD: A popular Python library for anomaly detection.
Types of Anomalies
- Point Anomalies: Individual data points that deviate significantly from the rest of the data. For example, detecting credit card fraud based on an unusually high 'amount spent'.
- Contextual Anomalies: Depend on the surrounding context. In time-series data, what’s normal at one time might be abnormal at another.
- Collective Anomalies: A set of data points together indicate an anomaly. For example, unexpected data transfer activities might indicate a potential cyberattack.
Note: Anomaly detection is considered to be similar to noise removal and novelty detection but not entirely the same.
What are Outliers?
In anomaly detection, outliers are data points that deviate significantly from the rest of the data. They are usually considered the 'odd ones out' that don't conform to the expected patterns or behaviors, falling far outside the typical range of values for a particular feature or set of features.
Types of Outliers
- Univariate Outliers: Occur within a single variable. For example, a highly unusual purchase amount.
- Multivariate Outliers: Identified by considering multiple variables together. For example, a customer with typical age, location, and purchase behavior.
Techniques and Approaches to Detect Anomaly
1. Univariate Outlier Detection
- Z-score: Measures how many standard deviations a point is from the mean. Points exceeding a threshold (e.g., 3 standard deviations) are flagged as outliers.
- Interquartile Range (IQR): Uses quartiles to define a range. Points outside the range are considered outliers.
- Modified Z-scores: Uses the median and Median Absolute Deviation (MAD) instead of mean and standard deviation, more robust for skewed data.
2. Multivariate Outlier Detection
- Isolation Forest: Isolates anomalies faster than normal data during random partitioning.
- Local Outlier Factor (LOF): Identifies outliers based on local density deviation from neighbors.
- Clustering Techniques (K-means, Hierarchical): Detect points far from established clusters or in small clusters.
- Angle-based Outlier Detection (ABOD): Analyzes angles between data points in high dimensions.
3. Machine Learning Based Approach
- Density-Based Anomaly Detection:
- K-Nearest Neighbors (k-NN): Classifies based on nearest neighbors.
- Local Outlier Factor (LOF): Scores data points based on neighbors' density compared to their own.
- Clustering-Based Anomaly Detection:
- K-means Algorithm: Common technique to group similar data points into clusters. Data points far from any cluster are flagged as anomalies.
- Support Vector Machine-Based Anomaly Detection:
- One-Class SVM: Learns a boundary around normal data points, identifies anomalies as points falling outside the boundary.
- Moving Average Using Discrete Linear Convolution:
- Smooths data to identify anomalies or deviations from the trend.
4. Gaussian Distribution
- Assumes data follows a bell-shaped normal distribution curve. Fits a Gaussian distribution model to the data and identifies points with very low probability as anomalies.
5. Autoencoders (in Neural Networks)
- Encode and reconstruct data points, trained to minimize reconstruction error and flag anomalies. Implemented with libraries like Keras and PyTorch.
Steps for Anomaly Detection Using PyOD
Step 1: Install Required Libraries
First, we need to install the pyod
library along with other necessary libraries for data handling and visualization.
!pip install pyod
Step 2: Import Required Libraries
Import the necessary libraries such as pandas
for data handling, numpy
for numerical computations, and matplotlib
and seaborn
for data visualization. Additionally, import relevant functions and classes from pyod
.
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Using PyOD
from pyod.utils.data import generate_data, get_outliers_inliers
from pyod.models.pca import PCA
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
Step 3: Generate Data
Create a dataset using the generate_data
function from pyod
. This function generates a synthetic dataset for training, where outliers and inliers are labeled.
Python
# Start by creating a dataset using generate_data() from pyod
X_train, y_train = generate_data(train_only=True)
Step 4: Create DataFrame
Convert the generated data into a Pandas DataFrame for easier handling and visualization. Add a column for the labels.
Python
# Create dataframe from Pandas using the generated data
df_train = pd.DataFrame(X_train)
df_train['y'] = y_train
# Display first few rows
df_train.head()
Step 5: Visualize Data
Visualize the generated dataset using Seaborn’s scatter plot. The color of each point represents its label (outlier or not).
Python
sns.scatterplot(x=0, y=1, hue='y', data=df_train, palette="hls", legend="full")
plt.title('Ground Truth')
Output:
Step 6: Create PCA Model
Initialize a PCA model from pyod
. PCA (Principal Component Analysis) is used for anomaly detection by identifying outliers based on principal components.
Python
# Create PCA model
clf = PCA()
Step 7: Train PCA Model
Train the PCA model using the generated data. The fit
method is used to train the model.
Python
# Trains PCA model
clf.fit(X_train)
Step 8: Store Predictions
Store the predictions for inliers and outliers in arrays as 0s and 1s. The labels_
attribute contains the predicted labels, and the decision_scores_
attribute contains the anomaly scores.
Python
# Store predictions for inlier and outlier in array as 0s and 1s
y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_
Step 9: Visualize Anomaly Scores
Visualize the anomaly scores using Seaborn’s scatter plot. The color of each point represents its anomaly score.
Python
ax = sns.scatterplot(x=0, y=1, hue=y_train_scores, data=df_train, palette="RdBu_r")
# Using legends, results look bit varied
legend_labels = [f"{score:.2f}" for score in np.unique(y_train_scores)] # Format scores up to 2 decimal places
ax.legend(title="Anomaly Scores", labels=legend_labels) # Create legend with title and labels
plt.title('Anomaly Scores by PCA')
Output:
Interpretation of the Output Graph with Anomaly Scores:
- Dense Clusters: The dense cluster of points in the center with lower anomaly scores (dark blue) indicates normal data points that follow the expected pattern.
- Scattered Points: The scattered points with higher anomaly scores (red) indicate potential anomalies or outliers that deviate significantly from the normal pattern.
- Anomaly Detection: This visualization helps in identifying which data points are considered anomalies by the PCA model. Points far from the dense cluster and with higher anomaly scores are flagged as anomalies.
By using this plot, you can visually inspect and analyze the anomalies detected by the PCA model, aiding in understanding and validating the results of your anomaly detection process.
Conclusion
Anomaly detection, also called outlier detection, is a process of finding patterns in any dataset that tends to deviate significantly from the expected or 'normal behavior'. So far we have discussed about the different types of anomalies (point, contextual, collective) and outliers and implementing anomaly detection in Python.
Similar Reads
Anomaly detection with TensorFlow
With the advancement of technology there is also a signification increment of frauds. In modern days, frauds are very common in monetary departments. Let's assume we have an efficient algorithm which observes data flow actions, learns the patterns and can even predict which are the anomalies or frau
7 min read
How to use PyTorch for anomaly detection?
An anomaly is something that deviates from what is standard, normal, or expected. In a broad sense, anomalies can be observed in various contexts, such as in data analysis, science, statistics, engineering, and more. In this article, we will see how we can detect anomalies using PyTorch. What is an
6 min read
Anomaly detection in Distributed Systems
Anomaly detection in distributed systems is a critical aspect of maintaining system health and performance. Distributed systems, which span multiple machines or nodes, require robust methods to identify and address irregularities that could indicate issues like failures, security breaches, or perfor
6 min read
HBOS: Efficient Outlier Detection with Python
Outlier detection is a crucial task in data analysis, helping to identify rare and anomalous instances that deviate significantly from the majority of the data. One efficient method for unsupervised anomaly detection is the Histogram-Based Outlier Score (HBOS). This article will delve into the princ
4 min read
Anomaly detection using Isolation Forest
Anomaly detection is vital across industries, revealing outliers in data that signal problems or unique insights. Isolation Forests offer a powerful solution, isolating anomalies from normal data. In this tutorial, we will explore the Isolation Forest algorithm's implementation for anomaly detection
7 min read
Anomaly Detection Using R
Anomaly detection is a critical aspect of data analysis, allowing us to identify unusual patterns, outliers, or abnormalities within datasets. It plays a pivotal role across various domains such as finance, cybersecurity, healthcare, and more. What is Anomalies?Anomalies, also known as outliers, are
13 min read
What is Anomaly Detection?
Anomaly Detection, additionally known as outlier detection, is a technique in records analysis and machine studying that detects statistics points, activities, or observations that vary drastically from the dataset's ordinary behavior. These abnormalities may sign extreme conditions which include mi
14 min read
Anomaly Detection in Time Series in R
Anomaly detection in time series involves identifying unusual data points that deviate significantly from expected patterns or trends. It is essential for detecting irregularities like spikes, dips or potential failures in systems or applications. Common use cases for anomaly detection include monit
6 min read
Z score for Outlier Detection - Python
Z score (or standard score) is an important concept in statistics. It helps to understand if a data value is greater or smaller than the mean and how far away it is from the mean. More specifically, the Z score tells how many standard deviations away a data point is from the mean. Z score = (x -mean
3 min read
Anomaly Detection in Time Series Data
Anomaly detection is the process of identifying data points or patterns in a dataset that deviate significantly from the norm. A time series is a collection of data points gathered over some time. Anomaly detection in time series data may be helpful in various industries, including manufacturing, he
7 min read