Open In App

Introduction to Anomaly Detection with Python

Last Updated : 05 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Anomaly detection is the process of identifying data points that deviate significantly from the expected pattern or behavior within a dataset.

The article aims to provide a comprehensive understanding of anomaly detection, including its definition, types, and techniques, and to demonstrate how to implement anomaly detection in Python using the PyOD library.

What is Anomaly Detection?

Anomaly detection, also called outlier detection, is the process of finding patterns in any dataset that deviate significantly from the expected or 'normal behavior.' The difference between 'normal' and 'abnormal' varies depending on the context. Usually, the basic approach of anomaly detection is defining a boundary around the 'normal' data points that separates them from the outliers. However, several factors make this approach very challenging, and it's not effective when dealing with large and multiple datasets.

Techniques in Python

In Python, many approaches can be used to detect these anomalies, such as using ML models, algorithms, or Python libraries, packages, or toolkits. For example:

  • Anomaly Detection Toolkit (ADTK): A Python package for unsupervised or rule-based time series anomaly detection.
  • PyOD: A popular Python library for anomaly detection.

Types of Anomalies

  1. Point Anomalies: Individual data points that deviate significantly from the rest of the data. For example, detecting credit card fraud based on an unusually high 'amount spent'.
  2. Contextual Anomalies: Depend on the surrounding context. In time-series data, what’s normal at one time might be abnormal at another.
  3. Collective Anomalies: A set of data points together indicate an anomaly. For example, unexpected data transfer activities might indicate a potential cyberattack.

Note: Anomaly detection is considered to be similar to noise removal and novelty detection but not entirely the same.

What are Outliers?

In anomaly detection, outliers are data points that deviate significantly from the rest of the data. They are usually considered the 'odd ones out' that don't conform to the expected patterns or behaviors, falling far outside the typical range of values for a particular feature or set of features.

Types of Outliers

  1. Univariate Outliers: Occur within a single variable. For example, a highly unusual purchase amount.
  2. Multivariate Outliers: Identified by considering multiple variables together. For example, a customer with typical age, location, and purchase behavior.

Techniques and Approaches to Detect Anomaly

1. Univariate Outlier Detection

  1. Z-score: Measures how many standard deviations a point is from the mean. Points exceeding a threshold (e.g., 3 standard deviations) are flagged as outliers.
  2. Interquartile Range (IQR): Uses quartiles to define a range. Points outside the range are considered outliers.
  3. Modified Z-scores: Uses the median and Median Absolute Deviation (MAD) instead of mean and standard deviation, more robust for skewed data.

2. Multivariate Outlier Detection

  1. Isolation Forest: Isolates anomalies faster than normal data during random partitioning.
  2. Local Outlier Factor (LOF): Identifies outliers based on local density deviation from neighbors.
  3. Clustering Techniques (K-means, Hierarchical): Detect points far from established clusters or in small clusters.
  4. Angle-based Outlier Detection (ABOD): Analyzes angles between data points in high dimensions.

3. Machine Learning Based Approach

  1. Density-Based Anomaly Detection:
    • K-Nearest Neighbors (k-NN): Classifies based on nearest neighbors.
    • Local Outlier Factor (LOF): Scores data points based on neighbors' density compared to their own.
  2. Clustering-Based Anomaly Detection:
    • K-means Algorithm: Common technique to group similar data points into clusters. Data points far from any cluster are flagged as anomalies.
  3. Support Vector Machine-Based Anomaly Detection:
    • One-Class SVM: Learns a boundary around normal data points, identifies anomalies as points falling outside the boundary.
  4. Moving Average Using Discrete Linear Convolution:
    • Smooths data to identify anomalies or deviations from the trend.

4. Gaussian Distribution

  • Assumes data follows a bell-shaped normal distribution curve. Fits a Gaussian distribution model to the data and identifies points with very low probability as anomalies.

5. Autoencoders (in Neural Networks)

  • Encode and reconstruct data points, trained to minimize reconstruction error and flag anomalies. Implemented with libraries like Keras and PyTorch.

Steps for Anomaly Detection Using PyOD

Step 1: Install Required Libraries

First, we need to install the pyod library along with other necessary libraries for data handling and visualization.

!pip install pyod

Step 2: Import Required Libraries

Import the necessary libraries such as pandas for data handling, numpy for numerical computations, and matplotlib and seaborn for data visualization. Additionally, import relevant functions and classes from pyod.

Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Using PyOD
from pyod.utils.data import generate_data, get_outliers_inliers
from pyod.models.pca import PCA
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize


Step 3: Generate Data

Create a dataset using the generate_data function from pyod. This function generates a synthetic dataset for training, where outliers and inliers are labeled.

Python
# Start by creating a dataset using generate_data() from pyod
X_train, y_train = generate_data(train_only=True)


Step 4: Create DataFrame

Convert the generated data into a Pandas DataFrame for easier handling and visualization. Add a column for the labels.

Python
# Create dataframe from Pandas using the generated data
df_train = pd.DataFrame(X_train)
df_train['y'] = y_train

# Display first few rows
df_train.head()

Step 5: Visualize Data

Visualize the generated dataset using Seaborn’s scatter plot. The color of each point represents its label (outlier or not).

Python
sns.scatterplot(x=0, y=1, hue='y', data=df_train, palette="hls", legend="full")
plt.title('Ground Truth')

Output:

download-(15)


Step 6: Create PCA Model

Initialize a PCA model from pyod. PCA (Principal Component Analysis) is used for anomaly detection by identifying outliers based on principal components.

Python
# Create PCA model
clf = PCA()


Step 7: Train PCA Model

Train the PCA model using the generated data. The fit method is used to train the model.

Python
# Trains PCA model
clf.fit(X_train)


Step 8: Store Predictions

Store the predictions for inliers and outliers in arrays as 0s and 1s. The labels_ attribute contains the predicted labels, and the decision_scores_ attribute contains the anomaly scores.

Python
# Store predictions for inlier and outlier in array as 0s and 1s
y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_


Step 9: Visualize Anomaly Scores

Visualize the anomaly scores using Seaborn’s scatter plot. The color of each point represents its anomaly score.

Python
ax = sns.scatterplot(x=0, y=1, hue=y_train_scores, data=df_train, palette="RdBu_r")

# Using legends, results look bit varied
legend_labels = [f"{score:.2f}" for score in np.unique(y_train_scores)]  # Format scores up to 2 decimal places
ax.legend(title="Anomaly Scores", labels=legend_labels)  # Create legend with title and labels
plt.title('Anomaly Scores by PCA')

Output:

download-(14)

Interpretation of the Output Graph with Anomaly Scores:

  • Dense Clusters: The dense cluster of points in the center with lower anomaly scores (dark blue) indicates normal data points that follow the expected pattern.
  • Scattered Points: The scattered points with higher anomaly scores (red) indicate potential anomalies or outliers that deviate significantly from the normal pattern.
  • Anomaly Detection: This visualization helps in identifying which data points are considered anomalies by the PCA model. Points far from the dense cluster and with higher anomaly scores are flagged as anomalies.

By using this plot, you can visually inspect and analyze the anomalies detected by the PCA model, aiding in understanding and validating the results of your anomaly detection process.

Conclusion

Anomaly detection, also called outlier detection, is a process of finding patterns in any dataset that tends to deviate significantly from the expected or 'normal behavior'. So far we have discussed about the different types of anomalies (point, contextual, collective) and outliers and implementing anomaly detection in Python.


Next Article

Similar Reads