Clustering Fundamentals

In this chapter, we are going to introduce the fundamental concept of cluster analysis, focusing the attention on our main principles that are shared by many algorithms and the most important techniques that can be employed to evaluate the performance of a method.

In particular, we are going to discuss:

An introduction to clustering and distance functions
K-means and K-means++
Evaluation metrics
K-Nearest Neighbors (KNN)
Vector Quantization (VQ)

Technical requirements

The code presented in this chapter requires:

Python 3.5+ (Anaconda distribution: https://www.anaconda.com/distribution/ is highly recommended)
Libraries:
- SciPy 0.19+
- NumPy 1.10+
- scikit-learn 0.20+
- pandas 0.22+
- Matplotlib 2.0+
- seaborn 0.9+

The dataset can be obtained through UCI. The CSV file can be downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data and doesn't need any preprocessing except for the addition of the column names that will occur during the loading stage.

The examples are available on the GitHub repository:

https://github.com/PacktPublishing/HandsOn-Unsupervised-Learning-with-Python/Chapter02.

Introduction to clustering

As we explained in Chapter 1, Getting Started with Unsupervised Learning, the main goal of a cluster analysis is to group the elements of a dataset according to a similarity measure or a proximity criterion. In the first part of this chapter, we are going to focus on the former approach, while in the second part and in the next chapter, we will analyze more generic methods that exploit other geometric features of the dataset.

Let's take a data generating process p_data(x) and draw N samples from it:

It's possible to assume that the probability space of p_data(x) is partitionable into (potentially infinite) configurations containing K (for K=1,2, ...) regions so that p_data(x; k) represents the probability of a sample belonging to a cluster k. In this way, we are stating that every possible clustering structure is already existing when p_data(x...

K-means

K-means is the simplest implementation of the principle of maximum separation and maximum internal cohesion. Let's suppose we have a dataset X ∈ ℜ^M×N (that is, M N-dimensional samples) that we want to split into K clusters and a set of K centroids corresponding to the means of the samples assigned to each cluster K_j:

The set M and the centroids have an additional index (as a superscript) indicating the iterative step. Starting from an initial guess M⁽⁰⁾, K-means tries to minimize an objective function called inertia (that is, the total average intra-cluster distance between samples assigned to a cluster K_j and its centroid μ_j):

It's easy to understand that S(t) cannot be considered as an absolute measure because its value is highly influenced by the variance of the samples. However, S(t+1) < S(t) implies that the centroids are moving...

Analysis of the Breast Cancer Wisconsin dataset

In this chapter, we are using the well-known Breast Cancer Wisconsin dataset to perform a cluster analysis. Originally, the dataset was proposed in order to train classifiers; however, it can be very helpful for a non-trivial cluster analysis. It contains 569 records made up of 32 attributes (including the diagnosis and an identification number). All the attributes are strictly related to biological and morphological properties of the tumors, but our goal is to validate generic hypotheses considering the ground truth (benign or malignant) and the statistical properties of the dataset. Before moving on, it's important to clarify some points. The dataset is high-dimensional and the clusters are non-convex (so we cannot expect a perfect segmentation). Moreover our goal is not using a clustering algorithm to obtain the results of...

Evaluation metrics

In this section, we are going to analyze some common methods that can be employed to evaluate the performances of a clustering algorithm and also to help find the optimal number of clusters.

Minimizing the inertia

One of the biggest drawbacks of K-means and similar algorithms is the explicit request for the number of clusters. Sometimes this piece of information is imposed by external constraints (for example, in the example of breast cancer, there are only two possible diagnoses), but in many cases (when an exploratory analysis is needed), the data scientist has to check different configurations and evaluate them. The simplest way to evaluate K-means performance and choose an appropriate number of clusters...

K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a method belonging to a category called instance-based learning. In this case, there's no parametrized model, but rather a rearrangement of the samples in order to speed up specific queries. In the simplest case (also known as brute-force search), let's say we have a dataset X containing M samples x_i ∈ ℜ^N. Given a distance function d(x_i, x_j), it's possible to define the radius neighborhood of a test sample x_i as:

The set ν(x_i) is a ball centered on x_i and including all the samples whose distance is less or equal to R. Alternatively, it's possible to compute only the top k nearest neighbors, which are the k samples closer to x_i (in general, this set is a subset of ν(x_i), but the opposite condition can also happen when k is very large). The procedure is straightforward but, unfortunately...

Vector Quantization

Vector Quantization (VQ) is a method that exploits unsupervised learning in order to perform a lossy compression of a sample x_i∈ ℜ^N (for simplicity, we are supposing the multi-dimensional samples are flattened) or an entire dataset X. The main idea is to find a codebook Q with a number of entries C << N and associate each element with an entry q_i ∈ Q. In the case of a single sample, each entry will represent one or more groups of features (for example, it can be the mean), therefore, the process can be described as a transformation T whose general representation is:

The codebook is defined as Q = (q₁, q₂, ..., q_C). Hence, given a synthetic dataset made up of a group of feature aggregates (for example, a group of two consecutive elements), VQ associates a single codebook entry:

As the input sample is represented using a combination...

Summary

In this chapter, we explained the fundamental concepts of cluster analysis, starting from the concept of similarity and how to measure it. We discussed the K-means algorithm and its optimized variant called K-means++ and we analyzed the Breast Cancer Wisconsin dataset. Then we discussed the most important evaluation metrics (with or without knowledge of the ground truth) and we have learned which factors can influence performance. The next two topics were KNN, a very famous algorithm that can be employed to find the most similar samples given a query vector, and VQ, a technique that exploits clustering algorithms in order to find a lossy representation of a sample (for example, an image) or a dataset.

In the next chapter, we are going to introduce some of the most important advanced clustering algorithms, showing how they can easily solve non-convex problems.

...

Questions

If two samples have a Minkowski distance (p=5) equal to 10, what can you say about their Manhattan distance?
The main factor that negatively impacts on the convergence speed of K-means is the dimensionality of the dataset. Is this correct?
One of the most important factors that can positively impact on the performance of K-means is the convexity of the clusters. Is this correct?
The homogeneity score of a clustering application is equal to 0.99. What does it mean?
What is the meaning of an adjusted Rand score equal to -0.5?
Considering the previous question, can a different number of clusters yield a better score?
An application based on KNN requires on average 100 5-NN base queries per minute. Every minute, 2 50-NN queries are executed (each of them requires 4 seconds with a leaf size=25) and, immediately after them, a 2-second blocking task is performed. Assuming...