Clustering Text Documents using K-Means in Scikit Learn
Last Updated :
15 May, 2025
Clustering text documents is a common problem in Natural Language Processing (NLP) where similar documents are grouped based on their content. K-Means clustering is a popular clustering technique used for this purpose. In this article we'll learn how to perform text document clustering using the K-Means algorithm in Scikit-Learn.
Implementation using Python
In this project we're building an application to detect sarcasm in headlines. Sarcasm can make sentences sound opposite to their true meaning which can confuse systems that analyze sentiment.
Step 1: Import Necessary Libraries
We need some Python libraries for our task like numpy, pandas, matplotlib and scikit learn.
Python
import json
import numpy as np
import pandas as pd
import requests
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
Step 2: Load the Dataset
Now let's load the dataset of sarcasm headlines. We download the dataset using the requests.get(url) method. The .json() method converts the raw data into a Python dictionary. Then we create a pandas DataFrame df to make the data easier to work with.
Python
url = "https://raw.githubusercontent.com/PawanKrGunjan/Natural-Language-Processing/main/Sarcasm%20Detection/sarcasm.json"
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data)
Step 3: Convert Text to Numeric Representation using TF-IDF
We need to convert the text data into a format that the K-Means algorithm can understand (numbers). We use TF-IDF for this.
- TfidfVectorizer converts text into a numeric format.
- stop_words='english' removes common words like "the", "and" that don't add much meaning.
- fit_transform(sentence) creates a TF-IDF matrix where each row represents a document and each column represents a word’s importance.
Python
sentence = df['headline']
vectorizer = TfidfVectorizer(stop_words='english')
vectorized_documents = vectorizer.fit_transform(sentence)
Step 4: Reduce Dimensionality using PCA
Since TF-IDF produces a high-dimensional matrix we reduce its dimensions to make it easier to visualize.
- TF-IDF output is high-dimensional and difficult to visualize.
- PCA(n_components=2) reduces it to 2 dimensions so we can plot it.
Python
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(vectorized_documents.toarray())
Step 5: Applying K-Means Clustering
We will now apply the K-Means algorithm to group the headlines into categories (sarcastic or not sarcastic).
- KMeans(n_clusters=2): We choose 2 clusters since the dataset has headlines labeled as either sarcastic or not sarcastic.
- n_init=5: Runs K-Means 5 times to get the best clustering result.
- max_iter=500: The algorithm can iterate 500 times to find the best solution.
- random_state=42: Ensures that results are reproducible.
Python
num_clusters = 2
kmeans = KMeans(n_clusters=num_clusters, n_init=5, max_iter=500, random_state=42)
kmeans.fit(vectorized_documents)
Output:
KMeans ClusteringStep 6: Storing Clustering Results
After clustering we store the results in a DataFrame for easy viewing.
- kmeans.labels_ contains the cluster label for each headline (0 or 1).
- We print 5 random samples of the results to check the clustering.
Python
results = pd.DataFrame()
results['document'] = sentence
results['cluster'] = kmeans.labels_
print(results.sample(5))
Output:
Clustering resultsStep 7: Visualizing Clusters
Finally we visualize the clustered headlines in a scatter plot.
- We use plt.scatter to plot the data points.
- Each cluster is shown in different colors red for non-sarcastic and green for sarcastic.
- The scatter plot shows how K-Means has grouped the headlines.
Python
colors = ['red', 'green']
cluster_labels = ['Not Sarcastic', 'Sarcastic']
for i in range(num_clusters):
plt.scatter(reduced_data[kmeans.labels_ == i, 0],
reduced_data[kmeans.labels_ == i, 1],
s=10, color=colors[i],
label=f'{cluster_labels[i]}')
plt.legend()
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clustering of Sarcasm Headlines')
plt.show()
Output:
Text clustering using KMeansThe scatter plot shows the K-Means clustering results for sarcasm detection in headlines. Red points represent Not Sarcastic headline while Green points indicate Sarcastic headlines. This clustering reveals distinct patterns using TF-IDF and K-Means can effectively separate text categories. This showcases the potential of clustering for text analysis using scikit learn.
Similar Reads
Text Classification using scikit-learn in NLP The purpose of text classification, a key task in natural language processing (NLP), is to categorise text content into preset groups. Topic categorization, sentiment analysis, and spam detection can all benefit from this. In this article, we will use scikit-learn, a Python machine learning toolkit,
5 min read
K-Means clustering on the handwritten digits data using Scikit Learn in Python K - means clustering is an unsupervised algorithm that is used in customer segmentation applications. In this algorithm, we try to form clusters within our datasets that are closely related to each other in a high-dimensional space. In this article, we will see how to use the k means algorithm to i
5 min read
Classification of text documents using sparse features in Python Scikit Learn Classification is a type of machine learning algorithm in which the model is trained, so as to categorize or label the given input based on the provided features for example classifying the input image as an image of a dog or a cat (binary classification) or to classify the provided picture of a liv
5 min read
Analysis of test data using K-Means Clustering in Python In data science K-Means clustering is one of the most popular unsupervised machine learning algorithms. It is primarily used for grouping similar data points together based on their features which helps in discovering inherent patterns in the dataset. In this article we will demonstrates how to appl
4 min read
Image compression using K-means clustering Prerequisite: K-means clustering The internet is filled with huge amounts of data in the form of images. People upload millions of pictures every day on social media sites such as Instagram, and Facebook and cloud storage platforms such as google drive, etc. With such large amounts of data, image c
6 min read
Mean Shift Clustering using Sklearn Clustering is a fundamental method in unsupervised device learning, and one powerful set of rules for this venture is Mean Shift clustering. Mean Shift is a technique for grouping comparable data factors into clusters primarily based on their inherent characteristics, with our previous understanding
9 min read
Color Quantization using K-Means in Scikit Learn In this article, we shall play around with pixel intensity value using Machine Learning Algorithms. The goal is to perform a Color Quantization example using KMeans in the Scikit Learn library. Color Quantization Color Quantization is a technique in which the color spaces in an image are reduced to
2 min read
Clustering Performance Evaluation in Scikit Learn In this article, we shall look at different approaches to evaluate Clustering Algorithms using Scikit Learn Python Machine Learning Library. Clustering is an Unsupervised Machine Learning algorithm that deals with grouping the dataset to its similar kind data point. Clustering is widely used for Seg
3 min read
Latent Text Analysis (lsa Package) Using Whole Documents in R Latent Text Analysis (LTA) is a technique used to discover the hidden (latent) structures within a set of documents. This approach is instrumental in natural language processing (NLP) for identifying patterns, topics, and relationships in large text corpora. This article will explore using whole doc
10 min read
Image Segmentation using K Means Clustering Image segmentation is a technique in computer vision that divides an image into different segments. This can help identify specific objects, boundaries or patterns in the image. Image is basically a set of given pixels and in image segmentation pixels with similar intensity are grouped together. Im
2 min read