K-Means Clustering using PySpark Python
Last Updated :
28 Apr, 2025
In this tutorial series, we are going to cover K-Means Clustering using Pyspark. K-means is a clustering algorithm that groups data points into K distinct clusters based on their similarity. It is an unsupervised learning technique that is widely used in data mining, machine learning, and pattern recognition. The algorithm works by iteratively assigning data points to a cluster based on their distance from the cluster's centroid and then recomputing the centroid of each cluster. The process continues until the clusters' centroids converge or a maximum number of iterations is reached. K-means is simple, efficient, and effective in finding the optimal clusters for a given dataset, making it a popular choice for various applications.
So, a typical clustering problem looks like this:
- Cluster Similar Documents
- Cluster Customers based on Features
- Identify similar physical groups
- Market Segmentation
We'll be working with a real data set about seeds, from the UCI repository: https://archive.ics.uci.edu/ml/datasets/seeds.
Task: We have seven geometrical parameters of wheat kernels and we have to group them into three different varieties of wheat: Kama, Rosa, and Canadian.
Step 1: Starting the PySpark server
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate()
print('Spark Version: {}'.format(spark.version))
Output:
Spark Version: 3.3.1
Step 2: Load the dataset
Python3
#Loading the data
dataset = spark.read.csv("seeds_dataset.csv",header=True,inferSchema=True)
#show the data in the above file using the below command
dataset.show(5)
Output:
+-----+---------+-----------+-----------------+---------------+---------------------+-----------------------+
| Area|Perimeter|Compactness|Length_of_ kernel|Width_of_kernel|Asymmetry_coefficient|Length_of_kernel_groove|
+-----+---------+-----------+-----------------+---------------+---------------------+-----------------------+
|15.26| 14.84| 0.871| 5.763| 3.312| 2.221| 5.22|
|14.88| 14.57| 0.8811| 5.554| 3.333| 1.018| 4.956|
|14.29| 14.09| 0.905| 5.291| 3.337| 2.699| 4.825|
|13.84| 13.94| 0.8955| 5.324| 3.379| 2.259| 4.805|
|16.14| 14.99| 0.9034| 5.658| 3.562| 1.355| 5.175|
+-----+---------+-----------+-----------------+---------------+---------------------+-----------------------+
only showing top 5 rows
Print schema
Python3
#Print schema
dataset.printSchema()
Output:
root
|-- Area: double (nullable = true)
|-- Perimeter: double (nullable = true)
|-- Compactness: double (nullable = true)
|-- Length_of_ kernel: double (nullable = true)
|-- Width_of_kernel: double (nullable = true)
|-- Asymmetry_coefficient: double (nullable = true)
|-- Length_of_kernel_groove: double (nullable = true)
Step 3: Format the data using Vector Assembler into vectors which will be used as "features"
Python3
from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols = dataset.columns,
outputCol='features')
final_data = vec_assembler.transform(dataset)
final_data.select('features').show(5)
Output:
+--------------------+
| features|
+--------------------+
|[15.26,14.84,0.87...|
|[14.88,14.57,0.88...|
|[14.29,14.09,0.90...|
|[13.84,13.94,0.89...|
|[16.14,14.99,0.90...|
+--------------------+
only showing top 5 rows
Step 4: Scaling the data
It is a good idea to scale our data to deal with the curse of dimensionality.
Python3
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features",
outputCol="scaledFeatures",
withStd=True,
withMean=False)
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(final_data)
# Normalize each feature to have unit standard deviation.
final_data = scalerModel.transform(final_data)
final_data.select('scaledFeatures').show(5)
Output:
+--------------------+
| scaledFeatures|
+--------------------+
|[5.24452795332028...|
|[5.11393027165175...|
|[4.91116018695588...|
|[4.75650503761158...|
|[5.54696468981581...|
+--------------------+
only showing top 5 rows
Step 5: Find the number of clusters using Silhouette Score
Python3
#Importing the model
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
silhouette_score=[]
evaluator = ClusteringEvaluator(predictionCol='prediction',
featuresCol='scaledFeatures', \
metricName='silhouette',
distanceMeasure='squaredEuclidean')
for i in range(2,10):
kmeans=KMeans(featuresCol='scaledFeatures', k=i)
model=kmeans.fit(final_data)
predictions=model.transform(final_data)
score=evaluator.evaluate(predictions)
silhouette_score.append(score)
print('Silhouette Score for k =',i,'is',score)
Output:
Silhouette Score for k = 2 is 0.6650046039315017
Silhouette Score for k = 3 is 0.5928460025426588
Silhouette Score for k = 4 is 0.44804230341047074
Silhouette Score for k = 5 is 0.47760014315974747
Silhouette Score for k = 6 is 0.42900353119793194
Silhouette Score for k = 7 is 0.4419918246535933
Silhouette Score for k = 8 is 0.395868387829853
Silhouette Score for k = 9 is 0.40541652397305605
Plot the Silhouette Score graph
Python3
#Visualizing the silhouette scores in a plot
import matplotlib.pyplot as plt
plt.plot(range(2,10),silhouette_score)
plt.xlabel('k')
plt.ylabel('silhouette score')
plt.title('Silhouette Score')
plt.show()
Output:
Silhouette Score
Since there is no definitive answer as to what value of K is an acceptable value. I want to move forward with k = 3 Where a local maximum of Silhouette Score is detected.
Step 6: Train the Model
Python3
# Trains a k-means model.
kmeans = KMeans(featuresCol='scaledFeatures',k=3)
model = kmeans.fit(final_data)
predictions = model.transform(final_data)
Print cluster centers
Python3
# Printing cluster centers
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
Output:
Cluster Centers:
[ 4.96198582 10.97871333 37.30930808 12.44647267 8.62880781 1.80062386
10.41913733]
[ 6.35645488 12.40730852 37.41990178 13.93860446 9.7892399 2.41585309
12.29286107]
[ 4.07497225 10.14410142 35.89816849 11.80812742 7.54416916 3.15411286
10.38031464]
Showing the result of groupings:
Python3
predictions.select('prediction').show(5)
Output:
+----------+
|prediction|
+----------+
| 0|
| 0|
| 0|
| 0|
| 0|
+----------+
only showing top 5 rows
End Session
Python3
#End Session
spark.stop()
Similar Reads
Image Segmentation using K Means Clustering
Image segmentation is a technique in computer vision that divides an image into different segments. This can help identify specific objects, boundaries or patterns in the image. Image is basically a set of given pixels and in image segmentation pixels with similar intensity are grouped together. Im
2 min read
Mean Shift Clustering using Sklearn
Clustering is a fundamental method in unsupervised device learning, and one powerful set of rules for this venture is Mean Shift clustering. Mean Shift is a technique for grouping comparable data factors into clusters primarily based on their inherent characteristics, with our previous understanding
9 min read
K means clustering using Weka
In this article, we are going to see how to use Weka explorer to do simple k-mean clustering. Here we will use sample data set which is based on iris data that is available in ARFF format. There are 150 iris instances in this dataset. Before starting let's have a small intro about clustering and sim
3 min read
Customer Segmentation using KMeans in R
Customer segmentation is one of unsupervised learning's most important applications. Employing clustering algorithms to identify the numerous customer subgroups enables businesses to target specific consumer groupings. In this machine learning project, K-means clustering, a critical method for clust
10 min read
K-Means Clustering in R Programming
K Means Clustering in R Programming is an Unsupervised Non-linear algorithm that cluster data based on similarity or similar groups. It seeks to partition the observations into a pre-specified number of clusters. Segmentation of data takes place to assign each training example to a segment called a
3 min read
Analysis of test data using K-Means Clustering in Python
In data science K-Means clustering is one of the most popular unsupervised machine learning algorithms. It is primarily used for grouping similar data points together based on their features which helps in discovering inherent patterns in the dataset. In this article we will demonstrates how to appl
4 min read
K-Mode Clustering in Python
K-mode clustering is an unsupervised machine-learning technique used to group a set of data objects into a specified number of clusters, based on their categorical attributes. The algorithm is called "K-Mode" because it uses modes (i.e. the most frequent values) instead of means or medians to repres
6 min read
Image Segmentation Using Fuzzy C-Means Clustering
This article delves into the process of image segmentation using Fuzzy C-Means (FCM) clustering, a powerful technique for partitioning images into meaningful regions. We'll explore the fundamentals of FCM, its advantages over traditional methods, and provide a step-by-step guide to implementing FCM
7 min read
Python PySpark sum() Function
PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data anal
3 min read
Spectral Clustering using R
Spectral clustering is a technique used in machine learning and data analysis for grouping data points based on their similarity. The method involves transforming the data into a representation where the clusters become apparent and then using a clustering algorithm on this transformed data. In R Pr
9 min read