The document discusses clustering methods like k-means and k-medoids, which are used in unsupervised learning to group similar data points. It explains the process and algorithms for clustering and classification, including supervised, unsupervised, and semi-supervised learning, detailing their applications and advantages. Additionally, it covers Bayesian classification and discriminant functions, emphasizing their role in machine learning.
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
The document discusses a novel approach to k-means and k-medoids clustering algorithms that improves the selection of initial centroids, leading to more stable and accurate clusters while reducing computational time. It highlights the shortcomings of traditional algorithms that randomly select centroids and explains how the proposed method systematically calculates initial centroids to enhance clustering performance. The findings emphasize the importance of effective centroid selection in optimizing clustering outcomes and efficiency in handling complex datasets.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document presents a novel clustering technique that combines the traditional k-means algorithm with a Double Link Cluster Tree (DLCT) to enhance efficiency in clustering unlabelled datasets without needing prior initialization of the number of clusters. The proposed method aims to optimize cluster output by automatically determining cluster numbers based on intra and inter cluster distances, making it suitable for various data types, including documents and images. The results indicate improved clustering capabilities, particularly in unpredictable datasets, while addressing inherent limitations of existing clustering methods.
The document discusses clustering as a form of unsupervised learning in machine learning, contrasting it with supervised learning that relies on labeled data. It outlines various clustering methods such as k-means, hierarchical, and DBSCAN, along with concepts of hard and soft clustering, distance metrics, and the importance of choosing the right number of clusters. Additionally, examples of real-world applications like customer segmentation and the pros and cons of k-means are provided to illustrate its utility.
The document discusses various clustering algorithms and concepts:
1) K-means clustering groups data by minimizing distances between points and cluster centers, but it is sensitive to initialization and may find local optima.
2) K-medians clustering is similar but uses point medians instead of means as cluster representatives.
3) K-center clustering aims to minimize maximum distances between points and clusters, and can be approximated with a farthest-first traversal algorithm.
The document discusses the K-means clustering algorithm. It begins by explaining that K-means is an unsupervised learning algorithm that partitions observations into K clusters by minimizing the within-cluster sum of squares. It then provides details on how K-means works, including initializing cluster centers, assigning observations to the nearest center, recalculating centers, and repeating until convergence. The document also discusses evaluating the number of clusters K, dealing with issues like local optima and sensitivity to initialization, and techniques for improving K-means such as K-means++ initialization and feature scaling.
The document describes a k-means clustering algorithm for outlier detection in data mining. It introduces k-means clustering and its steps. A leader-follower technique is used to determine the optimal number of clusters k. The algorithm is implemented on a sample dataset to cluster data points and identify outlier clusters based on having significantly fewer points than other clusters. The results show the data points clustered into three groups, with one cluster identified as an outlier based on its smaller size.
Machine Learning, K-means Algorithm Implementation with RIRJET Journal
This document discusses the implementation of the K-means clustering algorithm using R programming. It begins with an introduction to machine learning and the different types of machine learning algorithms. It then focuses on the K-means algorithm, describing the steps of the algorithm and how it is used for cluster analysis in unsupervised learning. The document then demonstrates implementing K-means clustering in R by generating sample data, initializing random centroids, calculating distances between data points and centroids, assigning data points to clusters based on closest centroid, recalculating centroids, and plotting the results. It concludes that K-means clustering is useful for gaining insights into dataset structure and was successfully implemented in R.
Log Analytics in Datacenter with Apache Spark and Machine LearningPiotr Tylenda
The document outlines a presentation on log analytics in data centers using Apache Spark and machine learning, specifically addressing workload log management and clustering techniques. It discusses various key components in the data pipeline, clustering algorithms like k-means, and the integration of machine learning for log data analysis. Additionally, it highlights lessons learned from the implementation and potential use cases for clustering and data exploration.
Log Analytics in Datacenter with Apache Spark and Machine LearningAgnieszka Potulska
This document discusses using Apache Spark and machine learning for log analytics in a data center. It covers collecting workload logs in Kafka and analyzing them using Spark Streaming, ELK stack, and Spark machine learning. Key techniques discussed include TF-IDF, word2vec, k-means clustering algorithm, and visualizing clustered logs. The document provides an example PySpark pipeline for preprocessing logs, creating word embeddings, and running k-means clustering on the data.
Data science involves using scientific methods to extract knowledge from structured and unstructured data. Machine learning is a type of data science that uses examples to help computers learn without being explicitly programmed. It detects patterns in data and adjusts programs accordingly. Machine learning algorithms include supervised learning techniques like decision trees and random forests as well as unsupervised learning techniques like clustering. Hierarchical and k-means clustering are commonly used clustering algorithms. Hierarchical clustering groups objects into clusters based on their distances while k-means clustering assigns objects to k number of clusters based on their attributes.
This paper addresses the issue of the random selection of initial centroids in the k-means clustering algorithm, which affects the stability and accuracy of clustering results. It proposes a method for premeditated initialization of centroids, leading to consistent clustering outcomes and improved performance across various datasets such as iris, abalone, and wine. Experimental results demonstrate that the proposed approach outperforms the traditional k-means method in terms of clustering accuracy and stability.
The document discusses machine learning classification concepts, focusing on the iris dataset used for training and testing classifiers like k-nearest neighbours and decision trees. It covers key topics such as data partitioning, evaluation by accuracy score, and defining metrics to measure distance for k-NN. Additionally, it explains the process of building a decision tree and calculating gini impurity for multiclass classification tasks.
The document discusses data clustering, a method of grouping objects based on similarity, and outlines various clustering algorithms such as K-means and fuzzy C-means. It also addresses feature selection methods, advantages and disadvantages of these algorithms, and clustering validation techniques like the Dunn and Davies-Bouldin indices. Applications of clustering include customer segmentation, data summarization, and social network analysis.
This document discusses two types of clustering algorithms: partitional and hierarchical clustering. It provides details on K-means, a popular partitional clustering algorithm, including the pseudocode and an example. It also discusses hierarchical clustering, including different cluster distance measures, the agglomerative algorithm, and provides an example of applying the agglomerative approach. Evaluation of K-means performance using sum of squared errors is also covered.
Novel algorithms for Knowledge discovery from neural networks in Classificat...Dr.(Mrs).Gethsiyal Augasta
The document describes a new discretization algorithm called DRDS (Discretization based on Range Coefficient of Dispersion and Skewness) for neural networks classifiers. DRDS is a supervised, incremental and bottom-up discretization method that automates the discretization process by introducing the number of intervals and stopping criterion. It has two phases: Phase I generates an Initial Discretization Scheme (IDS) by searching globally, and Phase II refines the intervals by merging them up to a stopping criterion without affecting quality. The algorithm uses range coefficient of dispersion and data skewness to select the best interval length and number of intervals for discretization. Experimental results show DRDS effectively discretizes data for neural network classification.
1. The document describes the implementation of a K-means clustering algorithm from scratch in Python. It includes data normalization, K-means++ initialization, and evaluation using the Silhouette method.
2. Various techniques are tested to improve the algorithm, including normalization to handle differently scaled features, and K-means++ initialization to avoid poor initial centroid locations.
3. The algorithm outputs the centroid locations, a plot of Silhouette scores against K values, and a 3D plot visualizing the clustered data points and centroids.
Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0theijes
This paper presents an algorithm for detecting anomalies in temporal data using a combination of K-means clustering and the C5.0 decision tree algorithm. The K-means algorithm is first applied to partition the dataset into clusters, after which the C5.0 decision tree is used for classification of instances as normal or anomalous. The proposed method demonstrates effective classification accuracy on the tested dataset.
The document explains k-means clustering as an unsupervised iterative technique that partitions data into k distinct clusters based on similarity. It outlines the algorithm's steps, advantages, and disadvantages, with a practical example and calculations. Additionally, it briefly describes the k-nearest neighbor (k-NN) algorithm, a supervised learning method used for classification and regression, emphasizing its operation and characteristics.
An improvement in k mean clustering algorithm using better time and accuracyijpla
This document summarizes a research paper that proposes an improved K-means clustering algorithm to enhance accuracy and reduce computation time. The standard K-means algorithm randomly selects initial cluster centroids, affecting results. The proposed algorithm systematically determines initial centroids based on data point distances. It assigns data to the closest initial centroid to generate initial clusters. Iteratively, it calculates new centroids and reassigns data only if distances decrease, reducing unnecessary computations. Experiments on various datasets show the proposed algorithm achieves higher accuracy faster than standard K-means.
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
The paper discusses the k-means clustering algorithm, a widely used method for partitioning data into a predefined number of clusters, k. It identifies limitations of the standard algorithm, particularly related to the selection of initial centroids and computational complexity, proposing an optimized version that systematically determines centroids and enhances assignment efficiency. The enhanced method aims to improve clustering accuracy while ensuring the entire process operates in O(n^2) time.
This document discusses machine learning algorithms in R. It provides an overview of machine learning, data science, and the 5 V's of big data. It then discusses two main machine learning algorithms - clustering and classification. For clustering, it covers k-means clustering, providing examples of how to implement k-means clustering in R. For classification, it discusses decision trees, K-nearest neighbors (KNN), and provides an example of KNN classification in R. It also provides a brief overview of regression analysis, including examples of simple and multiple linear regression in R.
The International Journal of Engineering and Science (The IJES)theijes
This document summarizes a research paper that proposes a novel approach to improving the k-means clustering algorithm. The standard k-means algorithm is computationally expensive and produces results that depend heavily on the initial centroid selection. The proposed approach determines initial centroids systematically and uses a heuristic to efficiently assign data points to clusters. It improves both the accuracy and efficiency of k-means clustering by ensuring the entire process takes O(n2) time without sacrificing cluster quality.
Clustering is an unsupervised learning process that partitions data into similar subsets, with methods including partitioning and hierarchical approaches. K-means clustering is a centroid-based algorithm that groups data points into k clusters by minimizing the distance within clusters, using distance metrics like Euclidean and Manhattan distances. Key steps in this process involve selecting initial centroids, assigning data points to clusters based on distance, and recalculating centroids until convergence.
Log Analytics in Datacenter with Apache Spark and Machine LearningPiotr Tylenda
The document outlines a presentation on log analytics in data centers using Apache Spark and machine learning, specifically addressing workload log management and clustering techniques. It discusses various key components in the data pipeline, clustering algorithms like k-means, and the integration of machine learning for log data analysis. Additionally, it highlights lessons learned from the implementation and potential use cases for clustering and data exploration.
Log Analytics in Datacenter with Apache Spark and Machine LearningAgnieszka Potulska
This document discusses using Apache Spark and machine learning for log analytics in a data center. It covers collecting workload logs in Kafka and analyzing them using Spark Streaming, ELK stack, and Spark machine learning. Key techniques discussed include TF-IDF, word2vec, k-means clustering algorithm, and visualizing clustered logs. The document provides an example PySpark pipeline for preprocessing logs, creating word embeddings, and running k-means clustering on the data.
Data science involves using scientific methods to extract knowledge from structured and unstructured data. Machine learning is a type of data science that uses examples to help computers learn without being explicitly programmed. It detects patterns in data and adjusts programs accordingly. Machine learning algorithms include supervised learning techniques like decision trees and random forests as well as unsupervised learning techniques like clustering. Hierarchical and k-means clustering are commonly used clustering algorithms. Hierarchical clustering groups objects into clusters based on their distances while k-means clustering assigns objects to k number of clusters based on their attributes.
This paper addresses the issue of the random selection of initial centroids in the k-means clustering algorithm, which affects the stability and accuracy of clustering results. It proposes a method for premeditated initialization of centroids, leading to consistent clustering outcomes and improved performance across various datasets such as iris, abalone, and wine. Experimental results demonstrate that the proposed approach outperforms the traditional k-means method in terms of clustering accuracy and stability.
The document discusses machine learning classification concepts, focusing on the iris dataset used for training and testing classifiers like k-nearest neighbours and decision trees. It covers key topics such as data partitioning, evaluation by accuracy score, and defining metrics to measure distance for k-NN. Additionally, it explains the process of building a decision tree and calculating gini impurity for multiclass classification tasks.
The document discusses data clustering, a method of grouping objects based on similarity, and outlines various clustering algorithms such as K-means and fuzzy C-means. It also addresses feature selection methods, advantages and disadvantages of these algorithms, and clustering validation techniques like the Dunn and Davies-Bouldin indices. Applications of clustering include customer segmentation, data summarization, and social network analysis.
This document discusses two types of clustering algorithms: partitional and hierarchical clustering. It provides details on K-means, a popular partitional clustering algorithm, including the pseudocode and an example. It also discusses hierarchical clustering, including different cluster distance measures, the agglomerative algorithm, and provides an example of applying the agglomerative approach. Evaluation of K-means performance using sum of squared errors is also covered.
Novel algorithms for Knowledge discovery from neural networks in Classificat...Dr.(Mrs).Gethsiyal Augasta
The document describes a new discretization algorithm called DRDS (Discretization based on Range Coefficient of Dispersion and Skewness) for neural networks classifiers. DRDS is a supervised, incremental and bottom-up discretization method that automates the discretization process by introducing the number of intervals and stopping criterion. It has two phases: Phase I generates an Initial Discretization Scheme (IDS) by searching globally, and Phase II refines the intervals by merging them up to a stopping criterion without affecting quality. The algorithm uses range coefficient of dispersion and data skewness to select the best interval length and number of intervals for discretization. Experimental results show DRDS effectively discretizes data for neural network classification.
1. The document describes the implementation of a K-means clustering algorithm from scratch in Python. It includes data normalization, K-means++ initialization, and evaluation using the Silhouette method.
2. Various techniques are tested to improve the algorithm, including normalization to handle differently scaled features, and K-means++ initialization to avoid poor initial centroid locations.
3. The algorithm outputs the centroid locations, a plot of Silhouette scores against K values, and a 3D plot visualizing the clustered data points and centroids.
Anomaly Detection in Temporal data Using Kmeans Clustering with C5.0theijes
This paper presents an algorithm for detecting anomalies in temporal data using a combination of K-means clustering and the C5.0 decision tree algorithm. The K-means algorithm is first applied to partition the dataset into clusters, after which the C5.0 decision tree is used for classification of instances as normal or anomalous. The proposed method demonstrates effective classification accuracy on the tested dataset.
The document explains k-means clustering as an unsupervised iterative technique that partitions data into k distinct clusters based on similarity. It outlines the algorithm's steps, advantages, and disadvantages, with a practical example and calculations. Additionally, it briefly describes the k-nearest neighbor (k-NN) algorithm, a supervised learning method used for classification and regression, emphasizing its operation and characteristics.
An improvement in k mean clustering algorithm using better time and accuracyijpla
This document summarizes a research paper that proposes an improved K-means clustering algorithm to enhance accuracy and reduce computation time. The standard K-means algorithm randomly selects initial cluster centroids, affecting results. The proposed algorithm systematically determines initial centroids based on data point distances. It assigns data to the closest initial centroid to generate initial clusters. Iteratively, it calculates new centroids and reassigns data only if distances decrease, reducing unnecessary computations. Experiments on various datasets show the proposed algorithm achieves higher accuracy faster than standard K-means.
Optimising Data Using K-Means Clustering AlgorithmIJERA Editor
The paper discusses the k-means clustering algorithm, a widely used method for partitioning data into a predefined number of clusters, k. It identifies limitations of the standard algorithm, particularly related to the selection of initial centroids and computational complexity, proposing an optimized version that systematically determines centroids and enhances assignment efficiency. The enhanced method aims to improve clustering accuracy while ensuring the entire process operates in O(n^2) time.
This document discusses machine learning algorithms in R. It provides an overview of machine learning, data science, and the 5 V's of big data. It then discusses two main machine learning algorithms - clustering and classification. For clustering, it covers k-means clustering, providing examples of how to implement k-means clustering in R. For classification, it discusses decision trees, K-nearest neighbors (KNN), and provides an example of KNN classification in R. It also provides a brief overview of regression analysis, including examples of simple and multiple linear regression in R.
The International Journal of Engineering and Science (The IJES)theijes
This document summarizes a research paper that proposes a novel approach to improving the k-means clustering algorithm. The standard k-means algorithm is computationally expensive and produces results that depend heavily on the initial centroid selection. The proposed approach determines initial centroids systematically and uses a heuristic to efficiently assign data points to clusters. It improves both the accuracy and efficiency of k-means clustering by ensuring the entire process takes O(n2) time without sacrificing cluster quality.
Clustering is an unsupervised learning process that partitions data into similar subsets, with methods including partitioning and hierarchical approaches. K-means clustering is a centroid-based algorithm that groups data points into k clusters by minimizing the distance within clusters, using distance metrics like Euclidean and Manhattan distances. Key steps in this process involve selecting initial centroids, assigning data points to clusters based on distance, and recalculating centroids until convergence.
This document provides an overview of basic C programming concepts including keywords, identifiers, variables, constants, operators, characters and strings. It discusses the terminologies used in C like keywords (which are reserved words that provide meaning to the compiler), identifiers (user-defined names for variables, functions etc.), and variables (named locations in memory that store values). It also summarizes C's control flow statements like if-else, switch-case and loops. The document aims to explain the basic building blocks of C to newcomers of the language.
本資料では、Google DeepMindの音声復元モデル「Miipher / Miipher-2」を紹介しています。Miipher-2はUSM + WaveFit構成により、テキスト不要&高速処理を実現する他、100TPUで100万時間を3日で処理するスケーラビリティも大きな特徴です。
It introduces Miipher / Miipher-2, Google DeepMind's speech enhancement and restoration models.
Miipher-2 uses a USM + WaveFit setup for text-free and efficient processing, and it scales to clean 1M hours of audio in 3 days on 100 TPUs.
Multi-proposer consensus protocols let multiple validators propose blocks in parallel, breaking the single-leader throughput bottleneck of classic designs. Yet the modern multi-proposer consensus implementation has grown a lot since HotStuff. THisworkshop will explore the implementation details of recent advances – DAG-based approaches like Narwhal and Sui’s Mysticeti – and reveal how implementation details translate to real-world performance gains. We focus on the nitty-gritty: how network communication patterns and data handling affect throughput and latency. New techniques such as Turbine-like block propagation (inspired by Solana’s erasure-coded broadcast) and lazy push gossip broadcasting dramatically cut communication overhead. These optimizations aren’t just theoretical – they enable modern blockchains to process over 100,000 transactions per second with finality in mere milliseconds redefining what is possible in decentralized systems.
Deep Learning for Image Processing on 16 June 2025 MITS.pptxresming1
This covers how image processing or the field of computer vision has advanced with the advent of neural network architectures ranging from LeNet to Vision transformers. It covers how deep neural network architectures have developed step-by-step from the popular CNNs to ViTs. CNNs and its variants along with their features are described. Vision transformers are introduced and compared with CNNs. It also shows how an image is processed to be given as input to the vision transformer. It give the applications of computer vision.
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...Mark Billinghurst
This is lecture 6 in the course on Rapid Prototyping for XR, taught on June 13th, 2025 by Mark Billinghurst. This lecture was about using AI for Prototyping and Research Directions.
For any number of circumstances, obsolescence risk is ever present in the electronics industry. This is especially true for human-to-machine interface hardware, such as keypads, touchscreens, front panels, bezels, etc. This industry is known for its high mix and low-volume builds, critical design requirements, and high costs to requalify hardware. Because of these reasons, many programs will face end-of-life challenges both at the component level as well as at the supplier level.
Redesigns and qualifications can take months or even years, so proactively managing this risk is the best way to deter this. If an LED is obsolete or a switch vendor has gone out of business, there are options to proceed.
In this webinar, we cover options to redesign and reverse engineer legacy keypad and touchscreen designs.
For more information on our HMI solutions, visit https://www.epectec.com/user-interfaces.
International Journal of Advanced Information Technology (IJAIT)ijait
International journal of advanced Information technology (IJAIT) is a bi monthly open access peer-
reviewed journal, will act as a major forum for the presentation of innovative ideas, approaches,
developments, and research projects in the area advanced information technology applications and
services. It will also serve to facilitate the exchange of information between researchers and industry
professionals to discuss the latest issues and advancement in the area of advanced IT. Core areas of
advanced IT and multi-disciplinary and its applications will be covered during the conferences.
Machine Learning with Python- Machine Learning Algorithms- K-Means Clustering Algo.pdf
1. Machine Learning with Python
Machine Learning Algorithms - K-Means Clustering
Prof.ShibdasDutta,
Associate Professor,
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
2. Machine Learning Algorithms – Classification Algo- K-Means Clustering
Introduction - K-Means Clustering
Before K-Means After K-Means
Clustering System
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
3. In general, Clustering is defined as the grouping of data points such that the data points in a group will be similar or
related to one another and different from the data points in another group. The goal of clustering is to determine the
intrinsic grouping in a set of unlabelled data.
K- means is an unsupervised partitional clustering algorithm that is based on grouping data into k – numbers of clusters by
determining centroid using the Euclidean or Manhattan method for distance calculation. It groups the object based on
minimum distance.
euclidean distance formula
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
4. ALGORITHM
1. First, initialize the number of clusters, K (Elbow method is generally used in selecting the number of clusters )
2. Randomly select the k data points for centroid. A centroid is the imaginary or real location representing the center of
the cluster.
3. Categorize each data items to its closest centroid and update the centroid coordinates calculating the average of items
coordinates categorized in that group so far
4. Repeat the process for a number of iterations till successive iterations clusters data items into the same group
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
5. HOW IT WORKS ?
In the beginning, the algorithm chooses k centroids in the dataset randomly after shuffling the data. Then it calculates
the distance of each point to each centroid using the euclidean distance calculation method. Each centroid assigned
represents a cluster and the points are assigned to the closest cluster. At the end of the first iteration, the centroid values
are recalculated, usually taking the arithmetic mean of all points in the cluster. In every iteration, new centroid values
are calculated until successive iterations provide the same centroid value.
Let’s kick off K-Means Clustering Scratch with a simple example: Suppose we have data points (1,1), (1.5,2), (3,4), (5,7),
(3.5,5), (4.5,5), (3.5,4.5). Let us suppose k = 2 i.e. dataset should be grouped in two clusters. Here we are using the
Euclidean distance method.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
6. Step 1 : It is already defined that k = 2 for this problem
Step-2: Since k = 2, we are randomly selecting two centroid as c1(1,1) and c2(5,7)
Step 3: Now, we calculate the distance of each point to each centroid using the euclidean distance calculation method
using Pythogoras theoream :
ITERATION 01
X1 Y1 X2 Y2 D1 X1 Y1 X2 Y2 D2 Remarks
1 1 1 1 0 1 1 5 7 7.21 D1<D2 : (1,1) belongs to c1
1.5 2 1 1 1.12 1.5 2 5 7 6.1 D1<D2 : (1.5,2) belongs to c1
3 4 1 1 3.61 3 4 5 7 3.61 D1<D2 : (3,4) belongs to c1
5 7 1 1 7.21 5 7 5 7 0 D1>D2 : (5,7) belongs to c2
3.5 5 1 1 4.72 3.5 5 5 7 2.5 D1>D2 : (3.5,5) belongs to c2
4.5 5 1 1 5.32 4.5 5 5 7 2.06 D1>D2 : (5.5,5) belongs to c2
3.5 4.5 1 1 4.3 3.5 4.5 5 7 2.91 D1>D2 : (3.5,4.5) belongs to c2
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
7. Note: D1 & D2 are euclidean distance between centroid (x2,y2) and data points (x1,y1)
In cluster c1 we have (1,1), (1.5,2) and (3,4) whereas centroid c2 contains (5,7), (3.5,5), (4.5,5) & (3.5,4.5). Here, a new
centroid is the algebraic mean of all the data items in a cluster.
C1(new) = ( (1+1.5+3)/3 , (1+2+4)/3) = (1.83, 2.33)
C2(new) = ((5+3.5+4.5+3.5)/4, (7+5+5+4.5)/4) = (4.125, 5.375)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
8. ITERATION 02
X1 Y1 X2 Y2 D1 X1 Y1 X2 Y2 D2 Remarks
1 1 1.83 2.33 1.56 1 1 4.12 5.37 5.37 (1,1) belongs to c1
1.5 2 1.83 2.33 0.46 1.5 2 4.12 5.37 4.27 (1.5,2) belongs to c1
3 4 1.83 2.33 2.03 3 4 4.12 5.37 1.77 (3,4) belongs to c2
5 7 1.83 2.33 5.64 5 7 4.12 5.37 1.84 (5,7) belongs to c2
3.5 5 1.83 2.33 3.14 3.5 5 4.12 5.37 0.72 (3.5,5) belongs to c2
4.5 5 1.83 2.33 3.77 4.5 5 4.12 5.37 0.53 (5.5,5) belongs to c2
3.5 4.5 1.83 2.33 2.73 3.5 4.5 4.12 5.37 1.07 (3.5,4.5) belongs to c2
In cluster c1 we have (1,1), (1.5,2) ) whereas centroid c2 contains (3,4),(5,7), (3.5,5), (4.5,5) & (3.5,4.5). Here, new
centroid is the algebraic mean of all the data items in a cluster.
C1(new) = ( (1+1.5)/2 , (1+2)/2) = (1.25,1.5)
C2(new) = ((3+5+3.5+4.5+3.5)/5, (4+7+5+5+4.5)/5) = (3.9, 5.1)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
12. K-Means Clustering Code
So far, we have learnt about the introduction to the K-Means algorithm. We have learnt in detail about the mathematics
behind the K-means clustering algorithm and have learnt how Euclidean distance method is used in grouping the data
items in K number of clusters.
Here were are implementing K-means clustering using python.
But the problem is how to choose the number of clusters?
In this example, assigning the number of clusters ourselves and later we will be discussing various ways of finding the
best number of clusters.
import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt
import math
class K_Means:
def __init__(self, k=2, tolerance = 0.001, max_iter = 500):
self.k = k
self.max_iterations = max_iter
self.tolerance = tolerance
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
13. We have defined a K-means class with init consisting default value of k as 2, error tolerance as 0.001, and maximum
iteration as 500.
Before diving into the code, let’s remember some mathematical terms involved in K-means clustering:- centroids &
euclidean distance. On a quick note centroid of a data is the average or mean of the data and Euclidean distance is the
distance between two points in the coordinate plane calculated using Pythagoras theorem.
def euclidean_distance(self, point1, point2):
#return math.sqrt((point1[0]-point2[0])**2 + (point1[1]-point2[1])**2 + (point1[2]-point2[2])**2)
#sqrt((x1-x2)^2 + (y1-y2)^2)
return np.linalg.norm(point1-point2, axis=0)
We find the euclidean distance from each point to all the centroids. If you look for efficiency it is better to use the NumPy
function (np.linalg.norm(point1-point2, axis=0))
def fit(self, data):
self.centroids = {}
for i in range(self.k):
self.centroids[i] = data[i]
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
14. ASSIGNING CENTROIDS
There are various methods of assigning k centroid initially. Mostly used is a random selection but let’s go in the most basic
way. We assign the first k points from the dataset as the initial centroids.
for i in range(self.max_iterations):
self.classes = {}
for j in range(self.k):
self.classes[j] = []
for point in data:
distances = []
for index in self.centroids:
distances.append(self.euclidean_distance(point,self.centroids[index]))
cluster_index = distances.index(min(distances))
self.classes[cluster_index].append(point)
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
15. Till now, we have defined the K-means class and initialized some default parameters. We have defined the euclidean
distance calculation function and we have also assigned initial k clusters. Now, In order to know which cluster and data
item belong to, we are calculating Euclidean distance from the data items to each centroid. Data item closest to the
cluster belongs to that respective cluster.
previous = dict(self.centroids)
for cluster_index in self.classes:
self.centroids[cluster_index] = np.average(self.classes[cluster_index], axis = 0)
isOptimal = True
for centroid in self.centroids:
original_centroid = previous[centroid]
curr = self.centroids[centroid]
if np.sum((curr - original_centroid)/original_centroid * 100.0) > self.tolerance:
isOptimal = False
if isOptimal:
break
At the end of the first iteration, the centroid values are recalculated, usually taking the arithmetic mean of all points in
the cluster. In every iteration, new centroid values are calculated until successive iterations provide the same centroid
value.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
16. CLUSTERING WITH DEMO DATA
We’ve now completed the K Means scratch code of this Machine Learning tutorial series. Now, let’s test our code by
clustering with randomly generated data:
#generate dummy cluster datasets
# Set three centers, the model should predict similar results
center_1 = np.array([1,1])
center_2 = np.array([5,5])
center_3 = np.array([8,1])
# Generate random data and center it to the three centers
cluster_1 = np.random.randn(100, 2) + center_1
cluster_2 = np.random.randn(100,2) + center_2
cluster_3 = np.random.randn(100,2) + center_3
data = np.concatenate((cluster_1, cluster_2, cluster_3), axis = 0)
Here we have created 3 groups of data of two-dimension with a different centre. We have defined the value of k as 3.
Now, let’s fit the model created
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
17. k_means = K_Means(K)
k_means.fit(data)
# Plotting starts here
colors = 10*["r", "g", "c", "b", "k"]
for centroid in k_means.centroids:
plt.scatter(k_means.centroids[centroid][0], k_means.centroids[centroid][1], s = 130, marker = "x")
for cluster_index in k_means.classes:
color = colors[cluster_index]
for features in k_means.classes[cluster_index]:
plt.scatter(features[0], features[1], color = color,s = 30)
K-Means Clustering
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
18. CHOOSING VALUE OF K
While working with the k-means clustering scratch, one thing we must keep in mind is the number of clusters ‘k’. We
should make sure that we are choosing the optimum number of clusters for the given data set. But, here arises a
question, how to choose the optimum value of k ?? We use the elbow method which is generally used in analyzing the
optimum value of k.
The Elbow method is based on the principle that “Sum of squares of distances of every data point from its
corresponding cluster centroid should be as minimum as possible”.
STEPS OF CHOOSING BEST K VALUE
1. Run k-means clustering model on various values of k
2. For each value of K, calculate the Sum of squares of distances of every data point from its corresponding cluster centroid
which is called WCSS ( Within-Cluster Sums of Squares)
3. Plot the value of WCSS with respect to various values of K
4. To select the value of k, we choose the value where there is bend (knee) on the plot i.e. WCSS isn’t increasing rapidly.
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
19. elbow method to find k
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
20. from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
# Load the data
X = pd.read_csv('data.csv').drop('label', axis=1)
y = pd.read_csv('data.csv')['label']
# Create the KMeans model
kmeans = KMeans(n_clusters=3)
# Fit the model to the data
kmeans.fit(X)
# Predict the labels for the data
y_pred = kmeans.predict(X)
# Calculate the accuracy
accuracy = accuracy_score(y, y_pred)
# Print the accuracy
print(accuracy)
Find out Accuracy Score in K-Means Clustering Algo
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
21. import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
# Create the data
X = np.random.randn(100, 2)
y = np.random.randint(0, 2, size=100)
# Fit the KMeans model
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
# Predict the labels
y_pred = kmeans.predict(X)
# Create the confusion matrix
cm = confusion_matrix(y, y_pred)
Plotting of Confusion Matrix in K-Means Clustering Algo
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
22. # Plot the confusion matrix
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion matrix')
plt.colorbar()
tick_marks = np.arange(len(kmeans.classes_))
plt.xticks(tick_marks, kmeans.classes_, rotation=45)
plt.yticks(tick_marks, kmeans.classes_)
fmt = '.2f'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True/Actual label')
plt.xlabel('Predicted label')
plt.show()
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
23. PROS OF K-MEANS
1. Relatively simple to learn and understand as the algorithm solely depends on the euclidean method of distance
calculation.
2. K means works on minimizing Sum of squares of distances, hence it guarantees convergence
3. Computational cost is O(K*n*d), hence K means is fast and efficient
CONS OF K-MEANS
1. Difficulty in choosing the optimum number of clusters K
2. K means has a problem when clusters are of different size, densities, and non-globular shapes
3. K means has problems when data contains outliers
4. As the number of dimensions increases, the difficulty in getting the algorithm to converge increases due to the curse of
dimensionality
5. If there is overlapping between clusters, k-means doesn’t have an intrinsic measure for uncertainty
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com
24. Applications of K- Means Clustering Algorithm
The main goals of cluster analysis are:
To get a meaningful intuition from the data we are working with.
Cluster-then-predict where different models will be built for different subgroups.
To fulfill the above-mentioned goals, K-means clustering is performing well enough.
It can be used in following applications:
Market segmentation
Document Clustering
Image segmentation
Image compression
Customer segmentation
Analyzing the trend on dynamic data
Company Confidential: Data-Core Systems, Inc. | datacoresystems.com