SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1764
TEXT DOCUMENT CLUSTERING USING K-MEANS ALGORITHM
Dr. A. Sudha Ramkumar1, R.Nethravathy2
1Assistant Professor, Sri Kanyaka Parameswari Arts and Science College for Women, Chennai, India
2 PG Student, Sri Kanyaka Parameswari Arts and Science College for Women, Chennai, India
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - Text document Clustering is the process of
gathering relevant information into cluster. A cluster is
specially designed for storing and analyzing the huge
amount of text documents. There are several algorithms for
clustering the large set of information from the text
documents. In this paper, K-Means clustering algorithm is
carried out to cluster the text documents. Document term
matrix is constructed usingthedocumentsandalltheunique
words of documents. This matrix is highly sparse and it
introduces complexity in clustering process. Dimension
reduction techniques can be used to reduce the dimensionof
the document term matrix which intern reduces the
complexity of clustering algorithm. In this paper, text
documents are clustered using three dimension reduction
(DR) techniques and it is compared with K-Means clustering
algorithm. BBCSports dataset has been used for the
experiment K-Means clustering using dimension reduction
outperforms the K-Means clustering algorithm is proved
through the experimental results.
Key Words: Document Clustering, K-Means, Dimension
Reduction, Confusion Matrix, Preprocessing
1. INTRODUCTION
1.1 TEXT DOCUMENT CLUSTERING
Text document clustering is the process of grouping a
similar set of documents into clusters. Text clustering is
accomplished by representing the documents as a set of
features as indexes associatedwithnumerical weights.The
goal is always to cluster the given text documents, such
that they get clustered based on the similarity measures
with a reasonable accuracy. During text clustering, the
documents need to be preprocessed before analyzing the
data. The dimensions of the vector that represent the
documents need to be reduced.
Text document clustering is generally considered to be a
centralized process. Textdocumentclusteringmaybeused
for different tasks,suchasgroupingsimilardocumentsand
analyze, discovering meaningful implicitsubjectsacrossall
documents. Using similarity measures, thedocumentterm
matrix is constructed in traditional clustering methods.
Since each document contains different terms, the
dimension of this document term matrix is very high and
sparse in nature. Because of this high dimensionality, the
clustering process yields irrelevant results.
Text document clustering is used for partitioning a
collection of text documents into similar clusters based on
the distance or similarity measure. Document clustering
groups similar documents to form a coherent cluster.
However, the definition of a pair of documents being
similar or different is not always clear and normally varies
with the actual problem setting. In document clustering,
similarity is typically computed using associations and
commonalities among features, where features are
typically words and phrases. A variety of similarity or
distance measures have been proposed and widely
applied, such as cosine similarity, Jaccard coefficient,
Euclidean distance and Pearson Correlation Coefficient.
The aim of text document clustering is to group the
documents. The documents should have high intra-cluster
similarity andlowinter-clustersimilarity.Theintra-cluster
similarity is the documents within the cluster, the
documents are closely relatedwitheachwhereastheinter-
cluster similarity is nothing but between the clusters, the
documents are different with each other.
Clustering is the most common form of unsupervised
learning and is a major tool in a number of applications in
many fields of business and science. According to the
Pankaj jajoo [1], the clustering is used for the following,
• Finding Similar Documents: This feature is often used
when the user has spotted one “good” document in a
search result and wants more-like-this. The interesting
property here is that clustering is able to discover
documents that are conceptually alike in contrast to
search-based approaches that are only able to discover
whether the documents share many of the same words.
•Organizing Large Document Collections: Document
retrieval focuses on finding documents relevant to a
particular query, but it fails to solve the problemofmaking
sense of a large number of uncategorized documents.
• Duplicate Content Detection: In many applications,
there is a need to find duplicates in a large number of
documents. Clustering is employed for plagiarism
detection, grouping of related news stories and to reorder
search results rankings (to assure higher diversity among
the topmost documents). Notethatinsuchapplicationsthe
description of clusters is rarely needed.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1765
• Recommendation System: In this application, a user
recommending articles based on the articles the user has
already read. Clustering of the articles makes it possible in
real time and improves the clustering quality.
• Search Optimization: Clusteringhelpsa lotinimproving
the quality and efficiency of search engines as the user
query can be first compared to the clusters instead of
comparing it directly to the documents and the search
results can also be arranged easily.
2. RELATED WORK
Anna Huang et al., [2] 2008, Compared and analyzed the
effectiveness of similarity measures in partitional
clustering for text document datasets which uses the
standard K-Means algorithm. The experiment shows the
use of K-Means algorithm andfivesimilaritymeasuresthat
have been most commonly used in text clustering.
Charu C.Aggarwal and ChengXiangZhai [9]2012, provide a
detailed survey of the problem of text clustering. In this,
they provide the key challenges of the clustering, as it
applies to the text domain and the key methods used for
text clustering.
Bin Tang et al.,[8] 2005, Comparing four Dimension
Reduction Techniques for text document clustering
problem, using five benchmark data sets.
A.Anil Kumar and S.Chandrasekhar [4] 2008, provide the
detailed concept of preprocessing and comparing the
dimension reduction techniques.
Rakesh Chandra Balabantaray et al., [6] 2013, provide the
complete processofclusteringandprovidethecomparison
of K-Means and K-Medoids algorithms.
Twinkle savdas and jasmin jha, [7] 2015, provide the
system to categorize the text documentsandforma cluster
with the electronic data.
Pankaj Jajoo [1] 2008, provide the approaches, the first
approach is improvement of graphpartitioningtechniques
used document clustering. Andthesecondapproachisthat
the words clustered first and then the word clusterusedto
cluster the documents. This reduces the noise in data and
thus improves the quality of the clusters.
D. Sailaja et al., [3] 2015, provide an overview of pre-
processing text clustering methods, introduce an effective
digital text analysis strategy using E-mail dataset.
ShouvikSachdeva and BhupendraKastore [5] 2014, they
used the “Bag of words model” to represents each
document and compare the representations using various
similarity measures.
3. METHODOLOGY
The text document clustering using K-Means clustering
algorithm uses the following methodology. The
Methodology contains the six phases. These phases are
Data collection, preprocessing, document term matrix, K-
Means clustering, DR techniques and cluster evaluation.
Figure 1: Methodology of K-Means with DR technique
Document collection: BBCSports dataset is downloaded
from the BBC website. BBCSports consists of 737 text
documents. BBCSports contains five classes such as,
Athletics, cricket, football, rugby and tennis class. These
five classes are combined together in the BBCSports
dataset. 101 text documents are in athletics class, 124 text
documents are in cricket class, 265 text documents are in
football class, 147 text documents are in rugby class and
100 text documents are in tennis class. These 737 text
documents are used for K-Means clustering as well as K-
Means using Dimension Reduction techniques.
Preprocessing is used for extracting information from
unstructured data. A dataset consists of massivevolumeof
text documents which is collected from heterogeneous
sources of text documents. For the efficient preprocessing
of text documents the followingtechniquesareused.There
are tokenization, stopword removal, and stemming.
Tokenization is the first step of analyses. The main use of
tokenization is identifying the meaningful keywords.
Stopword removal is reduces the text data and improves
the system performance. Stopwords are the words like
“also”, “and”, “or”, “can”, “this” which occursfrequentlybut
are meaningless. Stemming is the process of reducing
derived words into their base or root word. For example,
jumping, jumped, jumps must be reduced into its common
root “jump”.
Document term matrix or term document matrix is a
mathematical matrix that describesthefrequencyofterms
that occur in a collection of documents. In a document
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1766
term matrix, rows correspond to documents in the
collection and columns correspond to terms.
K-Means clustering algorithm: The k-means algorithm
takes the input parameter k, and partitions a set of n-
objects into K-clusters so that the resulting intra-cluster
similarity is high whereas the inter cluster similarity is
low. Cluster similarity is measured in regard to the mean
value of the objects in the cluster, which can be viewed as
the clusters center of gravity. The syntactical similarity of
the terms is calculated by using Euclidean distance.
Steps of K-Means clustering
1.Select k observations as initial cluster centroids
(seeds).
2.Assign each observation to the cluster that has the
closest centroids (for example, in Euclidean
sense).
3.When all the observations have been assigned,
recalculate the positions of the k centroids.
4.Repeat until the cluster centroidsnolongerchange.
A confusion matrix is a table thatisoftenusedtodescribe
the performance of a clustering model on a set of text
document for which the true values are known. It allows
the visualization of the performance of analgorithmandin
unsupervised learning it is calledasmatchingmatrixandis
shown in the figure 2. In the confusion matrix, all the
diagonal elements are true positives and it is the relevant
document to that particular class, where as the number of
documents retrieved are True Negatives (TN) and True
Positives (TP).
Figure 2: Confusion matrix
DR techniques, The large numbers of attributes are
selected from the dataset, the Document term matrix is
high dimensional sparse matrix. Attribute selection
dimension reduction method is used to reduce the
dimension of the matrix. The selected attributes are
analyzed and reduced by filter based attribute selection
method. The InfoGain feature selection method selectsthe
features from the original set of attributes based on
ranking of attributes. This reduced feature set isappliedto
the K-Means clustering method. The results of K-Means as
well as K-Means with Attribute Selection DR method are
validated using the evaluation metrics.
Infogain (IG) DR technique, Ingogain selects many items
in pure feature sets of text documents. Infogain (IG) is an
effective Feature Selection method and is widely used in
text document. Infogain does not concern the relation
between a certain feature word and certain class, but treat
all classes in training set as a whole. And the importanceof
a certain word is measured by calculating the information
amount that each class takes.
Cfssubset (CSS) DR technique, In CfsSubset, values of
subsets are correlate highly with the class value and low
correlation with each other. It is used to evaluate the
worthy of attributes subset by considering the individual
predictive ability of each attribute along withthedegreeof
redundancy between them. Attribute Subsets that are
highly correlated with the class while having low
intercorrelation are preferred.
The Search Method is the structured way in which the
search space of possible attribute subsets is navigated
based on the subset evaluation. Baseline methods include
Random Search and Exhaustive Search, although graph
search algorithms are popular such as Best First Search.
Attribute evaluation method is:
 BestFirst: Uses a best-first search strategy to
navigate attribute subsets.
 GreedyStepWise: Uses a forward (additive) or
backward (subtractive) step-wise strategy to navigate
attribute subsets.
Cluster quality evaluation
K-means clustering is applied on the dataset and a class to
clusters evaluation method of WEKA tool is used. It
generates on output in the form of confusion matrix R.E-
Benchs (2018).
4. EVALUATION METRICS
The evaluation metrics used in this paper are precision,
recall, f-measure and accuracy.
PRECISION
This measure retrieves the number of correct text
documents out of thenumberoftotal textdocumentsmade
by the system.
Precision=Number of relevant documents retrieved (TP)
Number of documents retrieved (TP+FP) … (1)
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1767
RECALL
This measure retrieves the number of correct text
documents made by the system, out of the number of all
possible text documents.
Recall= Number of relevant documents retrieved (TP)
Number of relevant documents (TP+FN) … (2)
ACCURACY
The accuracy of a measurementishowclosea resultcomes
to the true value.Systemetic error or Inaccuracy is
quantified by the average difference(bias)betweena setof
measurements obtained with the test method with a
reference value or values obtained with a reference
method.
Accuracy= TP+TN ... (3)
(TP+FP+TN+FN)
F-MEASURE
This measure is a combination of the precision and recall
measures used in machine learning.
F-Measure= 2* (Precision*recall) ... (4)
(Precision+ recall)
RESULTS AND DISCUSSION
In this paper, comparison of K-Means clustering and K-
Means clustering with DR technique using the BBCSports
dataset, which has five classes such as athletics, cricket,
football, rugby and tennis has been proposed.
In the following table, it is clear that K-Meanswithinfogain
(IG) DR technique is more effective than the K-Means
clustering without dimensionreductiontechniques.The K-
Means with infogain DR technique has 97.8% precision,
96.4% recall, 96.7%accuracy and 97% F-measure.
Table 1: Comparison of K-Means and K-Means with DR
techniques
Evaluation
Matrix Precision Recall Accuracy
F-
Measure
K-Means 89.6 85.8 86.2 87.6
K-
Means(CSS
bestfirst) 95 94.2 94 94.5
K-
Means(CSS 95.6 96.4 95.5 95.9
greedy)
K-
means(IG) 97.8 96.4 96.7 97
The following figure.3 shows the effectiveness of K-Means
with infogain DR techniques over the K-Means clustering
without the DR technique.
Figure 3: Comparison of K-Means and K-Means with DR
techniques
5. CONCLUSION
The main aim of text document clustering is to grouping the
similar documents into a cluster. This paper discusses
comparison of K-Means clustering and K-Means clustering
with DR techniques. The K-Means clustering with DR
techniques improves the clustering quality significantly.
When compared to K-Means clustering. The experimental
results of K-Means and K-Means using DR techniques
clustering algorithm using evaluating measures such as,
precision, recall, accuracy and f-measure hasbeendiscussed
in this paper.
REFERENCES
1. Pankaj Jajoo, Document Clustering, Indian
Institute of Technology Kharagpur, 2008.
2. Anna and Huang, Department of Computer
Science, The University of Waikato, Hamilton,
New Zealand, Similarity Measures for Text
Document Clustering, 2008.
3. D Sailaja et al, International Journal of
Computer Science and Information
Technologies (IJCSIT), An Overview of Pre-
Processing Text Clustering Methods, 2015.
4. Anil Kumar and S.Chandrasekar, Dept of CSE,
Sri Sivani College of Engineering, India. Text
data preprocessing and dimensionality
reduction for document clustering 2012.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1768
5. ShouvikSachdeva and BhupendraKastore,
Indian Institute of Technology, Kanpur,
Document Clustering: Similarity Measures,
2014.
6. Rakesh Chandra Balabantaray et al, Document
Clustering using K-Means and K-Medoids,
2013.
7. Twinkle Savdas and Jasmine Jha, Document
Cluster Mining On Text Document, 2015.
8. Bin Tang et al., Comparing dimension
reduction technique for document clustering,
2005.
9. Charu C.Aggarwal and ChengXiangZhai, A
Survey Of Text Clustering Algorithms, 2012.
10. R.E Banchas, Text Mining with MATLAB”,
Springer, 2012.
Ad

Recommended

PDF
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
cscpconf
 
PDF
Feature selection, optimization and clustering strategies of text documents
IJECEIAES
 
PDF
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
ijistjournal
 
PDF
Semantic Based Model for Text Document Clustering with Idioms
Waqas Tariq
 
PDF
A Survey on Sentiment Categorization of Movie Reviews
Editor IJMTER
 
PDF
Modeling Text Independent Speaker Identification with Vector Quantization
TELKOMNIKA JOURNAL
 
PPTX
Programmer information needs after memory failure
Bhagyashree Deokar
 
PPTX
Sources of errors in distributed development projects implications for colla...
Bhagyashree Deokar
 
PDF
Hybrid Classifier for Sentiment Analysis using Effective Pipelining
IRJET Journal
 
PDF
Sentiment Analysis and Classification of Tweets using Data Mining
IRJET Journal
 
PDF
H04564550
IOSR-JEN
 
PDF
8 efficient multi-document summary generation using neural network
INFOGAIN PUBLICATION
 
PDF
Performance Evaluation of Query Processing Techniques in Information Retrieval
idescitation
 
PDF
On the benefit of logic-based machine learning to learn pairwise comparisons
journalBEEI
 
PDF
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
PDF
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
ijnlc
 
PDF
Summarization using ntc approach based on keyword extraction for discussion f...
eSAT Publishing House
 
PDF
IRJET- Review on Information Retrieval for Desktop Search Engine
IRJET Journal
 
PDF
IRJET- Missing Value Evaluation in SQL Queries: A Survey
IRJET Journal
 
PDF
Framework for opinion as a service on review data of customer using semantics...
IJECEIAES
 
PDF
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
ijnlc
 
PDF
Profile Analysis of Users in Data Analytics Domain
Drjabez
 
PDF
Neural Network Based Context Sensitive Sentiment Analysis
Editor IJCATR
 
PDF
03 cs3024 pankaj_jajoo
Meetika Gupta
 
PDF
A Review on Text Mining in Data Mining
ijsc
 
PDF
K0936266
IOSR Journals
 
PDF
Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...
Aleksi Aaltonen
 
PDF
Review of Various Text Categorization Methods
iosrjce
 
PDF
50120130406022
IAEME Publication
 
PDF
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 

More Related Content

What's hot (20)

PDF
Hybrid Classifier for Sentiment Analysis using Effective Pipelining
IRJET Journal
 
PDF
Sentiment Analysis and Classification of Tweets using Data Mining
IRJET Journal
 
PDF
H04564550
IOSR-JEN
 
PDF
8 efficient multi-document summary generation using neural network
INFOGAIN PUBLICATION
 
PDF
Performance Evaluation of Query Processing Techniques in Information Retrieval
idescitation
 
PDF
On the benefit of logic-based machine learning to learn pairwise comparisons
journalBEEI
 
PDF
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
PDF
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
ijnlc
 
PDF
Summarization using ntc approach based on keyword extraction for discussion f...
eSAT Publishing House
 
PDF
IRJET- Review on Information Retrieval for Desktop Search Engine
IRJET Journal
 
PDF
IRJET- Missing Value Evaluation in SQL Queries: A Survey
IRJET Journal
 
PDF
Framework for opinion as a service on review data of customer using semantics...
IJECEIAES
 
PDF
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
ijnlc
 
PDF
Profile Analysis of Users in Data Analytics Domain
Drjabez
 
PDF
Neural Network Based Context Sensitive Sentiment Analysis
Editor IJCATR
 
PDF
03 cs3024 pankaj_jajoo
Meetika Gupta
 
PDF
A Review on Text Mining in Data Mining
ijsc
 
PDF
K0936266
IOSR Journals
 
PDF
Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...
Aleksi Aaltonen
 
PDF
Review of Various Text Categorization Methods
iosrjce
 
Hybrid Classifier for Sentiment Analysis using Effective Pipelining
IRJET Journal
 
Sentiment Analysis and Classification of Tweets using Data Mining
IRJET Journal
 
H04564550
IOSR-JEN
 
8 efficient multi-document summary generation using neural network
INFOGAIN PUBLICATION
 
Performance Evaluation of Query Processing Techniques in Information Retrieval
idescitation
 
On the benefit of logic-based machine learning to learn pairwise comparisons
journalBEEI
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
EXPERT OPINION AND COHERENCE BASED TOPIC MODELING
ijnlc
 
Summarization using ntc approach based on keyword extraction for discussion f...
eSAT Publishing House
 
IRJET- Review on Information Retrieval for Desktop Search Engine
IRJET Journal
 
IRJET- Missing Value Evaluation in SQL Queries: A Survey
IRJET Journal
 
Framework for opinion as a service on review data of customer using semantics...
IJECEIAES
 
Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm
ijnlc
 
Profile Analysis of Users in Data Analytics Domain
Drjabez
 
Neural Network Based Context Sensitive Sentiment Analysis
Editor IJCATR
 
03 cs3024 pankaj_jajoo
Meetika Gupta
 
A Review on Text Mining in Data Mining
ijsc
 
K0936266
IOSR Journals
 
Not Good Enough but Try Again! Mitigating the Impact of Rejections on New Con...
Aleksi Aaltonen
 
Review of Various Text Categorization Methods
iosrjce
 

Similar to IRJET- Text Document Clustering using K-Means Algorithm (20)

PDF
50120130406022
IAEME Publication
 
PDF
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 
PDF
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET Journal
 
PDF
Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...
IJMREMJournal
 
PDF
Volume 2-issue-6-1969-1973
Editor IJARCET
 
PDF
Volume 2-issue-6-1969-1973
Editor IJARCET
 
PDF
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
PDF
Reviews on swarm intelligence algorithms for text document clustering
IRJET Journal
 
PPTX
Text clustering
KU Leuven
 
DOC
TEXT CLUSTERING.doc
naveenchaurasia
 
PDF
Hierarchal clustering and similarity measures along
eSAT Publishing House
 
PDF
Hierarchal clustering and similarity measures along with multi representation
eSAT Journals
 
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
PDF
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
PDF
Bl24409420
IJERA Editor
 
PDF
600 608
Editor IJARCET
 
PDF
Improved text clustering with
IJDKP
 
PPTX
Natural Language Processing
Nimrita Koul
 
PPTX
Document clustering and classification
Mahmoud Alfarra
 
PDF
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET Journal
 
50120130406022
IAEME Publication
 
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 
IRJET- A Survey of Text Document Clustering by using Clustering Techniques
IRJET Journal
 
Survey on Text Mining Based on Social Media Comments as Big Data Analysis Usi...
IJMREMJournal
 
Volume 2-issue-6-1969-1973
Editor IJARCET
 
Volume 2-issue-6-1969-1973
Editor IJARCET
 
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
Reviews on swarm intelligence algorithms for text document clustering
IRJET Journal
 
Text clustering
KU Leuven
 
TEXT CLUSTERING.doc
naveenchaurasia
 
Hierarchal clustering and similarity measures along
eSAT Publishing House
 
Hierarchal clustering and similarity measures along with multi representation
eSAT Journals
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
Seeds Affinity Propagation Based on Text Clustering
IJRES Journal
 
Bl24409420
IJERA Editor
 
Improved text clustering with
IJDKP
 
Natural Language Processing
Nimrita Koul
 
Document clustering and classification
Mahmoud Alfarra
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET Journal
 
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

PDF
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
PDF
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
PPTX
retina_biometrics ruet rajshahi bangdesh.pptx
MdRakibulIslam697135
 
PPTX
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
PPTX
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
PPTX
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
 
PPTX
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
 
PDF
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
PDF
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
PDF
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
PDF
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
PDF
special_edition_using_visual_foxpro_6.pdf
Shabista Imam
 
PPTX
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
PDF
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
Shabista Imam
 
PPTX
NEW Strengthened Senior High School Gen Math.pptx
DaryllWhere
 
PDF
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
PPTX
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
PPTX
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 
PDF
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
retina_biometrics ruet rajshahi bangdesh.pptx
MdRakibulIslam697135
 
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
CST413 KTU S7 CSE Machine Learning Clustering K Means Hierarchical Agglomerat...
resming1
 
Industrial internet of things IOT Week-3.pptx
KNaveenKumarECE
 
Structured Programming with C++ :: Kjell Backman
Shabista Imam
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
Rapid Prototyping for XR: Lecture 5 - Cross Platform Development
Mark Billinghurst
 
special_edition_using_visual_foxpro_6.pdf
Shabista Imam
 
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
Shabista Imam
 
NEW Strengthened Senior High School Gen Math.pptx
DaryllWhere
 
FSE-Journal-First-Automated code editing with search-generate-modify.pdf
cl144
 
FSE_LLM4SE1_A Tool for In-depth Analysis of Code Execution Reasoning of Large...
cl144
 
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 
Rapid Prototyping for XR: Lecture 2 - Low Fidelity Prototyping.
Mark Billinghurst
 

IRJET- Text Document Clustering using K-Means Algorithm

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1764 TEXT DOCUMENT CLUSTERING USING K-MEANS ALGORITHM Dr. A. Sudha Ramkumar1, R.Nethravathy2 1Assistant Professor, Sri Kanyaka Parameswari Arts and Science College for Women, Chennai, India 2 PG Student, Sri Kanyaka Parameswari Arts and Science College for Women, Chennai, India ---------------------------------------------------------------------***---------------------------------------------------------------------- Abstract - Text document Clustering is the process of gathering relevant information into cluster. A cluster is specially designed for storing and analyzing the huge amount of text documents. There are several algorithms for clustering the large set of information from the text documents. In this paper, K-Means clustering algorithm is carried out to cluster the text documents. Document term matrix is constructed usingthedocumentsandalltheunique words of documents. This matrix is highly sparse and it introduces complexity in clustering process. Dimension reduction techniques can be used to reduce the dimensionof the document term matrix which intern reduces the complexity of clustering algorithm. In this paper, text documents are clustered using three dimension reduction (DR) techniques and it is compared with K-Means clustering algorithm. BBCSports dataset has been used for the experiment K-Means clustering using dimension reduction outperforms the K-Means clustering algorithm is proved through the experimental results. Key Words: Document Clustering, K-Means, Dimension Reduction, Confusion Matrix, Preprocessing 1. INTRODUCTION 1.1 TEXT DOCUMENT CLUSTERING Text document clustering is the process of grouping a similar set of documents into clusters. Text clustering is accomplished by representing the documents as a set of features as indexes associatedwithnumerical weights.The goal is always to cluster the given text documents, such that they get clustered based on the similarity measures with a reasonable accuracy. During text clustering, the documents need to be preprocessed before analyzing the data. The dimensions of the vector that represent the documents need to be reduced. Text document clustering is generally considered to be a centralized process. Textdocumentclusteringmaybeused for different tasks,suchasgroupingsimilardocumentsand analyze, discovering meaningful implicitsubjectsacrossall documents. Using similarity measures, thedocumentterm matrix is constructed in traditional clustering methods. Since each document contains different terms, the dimension of this document term matrix is very high and sparse in nature. Because of this high dimensionality, the clustering process yields irrelevant results. Text document clustering is used for partitioning a collection of text documents into similar clusters based on the distance or similarity measure. Document clustering groups similar documents to form a coherent cluster. However, the definition of a pair of documents being similar or different is not always clear and normally varies with the actual problem setting. In document clustering, similarity is typically computed using associations and commonalities among features, where features are typically words and phrases. A variety of similarity or distance measures have been proposed and widely applied, such as cosine similarity, Jaccard coefficient, Euclidean distance and Pearson Correlation Coefficient. The aim of text document clustering is to group the documents. The documents should have high intra-cluster similarity andlowinter-clustersimilarity.Theintra-cluster similarity is the documents within the cluster, the documents are closely relatedwitheachwhereastheinter- cluster similarity is nothing but between the clusters, the documents are different with each other. Clustering is the most common form of unsupervised learning and is a major tool in a number of applications in many fields of business and science. According to the Pankaj jajoo [1], the clustering is used for the following, • Finding Similar Documents: This feature is often used when the user has spotted one “good” document in a search result and wants more-like-this. The interesting property here is that clustering is able to discover documents that are conceptually alike in contrast to search-based approaches that are only able to discover whether the documents share many of the same words. •Organizing Large Document Collections: Document retrieval focuses on finding documents relevant to a particular query, but it fails to solve the problemofmaking sense of a large number of uncategorized documents. • Duplicate Content Detection: In many applications, there is a need to find duplicates in a large number of documents. Clustering is employed for plagiarism detection, grouping of related news stories and to reorder search results rankings (to assure higher diversity among the topmost documents). Notethatinsuchapplicationsthe description of clusters is rarely needed.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1765 • Recommendation System: In this application, a user recommending articles based on the articles the user has already read. Clustering of the articles makes it possible in real time and improves the clustering quality. • Search Optimization: Clusteringhelpsa lotinimproving the quality and efficiency of search engines as the user query can be first compared to the clusters instead of comparing it directly to the documents and the search results can also be arranged easily. 2. RELATED WORK Anna Huang et al., [2] 2008, Compared and analyzed the effectiveness of similarity measures in partitional clustering for text document datasets which uses the standard K-Means algorithm. The experiment shows the use of K-Means algorithm andfivesimilaritymeasuresthat have been most commonly used in text clustering. Charu C.Aggarwal and ChengXiangZhai [9]2012, provide a detailed survey of the problem of text clustering. In this, they provide the key challenges of the clustering, as it applies to the text domain and the key methods used for text clustering. Bin Tang et al.,[8] 2005, Comparing four Dimension Reduction Techniques for text document clustering problem, using five benchmark data sets. A.Anil Kumar and S.Chandrasekhar [4] 2008, provide the detailed concept of preprocessing and comparing the dimension reduction techniques. Rakesh Chandra Balabantaray et al., [6] 2013, provide the complete processofclusteringandprovidethecomparison of K-Means and K-Medoids algorithms. Twinkle savdas and jasmin jha, [7] 2015, provide the system to categorize the text documentsandforma cluster with the electronic data. Pankaj Jajoo [1] 2008, provide the approaches, the first approach is improvement of graphpartitioningtechniques used document clustering. Andthesecondapproachisthat the words clustered first and then the word clusterusedto cluster the documents. This reduces the noise in data and thus improves the quality of the clusters. D. Sailaja et al., [3] 2015, provide an overview of pre- processing text clustering methods, introduce an effective digital text analysis strategy using E-mail dataset. ShouvikSachdeva and BhupendraKastore [5] 2014, they used the “Bag of words model” to represents each document and compare the representations using various similarity measures. 3. METHODOLOGY The text document clustering using K-Means clustering algorithm uses the following methodology. The Methodology contains the six phases. These phases are Data collection, preprocessing, document term matrix, K- Means clustering, DR techniques and cluster evaluation. Figure 1: Methodology of K-Means with DR technique Document collection: BBCSports dataset is downloaded from the BBC website. BBCSports consists of 737 text documents. BBCSports contains five classes such as, Athletics, cricket, football, rugby and tennis class. These five classes are combined together in the BBCSports dataset. 101 text documents are in athletics class, 124 text documents are in cricket class, 265 text documents are in football class, 147 text documents are in rugby class and 100 text documents are in tennis class. These 737 text documents are used for K-Means clustering as well as K- Means using Dimension Reduction techniques. Preprocessing is used for extracting information from unstructured data. A dataset consists of massivevolumeof text documents which is collected from heterogeneous sources of text documents. For the efficient preprocessing of text documents the followingtechniquesareused.There are tokenization, stopword removal, and stemming. Tokenization is the first step of analyses. The main use of tokenization is identifying the meaningful keywords. Stopword removal is reduces the text data and improves the system performance. Stopwords are the words like “also”, “and”, “or”, “can”, “this” which occursfrequentlybut are meaningless. Stemming is the process of reducing derived words into their base or root word. For example, jumping, jumped, jumps must be reduced into its common root “jump”. Document term matrix or term document matrix is a mathematical matrix that describesthefrequencyofterms that occur in a collection of documents. In a document
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1766 term matrix, rows correspond to documents in the collection and columns correspond to terms. K-Means clustering algorithm: The k-means algorithm takes the input parameter k, and partitions a set of n- objects into K-clusters so that the resulting intra-cluster similarity is high whereas the inter cluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in the cluster, which can be viewed as the clusters center of gravity. The syntactical similarity of the terms is calculated by using Euclidean distance. Steps of K-Means clustering 1.Select k observations as initial cluster centroids (seeds). 2.Assign each observation to the cluster that has the closest centroids (for example, in Euclidean sense). 3.When all the observations have been assigned, recalculate the positions of the k centroids. 4.Repeat until the cluster centroidsnolongerchange. A confusion matrix is a table thatisoftenusedtodescribe the performance of a clustering model on a set of text document for which the true values are known. It allows the visualization of the performance of analgorithmandin unsupervised learning it is calledasmatchingmatrixandis shown in the figure 2. In the confusion matrix, all the diagonal elements are true positives and it is the relevant document to that particular class, where as the number of documents retrieved are True Negatives (TN) and True Positives (TP). Figure 2: Confusion matrix DR techniques, The large numbers of attributes are selected from the dataset, the Document term matrix is high dimensional sparse matrix. Attribute selection dimension reduction method is used to reduce the dimension of the matrix. The selected attributes are analyzed and reduced by filter based attribute selection method. The InfoGain feature selection method selectsthe features from the original set of attributes based on ranking of attributes. This reduced feature set isappliedto the K-Means clustering method. The results of K-Means as well as K-Means with Attribute Selection DR method are validated using the evaluation metrics. Infogain (IG) DR technique, Ingogain selects many items in pure feature sets of text documents. Infogain (IG) is an effective Feature Selection method and is widely used in text document. Infogain does not concern the relation between a certain feature word and certain class, but treat all classes in training set as a whole. And the importanceof a certain word is measured by calculating the information amount that each class takes. Cfssubset (CSS) DR technique, In CfsSubset, values of subsets are correlate highly with the class value and low correlation with each other. It is used to evaluate the worthy of attributes subset by considering the individual predictive ability of each attribute along withthedegreeof redundancy between them. Attribute Subsets that are highly correlated with the class while having low intercorrelation are preferred. The Search Method is the structured way in which the search space of possible attribute subsets is navigated based on the subset evaluation. Baseline methods include Random Search and Exhaustive Search, although graph search algorithms are popular such as Best First Search. Attribute evaluation method is:  BestFirst: Uses a best-first search strategy to navigate attribute subsets.  GreedyStepWise: Uses a forward (additive) or backward (subtractive) step-wise strategy to navigate attribute subsets. Cluster quality evaluation K-means clustering is applied on the dataset and a class to clusters evaluation method of WEKA tool is used. It generates on output in the form of confusion matrix R.E- Benchs (2018). 4. EVALUATION METRICS The evaluation metrics used in this paper are precision, recall, f-measure and accuracy. PRECISION This measure retrieves the number of correct text documents out of thenumberoftotal textdocumentsmade by the system. Precision=Number of relevant documents retrieved (TP) Number of documents retrieved (TP+FP) … (1)
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1767 RECALL This measure retrieves the number of correct text documents made by the system, out of the number of all possible text documents. Recall= Number of relevant documents retrieved (TP) Number of relevant documents (TP+FN) … (2) ACCURACY The accuracy of a measurementishowclosea resultcomes to the true value.Systemetic error or Inaccuracy is quantified by the average difference(bias)betweena setof measurements obtained with the test method with a reference value or values obtained with a reference method. Accuracy= TP+TN ... (3) (TP+FP+TN+FN) F-MEASURE This measure is a combination of the precision and recall measures used in machine learning. F-Measure= 2* (Precision*recall) ... (4) (Precision+ recall) RESULTS AND DISCUSSION In this paper, comparison of K-Means clustering and K- Means clustering with DR technique using the BBCSports dataset, which has five classes such as athletics, cricket, football, rugby and tennis has been proposed. In the following table, it is clear that K-Meanswithinfogain (IG) DR technique is more effective than the K-Means clustering without dimensionreductiontechniques.The K- Means with infogain DR technique has 97.8% precision, 96.4% recall, 96.7%accuracy and 97% F-measure. Table 1: Comparison of K-Means and K-Means with DR techniques Evaluation Matrix Precision Recall Accuracy F- Measure K-Means 89.6 85.8 86.2 87.6 K- Means(CSS bestfirst) 95 94.2 94 94.5 K- Means(CSS 95.6 96.4 95.5 95.9 greedy) K- means(IG) 97.8 96.4 96.7 97 The following figure.3 shows the effectiveness of K-Means with infogain DR techniques over the K-Means clustering without the DR technique. Figure 3: Comparison of K-Means and K-Means with DR techniques 5. CONCLUSION The main aim of text document clustering is to grouping the similar documents into a cluster. This paper discusses comparison of K-Means clustering and K-Means clustering with DR techniques. The K-Means clustering with DR techniques improves the clustering quality significantly. When compared to K-Means clustering. The experimental results of K-Means and K-Means using DR techniques clustering algorithm using evaluating measures such as, precision, recall, accuracy and f-measure hasbeendiscussed in this paper. REFERENCES 1. Pankaj Jajoo, Document Clustering, Indian Institute of Technology Kharagpur, 2008. 2. Anna and Huang, Department of Computer Science, The University of Waikato, Hamilton, New Zealand, Similarity Measures for Text Document Clustering, 2008. 3. D Sailaja et al, International Journal of Computer Science and Information Technologies (IJCSIT), An Overview of Pre- Processing Text Clustering Methods, 2015. 4. Anil Kumar and S.Chandrasekar, Dept of CSE, Sri Sivani College of Engineering, India. Text data preprocessing and dimensionality reduction for document clustering 2012.
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 1768 5. ShouvikSachdeva and BhupendraKastore, Indian Institute of Technology, Kanpur, Document Clustering: Similarity Measures, 2014. 6. Rakesh Chandra Balabantaray et al, Document Clustering using K-Means and K-Medoids, 2013. 7. Twinkle Savdas and Jasmine Jha, Document Cluster Mining On Text Document, 2015. 8. Bin Tang et al., Comparing dimension reduction technique for document clustering, 2005. 9. Charu C.Aggarwal and ChengXiangZhai, A Survey Of Text Clustering Algorithms, 2012. 10. R.E Banchas, Text Mining with MATLAB”, Springer, 2012.