SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 12 | Dec 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 235
Review of Existing Methods in K-means Clustering Algorithm
Sonu Pandey1, Lokendra kumar Tiwari2
1M.Tech Scholar, Department of Computer Science & Engineering, KNIT Sultanpur
2Asssitant Professor, Ewing Christian College Allahabad
-------------------------------------------------------------------------***------------------------------------------------------------------------
Abstract - K-means algorithm is one of the most trendyand
important algorithms for data clustering. Withthisalgorithm,
data of similar types are tried to be clustered together from a
large data set with brute force strategy which is done by
repeated calculations. With the advancement in Technology,
the data at many domains is generated at higher rates
reaching size greater than Petabytes. Significant amount of
information is unstructured; semi structured or structured
documents spread across the networkeg. Images, audio, video,
spreadsheets, pdf(s), etc, that contain answer to help us to
create new products, refine existing products, improve
customer relations. This gave rise to Large set of data and
challenges of Big Data which is generally suffered from 3V
(Volume, Velocity and Variety) problems. Hadoop is an open
source framework designed to overcome 3V challenges. Using
Hadoop with K-Means resulted in faster processing of large
and complex data set. However, arbitrary preliminary
centroids have to be provided in traditional K-Means
algorithm. The Convergence to be reach highly dependsonthe
set of preliminary centroids. In this paper we propose a
method which takes set of preliminary centroids which has
been calculated over Hadoop and afterward run the K-Means
algorithm which shows thatconvergencecriteriareachearlier
in the most of the cases, hence it will improve efficiency and
accuracy of the algorithm.
Key Words: Data Mining, K-Means clustering, arbitrary
preliminary centroids, improved preliminary centroids,
Hadoop, MapReduce.
1. INTRODUCTION
With the development and improvement of data mining
technology, data clustering algorithm are gradually applied
to some fields. The definition of clustering in the academic
community can be generalized as follows:first,thesimilarity
of data objects. Data objects within the same cluster have
great similarity, but data objects within the different cluster
have great non-similarities. Second, the distance of data
objects. Take entire data set as a test data object of the
gathering, the distance between any pair of data objects
within the same cluster size should not be greater than the
distance between the different clusters of arbitrary data
object. Third, the density of data objects.Take entiredata set
as a multi-dimensional space aggregation of the data object,
a cluster is the spaces which contain the number of data
object relatively high dimension cut off by the space which
contains the number of data object relativelylowdimension.
Thus form a relatively separated set of dimensional space.
The k-means algorithms [3, 4 and 11] have been used to
produce the clusters with the help of K-Means Algorithm. As
we know that traditional K-Means clusteringalgorithm[4]is
mostly dependent of the data set, if the data set is very large
it will take more time to go at the convergence stage.
Moreover In the most of the cases algorithm results are
depends on choice of the arbitrary preliminary centroids.
Quite a few attempts have been made by researchers [14] to
compute the overall result of the K-Means clustering [11,
13].
In this paper we propose technique to improve Accuracy
and Efficiency by producing preliminary Centroids for k-
means Clustering over Apache™ Hadoop [6] to harness the
power of parallel computing with clustering technique.
1.1 HADOOP-COMPUTATION AND STORAGE
SOLUTION
Dealing with “Big Data”requires–aninexpensive,reliable
storage and a new toolforanalyzingstructured,unstructured
and semi structured data. Apache Hadoop addresses both of
these problems. Because Hadoop works on map reduce
concept it share out and parallelize data processing across
many nodes in a compute cluster, speeding up large
computations and hiding I/O latency through increased
concurrency. It is fighting fit for large data processing like
searching and indexing in massive data set.
1.2 HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
HDFS had been mainly built as transportation for the
Apache NUTCH web search engine assignment. HDFS isnow
an Apache Hadoop subproject. HDFS has master/slave
architecture. HDFS issuitableforapplicationsthathavelarge
dataset. HDFS maintain the metadata in a dedicated server
called NameNode and the application data are kept in
separated nodes called DataNode. These server nodes are
fully connected and they communicate using TCP based
protocol.
2. K-MEANS CLUSTERING OVER HADOOP
The input has been provided to K-Means over Hadoop [10]is
given as <key,value> pair, where key is the ‘centroid’ and
‘value’ is serialized data nodes(objects) that are need to be
clustered. These keys and values are maintained in HDFS in
separate files. Centroid file contains preliminary centers
either entered by the user or selected arbitrarily from the
data nodes(objects) to be clustered. These centers form ‘key’
for <key, value> pair during Mapper phase.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 12 | Dec 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 236
Operation mechanism of MapReduce is as follows:
i. Input: MapReduce framework based on Hadoop
requires a pair of Map and Reduce functions
implementing the appropriate interface or abstract
class, and should also be specified the input and
output location and other operating parameters.
ii. MapReduce: framework puts the applicationofthe
input as a set of key-value pairs <key, value>. In the
Map stage, the framework will call the user-defined
Map function to process each key-value pairs <key,
value>, while generating a new batch of middle key
value pairs<key, value>.
iii. Shuffle:In order to ensurethattheinputofReduce
outputted by Map have been sorted, in the Shuffle
stage, the framework uses HTTP to get associated
key-value pairs <key, value> Map outputs for each
Reduce; Map Reduce frame workgroups theinputof
the Reduce phase according to the key value.
iv. Reduce:This phase will traverse the intermediate
data for each unique key, and execute user-defined
Reduce function. The input parameteris<key,{alist
of values} >, the output is the new key-value pairs<
key, value >.
v. Output:This stage will write the results of the
Reduce to the specified output directory location.
3. REVIEW OF DIFFERENT DATA MINING
TECHNIQUES
An attempt has been made tostudy and examine critically all
the available findings of previous researches and review the
salient features concerned in present work in a well-defined
manner and summarized to use it as a background literature
in the following paper.
After the text edit has been completed, the paper is ready for
the template. Duplicate the template file by using the SaveAs
command, andusethenamingconventionprescribedbyyour
conference for the name of your paper. In this newly created
file, highlight all of the contents and import your prepared
text file. You are now ready to style your paper.
i. Pham et al. (2004) proposed Factors that new
measures to assist the selection is proposed and
then conclude with an analysis of the results of
using the proposed measure to resolve the number
of clusters for the k-Means algorithm for dissimilar
data sets.
ii. Fahim et al. (2006) presented a simple and
efficient clustering algorithm based on the k-means
algorithm, which they call enhanced k-means
algorithm. It is very simple algorithm, which shows
the implementation, requiring a simple data
structure to keep some information in all iteration
to be used in the next iteration. Experimental
results demonstrated that scheme can improve the
computational speed of the k-Means algorithm by
the magnitude in the total number of distance
calculations and the overall time of computation.
iii. Deelers et al. (2007) gaveanalgorithmtocalculate
preliminary cluster centers for k-Means clustering.
Data in a cell is partitioned using a cutting plane
that divides cell in two smaller cells. The plane is
vertical to the data axis with the highest variation
and is intended to reduce the sum-squarederrorsof
the two cells as much as possible, while at the same
time keep the two cells far apart as possible. Cells
have been partitioned one at a time until the
number of cells equals to the predefined number of
clusters K. The experimental results show that the
proposed algorithm is efficient, meet to better
clustering results than those of the random
initialization. The research also indicated the
proposed algorithm would really improve the
chances of every cluster containing some data in it.
iv. Sebastian et al. (2009) proposed several methods
in the literature for improving the performance of
the k-Means clustering algorithm.Papersimulatesa
method for making the algorithm more effective
and efficient as to get better clustering with
compact complexity.
v. Chen et al. (2009) offered a newclusteringmethod
based on k-Means that have avoided substitute
randomness of initial centre. This work is focused
on k-Means algorithm; initial value of the
dependence of k selected from the aspects of the
algorithm is enhanced. First, the initial cluster
number is N. Second, through the application of the
sub-merger strategy the category were shared. The
algorithm does not require the user to give in
advance the number of cluster. Experiments on
artificial datasets are presented to have shown
considerable improvements in clustering accuracy
in association with the random k-Means.
vi. Pakhira et al. (2009) presenteda modifiedversion
of the k-means algorithm that efficiently eliminates
the empty cluster difficulty. They describedthatthe
updated algorithm is semantically equivalenttothe
traditional k-Means and there is no performance
issue due to integrated modification. Results of
simulation experiment using several datasetsprove
the claim.
vii. Gupta et al. (2010) proposed an algorithm to
automatically determine the numberofclustersina
given input data set, under a combination of
Gaussians assumption. The algorithm extends the
Anticipation- Maximization clustering approach by
preliminary with a single clusterassumptionforthe
data, and recursively split one of the clusters in
order to find a tighter fit. An Information standard
parameter is used to pick between the present and
previous form after each split.Theapproachisbuild
upon prior work done on both k-Means and
Expectation-Maximization algorithms. The
algorithm is extended using a cluster splitting
approach based on Principal Direction disruptive
Partitioning,whichimprove accuracyandefficiency.
viii. Yedla et al. (2010) simulate a new technique for
result the improved preliminary centroids and to
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 12 | Dec 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 237
provide an efficient way of passing on the data
points to appropriate clusters with compact time
complexity. According to evaluated results, the
modified algorithm has more accurate with less
time consuming as compared to original k-means
clustering algorithm.
ix. Ren et al. (2012) simulate Hadoopworkloadsfrom
three different clusters on an application-level
perspective, with two goals: (a) explore new issues
in application patterns and user behavior and (b)
understand key performance challenges related to
Input/output and load balancing. The carrying out
logs from three Hadoop clusters used for research:
OPENCLOUD, M45, and WEB MINING. Studied job
performance, configurations and user history files
from three different Hadoop clusters for academic
investigate. These new Hadoop cluster traces
contain comfortable information than previous
study by recording application pattern and user
behaviors, which are critical for understanding the
requirements and performanceof big-data systems.
Easing the use of Hadoop, and improve system
designs subject to changing use cases are crucial
research information for future.
x. Dittrich et al. (2012) states the need of many
organizations, companies, and researchers to deal
with big data volumes efficiently that includes web
analytics applications, scientific applications, and
social networks. A trendy data dealing out
mechanism for big data is Hadoop MapReduce.
Earlier versions of Hadoop MapReduce suffer from
various performance problems. There are various
strategies that canbeusedwithHadoopMapReduce
jobs to boost up the performance by orders of
degree. Jens Dittrich briefly familiarizes the
audience with Hadoop MapReduce and motivates
its use for big data processing and focuses on
different data management techniques, going from
job optimization to physical data organization like
data layouts and indexes. Through similarities and
differences between Hadoop MapReduce and
Parallel DBMS are discussed.
xi. Jain et al. (2012) proposed a new hybrid
algorithm, which isbasedon k-Means&k-Harmonic
Mean approach. Its performance is compare with
the customary K-means & K harmonic means
algorithm. The outcome which has been obtained
from proposed hybrid algorithm is to a great extent
better than the traditional K-mean & K harmonic
means algorithm.
xii. Kane et al. (2012) proposed a new, efficient
approach to determine the number of clusters
based on the volume of a cluster by comparing it
with a fixed threshold.
xiii. Zhang and Fang (2013) introduces the idea of the
k-means clustering algorithm analysis, the
advantages and disadvantages of the traditional k-
means clustering algorithm and elaborates the
method of improving the k-means clustering
algorithm based on improving the initial focal point
and thus determine the K value. Experimental
results show that the superior clustering algorithm
is more stable in clustering process. In the mean
time, improved clustering algorithm to reduce or
even avoid the impact of the noise data in the
dataset object to ensure that the final clustering
result is more accurate and effective.
xiv. Kodinariya and Makwana (2013) explored six
different approaches to determinetheright number
of clusters in a dataset. There are various methods
offered to estimate the number of clusters such as
statistical indices, variance based method,
Information Theoretic, goodness of fit method etc.
xv. Anchalia et al. (2013) discussed the
implementation of the K-Means Clustering
Algorithm over a distributed environment using
Apache™ Hadoop. Here they design the Mapperand
Reducer routine for processing of datasetandHDFS
has been used for the storage of dataset before and
after processing. Mapper takes the input as the
<key, value> pair where key work as the centre of
the cluster and value is the serializable
implementation of the dataset. The initial set of
centre has been stored on HDFS prior to the Map
function is called and it works as the key for the
<key, value> pair. The Mapper is design in such a
manner that it computes the distance between the
vector value and each of the cluster centers
mentioned in the cluster set and simultaneously
keeping track of the cluster to which the given
vector is closest.
xvi. Revathi and Nalini (2013) presented a
comparative study of clustering algorithms across
two different data items. The result of the variety of
clustering algorithms is compared based on the
time engaged to form the estimated clusters. Based
on experimental results itcanbeconcluded,thatthe
time taken to form the clusters increases as the
number of cluster increases. The farthest first
clustering algorithm takes very little time to cluster
the data items whereas the simple k-Means takes
the longest time to perform clustering.
xvii. Shah Neepa (2014) discussed the importance of
document clustering thatemergesfromthe massive
volumes of textual documents created. With more
and more development of information technology,
data set in many domains is reaching beyond peta-
scale; making it difficult to work with the document
clustering algorithms in central site and leading to
the need of increasing the computational
requirements. Parallel computing concepts have
been introduced for the elaboration of document
clustering which later introduced distributed
document clustering. The distributed document
clustering using Hadoop and map-reduce has been
proposed. First of all k-means has been tested on
single node then after modified the mapper and
reducer functions to run over cluster of three
machines. Dataset consisting of 20,000 documents
(20-newsgroups) and 21578 documents were
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 05 Issue: 12 | Dec 2018 www.irjet.net p-ISSN: 2395-0072
© 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 238
tested. Results showed that the required time has
been reduced that after addition of more nodes.
xviii. Duggal et al. (2015) reviews that we live in on-
demand, on-command Digital universe with data
prolifering by Institutions, Individuals and
Machines at a very high rate. This data has been
known as "Big Data" due to its sheer Volume,
Variety, Velocity and Veracity. The majority of this
data are unstructured, structured or semi
structured and it is assorted in nature. The degree
and the heterogeneity of data are generated with
the rapid rate, makes it difficult for the present
computing infrastructure to supervise Big Data.
Conventional data supervision, warehousing and
analysis systems fall short of tools to analyze this
data. The authors suggest various methods for
catering to the problems in hand through Map
Reduce framework over Hadoop Distributed File
System (HDFS). Map Reduce is an important
method which makes use of file indexing with
mapping, sorting, shuffling and finally reducing.
3. CONCLUSION
This paper presents a new and easy technique to generate
preliminary set of centroids for Improvingthe efficiencyand
accuracy of one dimensional data set. The proposed method
fairly reduces the no. of iterationstoreachconvergenceasK-
Means is highly sensitive to set of preliminarycentroids.The
overall execution time for K-Means Clustering job to finish
has also reduced. Making the technique handy for large sets
of data that would generally require large amount of time to
reach convergence as it has been observed in case of
arbitrarily selected initial centroids. The result may vary for
different data sets.
REFERENCES
[1] Abhijit Kane (2012). Determining the number of
Clusters for a K-Means Clustering Algorithm, Indian
Journal of Computer Science and Engineering (IJCSE),
Vol. 3 No.5 Oct-Nov 2012.
[2] Meenakshi, Poonam Yadav(2016). Asurveypaperon K-
means clustering using hadoop, IJRAET V-4 I-2.
[3] Chunfei Zhang and Zhiyi Fang (2013). An Improved K-
means Clustering Algorithm, Journal of Information &
Computational Science 10: 1 (2013) 193–199.
[4] J. B. MacQueen (1967). Some Methods for classification
and Analysis of Multivariate Observations, Proceedings
of 5-th Berkeley Symposium on Mathematical Statistics
and Probability", Berkeley, University of California
Press, 1:281-297.
[5] Jens Dittrich and Jorge Arnulfo Quian´e-Ruiz (2012),
Efficient Big Data Processing in Hadoop MapReduce,
Very Large Data Bases, Vol. 5, No. 12, 2012.
[6] K. A. Abdul Nazeer and M. P. Sebastian (2009).
Improving the Accuracy and Efficiency of the k-means
Clustering Algorithm, Proceedings of the World
Congress on Engineering 2009, Vol I WCE 2009, July 1 -
3, 2009, London, U.K.
[7] Yashika Verma, Sumit Kumari (2013). Study and
analysis on Document Clustering Based on MapReduce
in Hadoop using K-mean Algorithm, International
Journal of Science and Research (IJSR) ISSN (Online):
2319-7064 Index Copernicus Value (2013): 6.14 |
Impact Factor (2013): 4.438.
[8] Kohei Arai and Ali Ridho Barakbah (2007). Hierarchical
K-means: an algorithm for centroids initialization forK-
means, Rep. Fac. Sci. Engrg. , Saga Univ. 36-1 (2007),25-
31.
[9] Likas, N. Vlassis and J.J. Verbeek (2003). The Global k-
means Clustering algorithm, Pattern Recognition,
Volume 36, Issue 2, 2003, pp. 451-461.
[10] P Anchalia, Koudinya, Srinath (2013). MapReduce
Design of K-Means Clustering Algorithm.
[11] Fahim A. M., Salem A. M., F.A. Torkey and M.A. Ramadan
(2006). An Efficient enhanced k-means clustering
algorithm, Journal of Zhejiang University, 10(7):
6261633.
[12] Fang Yuan, Zeng-Hui Meng, Hong-Xia Zhang, Chun-Ru
Dong (2004). A New Algorithm To Get The Initial
Centroids, Proceedings of the Third International
Conference on Machine Laming and Cybernetics,
Shanghai, 26-29 August 2004.
[13] S. Deelers, and S. Auwatanamongkol (2007). Enhancing
K-Means Algorithm with Initial Cluster CentersDerived
from Data Partitioning along the Data Axis with the
Highest Variance, World Academy of Science,
Engineering and Technology Vol:1 2007-11-27.
[14] Revathi and Dr. T. Nalini (2013). Performance
Comparison of Various Clustering Algorithm, IJARCSSE,
Volume 3, Issue 2, February 2013.
Ad

Recommended

A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
IRJET Journal
 
GCUBE INDEXING
GCUBE INDEXING
IJDKP
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
journalBEEI
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
MapR Technologies
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
riyaniaes
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning Clustering
MapR Technologies
 
Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database Design
Waqas Tariq
 
Paper id 25201498
Paper id 25201498
IJRAT
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
Data clustering using map reduce
Data clustering using map reduce
Varad Meru
 
An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...
eSAT Publishing House
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET Journal
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applications
ijcsit
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
IAEME Publication
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
IJECEIAES
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
Ankit Rathi
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
João Gabriel Lima
 
C0312023
C0312023
iosrjournals
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
ijsrd.com
 
A0360109
A0360109
iosrjournals
 
B0330811
B0330811
iosrjournals
 
Multiple dag applications
Multiple dag applications
csandit
 
CMPE275-Project1Report
CMPE275-Project1Report
Sandyarathi Das
 
An experimental evaluation of performance
An experimental evaluation of performance
ijcsa
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 

More Related Content

What's hot (19)

Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database Design
Waqas Tariq
 
Paper id 25201498
Paper id 25201498
IJRAT
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
Data clustering using map reduce
Data clustering using map reduce
Varad Meru
 
An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...
eSAT Publishing House
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET Journal
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applications
ijcsit
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
IAEME Publication
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
IJECEIAES
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
Ankit Rathi
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
João Gabriel Lima
 
C0312023
C0312023
iosrjournals
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
ijsrd.com
 
A0360109
A0360109
iosrjournals
 
B0330811
B0330811
iosrjournals
 
Multiple dag applications
Multiple dag applications
csandit
 
CMPE275-Project1Report
CMPE275-Project1Report
Sandyarathi Das
 
An experimental evaluation of performance
An experimental evaluation of performance
ijcsa
 
Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database Design
Waqas Tariq
 
Paper id 25201498
Paper id 25201498
IJRAT
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
Data clustering using map reduce
Data clustering using map reduce
Varad Meru
 
An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...
eSAT Publishing House
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET Journal
 
Sharing of cluster resources among multiple Workflow Applications
Sharing of cluster resources among multiple Workflow Applications
ijcsit
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
IAEME Publication
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
IJECEIAES
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
Ankit Rathi
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
João Gabriel Lima
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
ijsrd.com
 
Multiple dag applications
Multiple dag applications
csandit
 
An experimental evaluation of performance
An experimental evaluation of performance
ijcsa
 

Similar to IRJET- Review of Existing Methods in K-Means Clustering Algorithm (20)

Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
IRJET Journal
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
IOSR Journals
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
IJET-V3I1P27
IJET-V3I1P27
IJET - International Journal of Engineering and Techniques
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
E132833
E132833
irjes
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using Containers
IRJET Journal
 
Mining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce Framework
IRJET Journal
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET Journal
 
Cost-aware optimal resource provisioning Map-Reduce scheduler for hadoop fram...
Cost-aware optimal resource provisioning Map-Reduce scheduler for hadoop fram...
IAESIJAI
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET Journal
 
Clustering
Clustering
Meme Hei
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
Demand-driven Gaussian window optimization for executing preferred population...
Demand-driven Gaussian window optimization for executing preferred population...
IJECEIAES
 
Time and Reliability Optimization Bat Algorithm for Scheduling Workflow in Cloud
Time and Reliability Optimization Bat Algorithm for Scheduling Workflow in Cloud
IRJET Journal
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
Mahantesh Angadi
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
 
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
IRJET Journal
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
IOSR Journals
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
E132833
E132833
irjes
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using Containers
IRJET Journal
 
Mining High Utility Patterns in Large Databases using Mapreduce Framework
Mining High Utility Patterns in Large Databases using Mapreduce Framework
IRJET Journal
 
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing wit...
IRJET Journal
 
Cost-aware optimal resource provisioning Map-Reduce scheduler for hadoop fram...
Cost-aware optimal resource provisioning Map-Reduce scheduler for hadoop fram...
IAESIJAI
 
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET- Enhanced Density Based Method for Clustering Data Stream
IRJET Journal
 
Clustering
Clustering
Meme Hei
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
Demand-driven Gaussian window optimization for executing preferred population...
Demand-driven Gaussian window optimization for executing preferred population...
IJECEIAES
 
Time and Reliability Optimization Bat Algorithm for Scheduling Workflow in Cloud
Time and Reliability Optimization Bat Algorithm for Scheduling Workflow in Cloud
IRJET Journal
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
IJDKP
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
Mahantesh Angadi
 
Ad

More from IRJET Journal (20)

Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

Mechanical Vibration_MIC 202_iit roorkee.pdf
Mechanical Vibration_MIC 202_iit roorkee.pdf
isahiliitr
 
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
Cadastral Maps
Cadastral Maps
Google
 
Fatality due to Falls at Working at Height
Fatality due to Falls at Working at Height
ssuserb8994f
 
Modern multi-proposer consensus implementations
Modern multi-proposer consensus implementations
François Garillot
 
60 Years and Beyond eBook 1234567891.pdf
60 Years and Beyond eBook 1234567891.pdf
waseemalazzeh
 
20CE404-Soil Mechanics - Slide Share PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
 
How to Un-Obsolete Your Legacy Keypad Design
How to Un-Obsolete Your Legacy Keypad Design
Epec Engineered Technologies
 
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
IPL_Logic_Flow.pdf Mainframe IPLMainframe IPL
IPL_Logic_Flow.pdf Mainframe IPLMainframe IPL
KhadijaKhadijaAouadi
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
Shabista Imam
 
Structural Wonderers_new and ancient.pptx
Structural Wonderers_new and ancient.pptx
nikopapa113
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
Industry 4.o the fourth revolutionWeek-2.pptx
Industry 4.o the fourth revolutionWeek-2.pptx
KNaveenKumarECE
 
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
 
Mechanical Vibration_MIC 202_iit roorkee.pdf
Mechanical Vibration_MIC 202_iit roorkee.pdf
isahiliitr
 
Introduction to sensing and Week-1.pptx
Introduction to sensing and Week-1.pptx
KNaveenKumarECE
 
Cadastral Maps
Cadastral Maps
Google
 
Fatality due to Falls at Working at Height
Fatality due to Falls at Working at Height
ssuserb8994f
 
Modern multi-proposer consensus implementations
Modern multi-proposer consensus implementations
François Garillot
 
60 Years and Beyond eBook 1234567891.pdf
60 Years and Beyond eBook 1234567891.pdf
waseemalazzeh
 
20CE404-Soil Mechanics - Slide Share PPT
20CE404-Soil Mechanics - Slide Share PPT
saravananr808639
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
 
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
Generative AI & Scientific Research : Catalyst for Innovation, Ethics & Impact
AlqualsaDIResearchGr
 
IPL_Logic_Flow.pdf Mainframe IPLMainframe IPL
IPL_Logic_Flow.pdf Mainframe IPLMainframe IPL
KhadijaKhadijaAouadi
 
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Rapid Prototyping for XR: Lecture 6 - AI for Prototyping and Research Directi...
Mark Billinghurst
 
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
Shabista Imam
 
Structural Wonderers_new and ancient.pptx
Structural Wonderers_new and ancient.pptx
nikopapa113
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
Industry 4.o the fourth revolutionWeek-2.pptx
Industry 4.o the fourth revolutionWeek-2.pptx
KNaveenKumarECE
 
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Rapid Prototyping for XR: Lecture 4 - High Level Prototyping.
Mark Billinghurst
 
System design handwritten notes guidance
System design handwritten notes guidance
Shabista Imam
 

IRJET- Review of Existing Methods in K-Means Clustering Algorithm

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 12 | Dec 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 235 Review of Existing Methods in K-means Clustering Algorithm Sonu Pandey1, Lokendra kumar Tiwari2 1M.Tech Scholar, Department of Computer Science & Engineering, KNIT Sultanpur 2Asssitant Professor, Ewing Christian College Allahabad -------------------------------------------------------------------------***------------------------------------------------------------------------ Abstract - K-means algorithm is one of the most trendyand important algorithms for data clustering. Withthisalgorithm, data of similar types are tried to be clustered together from a large data set with brute force strategy which is done by repeated calculations. With the advancement in Technology, the data at many domains is generated at higher rates reaching size greater than Petabytes. Significant amount of information is unstructured; semi structured or structured documents spread across the networkeg. Images, audio, video, spreadsheets, pdf(s), etc, that contain answer to help us to create new products, refine existing products, improve customer relations. This gave rise to Large set of data and challenges of Big Data which is generally suffered from 3V (Volume, Velocity and Variety) problems. Hadoop is an open source framework designed to overcome 3V challenges. Using Hadoop with K-Means resulted in faster processing of large and complex data set. However, arbitrary preliminary centroids have to be provided in traditional K-Means algorithm. The Convergence to be reach highly dependsonthe set of preliminary centroids. In this paper we propose a method which takes set of preliminary centroids which has been calculated over Hadoop and afterward run the K-Means algorithm which shows thatconvergencecriteriareachearlier in the most of the cases, hence it will improve efficiency and accuracy of the algorithm. Key Words: Data Mining, K-Means clustering, arbitrary preliminary centroids, improved preliminary centroids, Hadoop, MapReduce. 1. INTRODUCTION With the development and improvement of data mining technology, data clustering algorithm are gradually applied to some fields. The definition of clustering in the academic community can be generalized as follows:first,thesimilarity of data objects. Data objects within the same cluster have great similarity, but data objects within the different cluster have great non-similarities. Second, the distance of data objects. Take entire data set as a test data object of the gathering, the distance between any pair of data objects within the same cluster size should not be greater than the distance between the different clusters of arbitrary data object. Third, the density of data objects.Take entiredata set as a multi-dimensional space aggregation of the data object, a cluster is the spaces which contain the number of data object relatively high dimension cut off by the space which contains the number of data object relativelylowdimension. Thus form a relatively separated set of dimensional space. The k-means algorithms [3, 4 and 11] have been used to produce the clusters with the help of K-Means Algorithm. As we know that traditional K-Means clusteringalgorithm[4]is mostly dependent of the data set, if the data set is very large it will take more time to go at the convergence stage. Moreover In the most of the cases algorithm results are depends on choice of the arbitrary preliminary centroids. Quite a few attempts have been made by researchers [14] to compute the overall result of the K-Means clustering [11, 13]. In this paper we propose technique to improve Accuracy and Efficiency by producing preliminary Centroids for k- means Clustering over Apache™ Hadoop [6] to harness the power of parallel computing with clustering technique. 1.1 HADOOP-COMPUTATION AND STORAGE SOLUTION Dealing with “Big Data”requires–aninexpensive,reliable storage and a new toolforanalyzingstructured,unstructured and semi structured data. Apache Hadoop addresses both of these problems. Because Hadoop works on map reduce concept it share out and parallelize data processing across many nodes in a compute cluster, speeding up large computations and hiding I/O latency through increased concurrency. It is fighting fit for large data processing like searching and indexing in massive data set. 1.2 HADOOP DISTRIBUTED FILE SYSTEM (HDFS) HDFS had been mainly built as transportation for the Apache NUTCH web search engine assignment. HDFS isnow an Apache Hadoop subproject. HDFS has master/slave architecture. HDFS issuitableforapplicationsthathavelarge dataset. HDFS maintain the metadata in a dedicated server called NameNode and the application data are kept in separated nodes called DataNode. These server nodes are fully connected and they communicate using TCP based protocol. 2. K-MEANS CLUSTERING OVER HADOOP The input has been provided to K-Means over Hadoop [10]is given as <key,value> pair, where key is the ‘centroid’ and ‘value’ is serialized data nodes(objects) that are need to be clustered. These keys and values are maintained in HDFS in separate files. Centroid file contains preliminary centers either entered by the user or selected arbitrarily from the data nodes(objects) to be clustered. These centers form ‘key’ for <key, value> pair during Mapper phase.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 12 | Dec 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 236 Operation mechanism of MapReduce is as follows: i. Input: MapReduce framework based on Hadoop requires a pair of Map and Reduce functions implementing the appropriate interface or abstract class, and should also be specified the input and output location and other operating parameters. ii. MapReduce: framework puts the applicationofthe input as a set of key-value pairs <key, value>. In the Map stage, the framework will call the user-defined Map function to process each key-value pairs <key, value>, while generating a new batch of middle key value pairs<key, value>. iii. Shuffle:In order to ensurethattheinputofReduce outputted by Map have been sorted, in the Shuffle stage, the framework uses HTTP to get associated key-value pairs <key, value> Map outputs for each Reduce; Map Reduce frame workgroups theinputof the Reduce phase according to the key value. iv. Reduce:This phase will traverse the intermediate data for each unique key, and execute user-defined Reduce function. The input parameteris<key,{alist of values} >, the output is the new key-value pairs< key, value >. v. Output:This stage will write the results of the Reduce to the specified output directory location. 3. REVIEW OF DIFFERENT DATA MINING TECHNIQUES An attempt has been made tostudy and examine critically all the available findings of previous researches and review the salient features concerned in present work in a well-defined manner and summarized to use it as a background literature in the following paper. After the text edit has been completed, the paper is ready for the template. Duplicate the template file by using the SaveAs command, andusethenamingconventionprescribedbyyour conference for the name of your paper. In this newly created file, highlight all of the contents and import your prepared text file. You are now ready to style your paper. i. Pham et al. (2004) proposed Factors that new measures to assist the selection is proposed and then conclude with an analysis of the results of using the proposed measure to resolve the number of clusters for the k-Means algorithm for dissimilar data sets. ii. Fahim et al. (2006) presented a simple and efficient clustering algorithm based on the k-means algorithm, which they call enhanced k-means algorithm. It is very simple algorithm, which shows the implementation, requiring a simple data structure to keep some information in all iteration to be used in the next iteration. Experimental results demonstrated that scheme can improve the computational speed of the k-Means algorithm by the magnitude in the total number of distance calculations and the overall time of computation. iii. Deelers et al. (2007) gaveanalgorithmtocalculate preliminary cluster centers for k-Means clustering. Data in a cell is partitioned using a cutting plane that divides cell in two smaller cells. The plane is vertical to the data axis with the highest variation and is intended to reduce the sum-squarederrorsof the two cells as much as possible, while at the same time keep the two cells far apart as possible. Cells have been partitioned one at a time until the number of cells equals to the predefined number of clusters K. The experimental results show that the proposed algorithm is efficient, meet to better clustering results than those of the random initialization. The research also indicated the proposed algorithm would really improve the chances of every cluster containing some data in it. iv. Sebastian et al. (2009) proposed several methods in the literature for improving the performance of the k-Means clustering algorithm.Papersimulatesa method for making the algorithm more effective and efficient as to get better clustering with compact complexity. v. Chen et al. (2009) offered a newclusteringmethod based on k-Means that have avoided substitute randomness of initial centre. This work is focused on k-Means algorithm; initial value of the dependence of k selected from the aspects of the algorithm is enhanced. First, the initial cluster number is N. Second, through the application of the sub-merger strategy the category were shared. The algorithm does not require the user to give in advance the number of cluster. Experiments on artificial datasets are presented to have shown considerable improvements in clustering accuracy in association with the random k-Means. vi. Pakhira et al. (2009) presenteda modifiedversion of the k-means algorithm that efficiently eliminates the empty cluster difficulty. They describedthatthe updated algorithm is semantically equivalenttothe traditional k-Means and there is no performance issue due to integrated modification. Results of simulation experiment using several datasetsprove the claim. vii. Gupta et al. (2010) proposed an algorithm to automatically determine the numberofclustersina given input data set, under a combination of Gaussians assumption. The algorithm extends the Anticipation- Maximization clustering approach by preliminary with a single clusterassumptionforthe data, and recursively split one of the clusters in order to find a tighter fit. An Information standard parameter is used to pick between the present and previous form after each split.Theapproachisbuild upon prior work done on both k-Means and Expectation-Maximization algorithms. The algorithm is extended using a cluster splitting approach based on Principal Direction disruptive Partitioning,whichimprove accuracyandefficiency. viii. Yedla et al. (2010) simulate a new technique for result the improved preliminary centroids and to
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 12 | Dec 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 237 provide an efficient way of passing on the data points to appropriate clusters with compact time complexity. According to evaluated results, the modified algorithm has more accurate with less time consuming as compared to original k-means clustering algorithm. ix. Ren et al. (2012) simulate Hadoopworkloadsfrom three different clusters on an application-level perspective, with two goals: (a) explore new issues in application patterns and user behavior and (b) understand key performance challenges related to Input/output and load balancing. The carrying out logs from three Hadoop clusters used for research: OPENCLOUD, M45, and WEB MINING. Studied job performance, configurations and user history files from three different Hadoop clusters for academic investigate. These new Hadoop cluster traces contain comfortable information than previous study by recording application pattern and user behaviors, which are critical for understanding the requirements and performanceof big-data systems. Easing the use of Hadoop, and improve system designs subject to changing use cases are crucial research information for future. x. Dittrich et al. (2012) states the need of many organizations, companies, and researchers to deal with big data volumes efficiently that includes web analytics applications, scientific applications, and social networks. A trendy data dealing out mechanism for big data is Hadoop MapReduce. Earlier versions of Hadoop MapReduce suffer from various performance problems. There are various strategies that canbeusedwithHadoopMapReduce jobs to boost up the performance by orders of degree. Jens Dittrich briefly familiarizes the audience with Hadoop MapReduce and motivates its use for big data processing and focuses on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Through similarities and differences between Hadoop MapReduce and Parallel DBMS are discussed. xi. Jain et al. (2012) proposed a new hybrid algorithm, which isbasedon k-Means&k-Harmonic Mean approach. Its performance is compare with the customary K-means & K harmonic means algorithm. The outcome which has been obtained from proposed hybrid algorithm is to a great extent better than the traditional K-mean & K harmonic means algorithm. xii. Kane et al. (2012) proposed a new, efficient approach to determine the number of clusters based on the volume of a cluster by comparing it with a fixed threshold. xiii. Zhang and Fang (2013) introduces the idea of the k-means clustering algorithm analysis, the advantages and disadvantages of the traditional k- means clustering algorithm and elaborates the method of improving the k-means clustering algorithm based on improving the initial focal point and thus determine the K value. Experimental results show that the superior clustering algorithm is more stable in clustering process. In the mean time, improved clustering algorithm to reduce or even avoid the impact of the noise data in the dataset object to ensure that the final clustering result is more accurate and effective. xiv. Kodinariya and Makwana (2013) explored six different approaches to determinetheright number of clusters in a dataset. There are various methods offered to estimate the number of clusters such as statistical indices, variance based method, Information Theoretic, goodness of fit method etc. xv. Anchalia et al. (2013) discussed the implementation of the K-Means Clustering Algorithm over a distributed environment using Apache™ Hadoop. Here they design the Mapperand Reducer routine for processing of datasetandHDFS has been used for the storage of dataset before and after processing. Mapper takes the input as the <key, value> pair where key work as the centre of the cluster and value is the serializable implementation of the dataset. The initial set of centre has been stored on HDFS prior to the Map function is called and it works as the key for the <key, value> pair. The Mapper is design in such a manner that it computes the distance between the vector value and each of the cluster centers mentioned in the cluster set and simultaneously keeping track of the cluster to which the given vector is closest. xvi. Revathi and Nalini (2013) presented a comparative study of clustering algorithms across two different data items. The result of the variety of clustering algorithms is compared based on the time engaged to form the estimated clusters. Based on experimental results itcanbeconcluded,thatthe time taken to form the clusters increases as the number of cluster increases. The farthest first clustering algorithm takes very little time to cluster the data items whereas the simple k-Means takes the longest time to perform clustering. xvii. Shah Neepa (2014) discussed the importance of document clustering thatemergesfromthe massive volumes of textual documents created. With more and more development of information technology, data set in many domains is reaching beyond peta- scale; making it difficult to work with the document clustering algorithms in central site and leading to the need of increasing the computational requirements. Parallel computing concepts have been introduced for the elaboration of document clustering which later introduced distributed document clustering. The distributed document clustering using Hadoop and map-reduce has been proposed. First of all k-means has been tested on single node then after modified the mapper and reducer functions to run over cluster of three machines. Dataset consisting of 20,000 documents (20-newsgroups) and 21578 documents were
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 05 Issue: 12 | Dec 2018 www.irjet.net p-ISSN: 2395-0072 © 2018, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 238 tested. Results showed that the required time has been reduced that after addition of more nodes. xviii. Duggal et al. (2015) reviews that we live in on- demand, on-command Digital universe with data prolifering by Institutions, Individuals and Machines at a very high rate. This data has been known as "Big Data" due to its sheer Volume, Variety, Velocity and Veracity. The majority of this data are unstructured, structured or semi structured and it is assorted in nature. The degree and the heterogeneity of data are generated with the rapid rate, makes it difficult for the present computing infrastructure to supervise Big Data. Conventional data supervision, warehousing and analysis systems fall short of tools to analyze this data. The authors suggest various methods for catering to the problems in hand through Map Reduce framework over Hadoop Distributed File System (HDFS). Map Reduce is an important method which makes use of file indexing with mapping, sorting, shuffling and finally reducing. 3. CONCLUSION This paper presents a new and easy technique to generate preliminary set of centroids for Improvingthe efficiencyand accuracy of one dimensional data set. The proposed method fairly reduces the no. of iterationstoreachconvergenceasK- Means is highly sensitive to set of preliminarycentroids.The overall execution time for K-Means Clustering job to finish has also reduced. Making the technique handy for large sets of data that would generally require large amount of time to reach convergence as it has been observed in case of arbitrarily selected initial centroids. The result may vary for different data sets. REFERENCES [1] Abhijit Kane (2012). Determining the number of Clusters for a K-Means Clustering Algorithm, Indian Journal of Computer Science and Engineering (IJCSE), Vol. 3 No.5 Oct-Nov 2012. [2] Meenakshi, Poonam Yadav(2016). Asurveypaperon K- means clustering using hadoop, IJRAET V-4 I-2. [3] Chunfei Zhang and Zhiyi Fang (2013). An Improved K- means Clustering Algorithm, Journal of Information & Computational Science 10: 1 (2013) 193–199. [4] J. B. MacQueen (1967). Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1:281-297. [5] Jens Dittrich and Jorge Arnulfo Quian´e-Ruiz (2012), Efficient Big Data Processing in Hadoop MapReduce, Very Large Data Bases, Vol. 5, No. 12, 2012. [6] K. A. Abdul Nazeer and M. P. Sebastian (2009). Improving the Accuracy and Efficiency of the k-means Clustering Algorithm, Proceedings of the World Congress on Engineering 2009, Vol I WCE 2009, July 1 - 3, 2009, London, U.K. [7] Yashika Verma, Sumit Kumari (2013). Study and analysis on Document Clustering Based on MapReduce in Hadoop using K-mean Algorithm, International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 Index Copernicus Value (2013): 6.14 | Impact Factor (2013): 4.438. [8] Kohei Arai and Ali Ridho Barakbah (2007). Hierarchical K-means: an algorithm for centroids initialization forK- means, Rep. Fac. Sci. Engrg. , Saga Univ. 36-1 (2007),25- 31. [9] Likas, N. Vlassis and J.J. Verbeek (2003). The Global k- means Clustering algorithm, Pattern Recognition, Volume 36, Issue 2, 2003, pp. 451-461. [10] P Anchalia, Koudinya, Srinath (2013). MapReduce Design of K-Means Clustering Algorithm. [11] Fahim A. M., Salem A. M., F.A. Torkey and M.A. Ramadan (2006). An Efficient enhanced k-means clustering algorithm, Journal of Zhejiang University, 10(7): 6261633. [12] Fang Yuan, Zeng-Hui Meng, Hong-Xia Zhang, Chun-Ru Dong (2004). A New Algorithm To Get The Initial Centroids, Proceedings of the Third International Conference on Machine Laming and Cybernetics, Shanghai, 26-29 August 2004. [13] S. Deelers, and S. Auwatanamongkol (2007). Enhancing K-Means Algorithm with Initial Cluster CentersDerived from Data Partitioning along the Data Axis with the Highest Variance, World Academy of Science, Engineering and Technology Vol:1 2007-11-27. [14] Revathi and Dr. T. Nalini (2013). Performance Comparison of Various Clustering Algorithm, IJARCSSE, Volume 3, Issue 2, February 2013.