SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 7542
A Review on K-means++ Clustering Algorithm and Cloud Computing
with Map Reduce
1Sweekruth S Badiger, 2Sushmitha N
1PG Student, Dept. of Information Science and Engg., R. V. College of Engineering, Bangalore, Karnataka India
2Assistant Professor, Dept. of Information Science and Engg., R. V. College of Engineering, Bangalore,
Karnataka, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - The functionality of cloud computing is to host
the servers in a dispersed network with a aim to deliver access
to the wide range of customers. The core feature of cloud
computing is its flexibility which offer users to perform
upscaling or downscaling of the hardware. The main purpose
for the development of cloud computing is the substantial
increase in the volume of Big data evolved which require to be
examined. Researchers utilize multiple algorithms to gain
useful knowledge from a huge volume of data set. Hadoop has
come up with a new software platform called MapReduce to
carry out the operation parallelly on huge data set. Thereonly
specific unsupervised learning algorithms which are executed
successfully in MapReduce techniqueandaredeployedonhigh
volume of data set. The combination of cloud computing with
parallel processing in MapReduce stand out to be a powerful
approach for the future technologyenhancements. Thissurvey
provides a brief overview of cloud computing and of the most
popular clustering algorithm named k-means++
Key Words: Hadoop, MapReduce, Cloud computing, K-
Means++.
1. INTRODUCTION
Cloud computing is on demand availability of hardware
resources for computing or storage purpose. The term
‘Cloud’ basically means available to all through internet. It
reduces the work for user to either setup or manage the
hardware. Cloud provider take the ownership of providing
and maintaining the hardware infrastructure. Most famous
cloud service providers are Amazon, Google and Microsoft.
World is producing data of 2.5 quintillion bytes per day
at current pace due to internet, and it’s only going to
increase with arrival of internet of things. Data is new ‘oil’ of
our generation and mining it to discover knowledge is
critical to many businesses. Processing such a large amount
of data will pose unique kind of challenges to the data
analyzers. Normal sequential methods of programming are
less efficient and hence the need of parallel processing
gained importance. Cloud provides the ideal infrastructure
and Hadoop MapReduce programming paradigm can make
most use of it.
Many sequential programming algorithms are now
converted to Hadoop MapReduceprogrammingparadigm.It
should be noted here that there may be algorithms that
cannot be parallelized, and hence are of not much use in the
real world today as parallelization is of essence to process
Big Data. Hence some of the conventional algorithms for
clustering or technology can provide a great platform and
advancements to the new emergingtechnologyandalsovery
beneficial to the business enterprise organizations.
classification are now implemented using MapReduce
Programming paradigm.
2. BACKGROUND
A. Cloud computing
Cloud computing has evolved since the inception of
internet. Historically processing and storagewereexpensive,
but due to scientific and technological advancement in
hardware manufacturing industries, hardware has become
much cheaper and smaller insizecomparedtoearlier.Taking
advantage of this fact, industries came up with providing the
hardware infrastructure as service to users and this is how
Cloud computing paradigm was born.
Figure. 1. Cloud computing service types [1]
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 7543
Figure. 2. Cloud computing deployment models [1]
As shown in Fig. 1, cloudproviderscanprovideservicesin
three delivery types, that is Software as a Service (SaaS),
Platform as a Service (PaaS) and Infrastructure as a Service
(IaaS). Cloud deployment services comes in the form of
private, hybrid,publicandcommunityasmentionedinFigure
2. Public clouds have many customers compared to other
forms of cloud. Private clouds usually are not cost efficient.
B. Hadoop
Hadoop is considered to be one of the best tools tohandle
big data. It has two major components HDFS and another is
MapReduce. HDFS stores the files in blocks of 64MB. It can
handle the files of varying size from 10 MB to GB, TB.Hadoop
can run withsingle nodeormulti-nodecluster.EveryHadoop
cluster can have five running processes namely. HDFS can be
thought of Data node + Name node + Secondary Name node
and daemon process to manage MapReduce programming
paradigm in HDFS are Job Tracker + Task Tracker.
Figure. 3. Data flow in HDFS [2]
C. Map reduce paradigm
Map Reduce Programming Paradigm of Hadoop is a
model to process huge amount of data. Map phase maps the
input data into <key,value> pairs[3]. The reduce phase
combines the data based on common keys and performs
reduce operation defined by the user. The parallelization
occurs with many mappers created for reading the data and
it is not sequential. Because of this there is high throughput
[4].
3. K-Means++ CLUSTERING ALGORITHM
K-Means++ algorithm is the improved version of most
widely used K-means algorithm in clustering. K-Means
algorithm initializes the centroid randomly, which is where
it could sometimes create less accurate clusters. K-Means++
overcomes initialization part to improve K-Means. K-
Means++ algorithm takes an input k, which refers to the
number of clusters that should be generated and n refers to
set of objects [5].
K-Means++ clustering algorithm works as follows.
1. Select initial centroid X uniformly at random.
2. For each instance X we need to compute D(X) which is
the distance between X andthenearestcentroidthathas
already been chosen.
3. Choose next centroid using a weighted probability
distribution which is proportional to D(X)2.
4. Repeat 2 and 3 until K centroids have been chosen.
5. Assign each record or instance point to a cluster centre
which has least distance.
6. Calculate mean value of all points in the cluster.
7. Replace cluster centroid to this new mean value.
8. Repeat the steps from 6, until thereare nomorechanges
to centroids.
This algorithm can be implemented in map-reduce
pattern as follows.
• Map function: The HDFS stores input data as sequence file
of <key, value> pairs [6]. Every <key, value> pair represents
a record. The map function splits the data acrossall mappers
[7].
• Reduce function: After mapping, reducers are used for
computing Step-2 of the algorithm. Reducer will also
combine intermediate data of same mapper [8]. The
intermediate data can be put in hdfs or stored locally [9].
New centres are generated which can be used for further
iterations.
Figure. 4. K-Means++ Map Reduce [2]
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 7544
4. CONCLUSIONS AND FUTURE WORK
The Big Data generated today has demanded more
computing, more storage resources as well as better way to
process the data. Conventional methods were failing tocope
up with such challenges because of which Cloud Computing
and Hadoop MapReduce gained popularity [10].
The K-Means++ algorithm discussed in this survey is
efficient for large set of data but it suffers with outlier issue.
Future work could involve outlier detecting and removal
algorithm merged with K-Means++ to give better and more
accurate results.
REFERENCES
[1] Adem Tepe, GüRay Yilmaz, “A Survey on Cloud
Computing Technology and Its Application to Satellite
Ground Systems”, International Conference on Recent
Advances in Space Technologies (RAST), 2013.
[2] Rajashree Shettar, Bhimasen. V. Purohit, “A Review on
Clustering Algorithms Applicable for Map Reduce”,
International Conference on Computational Systemsfor
Health & Sustainability, pp. 176-178, 2015.
[3] K. Singh and R. Kaur, "Hadoop: Addressing challengesof
Big Data," 2014 IEEE International Advance Computing
Conference (IACC), Gurgaon, 2014, pp. 686-689.
[4] Borthakur. D “The Hadoop Distributed File System:
Architecture and design”, 2007.
[5] Weihzong Zhao, Huifang Ma, Qing He, “Parallel K-Means
clustering BasedonMapReduce”,springer-verlagBerlin,
Heidelberg, 2009.
[6] Sangeeta Ahuja, M.Ester, H. P. Kriegel, J. Sander, X.Xu,“A
Density based algorithm fordiscoveringclustersinlarge
spatial database with noise”, Second international
conference on knowledge discovery and Data
Mining,1996.
[7] B. Dai and I. Lin, "Efficient Map/Reduce-Based DBSCAN
Algorithm with Optimized Data Partition," 2012 IEEE
Fifth International Conference on Cloud Computing,
Honolulu, HI, 2012, pp. 59-66.
[8] V. Gaede, O. G’unther, “Multidimensional access
methods”, ACM comput. Surv., Vol. 30, No. 2, pp. 170-
231, 1998.
[9] Varad Meru “Data clustering: Using MapReduce”,
software Developers Journal, 2013.
[10] Das A.S, Datar M, Garg, and Rajaram S, “Google news
personalization scalable online collaborative filtering”,
pp. 271-280, 2007.
Ad

Recommended

IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET Journal
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
AM Publications,India
 
Paper id 25201498
Paper id 25201498
IJRAT
 
Introduction to HADOOP
Introduction to HADOOP
Shital Kat
 
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
Performance evaluation of Map-reduce jar pig hive and spark with machine lear...
IJECEIAES
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
dbpublications
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
IRJET Journal
 
An optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computing
DIGVIJAY SHINDE
 
Association Rule Mining using RHadoop
Association Rule Mining using RHadoop
IRJET Journal
 
Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing
Qutub-ud- Din
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
IRJET Journal
 
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET Journal
 
Energy Saving by Migrating Virtual Machine to Green Cloud Computing
Energy Saving by Migrating Virtual Machine to Green Cloud Computing
ijtsrd
 
Real-time Energy Data Analytics with Storm
Real-time Energy Data Analytics with Storm
DataWorks Summit
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
IAEME Publication
 
A Review: Metaheuristic Technique in Cloud Computing
A Review: Metaheuristic Technique in Cloud Computing
IRJET Journal
 
Perspective on HPC-enabled AI
Perspective on HPC-enabled AI
inside-BigData.com
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
International Journal of Modern Research in Engineering and Technology
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
Shital Kat
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
dbpublications
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
IRJET Journal
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
inside-BigData.com
 
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET Journal
 
Qiu bosc2010
Qiu bosc2010
BOSC 2010
 
Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.
Computer Science Journals
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
IRJET Journal
 

More Related Content

What's hot (20)

An optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computing
DIGVIJAY SHINDE
 
Association Rule Mining using RHadoop
Association Rule Mining using RHadoop
IRJET Journal
 
Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing
Qutub-ud- Din
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
IRJET Journal
 
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET Journal
 
Energy Saving by Migrating Virtual Machine to Green Cloud Computing
Energy Saving by Migrating Virtual Machine to Green Cloud Computing
ijtsrd
 
Real-time Energy Data Analytics with Storm
Real-time Energy Data Analytics with Storm
DataWorks Summit
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
IAEME Publication
 
A Review: Metaheuristic Technique in Cloud Computing
A Review: Metaheuristic Technique in Cloud Computing
IRJET Journal
 
Perspective on HPC-enabled AI
Perspective on HPC-enabled AI
inside-BigData.com
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
International Journal of Modern Research in Engineering and Technology
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
Shital Kat
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
dbpublications
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
IRJET Journal
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
inside-BigData.com
 
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET Journal
 
Qiu bosc2010
Qiu bosc2010
BOSC 2010
 
Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.
Computer Science Journals
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
An optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computing
DIGVIJAY SHINDE
 
Association Rule Mining using RHadoop
Association Rule Mining using RHadoop
IRJET Journal
 
Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing
Qutub-ud- Din
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
IRJET Journal
 
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET - Weather Log Analysis based on Hadoop Technology
IRJET Journal
 
Energy Saving by Migrating Virtual Machine to Green Cloud Computing
Energy Saving by Migrating Virtual Machine to Green Cloud Computing
ijtsrd
 
Real-time Energy Data Analytics with Storm
Real-time Energy Data Analytics with Storm
DataWorks Summit
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
IAEME Publication
 
A Review: Metaheuristic Technique in Cloud Computing
A Review: Metaheuristic Technique in Cloud Computing
IRJET Journal
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
Shital Kat
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
dbpublications
 
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
Performance Improvement of Heterogeneous Hadoop Cluster using Ranking Algorithm
IRJET Journal
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
inside-BigData.com
 
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET Journal
 
Qiu bosc2010
Qiu bosc2010
BOSC 2010
 
Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.
Computer Science Journals
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 

Similar to IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing with Map Reduce (20)

A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
IRJET Journal
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
Abhi Jit
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
IJSTA
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
Minhazul Arefin
 
Big Data in Azure
Big Data in Azure
DataWorks Summit/Hadoop Summit
 
Big Data on azure
Big Data on azure
David Giard
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
Nicolle Dammann
 
A premeditated cdm algorithm in cloud computing environment for fpm 2
A premeditated cdm algorithm in cloud computing environment for fpm 2
IAEME Publication
 
Microsoft's Hadoop Story
Microsoft's Hadoop Story
Michael Rys
 
Combining hadoop with big data analytics
Combining hadoop with big data analytics
The Marketing Distillery
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private Cloud
IJERA Editor
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining Techniques
IRJET Journal
 
Hadoop as an extension of DW
Hadoop as an extension of DW
Sidi yazid
 
Building Data Products
Building Data Products
Cloudera, Inc.
 
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
MapR Technologies
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
ijdpsjournal
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
IRJET Journal
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
MACHINE LEARNING ON MAPREDUCE FRAMEWORK
Abhi Jit
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
IJSTA
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
Minhazul Arefin
 
Big Data on azure
Big Data on azure
David Giard
 
How Partitioning Clustering Technique For Implementing...
How Partitioning Clustering Technique For Implementing...
Nicolle Dammann
 
A premeditated cdm algorithm in cloud computing environment for fpm 2
A premeditated cdm algorithm in cloud computing environment for fpm 2
IAEME Publication
 
Microsoft's Hadoop Story
Microsoft's Hadoop Story
Michael Rys
 
Distributed Framework for Data Mining As a Service on Private Cloud
Distributed Framework for Data Mining As a Service on Private Cloud
IJERA Editor
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining Techniques
IRJET Journal
 
Hadoop as an extension of DW
Hadoop as an extension of DW
Sidi yazid
 
Building Data Products
Building Data Products
Cloudera, Inc.
 
Big Data Lessons from the Cloud
Big Data Lessons from the Cloud
MapR Technologies
 
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
oj08
 
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
ijdpsjournal
 
Ad

More from IRJET Journal (20)

Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

Call For Papers - 17th International Conference on Wireless & Mobile Networks...
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
hosseinihamid192023
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Modern multi-proposer consensus implementations
Modern multi-proposer consensus implementations
François Garillot
 
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego López-de-Ipiña González-de-Artaza
 
Proposal for folders structure division in projects.pdf
Proposal for folders structure division in projects.pdf
Mohamed Ahmed
 
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
 
Fatality due to Falls at Working at Height
Fatality due to Falls at Working at Height
ssuserb8994f
 
Solar thermal – Flat plate and concentrating collectors .pptx
Solar thermal – Flat plate and concentrating collectors .pptx
jdaniabraham1
 
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
resming1
 
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
Microwatt: Open Tiny Core, Big Possibilities
Microwatt: Open Tiny Core, Big Possibilities
IBM
 
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
Shabista Imam
 
retina_biometrics ruet rajshahi bangdesh.pptx
retina_biometrics ruet rajshahi bangdesh.pptx
MdRakibulIslam697135
 
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
Data Structures Module 3 Binary Trees Binary Search Trees Tree Traversals AVL...
Data Structures Module 3 Binary Trees Binary Search Trees Tree Traversals AVL...
resming1
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
Call For Papers - 17th International Conference on Wireless & Mobile Networks...
hosseinihamid192023
 
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Abraham Silberschatz-Operating System Concepts (9th,2012.12).pdf
Shabista Imam
 
Modern multi-proposer consensus implementations
Modern multi-proposer consensus implementations
François Garillot
 
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Validating a Citizen Observatories enabling Platform by completing a Citizen ...
Diego López-de-Ipiña González-de-Artaza
 
Proposal for folders structure division in projects.pdf
Proposal for folders structure division in projects.pdf
Mohamed Ahmed
 
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
Introduction to Natural Language Processing - Stages in NLP Pipeline, Challen...
resming1
 
Fatality due to Falls at Working at Height
Fatality due to Falls at Working at Height
ssuserb8994f
 
Solar thermal – Flat plate and concentrating collectors .pptx
Solar thermal – Flat plate and concentrating collectors .pptx
jdaniabraham1
 
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
machine learning is a advance technology
machine learning is a advance technology
ynancy893
 
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
resming1
 
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
Tesla-Stock-Analysis-and-Forecast.pptx (1).pptx
moonsony54
 
Microwatt: Open Tiny Core, Big Possibilities
Microwatt: Open Tiny Core, Big Possibilities
IBM
 
Complete guidance book of Asp.Net Web API
Complete guidance book of Asp.Net Web API
Shabista Imam
 
International Journal of Advanced Information Technology (IJAIT)
International Journal of Advanced Information Technology (IJAIT)
ijait
 
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
FUNDAMENTALS OF COMPUTER ORGANIZATION AND ARCHITECTURE
Shabista Imam
 
retina_biometrics ruet rajshahi bangdesh.pptx
retina_biometrics ruet rajshahi bangdesh.pptx
MdRakibulIslam697135
 
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Tally.ERP 9 at a Glance.book - Tally Solutions .pdf
Shabista Imam
 
Data Structures Module 3 Binary Trees Binary Search Trees Tree Traversals AVL...
Data Structures Module 3 Binary Trees Binary Search Trees Tree Traversals AVL...
resming1
 
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
May 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
IJDKP
 

IRJET- A Review on K-Means++ Clustering Algorithm and Cloud Computing with Map Reduce

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 7542 A Review on K-means++ Clustering Algorithm and Cloud Computing with Map Reduce 1Sweekruth S Badiger, 2Sushmitha N 1PG Student, Dept. of Information Science and Engg., R. V. College of Engineering, Bangalore, Karnataka India 2Assistant Professor, Dept. of Information Science and Engg., R. V. College of Engineering, Bangalore, Karnataka, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - The functionality of cloud computing is to host the servers in a dispersed network with a aim to deliver access to the wide range of customers. The core feature of cloud computing is its flexibility which offer users to perform upscaling or downscaling of the hardware. The main purpose for the development of cloud computing is the substantial increase in the volume of Big data evolved which require to be examined. Researchers utilize multiple algorithms to gain useful knowledge from a huge volume of data set. Hadoop has come up with a new software platform called MapReduce to carry out the operation parallelly on huge data set. Thereonly specific unsupervised learning algorithms which are executed successfully in MapReduce techniqueandaredeployedonhigh volume of data set. The combination of cloud computing with parallel processing in MapReduce stand out to be a powerful approach for the future technologyenhancements. Thissurvey provides a brief overview of cloud computing and of the most popular clustering algorithm named k-means++ Key Words: Hadoop, MapReduce, Cloud computing, K- Means++. 1. INTRODUCTION Cloud computing is on demand availability of hardware resources for computing or storage purpose. The term ‘Cloud’ basically means available to all through internet. It reduces the work for user to either setup or manage the hardware. Cloud provider take the ownership of providing and maintaining the hardware infrastructure. Most famous cloud service providers are Amazon, Google and Microsoft. World is producing data of 2.5 quintillion bytes per day at current pace due to internet, and it’s only going to increase with arrival of internet of things. Data is new ‘oil’ of our generation and mining it to discover knowledge is critical to many businesses. Processing such a large amount of data will pose unique kind of challenges to the data analyzers. Normal sequential methods of programming are less efficient and hence the need of parallel processing gained importance. Cloud provides the ideal infrastructure and Hadoop MapReduce programming paradigm can make most use of it. Many sequential programming algorithms are now converted to Hadoop MapReduceprogrammingparadigm.It should be noted here that there may be algorithms that cannot be parallelized, and hence are of not much use in the real world today as parallelization is of essence to process Big Data. Hence some of the conventional algorithms for clustering or technology can provide a great platform and advancements to the new emergingtechnologyandalsovery beneficial to the business enterprise organizations. classification are now implemented using MapReduce Programming paradigm. 2. BACKGROUND A. Cloud computing Cloud computing has evolved since the inception of internet. Historically processing and storagewereexpensive, but due to scientific and technological advancement in hardware manufacturing industries, hardware has become much cheaper and smaller insizecomparedtoearlier.Taking advantage of this fact, industries came up with providing the hardware infrastructure as service to users and this is how Cloud computing paradigm was born. Figure. 1. Cloud computing service types [1]
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 7543 Figure. 2. Cloud computing deployment models [1] As shown in Fig. 1, cloudproviderscanprovideservicesin three delivery types, that is Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). Cloud deployment services comes in the form of private, hybrid,publicandcommunityasmentionedinFigure 2. Public clouds have many customers compared to other forms of cloud. Private clouds usually are not cost efficient. B. Hadoop Hadoop is considered to be one of the best tools tohandle big data. It has two major components HDFS and another is MapReduce. HDFS stores the files in blocks of 64MB. It can handle the files of varying size from 10 MB to GB, TB.Hadoop can run withsingle nodeormulti-nodecluster.EveryHadoop cluster can have five running processes namely. HDFS can be thought of Data node + Name node + Secondary Name node and daemon process to manage MapReduce programming paradigm in HDFS are Job Tracker + Task Tracker. Figure. 3. Data flow in HDFS [2] C. Map reduce paradigm Map Reduce Programming Paradigm of Hadoop is a model to process huge amount of data. Map phase maps the input data into <key,value> pairs[3]. The reduce phase combines the data based on common keys and performs reduce operation defined by the user. The parallelization occurs with many mappers created for reading the data and it is not sequential. Because of this there is high throughput [4]. 3. K-Means++ CLUSTERING ALGORITHM K-Means++ algorithm is the improved version of most widely used K-means algorithm in clustering. K-Means algorithm initializes the centroid randomly, which is where it could sometimes create less accurate clusters. K-Means++ overcomes initialization part to improve K-Means. K- Means++ algorithm takes an input k, which refers to the number of clusters that should be generated and n refers to set of objects [5]. K-Means++ clustering algorithm works as follows. 1. Select initial centroid X uniformly at random. 2. For each instance X we need to compute D(X) which is the distance between X andthenearestcentroidthathas already been chosen. 3. Choose next centroid using a weighted probability distribution which is proportional to D(X)2. 4. Repeat 2 and 3 until K centroids have been chosen. 5. Assign each record or instance point to a cluster centre which has least distance. 6. Calculate mean value of all points in the cluster. 7. Replace cluster centroid to this new mean value. 8. Repeat the steps from 6, until thereare nomorechanges to centroids. This algorithm can be implemented in map-reduce pattern as follows. • Map function: The HDFS stores input data as sequence file of <key, value> pairs [6]. Every <key, value> pair represents a record. The map function splits the data acrossall mappers [7]. • Reduce function: After mapping, reducers are used for computing Step-2 of the algorithm. Reducer will also combine intermediate data of same mapper [8]. The intermediate data can be put in hdfs or stored locally [9]. New centres are generated which can be used for further iterations. Figure. 4. K-Means++ Map Reduce [2]
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 05 | May 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 7544 4. CONCLUSIONS AND FUTURE WORK The Big Data generated today has demanded more computing, more storage resources as well as better way to process the data. Conventional methods were failing tocope up with such challenges because of which Cloud Computing and Hadoop MapReduce gained popularity [10]. The K-Means++ algorithm discussed in this survey is efficient for large set of data but it suffers with outlier issue. Future work could involve outlier detecting and removal algorithm merged with K-Means++ to give better and more accurate results. REFERENCES [1] Adem Tepe, GüRay Yilmaz, “A Survey on Cloud Computing Technology and Its Application to Satellite Ground Systems”, International Conference on Recent Advances in Space Technologies (RAST), 2013. [2] Rajashree Shettar, Bhimasen. V. Purohit, “A Review on Clustering Algorithms Applicable for Map Reduce”, International Conference on Computational Systemsfor Health & Sustainability, pp. 176-178, 2015. [3] K. Singh and R. Kaur, "Hadoop: Addressing challengesof Big Data," 2014 IEEE International Advance Computing Conference (IACC), Gurgaon, 2014, pp. 686-689. [4] Borthakur. D “The Hadoop Distributed File System: Architecture and design”, 2007. [5] Weihzong Zhao, Huifang Ma, Qing He, “Parallel K-Means clustering BasedonMapReduce”,springer-verlagBerlin, Heidelberg, 2009. [6] Sangeeta Ahuja, M.Ester, H. P. Kriegel, J. Sander, X.Xu,“A Density based algorithm fordiscoveringclustersinlarge spatial database with noise”, Second international conference on knowledge discovery and Data Mining,1996. [7] B. Dai and I. Lin, "Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition," 2012 IEEE Fifth International Conference on Cloud Computing, Honolulu, HI, 2012, pp. 59-66. [8] V. Gaede, O. G’unther, “Multidimensional access methods”, ACM comput. Surv., Vol. 30, No. 2, pp. 170- 231, 1998. [9] Varad Meru “Data clustering: Using MapReduce”, software Developers Journal, 2013. [10] Das A.S, Datar M, Garg, and Rajaram S, “Google news personalization scalable online collaborative filtering”, pp. 271-280, 2007.