SlideShare a Scribd company logo
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org
Page | 153
Paper Publications
Big Data on Implementation of Many to Many
Clustering
Ravi. R1,
Michael. G2
1
PG Scholar, 2
Assistant Professor
1,2
Department of C S E , Bharath University, Selaiyur, Chennai - 600 073, India
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive
data poses a great challenge for data processing and classification. In order to classify the data, there were several
algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used
for the feature subset selection. The feature selection involves identifying a subset of the most useful features that
produces compatible results as the original entire set of features. It is achieved by classifying the given data. The
efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the
quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm,
which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to
be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of
implementation is proposed in this project to efficiently cluster the data and persist on the back-end database
accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is
implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In
works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the
second step deals with the clustering and classification of data. This process is completely implemented using
Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed
file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce
programming model provides an efficient framework for processing large datasets in an extremely parallel mining.
And it comes to being the most popular parallel model for data processing in cloud computing platform. However,
designing the traditional machine learning algorithms with MapReduce programming framework is very
necessary in dealing with massive datasets.
Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
1. INTRODUCTION
Data linkage is one of the important task in data mining. It is of two kinds: one-to –one, one-to-many and many to many.
One-to-one data linkage associates one entity with a matching entity in another data set. One-to-many data linkage
associates an entity with a group of matching entities from another data set. Many-to-many data linkage associates group
of entities from one data set with group of matching entities from another data set. In this paper many-to-many data
linkage is implemented with MapReduce framework.
The massive data poses a great challenge for data processing and classification. In order to classify the data, there were
several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used
for the feature subset selection. The feature selection involves identifying a subset of the most useful features that
produces compatible results as the original entire set of features. It is achieved by classifying the given data. The
efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of
the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org
Page | 154
Paper Publications
be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the
clustering of datasets takes quiet more number of time.
Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-
end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random
forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data.
It works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second
step deals with the clustering and classification of data. This process is completely implemented using Google App
Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using
MapReduce framework for scalable distributed computing or cloud computing.
Hadoop MapReduce is a software framework for distributed processing of large data sets. It is a sub-project of the Apache
Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing failed tasks. The
primary objective of MapReduce is to split the input data set into independent chunks that are processed in a completely
parallel manner.
The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Both the input
and the output of the job are stored in a file system. The proposed method with MapReduce links between the entities
using a One-Class Clustering Tree. A clustering tree is a tree contains a cluster in each of the leaves instead of a single
classification. Each cluster is generalized by a set of rules that is stored in the appropriate leaf. One-class clustering tree is
implemented with the help of MapReduce in parallel. Clustering tree assigns each linkage task to mapper class. Master
node in the mapper assigns the work to slaves. Slaves in turn completes the work and returns the result to master node.
The reducer class combines the intermediate results from mapper class to provide a complete clustering tree.
As clustering tree is implemented in a parallel and distributed environment it reduces the execution time. The objectives
of the paper is threefold:
1. To implement one-to-many and many–to-many data linkage
2. To make data linkage for larger datasets more efficient
3. To execute data linkage in a parallel and distributed environment using MapReduce.
2. RELATED WORK
2.1 Map Reduce Overview:
In the beginning of the twenty first century, many computations that are specialized to process large amounts of raw data
as crawled documents, web request logs ,etc… These computations were applied by authors and many others at Google.
They computed different types of derived data, like inverted indices, summaries of the number of pages per host, various
representations of the graph structure of web documents. The input data was usually large in the computations, so it has to
be distributed among hundreds or thousands of machines to finish the process in a reasonable time. Although the
computations were straightforward, the matters of how to parallelize the computation, distribute the data, and deal with
the expected failures ,made the original simple computation to be very complex specially with the huge amount of code to
handle these issues. As a result to this complexity a new model was designed to allow expression of the simple
computation with hiding the complex details of data distribution, parallelization and fault-tolerance in a library. This
model is inspired by the map and 26 reduce primitive in Lisp which is the oldest high-level programming language. The
use of this functional abstraction with user specified map and reduce operations enables automatic parallelization and
distribution of huge computations, and achieves high performance on large clusters of PCs.
22 Map Reduce Programming Framework:
MapReduce is a software framework introduced by Google in2004 to support distributed computing on large data sets
onclusters of computers .MapReduce is a programming modelfor processing and generating large data sets. Users specify
amap function that processes a key/value pair to generate a setof intermediate key/value pairs and a reduce function
thatmerges all intermediate values associated with the same intermediate key
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org
Page | 155
Paper Publications
"Map" step: The master node takes the input, partitions it upinto smaller sub-problems, and distributes them to
workernodes. A worker node may do this again in turn, leading to amulti-level tree structure. The worker node processes
thesmaller problem, and passes the answer back to its masternode. Map takes one pair of data with a type in one
datadomain, and returns a list of pairs in a different domain:
Map (k1, v1) → list (K2, v2)
"Reduce" step: The master node then collects the answers toall the sub-problems and combines them in some way to
formthe output – the answer to the problem it was originally tryingto solve.The Reduce function is then applied in parallel
to each group,which in turn produces a collection of values in the samedomain:
Reduce (K2, list (v2)) → list (v3)
2.3 Map Reduce Programming Model:
MapReduce is a programming model and a linked implementation for dealing out with many terabytes of data on
thousands of machines .Computation obtains a set of input key/value pairs, and generates a set of output key/value pairs,
so the user of the MapReduce library states the computation as two functions: Map and Reduce.
In the map reduce function the user gets the data from data sources like lines out of files, rows of a database, etc) and
feeds them to the function asan input key/value pair (eg: filename, line). Then it generate a set of intermediate key/value
pairs as shown in fig. After that the library combined together these intermediate values related with the same
intermediate key, and pass them to the Reduce method which accepts an intermediate key I and a set of values for that key
and tries to merges together these values to form a possibly smaller set of values. From practice user can visualize that
usually only zero or one final value will be produced per key as presented in figure below
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org
Page | 156
Paper Publications
MapReduce Model:
Important thing which should be noticed that all the map( ) functions work in parallel to create different intermediate
values from different input data sets, and the same for the reduce( ) functions which run in parallel so each one work on
different output key.
Here is a simple example that could explain the functionality of MapReduce Model. In this problem the number of
occurrences of each word will be accounted from a large collection of documents. The map function emits each word and
an related count of occurrences just like 1 in the simple pseudo-code shown in (Figure 2.8) .On the other hand the reduce
function sums together all counts emitted for a certain word. In addition, another code is used by the user to fill in a
MapReduce specification object with the names of the input and output files, the user then invokes the MapReduce
function, passing it the specification object.
map (String key, String value) :
// key : document name
// value : document contents
for each word w in value :
EmitIntermediate(w, ―1‖);
reduce (String key, Iterator values) :
// key : a word
// value : a list of counts
int result = 0
for each v in values :
result += ParseInt(v)
Emit(AsString(result));
As said before when MapReduce process is used, the data must be distributed among the different nodes, so the master
program divvies up tasks depending on location of data where it tries to have map() tasks on same node as physical file
data, or at least same rack.
Another thing that MapReduce module deals with the expected failures, where the master detects worker failures in the
re-executes completed, in-progress map() tasks, and re-executes in-progress reduce() tasks. In addition, master notices
particular input key/values that cause crashes in map(), and skips those values on re-execution.
2.4 MapReduce Algorithms:
There are many problems that can be mapped to a MapReduce program, such as: sorting, searching, indexing and
classification. These programs must fit the features of the MapReduce algorithm. For any MapReduce algorithm,
processing data must go through a map phase and a reduce phase. With consideration that the output of the map phase as
the input to the reduce phase. In this project we use classification algorithm i.e Scalable random forest algorithm.
2.4.1 Existing Random Forest Algorithm:
Random forest is an ensemble classifier that consists of many decision trees. The method to build the random forest is
bagging. Given the independent variable X, Random Forest Classification is an ensemble classification model which
combined of K decision tree classifiers hl(X),hiX), .... hdX).Each of the classifier decision trees votes for one of the
classifications and the winner is the final classification results. The main work step of Random Forest Classification:
Firstly, select K samples from the original training dataset using bagging method randomly . Then, the K samples will be
the training set for growing K trees accordingly to achieve the K classification results .Finally, the K classifiers vote to
elect the optimal classification with majority.
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org
Page | 157
Paper Publications
2.4.2 Scalable Random Forest Algorithm:
Random Forest classification algorithm based on MapReduce-SMRF algorithm. The algorithm structure of the SMRF are
showed as Fig. 1, The algorithm is divided into three stages: initializing, generating and voting. In the stage initializing,
SMRF algorithm generates a file descriptor for the dataset to describe the attributions, meanwhile, In order to make good
use of the computing resources of computer clusters or cloud computing environments. When the SMRF algorithm
starting, it will negotiate the trees number of the forest with the controller automatically. The recommended values of K
which are depending on the system performance are stored in a Hashlist. The recommended values are based on the
recommendation of Breiman.L and experts experiences. cBreiman's recommendations are to pick a large number of trees,
as well as the square root of the number of variables of the subsets.
Scalable Random Forest Algorithm approach
In the stage generating, SMRF algorithm based on MapReduce programming model will divide the original dataset into
many subsets randomly. SMRF perform the job by using MapReduce distributed file system named HDFS, meanwhile,
the bootstrap samples which are for growing the random forest will be generated from the training dataset by Bagging
algorithm. At the same time the attributions will be selected randomly by Bagging algorithm too. In the MapReduce
processing, the key of the decision tree is n which represents the n'th decision tree, and the value is the class of each root
node of the trees. Demanding on the classification, it is necessary to describe the classification attributions of the training
dataset. And the attributions are leaf, real-valued, categorical etc. Each of the decision tree is slitting at the node which has
the maximum Information Gain. In the state generating all the decision trees will have grown over at the distributed
computing nodes differently. In the stage voting , each of the trees gives a classification, and all the decision trees vote for
the results. The forest decides the classification with the majority vote. When it comes to the prediction, each of the K
trees will give a prediction value and return an average value of them at last. In the whole SMRF algorithm procedure,
The map function selects the training dataset and the attributions randomly from the original dataset to build the decision
trees. The reduce function is used to merge all the decision trees votes and generates the last classified results.
2.5 Google App Engine:
Google App Engine is a service for hosting Web applications. It’s often presented as an example eof a platform-as-a-
service (PaaS) cloud computing solution because Web application developer scan use it to create their applications using a
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org
Page | 158
Paper Publications
set of Google tools. In contrast to other cloud services, such as the Amazon Elastic Compute Cloud (EC2), Google App
Engine requires that application developers use only a limited set of supported programming languages together with a
small set of APIs and frameworks that are mostly dedicated to Web applications. The applications are executed in a secure
hosting environment running on the computing infrastructure that Google provides and manages. The infrastructure offers
load balancing and auto scaling capabilities — that is, it automatically adjusts the number of application instances (virtual
machines) to the request rate.
2.6 Task Queue:
Work focuses on Task Queue, which is particularly interesting for computational applications. Task Queue lets you
execute small jobs asynchronously in the background of HTTP request processing. By dispatching tasks to multiple App
Servers according to the current load, Task Queue becomes a perfect tool to provide scalability. However, using Task
Queue and the Google App Engine’s execution environment presents several difficulties. Task execution must be
idempotent, because Google guarantees that each task will be executed at least one time, which might lead, in exceptional
circumstances, to some tasks re executing. The execution model doesn’t ensure any kind of task concurrency control, and
Google doesn’t provide APIs for inter task communication. It’s therefore impossible to specify when and on which
application server a given task should be executed. The only way to provide data flow between tasks is to store and read
data from the Big table database.
2.7 Extension Possibilities:
Although Google App Engine mainly hosts Web applications that handle many short (interactive)requests, there are use
cases in which longer background processing is required. In addition to the Task Queue API there are other including the
Pipeline API and MapReduce. The Pipeline API allows developers to create workflow-based applications in which larger
computations are split into tasks (pipelines) that are connected using data dependencies. The pipelines can be used to
represent not only processing tasks but also human interactions using Web requests, email, and so on. The Pipeline API
implements the MapReduce framework on top of App Engine. Users must define the map() function, which will be
iterated by the framework over entries in the input dataset, which might include data store objects or lines in blobstore, or
be provided by a custom reader. The tasks in both the Pipeline and MapReduce APIs follow the same constraints and
quotas as the ones in Task Queue, which means that these APIs can be used at no cost in the same way as framework for
parameter study applications.
2.8 Data Storage Management:
Data is stored in CSV format. A comma-separated value (CSV) file stores tabular data (numbers and text) in plain-text
form. Plain text means that the file is a sequence of characters, readable as textual or numerical material without much
processing, usually opposed to formatted text. A CSV file consists of any number off records, separated by line breaks of
some kind; each record consists of fields, separated by some other character or string, most commonly a literal comma or
tab or new line. Usually, all records have an identical sequence of fields. CSV formats are not limited to a particular
character set. They work just as well with Unicode as with ASCII.
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org
Page | 159
Paper Publications
3. PROPOSED ARCHITECTURE
The advertise dataset Is taken in to consideration in order to effectively apply the mining and clustering technique to
classify the advertisements to different categories based on several factors like click rate, cost of advertising etc.
The main work towards the step of Algorithm is: Firstly, the samples from the original dataset is selected. Then, the
samples will be the training set for growing K trees accordingly to achieve the K classification results . Before that, the
classification is relied on a mathematical calculation which is depending on the other parameters of the dataset.
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org
Page | 160
Paper Publications
The final classification of datasets is done on as CPC, CPA, CTR, Cost of Advertisement.
4. MODULES
4.1 Login – Authentication:
Authentication is used to make the application much secured and allow only the authorized person to use the application.
Google Data stores is used to store the user information, and the same is applied during the authentication process.
4.2 Split data into Subset:
This process is technically done by using map reduce algorithm. The Mapper is used for splitting as the data set is huge in
quantity. For Splitting, Map Reduce uses the concept called InputSplit. InputSplit represents the data to be processed by
an Individual Mapper.
4.3 Data Parallel Processing (Mapper):
In this module, we implement or configure the map reduce job for dividing the input data in to different clusters and share
it among multiple nodes and process in parallel.
4.4 Combine Intermediate Result (Reducer):
In this module, we combine the intermediate results from all individual map reduce job, that we allocated on the previous
module. After combing, the negotiation of K value for feature subset selection is processed.
4.5 Collect AD Impressions Datasets:
 Cost-per click is important because it is the number that is going to determine the financial success of your paid
search advertising campaign.
 CPA advertising tracks the person clicking on an ad and determines if that person then also creates a desired
transaction on the destination site.
4.6 Classification process (using Map Reduce):
• The advent of revolution in technology and internet has caused increase in marketing platform for advertising
companies.
• The classification is done in order to categorize the advertisements based on different factors that includes, CPA, CPC.
Etc.
• There are various models for determining the cost of advertising. They are Cost per Impressions (CPM), Cost per
Click (CPC) and Cost per Action (CPA).
• CPM denotes Cost per 1000 Impressions (frequency of the ad display). If the advertisement is shown 2000 times the
cost will be equal to 2 CPM price.
5. CONCLUSION
The proposed methodology is designed based on MapReduce model and it is likely to work mostly in computer clusters or
cloud computing environments. Internally the proposed mechanism uses the MapReduce framework that uses the shell
script to fetch the number of computing nodes and negotiates with the distributed computing environments to decide the
parameters K. Technically, the K Values are stored in Hash List in order to apply the Many-Many Clustering in a
sequential order. The Data Clustering is tested with 29567 rows of sample data sets. It is experimented in the distributed
environment with changed computing nodes. Observations are made by giving various shrads values and increasing the
datasets. The more decision trees will bring the better classification results. The proposed framework’s parallel
performances are mainly depend on the MapReduce model and the model has already been proved of good Scalability.
From all the results we evaluate the many to many clustering performances and prove that to have good scalable
performance.
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org
Page | 161
Paper Publications
REFERENCES
[1] Ma’ayan Dror, Asaf Shabtai, Lior Rokach, Yuval Elovici ―OCCT: A One-Class Clustering Tree for Implementing
One-to-Many Data Linkage‖, IEEE transactions on Knowledge and Data engineering, Vol. 26, No. 3, March 2014
[2] C. Li, Y. Zhang, and X. Li, ―OcVFDT: One-Class Very Fast Decision Tree for One-Class Classification of Data
Streams,‖ Proc. Third Int’l Workshop Knowledge Discovery from Sensor Data, pp. 79- 86, 2009.
[3] Anne-Laure Boulesteix, Silke Janitza J ochen Kruppa, Inke R. K¨onig ―Overview of Random Forest Methodology
and Practical Guidance with Emphasis on Computational Biology and Bioinformatics ‖ pre-review version of a
manuscript accepted for publication in WIREs Data Mining & Knowledge Discovery , July 25th 2012
[4] Chen, W Y; et al. (2011). ―Parallel Spectral Clustering in Distributed Systems‖. IEEE Trans. Pattern Anal. Mach.
Intell., 568-586.
[5] Jiawei Hanl, Yanheng Liul, Xin Sunl ―A Scalable Random Forest Algorithm Based on MapReduce‖ presented at
the IEEE Summer Power Meeting , 2013 IEEE
[6] Aditya B. Patel, Manashvi Birla, Ushma Nair ―Addressing Big Data Problem Using Hadoop and Map Reduce‖
presented at NIRMA university international conference on engineering NUiCONE-2012, 06-08December, 2012.
[7] Jyoti Nandimath , Ankur Patil, Ekata Banerjee, Pratima Kakade ―Big Data Analysis Using Apache Hadoop‖
presented at IEEE IRI 2013, August 14-16, 2013, San Francisco, California, USA
[8] J. Dean and S. Ghemawat, "MapReduce: simplified data processingon large clusters," Commun. ACM, vol. 51, no.
I, pp. 107-113, 2008.
[9] Apache Software Foundation. Official apache hadoop website, http://hadoop.apache.org/, Aug, 2012.

More Related Content

PDF
Paper id 25201498
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
PPTX
Data-Intensive Technologies for Cloud Computing
DOCX
Map reduce advantages over parallel databases report
PDF
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
PPTX
Big data & Hadoop
PDF
Survey on Load Rebalancing for Distributed File System in Cloud
Paper id 25201498
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Data-Intensive Technologies for Cloud Computing
Map reduce advantages over parallel databases report
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Big data & Hadoop
Survey on Load Rebalancing for Distributed File System in Cloud

What's hot (18)

PDF
Final Report_798 Project_Nithin_Sharmila
PDF
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
PDF
Reduce Side Joins
PDF
Web Oriented FIM for large scale dataset using Hadoop
PDF
A data aware caching 2415
PDF
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
PDF
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
PDF
Relational Algebra and MapReduce
PDF
Mapreduce2008 cacm
PDF
Eg4301808811
DOCX
Seminar Report Vaibhav
PPTX
Join Algorithms in MapReduce
PDF
Optimized Access Strategies for a Distributed Database Design
PDF
Survey of Parallel Data Processing in Context with MapReduce
PDF
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
PDF
MapReduce in Cloud Computing
Final Report_798 Project_Nithin_Sharmila
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...
Reduce Side Joins
Web Oriented FIM for large scale dataset using Hadoop
A data aware caching 2415
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Relational Algebra and MapReduce
Mapreduce2008 cacm
Eg4301808811
Seminar Report Vaibhav
Join Algorithms in MapReduce
Optimized Access Strategies for a Distributed Database Design
Survey of Parallel Data Processing in Context with MapReduce
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
MapReduce in Cloud Computing
Ad

Viewers also liked (20)

PDF
Study of Color Image Processing For Capturing and Comparing Images
PDF
Applying Agile Methodologies on Large Software Projects
PDF
Research on Lexicographic Linear Goal Programming Problem Based on LINGO and ...
PDF
Deterministic AODV Routing Protocol for Vehicular Ad-Hoc Network
PDF
Co-Clustering For Cross-Domain Text Classification
PDF
Using ψ-Operator to Formulate a New Definition of Local Function
PDF
Cloud Computing Security Issues and Challenges
PDF
Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)
PDF
On the Zeros of Polar Derivatives
PDF
Study Kc - Spaces Via ω-Open Sets
PDF
Non Split Edge Domination in Fuzzy Graphs
PDF
Enhance The Technique For Searching Dimension Incomplete Databases
PDF
Private Mobile Search Engine Using RSVM Training
PDF
An Advanced IR System of Relational Keyword Search Technique
PDF
Comparative Performance Analysis & Complexity of Different Sorting Algorithm
PDF
Delay Constrained Energy Efficient Data Transmission over WSN
PDF
An Algorithm Analysis on Data Mining
PDF
Analysis of Classification Approaches
PDF
A SURVEY AND COMPARETIVE ANALYSIS OF E-LEARNING PLATFORM (MOODLE AND BLACKBOARD)
PDF
Entropy based Digital Watermarking using 2-D Biorthogonal WAVELET
Study of Color Image Processing For Capturing and Comparing Images
Applying Agile Methodologies on Large Software Projects
Research on Lexicographic Linear Goal Programming Problem Based on LINGO and ...
Deterministic AODV Routing Protocol for Vehicular Ad-Hoc Network
Co-Clustering For Cross-Domain Text Classification
Using ψ-Operator to Formulate a New Definition of Local Function
Cloud Computing Security Issues and Challenges
Some Continued Mock Theta Functions from Ramanujan’s Lost Notebook (IV)
On the Zeros of Polar Derivatives
Study Kc - Spaces Via ω-Open Sets
Non Split Edge Domination in Fuzzy Graphs
Enhance The Technique For Searching Dimension Incomplete Databases
Private Mobile Search Engine Using RSVM Training
An Advanced IR System of Relational Keyword Search Technique
Comparative Performance Analysis & Complexity of Different Sorting Algorithm
Delay Constrained Energy Efficient Data Transmission over WSN
An Algorithm Analysis on Data Mining
Analysis of Classification Approaches
A SURVEY AND COMPARETIVE ANALYSIS OF E-LEARNING PLATFORM (MOODLE AND BLACKBOARD)
Entropy based Digital Watermarking using 2-D Biorthogonal WAVELET
Ad

Similar to Big Data on Implementation of Many to Many Clustering (20)

PDF
Generating Frequent Itemsets by RElim on Hadoop Clusters
PDF
Implementation of p pic algorithm in map reduce to handle big data
PDF
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
DOC
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
PDF
Parallel Data Processing with MapReduce: A Survey
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
PPTX
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
PDF
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
PDF
Seminar_Report_hadoop
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PDF
Introduction to Big Data and Hadoop using Local Standalone Mode
PPTX
This gives a brief detail about big data
PDF
Simplified Data Processing On Large Cluster
PDF
lec6_ref.pdf
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PDF
B018110610
PPT
Seminar Presentation Hadoop
PDF
Report Hadoop Map Reduce
PPTX
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION
Generating Frequent Itemsets by RElim on Hadoop Clusters
Implementation of p pic algorithm in map reduce to handle big data
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
Parallel Data Processing with MapReduce: A Survey
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Survey on MapReduce in Big Data Clustering using Machine Learning Algorithms
Seminar_Report_hadoop
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Introduction to Big Data and Hadoop using Local Standalone Mode
This gives a brief detail about big data
Simplified Data Processing On Large Cluster
lec6_ref.pdf
Hadoop and MapReduce addDdaDadadDDAD.pptx
B018110610
Seminar Presentation Hadoop
Report Hadoop Map Reduce
COMPLETE MAP AND REDUCE FRAMEWORK INTRODUCTION

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Getting Started with Data Integration: FME Form 101
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding
Advanced methodologies resolving dimensionality complications for autism neur...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Digital-Transformation-Roadmap-for-Companies.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
MIND Revenue Release Quarter 2 2025 Press Release
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Getting Started with Data Integration: FME Form 101

Big Data on Implementation of Many to Many Clustering

  • 1. ISSN 2350-1022 International Journal of Recent Research in Mathematics Computer Science and Information Technology Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org Page | 153 Paper Publications Big Data on Implementation of Many to Many Clustering Ravi. R1, Michael. G2 1 PG Scholar, 2 Assistant Professor 1,2 Department of C S E , Bharath University, Selaiyur, Chennai - 600 073, India Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets. Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree. 1. INTRODUCTION Data linkage is one of the important task in data mining. It is of two kinds: one-to –one, one-to-many and many to many. One-to-one data linkage associates one entity with a matching entity in another data set. One-to-many data linkage associates an entity with a group of matching entities from another data set. Many-to-many data linkage associates group of entities from one data set with group of matching entities from another data set. In this paper many-to-many data linkage is implemented with MapReduce framework. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to
  • 2. ISSN 2350-1022 International Journal of Recent Research in Mathematics Computer Science and Information Technology Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org Page | 154 Paper Publications be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back- end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. It works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. Hadoop MapReduce is a software framework for distributed processing of large data sets. It is a sub-project of the Apache Hadoop project. The framework takes care of scheduling tasks, monitoring them and re-executing failed tasks. The primary objective of MapReduce is to split the input data set into independent chunks that are processed in a completely parallel manner. The Hadoop MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Both the input and the output of the job are stored in a file system. The proposed method with MapReduce links between the entities using a One-Class Clustering Tree. A clustering tree is a tree contains a cluster in each of the leaves instead of a single classification. Each cluster is generalized by a set of rules that is stored in the appropriate leaf. One-class clustering tree is implemented with the help of MapReduce in parallel. Clustering tree assigns each linkage task to mapper class. Master node in the mapper assigns the work to slaves. Slaves in turn completes the work and returns the result to master node. The reducer class combines the intermediate results from mapper class to provide a complete clustering tree. As clustering tree is implemented in a parallel and distributed environment it reduces the execution time. The objectives of the paper is threefold: 1. To implement one-to-many and many–to-many data linkage 2. To make data linkage for larger datasets more efficient 3. To execute data linkage in a parallel and distributed environment using MapReduce. 2. RELATED WORK 2.1 Map Reduce Overview: In the beginning of the twenty first century, many computations that are specialized to process large amounts of raw data as crawled documents, web request logs ,etc… These computations were applied by authors and many others at Google. They computed different types of derived data, like inverted indices, summaries of the number of pages per host, various representations of the graph structure of web documents. The input data was usually large in the computations, so it has to be distributed among hundreds or thousands of machines to finish the process in a reasonable time. Although the computations were straightforward, the matters of how to parallelize the computation, distribute the data, and deal with the expected failures ,made the original simple computation to be very complex specially with the huge amount of code to handle these issues. As a result to this complexity a new model was designed to allow expression of the simple computation with hiding the complex details of data distribution, parallelization and fault-tolerance in a library. This model is inspired by the map and 26 reduce primitive in Lisp which is the oldest high-level programming language. The use of this functional abstraction with user specified map and reduce operations enables automatic parallelization and distribution of huge computations, and achieves high performance on large clusters of PCs. 22 Map Reduce Programming Framework: MapReduce is a software framework introduced by Google in2004 to support distributed computing on large data sets onclusters of computers .MapReduce is a programming modelfor processing and generating large data sets. Users specify amap function that processes a key/value pair to generate a setof intermediate key/value pairs and a reduce function thatmerges all intermediate values associated with the same intermediate key
  • 3. ISSN 2350-1022 International Journal of Recent Research in Mathematics Computer Science and Information Technology Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org Page | 155 Paper Publications "Map" step: The master node takes the input, partitions it upinto smaller sub-problems, and distributes them to workernodes. A worker node may do this again in turn, leading to amulti-level tree structure. The worker node processes thesmaller problem, and passes the answer back to its masternode. Map takes one pair of data with a type in one datadomain, and returns a list of pairs in a different domain: Map (k1, v1) → list (K2, v2) "Reduce" step: The master node then collects the answers toall the sub-problems and combines them in some way to formthe output – the answer to the problem it was originally tryingto solve.The Reduce function is then applied in parallel to each group,which in turn produces a collection of values in the samedomain: Reduce (K2, list (v2)) → list (v3) 2.3 Map Reduce Programming Model: MapReduce is a programming model and a linked implementation for dealing out with many terabytes of data on thousands of machines .Computation obtains a set of input key/value pairs, and generates a set of output key/value pairs, so the user of the MapReduce library states the computation as two functions: Map and Reduce. In the map reduce function the user gets the data from data sources like lines out of files, rows of a database, etc) and feeds them to the function asan input key/value pair (eg: filename, line). Then it generate a set of intermediate key/value pairs as shown in fig. After that the library combined together these intermediate values related with the same intermediate key, and pass them to the Reduce method which accepts an intermediate key I and a set of values for that key and tries to merges together these values to form a possibly smaller set of values. From practice user can visualize that usually only zero or one final value will be produced per key as presented in figure below
  • 4. ISSN 2350-1022 International Journal of Recent Research in Mathematics Computer Science and Information Technology Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org Page | 156 Paper Publications MapReduce Model: Important thing which should be noticed that all the map( ) functions work in parallel to create different intermediate values from different input data sets, and the same for the reduce( ) functions which run in parallel so each one work on different output key. Here is a simple example that could explain the functionality of MapReduce Model. In this problem the number of occurrences of each word will be accounted from a large collection of documents. The map function emits each word and an related count of occurrences just like 1 in the simple pseudo-code shown in (Figure 2.8) .On the other hand the reduce function sums together all counts emitted for a certain word. In addition, another code is used by the user to fill in a MapReduce specification object with the names of the input and output files, the user then invokes the MapReduce function, passing it the specification object. map (String key, String value) : // key : document name // value : document contents for each word w in value : EmitIntermediate(w, ―1‖); reduce (String key, Iterator values) : // key : a word // value : a list of counts int result = 0 for each v in values : result += ParseInt(v) Emit(AsString(result)); As said before when MapReduce process is used, the data must be distributed among the different nodes, so the master program divvies up tasks depending on location of data where it tries to have map() tasks on same node as physical file data, or at least same rack. Another thing that MapReduce module deals with the expected failures, where the master detects worker failures in the re-executes completed, in-progress map() tasks, and re-executes in-progress reduce() tasks. In addition, master notices particular input key/values that cause crashes in map(), and skips those values on re-execution. 2.4 MapReduce Algorithms: There are many problems that can be mapped to a MapReduce program, such as: sorting, searching, indexing and classification. These programs must fit the features of the MapReduce algorithm. For any MapReduce algorithm, processing data must go through a map phase and a reduce phase. With consideration that the output of the map phase as the input to the reduce phase. In this project we use classification algorithm i.e Scalable random forest algorithm. 2.4.1 Existing Random Forest Algorithm: Random forest is an ensemble classifier that consists of many decision trees. The method to build the random forest is bagging. Given the independent variable X, Random Forest Classification is an ensemble classification model which combined of K decision tree classifiers hl(X),hiX), .... hdX).Each of the classifier decision trees votes for one of the classifications and the winner is the final classification results. The main work step of Random Forest Classification: Firstly, select K samples from the original training dataset using bagging method randomly . Then, the K samples will be the training set for growing K trees accordingly to achieve the K classification results .Finally, the K classifiers vote to elect the optimal classification with majority.
  • 5. ISSN 2350-1022 International Journal of Recent Research in Mathematics Computer Science and Information Technology Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org Page | 157 Paper Publications 2.4.2 Scalable Random Forest Algorithm: Random Forest classification algorithm based on MapReduce-SMRF algorithm. The algorithm structure of the SMRF are showed as Fig. 1, The algorithm is divided into three stages: initializing, generating and voting. In the stage initializing, SMRF algorithm generates a file descriptor for the dataset to describe the attributions, meanwhile, In order to make good use of the computing resources of computer clusters or cloud computing environments. When the SMRF algorithm starting, it will negotiate the trees number of the forest with the controller automatically. The recommended values of K which are depending on the system performance are stored in a Hashlist. The recommended values are based on the recommendation of Breiman.L and experts experiences. cBreiman's recommendations are to pick a large number of trees, as well as the square root of the number of variables of the subsets. Scalable Random Forest Algorithm approach In the stage generating, SMRF algorithm based on MapReduce programming model will divide the original dataset into many subsets randomly. SMRF perform the job by using MapReduce distributed file system named HDFS, meanwhile, the bootstrap samples which are for growing the random forest will be generated from the training dataset by Bagging algorithm. At the same time the attributions will be selected randomly by Bagging algorithm too. In the MapReduce processing, the key of the decision tree is n which represents the n'th decision tree, and the value is the class of each root node of the trees. Demanding on the classification, it is necessary to describe the classification attributions of the training dataset. And the attributions are leaf, real-valued, categorical etc. Each of the decision tree is slitting at the node which has the maximum Information Gain. In the state generating all the decision trees will have grown over at the distributed computing nodes differently. In the stage voting , each of the trees gives a classification, and all the decision trees vote for the results. The forest decides the classification with the majority vote. When it comes to the prediction, each of the K trees will give a prediction value and return an average value of them at last. In the whole SMRF algorithm procedure, The map function selects the training dataset and the attributions randomly from the original dataset to build the decision trees. The reduce function is used to merge all the decision trees votes and generates the last classified results. 2.5 Google App Engine: Google App Engine is a service for hosting Web applications. It’s often presented as an example eof a platform-as-a- service (PaaS) cloud computing solution because Web application developer scan use it to create their applications using a
  • 6. ISSN 2350-1022 International Journal of Recent Research in Mathematics Computer Science and Information Technology Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org Page | 158 Paper Publications set of Google tools. In contrast to other cloud services, such as the Amazon Elastic Compute Cloud (EC2), Google App Engine requires that application developers use only a limited set of supported programming languages together with a small set of APIs and frameworks that are mostly dedicated to Web applications. The applications are executed in a secure hosting environment running on the computing infrastructure that Google provides and manages. The infrastructure offers load balancing and auto scaling capabilities — that is, it automatically adjusts the number of application instances (virtual machines) to the request rate. 2.6 Task Queue: Work focuses on Task Queue, which is particularly interesting for computational applications. Task Queue lets you execute small jobs asynchronously in the background of HTTP request processing. By dispatching tasks to multiple App Servers according to the current load, Task Queue becomes a perfect tool to provide scalability. However, using Task Queue and the Google App Engine’s execution environment presents several difficulties. Task execution must be idempotent, because Google guarantees that each task will be executed at least one time, which might lead, in exceptional circumstances, to some tasks re executing. The execution model doesn’t ensure any kind of task concurrency control, and Google doesn’t provide APIs for inter task communication. It’s therefore impossible to specify when and on which application server a given task should be executed. The only way to provide data flow between tasks is to store and read data from the Big table database. 2.7 Extension Possibilities: Although Google App Engine mainly hosts Web applications that handle many short (interactive)requests, there are use cases in which longer background processing is required. In addition to the Task Queue API there are other including the Pipeline API and MapReduce. The Pipeline API allows developers to create workflow-based applications in which larger computations are split into tasks (pipelines) that are connected using data dependencies. The pipelines can be used to represent not only processing tasks but also human interactions using Web requests, email, and so on. The Pipeline API implements the MapReduce framework on top of App Engine. Users must define the map() function, which will be iterated by the framework over entries in the input dataset, which might include data store objects or lines in blobstore, or be provided by a custom reader. The tasks in both the Pipeline and MapReduce APIs follow the same constraints and quotas as the ones in Task Queue, which means that these APIs can be used at no cost in the same way as framework for parameter study applications. 2.8 Data Storage Management: Data is stored in CSV format. A comma-separated value (CSV) file stores tabular data (numbers and text) in plain-text form. Plain text means that the file is a sequence of characters, readable as textual or numerical material without much processing, usually opposed to formatted text. A CSV file consists of any number off records, separated by line breaks of some kind; each record consists of fields, separated by some other character or string, most commonly a literal comma or tab or new line. Usually, all records have an identical sequence of fields. CSV formats are not limited to a particular character set. They work just as well with Unicode as with ASCII.
  • 7. ISSN 2350-1022 International Journal of Recent Research in Mathematics Computer Science and Information Technology Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org Page | 159 Paper Publications 3. PROPOSED ARCHITECTURE The advertise dataset Is taken in to consideration in order to effectively apply the mining and clustering technique to classify the advertisements to different categories based on several factors like click rate, cost of advertising etc. The main work towards the step of Algorithm is: Firstly, the samples from the original dataset is selected. Then, the samples will be the training set for growing K trees accordingly to achieve the K classification results . Before that, the classification is relied on a mathematical calculation which is depending on the other parameters of the dataset.
  • 8. ISSN 2350-1022 International Journal of Recent Research in Mathematics Computer Science and Information Technology Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org Page | 160 Paper Publications The final classification of datasets is done on as CPC, CPA, CTR, Cost of Advertisement. 4. MODULES 4.1 Login – Authentication: Authentication is used to make the application much secured and allow only the authorized person to use the application. Google Data stores is used to store the user information, and the same is applied during the authentication process. 4.2 Split data into Subset: This process is technically done by using map reduce algorithm. The Mapper is used for splitting as the data set is huge in quantity. For Splitting, Map Reduce uses the concept called InputSplit. InputSplit represents the data to be processed by an Individual Mapper. 4.3 Data Parallel Processing (Mapper): In this module, we implement or configure the map reduce job for dividing the input data in to different clusters and share it among multiple nodes and process in parallel. 4.4 Combine Intermediate Result (Reducer): In this module, we combine the intermediate results from all individual map reduce job, that we allocated on the previous module. After combing, the negotiation of K value for feature subset selection is processed. 4.5 Collect AD Impressions Datasets:  Cost-per click is important because it is the number that is going to determine the financial success of your paid search advertising campaign.  CPA advertising tracks the person clicking on an ad and determines if that person then also creates a desired transaction on the destination site. 4.6 Classification process (using Map Reduce): • The advent of revolution in technology and internet has caused increase in marketing platform for advertising companies. • The classification is done in order to categorize the advertisements based on different factors that includes, CPA, CPC. Etc. • There are various models for determining the cost of advertising. They are Cost per Impressions (CPM), Cost per Click (CPC) and Cost per Action (CPA). • CPM denotes Cost per 1000 Impressions (frequency of the ad display). If the advertisement is shown 2000 times the cost will be equal to 2 CPM price. 5. CONCLUSION The proposed methodology is designed based on MapReduce model and it is likely to work mostly in computer clusters or cloud computing environments. Internally the proposed mechanism uses the MapReduce framework that uses the shell script to fetch the number of computing nodes and negotiates with the distributed computing environments to decide the parameters K. Technically, the K Values are stored in Hash List in order to apply the Many-Many Clustering in a sequential order. The Data Clustering is tested with 29567 rows of sample data sets. It is experimented in the distributed environment with changed computing nodes. Observations are made by giving various shrads values and increasing the datasets. The more decision trees will bring the better classification results. The proposed framework’s parallel performances are mainly depend on the MapReduce model and the model has already been proved of good Scalability. From all the results we evaluate the many to many clustering performances and prove that to have good scalable performance.
  • 9. ISSN 2350-1022 International Journal of Recent Research in Mathematics Computer Science and Information Technology Vol. 2, Issue 1, pp: (153-161), Month: April 2015 – September 2015, Available at: www.paperpublications.org Page | 161 Paper Publications REFERENCES [1] Ma’ayan Dror, Asaf Shabtai, Lior Rokach, Yuval Elovici ―OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage‖, IEEE transactions on Knowledge and Data engineering, Vol. 26, No. 3, March 2014 [2] C. Li, Y. Zhang, and X. Li, ―OcVFDT: One-Class Very Fast Decision Tree for One-Class Classification of Data Streams,‖ Proc. Third Int’l Workshop Knowledge Discovery from Sensor Data, pp. 79- 86, 2009. [3] Anne-Laure Boulesteix, Silke Janitza J ochen Kruppa, Inke R. K¨onig ―Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics ‖ pre-review version of a manuscript accepted for publication in WIREs Data Mining & Knowledge Discovery , July 25th 2012 [4] Chen, W Y; et al. (2011). ―Parallel Spectral Clustering in Distributed Systems‖. IEEE Trans. Pattern Anal. Mach. Intell., 568-586. [5] Jiawei Hanl, Yanheng Liul, Xin Sunl ―A Scalable Random Forest Algorithm Based on MapReduce‖ presented at the IEEE Summer Power Meeting , 2013 IEEE [6] Aditya B. Patel, Manashvi Birla, Ushma Nair ―Addressing Big Data Problem Using Hadoop and Map Reduce‖ presented at NIRMA university international conference on engineering NUiCONE-2012, 06-08December, 2012. [7] Jyoti Nandimath , Ankur Patil, Ekata Banerjee, Pratima Kakade ―Big Data Analysis Using Apache Hadoop‖ presented at IEEE IRI 2013, August 14-16, 2013, San Francisco, California, USA [8] J. Dean and S. Ghemawat, "MapReduce: simplified data processingon large clusters," Commun. ACM, vol. 51, no. I, pp. 107-113, 2008. [9] Apache Software Foundation. Official apache hadoop website, http://hadoop.apache.org/, Aug, 2012.