SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1807
Parallel kNN for Big Data using Adaptive Indexing
Tejal Katore1, Prof. Dr. Suhasini Itkar2
1Post Graduate Scholar, Dept. of Computer Engineering, P.E.S Modern College of Engineering, Pune, India
2Professor, Dept. of Computer Engineering, P.E.S Modern College of Engineering, Pune, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - k Nearest Neighbor is frequently used in
classification methods. kNN algorithm defines the class
membership of the given element. kNN when used in context
with large data, does not perform well. So multiple techniques
were introduced to execute kNN parallely and enhance its
performance. Alongwiththis MapReduceprogrammingmodel
was used which was suitable for distributed approaches. The
different reference algorithms were given as follows HzknnJ,
HBNJ, RankReduce which compute kNN on MapReduce. Data
preprocessing, Data partitioning and computation are the
three common steps for kNN computation. For all given
solutions only the partitioning technique differs. Adaptive
Indexing is a indexing paradigm where index creation and
reorganizationtakesplaceautomaticallyand incrementally. It
was used along with the RankReduce algorithm which helps
knn to exec more efficiently.
Key Words: Hadoop Block Nested Loop kNN (H-BNLJ),
Hadoop z value (H-zkNNJ), k Nearest Neighbor,
MapReduce, Performance Evaluation, RankReduce.
1. INTRODUCTION
k Nearest Neighbor is widely used as a classification or
clustering method in machine learning or data mining[1].
The k-Nearest Neighbor algorithm (k-NN) [2] is considered
one of the ten most significantly data miningalgorithms.Itis
an lazy learner which do not need absolute training phase.
The method requires that all of the data instances are stored
and unseen cases classified by finding the class labels of the
k closest instances to them[3]. To determine how close two
instances are, several distances can be computed. This
operation as to be performed for all the input examples
against the whole training dataset.
Given R is a point and S is set of reference points, a k
nearest neighbor join is an operation which for each pointin
R, discovers the k nearest neighbor in S. The data points are
divided into training set and testingset,alsocalledunlabeled
data. The aim is to find the class label for the new points. For
each unlabeled data, a kNN query on the training set will be
performed to estimate its class membership. This process
can be considered as a kNN join of the testing set with the
training set. The basic idea to compute a kNN join is to
perform a pairwisecomputationofdistanceforeachelement
in R and each element in S . The difficulties mainly lie in the
following two aspect: (1) Data Volume (2)Data
Dimensionality. A lot of work has been dedicated to reduce
the in-memory com-putational complexity [1]. These works
mainly focus on two points: (1) Use indexes to decrease the
number of distances needto becalculated.Theseindexescan
hardly be scaled on high dimension data. (2) Useprojections
to reduce the dimensionality of data. But the maintenance of
the accuracy becomes another issue. Despite these efforts,
there are still significant limitations to process kNN on a
centralized machine when the amount of data increases
[4],[10],[11].
Only distributed and parallel solutions are proved to be
powerful, for large dataset . MapReduce is a flexible and
scalable parallel and distributed programming paradigm
which is specially designed for data-intensive processing.
MapReduce is a parallel programming model that aims at
efficiently processing large-datasets. It consists of:(1)
representing a key-value pair (2)defining map function
(3)defining reducefunction. Hereweintroducethereference
algorithms that compute kNN over MapReduce. These
algorithms are based on different methods, but follow a
common work-flow which consists three ordered
steps:(1)data pre-processing (2)data partitioning (3) kNN
computation.
2. LITERATURE REVIEW
kNN is based on a distance function that measures the
difference or similarity between two instances. kNN using
centralized approach was not able to perform for large
inputs. So a new approach to execute it parallelly was
developed. There are various existing solutions to perform
the kNN operation in the context of MapReduce are given.
The approach HBNLJ[1] consists of two phases. The data set
is divided into a certain blocks of particular size. The data is
partitioned such a that an element in a partition of R will
have its nearest neighbor in only one partitioned of S. Two
partitioning strategies that enable to separate the datasets
into independent partitions, while preserving locality
information, are proposed. H-zkNNJ [1],[4], which use size
based partitioning strategies, have a very good loadbalance,
with a very small deviation of the completion time of each
task. In H-zkNNJ, the z -value transformation leads to
information loss. The recall of this algorithmisinfluencedby
the nature, the dimension and the size of the input data.
More specifically, this algorithm becomes biased if the
distance between initial data is very scattered, and the more
input and M , the number of hash functions in each family.
Since they are dependent on the dataset, experiments are
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1808
needed to precisely tune them. In, the authors suggests this
can be achieved with a sample dataset and a theoretical
model. The first important metric to consider is the number
of candidates available in each bucket. Indeed, with some
poorly chosen parameter values, it is possible to have less
than k elements in each bucket, making it impossibleto have
enough elements at the end of the computation. HzknnJuses
Locality Sensitive Hashing[7][8][9].
Rank Reduce [1],[5], with the addition of a third job, can
have the best performance of all, provided that it is started
with the optimal parameters. The most important ones are
W ,the size of each bucket, L , the number of hash families.
Increasing the number of families L greatly improves both
the precision and recall. However, increasingM,thenumber
of hash functions, decreases the number of collisions,
reducing execution time but also the recall and precision.
Overall, finding the optimal parameters for the Locality
Sensitive Hashing part is complex and has to be done for
every dataset.
Special type of distance [8], [4] is adaptive indexing. It is
specifically addressed kNN queries in high-dimensional
space and has since proven to be one of the most efficient
and state-of-the-art high dimensional indexing techniques
available for exact kNN search. In recent years,iDistancehas
been used in a number of applications. In a set of one-
dimensional distance values, each related to one or more
data points, for each partition thatareall indexedtogetherin
a single standard B + -tree. The algorithm was motivated by
the ability to use arbitrary reference points to determinethe
similarity and dissimilarity between any twodata pointsina
metric space, allowing single dimensional ranking and
indexing of data points no matter what thedimensionalityof
the original space [8].
3. SYSTEM ARCHITECTURE
Processing Steps The following scheme consists of three
basic steps:
1) Pre-processing
i. Remove column names
ii. Move to HDFS
iii. Feature Extraction
iv. Clean Data
v. Divide into training and testing set
2) Partitioning
3) kNN Computation
In iDistance algorithm indexing was added with Rank
Reduce in between. From this the reference was taken and
the implemented system also had indexing with Rank
Reduce but in shuffled order. Firstly the indexing is
performed and then the Rank Reduce is executed.
Fig -1: Architecture Diagram
1.Pre-processing- The data is transformed from its original
form to the data that is beneficiary. Only the required data is
kept and remaining is removed. It further consists of few
steps. A).Remove the column names- The attributes or
column names of the dataset areremoved.B).MovetoHDFS-
The data set is moved over the Hadoop Distributed File
System. C) Feature Extraction- Extracting the selected
features from the given data . D) Clean Data-After selecting
the features the remaining data is discarded.
E) Divide into training and testing set- The data set is divide
into training as well as testing data set. Training data set is
the labelled data which consists of class membership.
Testing data set is the unlabeled data which is to be
processed.
2.Partitioning- While processing data on MapReduce, we
need to divide the data set into independent pieces,calledas
partitions. Partitioning is the process of dividing the data
into blocks, regions, buckets, etc. All the algorithms use
different partitioning strategies. Partition is done using 2
different strategies: (1) Distance basedPartitioningStrategy
(2) Size based Partitioning Strategy. In distance based
partition the space is divided into disjoint cells while in size
based partition the space is divided into equal size partition.
The algorithms are divided under both strategies.
(3). kNN Computation - The reducers perform the
computation. The mappers divide the data set into numbers
of blocks and the output of these is given to the reducers.
Then the reducers sort the points according tothedistances.
A DataSets
The dataset used is named as ”Airline On-Time Statistics
and Delay Causes”. The dataset consists of records of
airlines which include all details related to flights. It is
packaged in yearly chunks from 1987 to2008.Itconsistsof
29 columns from which 19 columns are selected. The
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1809
column name ”cancelled tickets” is used as labeledcolumn.
The size of the data set is 12 GB. It consists of 52 billion
records[12].
B. Hardware
1) Memory: 8GB
2) Processor: Intel (R) Pentium (R) CPU B950 @2.10
GHz
3) Hard disk: 64 GB
C. Software
1) Cloudera :Hadoop framework
2) Operating System: CentOs
3) Eclipse 4.2.2 and above
4) vi editor
D. Performance Parameters
The performance of the system is measured using
different parameters among which time is the important
factor. The number of mappers and reducers for each
algorithm are kept same. The value of k is randomly taken.
The time required for each algorithm to execute is in
seconds.
E. Results
Precision
Algorithms k=10 k=15 k=20
Block 0.57 0.56 0.55
Z-value 0.7 0.31 0.2
LSH 0.85 0.73 0.68
Adaptive Indexing 0.9 0.77 0.7
Table 1: Precision value for each algorithm
The comparison betweenall fouralgorithmsisdone.Some
advantages and shortcomings ofall algorithmsareobserved.
HBkNNJ is trivial to implement. It breaks easily but is good
for tiny dataset. H-BNLJ is easy to implement but has a Very
large communication overhead. It performs well for small
and middle datasets. While H-zkNNJisfastandmoreprecise.
But it requires large disk space and gets slower for high
dimensional dataset. RankReduce performs best among all
algorithms. It is fast and can be used for high dimensional
data. Adaptive indexing yeilds best results among all
algorithms in terms of precision. For different values of k
precision values are taken.
Fig. 2. Performance of algorithm in seconds
Similarly time of execution is reduced in Adaptive
Indexing. Compared to all remaining algorithms Adaptive
Indexing takes less time for all values of k.
Time
Algorithms k=10 k=15 k=20
Block 28.3 25.06 26.32
Z-value 30 30.1 32
LSH 23 25.3 24
Adaptive Indexing 17.2 21 19.8
Table 2:Execution Time for each algorithm in minutes
Fig. 3. Time required for execution in minutes
4. CONCLUSION
In the given paper we implemented the existing solution
for kNN operation in context of Map Reduce[6]. All solutions
follow three main steps that arepre-processing,partitioning,
and computation. Different reference algorithms are
implemented. Depending upon the time and complexity
required Adaptive Indexing is found to be efficient. The
parallel implementationhashelpedtoimprovethe efficiency
of the kNN. Using indexing over hashingmayreducethetime
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1810
required for iterations. And it has contributed in
performance of the algorithm.
5. ACKNOWLEDGEMENT
Working on the topic ”Parallel kNN for Big Data using
Adaptive Indexing” was a source of immense knowledge to
me. I would like to express my sincere gratitude towards
Prof. Suhasini Itkar for her guidance and valuable support
thought out of research work. I acknowledge with a deep
sense of gratitude , the encouragement and inspiration
received from our staff members and friends. Last but not
the least I would like to thank my parents for their love and
support.
6. REFERENCES
[1] Ge Song, Justine Rochas , Lea El BezeandFabriceHuet
, K Nearest Neighbor Joins for Big Data on MapReduce: a
Theoretical and Experimental Analysis, in Proceedings of
IEEE Transactions on Knowledge and Data Engineering
1041-4347 2016.
[2] T. M. Cover and P. E. Hart, Nearest neighbor pattern
classification, IEEE Transactions onInformation Theory,vol.
13, no. 1, pp. 2127.
[3] X. Wu and V. Kumar, Eds.,” The Top Ten Algorithmsin
Data Mining”,Chapman Hall/CRC Data Mining and
Knowledge Discovery, 2009.
[4] G. Song, J. Rochas, F. Huet, and F. Magouls, Solutions
for Processing K Nearest Neighbor Joins for Massive Data on
MapReduce, in 23rd Euromicro International Conference on
Parallel, Distributed and Network-based Processing, Turku,
Finland, Mar. 2015.
[5] W. Lu, Y. Shen, S. Chen, and B. C. Ooi, Efficient processing
of k nearest neighbor joins using map reduce, Proc. VLDB
Endow., 2012.
[6] A. Stupar, S. Michel, and R. Schenkel, RankReduce -
processing k- nearest neighbor queries on top of map
reduce, in In LSDS-IR, 2010.
[7] C. Zhang, F. Li, and J. Jestes, Efficient parallel knnjoins
for large data in mapreduce, in Extending Database
Technology, 2012.
[8] C. Yu, R. Zhang, Y. Huang, and H. Xiong, High-
dimensional knn joins with incremental updates,
GeoInformatica, 2010.
[9] B. Yao, F. Li, and P. Kumar, K nearest neighborqueries
and knn-joins in large relational databases (almost) for free,
in Data Engineering (ICDE), 2010 IEEE 26th International
Conference on, March 2010, pp. 415.
[10] http://stat-computing.org/dataexpo/2009/the-
data.html.

More Related Content

PDF
Experimental study of Data clustering using k- Means and modified algorithms
PDF
Af4201214217
PDF
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
PDF
Decision tree clustering a columnstores tuple reconstruction
PDF
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
PDF
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
PDF
Cg33504508
Experimental study of Data clustering using k- Means and modified algorithms
Af4201214217
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
Decision tree clustering a columnstores tuple reconstruction
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
Cg33504508

What's hot (19)

PDF
Data clustering using map reduce
PDF
Dynamic approach to k means clustering algorithm-2
PDF
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
PDF
Big Data Clustering Model based on Fuzzy Gaussian
PDF
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
PDF
Premeditated Initial Points for K-Means Clustering
PDF
Clustering using kernel entropy principal component analysis and variable ker...
PDF
50120140505013
PDF
Data clustering using kernel based
PDF
Big data Clustering Algorithms And Strategies
PDF
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
DOCX
K means report
PDF
A survey on Efficient Enhanced K-Means Clustering Algorithm
PDF
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
PDF
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
PDF
New Approach for K-mean and K-medoids Algorithm
PDF
J41046368
PDF
Parallel k nn on gpu architecture using opencl
PDF
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
Data clustering using map reduce
Dynamic approach to k means clustering algorithm-2
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
Big Data Clustering Model based on Fuzzy Gaussian
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
Premeditated Initial Points for K-Means Clustering
Clustering using kernel entropy principal component analysis and variable ker...
50120140505013
Data clustering using kernel based
Big data Clustering Algorithms And Strategies
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
K means report
A survey on Efficient Enhanced K-Means Clustering Algorithm
IRJET- Clustering of Hierarchical Documents based on the Similarity Deduc...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
New Approach for K-mean and K-medoids Algorithm
J41046368
Parallel k nn on gpu architecture using opencl
A HYBRID CLUSTERING ALGORITHM FOR DATA MINING
Ad

Similar to Parallel KNN for Big Data using Adaptive Indexing (20)

PDF
K NEAREST NEIGHBOUR JOINS FOR BIG DATA ON MAPREDUCE: A THEORETICAL AND EXPER...
PDF
Natural Language Processing of applications.pdf
PDF
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
PPTX
K-Nearest Neighbor Classifier
PPTX
NEAREST NEIGHBOUR CLUSTER ANALYSIS.pptx
PPTX
KNN Classificationwithexplanation and examples.pptx
PDF
Di35605610
PPTX
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
PDF
Parallel kmeans clustering in Erlang
PPTX
K Nearest Neighbor Algorithm
PPTX
Knn 160904075605-converted
PDF
Parallel knn on gpu architecture using opencl
PDF
Big data classification based on improved parallel k-nearest neighbor
PPTX
KNN.pptx
PPTX
KNN.pptx
PPTX
k-Nearest Neighbors with brief explanation.pptx
PPTX
Data Stream Outlier Detection Algorithm
PDF
tghteh ddh4eth rtnrtrgthgh12500123196.pdf
PDF
Lecture 6 - Classification Classification
K NEAREST NEIGHBOUR JOINS FOR BIG DATA ON MAPREDUCE: A THEORETICAL AND EXPER...
Natural Language Processing of applications.pdf
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
K-Nearest Neighbor Classifier
NEAREST NEIGHBOUR CLUSTER ANALYSIS.pptx
KNN Classificationwithexplanation and examples.pptx
Di35605610
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
Parallel kmeans clustering in Erlang
K Nearest Neighbor Algorithm
Knn 160904075605-converted
Parallel knn on gpu architecture using opencl
Big data classification based on improved parallel k-nearest neighbor
KNN.pptx
KNN.pptx
k-Nearest Neighbors with brief explanation.pptx
Data Stream Outlier Detection Algorithm
tghteh ddh4eth rtnrtrgthgh12500123196.pdf
Lecture 6 - Classification Classification
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
Geodesy 1.pptx...............................................
PDF
Well-logging-methods_new................
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
web development for engineering and engineering
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Artificial Intelligence
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
Construction Project Organization Group 2.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Current and future trends in Computer Vision.pptx
PDF
composite construction of structures.pdf
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
R24 SURVEYING LAB MANUAL for civil enggi
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Geodesy 1.pptx...............................................
Well-logging-methods_new................
Foundation to blockchain - A guide to Blockchain Tech
web development for engineering and engineering
Embodied AI: Ushering in the Next Era of Intelligent Systems
CH1 Production IntroductoryConcepts.pptx
III.4.1.2_The_Space_Environment.p pdffdf
Artificial Intelligence
Safety Seminar civil to be ensured for safe working.
Construction Project Organization Group 2.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
additive manufacturing of ss316l using mig welding
Current and future trends in Computer Vision.pptx
composite construction of structures.pdf
Fundamentals of safety and accident prevention -final (1).pptx

Parallel KNN for Big Data using Adaptive Indexing

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1807 Parallel kNN for Big Data using Adaptive Indexing Tejal Katore1, Prof. Dr. Suhasini Itkar2 1Post Graduate Scholar, Dept. of Computer Engineering, P.E.S Modern College of Engineering, Pune, India 2Professor, Dept. of Computer Engineering, P.E.S Modern College of Engineering, Pune, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - k Nearest Neighbor is frequently used in classification methods. kNN algorithm defines the class membership of the given element. kNN when used in context with large data, does not perform well. So multiple techniques were introduced to execute kNN parallely and enhance its performance. Alongwiththis MapReduceprogrammingmodel was used which was suitable for distributed approaches. The different reference algorithms were given as follows HzknnJ, HBNJ, RankReduce which compute kNN on MapReduce. Data preprocessing, Data partitioning and computation are the three common steps for kNN computation. For all given solutions only the partitioning technique differs. Adaptive Indexing is a indexing paradigm where index creation and reorganizationtakesplaceautomaticallyand incrementally. It was used along with the RankReduce algorithm which helps knn to exec more efficiently. Key Words: Hadoop Block Nested Loop kNN (H-BNLJ), Hadoop z value (H-zkNNJ), k Nearest Neighbor, MapReduce, Performance Evaluation, RankReduce. 1. INTRODUCTION k Nearest Neighbor is widely used as a classification or clustering method in machine learning or data mining[1]. The k-Nearest Neighbor algorithm (k-NN) [2] is considered one of the ten most significantly data miningalgorithms.Itis an lazy learner which do not need absolute training phase. The method requires that all of the data instances are stored and unseen cases classified by finding the class labels of the k closest instances to them[3]. To determine how close two instances are, several distances can be computed. This operation as to be performed for all the input examples against the whole training dataset. Given R is a point and S is set of reference points, a k nearest neighbor join is an operation which for each pointin R, discovers the k nearest neighbor in S. The data points are divided into training set and testingset,alsocalledunlabeled data. The aim is to find the class label for the new points. For each unlabeled data, a kNN query on the training set will be performed to estimate its class membership. This process can be considered as a kNN join of the testing set with the training set. The basic idea to compute a kNN join is to perform a pairwisecomputationofdistanceforeachelement in R and each element in S . The difficulties mainly lie in the following two aspect: (1) Data Volume (2)Data Dimensionality. A lot of work has been dedicated to reduce the in-memory com-putational complexity [1]. These works mainly focus on two points: (1) Use indexes to decrease the number of distances needto becalculated.Theseindexescan hardly be scaled on high dimension data. (2) Useprojections to reduce the dimensionality of data. But the maintenance of the accuracy becomes another issue. Despite these efforts, there are still significant limitations to process kNN on a centralized machine when the amount of data increases [4],[10],[11]. Only distributed and parallel solutions are proved to be powerful, for large dataset . MapReduce is a flexible and scalable parallel and distributed programming paradigm which is specially designed for data-intensive processing. MapReduce is a parallel programming model that aims at efficiently processing large-datasets. It consists of:(1) representing a key-value pair (2)defining map function (3)defining reducefunction. Hereweintroducethereference algorithms that compute kNN over MapReduce. These algorithms are based on different methods, but follow a common work-flow which consists three ordered steps:(1)data pre-processing (2)data partitioning (3) kNN computation. 2. LITERATURE REVIEW kNN is based on a distance function that measures the difference or similarity between two instances. kNN using centralized approach was not able to perform for large inputs. So a new approach to execute it parallelly was developed. There are various existing solutions to perform the kNN operation in the context of MapReduce are given. The approach HBNLJ[1] consists of two phases. The data set is divided into a certain blocks of particular size. The data is partitioned such a that an element in a partition of R will have its nearest neighbor in only one partitioned of S. Two partitioning strategies that enable to separate the datasets into independent partitions, while preserving locality information, are proposed. H-zkNNJ [1],[4], which use size based partitioning strategies, have a very good loadbalance, with a very small deviation of the completion time of each task. In H-zkNNJ, the z -value transformation leads to information loss. The recall of this algorithmisinfluencedby the nature, the dimension and the size of the input data. More specifically, this algorithm becomes biased if the distance between initial data is very scattered, and the more input and M , the number of hash functions in each family. Since they are dependent on the dataset, experiments are
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1808 needed to precisely tune them. In, the authors suggests this can be achieved with a sample dataset and a theoretical model. The first important metric to consider is the number of candidates available in each bucket. Indeed, with some poorly chosen parameter values, it is possible to have less than k elements in each bucket, making it impossibleto have enough elements at the end of the computation. HzknnJuses Locality Sensitive Hashing[7][8][9]. Rank Reduce [1],[5], with the addition of a third job, can have the best performance of all, provided that it is started with the optimal parameters. The most important ones are W ,the size of each bucket, L , the number of hash families. Increasing the number of families L greatly improves both the precision and recall. However, increasingM,thenumber of hash functions, decreases the number of collisions, reducing execution time but also the recall and precision. Overall, finding the optimal parameters for the Locality Sensitive Hashing part is complex and has to be done for every dataset. Special type of distance [8], [4] is adaptive indexing. It is specifically addressed kNN queries in high-dimensional space and has since proven to be one of the most efficient and state-of-the-art high dimensional indexing techniques available for exact kNN search. In recent years,iDistancehas been used in a number of applications. In a set of one- dimensional distance values, each related to one or more data points, for each partition thatareall indexedtogetherin a single standard B + -tree. The algorithm was motivated by the ability to use arbitrary reference points to determinethe similarity and dissimilarity between any twodata pointsina metric space, allowing single dimensional ranking and indexing of data points no matter what thedimensionalityof the original space [8]. 3. SYSTEM ARCHITECTURE Processing Steps The following scheme consists of three basic steps: 1) Pre-processing i. Remove column names ii. Move to HDFS iii. Feature Extraction iv. Clean Data v. Divide into training and testing set 2) Partitioning 3) kNN Computation In iDistance algorithm indexing was added with Rank Reduce in between. From this the reference was taken and the implemented system also had indexing with Rank Reduce but in shuffled order. Firstly the indexing is performed and then the Rank Reduce is executed. Fig -1: Architecture Diagram 1.Pre-processing- The data is transformed from its original form to the data that is beneficiary. Only the required data is kept and remaining is removed. It further consists of few steps. A).Remove the column names- The attributes or column names of the dataset areremoved.B).MovetoHDFS- The data set is moved over the Hadoop Distributed File System. C) Feature Extraction- Extracting the selected features from the given data . D) Clean Data-After selecting the features the remaining data is discarded. E) Divide into training and testing set- The data set is divide into training as well as testing data set. Training data set is the labelled data which consists of class membership. Testing data set is the unlabeled data which is to be processed. 2.Partitioning- While processing data on MapReduce, we need to divide the data set into independent pieces,calledas partitions. Partitioning is the process of dividing the data into blocks, regions, buckets, etc. All the algorithms use different partitioning strategies. Partition is done using 2 different strategies: (1) Distance basedPartitioningStrategy (2) Size based Partitioning Strategy. In distance based partition the space is divided into disjoint cells while in size based partition the space is divided into equal size partition. The algorithms are divided under both strategies. (3). kNN Computation - The reducers perform the computation. The mappers divide the data set into numbers of blocks and the output of these is given to the reducers. Then the reducers sort the points according tothedistances. A DataSets The dataset used is named as ”Airline On-Time Statistics and Delay Causes”. The dataset consists of records of airlines which include all details related to flights. It is packaged in yearly chunks from 1987 to2008.Itconsistsof 29 columns from which 19 columns are selected. The
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1809 column name ”cancelled tickets” is used as labeledcolumn. The size of the data set is 12 GB. It consists of 52 billion records[12]. B. Hardware 1) Memory: 8GB 2) Processor: Intel (R) Pentium (R) CPU B950 @2.10 GHz 3) Hard disk: 64 GB C. Software 1) Cloudera :Hadoop framework 2) Operating System: CentOs 3) Eclipse 4.2.2 and above 4) vi editor D. Performance Parameters The performance of the system is measured using different parameters among which time is the important factor. The number of mappers and reducers for each algorithm are kept same. The value of k is randomly taken. The time required for each algorithm to execute is in seconds. E. Results Precision Algorithms k=10 k=15 k=20 Block 0.57 0.56 0.55 Z-value 0.7 0.31 0.2 LSH 0.85 0.73 0.68 Adaptive Indexing 0.9 0.77 0.7 Table 1: Precision value for each algorithm The comparison betweenall fouralgorithmsisdone.Some advantages and shortcomings ofall algorithmsareobserved. HBkNNJ is trivial to implement. It breaks easily but is good for tiny dataset. H-BNLJ is easy to implement but has a Very large communication overhead. It performs well for small and middle datasets. While H-zkNNJisfastandmoreprecise. But it requires large disk space and gets slower for high dimensional dataset. RankReduce performs best among all algorithms. It is fast and can be used for high dimensional data. Adaptive indexing yeilds best results among all algorithms in terms of precision. For different values of k precision values are taken. Fig. 2. Performance of algorithm in seconds Similarly time of execution is reduced in Adaptive Indexing. Compared to all remaining algorithms Adaptive Indexing takes less time for all values of k. Time Algorithms k=10 k=15 k=20 Block 28.3 25.06 26.32 Z-value 30 30.1 32 LSH 23 25.3 24 Adaptive Indexing 17.2 21 19.8 Table 2:Execution Time for each algorithm in minutes Fig. 3. Time required for execution in minutes 4. CONCLUSION In the given paper we implemented the existing solution for kNN operation in context of Map Reduce[6]. All solutions follow three main steps that arepre-processing,partitioning, and computation. Different reference algorithms are implemented. Depending upon the time and complexity required Adaptive Indexing is found to be efficient. The parallel implementationhashelpedtoimprovethe efficiency of the kNN. Using indexing over hashingmayreducethetime
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 07 | July -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1810 required for iterations. And it has contributed in performance of the algorithm. 5. ACKNOWLEDGEMENT Working on the topic ”Parallel kNN for Big Data using Adaptive Indexing” was a source of immense knowledge to me. I would like to express my sincere gratitude towards Prof. Suhasini Itkar for her guidance and valuable support thought out of research work. I acknowledge with a deep sense of gratitude , the encouragement and inspiration received from our staff members and friends. Last but not the least I would like to thank my parents for their love and support. 6. REFERENCES [1] Ge Song, Justine Rochas , Lea El BezeandFabriceHuet , K Nearest Neighbor Joins for Big Data on MapReduce: a Theoretical and Experimental Analysis, in Proceedings of IEEE Transactions on Knowledge and Data Engineering 1041-4347 2016. [2] T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE Transactions onInformation Theory,vol. 13, no. 1, pp. 2127. [3] X. Wu and V. Kumar, Eds.,” The Top Ten Algorithmsin Data Mining”,Chapman Hall/CRC Data Mining and Knowledge Discovery, 2009. [4] G. Song, J. Rochas, F. Huet, and F. Magouls, Solutions for Processing K Nearest Neighbor Joins for Massive Data on MapReduce, in 23rd Euromicro International Conference on Parallel, Distributed and Network-based Processing, Turku, Finland, Mar. 2015. [5] W. Lu, Y. Shen, S. Chen, and B. C. Ooi, Efficient processing of k nearest neighbor joins using map reduce, Proc. VLDB Endow., 2012. [6] A. Stupar, S. Michel, and R. Schenkel, RankReduce - processing k- nearest neighbor queries on top of map reduce, in In LSDS-IR, 2010. [7] C. Zhang, F. Li, and J. Jestes, Efficient parallel knnjoins for large data in mapreduce, in Extending Database Technology, 2012. [8] C. Yu, R. Zhang, Y. Huang, and H. Xiong, High- dimensional knn joins with incremental updates, GeoInformatica, 2010. [9] B. Yao, F. Li, and P. Kumar, K nearest neighborqueries and knn-joins in large relational databases (almost) for free, in Data Engineering (ICDE), 2010 IEEE 26th International Conference on, March 2010, pp. 415. [10] http://stat-computing.org/dataexpo/2009/the- data.html.