SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1341
SVM CLASSIFIER ALGORITHM FOR DATA STREAM MINING USING HIVE
AND R
Mrs.Pranamita Nanda1,B.Sandhiya2,R.Sandhiya3,A.S.Vanaja4
1Assistant Professor,2,3,4Students
Department of Computer Science and Engineering
Velammal Institute Of Technology, Ponneri, Tiruvallur.
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract: Big data is a challengingfunctionalityforanalyzing
the large volume of data in the IT deployment in a different
dimension. To make that analysis process in more efficient
manner we use Hive tool for query processing and providing
statistical report using RStudio. The processing load in data
stream mining has been reduced by the technique know as
Feature Selection. However, whenitcomestominingoverhigh
dimensional data the search space from which an optimal
feature subset is derived growsexponentiallyinsize, leadingto
an intractable demand in computation. To reduce the
complexity of using accelerated particle swarm
optimization.(APSO), we connect the data by using Hadoop
technology. Hadoop technology is easier to store and retrieve
the data in a big data environment. With the dataset the
data’s are analysed and the statisticalreportisproduced using
SVM algorithm in R software where R languageisused. ThisR-
software environment is used toprovideastatisicalcomputing
and graphics. This statistical report compares the accuracy
between the linear and non linear grid where the higher
accuracy dataset is efficient. The final graph provides
combination of the linear and nonlinear with respect to cost
and sigma which is the userdefined value. PSO with SVM
algorithm increases the performance of analysing the data.
INTRODUCTION:
The process of handling large volume of data, storing and
retrieval of data is challenging factor. Data stream mining is
the process of extracting knowledge structures from
continuous, rapid data records. A data stream is an ordered
sequence of instances that in many application of data
stream mining can be read only once or a small number of
times using limited computing andstoragecapabilities.Thus
for retrieval of data we use data streamminingtechnique. To
make the retrieval of data in efficient manner we use
hadoop-hive tool for query processing. It takes less time to
process. Process such as converting the unstructured data
into structured data by creating schema. Then in hadoop
environment there is a data storage place known as hadoop
distributed file system where our database is importedfrom
the external device or internal device such as server or
system that we are working in to the HDFS using the hive
query. The keyword inpath or externalpath is used for
importing data from internal device and external device.
Then the data is extracted from the database using test data
and trained data. The trained data is already existing data’s
which is just a predicted one. With the trained data the
testing is done for analyzing. Both the test data and trained
data are used for classification algorithm known as Support
Vector Machine. The SVM classifier is the classification
algorithm. For a dataset consisting of feature s set and label
set an classifier build a model to predict classes. The
parameter used for this process is accuracy. The SVM
classifier evaluate the predicted data and provides the
accuracy. Thus the efficient accuracy is taken into
consideration.
EXISTING SYSTEM:
The light weight feature selection technique known as
swarm search is used for classfing the dataset. There are
many feature selection technique like CCV, Improved PSO
etc.,The amount of data feed is potentially infinite and the
data delivery is continuous like a high speed train of
information.The processing hence isexpectedtobereal time
and instantly responsive. The retrieval of data from large
volume of data and maintaining them is difficult and the
accuracy of the data is little lower which is been overcomed
using best classifier algorithm. The complication on top of
quantitatively computing the non-linear relations between
the feature value and target classes is the temporal nature of
such data stream, One must crunch on the data stream long
enough for accurately modeling seasonal cycles or regular
pattern if they ever exist. There are no straight-forward
relations that can easily map the attributedata intoa specific
class without a long-term observation. This impacts
considerately on the data mining algorithm design that
should be capable of just reading and forgetting the data
stream.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1342
LITERATURE SURVEY:
Big Data though it is hype up-springing many technical
challenges that confront both academic research
communities and commercial IT deployment, the root
sources of Big Data are founded on data streams and the
curse of dimensionality. It is generally known that data
which are sourced from data streams accumulate
continuously making traditional batch-based model
induction algorithms infeasible for real-time data mining. In
order to tackle this problem which is mainly based on the
high-dimensionality and streaming format of data feeds in
Big Data, a novel lightweight feature selection is proposed.
The feature selection is designed particularly for mining
streaming data on the fly, by using accelerated particle
swarm optimization (APSO) type of swarm search that
achieves enhanced analytical accuracy within reasonable
processing time. In this paper, a collection of Big Data with
exceptionally large degree of dimensionality are put under
test of our new feature selection algorithm for performance
evaluation.[1]
The energy-saving research of virtualization of the cloud
computing platform shows that there are problems in the
management mode of the existing virtualization
platform.This model is based on a single node managing the
whole platform and the single model is responsible for
migrating as well as scheduling all of the virtual
machine.Therefore proposing a double management model
of the virtual machine is used to solve the problem of single
management node bottleneck and scope of the migration.A
the same time,the improved PSO algorithm is used to make
the plan for virtual machine migration.On the premise of
meeting the service performance,the plan achieves energy
saving by server booting to a minimum.Through the
experiment,it proves that the proposed management mode
not only solves the bottleneck problem of single
management node, but also reduces themigrationscopeand
the difficulty of the problem. The improved PSO algorithm
obviously raises the speed of the migration and overall
energy efficiency of scheme.[2]
The cloud storage problem is one of the interesting and
important topics in the fields of cloud computing and big
data. From the viewpoint of optimization, one discrete PSO
algorithm is mainly utilized to handle with the cloud storage
problem of the distributed data centers in China’s railway
and copy with the data between two data centers.Inorderto
achieve the good performance considering the smallest
transmitting distance,onediscretePSOalgorithmessentially
marries each other between two data center sets. Numerical
results highlight that the discrete PSO algorithmcanprovide
the guideline for the suboptimal cloud storage strategy of
China’s railway when the number of the distributed data
centers is equal to 15, 17 and 18.[3]
One of the challenges in inferring a classificationmodel with
good prediction accuracy is to select the relevant features
that contribute to maximum predictive power. Manyfeature
selection techniques have been proposed and studied in the
past, but none so far claimed to be the best. In this paper, a
novel and efficientfeatureselectionmethodcalledClustering
Coefficients of Variation (CCV) is proposed.CCVisbased ona
very simple principle of variance-basis which finds an
optimal balance between generalization and overfitting.By
the simplicity of design it is anticipated that CCV will be a
useful alternative of pre-processingmethodforclassification
especially with those datasets that are characterized by
many features.[4]
In a series of recent papers, Prof. Olariu and his co-workers
have promoted the vision of vehicular clouds (VCs), a
nontrivial extension, along several dimensions, of
conventional cloud computing. Themaincontributionof this
work is to identify and analyze a number of security
challenges and potential privacy threats in VCs. Although
security issues have received attention in cloud computing
and vehicular networks, we identify security challengesthat
are specific to VCs, e.g., challenges of authentication of high-
mobility vehicles, scalability and single interface, tangled
identities and locations, and the complexity of establishing
trust relationships among multiple players caused by
intermittent short-range communications. Additionally, we
provide a security scheme that addresses several of the
challenges discussed.[5]
PROPOSED SYSTEM
We are proposing an approach called data stream mining
using Hadoop – Hive technology. To implement the big data
analytics in a huge scalabilitymanner,bigdata needshadoop
for processing the data. The main research challenge hereis
about finding the most appropriate model induction
algorithm for mining data streams. As an additional feature,
pertaining to the possibility of embedding the data miner
module into some small devices, the memoryrequirementis
opt to be as little as possible for obvious reasons of energy
saving and fitting into a tiny device size. In other words, the
learned model, probably in form of generalized non-linear
mappings between the valuesofthefeaturestothepredicted
target classes, must be compact enough to executeina small
run-time memory. No roomiswastedforstoringthefeatures
and their relations that are neither significant norcontribute
little to the model accuracy. To this end, without using
feature selection is out of consideration, as the number of
original features extracted from the data streams. Since
these models are built based on a stationary dataset, model
up-date needs to repeat the whole training process
whenever new samples arrive, adding them to incorporate
the changing underlying patterns. In dynamic stream
processing environment, however, data classificationmodel
would have to be frequently updated accordingly.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1343
ARCHITECTURAL DIAGRAM AND EXPLANATION:
FIG: 2: PROPOSED ARCHITECTURE
From the database the datasets related to the user needs is
retrieved using the hive query. The hive is the data
warehouse used to analyse and retrieve the data. For this,
first we need to continuously upload the data’s in database.
Then the datasets are retrieved for eg in database there will
be the medical datasets, traffic light datasets, weather
forcasting datasets etc., from these multiple datasets
required one is retrieved using the hive query. The datasets
have multiple fields here fields represents age, name, sex
etc., The retrieval of data is based on these fields. With the
retrieved datasets, analysis is done and divided into two
segments known as trained dataset and test dataset. The
trained dataset will be more than the test dataset. The
trained dataset undergoes some filtering process. But the
test dataset undergoes classification where the data’s are
sliced. And both sliced data and trained data enters into the
SVM machine.
The SVM algorithm is used for binary, multi-class problem
and anomalie detection. Using hyper planar the critical
points are divided known as support vectors. Theseperation
is then perpendicular bisector of the line joining these two
support vectors. These data’s are entered into the R input
frames. These R input frames is used to extract the data
using statistical computing andgraphics.Itisusedtoprovide
statistical report. The statistical report is provided forlinear
and nonlinear. These report provides accuracy for both
stream. Then linear accuracy and non linear accuracy is
compared to see the efficiency.Thenthegridanalysisisdone
which combines both the accuracy and provides the graph.
With that positive and negative data’s are identified. The
positive data is safe whereas the negative value is unsafe. It
increases the efficiency and takes less time for anaysing and
for retrieving the data. It improves the data processing
speed. It can be able to analyse the large volume of data in a
small time compare to another tools. It provides large scale
integration of data.
MODULES:
 Create schema in data warehouse
 Importing the data to HDFS
 Extracting the data
 Performance evolution
 Statistical report
MODULE DESCRIPTION:
A) CREATE SCHEMA IN DATA WAREHOUSE:
In database the data's will be in the unstructured format
which is unreadable. The database is uploaded in thesystem
and to process the unstructured data in Hive, a schema is
created. A schema is created using the attributes which is
considered as field in Hive. These fields can beusedtodivide
the data sets as test data and trained data where test data is
a unpredicted data and trained data is a predicted data.
B) IMORTING THE DATA IN HDFS:
The Hadoop Distributed File Systemisdesignedtostorevery
large dataset and to stream those data sets at high
bandwidth to user application. The Database is converted
from unstructured to structured format by creating the
schema which is loaded into the HDFS. If the database is
stored in the desktop then INPATH keyword is usedwhereif
it is stored in external devices then EXTERNALPATH
keyword is used. The keyword OVERWRITE is used to
replace old data with new data.
C) EXTRACTING THE DATA:
The hive query which is used for providing data
summarization ,query and analysis. It gives an SQL like
interface to query data data stored in various databases and
file systems that integrate with Hadoop. Hive provides the
necessary thenecessarySQLabstractiontointegrateHIVEQL
into underlying java API without the need to implement
queries in the low level API. Hive supportseasyportabilityof
SQL based application to Hadoop. It provides the sliced data
from the datasets which is relevant to the user query. Using
hive the data’s are retrieved in faster manneranditcanlarge
volume of data. As the database is stored in the system and
the processing also take place in the same system, the
system act as both client and server.
DATA STORAGE
DATA SETS
TRAINED
DATA
TEST
DATA
SVM
STATISTICAL
REPORT
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1344
D) PERFORMANCE EVOLUTION:
In this approach, the Support Vector Machine(SVM)
algorithm is used for analysing and retrival of data. It is a
linearized programming and supervised learning approach.
It is processed on the basis of Machine Learning(ML)
techniques. It accurately reduce the time complexity and
code complexity. RStudio is adaptable with any type of data
and produces the result with efficient improvement. The
SVM algorithm is divided into two types they are linear and
radial methods. Accuracy is the parameter which is
determined using the SVM algorithm. The linear provides
one accuracy and radial provides one accuracy. Comparing
these two accuracy the highest accuracy is considered as
efficient.
E) STATISTICAL REPORT:
The Statistical report is determined using the Rstudio as per
the user needs where R programming language is used for
analysing the data. The Rstudio tool provides the graphical
representation of the data for our input data. Both the linear
and radial is combined to provide grid graph which helps to
identify the highly positive and negative value
SCREENSHOTS:
A) CREATING SCHEMA IN DATA WAREHOUSE:
B) IMPORTING THE DATA INTO HDFS:
C) LINEAR KERNEL GRAPH:
E) RADIAL KERNEL GRAPH:
F) RADIAL GRID GRAPH:
CONCLUSION:
An approach known as Hive Tool which is used for storing
and retrieving the data in large volume at higher speed. The
Hive Tool can be used to process and store the exactdata ina
large database, compared to other data mining and cloud
methodologies. The R-Studio is used o provide thestatistical
report by anlysing the data in the database as per the user
requirement. The PSO with SVM algorithm improves the
throughput efficiency.
FUTURE ENHANCEMENT:
In this paper the process of analysing is performed using
Hive tool and statistical report is provided using R Software
where R language is used. The statistical report provides
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1345
positive and negative value in the database. In future using
these values prediction is done. This prediction says what
will be the future problem with the help of past analysed
data. Some new algorithm can be derived to increase the
parameters efficiency ie.accuracy and also reduces the time
consumption for the retrieval of data from the database.
REFERENCES:
[1] Simon Fong, Raymond wong, V.Vasilakos “Accelerated
PSO swarm search feature selection for data stream mining
bigdata”, IEEE Transaction on Data engineering, VOL.10,
NO.7, July 2016.
[2] Ge Rietai, Gao Jing “Improved PSO algorithm for energy
saving research in the double layermanagement modeof the
cloud platform”, CloudComputing and Bigdata
analysis(2016).
[3] Jun Liu, Tianyunshi, Ping Li “Optimal cloud storage
problem in the distributed cloud data centresbythediscrete
PSO algorithm”, Institute of computing technologies,
china(2015).
[4] Fong.S, Liang.J, Wong.R, Ghanavati.M, "A novel feature
selection by clustering coefficientsofvariations",2014Ninth
International Conference on Digital Information
Management (ICDIM), Sept. 29, 2014, pp.205-213.
[5] Gong Jun Yan, Ding Wen,Stephan dariu, Michael C Weigle
“Security challenges in vehicular cloud computing”, IEEE
Transaction on Intelligent transportation systems, VOL.14,
NO.1, March 2013.

More Related Content

What's hot (19)

PDF
A Survey on Batch Auditing Systems for Cloud Storage
IRJET Journal
 
PDF
A time efficient and accurate retrieval of range aggregate queries using fuzz...
IJECEIAES
 
PDF
data Fusion and log correlation
Mahdi Sayyad
 
PDF
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
IRJET Journal
 
PDF
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
PDF
Parallel and distributed system projects for java and dot net
redpel dot com
 
PDF
IRJET-Auditing and Resisting Key Exposure on Cloud Storage
IRJET Journal
 
PDF
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET Journal
 
PDF
FDMC: Framework for Decision Making in Cloud for EfficientResource Management
IJECEIAES
 
PDF
Target Response Electrical usage Profile Clustering using Big Data
IRJET Journal
 
PPTX
Journals analysis ppt
Muhammad Heikal
 
PDF
Differentiating Algorithms of Cloud Task Scheduling Based on various Parameters
iosrjce
 
PDF
Use of genetic algorithm for
ijitjournal
 
DOCX
High performance intrusion detection using modified k mean & naïve bayes
eSAT Journals
 
PPTX
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
Jenny Liu
 
PDF
V3 i35
silverscouts
 
PDF
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
Gyan Prakash
 
PDF
Efficient Cost Minimization for Big Data Processing
IRJET Journal
 
PDF
Ay4201347349
IJERA Editor
 
A Survey on Batch Auditing Systems for Cloud Storage
IRJET Journal
 
A time efficient and accurate retrieval of range aggregate queries using fuzz...
IJECEIAES
 
data Fusion and log correlation
Mahdi Sayyad
 
Peer-to-Peer Data Sharing and Deduplication using Genetic Algorithm
IRJET Journal
 
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
Parallel and distributed system projects for java and dot net
redpel dot com
 
IRJET-Auditing and Resisting Key Exposure on Cloud Storage
IRJET Journal
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET Journal
 
FDMC: Framework for Decision Making in Cloud for EfficientResource Management
IJECEIAES
 
Target Response Electrical usage Profile Clustering using Big Data
IRJET Journal
 
Journals analysis ppt
Muhammad Heikal
 
Differentiating Algorithms of Cloud Task Scheduling Based on various Parameters
iosrjce
 
Use of genetic algorithm for
ijitjournal
 
High performance intrusion detection using modified k mean & naïve bayes
eSAT Journals
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
Jenny Liu
 
V3 i35
silverscouts
 
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
Gyan Prakash
 
Efficient Cost Minimization for Big Data Processing
IRJET Journal
 
Ay4201347349
IJERA Editor
 

Similar to Svm Classifier Algorithm for Data Stream Mining Using Hive and R (20)

PDF
IRJET-Scaling Distributed Associative Classifier using Big Data
IRJET Journal
 
PDF
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
PDF
A survey of modified support vector machine using particle of swarm optimizat...
Editor Jacotech
 
PDF
Computational Methods Of Feature Selection Huan Liu Hiroshi Motoda
rasabigley
 
PDF
Parametric comparison based on split criterion on classification algorithm
IAEME Publication
 
PDF
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
doweyhostel
 
PDF
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
obkagyabu
 
PDF
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET Journal
 
PDF
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
lemanamiddag
 
PDF
Ijariie1184
IJARIIE JOURNAL
 
PDF
Ijariie1184
IJARIIE JOURNAL
 
PDF
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET Journal
 
PDF
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET Journal
 
PDF
IRJET- Customer Online Buying Prediction using Frequent Item Set Mining
IRJET Journal
 
PDF
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
irjes
 
PDF
Fault detection of imbalanced data using incremental clustering
IRJET Journal
 
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
IRJET Journal
 
DOCX
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
Nexgen Technology
 
PPT
Data mining technique for classification and feature evaluation using stream ...
ranjit banshpal
 
PDF
Analysis on different Data mining Techniques and algorithms used in IOT
IJERA Editor
 
IRJET-Scaling Distributed Associative Classifier using Big Data
IRJET Journal
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
A survey of modified support vector machine using particle of swarm optimizat...
Editor Jacotech
 
Computational Methods Of Feature Selection Huan Liu Hiroshi Motoda
rasabigley
 
Parametric comparison based on split criterion on classification algorithm
IAEME Publication
 
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
doweyhostel
 
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
obkagyabu
 
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET Journal
 
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
lemanamiddag
 
Ijariie1184
IJARIIE JOURNAL
 
Ijariie1184
IJARIIE JOURNAL
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET Journal
 
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...
IRJET Journal
 
IRJET- Customer Online Buying Prediction using Frequent Item Set Mining
IRJET Journal
 
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
irjes
 
Fault detection of imbalanced data using incremental clustering
IRJET Journal
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
IRJET Journal
 
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
Nexgen Technology
 
Data mining technique for classification and feature evaluation using stream ...
ranjit banshpal
 
Analysis on different Data mining Techniques and algorithms used in IOT
IJERA Editor
 
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
PDF
Kiona – A Smart Society Automation Project
IRJET Journal
 
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
PDF
Breast Cancer Detection using Computer Vision
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
PDF
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
PDF
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
IRJET Journal
 
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
IRJET Journal
 
Kiona – A Smart Society Automation Project
IRJET Journal
 
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
IRJET Journal
 
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
IRJET Journal
 
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
IRJET Journal
 
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
IRJET Journal
 
BRAIN TUMOUR DETECTION AND CLASSIFICATION
IRJET Journal
 
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
IRJET Journal
 
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
IRJET Journal
 
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
IRJET Journal
 
Breast Cancer Detection using Computer Vision
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
IRJET Journal
 
Auto-Charging E-Vehicle with its battery Management.
IRJET Journal
 
Analysis of high energy charge particle in the Heliosphere
IRJET Journal
 
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
IRJET Journal
 
Ad

Recently uploaded (20)

PPTX
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 
PDF
13th International Conference on Artificial Intelligence, Soft Computing (AIS...
ijait
 
PDF
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
PDF
A Brief Introduction About Robert Paul Hardee
Robert Paul Hardee
 
PDF
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
PDF
Bayesian Learning - Naive Bayes Algorithm
Sharmila Chidaravalli
 
PDF
LLC CM NCP1399 SIMPLIS MODEL MANUAL.PDF
ssuser1be9ce
 
PDF
William Stallings - Foundations of Modern Networking_ SDN, NFV, QoE, IoT, and...
lavanya896395
 
PPTX
Precooling and Refrigerated storage.pptx
ThongamSunita
 
PPSX
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
 
PPTX
CM Function of the heart pp.pptxafsasdfddsf
drmaneharshalid
 
PPTX
Electrical_Safety_EMI_EMC_Presentation.pptx
drmaneharshalid
 
PPTX
template.pptxr4t5y67yrttttttttttttttttttttttttttttttttttt
SithamparanaathanPir
 
PPTX
Diabetes diabetes diabetes diabetes jsnsmxndm
130SaniyaAbduNasir
 
PDF
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
PDF
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
PPTX
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
DOCX
Engineering Geology Field Report to Malekhu .docx
justprashant567
 
PDF
Artificial Neural Network-Types,Perceptron,Problems
Sharmila Chidaravalli
 
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 
13th International Conference on Artificial Intelligence, Soft Computing (AIS...
ijait
 
Decision support system in machine learning models for a face recognition-bas...
TELKOMNIKA JOURNAL
 
A Brief Introduction About Robert Paul Hardee
Robert Paul Hardee
 
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
Bayesian Learning - Naive Bayes Algorithm
Sharmila Chidaravalli
 
LLC CM NCP1399 SIMPLIS MODEL MANUAL.PDF
ssuser1be9ce
 
William Stallings - Foundations of Modern Networking_ SDN, NFV, QoE, IoT, and...
lavanya896395
 
Precooling and Refrigerated storage.pptx
ThongamSunita
 
OOPS Concepts in Python and Exception Handling
Dr. A. B. Shinde
 
CM Function of the heart pp.pptxafsasdfddsf
drmaneharshalid
 
Electrical_Safety_EMI_EMC_Presentation.pptx
drmaneharshalid
 
template.pptxr4t5y67yrttttttttttttttttttttttttttttttttttt
SithamparanaathanPir
 
Diabetes diabetes diabetes diabetes jsnsmxndm
130SaniyaAbduNasir
 
NFPA 10 - Estandar para extintores de incendios portatiles (ed.22 ENG).pdf
Oscar Orozco
 
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
Engineering Geology Field Report to Malekhu .docx
justprashant567
 
Artificial Neural Network-Types,Perceptron,Problems
Sharmila Chidaravalli
 

Svm Classifier Algorithm for Data Stream Mining Using Hive and R

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1341 SVM CLASSIFIER ALGORITHM FOR DATA STREAM MINING USING HIVE AND R Mrs.Pranamita Nanda1,B.Sandhiya2,R.Sandhiya3,A.S.Vanaja4 1Assistant Professor,2,3,4Students Department of Computer Science and Engineering Velammal Institute Of Technology, Ponneri, Tiruvallur. ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract: Big data is a challengingfunctionalityforanalyzing the large volume of data in the IT deployment in a different dimension. To make that analysis process in more efficient manner we use Hive tool for query processing and providing statistical report using RStudio. The processing load in data stream mining has been reduced by the technique know as Feature Selection. However, whenitcomestominingoverhigh dimensional data the search space from which an optimal feature subset is derived growsexponentiallyinsize, leadingto an intractable demand in computation. To reduce the complexity of using accelerated particle swarm optimization.(APSO), we connect the data by using Hadoop technology. Hadoop technology is easier to store and retrieve the data in a big data environment. With the dataset the data’s are analysed and the statisticalreportisproduced using SVM algorithm in R software where R languageisused. ThisR- software environment is used toprovideastatisicalcomputing and graphics. This statistical report compares the accuracy between the linear and non linear grid where the higher accuracy dataset is efficient. The final graph provides combination of the linear and nonlinear with respect to cost and sigma which is the userdefined value. PSO with SVM algorithm increases the performance of analysing the data. INTRODUCTION: The process of handling large volume of data, storing and retrieval of data is challenging factor. Data stream mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many application of data stream mining can be read only once or a small number of times using limited computing andstoragecapabilities.Thus for retrieval of data we use data streamminingtechnique. To make the retrieval of data in efficient manner we use hadoop-hive tool for query processing. It takes less time to process. Process such as converting the unstructured data into structured data by creating schema. Then in hadoop environment there is a data storage place known as hadoop distributed file system where our database is importedfrom the external device or internal device such as server or system that we are working in to the HDFS using the hive query. The keyword inpath or externalpath is used for importing data from internal device and external device. Then the data is extracted from the database using test data and trained data. The trained data is already existing data’s which is just a predicted one. With the trained data the testing is done for analyzing. Both the test data and trained data are used for classification algorithm known as Support Vector Machine. The SVM classifier is the classification algorithm. For a dataset consisting of feature s set and label set an classifier build a model to predict classes. The parameter used for this process is accuracy. The SVM classifier evaluate the predicted data and provides the accuracy. Thus the efficient accuracy is taken into consideration. EXISTING SYSTEM: The light weight feature selection technique known as swarm search is used for classfing the dataset. There are many feature selection technique like CCV, Improved PSO etc.,The amount of data feed is potentially infinite and the data delivery is continuous like a high speed train of information.The processing hence isexpectedtobereal time and instantly responsive. The retrieval of data from large volume of data and maintaining them is difficult and the accuracy of the data is little lower which is been overcomed using best classifier algorithm. The complication on top of quantitatively computing the non-linear relations between the feature value and target classes is the temporal nature of such data stream, One must crunch on the data stream long enough for accurately modeling seasonal cycles or regular pattern if they ever exist. There are no straight-forward relations that can easily map the attributedata intoa specific class without a long-term observation. This impacts considerately on the data mining algorithm design that should be capable of just reading and forgetting the data stream.
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1342 LITERATURE SURVEY: Big Data though it is hype up-springing many technical challenges that confront both academic research communities and commercial IT deployment, the root sources of Big Data are founded on data streams and the curse of dimensionality. It is generally known that data which are sourced from data streams accumulate continuously making traditional batch-based model induction algorithms infeasible for real-time data mining. In order to tackle this problem which is mainly based on the high-dimensionality and streaming format of data feeds in Big Data, a novel lightweight feature selection is proposed. The feature selection is designed particularly for mining streaming data on the fly, by using accelerated particle swarm optimization (APSO) type of swarm search that achieves enhanced analytical accuracy within reasonable processing time. In this paper, a collection of Big Data with exceptionally large degree of dimensionality are put under test of our new feature selection algorithm for performance evaluation.[1] The energy-saving research of virtualization of the cloud computing platform shows that there are problems in the management mode of the existing virtualization platform.This model is based on a single node managing the whole platform and the single model is responsible for migrating as well as scheduling all of the virtual machine.Therefore proposing a double management model of the virtual machine is used to solve the problem of single management node bottleneck and scope of the migration.A the same time,the improved PSO algorithm is used to make the plan for virtual machine migration.On the premise of meeting the service performance,the plan achieves energy saving by server booting to a minimum.Through the experiment,it proves that the proposed management mode not only solves the bottleneck problem of single management node, but also reduces themigrationscopeand the difficulty of the problem. The improved PSO algorithm obviously raises the speed of the migration and overall energy efficiency of scheme.[2] The cloud storage problem is one of the interesting and important topics in the fields of cloud computing and big data. From the viewpoint of optimization, one discrete PSO algorithm is mainly utilized to handle with the cloud storage problem of the distributed data centers in China’s railway and copy with the data between two data centers.Inorderto achieve the good performance considering the smallest transmitting distance,onediscretePSOalgorithmessentially marries each other between two data center sets. Numerical results highlight that the discrete PSO algorithmcanprovide the guideline for the suboptimal cloud storage strategy of China’s railway when the number of the distributed data centers is equal to 15, 17 and 18.[3] One of the challenges in inferring a classificationmodel with good prediction accuracy is to select the relevant features that contribute to maximum predictive power. Manyfeature selection techniques have been proposed and studied in the past, but none so far claimed to be the best. In this paper, a novel and efficientfeatureselectionmethodcalledClustering Coefficients of Variation (CCV) is proposed.CCVisbased ona very simple principle of variance-basis which finds an optimal balance between generalization and overfitting.By the simplicity of design it is anticipated that CCV will be a useful alternative of pre-processingmethodforclassification especially with those datasets that are characterized by many features.[4] In a series of recent papers, Prof. Olariu and his co-workers have promoted the vision of vehicular clouds (VCs), a nontrivial extension, along several dimensions, of conventional cloud computing. Themaincontributionof this work is to identify and analyze a number of security challenges and potential privacy threats in VCs. Although security issues have received attention in cloud computing and vehicular networks, we identify security challengesthat are specific to VCs, e.g., challenges of authentication of high- mobility vehicles, scalability and single interface, tangled identities and locations, and the complexity of establishing trust relationships among multiple players caused by intermittent short-range communications. Additionally, we provide a security scheme that addresses several of the challenges discussed.[5] PROPOSED SYSTEM We are proposing an approach called data stream mining using Hadoop – Hive technology. To implement the big data analytics in a huge scalabilitymanner,bigdata needshadoop for processing the data. The main research challenge hereis about finding the most appropriate model induction algorithm for mining data streams. As an additional feature, pertaining to the possibility of embedding the data miner module into some small devices, the memoryrequirementis opt to be as little as possible for obvious reasons of energy saving and fitting into a tiny device size. In other words, the learned model, probably in form of generalized non-linear mappings between the valuesofthefeaturestothepredicted target classes, must be compact enough to executeina small run-time memory. No roomiswastedforstoringthefeatures and their relations that are neither significant norcontribute little to the model accuracy. To this end, without using feature selection is out of consideration, as the number of original features extracted from the data streams. Since these models are built based on a stationary dataset, model up-date needs to repeat the whole training process whenever new samples arrive, adding them to incorporate the changing underlying patterns. In dynamic stream processing environment, however, data classificationmodel would have to be frequently updated accordingly.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1343 ARCHITECTURAL DIAGRAM AND EXPLANATION: FIG: 2: PROPOSED ARCHITECTURE From the database the datasets related to the user needs is retrieved using the hive query. The hive is the data warehouse used to analyse and retrieve the data. For this, first we need to continuously upload the data’s in database. Then the datasets are retrieved for eg in database there will be the medical datasets, traffic light datasets, weather forcasting datasets etc., from these multiple datasets required one is retrieved using the hive query. The datasets have multiple fields here fields represents age, name, sex etc., The retrieval of data is based on these fields. With the retrieved datasets, analysis is done and divided into two segments known as trained dataset and test dataset. The trained dataset will be more than the test dataset. The trained dataset undergoes some filtering process. But the test dataset undergoes classification where the data’s are sliced. And both sliced data and trained data enters into the SVM machine. The SVM algorithm is used for binary, multi-class problem and anomalie detection. Using hyper planar the critical points are divided known as support vectors. Theseperation is then perpendicular bisector of the line joining these two support vectors. These data’s are entered into the R input frames. These R input frames is used to extract the data using statistical computing andgraphics.Itisusedtoprovide statistical report. The statistical report is provided forlinear and nonlinear. These report provides accuracy for both stream. Then linear accuracy and non linear accuracy is compared to see the efficiency.Thenthegridanalysisisdone which combines both the accuracy and provides the graph. With that positive and negative data’s are identified. The positive data is safe whereas the negative value is unsafe. It increases the efficiency and takes less time for anaysing and for retrieving the data. It improves the data processing speed. It can be able to analyse the large volume of data in a small time compare to another tools. It provides large scale integration of data. MODULES:  Create schema in data warehouse  Importing the data to HDFS  Extracting the data  Performance evolution  Statistical report MODULE DESCRIPTION: A) CREATE SCHEMA IN DATA WAREHOUSE: In database the data's will be in the unstructured format which is unreadable. The database is uploaded in thesystem and to process the unstructured data in Hive, a schema is created. A schema is created using the attributes which is considered as field in Hive. These fields can beusedtodivide the data sets as test data and trained data where test data is a unpredicted data and trained data is a predicted data. B) IMORTING THE DATA IN HDFS: The Hadoop Distributed File Systemisdesignedtostorevery large dataset and to stream those data sets at high bandwidth to user application. The Database is converted from unstructured to structured format by creating the schema which is loaded into the HDFS. If the database is stored in the desktop then INPATH keyword is usedwhereif it is stored in external devices then EXTERNALPATH keyword is used. The keyword OVERWRITE is used to replace old data with new data. C) EXTRACTING THE DATA: The hive query which is used for providing data summarization ,query and analysis. It gives an SQL like interface to query data data stored in various databases and file systems that integrate with Hadoop. Hive provides the necessary thenecessarySQLabstractiontointegrateHIVEQL into underlying java API without the need to implement queries in the low level API. Hive supportseasyportabilityof SQL based application to Hadoop. It provides the sliced data from the datasets which is relevant to the user query. Using hive the data’s are retrieved in faster manneranditcanlarge volume of data. As the database is stored in the system and the processing also take place in the same system, the system act as both client and server. DATA STORAGE DATA SETS TRAINED DATA TEST DATA SVM STATISTICAL REPORT
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1344 D) PERFORMANCE EVOLUTION: In this approach, the Support Vector Machine(SVM) algorithm is used for analysing and retrival of data. It is a linearized programming and supervised learning approach. It is processed on the basis of Machine Learning(ML) techniques. It accurately reduce the time complexity and code complexity. RStudio is adaptable with any type of data and produces the result with efficient improvement. The SVM algorithm is divided into two types they are linear and radial methods. Accuracy is the parameter which is determined using the SVM algorithm. The linear provides one accuracy and radial provides one accuracy. Comparing these two accuracy the highest accuracy is considered as efficient. E) STATISTICAL REPORT: The Statistical report is determined using the Rstudio as per the user needs where R programming language is used for analysing the data. The Rstudio tool provides the graphical representation of the data for our input data. Both the linear and radial is combined to provide grid graph which helps to identify the highly positive and negative value SCREENSHOTS: A) CREATING SCHEMA IN DATA WAREHOUSE: B) IMPORTING THE DATA INTO HDFS: C) LINEAR KERNEL GRAPH: E) RADIAL KERNEL GRAPH: F) RADIAL GRID GRAPH: CONCLUSION: An approach known as Hive Tool which is used for storing and retrieving the data in large volume at higher speed. The Hive Tool can be used to process and store the exactdata ina large database, compared to other data mining and cloud methodologies. The R-Studio is used o provide thestatistical report by anlysing the data in the database as per the user requirement. The PSO with SVM algorithm improves the throughput efficiency. FUTURE ENHANCEMENT: In this paper the process of analysing is performed using Hive tool and statistical report is provided using R Software where R language is used. The statistical report provides
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 03 | Mar -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 1345 positive and negative value in the database. In future using these values prediction is done. This prediction says what will be the future problem with the help of past analysed data. Some new algorithm can be derived to increase the parameters efficiency ie.accuracy and also reduces the time consumption for the retrieval of data from the database. REFERENCES: [1] Simon Fong, Raymond wong, V.Vasilakos “Accelerated PSO swarm search feature selection for data stream mining bigdata”, IEEE Transaction on Data engineering, VOL.10, NO.7, July 2016. [2] Ge Rietai, Gao Jing “Improved PSO algorithm for energy saving research in the double layermanagement modeof the cloud platform”, CloudComputing and Bigdata analysis(2016). [3] Jun Liu, Tianyunshi, Ping Li “Optimal cloud storage problem in the distributed cloud data centresbythediscrete PSO algorithm”, Institute of computing technologies, china(2015). [4] Fong.S, Liang.J, Wong.R, Ghanavati.M, "A novel feature selection by clustering coefficientsofvariations",2014Ninth International Conference on Digital Information Management (ICDIM), Sept. 29, 2014, pp.205-213. [5] Gong Jun Yan, Ding Wen,Stephan dariu, Michael C Weigle “Security challenges in vehicular cloud computing”, IEEE Transaction on Intelligent transportation systems, VOL.14, NO.1, March 2013.