SlideShare a Scribd company logo
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
459
PARAMETRIC COMPARISON BASED ON SPLIT CRITERION ON
CLASSIFICATION ALGORITHM IN STREAM DATA MINING
Ms. Madhu S. Shukla*, Dr.K.H.Wandra**, Mr. Kirit R. Rathod***
*(PG-CE Student, Department of Computer Engineering),
(C.U.Shah College of Engineering and Technology, Gujarat, India)
** (Principal, Department of Computer Engineering),
(C.U.Shah College of Engineering and Technology, Gujarat, India)
*** (Assistant Professor, Department of Computer Engineering)
ABSTRACT
Stream Data Mining is a new emerging topic in the field of research. Today, there are
number of application that generate Massive amount of stream data. Examples of such kind
of systems are Sensor networks, Real time surveillance systems, telecommunication systems.
Hence there is requirement of intelligent processing of such type of data that would help in
proper analysis and use of this data in other task even. Mining stream data is concerned with
extracting knowledge structures represented in models and patterns in non stopping streams
of information.
Classification process based on generating decision tree in stream data mining
that makes decision process easy. As per the characteristic of stream data, it becomes
essential to handle large amount of continuous and changing data with accuracy. In
classification process attribute selection at the non leaf decision node thus become a critical
analytic point. Various performance parameter’s like Speed of Classification, Accuracy, and
CPU Utilization time can be improved if split criterion is implemented precisely. This paper
presents implementation of different attribute selection criteria and their comparison with
alternative method.
Keywords: Stream, Stream Data Mining, Performance Parameter processing, MOA (Massive
Online Analysis), Split Criterion.
INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING
& TECHNOLOGY (IJCET)
ISSN 0976 – 6367(Print)
ISSN 0976 – 6375(Online)
Volume 4, Issue 2, March – April (2013), pp. 459-470
© IAEME: www.iaeme.com/ijcet.asp
Journal Impact Factor (2013): 6.1302 (Calculated by GISI)
www.jifactor.com
IJCET
© I A E M E
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
460
1. INTRODUCTION
Characteristic of stream data also act as challenges for the same. Due its huge size,
continuous nature, speed with which it changes, it requires a real time response which is done
after analysis of this type of data. As the data is huge in size algorithm which would access
the data is restricted for single scan of the data.
Data mining makes use of different types of algorithm for various types of mining
task like Classification, Clustering, and Pattern Recognition. Same way, Stream Data mining
also makes use of different types of algorithm for various types of mining task. Some of the
algorithm for Classification of Stream Data is Hoeffding Tree, VFDT (Very Fast decision
Tree, CVFDT (Concept adaptation Very Fast Decision Tree).These classification algorithm is
based on Hoeffding Bound for decision tree generation. It makes use of Hoeffding Bound to
gather optimum amount of data so that classification can be done accurately. CVFDT is the
algorithm which is able to detect concept drift which again is a challenge in stream data
mining. As the size of stream data is extremely large, a method is required for improving the
split criterion at the node of decision tree, so that the speed in tree generation is achieved
accuracy is improved and CPU utilization time is reduced. Two different types of split
criterion are checked for Stream data Classification in this paper. And thus improvement in
the algorithm based on it is done as a part of research work.
As said earlier, Stream Data is huge in size, so in order to perform certain analysis; we
need to take some sample of that data so that processing of stream data could be done with
ease. These samples taken should be such that whatever data comes in the portion of sample
is worth analyzing or processing, which means maximum knowledge is extracted from that
sampled data.
In this paper sampling technique used is adaptive sliding window in Hoeffding-Bound based
tree algorithm.
2. RELATED WORK
Implementing algorithm for Stream Data Classification demands improvement in
resource utilization as well as improvisation in accuracy with ongoing classification process.
Here, we would see improvement done on algorithm that is based on Concept Drift Detection
while doing the classification of the data. Drift Detection here is done using Windowing
Technique.
Sliding Window: It is an advance technique. It deals with detailed analysis over most recent
data items and over summarized versions of older ones.
The inspiration behind sliding window is that the user is more concerned with the analysis of
most recent data streams. Thus the detailed analysis is done over the most recent data items
and summarized versions of the old ones. This idea has been adopted in many techniques in
the undergoing comprehensive data stream mining system.
3. CLASSIFICATION PROCESS.
There are many data mining algorithms that exist in practice. Data mining algorithms
can be categorized in three types:
1. Classification
2. Clustering
3. Association
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
461
A standard classification system has normally three different phases:
1. The training phase, during which the model is built using labeled data.
2. The testing phase, during which the model is tested by measuring its classification
accuracy on withheld labeled data.
3. The deployment phase during which the model is used to predict the class of unlabelled
data. The three phases are carried out in sequence. See Figure 2.1 for the standard
classification phases.
Fig 3.1: Phases of standard classification systems
3.1. STREAM DATA MINING
Ordinary classification is usually considered in three phases. In the first phase, a
model is built using data, called the training data, for which the property of interest (the class)
is already known (labeled data). In the second phase, the model is used to predict the class of
data (test data), for which the property of interest is known, but which the model has not
previously seen. In the third phase, the model is deployed and used to predict the property of
interest for (unlabelled data).
In stream classification, there is only a single stream of data, having labeled and unlabelled
records occurring together in the stream. The training/test and deployment phases, therefore,
interleave. Stream classification of unlabelled records could be required from the beginning
of the stream, after some sufficiently long initial sequence of labeled records, or at specific
moments in time or for a specific block of records selected by an external analyst.
4. ATTRIBUTE SELECTION CRITERION IN DECISION TREE:
Selection of appropriate splitting criterion helps in improving performance measurement
dimensions. In data stream mining main three performance measurement dimensions:
- Accuracy
- Amount of space necessary or computer memory (Model cost or RAM hours)
- The time required to learn from training examples and to predict (Evaluation time)
These properties may be interdependent: adjusting the time and space used by an
algorithm can influence accuracy. By storing more pre-computed information, such as look
up tables, an algorithm can run faster at the expense of space. An algorithm can also run
faster by processing less information, either by stopping early or storing less, thus having less
data to process. The more time an algorithm has, the more likely it is that accuracy can be
increased.
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
462
There are major two types of attribute selection criterion and they are Information
Gain and Gini Index. Later one is also known as binary split criterion. During late 1970s and
1980s .
J.Ross Quinlan, a researcher in machine learning has developed a decision tree
algorithm known as ID3 [1] (Iterative Dichotomiser). ID3 uses information gain for attribute
selection. Information gain Gain (A) is given as Gain (A) = Info (D) –InfoA (D).We have
developed a new algorithm to calculate information gain. Methodology wise this algorithm is
promising. We have divided the algorithm into two parts. The first part calculates Info (D)
and the second part calculates the Gain (A).
4.1. Information Gain Calculation: (information before split) – (information after split)
Entropy: A common way to measure impurity is entropy
• Entropy = Where pi is the
probability of class i.
Compute it as the proportion of class i in the set.
• Entropy comes from information theory. The higher the entropy the more the
information content.
• For Continuous data value is computed as (ai+ai+1+1)/2
787.0
17
4
log
17
4
17
13
log
17
13
22 =





⋅−





⋅−
Entire population (30 instances)
Information Gain= 0.996 - 0.615 = 0.38
391.0
13
12
log
13
12
13
1
log
13
1
22 =





⋅−





⋅−
Calculating Information Gain
17 instances
13 instances
Information Gain = entropy(parent) – [average entropy(children)]
996.0
30
16
log
30
16
30
14
log
30
14
22 =





⋅−





⋅−
(Weighted) Average Entropy of Children = 615.0391.0
30
13
787.0
30
17
=





⋅+





⋅
parent
entropy
child
entropy
child
entropy
Figure 4.1: Phases of standard classification systems
4.2. Calculating Gini Index
If a data set T contains examples from n classes, Gini index, Gini (T) is defined as
Where pj is the relative frequency of class j in T. Gini (T) is minimized if the classes in T are
skewed.
After splitting T into two subsets T1 and T2 with sizes N1 and N2, the Gini index of the split
data is defined as
The attribute providing smallest gin split(T) is chosen to split the node.
∑−
i
ii pp 2log
∑=
−=
n
j
j
pTgini
1
2
1)(
)()()( 2
2
1
1
T
N
T
Ngini gini
N
gini
N
T
split
+=
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
463
5. METHODOLOGY AND PROPOSED ALGORITHM
CVFDT (Concept Adaptation Very fast Decision Tree) is an extended version of
VFDT which provides same speed and accuracy advantages but if any changes occur in
example generating process provide the ability to detect and respond. Various systems with
this CVFDT uses sliding window of various dataset to keep its model consistent. In Most of
systems, it needs to learn a new model from scratch after arrival of new data. Instead,
CVFDT continuous monitors the quality of new data and adjusts those that are no longer
correct. Whenever new data arrives, CVFDT incrementing counts for new data and
decrements counts for oldest data in the window. The concept is stationary than there is no
statically effect. If the concept is changing, however, some splits examples that will no longer
appear best because new data provides more gain than previous one. Whenever this thing
occurs, CVFDT create alternative sub-tree to find best attribute at root. Each time new best
tree replaces old sub tree and it is more accurate on new data.
5.1 CVFDT ALGORITHM (Based on HoeffdingTree)
1. Alternate trees for each node in HT start as empty.
2. Process Examples from the stream indefinitely
3. For Each Example (x, y)
4. Pass (x, y) down to a set of leaves using HT And all alternate trees of the nodes (x, y) pass
Through.
5. Add(x, y) To the sliding window of examples.
6. Remove and forget the effect of the oldest Examples, if the sliding window overflows.
7. CVFDT Grow
8. Check Split Validity if f examples seen since Last checking of alternate trees.
9. Return HT.
Fig: 5.1 Flow of CVFDT algorithm
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
464
6. EXPERIMENTAL ANALYSIS WITH OBSERVATION
Different types of dataset were taken and the algorithm of CVFDT was implemented
after Importing those data set to in MOA. Performance analysis of various split criterion used
in decision tree approach are also tested for improving the accuracy of the algorithm. Datasets
used here are in ARFF format. Some of the data are taken from Repository of California
University, some from projects of Spain which are working on Stream Data.
Data Sets taken were as follows:
1) Sensor
2) Sea
3) Random Tree generator.
The Readings taken here are for Sensor data. It contains information (temperature,
humidity, light, and sensor voltage) collected from 54 sensors deployed in Intel Berkeley
Research Lab. The whole stream contains consecutive information recorded over a 2 months
period (1 reading per 1-3 minutes). I used the sensor ID as the class label, so the learning task
of the stream is to correctly identify the sensor ID (1 out of 54 sensors) purely based on the
sensor data and the corresponding recording time. While the data stream flow over time, so
does the concepts underlying the stream. For example, the lighting during the working hours
is generally stronger than the night, and the temperature of specific sensors (conference room)
may regularly rise during the meetings.
Fig: 6.1 MIT Computer Science and Artificial Intelligence Lab data repository
As discussed above an attribute selection measure is a heuristic for selecting the splitting criterion
that “best” separates a given Data. Two common methods used for it are:
1) Entropy based method (i.e. Information Gain)
2) Gini Index
6.1 RANDOM TREE GENERATOR DATA SET RESULTS
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
465
Instance Information
Gain(Accuracy)
Gini
Index(Accuracy)
100000 92.6 81.7
200000 93 83
300000 94.7 80.1
400000 96.3 82.2
500000 94.8 80.9
600000 96.9 81.9
700000 96.9 82.6
800000 96.7 82.1
900000 98.7 84
1000000 97.4 77.9
Table-I: Comparison for accuracy in random tree generator
6.2 SEA DATA SET RESULTS
Instance Information
Gain(Accuracy)
Gini
Index(Accuracy)
100000 89.8 89.3
200000 92.1 91.6
300000 89.6 89.3
400000 89.1 88.9
500000 88.5 88.5
600000 88.8 88.1
700000 90.6 90.6
800000 89.5 89.3
900000 89.1 89
1000000 89.9 89.9
Table-II: Comparison for accuracy for SEA Data
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
466
6.3 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (CPU
UTILIZATION)
Learning evaluation
instances
Evaluation time (Cpu
seconds) Info gain
Evaluation time (Cpu
seconds)Gini index
100000 6.676843 8.704856
200000 13.46289 18.67332
300000 20.23333 29.40619
400000 26.97257 39.87386
500000 33.68062 49.63952
600000 40.40426 59.06198
700000 47.0499 67.70443
800000 53.74234 78.0941
900000 59.93558 88.14057
1000000 66.79963 98.48343
1100000 73.27367 107.1727
1200000 79.27971 116.9851
1300000 85.53535 127.016
1400000 91.99379 136.6257
1500000 98.40543 145.2993
1600000 104.3803 152.9278
1700000 110.3083 160.0102
1800000 116.4859 168.1223
1900000 121.9928 174.8459
Table-III: Comparison of CPU Utilization time for SENSOR Data
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
467
6.4 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (ACCURACY)
Learning evaluation
instances
Classifications correct
(percent)Info Gain
Classifications correct
(percent)Gini Index
100000 96.3 98.4
200000 68.3 69.7
300000 18 64.4
400000 43.2 67.4
500000 62.8 72.9
600000 92 71
700000 97.9 72.5
800000 97.4 73.9
900000 96.8 73.7
1000000 80.6 68.5
1100000 53.6 71.2
1200000 71 90.3
1300000 84.1 73.1
1400000 78.5 83.9
1500000 96.3 84.9
1600000 50.9 84.9
1700000 24 79
1800000 74.3 87.6
1900000 98 97.8
Table-IV: Comparison of ACCURACY for SENSOR Data
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
468
6.5 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (TREE SIZE)
Learning evaluation
instances
Tree size (nodes) Info
Gain
Tree size (nodes) Gini
Index
100000 14 126
200000 30 270
300000 44 396
400000 60 530
500000 76 666
600000 88 800
700000 102 938
800000 122 1076
900000 136 1214
1000000 150 1346
1100000 172 1466
1200000 196 1602
1300000 216 1742
1400000 226 1868
1500000 240 1998
1600000 262 2122
1700000 282 2238
1800000 292 2352
1900000 312 2474
Table-V: Comparison of TREE SIZE for SENSOR Data)
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
469
6.6 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (LEAVES)
Learning evaluation
instances
Tree size (leaves) Info
Gain Tree size (leaves) Gini Index
100000 7 63
200000 15 135
300000 22 198
400000 30 265
500000 38 333
600000 44 400
700000 51 469
800000 61 538
900000 68 607
1000000 75 673
1100000 86 733
1200000 98 801
1300000 108 871
1400000 113 934
1500000 120 999
1600000 131 1061
1700000 141 1119
1800000 146 1176
1900000 156 1237
Table-IV: Comparison of LEAVES for SENSOR Data)
6.7 COMPARISION OF ALL DIMENSION OF PERFORMANCE TOGETHER
FOR SENSOR DATA
Fig 6.2: Comparison of Performance for Sensor Data for every dimension together
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-
6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME
470
7. CONCLUSION
In this paper, we discussed about theoretical aspects and practical results of Stream
Data Mining Classification algorithms with different split criterion. The comparison based on
different dataset shows the result analysis. Hoeffding trees with windowing technique spend
least amount of time for learning and results in higher accuracy than Gini Index. Memory
utilization, Accuracy and CPU Utilization which are crucial factor in Stream Data are
practically discussed here in this paper with observation. Classification generates decision
tree and tree generated with Split Criterion as Information gain shows that size of tree is also
decreased as shown in table along with dramatic change in accuracy and CPU Utilization.
REFERENCES
[1] Elena ikonomovska,Suzana Loskovska,Dejan Gjorgjevik, “A Survey Of Stream Data
Mining” Eight National Conference with International Participation-ETAI2007
[2] S.Muthukrishnan, “Data streams: Algorithms and Applications”.Proceeding of the
fourteenth annual ACM-SIAM symposium on discrete algorithms,2003
[3] Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. ]“Mining Data
Streams: A Review”, Centre for Distributed Systems and Software Engineering, Monash
University900 Dandenong Rd, Caulfield East, VIC3145, Australia
[4] P. Domingos and G. Hulten, “A General Method for Scaling Up Machine Learning
Algorithms and its Application to Clustering”, Proceedings of the Eighteenth International
Conference on Machine Learning, 2001, Williamstown, MA, Morgan Kaufmann
[5] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P.Blair, S. Bushra, J. Dull, K. Sarkar, M.
Klein, M. Vasa, and D. Handy, VEDAS: “A Mobile and Distributed Data Stream Mining
System for Real-Time Vehicle Monitoring”, Proceedings of SIAM International Conference
on Data Mining, 2004.
[6]“Adaptive Parameter-free Learning from Evolving Data Streams”, Albert Bifet and Ricard
Gavald`a, Universitat Polit`ecnica de Catalunya, Barcelona, Spain.
[7] “Mining Stream with Concept Drift”, Dariusz Brzezinski, Master’s thesis, Poznan
University of Technology
[8] R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past,
Present and Future”, International journal of Computer Engineering & Technology (IJCET),
Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375
[9] Mr. M. Karthikeyan, Mr. M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review
on the Data Mining And Information Security”, International journal of Computer
Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print:
0976 – 6367, ISSN Online: 0976 – 6375

More Related Content

PDF
Fp3111131118
PDF
Analysis on different Data mining Techniques and algorithms used in IOT
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
PDF
Data characterization towards modeling frequent pattern mining algorithms
PDF
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
PDF
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
PDF
PDF
IRJET- Missing Data Imputation by Evidence Chain
Fp3111131118
Analysis on different Data mining Techniques and algorithms used in IOT
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
Data characterization towards modeling frequent pattern mining algorithms
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...
EVALUATION OF A NEW INCREMENTAL CLASSIFICATION TREE ALGORITHM FOR MINING HIGH...
IRJET- Missing Data Imputation by Evidence Chain

What's hot (18)

PDF
Comparative analysis of various data stream mining procedures and various dim...
PDF
A1802050102
PDF
A Firefly based improved clustering algorithm
PDF
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
PDF
Data mining techniques application for prediction in OLAP cube
PDF
Data performance characterization of frequent pattern mining algorithms
PDF
Ay4201347349
PDF
Predicting performance of classification algorithms
PDF
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PDF
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
PDF
Effective data mining for proper
PDF
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
PDF
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PDF
IRJET- Analyze Weather Condition using Machine Learning Algorithms
PDF
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
PDF
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
PDF
Data Imputation by Soft Computing
PDF
GCUBE INDEXING
Comparative analysis of various data stream mining procedures and various dim...
A1802050102
A Firefly based improved clustering algorithm
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
Data mining techniques application for prediction in OLAP cube
Data performance characterization of frequent pattern mining algorithms
Ay4201347349
Predicting performance of classification algorithms
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Effective data mining for proper
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
IRJET- Analyze Weather Condition using Machine Learning Algorithms
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Data Imputation by Soft Computing
GCUBE INDEXING
Ad

Viewers also liked (6)

PPTX
Null hypothesis for One way RM ANOVA
PPTX
What is a one-way repeated measures ANOVA?
PPTX
Null hypothesis for a one-way anova
DOCX
One way repeated measure anova
PPTX
Reporting a one way repeated measures anova
PDF
MICROCONTROLLER BASED SOLAR POWER INVERTER
Null hypothesis for One way RM ANOVA
What is a one-way repeated measures ANOVA?
Null hypothesis for a one-way anova
One way repeated measure anova
Reporting a one way repeated measures anova
MICROCONTROLLER BASED SOLAR POWER INVERTER
Ad

Similar to Parametric comparison based on split criterion on classification algorithm (20)

PDF
ME Synopsis
PDF
Predicting students' performance using id3 and c4.5 classification algorithms
PDF
Fn3110961103
PPTX
Data Mining: Mining stream time series and sequence data
PPTX
Data Mining: Mining stream time series and sequence data
PDF
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
PDF
An efficient feature selection in
PDF
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
PDF
Chapter 4.pdf
PDF
An Improved Differential Evolution Algorithm for Data Stream Clustering
PPT
Data mining technique for classification and feature evaluation using stream ...
PPT
2.2 decision tree
PDF
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
PDF
Supervised Learning Decision Trees Review of Entropy
PDF
Supervised Learning Decision Trees Machine Learning
PDF
Z36149154
PPTX
Data classification
PPT
Classfication Basic.ppt
PDF
DATA MINING METHODOLOGIES TO STUDY STUDENT'S ACADEMIC PERFORMANCE USING THE...
PDF
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
ME Synopsis
Predicting students' performance using id3 and c4.5 classification algorithms
Fn3110961103
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
An efficient feature selection in
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Chapter 4.pdf
An Improved Differential Evolution Algorithm for Data Stream Clustering
Data mining technique for classification and feature evaluation using stream ...
2.2 decision tree
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Supervised Learning Decision Trees Review of Entropy
Supervised Learning Decision Trees Machine Learning
Z36149154
Data classification
Classfication Basic.ppt
DATA MINING METHODOLOGIES TO STUDY STUDENT'S ACADEMIC PERFORMANCE USING THE...
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...

More from IAEME Publication (20)

PDF
IAEME_Publication_Call_for_Paper_September_2022.pdf
PDF
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
PDF
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
PDF
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
PDF
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
PDF
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
PDF
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
PDF
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
PDF
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
PDF
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
PDF
GANDHI ON NON-VIOLENT POLICE
PDF
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
PDF
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
PDF
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
PDF
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
PDF
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
PDF
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
PDF
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
PDF
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
PDF
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
IAEME_Publication_Call_for_Paper_September_2022.pdf
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
GANDHI ON NON-VIOLENT POLICE
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Cloud computing and distributed systems.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced IT Governance
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The Rise and Fall of 3GPP – Time for a Sabbatical?
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
Advanced methodologies resolving dimensionality complications for autism neur...
Sensors and Actuators in IoT Systems using pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Understanding_Digital_Forensics_Presentation.pptx
madgavkar20181017ppt McKinsey Presentation.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
Advanced IT Governance
Diabetes mellitus diagnosis method based random forest with bat algorithm

Parametric comparison based on split criterion on classification algorithm

  • 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 459 PARAMETRIC COMPARISON BASED ON SPLIT CRITERION ON CLASSIFICATION ALGORITHM IN STREAM DATA MINING Ms. Madhu S. Shukla*, Dr.K.H.Wandra**, Mr. Kirit R. Rathod*** *(PG-CE Student, Department of Computer Engineering), (C.U.Shah College of Engineering and Technology, Gujarat, India) ** (Principal, Department of Computer Engineering), (C.U.Shah College of Engineering and Technology, Gujarat, India) *** (Assistant Professor, Department of Computer Engineering) ABSTRACT Stream Data Mining is a new emerging topic in the field of research. Today, there are number of application that generate Massive amount of stream data. Examples of such kind of systems are Sensor networks, Real time surveillance systems, telecommunication systems. Hence there is requirement of intelligent processing of such type of data that would help in proper analysis and use of this data in other task even. Mining stream data is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. Classification process based on generating decision tree in stream data mining that makes decision process easy. As per the characteristic of stream data, it becomes essential to handle large amount of continuous and changing data with accuracy. In classification process attribute selection at the non leaf decision node thus become a critical analytic point. Various performance parameter’s like Speed of Classification, Accuracy, and CPU Utilization time can be improved if split criterion is implemented precisely. This paper presents implementation of different attribute selection criteria and their comparison with alternative method. Keywords: Stream, Stream Data Mining, Performance Parameter processing, MOA (Massive Online Analysis), Split Criterion. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), pp. 459-470 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 460 1. INTRODUCTION Characteristic of stream data also act as challenges for the same. Due its huge size, continuous nature, speed with which it changes, it requires a real time response which is done after analysis of this type of data. As the data is huge in size algorithm which would access the data is restricted for single scan of the data. Data mining makes use of different types of algorithm for various types of mining task like Classification, Clustering, and Pattern Recognition. Same way, Stream Data mining also makes use of different types of algorithm for various types of mining task. Some of the algorithm for Classification of Stream Data is Hoeffding Tree, VFDT (Very Fast decision Tree, CVFDT (Concept adaptation Very Fast Decision Tree).These classification algorithm is based on Hoeffding Bound for decision tree generation. It makes use of Hoeffding Bound to gather optimum amount of data so that classification can be done accurately. CVFDT is the algorithm which is able to detect concept drift which again is a challenge in stream data mining. As the size of stream data is extremely large, a method is required for improving the split criterion at the node of decision tree, so that the speed in tree generation is achieved accuracy is improved and CPU utilization time is reduced. Two different types of split criterion are checked for Stream data Classification in this paper. And thus improvement in the algorithm based on it is done as a part of research work. As said earlier, Stream Data is huge in size, so in order to perform certain analysis; we need to take some sample of that data so that processing of stream data could be done with ease. These samples taken should be such that whatever data comes in the portion of sample is worth analyzing or processing, which means maximum knowledge is extracted from that sampled data. In this paper sampling technique used is adaptive sliding window in Hoeffding-Bound based tree algorithm. 2. RELATED WORK Implementing algorithm for Stream Data Classification demands improvement in resource utilization as well as improvisation in accuracy with ongoing classification process. Here, we would see improvement done on algorithm that is based on Concept Drift Detection while doing the classification of the data. Drift Detection here is done using Windowing Technique. Sliding Window: It is an advance technique. It deals with detailed analysis over most recent data items and over summarized versions of older ones. The inspiration behind sliding window is that the user is more concerned with the analysis of most recent data streams. Thus the detailed analysis is done over the most recent data items and summarized versions of the old ones. This idea has been adopted in many techniques in the undergoing comprehensive data stream mining system. 3. CLASSIFICATION PROCESS. There are many data mining algorithms that exist in practice. Data mining algorithms can be categorized in three types: 1. Classification 2. Clustering 3. Association
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 461 A standard classification system has normally three different phases: 1. The training phase, during which the model is built using labeled data. 2. The testing phase, during which the model is tested by measuring its classification accuracy on withheld labeled data. 3. The deployment phase during which the model is used to predict the class of unlabelled data. The three phases are carried out in sequence. See Figure 2.1 for the standard classification phases. Fig 3.1: Phases of standard classification systems 3.1. STREAM DATA MINING Ordinary classification is usually considered in three phases. In the first phase, a model is built using data, called the training data, for which the property of interest (the class) is already known (labeled data). In the second phase, the model is used to predict the class of data (test data), for which the property of interest is known, but which the model has not previously seen. In the third phase, the model is deployed and used to predict the property of interest for (unlabelled data). In stream classification, there is only a single stream of data, having labeled and unlabelled records occurring together in the stream. The training/test and deployment phases, therefore, interleave. Stream classification of unlabelled records could be required from the beginning of the stream, after some sufficiently long initial sequence of labeled records, or at specific moments in time or for a specific block of records selected by an external analyst. 4. ATTRIBUTE SELECTION CRITERION IN DECISION TREE: Selection of appropriate splitting criterion helps in improving performance measurement dimensions. In data stream mining main three performance measurement dimensions: - Accuracy - Amount of space necessary or computer memory (Model cost or RAM hours) - The time required to learn from training examples and to predict (Evaluation time) These properties may be interdependent: adjusting the time and space used by an algorithm can influence accuracy. By storing more pre-computed information, such as look up tables, an algorithm can run faster at the expense of space. An algorithm can also run faster by processing less information, either by stopping early or storing less, thus having less data to process. The more time an algorithm has, the more likely it is that accuracy can be increased.
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 462 There are major two types of attribute selection criterion and they are Information Gain and Gini Index. Later one is also known as binary split criterion. During late 1970s and 1980s . J.Ross Quinlan, a researcher in machine learning has developed a decision tree algorithm known as ID3 [1] (Iterative Dichotomiser). ID3 uses information gain for attribute selection. Information gain Gain (A) is given as Gain (A) = Info (D) –InfoA (D).We have developed a new algorithm to calculate information gain. Methodology wise this algorithm is promising. We have divided the algorithm into two parts. The first part calculates Info (D) and the second part calculates the Gain (A). 4.1. Information Gain Calculation: (information before split) – (information after split) Entropy: A common way to measure impurity is entropy • Entropy = Where pi is the probability of class i. Compute it as the proportion of class i in the set. • Entropy comes from information theory. The higher the entropy the more the information content. • For Continuous data value is computed as (ai+ai+1+1)/2 787.0 17 4 log 17 4 17 13 log 17 13 22 =      ⋅−      ⋅− Entire population (30 instances) Information Gain= 0.996 - 0.615 = 0.38 391.0 13 12 log 13 12 13 1 log 13 1 22 =      ⋅−      ⋅− Calculating Information Gain 17 instances 13 instances Information Gain = entropy(parent) – [average entropy(children)] 996.0 30 16 log 30 16 30 14 log 30 14 22 =      ⋅−      ⋅− (Weighted) Average Entropy of Children = 615.0391.0 30 13 787.0 30 17 =      ⋅+      ⋅ parent entropy child entropy child entropy Figure 4.1: Phases of standard classification systems 4.2. Calculating Gini Index If a data set T contains examples from n classes, Gini index, Gini (T) is defined as Where pj is the relative frequency of class j in T. Gini (T) is minimized if the classes in T are skewed. After splitting T into two subsets T1 and T2 with sizes N1 and N2, the Gini index of the split data is defined as The attribute providing smallest gin split(T) is chosen to split the node. ∑− i ii pp 2log ∑= −= n j j pTgini 1 2 1)( )()()( 2 2 1 1 T N T Ngini gini N gini N T split +=
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 463 5. METHODOLOGY AND PROPOSED ALGORITHM CVFDT (Concept Adaptation Very fast Decision Tree) is an extended version of VFDT which provides same speed and accuracy advantages but if any changes occur in example generating process provide the ability to detect and respond. Various systems with this CVFDT uses sliding window of various dataset to keep its model consistent. In Most of systems, it needs to learn a new model from scratch after arrival of new data. Instead, CVFDT continuous monitors the quality of new data and adjusts those that are no longer correct. Whenever new data arrives, CVFDT incrementing counts for new data and decrements counts for oldest data in the window. The concept is stationary than there is no statically effect. If the concept is changing, however, some splits examples that will no longer appear best because new data provides more gain than previous one. Whenever this thing occurs, CVFDT create alternative sub-tree to find best attribute at root. Each time new best tree replaces old sub tree and it is more accurate on new data. 5.1 CVFDT ALGORITHM (Based on HoeffdingTree) 1. Alternate trees for each node in HT start as empty. 2. Process Examples from the stream indefinitely 3. For Each Example (x, y) 4. Pass (x, y) down to a set of leaves using HT And all alternate trees of the nodes (x, y) pass Through. 5. Add(x, y) To the sliding window of examples. 6. Remove and forget the effect of the oldest Examples, if the sliding window overflows. 7. CVFDT Grow 8. Check Split Validity if f examples seen since Last checking of alternate trees. 9. Return HT. Fig: 5.1 Flow of CVFDT algorithm
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 464 6. EXPERIMENTAL ANALYSIS WITH OBSERVATION Different types of dataset were taken and the algorithm of CVFDT was implemented after Importing those data set to in MOA. Performance analysis of various split criterion used in decision tree approach are also tested for improving the accuracy of the algorithm. Datasets used here are in ARFF format. Some of the data are taken from Repository of California University, some from projects of Spain which are working on Stream Data. Data Sets taken were as follows: 1) Sensor 2) Sea 3) Random Tree generator. The Readings taken here are for Sensor data. It contains information (temperature, humidity, light, and sensor voltage) collected from 54 sensors deployed in Intel Berkeley Research Lab. The whole stream contains consecutive information recorded over a 2 months period (1 reading per 1-3 minutes). I used the sensor ID as the class label, so the learning task of the stream is to correctly identify the sensor ID (1 out of 54 sensors) purely based on the sensor data and the corresponding recording time. While the data stream flow over time, so does the concepts underlying the stream. For example, the lighting during the working hours is generally stronger than the night, and the temperature of specific sensors (conference room) may regularly rise during the meetings. Fig: 6.1 MIT Computer Science and Artificial Intelligence Lab data repository As discussed above an attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given Data. Two common methods used for it are: 1) Entropy based method (i.e. Information Gain) 2) Gini Index 6.1 RANDOM TREE GENERATOR DATA SET RESULTS
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 465 Instance Information Gain(Accuracy) Gini Index(Accuracy) 100000 92.6 81.7 200000 93 83 300000 94.7 80.1 400000 96.3 82.2 500000 94.8 80.9 600000 96.9 81.9 700000 96.9 82.6 800000 96.7 82.1 900000 98.7 84 1000000 97.4 77.9 Table-I: Comparison for accuracy in random tree generator 6.2 SEA DATA SET RESULTS Instance Information Gain(Accuracy) Gini Index(Accuracy) 100000 89.8 89.3 200000 92.1 91.6 300000 89.6 89.3 400000 89.1 88.9 500000 88.5 88.5 600000 88.8 88.1 700000 90.6 90.6 800000 89.5 89.3 900000 89.1 89 1000000 89.9 89.9 Table-II: Comparison for accuracy for SEA Data
  • 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 466 6.3 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (CPU UTILIZATION) Learning evaluation instances Evaluation time (Cpu seconds) Info gain Evaluation time (Cpu seconds)Gini index 100000 6.676843 8.704856 200000 13.46289 18.67332 300000 20.23333 29.40619 400000 26.97257 39.87386 500000 33.68062 49.63952 600000 40.40426 59.06198 700000 47.0499 67.70443 800000 53.74234 78.0941 900000 59.93558 88.14057 1000000 66.79963 98.48343 1100000 73.27367 107.1727 1200000 79.27971 116.9851 1300000 85.53535 127.016 1400000 91.99379 136.6257 1500000 98.40543 145.2993 1600000 104.3803 152.9278 1700000 110.3083 160.0102 1800000 116.4859 168.1223 1900000 121.9928 174.8459 Table-III: Comparison of CPU Utilization time for SENSOR Data
  • 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 467 6.4 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (ACCURACY) Learning evaluation instances Classifications correct (percent)Info Gain Classifications correct (percent)Gini Index 100000 96.3 98.4 200000 68.3 69.7 300000 18 64.4 400000 43.2 67.4 500000 62.8 72.9 600000 92 71 700000 97.9 72.5 800000 97.4 73.9 900000 96.8 73.7 1000000 80.6 68.5 1100000 53.6 71.2 1200000 71 90.3 1300000 84.1 73.1 1400000 78.5 83.9 1500000 96.3 84.9 1600000 50.9 84.9 1700000 24 79 1800000 74.3 87.6 1900000 98 97.8 Table-IV: Comparison of ACCURACY for SENSOR Data
  • 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 468 6.5 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (TREE SIZE) Learning evaluation instances Tree size (nodes) Info Gain Tree size (nodes) Gini Index 100000 14 126 200000 30 270 300000 44 396 400000 60 530 500000 76 666 600000 88 800 700000 102 938 800000 122 1076 900000 136 1214 1000000 150 1346 1100000 172 1466 1200000 196 1602 1300000 216 1742 1400000 226 1868 1500000 240 1998 1600000 262 2122 1700000 282 2238 1800000 292 2352 1900000 312 2474 Table-V: Comparison of TREE SIZE for SENSOR Data)
  • 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 469 6.6 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (LEAVES) Learning evaluation instances Tree size (leaves) Info Gain Tree size (leaves) Gini Index 100000 7 63 200000 15 135 300000 22 198 400000 30 265 500000 38 333 600000 44 400 700000 51 469 800000 61 538 900000 68 607 1000000 75 673 1100000 86 733 1200000 98 801 1300000 108 871 1400000 113 934 1500000 120 999 1600000 131 1061 1700000 141 1119 1800000 146 1176 1900000 156 1237 Table-IV: Comparison of LEAVES for SENSOR Data) 6.7 COMPARISION OF ALL DIMENSION OF PERFORMANCE TOGETHER FOR SENSOR DATA Fig 6.2: Comparison of Performance for Sensor Data for every dimension together
  • 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME 470 7. CONCLUSION In this paper, we discussed about theoretical aspects and practical results of Stream Data Mining Classification algorithms with different split criterion. The comparison based on different dataset shows the result analysis. Hoeffding trees with windowing technique spend least amount of time for learning and results in higher accuracy than Gini Index. Memory utilization, Accuracy and CPU Utilization which are crucial factor in Stream Data are practically discussed here in this paper with observation. Classification generates decision tree and tree generated with Split Criterion as Information gain shows that size of tree is also decreased as shown in table along with dramatic change in accuracy and CPU Utilization. REFERENCES [1] Elena ikonomovska,Suzana Loskovska,Dejan Gjorgjevik, “A Survey Of Stream Data Mining” Eight National Conference with International Participation-ETAI2007 [2] S.Muthukrishnan, “Data streams: Algorithms and Applications”.Proceeding of the fourteenth annual ACM-SIAM symposium on discrete algorithms,2003 [3] Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. ]“Mining Data Streams: A Review”, Centre for Distributed Systems and Software Engineering, Monash University900 Dandenong Rd, Caulfield East, VIC3145, Australia [4] P. Domingos and G. Hulten, “A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering”, Proceedings of the Eighteenth International Conference on Machine Learning, 2001, Williamstown, MA, Morgan Kaufmann [5] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P.Blair, S. Bushra, J. Dull, K. Sarkar, M. Klein, M. Vasa, and D. Handy, VEDAS: “A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring”, Proceedings of SIAM International Conference on Data Mining, 2004. [6]“Adaptive Parameter-free Learning from Evolving Data Streams”, Albert Bifet and Ricard Gavald`a, Universitat Polit`ecnica de Catalunya, Barcelona, Spain. [7] “Mining Stream with Concept Drift”, Dariusz Brzezinski, Master’s thesis, Poznan University of Technology [8] R. Manickam, D. Boominath and V. Bhuvaneswari, “An Analysis of Data Mining: Past, Present and Future”, International journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375 [9] Mr. M. Karthikeyan, Mr. M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review on the Data Mining And Information Security”, International journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375