SlideShare a Scribd company logo
Evaluating classification algorithms applied to data streams Author: Ing. Esteban  D. Donato Advisor: Dr. Fazel Famili Co-Advisor: Dra. Ana S. Haedo Dec-2009   Maestría  en  Explotación de Datos y Descubrimiento del Conocimiento
Introduction Majority of companies and organizations collect and maintain gigantic databases that grow to millions of registers per day.  Current algorithms for mining complex models from data cannot mine even a fraction of these data in useful time. Concept drift: o ccurs when the underlying data distribution changes over time.
Objective To perform a benchmarking analysis between a number of known algorithms applied to data streams.  The algorithms chosen for this study are: UFFT, CVFDT and VFDTc.  The analysis will be focused on some aspects that all the algorithms applied to data streams have to deal with.
Related work  A data stream  is a sequence of data items x 1 ,…,x i ,…,x n .  Those items are read one at a time in increasing order of the indices. Off-line learning : Assumes that the dataset resides in a static database and that has been generated from a static distribution.  Also, they assume that all the data is available before the training and that all the examples can fit into the memory.  Incremental learning : The items are time-ordered and the distribution that generates them varies over time. Systems evolve and change a concept definition as new observations are processed.
Related work (Cont.):  Data Streams Mining A sub area of incremental learning. Accumulates faster than it can be mined. It must require small constant time per record. It must use only a fixed amount of main memory. It must be able to build a model using at most one scan of the data.  It must make a usable model available at any point in time. Ideally, it should produce a model that is equivalent to the one that would be obtained by the corresponding ordinary database mining algorithm. The model should be up-to-date at any time. Types of Algorithms : Set of rules, Induction trees and Ensembles methods.
Related work (Cont.):  Very Fast Decision Tree (VFDT) Requires each example to be read only once. Requires a small constant time to process it.  Building process : given a stream of examples, the first ones will be used to choose the root and the following examples will be passed down to the corresponding leaves. To detect how many examples are needed at each node,  The Hoeffding  bound is used.  The Hoeffding bound: with probability 1 -  φ , the true mean of the variable is at least  r  - e, where: Let ∆ G  =  G (Xa) -  G (Xb) >= 0, if ∆ G > e then  ∆G >= ∆ G - e > 0 with probability 1 –  φ Other features:  Pre-pruning, different evaluation measure, Ties, Memory, Poor attributes, Initialization, Rescans. Drawbacks : It does not detect Concept Drift.
Related work (Cont.):  Concept Drift Change in the target concept Depends on some hidden attributes,  not given explicitly in the form of predictive features,  Examples: Weather prediction, customers’ buying, etc. Concept drift handling system should be able to:  Quickly adapt to concept drift Be robust to noise and distinguish it from concept drift. Recognize and treat recurring contexts. Types:  sudden, gradual, frequent and virtual  concept drift.
Conclusion  of literature review Data stream is a sequence of time-ordered items, arriving faster than the time needed to be mined.   Some changes in the underlying data distribution may occur requiring the algorithms to detect and adapt to these changes.  The main challenge in incremental learning is how to detect and adapt to a concept drift. To deal with the problem of data arriving fast, the algorithms must require a small constant processing time per record. One of the first algorithms developed was VFDT, using the Hoeffding bound In concept drift, a difficult problem is to distinguish between a true concept drift and noise.
Algorithm: VFDTc V ery  F ast  D ecision  T ree for  C ontinuous attributes Extension of  VFDT in three directions: continuous data, functional leaves, and concept drift. For a continuous attribute the split-test is a condition of the form attri <= cut_point.  Use of Information gain to detect the cut_point. Functional tree leaves:  An innovative aspect of this algorithm is its ability to use the naive Bayes classifiers at tree leaves  A leaf must see nmin examples before computing the evaluation function Concept Drift  is based on the assumption that whatever is the cause of the drift, the decision surface moves.  It supports two methods: Drift Detection based on Error Estimates (EE/EBP)  Drift Detection based on Affinity Coefficient (AC) Reacting to Drift:  method pushes up all the information of the descending leaves to node This is a forgetting mechanism.
Algorithm:  UFFT U ltra  F ast  F orest  T ree  Generates a forest of binary trees Processes each example in constant time  It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test  It maintains a short term memory for initializing the leaves To expand a leaf node: information gain positive and statistical support Functional leaves Concept drift detection: error rate is calculated at each node (Naive-Bayes ).  Error follows a binomial distribution. Two confident interval levels: warning drift
Algorithm: CVFDT C oncept-adapting  V ery  F ast  D ecision  T ree Extension of VFDT with support to concept drifts It works by keeping its model consistent with a sliding window of examples.  Updates just statistics. It uses information gain for selecting the best attribute. Grows an alternative subtree with the new best attribute at its root.  Periodically scans HT and all alternate trees looking for internal nodes whose performing better than the actual nodes.
Performance measures Capacity to detect and respond to concept drift Capacity to detect and respond to virtual concept drift Capacity to detect and respond to recurring concept drift Capacity to adapt to sudden concept drift Capacity to adapt to gradual concept drift Capacity to adapt to frequent concept drift Accuracy of the classification task Capacity to deal with outliers Capacity to deal with noisy data Speed (Time to take to process an item in the stream)
Data sets generated Data sets based on a moving hyperplane d-dimensional space [0; 1] d , is denoted by  MOA (Massive Online Analysis) tool  http://sourceforge.net/projects/moa-datastream/   Released under GNU. Free and open source Current configurabe attributes: instanceRandomSeed  numClasses numAtts numDriftAtts magChange noisePercentage sigmaPercentage New configurable attributes: driftFreq driftTran  outlierPercentage distributionPercentage
Data sets generated Dataset with no concept drift, outlier of noise Dataset with 10% of noisy data Dataset with 1% of outliers Dataset with 3 concept drift
Results Capacity to detect and respond to concept drift
Results Capacity to detect and respond to virtual concept drift
Results Capacity to detect and respond to recurring concept drift
Results Capacity to adapt to sudden concept drift
Results Capacity to adapt to gradual concept drift
Results Capacity to adapt to frequent concept drift
Results Accuracy of the classification task VFDTc (CA) VFDTc (EBP) UFFT CVFDT measures derived from the confusion matrix     Predicted Predicted     Class 1 Class 2 Actual Class 1 44.5% (887) 5.5% (109) Actual Class 2 5% (101) 45% (903)     Predicted Predicted     Class 1 Class 2 Actual Class 1 39% (777) 11% (219) Actual Class 2 9% (173) 41% (831)     Predicted Predicted     Class 1 Class 2 Actual Class 1 46% (928) 3.5% (68) Actual Class 2 2.5% (48) 48% (956)     Predicted Predicted     Class 1 Class 2 Actual Class 1 34.5% (685) 15.5% (311) Actual Class 2 15.5% (312) 34.5% (692)   Accuracy (AC) True positive (TP) False Positive (FP) True Negative (TN) False Negative (FN) Precision (P) VFDTc (CA) 0.89 0.89 0.10 0.90 0.11 0.90 VFDTc (EBP) 0.80 0.78 0.17 0.83 0.22 0.82 UFFT 0.94 0.93 0.05 0.95 0.07 0.95 CVFDT 0.69 0.69 0.31 0.69 0.31 0.69
Results Dealing with outliers
Results Dealing with noisy data
Results Speed (Time to take to process an item in the stream)
Conclusions & future work Given that the data can be generated very fast, that give us a new and challenging way of developing Data Mining algorithms. We have to develop them having in mind that the training phase can never end The changes in the data distribution are another challenging scenario that data stream mining has to deal with.   VFDT was one of the first data stream mining algorithms developed.  It implemented the Hoeffding bound We generated different datasets using the  moving hyperplane algorithm UFFT for short term predictions CVFDT for long term solutions No impact on virtual concept drift or recurring concept drift
Conclusions & future work VFDTc (CA)  is not suitable for gradual or sudden concept drift VFDTc (CA) or UFFT are not suitable for frequent concept drift VFDTc (EBP) and CVFDT for data streams with outliers CVFDT for data streams with noisy points CVFDT and UFFT fastest algorithms Future Work Clustering algorithms applied to data streams Classification algorithms applied to data streams of unstructured datasets (text, images, etc)
Questions ? E-mail:  [email_address] Twitter: @eddonato

More Related Content

PDF
Distributed Decision Tree Learning for Mining Big Data Streams
PDF
Introduction to Data streaming - 05/12/2014
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PPT
5.1 mining data streams
PDF
18 Data Streams
PPTX
Mining high speed data streams: Hoeffding and VFDT
PPTX
Mining data streams
PPTX
Streaming Algorithms
Distributed Decision Tree Learning for Mining Big Data Streams
Introduction to Data streaming - 05/12/2014
Scalable Distributed Real-Time Clustering for Big Data Streams
5.1 mining data streams
18 Data Streams
Mining high speed data streams: Hoeffding and VFDT
Mining data streams
Streaming Algorithms

What's hot (20)

PPTX
Data Streaming in Big Data Analysis
PDF
Cloud-based Data Stream Processing
PDF
Mahoney mlconf-nov13
PDF
Josh Patterson MLconf slides
PDF
Moa: Real Time Analytics for Data Streams
PPT
5.2 mining time series data
PDF
Introduction to neural networks and Keras
PDF
IJSETR-VOL-3-ISSUE-12-3358-3363
PDF
ACM DEBS 2015: Realtime Streaming Analytics Patterns
PPT
Temporal data mining
PPT
Elag 2012 - Under the hood of 3TU.Datacentrum.
PDF
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
PDF
Predicting rainfall using ensemble of ensembles
PDF
Basic ideas on keras framework
PDF
Enterprise Scale Topological Data Analysis Using Spark
DOCX
Final proj 2 (1)
PPT
The study on mining temporal patterns and related applications in dynamic soc...
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
PDF
Distributed implementation of a lstm on spark and tensorflow
Data Streaming in Big Data Analysis
Cloud-based Data Stream Processing
Mahoney mlconf-nov13
Josh Patterson MLconf slides
Moa: Real Time Analytics for Data Streams
5.2 mining time series data
Introduction to neural networks and Keras
IJSETR-VOL-3-ISSUE-12-3358-3363
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Temporal data mining
Elag 2012 - Under the hood of 3TU.Datacentrum.
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...
Predicting rainfall using ensemble of ensembles
Basic ideas on keras framework
Enterprise Scale Topological Data Analysis Using Spark
Final proj 2 (1)
The study on mining temporal patterns and related applications in dynamic soc...
Efficient Online Evaluation of Big Data Stream Classifiers
Distributed implementation of a lstm on spark and tensorflow
Ad

Viewers also liked (13)

PDF
My ontology is better than yours! Building and evaluating ontologies for inte...
PPT
Advanced Practice Nursing and Research
PDF
Handling concept drift in data stream mining
PDF
NYP EBP Cohort 8 Under Pressure
PDF
Integrative research and development: workspaces
PPT
Acn research and nursing profession
PPTX
EVALUATION OF PERFORMANCE & QUALITY
PPSX
Primary Health Care
PDF
A Short Course in Data Stream Mining
PPT
Systematic review
PPTX
Quality assurance in nursing
PPTX
Integrative Review of Factors Associated with the Willingness of Health Care ...
PPT
Research in nursing practice revision
My ontology is better than yours! Building and evaluating ontologies for inte...
Advanced Practice Nursing and Research
Handling concept drift in data stream mining
NYP EBP Cohort 8 Under Pressure
Integrative research and development: workspaces
Acn research and nursing profession
EVALUATION OF PERFORMANCE & QUALITY
Primary Health Care
A Short Course in Data Stream Mining
Systematic review
Quality assurance in nursing
Integrative Review of Factors Associated with the Willingness of Health Care ...
Research in nursing practice revision
Ad

Similar to Evaluating Classification Algorithms Applied To Data Streams Esteban Donato (20)

PPT
Data mining technique for classification and feature evaluation using stream ...
PDF
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
PDF
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
PDF
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
PPTX
Data Mining: Mining stream time series and sequence data
PPTX
Data Mining: Mining stream time series and sequence data
PDF
Concept Drift Identification using Classifier Ensemble Approach
PDF
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
PDF
DETECTION AND HANDLING OF DIFFERENT TYPES OF CONCEPT DRIFT IN NEWS RECOMMENDA...
PDF
DETECTION AND HANDLING OF DIFFERENT TYPES OF CONCEPT DRIFT IN NEWS RECOMMENDA...
PDF
Adaptive Learning and Mining for Data Streams and Frequent Patterns
PDF
In data streams using classification and clustering different techniques to f...
PDF
In data streams using classification and clustering
PDF
ME Synopsis
PDF
D017122026
PDF
A Review on Concept Drift
PDF
Online machine learning in Streaming Applications
PDF
An Improved Differential Evolution Algorithm for Data Stream Clustering
PDF
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...
Data mining technique for classification and feature evaluation using stream ...
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Concept Drift Identification using Classifier Ensemble Approach
Learning In Nonstationary Environments: Perspectives And Applications. Part1:...
DETECTION AND HANDLING OF DIFFERENT TYPES OF CONCEPT DRIFT IN NEWS RECOMMENDA...
DETECTION AND HANDLING OF DIFFERENT TYPES OF CONCEPT DRIFT IN NEWS RECOMMENDA...
Adaptive Learning and Mining for Data Streams and Frequent Patterns
In data streams using classification and clustering different techniques to f...
In data streams using classification and clustering
ME Synopsis
D017122026
A Review on Concept Drift
Online machine learning in Streaming Applications
An Improved Differential Evolution Algorithm for Data Stream Clustering
Evaluation of a New Incremental Classification Tree Algorithm for Mining High...

Evaluating Classification Algorithms Applied To Data Streams Esteban Donato

  • 1. Evaluating classification algorithms applied to data streams Author: Ing. Esteban D. Donato Advisor: Dr. Fazel Famili Co-Advisor: Dra. Ana S. Haedo Dec-2009 Maestría en Explotación de Datos y Descubrimiento del Conocimiento
  • 2. Introduction Majority of companies and organizations collect and maintain gigantic databases that grow to millions of registers per day. Current algorithms for mining complex models from data cannot mine even a fraction of these data in useful time. Concept drift: o ccurs when the underlying data distribution changes over time.
  • 3. Objective To perform a benchmarking analysis between a number of known algorithms applied to data streams. The algorithms chosen for this study are: UFFT, CVFDT and VFDTc. The analysis will be focused on some aspects that all the algorithms applied to data streams have to deal with.
  • 4. Related work A data stream is a sequence of data items x 1 ,…,x i ,…,x n . Those items are read one at a time in increasing order of the indices. Off-line learning : Assumes that the dataset resides in a static database and that has been generated from a static distribution. Also, they assume that all the data is available before the training and that all the examples can fit into the memory. Incremental learning : The items are time-ordered and the distribution that generates them varies over time. Systems evolve and change a concept definition as new observations are processed.
  • 5. Related work (Cont.): Data Streams Mining A sub area of incremental learning. Accumulates faster than it can be mined. It must require small constant time per record. It must use only a fixed amount of main memory. It must be able to build a model using at most one scan of the data. It must make a usable model available at any point in time. Ideally, it should produce a model that is equivalent to the one that would be obtained by the corresponding ordinary database mining algorithm. The model should be up-to-date at any time. Types of Algorithms : Set of rules, Induction trees and Ensembles methods.
  • 6. Related work (Cont.): Very Fast Decision Tree (VFDT) Requires each example to be read only once. Requires a small constant time to process it. Building process : given a stream of examples, the first ones will be used to choose the root and the following examples will be passed down to the corresponding leaves. To detect how many examples are needed at each node, The Hoeffding bound is used. The Hoeffding bound: with probability 1 - φ , the true mean of the variable is at least r - e, where: Let ∆ G = G (Xa) - G (Xb) >= 0, if ∆ G > e then ∆G >= ∆ G - e > 0 with probability 1 – φ Other features: Pre-pruning, different evaluation measure, Ties, Memory, Poor attributes, Initialization, Rescans. Drawbacks : It does not detect Concept Drift.
  • 7. Related work (Cont.): Concept Drift Change in the target concept Depends on some hidden attributes, not given explicitly in the form of predictive features, Examples: Weather prediction, customers’ buying, etc. Concept drift handling system should be able to: Quickly adapt to concept drift Be robust to noise and distinguish it from concept drift. Recognize and treat recurring contexts. Types: sudden, gradual, frequent and virtual concept drift.
  • 8. Conclusion of literature review Data stream is a sequence of time-ordered items, arriving faster than the time needed to be mined. Some changes in the underlying data distribution may occur requiring the algorithms to detect and adapt to these changes. The main challenge in incremental learning is how to detect and adapt to a concept drift. To deal with the problem of data arriving fast, the algorithms must require a small constant processing time per record. One of the first algorithms developed was VFDT, using the Hoeffding bound In concept drift, a difficult problem is to distinguish between a true concept drift and noise.
  • 9. Algorithm: VFDTc V ery F ast D ecision T ree for C ontinuous attributes Extension of VFDT in three directions: continuous data, functional leaves, and concept drift. For a continuous attribute the split-test is a condition of the form attri <= cut_point. Use of Information gain to detect the cut_point. Functional tree leaves: An innovative aspect of this algorithm is its ability to use the naive Bayes classifiers at tree leaves A leaf must see nmin examples before computing the evaluation function Concept Drift is based on the assumption that whatever is the cause of the drift, the decision surface moves. It supports two methods: Drift Detection based on Error Estimates (EE/EBP) Drift Detection based on Affinity Coefficient (AC) Reacting to Drift: method pushes up all the information of the descending leaves to node This is a forgetting mechanism.
  • 10. Algorithm: UFFT U ltra F ast F orest T ree Generates a forest of binary trees Processes each example in constant time It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test It maintains a short term memory for initializing the leaves To expand a leaf node: information gain positive and statistical support Functional leaves Concept drift detection: error rate is calculated at each node (Naive-Bayes ). Error follows a binomial distribution. Two confident interval levels: warning drift
  • 11. Algorithm: CVFDT C oncept-adapting V ery F ast D ecision T ree Extension of VFDT with support to concept drifts It works by keeping its model consistent with a sliding window of examples. Updates just statistics. It uses information gain for selecting the best attribute. Grows an alternative subtree with the new best attribute at its root. Periodically scans HT and all alternate trees looking for internal nodes whose performing better than the actual nodes.
  • 12. Performance measures Capacity to detect and respond to concept drift Capacity to detect and respond to virtual concept drift Capacity to detect and respond to recurring concept drift Capacity to adapt to sudden concept drift Capacity to adapt to gradual concept drift Capacity to adapt to frequent concept drift Accuracy of the classification task Capacity to deal with outliers Capacity to deal with noisy data Speed (Time to take to process an item in the stream)
  • 13. Data sets generated Data sets based on a moving hyperplane d-dimensional space [0; 1] d , is denoted by MOA (Massive Online Analysis) tool http://sourceforge.net/projects/moa-datastream/ Released under GNU. Free and open source Current configurabe attributes: instanceRandomSeed numClasses numAtts numDriftAtts magChange noisePercentage sigmaPercentage New configurable attributes: driftFreq driftTran outlierPercentage distributionPercentage
  • 14. Data sets generated Dataset with no concept drift, outlier of noise Dataset with 10% of noisy data Dataset with 1% of outliers Dataset with 3 concept drift
  • 15. Results Capacity to detect and respond to concept drift
  • 16. Results Capacity to detect and respond to virtual concept drift
  • 17. Results Capacity to detect and respond to recurring concept drift
  • 18. Results Capacity to adapt to sudden concept drift
  • 19. Results Capacity to adapt to gradual concept drift
  • 20. Results Capacity to adapt to frequent concept drift
  • 21. Results Accuracy of the classification task VFDTc (CA) VFDTc (EBP) UFFT CVFDT measures derived from the confusion matrix     Predicted Predicted     Class 1 Class 2 Actual Class 1 44.5% (887) 5.5% (109) Actual Class 2 5% (101) 45% (903)     Predicted Predicted     Class 1 Class 2 Actual Class 1 39% (777) 11% (219) Actual Class 2 9% (173) 41% (831)     Predicted Predicted     Class 1 Class 2 Actual Class 1 46% (928) 3.5% (68) Actual Class 2 2.5% (48) 48% (956)     Predicted Predicted     Class 1 Class 2 Actual Class 1 34.5% (685) 15.5% (311) Actual Class 2 15.5% (312) 34.5% (692)   Accuracy (AC) True positive (TP) False Positive (FP) True Negative (TN) False Negative (FN) Precision (P) VFDTc (CA) 0.89 0.89 0.10 0.90 0.11 0.90 VFDTc (EBP) 0.80 0.78 0.17 0.83 0.22 0.82 UFFT 0.94 0.93 0.05 0.95 0.07 0.95 CVFDT 0.69 0.69 0.31 0.69 0.31 0.69
  • 23. Results Dealing with noisy data
  • 24. Results Speed (Time to take to process an item in the stream)
  • 25. Conclusions & future work Given that the data can be generated very fast, that give us a new and challenging way of developing Data Mining algorithms. We have to develop them having in mind that the training phase can never end The changes in the data distribution are another challenging scenario that data stream mining has to deal with. VFDT was one of the first data stream mining algorithms developed. It implemented the Hoeffding bound We generated different datasets using the moving hyperplane algorithm UFFT for short term predictions CVFDT for long term solutions No impact on virtual concept drift or recurring concept drift
  • 26. Conclusions & future work VFDTc (CA) is not suitable for gradual or sudden concept drift VFDTc (CA) or UFFT are not suitable for frequent concept drift VFDTc (EBP) and CVFDT for data streams with outliers CVFDT for data streams with noisy points CVFDT and UFFT fastest algorithms Future Work Clustering algorithms applied to data streams Classification algorithms applied to data streams of unstructured datasets (text, images, etc)
  • 27. Questions ? E-mail: [email_address] Twitter: @eddonato