SlideShare a Scribd company logo
Analysis of
Datasets
Rafsanjani
Muhammod
References Analysis of Datasets
[ Using Machine Learning Algorithms ]
Rafsanjani Muhammod
Undergrad Student, Department of Computer Science & Engineering
United International University, Bangladesh
April 16, 2017
Analysis of
Datasets
Rafsanjani
Muhammod
References
Outlines
Data Analysis
Data Visualization
Data Preprocessing
Working Datasets
Datasets Description
Results
Comparison between classifiers
References
Analysis of
Datasets
Rafsanjani
Muhammod
References
Outlines
Data Analysis
Data Visualization
Data Preprocessing
Working Datasets
Datasets Description
Results
Comparison between classifiers
References
Analysis of
Datasets
Rafsanjani
Muhammod
References
Outlines
Data Analysis
Data Visualization
Data Preprocessing
Working Datasets
Datasets Description
Results
Comparison between classifiers
References
Analysis of
Datasets
Rafsanjani
Muhammod
References
Outlines
Data Analysis
Data Visualization
Data Preprocessing
Working Datasets
Datasets Description
Results
Comparison between classifiers
References
Analysis of
Datasets
Rafsanjani
Muhammod
References
Outlines
Data Analysis
Data Visualization
Data Preprocessing
Working Datasets
Datasets Description
Results
Comparison between classifiers
References
Analysis of
Datasets
Rafsanjani
Muhammod
References
Outlines
Data Analysis
Data Visualization
Data Preprocessing
Working Datasets
Datasets Description
Results
Comparison between classifiers
References
Analysis of
Datasets
Rafsanjani
Muhammod
References
Outlines
Data Analysis
Data Visualization
Data Preprocessing
Working Datasets
Datasets Description
Results
Comparison between classifiers
References
Analysis of
Datasets
Rafsanjani
Muhammod
References
Outlines
Data Analysis
Data Visualization
Data Preprocessing
Working Datasets
Datasets Description
Results
Comparison between classifiers
References
Analysis of
Datasets
Rafsanjani
Muhammod
References
Name of Datasets
KDD Cup 1999 Datasets
NSL-KDD Datasets
Hypothyroid Diseases Dataset
Chronic Kidney Diseases Datasets
Leaf Dataset
Analysis of
Datasets
Rafsanjani
Muhammod
References
Name of Datasets
KDD Cup 1999 Datasets
NSL-KDD Datasets
Hypothyroid Diseases Dataset
Chronic Kidney Diseases Datasets
Leaf Dataset
Analysis of
Datasets
Rafsanjani
Muhammod
References
Name of Datasets
KDD Cup 1999 Datasets
NSL-KDD Datasets
Hypothyroid Diseases Dataset
Chronic Kidney Diseases Datasets
Leaf Dataset
Analysis of
Datasets
Rafsanjani
Muhammod
References
Name of Datasets
KDD Cup 1999 Datasets
NSL-KDD Datasets
Hypothyroid Diseases Dataset
Chronic Kidney Diseases Datasets
Leaf Dataset
Analysis of
Datasets
Rafsanjani
Muhammod
References
Name of Datasets
KDD Cup 1999 Datasets
NSL-KDD Datasets
Hypothyroid Diseases Dataset
Chronic Kidney Diseases Datasets
Leaf Dataset
Analysis of
Datasets
Rafsanjani
Muhammod
References
Hawk eyes view : Datasets Description
Name of Dataset Instances Features Tasks
KDD Cup ’99 4, 898, 431 41 (& 23) Classification
NSL-KDD 125, 973 41 (& 23) Classification
Hypothyroid 3772 29 (& 4) Classification
Chronic Kidney 400 24 (& 2) Classification
Leaf 340 16 (& 36) Classification
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 1 : KDD Cup 1999 Datasets I
Problem Description :
This is a “Computer Networks Intrusion Detection”
problem.
Brief Description :
Data Set Characteristics : Multivariate
Attribute Characteristics : Categorical, Integer
Associated Tasks : Classification
Area : Computer Networking
Donar (with Date) : Unknown (January 1, 1991)
Instances : 4, 898, 431
Features : 41 ( & class values : 23)
Missing Values : None
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 1 : KDD Cup 1999 Datasets II
There’re ‘4’ types of attacks :
1 Denial of Service Attack (DoS)
2 User to Root Attack (U2R)
3 Remote to Local Attack (R2L)
4 Probing Attack
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 1 : KDD Cup 1999 Datasets III
Snapshot of Dataset # 1 :
This datasets have some problems :
Redundancy :
K. Leung et al.[5] observed that threre’re :
Around 78% “train data” are duplicant &
Around 75% “test data” are duplicant.
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 1 : KDD Cup 1999 Datasets IV
Data Partitioning :
Portnoy et al.[3] divede this big data into 10 sub data.
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 1 : KDD Cup 1999 Datasets V
Key Issues :
M. Tavallaee[4] observed :
1 Data redundancy
2 High accuracy rate
3 Highly imbalanced
Costly Cross-Validation :
Portnoy et al.[3] also observed that the distribution of this
data set(s) are very uneven which made cross-validation
difficult.
For avoid those kind of problem [4] proposed a
modified version of KDD Cup’99 datasets, is
known as NSL-KDD datasets.
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 2 : NSL-KDD Datasets I
“NSL-KDD” datasets is a subset of “KDD
Cup’99” datasets.
Problem Description :
This is also a “Computer Networks Intrusion
Detection” problem.
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 2 : NSL-KDD Datasets II
Brief Description :
Data Set Characteristics : Multivariate
Attribute Characteristics : Categorical,
Integer
Associated Tasks : Classification
Area : Computer Networking
Donar (with Date) : [4] (2009)
Instances : 125, 973
Features : 41 ( & class values : 23)
Missing Values : None
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 2 : NSL-KDD Datasets III
Experiment Results (70% train & 30%test) :
Classifiers vs Accuracy Graph (in R)
Plotting with R : http://goo.gl/OpkU7e
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 2 : NSL-KDD Datasets IV
Details Results (70% train & 30%test) [using WEKA API] :
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 2 : NSL-KDD Datasets V
Experiment Results (10 Folds Cross-Validation) :
Classifiers vs Accuracy Graph (in R)
Plotting with R : https://goo.gl/4MVxwl
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 2 : NSL-KDD Datasets VI
Details Results (10 Folds Cross-Validation) [using WEKA API] :
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 2 : NSL-KDD Datasets VII
Now the question : which classifier is better & why ?
OneR, use feature :
“A5 src byte” which can classify 90% accurate.
Important feature :
“A5 src byte” is most important feature.
Why “Naive Bayes” is poor :
Beacuse :
1 Lack of independence variables.
2 “Naive Bayes classifier” since it is much more robust to
overfitting.
Ensure :
I’m confident that DT related classifiers ensure more that 90% accuracy.
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 2 : NSL-KDD Datasets VIII
ROC Curve (with Python) :
Analysis of
Datasets
Rafsanjani
Muhammod
References
Dataset # 2 : NSL-KDD Datasets IX
Data Visualization : [1][2] [ Using Orange3 ]
Analysis of
Datasets
Rafsanjani
Muhammod
References
Analysis of Confusion Matrix :
Analysis of
Datasets
Rafsanjani
Muhammod
References
References
DM Farid et al. “Hybrid decision tree and “Naive Bayes”
classifiers for multi-class classification tasks”. In: (2014).
J. Han et al. “Data Mining : Concepts and Techniques”.
In: (2012).
L. Portnoy et al. “Intrusion detection with unbalanced
data using clusters”. In: ().
M. Tavallaee et al. “A Detailed Analysis of the KDD Cup
’99 data Set”. In: (2009).
K. Leung and C. Leckie. “Unsupervised Anomaly
Detection in Network Intrusion Detection Using Clusters”.
In: ().
Analysis of
Datasets
Rafsanjani
Muhammod
References
Questions & Answers
Ask me.
Analysis of
Datasets
Rafsanjani
Muhammod
References
Thankyou !
My LaTeX Template : https://goo.gl/tzFlD1

More Related Content

PPTX
Introduction to Cyber Forensics Module 1
PDF
An Introduction to Anomaly Detection
PPTX
greedy algorithm Fractional Knapsack
PDF
Anomaly Detection in Seasonal Time Series
PPT
Data cleaning-outlier-detection
PPTX
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
PPTX
Data recovery power point
PPTX
Anomaly Detection
Introduction to Cyber Forensics Module 1
An Introduction to Anomaly Detection
greedy algorithm Fractional Knapsack
Anomaly Detection in Seasonal Time Series
Data cleaning-outlier-detection
Anomaly Detection - Real World Scenarios, Approaches and Live Implementation
Data recovery power point
Anomaly Detection

What's hot (20)

PDF
Search problems in Artificial Intelligence
PDF
A brief Intro to Digital Forensics
PPTX
Raw- Research & Analysis Wing
PPT
Fraud Deterrence
PDF
Anomaly detection (Unsupervised Learning) in Machine Learning
PPTX
Logistic regression
PPTX
Outlier analysis and anomaly detection
PPTX
Page rank algortihm
PPT
Machine Learning 3 - Decision Tree Learning
ODP
Machine Learning with Decision trees
PPT
Asymptotic notations
PPTX
Data Mining: Outlier analysis
PDF
Cyber Forensics Module 1
PPTX
PPTX
Supervised learning and Unsupervised learning
PPTX
Legal aspects of digital forensics
PDF
Social Impacts & Trends of Data Mining
PPTX
Machine learning module 2
PPTX
Group and Community Detection in Social Networks
PDF
DMTM Lecture 15 Clustering evaluation
Search problems in Artificial Intelligence
A brief Intro to Digital Forensics
Raw- Research & Analysis Wing
Fraud Deterrence
Anomaly detection (Unsupervised Learning) in Machine Learning
Logistic regression
Outlier analysis and anomaly detection
Page rank algortihm
Machine Learning 3 - Decision Tree Learning
Machine Learning with Decision trees
Asymptotic notations
Data Mining: Outlier analysis
Cyber Forensics Module 1
Supervised learning and Unsupervised learning
Legal aspects of digital forensics
Social Impacts & Trends of Data Mining
Machine learning module 2
Group and Community Detection in Social Networks
DMTM Lecture 15 Clustering evaluation
Ad

Similar to Analysis of the Datasets (20)

PPT
Mining the LET Performance in Generating Prediction Models for OTDSS
PPTX
Masters Thesis Defense Talk
PPTX
Extending facet search to the general web
PPTX
Predicting query performance and explaining results to assist Linked Data con...
PDF
TUW-ASE Summer 2015 - Quality of Result-aware data analytics
PDF
Data Preparation and Reduction Technique in Intrusion Detection Systems: ANOV...
PPTX
Performance analysis of machine learning approaches in software complexity pr...
PPT
deep_Visualization in Data mining.ppt
PDF
PPTX
Build a Next-Generation Clinical Operational Metrics Solution
PDF
ACQSurvey (Poster)
PDF
Mappings Validation
PDF
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
PPTX
Promise 2011: "Customization Support for CBR-Based Defect Prediction"
PDF
Phd thesis final presentation
PDF
Modelling of Vendor Selection Problem for Radial Drilling Column by Fuzzy Inf...
PDF
IRJET-Fake Product Review Monitoring
PPT
probabilistic ranking
PDF
A Content Boosted Hybrid Recommendation System
PDF
customized eager lazy data cleansing for satisfactory big data veracity
Mining the LET Performance in Generating Prediction Models for OTDSS
Masters Thesis Defense Talk
Extending facet search to the general web
Predicting query performance and explaining results to assist Linked Data con...
TUW-ASE Summer 2015 - Quality of Result-aware data analytics
Data Preparation and Reduction Technique in Intrusion Detection Systems: ANOV...
Performance analysis of machine learning approaches in software complexity pr...
deep_Visualization in Data mining.ppt
Build a Next-Generation Clinical Operational Metrics Solution
ACQSurvey (Poster)
Mappings Validation
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
Promise 2011: "Customization Support for CBR-Based Defect Prediction"
Phd thesis final presentation
Modelling of Vendor Selection Problem for Radial Drilling Column by Fuzzy Inf...
IRJET-Fake Product Review Monitoring
probabilistic ranking
A Content Boosted Hybrid Recommendation System
customized eager lazy data cleansing for satisfactory big data veracity
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
How to run a consulting project- client discovery
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Transcultural that can help you someday.
PDF
Introduction to the R Programming Language
PPT
Predictive modeling basics in data cleaning process
PPTX
Managing Community Partner Relationships
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
Database Infoormation System (DBIS).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Leprosy and NLEP programme community medicine
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
ISS -ESG Data flows What is ESG and HowHow
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
importance of Data-Visualization-in-Data-Science. for mba studnts
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Pilar Kemerdekaan dan Identi Bangsa.pptx
How to run a consulting project- client discovery
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Transcultural that can help you someday.
Introduction to the R Programming Language
Predictive modeling basics in data cleaning process
Managing Community Partner Relationships
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
DATA COLLECTION METHODS-ppt for nursing research
Database Infoormation System (DBIS).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Leprosy and NLEP programme community medicine
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Analysis of the Datasets