SlideShare a Scribd company logo
http://www.iaeme.com/IJMET/index.asp 1392 editor@iaeme.com
International Journal of Mechanical Engineering and Technology (IJMET)
Volume 10, Issue 01, January 2019, pp. 1392-1398, Article ID: IJMET_10_01_141
Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=01
ISSN Print: 0976-6340 and ISSN Online: 0976-6359
© IAEME Publication Scopus Indexed
NORMALIZED GEOMETRIC INDEX: A SCALE
FOR CLASSIFIER SELECTION
Krishna Sriharsha Gundu
Student in VIT Vellore
Sundar S
Professor in VIT Vellore
ABSTRACT
For years, the Machine Learning community has focused on developing efficient
algorithms that can produce very accurate classifiers. However, it is often much easier
to find several good classifiers based on dataset combination, instead of single classifier
applied on deferent datasets. The advantages of using classifier dataset combinations
instead of a single one are twofold: it helps lowering the computational complexity by
using simpler models, and it can improve the classification accuracy and performance.
Most Data mining applications are based on pattern matching algorithms, thus improving
the performance of the classification has a positive impact on the quality of the overall
data mining task. Since combination strategies proved very useful in improving the
performance, these techniques have become very important in applications such as
Cancer detection, Speech Technology and Natural Language Processing .The aim of this
paper is basically to propose proprietary metric, Normalized Geometric Index (NGI)
based on the latent properties of datasets for improving the accuracy of data mining tasks.
Key words: Machine Learning, Classification, Classifier Selection, Data Mining, Non
Linear Regression, Normalized Geometric Index (NGI)
Cite this Article: Krishna Sriharsha Gundu and Sundar S, Normalized Geometric Index: a
Scale for Classifier Selection, International Journal of Mechanical Engineering and
Technology, 10(01), 2019, pp.1392–1398
http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=10&Type=01
1. INTRODUCTION
Classification is an important data mining task, to classify the given features and to learn the
hidden knowledge in the dataset .Input to a typical classification system is a set of features from
dataset with an associated class .A feature is represented by a set of measurements that contain
relevant information about the structure of the object we wish to classify .Hence in the context of
classification the combination of classifier and dataset is important measure to understand the
performance of classifier system .This method of understanding the performance is termed as
"Overproduce and choose"[1]. In this method, a large number of datasets of different geometries
are given as inputs to different classifiers. Flash.P [2] has discussed more details about generic
Normalized Geometric Index: a Scale for Classifier Selection
http://www.iaeme.com/IJMET/index.asp 1393 editor@iaeme.com
approaches to assess the data set parameter influence on the accuracy but did not elaborate any
specific direction to solve the same. In this paper, a numeric index is developed (in Section 3)
and used to predict which classifier gives the best accuracy to the geometry of the dataset which
is to be classified.
1.1. Dataset and Classifier
Classification consists of predicting a certain outcome based on the input historic data. The
prediction is carried out by processing the data with the help of an algorithm applied on training
dataset. The algorithm tries to discover the relationship between the features that will aid in
predicting the output according to the perceived pattern. The classification algorithm analyzes
the input and predict the output. The prediction accuracy is the figure of merit of the classification
algorithm. For example, in a software defect dataset,as shown below table 1 ,the training set
would have relevant information on the "bug" ,collected historically. The prediction table data is
used by the algorithm to as shown below table 2, to predict the bugs in the module.
1.2. Influence of Dataset-Classifier combination
In general classification is about taking a decision on given data. Michael's et al define
classification as "construction of a procedure that is applied to a series of objects, where is each
object is assigned to a label" [3,4,5] .In this paper classification refers to supervisory learning
based classification where classifier is trained on the historic data with associated classes.
In today's machine learning world, the challenge is to improve the performance of the learning
system and to apply the classification algorithm for a particular dataset [6,7]. Since the data set
volume and features increases over span of time selecting a suitable classifier is a challenge. Poor
selection of classifier results poor accuracy. Several studies carried out on the same sparsely but
the problem is still a challenge [8,9] .Due to explosive growth of volumes of data and availability
of several machine learning algorithms there is no study on the selection of classifiers or
guidelines for selecting the classifiers for a given type of data set [10] .Datasets themselves offers
a little clue in selecting a relevant classifier algorithm.
Each classifier interprets and processes the data separately. For example, k - Nearest
Neighbour (kNN) classifier computes the distance from the test point to all the train points and
classifies into a class which is the nearest to it. Although, Support Vector Machine (SVM)
Classifier, draws hyper-planes such that all the points that satisfy a hyper plane belong to a class.
This difference in algorithm used causes a difference of accuracy to a classification accuracy. In
other words, not all the classifiers are suitable to all the geometries of datasets, i.e, the importance
J Krishna Sriharsha Gundu and Sundar S
http://www.iaeme.com/IJMET/index.asp 1394 editor@iaeme.com
of features, observations and classes are different to different algorithms. Thereby we have to
choose the classifier based on the dataset at hand.
2. EXISTING WORK
T. van Gemert [3] conducted study on the relationship between classification algorithm and
dataset with prime consideration on execution time .The author has discussed the influence of
dataset characteristics on classification algorithm performance, however he has not mentioned
about the real issue of dataset parameter influence on the classification accuracy. The author did
not mentioned about the dataset characteristics (eg - Number of features/classes, etc) on the
performance of the classifier.
In an attempt to produce optimal accuracy for all the datasets, a classifier ensemble is
generated from other classifiers / classifier ensembles and combination functions [11]. Although
this approach may yield optimal accuracy for all kinds of datasets, there are many combinations
of classifiers and combination functions. The time required to build a custom classifier is very
high. A final selection is to be made from the list of classifiers built for the dataset [12]. The
proposed method is worth implementing only if the there is a drastic improvement of accuracy.
The time taken to understand and build the optimum classifier renders the classification task
useless in time sensitive classification or accuracy insensitive scenarios (for minor accuracy
improvements).
A true Classifier ensemble can be built provided classifiers with diverse strong and weak
points are combined. This research on measuring diversity is not concluded [13]. This implies
that a solution of classifier ensemble may not be found or if found it could be a local minimum
error point. This requires a restart on the search process and start with a different classifier and
its combinations. The time taken to get enough data to make a decision is considered critical in a
classification task [13].
Since truly complimentary and diverse classifiers do not exist, the fusion of many decisions
into a single output label is challenging. Although some frame-works such as weighted voting
are developed [14], they are not fool proof. Thereby requiring another classifier just to classify
the pool of outputs into a single out-put based on the feature relevance to the classifier and the
confidence interval. The computational complexity for such a system is exponentially high as the
selection of classifier ensemble uses another classifier ensemble.
The diversity of the classifier pool is ensured by manipulating the classifier inputs and outputs
[15]. This manipulation of dataset could loose some crucial information about the dataset.
The main drawback of the single classifier system is the requirement of prior knowledge to
choose the best classifier [16]. This paper proposes a solution to standardize the knowledge
through a parameter.
3. PROPOSED METRIC
In this paper we propose a novel metric, Normalized Geometric Index (NGI) for selecting the
classifier based on the dataset parameters for optimal classification accuracy and execution time.
Since the metric is a numeric value, it can directly be used for deciding a single classifier for
optimal performance. This eliminates the need for one or more learning algorithms to fuse
classifiers and choose the right fusion function thereby saving time and computation complexity.
This parameter is developed keeping four kinds of datasets in mind. (Refer Section 4.2)
The rules behind developing the metric –
1. The accuracy of classification of a dataset improves with the increase in the number
of observations, as long as enough care is taken to avoid overfitting.
Normalized Geometric Index: a Scale for Classifier Selection
http://www.iaeme.com/IJMET/index.asp 1395 editor@iaeme.com
2. The accuracy of classification decreases if there are more number of classes for the
same number of observations.
3. The accuracy of classification decreases if there are more number of features for the
same number of observations
Every observation is considered as new information provided to the classifier detailing the
behaviour of the dataset. Thereby more observations (for the same number of classes and
features) would imply better classification.
Every new feature is considered as new dimension to visualize the classes .If there are more
features (for the same number of observations and classes), then the information provided by the
observations will not be sufficient for the classifier to perform the classification.
More the number of output classes, more will be the information required to classify the test
data into different classes. By combining the above points, we the following metric.
The classifiers and data sets consider for the experiments are :
1. Gaussian Naive Bayes (GNB)
2. Support Vector Machine (SVM)
3. Random Forest (RF)
4. k Nearest Neighbor (KNN)
5. Multi-Layer Perceptron (MLP)
6. Multinomial Naive Bayes (MNB)
7. Quadratic Discriminant Analysis (QDA)
4. EXPERIMENTAL SETUP
4.1. Technical Details in experiment
The parameters of classifiers are set as follows-
• kNN Classifier has k value set to 3
• MLP Classifier has activation function set to tanh()
• MLP Classifier has been set to adam solver in python
• MLP Classifier’s tolerance has been set to 10-5
• Random Forest takes the decision from an ensemble of 100 trees.
• SVM uses sigmoid kernel for classification.
4.2. Experiment
The experiment is setup such that datasets of all types are executed on all the classifiers. The
resulting accuracies are tabulated. A non-linear line of regression is drawn for all the accuracies
as a function of NGI metric for each classifier.
An average of all the accuracies are taken for a single dataset. Another line of non-linear
regression is drawn for the accuracies as a function of NGI metric. This line acts as a threshold
to select a classifier.
When two lines are plotted, the region where the individual classifier out-performs the
average classifier performance, is the region of strength for the classifier. It must be noted that
the R-squared value (describing the best fitness of the line) will be low, since the accuracy is not
J Krishna Sriharsha Gundu and Sundar S
http://www.iaeme.com/IJMET/index.asp 1396 editor@iaeme.com
fully explained for just these parameters. Hence, a line which describes the accuracy best, in terms
of NGI will be selected for determining the region of performance.
Figure. 1. Experimental Setup
5. RESULTS
After conducting the experiments as mentioned in the previous section the results are obtained as
mentioned below.
A table of Classifier Accuracy for the NGI values.
The results of the accuracies are tabulated. As seen in the NGI column, all the datasets
correspond to a different geometry of datasets. From the table, it can be inferred that NGI values
around the value of 0.041 have overall higher classification accuracy. This NGI value
corresponds to the datasets having less number of classes, more observations and less features.
Least overall classification accuracy is for high number classes, high number of features and low
number of observations. During classification, it is also observed that Multi Nomial Naive Bayes
Classifier will not work with all the raw datasets. All the inputs to it should be non-negative and
thereby needing pre-processing. The following setup is used for studying the relationship between
the classifier performance and NGI metric. For each of the classifier the experiments are
conducted using different datasets as mentioned in the previous section. The response function
of NGI is calculated as
The above accuracy corresponds to the average classification accuracy. It serves as the
baseline for determining the classifier to be better or worse at the particular NGI value. After
calculating the NGI value from the above equation (2), the accuracy values of the individual
classifiers are approximated based on the following equations. These equations are non-linear
regression models of accuracy in terms of NGI metric. In the case of accuracy of a specific
classifier, the corresponding column from the table is chosen as y and the related NGI values are
chosen as x. Using non linear regression, a function is created. These functions are as follows.
Normalized Geometric Index: a Scale for Classifier Selection
http://www.iaeme.com/IJMET/index.asp 1397 editor@iaeme.com
These equations approximately describe the behaviour of classifiers for different NGI values.
On graphical examination of the same will show the performance of optimal classifier for the
NGI value. All the approximations are having an R-squared value greater than 0.85.
6. CONCLUSION AND FUTURE WORK
From the above results it is clear that the Normalized Geometric Index (NGI) is very much helpful
in determining the classifier dataset relation for improved accuracy. The QDA performance is
inferior when compared to the other classifier performance. It is not suggested as prime choice
for the given data set properties. KNN classifier has performed consistently well when compared
to the remaining classifiers while compromising on accuracy front. However the accuracy of
MLP classifier is very high provided the NGI greater than 0.787.The Random Forest (RF)
classifier performs uniformly well across all NGI values with no threshold values. The
performance is mostly consistent.
We have conducted experiments on few data sets, however the experiments can be repeated
using sparse and scientific data sets to study the impact of the NGI metric on a different variety
of datasets.
REFERENCES
[1] Amanda JC Sharkey, Noel E Sharkey, Uwe Gerecke, and Gopinath Odayammadath
Chandroth. The “test and select” approach to ensemble combination. In International
Workshop on Multiple Classifier Systems, pages 30–44. Springer, 2000.
[2] Peter Flach. The art and science of algorithms that make sense of data,2012.
[3] T van Gemert. On the influence of dataset characteristics on classifier performance. B.S.
thesis, 2017.
[4] DJSD Michie. Dj spiegelhalter, and cc taylor. Machine learning, neural and statistical
classification. Ellis Horwood, 1994.
[5] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and
Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural
Information Processing Systems, pages 2962–2970, 2015.
J Krishna Sriharsha Gundu and Sundar S
http://www.iaeme.com/IJMET/index.asp 1398 editor@iaeme.com
[6] Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and
Tim Kraska. Automating model search for large scale machine learning. In Proceedings of
the Sixth ACM Symposium on Cloud Computing, pages 368–380. ACM, 2015.
[7] Gang Luo. A review of automatic selection methods for machine learning algorithms and
hyper-parameter values. Network Modeling Analysis in Health Informatics and
Bioinformatics, 5(1):18, 2016.
[8] Alexandros Kalousis and Theoharis Theoharis. Noemon: Design, implementation and
performance results of an intelligent assistant for classifier selection. Intelligent Data
Analysis, 3(5):319–337, 1999.
[9] Joao Gama and Pavel Brazdil. Characterization of classification algorithms. In Portuguese
Conference on Artificial Intelligence, pages 189–200. Springer, 1995
[10] Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo Vilalta. Meta learning:
Applications to data mining. Springer Science & Business Media, 2008
[11] Josef Kittler. Multiple Classifier Systems: First International Workshop , MCS 2000 Cagliari,
Italy, June 21-23, 2000 Proceedings, volume 1857. Springer Science & Business Media, 2000.
[12] Fabio Roli, Giorgio Giacinto, and Gianni Vernazza. Methods for designing multiple classifier
systems. In International Workshop on Multiple ClassifierSystems, pages 78–87. Springer,
2001.
[13] MichałWozniak, Manuel Graña, and Emilio Corchado. A survey of multiple classifier
systems as hybrid systems. Information Fusion, 16:3–17, 2014.
[14] Šar¯ unas Raudys. Trainable fusion rules. ii. small sample-size effects. Neural Networks,
19(10):1517–1527, 2006.
[15] Ludmila I Kuncheva. Combining pattern classifiers: methods and algorithms. John Wiley &
Sons, 2004.
[16] Josef Kittler. A framework for classifier fusion: Is it still needed? In Joint IAPR International
Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and
Syntactic Pattern Recognition (SSPR), pages 45–56. Springer, 2000.

More Related Content

PDF
The pertinent single-attribute-based classifier for small datasets classific...
PDF
G046024851
PDF
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
PDF
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
PDF
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
PDF
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
PDF
Issues in Query Processing and Optimization
The pertinent single-attribute-based classifier for small datasets classific...
G046024851
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsup...
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
New Feature Selection Model Based Ensemble Rule Classifiers Method for Datase...
Issues in Query Processing and Optimization

What's hot (17)

PDF
M43016571
PDF
DATA MINING ATTRIBUTE SELECTION APPROACH FOR DROUGHT MODELLING: A CASE STUDY ...
PDF
Using particle swarm optimization to solve test functions problems
PDF
An unsupervised feature selection algorithm with feature ranking for maximizi...
PDF
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
PDF
An effective adaptive approach for joining data in data
PDF
T180203125133
PDF
Network Based Intrusion Detection System using Filter Based Feature Selection...
PDF
A survey of modified support vector machine using particle of swarm optimizat...
PDF
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
PDF
A h k clustering algorithm for high dimensional data using ensemble learning
PDF
A Survey on Constellation Based Attribute Selection Method for High Dimension...
PDF
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
PDF
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
PDF
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
PDF
mlsys_portrait
PDF
A novel hybrid feature selection approach
M43016571
DATA MINING ATTRIBUTE SELECTION APPROACH FOR DROUGHT MODELLING: A CASE STUDY ...
Using particle swarm optimization to solve test functions problems
An unsupervised feature selection algorithm with feature ranking for maximizi...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
An effective adaptive approach for joining data in data
T180203125133
Network Based Intrusion Detection System using Filter Based Feature Selection...
A survey of modified support vector machine using particle of swarm optimizat...
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
A h k clustering algorithm for high dimensional data using ensemble learning
A Survey on Constellation Based Attribute Selection Method for High Dimension...
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
mlsys_portrait
A novel hybrid feature selection approach
Ad

Similar to Ijmet 10 01_141 (20)

PDF
Predicting performance of classification algorithms
PDF
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
PDF
Assessment of Cluster Tree Analysis based on Data Linkages
PDF
Lx3520322036
PDF
Threshold benchmarking for feature ranking techniques
PDF
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
PDF
Review of Existing Methods in K-means Clustering Algorithm
PDF
The International Journal of Engineering and Science (The IJES)
PDF
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
PDF
A new model for iris data set classification based on linear support vector m...
PDF
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
PDF
84cc04ff77007e457df6aa2b814d2346bf1b
PDF
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
PDF
Novel Ensemble Tree for Fast Prediction on Data Streams
PDF
Performance Evaluation: A Comparative Study of Various Classifiers
PDF
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
PDF
IRJET- Comparison of Classification Algorithms using Machine Learning
PDF
Configuration Navigation Analysis Model for Regression Test Case Prioritization
PDF
A Study of Efficiency Improvements Technique for K-Means Algorithm
PDF
Decision tree clustering a columnstores tuple reconstruction
Predicting performance of classification algorithms
PREDICTING PERFORMANCE OF CLASSIFICATION ALGORITHMS
Assessment of Cluster Tree Analysis based on Data Linkages
Lx3520322036
Threshold benchmarking for feature ranking techniques
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
Review of Existing Methods in K-means Clustering Algorithm
The International Journal of Engineering and Science (The IJES)
N ETWORK F AULT D IAGNOSIS U SING D ATA M INING C LASSIFIERS
A new model for iris data set classification based on linear support vector m...
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
84cc04ff77007e457df6aa2b814d2346bf1b
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTION
Novel Ensemble Tree for Fast Prediction on Data Streams
Performance Evaluation: A Comparative Study of Various Classifiers
Proposing an Appropriate Pattern for Car Detection by Using Intelligent Algor...
IRJET- Comparison of Classification Algorithms using Machine Learning
Configuration Navigation Analysis Model for Regression Test Case Prioritization
A Study of Efficiency Improvements Technique for K-Means Algorithm
Decision tree clustering a columnstores tuple reconstruction
Ad

More from IAEME Publication (20)

PDF
IAEME_Publication_Call_for_Paper_September_2022.pdf
PDF
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
PDF
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
PDF
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
PDF
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
PDF
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
PDF
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
PDF
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
PDF
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
PDF
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
PDF
GANDHI ON NON-VIOLENT POLICE
PDF
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
PDF
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
PDF
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
PDF
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
PDF
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
PDF
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
PDF
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
PDF
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
PDF
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT
IAEME_Publication_Call_for_Paper_September_2022.pdf
MODELING AND ANALYSIS OF SURFACE ROUGHNESS AND WHITE LATER THICKNESS IN WIRE-...
A STUDY ON THE REASONS FOR TRANSGENDER TO BECOME ENTREPRENEURS
BROAD UNEXPOSED SKILLS OF TRANSGENDER ENTREPRENEURS
DETERMINANTS AFFECTING THE USER'S INTENTION TO USE MOBILE BANKING APPLICATIONS
ANALYSE THE USER PREDILECTION ON GPAY AND PHONEPE FOR DIGITAL TRANSACTIONS
VOICE BASED ATM FOR VISUALLY IMPAIRED USING ARDUINO
IMPACT OF EMOTIONAL INTELLIGENCE ON HUMAN RESOURCE MANAGEMENT PRACTICES AMONG...
VISUALISING AGING PARENTS & THEIR CLOSE CARERS LIFE JOURNEY IN AGING ECONOMY
A STUDY ON THE IMPACT OF ORGANIZATIONAL CULTURE ON THE EFFECTIVENESS OF PERFO...
GANDHI ON NON-VIOLENT POLICE
A STUDY ON TALENT MANAGEMENT AND ITS IMPACT ON EMPLOYEE RETENTION IN SELECTED...
ATTRITION IN THE IT INDUSTRY DURING COVID-19 PANDEMIC: LINKING EMOTIONAL INTE...
INFLUENCE OF TALENT MANAGEMENT PRACTICES ON ORGANIZATIONAL PERFORMANCE A STUD...
A STUDY OF VARIOUS TYPES OF LOANS OF SELECTED PUBLIC AND PRIVATE SECTOR BANKS...
EXPERIMENTAL STUDY OF MECHANICAL AND TRIBOLOGICAL RELATION OF NYLON/BaSO4 POL...
ROLE OF SOCIAL ENTREPRENEURSHIP IN RURAL DEVELOPMENT OF INDIA - PROBLEMS AND ...
OPTIMAL RECONFIGURATION OF POWER DISTRIBUTION RADIAL NETWORK USING HYBRID MET...
APPLICATION OF FRUGAL APPROACH FOR PRODUCTIVITY IMPROVEMENT - A CASE STUDY OF...
A MULTIPLE – CHANNEL QUEUING MODELS ON FUZZY ENVIRONMENT

Recently uploaded (20)

PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
PDF
737-MAX_SRG.pdf student reference guides
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPT
Total quality management ppt for engineering students
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
PPT on Performance Review to get promotions
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Exploratory_Data_Analysis_Fundamentals.pdf
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
737-MAX_SRG.pdf student reference guides
Visual Aids for Exploratory Data Analysis.pdf
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Total quality management ppt for engineering students
Automation-in-Manufacturing-Chapter-Introduction.pdf
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
R24 SURVEYING LAB MANUAL for civil enggi
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PPT on Performance Review to get promotions
III.4.1.2_The_Space_Environment.p pdffdf
Safety Seminar civil to be ensured for safe working.
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...

Ijmet 10 01_141

  • 1. http://www.iaeme.com/IJMET/index.asp 1392 [email protected] International Journal of Mechanical Engineering and Technology (IJMET) Volume 10, Issue 01, January 2019, pp. 1392-1398, Article ID: IJMET_10_01_141 Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=01 ISSN Print: 0976-6340 and ISSN Online: 0976-6359 © IAEME Publication Scopus Indexed NORMALIZED GEOMETRIC INDEX: A SCALE FOR CLASSIFIER SELECTION Krishna Sriharsha Gundu Student in VIT Vellore Sundar S Professor in VIT Vellore ABSTRACT For years, the Machine Learning community has focused on developing efficient algorithms that can produce very accurate classifiers. However, it is often much easier to find several good classifiers based on dataset combination, instead of single classifier applied on deferent datasets. The advantages of using classifier dataset combinations instead of a single one are twofold: it helps lowering the computational complexity by using simpler models, and it can improve the classification accuracy and performance. Most Data mining applications are based on pattern matching algorithms, thus improving the performance of the classification has a positive impact on the quality of the overall data mining task. Since combination strategies proved very useful in improving the performance, these techniques have become very important in applications such as Cancer detection, Speech Technology and Natural Language Processing .The aim of this paper is basically to propose proprietary metric, Normalized Geometric Index (NGI) based on the latent properties of datasets for improving the accuracy of data mining tasks. Key words: Machine Learning, Classification, Classifier Selection, Data Mining, Non Linear Regression, Normalized Geometric Index (NGI) Cite this Article: Krishna Sriharsha Gundu and Sundar S, Normalized Geometric Index: a Scale for Classifier Selection, International Journal of Mechanical Engineering and Technology, 10(01), 2019, pp.1392–1398 http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=10&Type=01 1. INTRODUCTION Classification is an important data mining task, to classify the given features and to learn the hidden knowledge in the dataset .Input to a typical classification system is a set of features from dataset with an associated class .A feature is represented by a set of measurements that contain relevant information about the structure of the object we wish to classify .Hence in the context of classification the combination of classifier and dataset is important measure to understand the performance of classifier system .This method of understanding the performance is termed as "Overproduce and choose"[1]. In this method, a large number of datasets of different geometries are given as inputs to different classifiers. Flash.P [2] has discussed more details about generic
  • 2. Normalized Geometric Index: a Scale for Classifier Selection http://www.iaeme.com/IJMET/index.asp 1393 [email protected] approaches to assess the data set parameter influence on the accuracy but did not elaborate any specific direction to solve the same. In this paper, a numeric index is developed (in Section 3) and used to predict which classifier gives the best accuracy to the geometry of the dataset which is to be classified. 1.1. Dataset and Classifier Classification consists of predicting a certain outcome based on the input historic data. The prediction is carried out by processing the data with the help of an algorithm applied on training dataset. The algorithm tries to discover the relationship between the features that will aid in predicting the output according to the perceived pattern. The classification algorithm analyzes the input and predict the output. The prediction accuracy is the figure of merit of the classification algorithm. For example, in a software defect dataset,as shown below table 1 ,the training set would have relevant information on the "bug" ,collected historically. The prediction table data is used by the algorithm to as shown below table 2, to predict the bugs in the module. 1.2. Influence of Dataset-Classifier combination In general classification is about taking a decision on given data. Michael's et al define classification as "construction of a procedure that is applied to a series of objects, where is each object is assigned to a label" [3,4,5] .In this paper classification refers to supervisory learning based classification where classifier is trained on the historic data with associated classes. In today's machine learning world, the challenge is to improve the performance of the learning system and to apply the classification algorithm for a particular dataset [6,7]. Since the data set volume and features increases over span of time selecting a suitable classifier is a challenge. Poor selection of classifier results poor accuracy. Several studies carried out on the same sparsely but the problem is still a challenge [8,9] .Due to explosive growth of volumes of data and availability of several machine learning algorithms there is no study on the selection of classifiers or guidelines for selecting the classifiers for a given type of data set [10] .Datasets themselves offers a little clue in selecting a relevant classifier algorithm. Each classifier interprets and processes the data separately. For example, k - Nearest Neighbour (kNN) classifier computes the distance from the test point to all the train points and classifies into a class which is the nearest to it. Although, Support Vector Machine (SVM) Classifier, draws hyper-planes such that all the points that satisfy a hyper plane belong to a class. This difference in algorithm used causes a difference of accuracy to a classification accuracy. In other words, not all the classifiers are suitable to all the geometries of datasets, i.e, the importance
  • 3. J Krishna Sriharsha Gundu and Sundar S http://www.iaeme.com/IJMET/index.asp 1394 [email protected] of features, observations and classes are different to different algorithms. Thereby we have to choose the classifier based on the dataset at hand. 2. EXISTING WORK T. van Gemert [3] conducted study on the relationship between classification algorithm and dataset with prime consideration on execution time .The author has discussed the influence of dataset characteristics on classification algorithm performance, however he has not mentioned about the real issue of dataset parameter influence on the classification accuracy. The author did not mentioned about the dataset characteristics (eg - Number of features/classes, etc) on the performance of the classifier. In an attempt to produce optimal accuracy for all the datasets, a classifier ensemble is generated from other classifiers / classifier ensembles and combination functions [11]. Although this approach may yield optimal accuracy for all kinds of datasets, there are many combinations of classifiers and combination functions. The time required to build a custom classifier is very high. A final selection is to be made from the list of classifiers built for the dataset [12]. The proposed method is worth implementing only if the there is a drastic improvement of accuracy. The time taken to understand and build the optimum classifier renders the classification task useless in time sensitive classification or accuracy insensitive scenarios (for minor accuracy improvements). A true Classifier ensemble can be built provided classifiers with diverse strong and weak points are combined. This research on measuring diversity is not concluded [13]. This implies that a solution of classifier ensemble may not be found or if found it could be a local minimum error point. This requires a restart on the search process and start with a different classifier and its combinations. The time taken to get enough data to make a decision is considered critical in a classification task [13]. Since truly complimentary and diverse classifiers do not exist, the fusion of many decisions into a single output label is challenging. Although some frame-works such as weighted voting are developed [14], they are not fool proof. Thereby requiring another classifier just to classify the pool of outputs into a single out-put based on the feature relevance to the classifier and the confidence interval. The computational complexity for such a system is exponentially high as the selection of classifier ensemble uses another classifier ensemble. The diversity of the classifier pool is ensured by manipulating the classifier inputs and outputs [15]. This manipulation of dataset could loose some crucial information about the dataset. The main drawback of the single classifier system is the requirement of prior knowledge to choose the best classifier [16]. This paper proposes a solution to standardize the knowledge through a parameter. 3. PROPOSED METRIC In this paper we propose a novel metric, Normalized Geometric Index (NGI) for selecting the classifier based on the dataset parameters for optimal classification accuracy and execution time. Since the metric is a numeric value, it can directly be used for deciding a single classifier for optimal performance. This eliminates the need for one or more learning algorithms to fuse classifiers and choose the right fusion function thereby saving time and computation complexity. This parameter is developed keeping four kinds of datasets in mind. (Refer Section 4.2) The rules behind developing the metric – 1. The accuracy of classification of a dataset improves with the increase in the number of observations, as long as enough care is taken to avoid overfitting.
  • 4. Normalized Geometric Index: a Scale for Classifier Selection http://www.iaeme.com/IJMET/index.asp 1395 [email protected] 2. The accuracy of classification decreases if there are more number of classes for the same number of observations. 3. The accuracy of classification decreases if there are more number of features for the same number of observations Every observation is considered as new information provided to the classifier detailing the behaviour of the dataset. Thereby more observations (for the same number of classes and features) would imply better classification. Every new feature is considered as new dimension to visualize the classes .If there are more features (for the same number of observations and classes), then the information provided by the observations will not be sufficient for the classifier to perform the classification. More the number of output classes, more will be the information required to classify the test data into different classes. By combining the above points, we the following metric. The classifiers and data sets consider for the experiments are : 1. Gaussian Naive Bayes (GNB) 2. Support Vector Machine (SVM) 3. Random Forest (RF) 4. k Nearest Neighbor (KNN) 5. Multi-Layer Perceptron (MLP) 6. Multinomial Naive Bayes (MNB) 7. Quadratic Discriminant Analysis (QDA) 4. EXPERIMENTAL SETUP 4.1. Technical Details in experiment The parameters of classifiers are set as follows- • kNN Classifier has k value set to 3 • MLP Classifier has activation function set to tanh() • MLP Classifier has been set to adam solver in python • MLP Classifier’s tolerance has been set to 10-5 • Random Forest takes the decision from an ensemble of 100 trees. • SVM uses sigmoid kernel for classification. 4.2. Experiment The experiment is setup such that datasets of all types are executed on all the classifiers. The resulting accuracies are tabulated. A non-linear line of regression is drawn for all the accuracies as a function of NGI metric for each classifier. An average of all the accuracies are taken for a single dataset. Another line of non-linear regression is drawn for the accuracies as a function of NGI metric. This line acts as a threshold to select a classifier. When two lines are plotted, the region where the individual classifier out-performs the average classifier performance, is the region of strength for the classifier. It must be noted that the R-squared value (describing the best fitness of the line) will be low, since the accuracy is not
  • 5. J Krishna Sriharsha Gundu and Sundar S http://www.iaeme.com/IJMET/index.asp 1396 [email protected] fully explained for just these parameters. Hence, a line which describes the accuracy best, in terms of NGI will be selected for determining the region of performance. Figure. 1. Experimental Setup 5. RESULTS After conducting the experiments as mentioned in the previous section the results are obtained as mentioned below. A table of Classifier Accuracy for the NGI values. The results of the accuracies are tabulated. As seen in the NGI column, all the datasets correspond to a different geometry of datasets. From the table, it can be inferred that NGI values around the value of 0.041 have overall higher classification accuracy. This NGI value corresponds to the datasets having less number of classes, more observations and less features. Least overall classification accuracy is for high number classes, high number of features and low number of observations. During classification, it is also observed that Multi Nomial Naive Bayes Classifier will not work with all the raw datasets. All the inputs to it should be non-negative and thereby needing pre-processing. The following setup is used for studying the relationship between the classifier performance and NGI metric. For each of the classifier the experiments are conducted using different datasets as mentioned in the previous section. The response function of NGI is calculated as The above accuracy corresponds to the average classification accuracy. It serves as the baseline for determining the classifier to be better or worse at the particular NGI value. After calculating the NGI value from the above equation (2), the accuracy values of the individual classifiers are approximated based on the following equations. These equations are non-linear regression models of accuracy in terms of NGI metric. In the case of accuracy of a specific classifier, the corresponding column from the table is chosen as y and the related NGI values are chosen as x. Using non linear regression, a function is created. These functions are as follows.
  • 6. Normalized Geometric Index: a Scale for Classifier Selection http://www.iaeme.com/IJMET/index.asp 1397 [email protected] These equations approximately describe the behaviour of classifiers for different NGI values. On graphical examination of the same will show the performance of optimal classifier for the NGI value. All the approximations are having an R-squared value greater than 0.85. 6. CONCLUSION AND FUTURE WORK From the above results it is clear that the Normalized Geometric Index (NGI) is very much helpful in determining the classifier dataset relation for improved accuracy. The QDA performance is inferior when compared to the other classifier performance. It is not suggested as prime choice for the given data set properties. KNN classifier has performed consistently well when compared to the remaining classifiers while compromising on accuracy front. However the accuracy of MLP classifier is very high provided the NGI greater than 0.787.The Random Forest (RF) classifier performs uniformly well across all NGI values with no threshold values. The performance is mostly consistent. We have conducted experiments on few data sets, however the experiments can be repeated using sparse and scientific data sets to study the impact of the NGI metric on a different variety of datasets. REFERENCES [1] Amanda JC Sharkey, Noel E Sharkey, Uwe Gerecke, and Gopinath Odayammadath Chandroth. The “test and select” approach to ensemble combination. In International Workshop on Multiple Classifier Systems, pages 30–44. Springer, 2000. [2] Peter Flach. The art and science of algorithms that make sense of data,2012. [3] T van Gemert. On the influence of dataset characteristics on classifier performance. B.S. thesis, 2017. [4] DJSD Michie. Dj spiegelhalter, and cc taylor. Machine learning, neural and statistical classification. Ellis Horwood, 1994. [5] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pages 2962–2970, 2015.
  • 7. J Krishna Sriharsha Gundu and Sundar S http://www.iaeme.com/IJMET/index.asp 1398 [email protected] [6] Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. Automating model search for large scale machine learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 368–380. ACM, 2015. [7] Gang Luo. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Network Modeling Analysis in Health Informatics and Bioinformatics, 5(1):18, 2016. [8] Alexandros Kalousis and Theoharis Theoharis. Noemon: Design, implementation and performance results of an intelligent assistant for classifier selection. Intelligent Data Analysis, 3(5):319–337, 1999. [9] Joao Gama and Pavel Brazdil. Characterization of classification algorithms. In Portuguese Conference on Artificial Intelligence, pages 189–200. Springer, 1995 [10] Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo Vilalta. Meta learning: Applications to data mining. Springer Science & Business Media, 2008 [11] Josef Kittler. Multiple Classifier Systems: First International Workshop , MCS 2000 Cagliari, Italy, June 21-23, 2000 Proceedings, volume 1857. Springer Science & Business Media, 2000. [12] Fabio Roli, Giorgio Giacinto, and Gianni Vernazza. Methods for designing multiple classifier systems. In International Workshop on Multiple ClassifierSystems, pages 78–87. Springer, 2001. [13] MichałWozniak, Manuel Graña, and Emilio Corchado. A survey of multiple classifier systems as hybrid systems. Information Fusion, 16:3–17, 2014. [14] Šar¯ unas Raudys. Trainable fusion rules. ii. small sample-size effects. Neural Networks, 19(10):1517–1527, 2006. [15] Ludmila I Kuncheva. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004. [16] Josef Kittler. A framework for classifier fusion: Is it still needed? In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 45–56. Springer, 2000.