SlideShare a Scribd company logo
International Journal of Trend in Scientific Research and Development (IJTSRD)
Volume: 3 | Issue: 2 | Jan-Feb 2019 Available Online: www.ijtsrd.com e-ISSN: 2456 - 6470
@ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 974
Analysis of Imbalanced Classification Algorithms:
A Perspective View
Priyanka Singh1, Prof. Avinash Sharma2
1PG Scholar, 2Assistant Professor
1,2Department of CSE, MITS, Bhopal, Madhya Pradesh, India
ABSTRACT
Classification of data has become an important research area. The process of classifying documentsintopredefined categories
Unbalanced data set, a problem often found in real world application, can cause seriously negative effect on classification
performance of machine learning algorithms. There have been many attemptsatdealingwith classificationofunbalanceddata
sets. In this paper we present a brief review of existing solutions to the class-imbalance problemproposed bothatthedata and
algorithmic levels. Even though a common practice to handle the problem of imbalanced data is to rebalance them artificially
by oversampling and/or under-sampling, some researchers proved that modified support vector machine, rough set based
minority class oriented rule learning methods, cost sensitive classifier perform goodonimbalanceddataset. Weobserved that
current research in imbalance data problem is moving to hybrid algorithms.
Keywords: cost-sensitive learning, imbalanced data set, modified SVM, oversampling, undersampling
I. INTRODUCTION
A data set is called imbalanced if it contains many more
samples from one class than from the rest of theclasses.Data
sets are unbalanced when atleast one class is representedby
only a small number of trainingexamples(calledtheminority
class) while other classes make up the majority. In this
scenario, classifiers can have good accuracy on the majority
class but very poor accuracy on the minority class(es) due to
the influence that the larger majority class has on traditional
training criteria. Most original classification algorithms
pursue to minimize the error rate: the percentage of the
incorrect prediction ofclasslabels.Theyignorethedifference
between types of misclassification errors. In particular, they
implicitly assume that all misclassification errors cost
equally.
In many real-world applications, this assumption is not true.
Thedifferencesbetweendifferentmisclassificationerrorscan
be quite large. For example, in medical diagnosis of a certain
cancer, if the cancer is regarded as the positive class, and
non-cancer (healthy) as negative, then missing a cancer (the
patientis actually positive but is classified asnegative;thusit
is also called ―false negative‖) is much more serious (thus
expensive) than the false-positive error. The patient could
lose his/her life because of the delay in the correct diagnosis
and treatment. Similarly, if carryingabombispositive,thenit
is much more expensive to miss a terrorist who carries a
bomb to a flight than searching an innocent person.
The unbalanced data set problem appearsinmanyrealworld
applications like text categorization, fault detection, fraud
detection, oil-spills detection in satellite images, toxicology,
cultural modeling, medical diagnosis.[1] Many research
papers on imbalanced data sets have commonly agreed that
because of this unequal classdistribution,theperformanceof
the existing classifiers tends to be biased towards the
majority class. The reasons for poor performance of the
existing classificationalgorithmsonimbalanceddatasetsare:
1. They are accuracy driven i.e.,theirgoalistominimizethe
overall error to whichtheminorityclasscontributesvery
little.
2. They assume that there is equal distribution of data for
all the classes.
3. They also assume that the errors coming from different
classes have the same cost[2].
With unbalanced data sets, data mining learning algorithms
produce degenerated models that do not take into account
the minority class as most data mining algorithms assume
balanced data set.
A number of solutions to the class-imbalance problem were
previously proposed both at the data and algorithmic levels
[3]. At the data level, these solutions include many different
forms of re-sampling such as random oversampling with
replacement, randomundersampling,directedoversampling
(in which no new examples are created, but the choice of
samples to replace is informed ratherthanrandom),directed
undersampling (where, again, the choice of examples to
eliminate is informed), oversampling with informed
generation of new samples, and combinations of the above
techniques. At the algorithmic level, solutions include
adjusting the costs of the various classes so as to counter the
class imbalance, adjusting the probabilistic estimate at the
tree leaf (when working with decision trees), adjusting the
decision threshold, and recognition-based(i.e.,learningfrom
one class) rather than discrimination-based (two class)
learning. The most common techniques to deal with
unbalanced data include resizing training datasets, cost-
sensitive classifier, and snowball method. Recently, several
methods have been proposed with good performance on
unbalanced data. These approachesincludemodifiedSVMs,k
nearest neighbor (kNN), neural networks, genetic
programming, rough set based algorithms, probabilistic
decision tree and learning methods. The next sections focus
on some of the method in detail.
II. SAMPLING METHODS
An easy Datalevel methodsfor balancing theclassesconsists
of re-sampling the original data set, either by over- sampling
the minority class or by under-sampling the majority class,
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 975
until the classes are approximatelyequallyrepresented.Both
strategies can be applied in any learning system, since they
act as a preprocessingphase, allowingthe learningsystemto
receive the training instances as if they belonged to a well-
balanced data set. Thus, any bias of the system towards the
majority class due tothedifferentproportionofexamplesper
class would be expected to be suppressed.
Hulse et al. [4] suggest that the utility of the re-sampling
methods depends on a number of factors, including the ratio
between positive and negative examples, other
characteristics of data, and the nature of the classifier.
However, re-sampling methods have shown important
drawbacks. Under-samplingmaythrowoutpotentiallyuseful
data, while over-sampling artificially increases thesizeofthe
data set and consequently, worsens the computational
burden of the learning algorithm.
A. Oversampling
The simplestmethod toincrease thesizeoftheminorityclass
corresponds to random over-sampling, that is, a non-
heuristicmethod thatbalances theclassdistributionthrough
the random replication of positive examples. Nevertheless,
sincethis methodreplicatesexistingexamplesintheminority
class, over fitting is more likely to occur.
Chawla proposed Synthetic Minority Over-sampling
Technique (SMOTE) [5] an over-samplingapproachinwhich
the minority class is over-sampled by creating synthetically
examples rather than by over-sampling with replacement.
The minority class is over-sampled by taking each minority
class sample and introducing synthetic examples along the
line segments joining any/all of the k minority class nearest
neighbors. Depending upon the amount of over-sampling
required, neighbors from the k nearest neighbors are
randomlychosen.FromtheoriginalSMOTEalgorithm,several
modifications have been proposed in the literature. While
SMOTE approach does not handle data sets with all nominal
features, it was generalized to handle mixed datasets of
continuousandnominalfeatures.ChawlaproposeSMOTE-NC
(Synthetic Minority Over-sampling Technique Nominal
Continuous) and SMOTE-N (Synthetic Minority Over-
sampling Technique Nominal), the SMOTE can also be
extended for nominal features.
Andrew Estabrooks et al. proposed a multiple re- sampling
method which selected the most appropriate re-sampling
rate adaptively [6]. Taeho Jo et al. put forward a cluster-
based over-samplingmethodwhichdealtwithbetween-class
imbalance and within-class imbalance simultaneously [7].
Hongyu Guo et al. found out hard examples of the majority
and minority classes dur-ing the process of boosting, then
generated new synthetic examples from hard examples and
add them to the data sets [8].Based on SMOTE method, Hui
Han and Wen-Yuan Wang [9] presented two new minority
over-samplingmethods,borderline-SMOTE1andborderline-
SMOTE2, in which only the minority examples near the
borderline are over- sampled. These approaches achieve
better TP rate and F- value than SMOTE and random over-
sampling methods.
B. Undersampling
Under-samplingisanefficientmethodforclassing-imbalance
learning. This method uses a subset of the majority class to
train the classifier. Since many majority class examples are
ignored, the training set becomes more balanced and the
training process becomes faster. The most common
preprocessing technique israndommajorityunder-sampling
(RUS), IN RUS, Instances of the majority class are randomly
discarded from the dataset.
However, the main drawback of under-sampling is that
potentially useful information contained in these ignored
examples is neglected. There many ways attempts to
improve upon the performance of random sampling, such as
Tomek links, Condensed Nearest Neighbor Rule and One-
sided selection etc. one-sided selection (OSS) is proposed
by Rule Kubat and Matwin attempts to intelligently under-
sample the majority class by removing majority class
examples that are considered either redundant or noisy.‘
Over-sampling is a method for improve minority class
recognition, randomly duplicate the minority data not only
without increase any category of a small number of new
information, but also will lead to over-fitting.
For some problems like fraud detection which is highly
overlapped unbalanced data classification problem, where
non-fraud samples heavily outnumber the fraud samples,T.
Maruthi Padmaja[10]proposedhybridsamplingtechnique, a
combination of SMOTE to over-sample the minority data
(fraud samples) and random under- sampling to under-
sample the majoritydata(non-fraudsamples)ifweeliminate
extreme outliers from the minority samples for highly
skewed imbalanced data sets like fraud detection
classification accuracy can be improved.
Sampling methods consider the class skew and properties of
the dataset as a whole. However, machine learning and data
mining often face nontrivial datasets, which often exhibit
characteristics and properties at a local, rather than global
level. It is noted that a classifier improved through global
sampling levels may be insensitive to the peculiarities of
different components or modalities in the data, resulting in a
suboptimal performance. David A. Cieslak, Nitesh V.
Chawla[11] has suggested that for improving classifier
performance sampling can be treated locally, instead of
applying uniform levels of sampling globally. They proposed
a framework which first identifiesmeaningfulregionsofdata
and then proceeds to find optimal sampling levels within
each.
There are known disadvantages associated with the use of
sampling to implement cost-sensitive learning. The
disadvantage with undersampling is that it discards
potentially useful data. The main disadvantage with
oversampling, from our perspective, is that by making exact
copies of existing examples, it makes over fitting likely. In
fact, with oversampling it is quite common for a learner to
generate a classification rule to cover a single, replicated,
example. A second disadvantage of oversampling is that it
increases the number of training examples, thus increasing
the learning time.
Given the disadvantages with sampling, still sampling is a
popular way to deal with imbalanced data rather than a cost-
sensitive learning algorithm. There are several reasons for
this. The most obvious reason is there are not cost- sensitive
implementations of all learning algorithms and therefore a
wrapper-based approach using sampling is the only option.
While this is certainly less true today than in the past, many
learning algorithms (e.g., C4.5) still do not directly handle
costs in the learning process. A second reason for using
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 976
sampling is thatmany highly skewed data sets are enormous
and the size of the training set must be reduced in order for
learning to be feasible.
In this case, undersampling seems to be a reasonable, and
valid, strategy. if one needs to discard some training data, it
still might be beneficial to discard some of the majority class
examples in order to reduce the training set size to the
required size, and then alsoemploy a cost- sensitive learning
algorithm, so that the amount of discarded training data is
minimized. A final reason that may have contributed to the
use of sampling rather than a cost-sensitive learning
algorithm is that misclassification costs are often unknown.
However, this is not a valid reason for using sampling over a
cost-sensitive learning algorithm, since the analogous issue
arises with sampling—what should the class distribution of
the final training data be? If this cost information is not
known, a measure such as the area under the ROC curve
could be used to measure classifier performance and both
approaches could then empiricallydeterminethepropercost
ratio/class distribution [12].
III. COST-SENSITIVE LEARNING
At the algorithmic level, solutions include adjusting the costs
of the various classes so as to counter the class imbalance,
adjusting the probabilistic estimate at the tree leaf (when
working with decision trees), adjusting the decision
threshold, and recognition-based (i.e., learning from one
class) rather than discrimination-based (two class) learning.
Cost-Sensitive Learning is a type of learning in data mining
that takes the misclassification costs (and possibly other
types of cost) into consideration. There are many ways to
implement cost sensitive learning, in [13], it is categorized
into three, the first class of techniquesapplymisclassification
costs to the data set as a form of data space weighting, the
second class applies cost-minimizing techniques to the
combination schemesofensemblemethods,andthelastclass
of techniques incorporates cost sensitive features directly
into classification paradigms to essentially fit the cost
sensitive framework into these classifiers.
Incorporating costintodecision tree classification algorithm
which is one of the most widely used and simple classifier.
Cost can be incorporated into it in various ways. First way is
cost can be applied to adjust the decision threshold, second
way is cost can be used in splitting attribute selection during
decision treeconstruction and theother way is costsensitive
pruningschemes can be applied to the tree. Ref.[14]propose
a method for building and testing decision trees that
minimizes total sum of the misclassification and test costs.
The algorithm used by them chooses an splitting attribute
that minimizes the total cost, the sum of the test cost and the
misclassification cost rather than choosing an attribute that
minimizes the entropy. Information gain, Gini measures are
considered to be skew sensitive [15]. In Ref. [16] a new
decision tree algorithm called Class Confidence Proportion
Decision Tree (CCPDT) is proposed which is robust and
insensitive to size of classes and generates rules which are
statistically significant. Ref. [17] analytically and empirically
demonstrates the strong skew insensitivity of Hellinger
Distance and its advantages over popularalternativemetrics.
They arrived at a conclusion that for imbalanced data it is
sufficient to use Hellinger trees with bagging without any
sampling methods. Ref. [18] uses different operators of
Genetic algorithms for oversampling to enlarge the ratio of
positive samples and then apply clustering to the
oversampled training data set as adata clearning method for
both classes, removing the redundant or noisysamples.They
used AUC as evaluation metricand foundthattheiralgorithm
performed better.
Nguyen ha vo, Yonggwan won[19] extended Regularized
Least Square(RLS) algorithm that penalizes errors of
different samples with different weights and some rules of
thumb to determine those weights. The significantly better
classification accuracy of weighted RLS classifiers showed
that it is promising substitution of other previous cost-
sensitiveclassification methodsfor unbalanceddataset.This
approach is equivalent to up- sampling or down-sampling
depending on the cost we choose. For example, doubling the
cost-sensitivity of one class is said to be equivalent to
doubling the number of samples in that class.
Ref[20] proposed a novel approach reducing each within
group error, BABoostthat is a variant of AdaBoost. Adaboost
algorithm gives equal weight to each misclassified example.
But the misclassification error of each class is not same.
Generally, the misclassificationerroroftheminorityclasswill
larger than themajority‘s. SoAdaboostalgorithm will lead to
higher bias and smaller margin when encountering skew
distribution. BABoost algorithm in each round of boosting
assigns more weights to the misclassified examples,
especially those in the minority class.
Yanmin Sun a and Mohamed S. Kamel[21] explored three
cost-sensitive boosting algorithms, which are developed by
introducing cost items into the learning framework of
AdaBoost. These boosting algorithms are also studied with
respect to their weighting strategies towards different types
of samples, and their effectiveness in identifying rare cases
through experiments on several real worldmedicaldatasets,
where the class imbalance problem prevails.
IV. SVM AND IMBALANCED DATASETS
The success of SVM is very limited when it is applied to the
problem of learning from imbalanced datasets in which
negative instances heavily outnumber the positiveinstances.
Even though undersamplingthe majorityclass doesimprove
SVM performance, there is an inherent loss of valuable
information in this process. Rehan Akbani[22]combined
sampling and cost sensitive learning for improving
performance of SVM. Their algorithm is based on a variantof
the SMOTE algorithm by Chawla et al, combined with
Veropoulos et al‘s different error costs algorithm.
TAO Xiao-yan[23] presented A modified proximal support
vector machine (MPSVM) which assigns different penalty
coefficients to the positive and negative samplesrespectively
by adding a new diagonal matrix in the primal optimization
problem. And further the decision function is obtained. The
real-coded immune clone algorithm (RICA) is employed to
select the global optimal parameters to get the high
generalization performance.
M. Muntean 1 and H. Vălean[24] provided the Enhancer, a
viable algorithm for improving the SVM classification of
unbalanced datasets. They improve the Cost-sensitive
classification for Support Vector Machines, by multiplying in
the training step the instances of the underrepresented
classes.
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 977
Yuchun Tang and nitesh chawla[25] also implemented and
rigorouslyevaluated four SVM modelingtechniquesSVM can
be effective if incorporate different ―rebalance‖heuristics in
SVM modeling, including cost-sensitive learning, and over
and under sampling.
Geneticprogramming(GP)canevolvebiasedclassifierswhen
data sets are unbalanced. The cost sensitive learning uses
cost adjustment within the learningalgorithmtofactorinthe
uneven distribution of class examples in the original
(unmodified) unbalanced data set, during the training
process. In GP, cost adjustment can be enforced by adapting
the fitness function. Here, solutions with good classification
accuracy on both classes are rewarded with better fitness,
while those that are biased toward one class only are
penalized with poor fitness.
Common techniques include using fixed misclassification
costs for minority and majority class examples [26], [27], or
improved performance criteria such as the area under the
receiver operating characteristic (ROC) curve (AUC) [28], in
the fitness function. While these techniques have
substantially improved minority class performances in
evolved classifiers, they can incur both a tradeoff in majority
class accuracy and, thus, a loss in overall classificationability,
and long training times duetothe computationaloverheadin
evaluating these improved fitness measures. In addition,
these approaches can be problem specific, i.e., fitness
functions are handcrafted for a particular problem domain
only.
V. HYBRID ALGORITHMS
The EasyEnsemble classifierisanunder-samplingalgorithm,
which independently samples several subsets from negative
examples and one classifier is built for each subset. All
generated classifiers arethencombinedfor thefinaldecision
by using Adaboost. In imbalanced problems, some features
are redundant and even irrelevant; these features will hurt
thegeneralizationperformanceoflearningmachines.Feature
selection, a process of choosing a subset of features from the
original ones, isfrequentlyusedasapreprocessingtechnique
in analysis of data. It has been proved effective in reducing
dimensionality, improving mining efficiency, increasing
mining accuracy and enhancing result comprehensibility.
Ref[29] combined the feature selection method with Easy
Ensemble in order to improve the accuracy.
In ref[30] a hybrid algorithm based on random over-
sampling, decision tree (DT), particle swarm optimization
(PSO) and feature selection is proposed to classify
unbalanced data. The proposed algorithm has the ability to
select beneficial feature subsets, automatically adjust values
of parameter and obtain the bestclassification accuracy. The
zoo dataset is used to test the performance. From simulation
results, the classification accuracy ofthisproposedalgorithm
outperforms other existing methods
Decision trees,supplementedwithsamplingtechniques,have
proven to be an effectiveway to address the imbalanceddata
problem. Despite their effectiveness, however, sampling
methods add complexity and the need for parameter
selection. To bypass these difficulties a new decision tree
technique called Hellinger Distance Decision Trees (HDDT)
which uses Hellinger distance as the splitting criterion is
suggested in ref[17]. Theytook advantageofthestrongskew
insensitivity of Hellinger distance and its advantages over
popular alternatives such as entropy (gain ratio). For
imbalanced data it is sufficient to use Hellinger trees with
bagging without any sampling methods.
VI. CONCLUSION
This paper provides an overview of the classification of
imbalanced data sets. At data level, sampling is the most
common approach to deal with imbalanced data. Over-
sampling clearly appears as better than under-sampling for
local classifiers, whereas some under-sampling strategies
outperform over-sampling when employing classifiers with
global learning. Researchers proved that Hybrid sampling
techniques can perform better than just oversampling or
under sampling. At the algorithmic level, solutions include
adjusting the costs of the various classes so as to counter the
class imbalance, adjusting the probabilistic estimate at the
tree leaf (when working with decision trees), adjusting the
decision threshold, and recognition-based(i.e.,learningfrom
one class) rather than discrimination-based (two class)
learning. Solutions based on modified support vector
machine, rough set based minority class oriented rule
learning methods, cost sensitive classifier are also proposed
to deal with unbalanced data. There areof coursemanyother
worthwhile research possibilities that are not included here.
DevelopingClassifierswhicharerobustandskew-insensitive
or hybrid algorithms can be point of interest for the future
research in imbalanced dataset.
REFERENCE
[1] Miho Ohsaki, Peng Wang, Kenji Matsuda, Shigeru
Katagiri, Hideyuki Watanabe, and Anca Ralescu,
“Confusion-matrix-based Kernel Logistic Regression
for Imbalanced Data Classification”, IEEE Transactions
on Knowledge and Data Engineering, 2017.
[2] Alberto Fernández, Sara del Río, Nitesh V. Chawla,
Francisco Herrera, “An insight into imbalanced Big
Data classification: outcomes and challenges”,Springer
journal of bigdata, 2017.
[3] Vaibhav P. Vasani1, Rajendra D. Gawali, “Classification
and performance evaluation using data mining
algorithms”, International Journal of Innovative
Research in Science, Engineering and Technology,
2014.
[4] Kaile Su, Huijing Huang, Xindong Wu, Shichao Zhang,
“Rough Sets for FeatureSelectionand Classification:An
Overview with Applications”, International Journal of
Recent Technology and Engineering (IJRTE) ISSN:
2277-3878, Volume-3, Issue-5, November 2014.
[5] Senzhang Wang, Zhoujun Li, Wenhan Chao and
Qinghua Cao, “Applying Adaptive Over-sampling
Technique Based on Data Density and Cost-Sensitive
SVM to Imbalanced Learning”,IEEE World Congresson
Computational Intelligence June, 2012.
[6] Mikel Galar, Alberto Fernandez, Edurne Barrenechea,
Humberto Bustince and Francisco Herrera, “A Review
on Ensembles for the Class Imbalance Problem:
Bagging, Boosting, and Hybrid-Based Approaches”,
IEEE Transactions on Systems, Man and Cybernetics—
Part C: Applications and Reviews, Vol. 42, No. 4, July
2012.
[7] Nada M. A. Al Salami, “Mining High Speed Data
Streams”. UbiCC Journal, 2011.
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 978
[8] Dian Palupi Rini, Siti Mariyam Shamsuddin and Siti
Sophiyati, “Particle Swarm Optimization: Technique,
System and Challenges”, International Journal of
Computer Applications (0975 – 8887) Volume 14–
No.1, January 2011.
[9] Amit Saxena, Leeladhar Kumar Gavel, Madan Madhaw
Shrivas, “Online Streaming Feature Selection”, 27th
International Conference on Machine Learning, 2010.
[10] Yuchun Tang, Member, Yan-Qing Zhang, Nitesh V.
Chawla and Sven Krasser, “SVMs Modeling for Highly
Imbalanced Classification”, IEEE Transaction on
Systems, Man and Cybernetics,Vol.39, NO.1,Feb2009.
[11] Haibo He and Edwardo A. Garcia, “Learning from
Imbalanced Data”, IEEE Transactions on Knowledge
and Data Engineering, September 2009.
[12] Thair Nu Phyu, “Survey of Classification Techniques in
Data Mining”, International Multi Conference of
Engineers and Computer Scientists, IMECS 2009,
March, 2009.
[13] Haibo He, Yang Bai, Edwardo A. Garcia and Shutao Li,
“ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced Learning”, IEEE Transaction of Data
Mining, 2009.
[14] Swagatam Das, Ajith Abraham and Amit Konar,
“Particle Swarm Optimization and Differential
Evolution Algorithms: TechnicalAnalysis, Applications
and Hybridization Perspectives”, Springer journal on
knowledge engineering, 2008.
[15] “A logical framework for identifyingqualityknowledge
from different data sources”, International Conference
on Decision Support Systems, 2006.
[16] “Database classification for multi-database mining”,
International Conferenceon DecisionSupportSystems,
2005.
[17] Volker Roth, “Probabilistic Discriminative Kernel
Classifiers for Multi-class Problems”, Springer-Verlag
journal, 2001.
[18] R. Chen, K. Sivakumar and H. Kargupta “Collective
Mining of Bayesian Networks from Distributed
Heterogeneous Data”, Kluwer Academic Publishers,
2001.
[19] Shigeru Katagiri, Biing-Hwang Juang and Chin-HuiLee,
“Pattern Recognition Using a Family of Design
Algorithms Based Upon the Generalized Probabilistic
Descent Method”, IEEE Journal of Data Minig, 1998.
[20] I. Katakis, G. Tsoumakas, and I. Vlahavas. Tracking
recurring contexts using ensemble classifiers: an
application to email filtering. Knowledge and
Information Systems, Pp 371–391, 2010.
[21] J. Kolter and M. Maloof. Using additive expert
ensembles to cope with concept drift. In Proc. ICML,Pp
449–456, 2005.
[22] D. D. Lewis, Y. Yang, T. Rose, and F. Li. Rcv1: A new
benchmark collection for text categorization research.
Journal of Machine Learning Research, Pp 361–397,
2004.
[23] X. Li, P. S. Yu, B. Liu, and S.-K. Ng. Positive unlabeled
learning for data stream classification. In Proc.SDM,Pp
257–268, 2009.
[24] M. M. Masud, Q. Chen, J. Gao, L. Khan, J. Han, and B. M.
Thuraisingham. Classificationand novel classdetection
of data streams in a dynamic feature space. In Proc.
ECML PKDD, volume II, Pp 337–352, 2010.
[25] P. Zhang, X. Zhu, J. Tan, and L. Guo, “Classifier and
Cluster Ensembles for Mining Concept Drifting Data
Streams,” Proc. 10th Int’l Conf. Data Mining, 2010.
[26] X. Zhu, P. Zhang, X. Lin, and Y. Shi, “Active Learning
from Stream Data Using Optimal Weight Classifier
Ensemble,” IEEE Trans. Systems,Man, CyberneticsPart
B, vol. 40, no. 6, Pp 1607- 1621, Dec. 2010.
[27] Q. Zhang, J. Liu, and W. Wang, “Incremental Subspace
Clustering over Multiple Data Streams,” Proc. Seventh
Int’l Conf. Data Mining, 2007.
[28] Q. Zhang, J. Liu, and W. Wang, “Approximate Clustering
on Distributed Data Streams,” Proc. 24th Int’l Conf.
Data Eng., 2008.
[29] C. C. Aggarwal. On classification and segmentation of
massive audio data streams. Knowl. and Info. Sys., Pp
137–156, July 2009.
[30] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A
framework for on-demand classification of evolving
data streams. IEEE Trans. Knowl. Data Eng, Pp 577–
589, 2006.
[31] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R.
Gavald. New ensemble methods for evolving data
streams. In Proc. SIGKDD, Pp 139–148, 2009.
[32] S. Chen, H. Wang, S. Zhou, and P. Yu. Stop chasing
trends: Discovering highorder modelsin evolvingdata.
In Proc. ICDE, Pp 923–932, 2008.
[33] P. Zhang, X. Zhu, and L. Guo. Mining data streams with
labeled and unlabeled training examples. In Proc.
ICDM, Pp 627–636, 2009.
[34] O. R. Terrades, E. Valveny, and S. Tabbone, “Optimal
classifier fusion in a non-Bayesian probabilistic
framework,” IEEE Trans.Pattern Anal.Mach.Intell., vol.
31, no. 9, Pp 1630–1644, Sep. 2009.

More Related Content

What's hot (20)

PPT
Sampling and Sample Size
Dr. Keerti Jain
 
PPTX
Sample and sample size
Dr. Roshni Maurya
 
PPTX
Population & sample lecture 04
DrZahid Khan
 
PPTX
Advanced Biostatistics and Data Analysis abdul ghafoor sajjad
HeadDPT
 
PPTX
Biostatistics
Hanimarcelo slideshare
 
PPTX
How to determine sample size
saifur rahman
 
PPT
eMba i qt unit-5_sampling
Rai University
 
PPT
statistics in pharmaceutical sciences
Techmasi
 
PPTX
Sample size calculation
Santam Chakraborty
 
PPTX
Sample Size Determination
Tina Sepehrifar
 
PPT
Sampling and its variability
DrBhushan Kamble
 
PDF
Meta analisis in health policy
rsd kol abundjani
 
PPTX
Introduction to Biostatistics and types of sampling methods
Dr. Sunita Ojha
 
PPTX
Basics of biostatistic
NeurologyKota
 
PPTX
Complex sampling design & analysis
International Islamic University Malaysia
 
PPT
Mangasini ppt lect_sample size determination
Mangasini Katundu
 
PDF
IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...
IRJET Journal
 
PDF
A Framework for Statistical Simulation of Physiological Responses (SSPR).
Waqas Tariq
 
Sampling and Sample Size
Dr. Keerti Jain
 
Sample and sample size
Dr. Roshni Maurya
 
Population & sample lecture 04
DrZahid Khan
 
Advanced Biostatistics and Data Analysis abdul ghafoor sajjad
HeadDPT
 
Biostatistics
Hanimarcelo slideshare
 
How to determine sample size
saifur rahman
 
eMba i qt unit-5_sampling
Rai University
 
statistics in pharmaceutical sciences
Techmasi
 
Sample size calculation
Santam Chakraborty
 
Sample Size Determination
Tina Sepehrifar
 
Sampling and its variability
DrBhushan Kamble
 
Meta analisis in health policy
rsd kol abundjani
 
Introduction to Biostatistics and types of sampling methods
Dr. Sunita Ojha
 
Basics of biostatistic
NeurologyKota
 
Complex sampling design & analysis
International Islamic University Malaysia
 
Mangasini ppt lect_sample size determination
Mangasini Katundu
 
IRJET- Extending Association Rule Summarization Techniques to Assess Risk of ...
IRJET Journal
 
A Framework for Statistical Simulation of Physiological Responses (SSPR).
Waqas Tariq
 

Similar to Analysis of Imbalanced Classification Algorithms A Perspective View (20)

PDF
Multi-Cluster Based Approach for skewed Data in Data Mining
IOSR Journals
 
PDF
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
ijtsrd
 
PDF
6145-Article Text-9370-1-10-20200513.pdf
chalachew5
 
PDF
An overview on data mining designed for imbalanced datasets
eSAT Publishing House
 
PDF
An overview on data mining designed for imbalanced datasets
eSAT Journals
 
PDF
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
cscpconf
 
PPTX
Feature selection with imbalanced data in agriculture
Aboul Ella Hassanien
 
PDF
Under-sampling technique for imbalanced data using minimum sum of euclidean d...
IAESIJAI
 
PDF
Oversampling technique in student performance classification from engineering...
IJECEIAES
 
PDF
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
IJMIT JOURNAL
 
PDF
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
IJMIT JOURNAL
 
PDF
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET Journal
 
PDF
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCE
IJCI JOURNAL
 
PPTX
COMP_GroupA2.pptx
shivrajdeshmukh22
 
PDF
An advance extended binomial GLMBoost ensemble method with synthetic minorit...
IJECEIAES
 
PDF
Class imbalance problem1
chs71
 
PPTX
Borderline Smote
Trector Rancor
 
PDF
TRENDS IN FINANCIAL RISK MANAGEMENT SYSTEMS IN 2020
IJMIT JOURNAL
 
Multi-Cluster Based Approach for skewed Data in Data Mining
IOSR Journals
 
Adaptive Classification of Imbalanced Data using ANN with Particle of Swarm O...
ijtsrd
 
6145-Article Text-9370-1-10-20200513.pdf
chalachew5
 
An overview on data mining designed for imbalanced datasets
eSAT Publishing House
 
An overview on data mining designed for imbalanced datasets
eSAT Journals
 
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATA
cscpconf
 
Feature selection with imbalanced data in agriculture
Aboul Ella Hassanien
 
Under-sampling technique for imbalanced data using minimum sum of euclidean d...
IAESIJAI
 
Oversampling technique in student performance classification from engineering...
IJECEIAES
 
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
IJMIT JOURNAL
 
PREDICTING CLASS-IMBALANCED BUSINESS RISK USING RESAMPLING, REGULARIZATION, A...
IJMIT JOURNAL
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET Journal
 
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCE
IJCI JOURNAL
 
COMP_GroupA2.pptx
shivrajdeshmukh22
 
An advance extended binomial GLMBoost ensemble method with synthetic minorit...
IJECEIAES
 
Class imbalance problem1
chs71
 
Borderline Smote
Trector Rancor
 
TRENDS IN FINANCIAL RISK MANAGEMENT SYSTEMS IN 2020
IJMIT JOURNAL
 
Ad

More from ijtsrd (20)

PDF
A Study of School Dropout in Rural Districts of Darjeeling and Its Causes
ijtsrd
 
PDF
Pre extension Demonstration and Evaluation of Soybean Technologies in Fedis D...
ijtsrd
 
PDF
Pre extension Demonstration and Evaluation of Potato Technologies in Selected...
ijtsrd
 
PDF
Pre extension Demonstration and Evaluation of Animal Drawn Potato Digger in S...
ijtsrd
 
PDF
Pre extension Demonstration and Evaluation of Drought Tolerant and Early Matu...
ijtsrd
 
PDF
Pre extension Demonstration and Evaluation of Double Cropping Practice Legume...
ijtsrd
 
PDF
Pre extension Demonstration and Evaluation of Common Bean Technology in Low L...
ijtsrd
 
PDF
Enhancing Image Quality in Compression and Fading Channels A Wavelet Based Ap...
ijtsrd
 
PDF
Manpower Training and Employee Performance in Mellienium Ltdawka, Anambra State
ijtsrd
 
PDF
A Statistical Analysis on the Growth Rate of Selected Sectors of Nigerian Eco...
ijtsrd
 
PDF
Automatic Accident Detection and Emergency Alert System using IoT
ijtsrd
 
PDF
Corporate Social Responsibility Dimensions and Corporate Image of Selected Up...
ijtsrd
 
PDF
The Role of Media in Tribal Health and Educational Progress of Odisha
ijtsrd
 
PDF
Advancements and Future Trends in Advanced Quantum Algorithms A Prompt Scienc...
ijtsrd
 
PDF
A Study on Seismic Analysis of High Rise Building with Mass Irregularities, T...
ijtsrd
 
PDF
Descriptive Study to Assess the Knowledge of B.Sc. Interns Regarding Biomedic...
ijtsrd
 
PDF
Performance of Grid Connected Solar PV Power Plant at Clear Sky Day
ijtsrd
 
PDF
Vitiligo Treated Homoeopathically A Case Report
ijtsrd
 
PDF
Vitiligo Treated Homoeopathically A Case Report
ijtsrd
 
PDF
Uterine Fibroids Homoeopathic Perspectives
ijtsrd
 
A Study of School Dropout in Rural Districts of Darjeeling and Its Causes
ijtsrd
 
Pre extension Demonstration and Evaluation of Soybean Technologies in Fedis D...
ijtsrd
 
Pre extension Demonstration and Evaluation of Potato Technologies in Selected...
ijtsrd
 
Pre extension Demonstration and Evaluation of Animal Drawn Potato Digger in S...
ijtsrd
 
Pre extension Demonstration and Evaluation of Drought Tolerant and Early Matu...
ijtsrd
 
Pre extension Demonstration and Evaluation of Double Cropping Practice Legume...
ijtsrd
 
Pre extension Demonstration and Evaluation of Common Bean Technology in Low L...
ijtsrd
 
Enhancing Image Quality in Compression and Fading Channels A Wavelet Based Ap...
ijtsrd
 
Manpower Training and Employee Performance in Mellienium Ltdawka, Anambra State
ijtsrd
 
A Statistical Analysis on the Growth Rate of Selected Sectors of Nigerian Eco...
ijtsrd
 
Automatic Accident Detection and Emergency Alert System using IoT
ijtsrd
 
Corporate Social Responsibility Dimensions and Corporate Image of Selected Up...
ijtsrd
 
The Role of Media in Tribal Health and Educational Progress of Odisha
ijtsrd
 
Advancements and Future Trends in Advanced Quantum Algorithms A Prompt Scienc...
ijtsrd
 
A Study on Seismic Analysis of High Rise Building with Mass Irregularities, T...
ijtsrd
 
Descriptive Study to Assess the Knowledge of B.Sc. Interns Regarding Biomedic...
ijtsrd
 
Performance of Grid Connected Solar PV Power Plant at Clear Sky Day
ijtsrd
 
Vitiligo Treated Homoeopathically A Case Report
ijtsrd
 
Vitiligo Treated Homoeopathically A Case Report
ijtsrd
 
Uterine Fibroids Homoeopathic Perspectives
ijtsrd
 
Ad

Recently uploaded (20)

PPTX
Natural Language processing using nltk.pptx
Ramakrishna Reddy Bijjam
 
PDF
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
PDF
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
PPTX
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
PPTX
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
PDF
TLE 8 QUARTER 1 MODULE WEEK 1 MATATAG CURRICULUM
denniseraya1997
 
DOCX
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
 
PPTX
Lesson 1 Cell (Structures, Functions, and Theory).pptx
marvinnbustamante1
 
PPTX
Parsing HTML read and write operations and OS Module.pptx
Ramakrishna Reddy Bijjam
 
PDF
Supply Chain Security A Comprehensive Approach 1st Edition Arthur G. Arway
rxgnika452
 
PDF
Indian National movement PPT by Simanchala Sarab, Covering The INC(Formation,...
Simanchala Sarab, BABed(ITEP Secondary stage) in History student at GNDU Amritsar
 
PDF
Lesson 1 : Science and the Art of Geography Ecosystem
marvinnbustamante1
 
PDF
Andreas Schleicher_Teaching Compass_Education 2040.pdf
EduSkills OECD
 
PDF
Cooperative wireless communications 1st Edition Yan Zhang
jsphyftmkb123
 
PDF
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
PPTX
Practice Gardens and Polytechnic Education: Utilizing Nature in 1950s’ Hu...
Lajos Somogyvári
 
PPTX
week 1-2.pptx yueojerjdeiwmwjsweuwikwswiewjrwiwkw
rebznelz
 
PPTX
SYMPATHOMIMETICS[ADRENERGIC AGONISTS] pptx
saip95568
 
PPTX
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
PDF
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
 
Natural Language processing using nltk.pptx
Ramakrishna Reddy Bijjam
 
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
 
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
How to Setup Automatic Reordering Rule in Odoo 18 Inventory
Celine George
 
TLE 8 QUARTER 1 MODULE WEEK 1 MATATAG CURRICULUM
denniseraya1997
 
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
 
Lesson 1 Cell (Structures, Functions, and Theory).pptx
marvinnbustamante1
 
Parsing HTML read and write operations and OS Module.pptx
Ramakrishna Reddy Bijjam
 
Supply Chain Security A Comprehensive Approach 1st Edition Arthur G. Arway
rxgnika452
 
Indian National movement PPT by Simanchala Sarab, Covering The INC(Formation,...
Simanchala Sarab, BABed(ITEP Secondary stage) in History student at GNDU Amritsar
 
Lesson 1 : Science and the Art of Geography Ecosystem
marvinnbustamante1
 
Andreas Schleicher_Teaching Compass_Education 2040.pdf
EduSkills OECD
 
Cooperative wireless communications 1st Edition Yan Zhang
jsphyftmkb123
 
Free eBook ~100 Common English Proverbs (ebook) pdf.pdf
OH TEIK BIN
 
Practice Gardens and Polytechnic Education: Utilizing Nature in 1950s’ Hu...
Lajos Somogyvári
 
week 1-2.pptx yueojerjdeiwmwjsweuwikwswiewjrwiwkw
rebznelz
 
SYMPATHOMIMETICS[ADRENERGIC AGONISTS] pptx
saip95568
 
How to Manage Wins & Losses in Odoo 18 CRM
Celine George
 
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
 

Analysis of Imbalanced Classification Algorithms A Perspective View

  • 1. International Journal of Trend in Scientific Research and Development (IJTSRD) Volume: 3 | Issue: 2 | Jan-Feb 2019 Available Online: www.ijtsrd.com e-ISSN: 2456 - 6470 @ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 974 Analysis of Imbalanced Classification Algorithms: A Perspective View Priyanka Singh1, Prof. Avinash Sharma2 1PG Scholar, 2Assistant Professor 1,2Department of CSE, MITS, Bhopal, Madhya Pradesh, India ABSTRACT Classification of data has become an important research area. The process of classifying documentsintopredefined categories Unbalanced data set, a problem often found in real world application, can cause seriously negative effect on classification performance of machine learning algorithms. There have been many attemptsatdealingwith classificationofunbalanceddata sets. In this paper we present a brief review of existing solutions to the class-imbalance problemproposed bothatthedata and algorithmic levels. Even though a common practice to handle the problem of imbalanced data is to rebalance them artificially by oversampling and/or under-sampling, some researchers proved that modified support vector machine, rough set based minority class oriented rule learning methods, cost sensitive classifier perform goodonimbalanceddataset. Weobserved that current research in imbalance data problem is moving to hybrid algorithms. Keywords: cost-sensitive learning, imbalanced data set, modified SVM, oversampling, undersampling I. INTRODUCTION A data set is called imbalanced if it contains many more samples from one class than from the rest of theclasses.Data sets are unbalanced when atleast one class is representedby only a small number of trainingexamples(calledtheminority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class has on traditional training criteria. Most original classification algorithms pursue to minimize the error rate: the percentage of the incorrect prediction ofclasslabels.Theyignorethedifference between types of misclassification errors. In particular, they implicitly assume that all misclassification errors cost equally. In many real-world applications, this assumption is not true. Thedifferencesbetweendifferentmisclassificationerrorscan be quite large. For example, in medical diagnosis of a certain cancer, if the cancer is regarded as the positive class, and non-cancer (healthy) as negative, then missing a cancer (the patientis actually positive but is classified asnegative;thusit is also called ―false negative‖) is much more serious (thus expensive) than the false-positive error. The patient could lose his/her life because of the delay in the correct diagnosis and treatment. Similarly, if carryingabombispositive,thenit is much more expensive to miss a terrorist who carries a bomb to a flight than searching an innocent person. The unbalanced data set problem appearsinmanyrealworld applications like text categorization, fault detection, fraud detection, oil-spills detection in satellite images, toxicology, cultural modeling, medical diagnosis.[1] Many research papers on imbalanced data sets have commonly agreed that because of this unequal classdistribution,theperformanceof the existing classifiers tends to be biased towards the majority class. The reasons for poor performance of the existing classificationalgorithmsonimbalanceddatasetsare: 1. They are accuracy driven i.e.,theirgoalistominimizethe overall error to whichtheminorityclasscontributesvery little. 2. They assume that there is equal distribution of data for all the classes. 3. They also assume that the errors coming from different classes have the same cost[2]. With unbalanced data sets, data mining learning algorithms produce degenerated models that do not take into account the minority class as most data mining algorithms assume balanced data set. A number of solutions to the class-imbalance problem were previously proposed both at the data and algorithmic levels [3]. At the data level, these solutions include many different forms of re-sampling such as random oversampling with replacement, randomundersampling,directedoversampling (in which no new examples are created, but the choice of samples to replace is informed ratherthanrandom),directed undersampling (where, again, the choice of examples to eliminate is informed), oversampling with informed generation of new samples, and combinations of the above techniques. At the algorithmic level, solutions include adjusting the costs of the various classes so as to counter the class imbalance, adjusting the probabilistic estimate at the tree leaf (when working with decision trees), adjusting the decision threshold, and recognition-based(i.e.,learningfrom one class) rather than discrimination-based (two class) learning. The most common techniques to deal with unbalanced data include resizing training datasets, cost- sensitive classifier, and snowball method. Recently, several methods have been proposed with good performance on unbalanced data. These approachesincludemodifiedSVMs,k nearest neighbor (kNN), neural networks, genetic programming, rough set based algorithms, probabilistic decision tree and learning methods. The next sections focus on some of the method in detail. II. SAMPLING METHODS An easy Datalevel methodsfor balancing theclassesconsists of re-sampling the original data set, either by over- sampling the minority class or by under-sampling the majority class,
  • 2. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 975 until the classes are approximatelyequallyrepresented.Both strategies can be applied in any learning system, since they act as a preprocessingphase, allowingthe learningsystemto receive the training instances as if they belonged to a well- balanced data set. Thus, any bias of the system towards the majority class due tothedifferentproportionofexamplesper class would be expected to be suppressed. Hulse et al. [4] suggest that the utility of the re-sampling methods depends on a number of factors, including the ratio between positive and negative examples, other characteristics of data, and the nature of the classifier. However, re-sampling methods have shown important drawbacks. Under-samplingmaythrowoutpotentiallyuseful data, while over-sampling artificially increases thesizeofthe data set and consequently, worsens the computational burden of the learning algorithm. A. Oversampling The simplestmethod toincrease thesizeoftheminorityclass corresponds to random over-sampling, that is, a non- heuristicmethod thatbalances theclassdistributionthrough the random replication of positive examples. Nevertheless, sincethis methodreplicatesexistingexamplesintheminority class, over fitting is more likely to occur. Chawla proposed Synthetic Minority Over-sampling Technique (SMOTE) [5] an over-samplingapproachinwhich the minority class is over-sampled by creating synthetically examples rather than by over-sampling with replacement. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomlychosen.FromtheoriginalSMOTEalgorithm,several modifications have been proposed in the literature. While SMOTE approach does not handle data sets with all nominal features, it was generalized to handle mixed datasets of continuousandnominalfeatures.ChawlaproposeSMOTE-NC (Synthetic Minority Over-sampling Technique Nominal Continuous) and SMOTE-N (Synthetic Minority Over- sampling Technique Nominal), the SMOTE can also be extended for nominal features. Andrew Estabrooks et al. proposed a multiple re- sampling method which selected the most appropriate re-sampling rate adaptively [6]. Taeho Jo et al. put forward a cluster- based over-samplingmethodwhichdealtwithbetween-class imbalance and within-class imbalance simultaneously [7]. Hongyu Guo et al. found out hard examples of the majority and minority classes dur-ing the process of boosting, then generated new synthetic examples from hard examples and add them to the data sets [8].Based on SMOTE method, Hui Han and Wen-Yuan Wang [9] presented two new minority over-samplingmethods,borderline-SMOTE1andborderline- SMOTE2, in which only the minority examples near the borderline are over- sampled. These approaches achieve better TP rate and F- value than SMOTE and random over- sampling methods. B. Undersampling Under-samplingisanefficientmethodforclassing-imbalance learning. This method uses a subset of the majority class to train the classifier. Since many majority class examples are ignored, the training set becomes more balanced and the training process becomes faster. The most common preprocessing technique israndommajorityunder-sampling (RUS), IN RUS, Instances of the majority class are randomly discarded from the dataset. However, the main drawback of under-sampling is that potentially useful information contained in these ignored examples is neglected. There many ways attempts to improve upon the performance of random sampling, such as Tomek links, Condensed Nearest Neighbor Rule and One- sided selection etc. one-sided selection (OSS) is proposed by Rule Kubat and Matwin attempts to intelligently under- sample the majority class by removing majority class examples that are considered either redundant or noisy.‘ Over-sampling is a method for improve minority class recognition, randomly duplicate the minority data not only without increase any category of a small number of new information, but also will lead to over-fitting. For some problems like fraud detection which is highly overlapped unbalanced data classification problem, where non-fraud samples heavily outnumber the fraud samples,T. Maruthi Padmaja[10]proposedhybridsamplingtechnique, a combination of SMOTE to over-sample the minority data (fraud samples) and random under- sampling to under- sample the majoritydata(non-fraudsamples)ifweeliminate extreme outliers from the minority samples for highly skewed imbalanced data sets like fraud detection classification accuracy can be improved. Sampling methods consider the class skew and properties of the dataset as a whole. However, machine learning and data mining often face nontrivial datasets, which often exhibit characteristics and properties at a local, rather than global level. It is noted that a classifier improved through global sampling levels may be insensitive to the peculiarities of different components or modalities in the data, resulting in a suboptimal performance. David A. Cieslak, Nitesh V. Chawla[11] has suggested that for improving classifier performance sampling can be treated locally, instead of applying uniform levels of sampling globally. They proposed a framework which first identifiesmeaningfulregionsofdata and then proceeds to find optimal sampling levels within each. There are known disadvantages associated with the use of sampling to implement cost-sensitive learning. The disadvantage with undersampling is that it discards potentially useful data. The main disadvantage with oversampling, from our perspective, is that by making exact copies of existing examples, it makes over fitting likely. In fact, with oversampling it is quite common for a learner to generate a classification rule to cover a single, replicated, example. A second disadvantage of oversampling is that it increases the number of training examples, thus increasing the learning time. Given the disadvantages with sampling, still sampling is a popular way to deal with imbalanced data rather than a cost- sensitive learning algorithm. There are several reasons for this. The most obvious reason is there are not cost- sensitive implementations of all learning algorithms and therefore a wrapper-based approach using sampling is the only option. While this is certainly less true today than in the past, many learning algorithms (e.g., C4.5) still do not directly handle costs in the learning process. A second reason for using
  • 3. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 976 sampling is thatmany highly skewed data sets are enormous and the size of the training set must be reduced in order for learning to be feasible. In this case, undersampling seems to be a reasonable, and valid, strategy. if one needs to discard some training data, it still might be beneficial to discard some of the majority class examples in order to reduce the training set size to the required size, and then alsoemploy a cost- sensitive learning algorithm, so that the amount of discarded training data is minimized. A final reason that may have contributed to the use of sampling rather than a cost-sensitive learning algorithm is that misclassification costs are often unknown. However, this is not a valid reason for using sampling over a cost-sensitive learning algorithm, since the analogous issue arises with sampling—what should the class distribution of the final training data be? If this cost information is not known, a measure such as the area under the ROC curve could be used to measure classifier performance and both approaches could then empiricallydeterminethepropercost ratio/class distribution [12]. III. COST-SENSITIVE LEARNING At the algorithmic level, solutions include adjusting the costs of the various classes so as to counter the class imbalance, adjusting the probabilistic estimate at the tree leaf (when working with decision trees), adjusting the decision threshold, and recognition-based (i.e., learning from one class) rather than discrimination-based (two class) learning. Cost-Sensitive Learning is a type of learning in data mining that takes the misclassification costs (and possibly other types of cost) into consideration. There are many ways to implement cost sensitive learning, in [13], it is categorized into three, the first class of techniquesapplymisclassification costs to the data set as a form of data space weighting, the second class applies cost-minimizing techniques to the combination schemesofensemblemethods,andthelastclass of techniques incorporates cost sensitive features directly into classification paradigms to essentially fit the cost sensitive framework into these classifiers. Incorporating costintodecision tree classification algorithm which is one of the most widely used and simple classifier. Cost can be incorporated into it in various ways. First way is cost can be applied to adjust the decision threshold, second way is cost can be used in splitting attribute selection during decision treeconstruction and theother way is costsensitive pruningschemes can be applied to the tree. Ref.[14]propose a method for building and testing decision trees that minimizes total sum of the misclassification and test costs. The algorithm used by them chooses an splitting attribute that minimizes the total cost, the sum of the test cost and the misclassification cost rather than choosing an attribute that minimizes the entropy. Information gain, Gini measures are considered to be skew sensitive [15]. In Ref. [16] a new decision tree algorithm called Class Confidence Proportion Decision Tree (CCPDT) is proposed which is robust and insensitive to size of classes and generates rules which are statistically significant. Ref. [17] analytically and empirically demonstrates the strong skew insensitivity of Hellinger Distance and its advantages over popularalternativemetrics. They arrived at a conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging without any sampling methods. Ref. [18] uses different operators of Genetic algorithms for oversampling to enlarge the ratio of positive samples and then apply clustering to the oversampled training data set as adata clearning method for both classes, removing the redundant or noisysamples.They used AUC as evaluation metricand foundthattheiralgorithm performed better. Nguyen ha vo, Yonggwan won[19] extended Regularized Least Square(RLS) algorithm that penalizes errors of different samples with different weights and some rules of thumb to determine those weights. The significantly better classification accuracy of weighted RLS classifiers showed that it is promising substitution of other previous cost- sensitiveclassification methodsfor unbalanceddataset.This approach is equivalent to up- sampling or down-sampling depending on the cost we choose. For example, doubling the cost-sensitivity of one class is said to be equivalent to doubling the number of samples in that class. Ref[20] proposed a novel approach reducing each within group error, BABoostthat is a variant of AdaBoost. Adaboost algorithm gives equal weight to each misclassified example. But the misclassification error of each class is not same. Generally, the misclassificationerroroftheminorityclasswill larger than themajority‘s. SoAdaboostalgorithm will lead to higher bias and smaller margin when encountering skew distribution. BABoost algorithm in each round of boosting assigns more weights to the misclassified examples, especially those in the minority class. Yanmin Sun a and Mohamed S. Kamel[21] explored three cost-sensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. These boosting algorithms are also studied with respect to their weighting strategies towards different types of samples, and their effectiveness in identifying rare cases through experiments on several real worldmedicaldatasets, where the class imbalance problem prevails. IV. SVM AND IMBALANCED DATASETS The success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positiveinstances. Even though undersamplingthe majorityclass doesimprove SVM performance, there is an inherent loss of valuable information in this process. Rehan Akbani[22]combined sampling and cost sensitive learning for improving performance of SVM. Their algorithm is based on a variantof the SMOTE algorithm by Chawla et al, combined with Veropoulos et al‘s different error costs algorithm. TAO Xiao-yan[23] presented A modified proximal support vector machine (MPSVM) which assigns different penalty coefficients to the positive and negative samplesrespectively by adding a new diagonal matrix in the primal optimization problem. And further the decision function is obtained. The real-coded immune clone algorithm (RICA) is employed to select the global optimal parameters to get the high generalization performance. M. Muntean 1 and H. Vălean[24] provided the Enhancer, a viable algorithm for improving the SVM classification of unbalanced datasets. They improve the Cost-sensitive classification for Support Vector Machines, by multiplying in the training step the instances of the underrepresented classes.
  • 4. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 977 Yuchun Tang and nitesh chawla[25] also implemented and rigorouslyevaluated four SVM modelingtechniquesSVM can be effective if incorporate different ―rebalance‖heuristics in SVM modeling, including cost-sensitive learning, and over and under sampling. Geneticprogramming(GP)canevolvebiasedclassifierswhen data sets are unbalanced. The cost sensitive learning uses cost adjustment within the learningalgorithmtofactorinthe uneven distribution of class examples in the original (unmodified) unbalanced data set, during the training process. In GP, cost adjustment can be enforced by adapting the fitness function. Here, solutions with good classification accuracy on both classes are rewarded with better fitness, while those that are biased toward one class only are penalized with poor fitness. Common techniques include using fixed misclassification costs for minority and majority class examples [26], [27], or improved performance criteria such as the area under the receiver operating characteristic (ROC) curve (AUC) [28], in the fitness function. While these techniques have substantially improved minority class performances in evolved classifiers, they can incur both a tradeoff in majority class accuracy and, thus, a loss in overall classificationability, and long training times duetothe computationaloverheadin evaluating these improved fitness measures. In addition, these approaches can be problem specific, i.e., fitness functions are handcrafted for a particular problem domain only. V. HYBRID ALGORITHMS The EasyEnsemble classifierisanunder-samplingalgorithm, which independently samples several subsets from negative examples and one classifier is built for each subset. All generated classifiers arethencombinedfor thefinaldecision by using Adaboost. In imbalanced problems, some features are redundant and even irrelevant; these features will hurt thegeneralizationperformanceoflearningmachines.Feature selection, a process of choosing a subset of features from the original ones, isfrequentlyusedasapreprocessingtechnique in analysis of data. It has been proved effective in reducing dimensionality, improving mining efficiency, increasing mining accuracy and enhancing result comprehensibility. Ref[29] combined the feature selection method with Easy Ensemble in order to improve the accuracy. In ref[30] a hybrid algorithm based on random over- sampling, decision tree (DT), particle swarm optimization (PSO) and feature selection is proposed to classify unbalanced data. The proposed algorithm has the ability to select beneficial feature subsets, automatically adjust values of parameter and obtain the bestclassification accuracy. The zoo dataset is used to test the performance. From simulation results, the classification accuracy ofthisproposedalgorithm outperforms other existing methods Decision trees,supplementedwithsamplingtechniques,have proven to be an effectiveway to address the imbalanceddata problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion is suggested in ref[17]. Theytook advantageofthestrongskew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). For imbalanced data it is sufficient to use Hellinger trees with bagging without any sampling methods. VI. CONCLUSION This paper provides an overview of the classification of imbalanced data sets. At data level, sampling is the most common approach to deal with imbalanced data. Over- sampling clearly appears as better than under-sampling for local classifiers, whereas some under-sampling strategies outperform over-sampling when employing classifiers with global learning. Researchers proved that Hybrid sampling techniques can perform better than just oversampling or under sampling. At the algorithmic level, solutions include adjusting the costs of the various classes so as to counter the class imbalance, adjusting the probabilistic estimate at the tree leaf (when working with decision trees), adjusting the decision threshold, and recognition-based(i.e.,learningfrom one class) rather than discrimination-based (two class) learning. Solutions based on modified support vector machine, rough set based minority class oriented rule learning methods, cost sensitive classifier are also proposed to deal with unbalanced data. There areof coursemanyother worthwhile research possibilities that are not included here. DevelopingClassifierswhicharerobustandskew-insensitive or hybrid algorithms can be point of interest for the future research in imbalanced dataset. REFERENCE [1] Miho Ohsaki, Peng Wang, Kenji Matsuda, Shigeru Katagiri, Hideyuki Watanabe, and Anca Ralescu, “Confusion-matrix-based Kernel Logistic Regression for Imbalanced Data Classification”, IEEE Transactions on Knowledge and Data Engineering, 2017. [2] Alberto Fernández, Sara del Río, Nitesh V. Chawla, Francisco Herrera, “An insight into imbalanced Big Data classification: outcomes and challenges”,Springer journal of bigdata, 2017. [3] Vaibhav P. Vasani1, Rajendra D. Gawali, “Classification and performance evaluation using data mining algorithms”, International Journal of Innovative Research in Science, Engineering and Technology, 2014. [4] Kaile Su, Huijing Huang, Xindong Wu, Shichao Zhang, “Rough Sets for FeatureSelectionand Classification:An Overview with Applications”, International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-3, Issue-5, November 2014. [5] Senzhang Wang, Zhoujun Li, Wenhan Chao and Qinghua Cao, “Applying Adaptive Over-sampling Technique Based on Data Density and Cost-Sensitive SVM to Imbalanced Learning”,IEEE World Congresson Computational Intelligence June, 2012. [6] Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince and Francisco Herrera, “A Review on Ensembles for the Class Imbalance Problem: Bagging, Boosting, and Hybrid-Based Approaches”, IEEE Transactions on Systems, Man and Cybernetics— Part C: Applications and Reviews, Vol. 42, No. 4, July 2012. [7] Nada M. A. Al Salami, “Mining High Speed Data Streams”. UbiCC Journal, 2011.
  • 5. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Reference Paper ID – IJTSRD21574 | Volume – 3 | Issue – 2 | Jan-Feb 2019 Page: 978 [8] Dian Palupi Rini, Siti Mariyam Shamsuddin and Siti Sophiyati, “Particle Swarm Optimization: Technique, System and Challenges”, International Journal of Computer Applications (0975 – 8887) Volume 14– No.1, January 2011. [9] Amit Saxena, Leeladhar Kumar Gavel, Madan Madhaw Shrivas, “Online Streaming Feature Selection”, 27th International Conference on Machine Learning, 2010. [10] Yuchun Tang, Member, Yan-Qing Zhang, Nitesh V. Chawla and Sven Krasser, “SVMs Modeling for Highly Imbalanced Classification”, IEEE Transaction on Systems, Man and Cybernetics,Vol.39, NO.1,Feb2009. [11] Haibo He and Edwardo A. Garcia, “Learning from Imbalanced Data”, IEEE Transactions on Knowledge and Data Engineering, September 2009. [12] Thair Nu Phyu, “Survey of Classification Techniques in Data Mining”, International Multi Conference of Engineers and Computer Scientists, IMECS 2009, March, 2009. [13] Haibo He, Yang Bai, Edwardo A. Garcia and Shutao Li, “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning”, IEEE Transaction of Data Mining, 2009. [14] Swagatam Das, Ajith Abraham and Amit Konar, “Particle Swarm Optimization and Differential Evolution Algorithms: TechnicalAnalysis, Applications and Hybridization Perspectives”, Springer journal on knowledge engineering, 2008. [15] “A logical framework for identifyingqualityknowledge from different data sources”, International Conference on Decision Support Systems, 2006. [16] “Database classification for multi-database mining”, International Conferenceon DecisionSupportSystems, 2005. [17] Volker Roth, “Probabilistic Discriminative Kernel Classifiers for Multi-class Problems”, Springer-Verlag journal, 2001. [18] R. Chen, K. Sivakumar and H. Kargupta “Collective Mining of Bayesian Networks from Distributed Heterogeneous Data”, Kluwer Academic Publishers, 2001. [19] Shigeru Katagiri, Biing-Hwang Juang and Chin-HuiLee, “Pattern Recognition Using a Family of Design Algorithms Based Upon the Generalized Probabilistic Descent Method”, IEEE Journal of Data Minig, 1998. [20] I. Katakis, G. Tsoumakas, and I. Vlahavas. Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowledge and Information Systems, Pp 371–391, 2010. [21] J. Kolter and M. Maloof. Using additive expert ensembles to cope with concept drift. In Proc. ICML,Pp 449–456, 2005. [22] D. D. Lewis, Y. Yang, T. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, Pp 361–397, 2004. [23] X. Li, P. S. Yu, B. Liu, and S.-K. Ng. Positive unlabeled learning for data stream classification. In Proc.SDM,Pp 257–268, 2009. [24] M. M. Masud, Q. Chen, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham. Classificationand novel classdetection of data streams in a dynamic feature space. In Proc. ECML PKDD, volume II, Pp 337–352, 2010. [25] P. Zhang, X. Zhu, J. Tan, and L. Guo, “Classifier and Cluster Ensembles for Mining Concept Drifting Data Streams,” Proc. 10th Int’l Conf. Data Mining, 2010. [26] X. Zhu, P. Zhang, X. Lin, and Y. Shi, “Active Learning from Stream Data Using Optimal Weight Classifier Ensemble,” IEEE Trans. Systems,Man, CyberneticsPart B, vol. 40, no. 6, Pp 1607- 1621, Dec. 2010. [27] Q. Zhang, J. Liu, and W. Wang, “Incremental Subspace Clustering over Multiple Data Streams,” Proc. Seventh Int’l Conf. Data Mining, 2007. [28] Q. Zhang, J. Liu, and W. Wang, “Approximate Clustering on Distributed Data Streams,” Proc. 24th Int’l Conf. Data Eng., 2008. [29] C. C. Aggarwal. On classification and segmentation of massive audio data streams. Knowl. and Info. Sys., Pp 137–156, July 2009. [30] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand classification of evolving data streams. IEEE Trans. Knowl. Data Eng, Pp 577– 589, 2006. [31] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavald. New ensemble methods for evolving data streams. In Proc. SIGKDD, Pp 139–148, 2009. [32] S. Chen, H. Wang, S. Zhou, and P. Yu. Stop chasing trends: Discovering highorder modelsin evolvingdata. In Proc. ICDE, Pp 923–932, 2008. [33] P. Zhang, X. Zhu, and L. Guo. Mining data streams with labeled and unlabeled training examples. In Proc. ICDM, Pp 627–636, 2009. [34] O. R. Terrades, E. Valveny, and S. Tabbone, “Optimal classifier fusion in a non-Bayesian probabilistic framework,” IEEE Trans.Pattern Anal.Mach.Intell., vol. 31, no. 9, Pp 1630–1644, Sep. 2009.