SlideShare a Scribd company logo
Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
Vol. 5, No. 3, September 2017, pp. 275~284
ISSN: 2089-3272, DOI: 10.11591/ijeei.v5i3.299  275
Received March 25, 2017; Revised July 10, 2017; Accepted August 11, 2017
An Heterogeneous Population-Based Genetic
Algorithm for Data Clustering
Amina Bedboudi
*1
, Cherif Bouras
2
, Mohamed Tahar Kimour
3
1,3
Departement of Computer Science, University of Badji Mokhtar - Annaba, Algeria
2
Departement of Mathematics, University of Badji Mokhtar - Annaba, Algeria
e-mail: bedboudi.amina@hotmail.fr
Abstract
As a primary data mining method for knowledge discovery, clustering is a technique of
classifying a dataset into groups of similar objects. The most popular method for data clustering
K-means suffers from the drawbacks of requiring the number of clusters and their initial centers,
which should be provided by the user. In the literature, several methods have proposed in a
form of k-means variants, genetic algorithms, or combinations between them for calculating the
number of clusters and finding proper clusters centers. However, none of these solutions has
provided satisfactory results and determining the number of clusters and the initial centers are
still the main challenge in clustering processes. In this paper we present an approach to
automatically generate such parameters to achieve optimal clusters using a modified genetic
algorithm operating on varied individual structures and using a new crossover operator.
Experimental results show that our modified genetic algorithm is a better efficient alternative to
the existing approaches
Keywords: data mining, clustering, K-means, genetic algorithms, crossover
1. Introduction
As one of the most important tasks of spatial data mining, cluster analysis has been
widely used in several domains, such as biology, system engineering and social sciences, in
order to identify natural groups in large amounts of data [1-3]. A clustering algorithm assigns a
large number of data points to a smaller number of groups (or clusters) such that data points in
the same group share the same properties (similar) while, in different groups, they are dissimilar
[4]. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics,
and numerical analysis [5]. Data mining adds to clustering the complications of very large
datasets with very many attributes of different types. This imposes unique computational
requirements on relevant clustering algorithms.
A variety of algorithms have recently emerged that meet these requirements and were
successfully applied to real-life data mining problems. Especially, popular methods such as K-
means, genetic algorithms and their variants are the most notable ones for data clustering [1-9],
[14-21]. Their advantages lie in the simplicity, efficiency and ease convergenc, however, most of
them suffer from the drawbacks of requiring the number of clusters and their initial centers, but
also from the poor clustering quality [9].
In this paper we present an approach to automatically generate such parameters, while
improving the speed and accuracy of the algorithm. It is based on an appropriate data structure
and process that handles heterogeneous populations in a modified genetic algorithm that
operates on varied individual structures .Generally, using parameters such as the crossover and
mutation probabilities that adapt to the evolution of the algorithm is a good choice, since higher
diversity in the population can be achieved, preventing the algorithm to fall in local minima.
Moreover, to accelerate the genetic algorithm process and increase the individual
diversity of the initial population, we generate that initial population in two manners: the first
subpopulation is obtained with a deterministic way and the second one with a random way.
Increasing the population diversity will allow to achieve better quality.
The rest of the article is structured as follows: section 2 gives an analysis of two data
clustering techniques; k-means and genetic algorithms. Section 3 details our data clustering
 ISSN: 2089-3272
IJEEI Vol. 5, No. 3, September 2017 : 275 – 284
276
approach by describing the proposed structure of chromosomes and genetic operators. Section
4 presents the related works and a comparaison with our aproposed approach. Section 5
describes the experimental results and discusses the strength of our approach. Finally, we
draw a conclusion and define some future works in section 6.
2. Clustering
Clustering can be thought of as data partitioning or segmenting into homogeneous
groups. The clustering is usually accomplished by determining the similarity among the data on
predefined attributes. The most similar data are grouped into clusters [2-5], [9].
There is a close relationship between clustering techniques and many other disciplines.
Clustering has always been used in statistics and science [7]. Typical applications include
speech and character recognition. viewed as a density estimation problem, machine learning
clustering algorithms were applied to image segmentation and computer vision [10-13]. In the
literature, various methods have been used to handle the clustering problem. The K-means [4-5]
[17] is considered one of the major algorithms widely used in clustering [1]. Genetic algorithms
[6] are also used in clustering, either in a separated way or combined with-the k-means
algorithm [9].
2.1. K-means
K-means first randomly select the k objects; each object initially represents a cluster
center. For each remaining object according to its distance from the center of each cluster,
assign it to the nearest cluster. Afterward, each cluster center is replaced by the average value
on the respective cluster. This process is repeated, until k centers do not change. K-means
algorithm attempts to provide a partition of the given data such that the centroids of provided
classes are the minima of the following objective function [8-9], [13], [20] :
∑ ∑ ‖ ‖
The typical algorithm is as follows [1]
1. Select k objects as initial centers;
2. Assign each data object to the closest center;
3. Recalculate the centers of each cluster;
4. Repeat steps 2 and 3 until centers do not change;
Due to its practice properties, K-means algorithm is considered as the most popular
technique to cluster information. Its main advantages are: i) easy to implement, ii) the
comparison is conducted only between the observations and the center of classes, and iii) it
detects and isolates the outliers.
However, the algorithm exhibits some drawbacks, especially on the fact that it can be
really slow since in each step the distance between each point to each cluster has to be
calculated, which can be really expensive in the presence of a large dataset. Moreover, it is
sensitive to the initial cluster centers that are provided by the user as well as the k number of
clusters.
IJEEI ISSN: 2089-3272 
An Heterogeneous Population-Based Genetic Algorithm for Data… (Amina Bedboudi)
277
2.2. Genetic algorithms
Genetic algorithms [6] are robust, stochastic optimization algorithms, used to solve a
wide variety of problems. They were developed with the goal of better understanding natural
processes such as adaptation, and it belongs to a type of search techniques that mimic the
principles of natural selection to develop solutions of large optimization problems. A genetic
algorithm finds the optimal value for a particular objective function depending on the problem to
be solved. The standard approach to an optimization problem begins by designing an objective
function that can model the problem’s objectives while incorporating any constraints.
A genetic algorithm consists of: Chromosomal Representation, initial population, fitness
evaluation, selection and reproduction (i.e., crossover and mutation). GA operates by
maintaining and manipulating a population of potential solutions called chromosomes. Each
chromosome has an associated fitness value which is a qualitative measure of the goodness of
the solution encoded in it.
In a genetic algorithm process, we determine firstly an initial solution, composed of a set
of chromosomes (initial population), and iteratively apply reproduction operators(selection,
crossover and mutation) until achieving a certain quality parameter or a predefined number of
iterations. A use of a fitness function guides the stochastic selection of chromosomes which are
then used to generate new candidate solutions through crossover and mutation.
Therefore, basic operations are: i) Crossover, which generates new chromosomes by
combining sections of two or more selected parents, ii) mutation, which acts by randomly
selecting genes which are then altered; thereby preventing suboptimal solutions from persisting
and increases diversity in the population.
There are three main types of selection methods: fitness proportionate selection, ranking
method and tournament selection.
In tournament selection [14], individuals are selected randomly from the population, based on
the fitness function. Tournaments are often held between pairs of individuals, although larger
tournaments can be used. Simple outline of a genetic algorithm is shown in Figure 1.
Figure 1. Outline of the genetic algorithm.
 ISSN: 2089-3272
IJEEI Vol. 5, No. 3, September 2017 : 275 – 284
278
Genetic algorithms are characterized by attributes such as objective function, encoding
of the input data, crossover, mutation, and population size.
(1) Objective function. It is used to assign each individual in the population a fitness
value; an individual with a higher fitness represents a better solution to the problem
than an individual with a lower fitness value;
(2) Encoding. Genetic algorithms operate on an en-coding of the problem’s input data
(which represent in-dependent variables for the objective function);
(3) Elitism. This is a way to ensure that the highly fit-ting chromosomes are not lost and
copied to the new population. Elitism has been found to be very important to the
performance of genetic algorithms;
(4) Crossover. It is a procedure in which a highly fit-ting chromosome is given an
opportunity to reproduce by exchanging pieces of its genetic information with other
highly fitting chromosomes;
(5) Mutation. This is often applied after crossover by randomly altering some genes to
individual parents;
(6) Population size. It is the number of individuals in a population. The larger the
population size, the better the chance that an optimal solution will be found.
Genetic algorithms iterate a fixed number of times. Since the function’s upper bound
(the maximum fitness value possible for an individual) may not be known or cannot be reached,
we must limit the number of generations in order to guarantee the termination of the search
process. This may result in a suboptimal solution. Moreover, combined approaches of genetic
algorithms with other clustering techniques are still not satisfying on some user requirements
especially for the accuracy and execution time
3. The Proposed Approach
The aim of our approach is to give an efficient clustering process leading to better
enhanced clusters’ quality and execution time using a varied structure-based genetic algorithm
with a new crossover operator. Tthe input of such modified genetic algorithm process is a
dataset. The genetic process automatically generates the number of clusters with their initial
centroids
3.1. Chromosome representation
Chromosome representation is a problem encoding performed at the most important
steps in using genetic algorithm to solve a problem. To encode both the number of clusters and
their centers, we propose a chromosome structure that applies a real coded genetic algorithm to
the clustering problem, where crossover and mutation operators are applied directly to real
parameter values.
The use of real parameter values in the GA representation has a number of advantages
over binary coding. The efficiency of the GA is increased as there is no need to convert the
solution variables to the binary type, less memory is required, there is no loss in precision by
discretization to binary or other values, and there is greater freedom to use different genetic
operators. In this work, the existing population is processed in the form of several
subpopulations with different size. In other words, the number of genes in every cluster differs.
A chromosome in a subpopulations may contain 2 genes, 3 genes to Kmax genes, where
Kmaxis the maximum number of clusters. These genes hold corresponding information about the
cluster centers. Here, each gene in the chromosome represents cluster center, and number K of
genes in a chromosome stands for number of clusters. The value of K is assumed to lie in the
interval [2; Kmax]. Thus, each solution i is a fixed-length string represented by cluster centers cij ;
j = 2,…,Kmax. Then, solution i of the population is represented via a vector as follows:
IJEEI ISSN: 2089-3272 
An Heterogeneous Population-Based Genetic Algorithm for Data… (Amina Bedboudi)
279
ci1
… cij … ciKmax
Figure.2. An example of chromosomes as defined by our approach.
3.2. Fitness function
The fitness function f is a measure of profit we need during optimization. It is an
objective function that is used to summarize how close a given solution is to achieving the aims
and has an important effect on success of a genetic algorithm. Fitness is proportional to the
utility or ability of individual which the chromosome represents. Measure of fitness helps in
evolving good solutions and implementing natural selection. In this work, the fitness of a
chromosome is computed using Mean Square Error (MSE) [18]. The MSE calculate the error
between the clusters.
MSE= ∑ (1)
Fitness =
After the definitions of the fitness function the different parameters of the genetic
algorithm operators are fixed.
3.3. Generation of the Initial Population
To accelerate the genetic algorithm process and increase the individual diversity of the
initial population, we generate that initial population in two phases. The first phase consists in a
deterministic producing manner, and the second one consists in a randomly generation. We
proceed such generation so to obtain 20% of the initial population from the first phase and 80%
from the second phase. his phase, we deterministically generate the first sub-population that
constitutes 20% the entire population. In From the input dataset, we produce a sorted dataset.
Then, the sorted dataset is divided into k equal segments and the means, modal, and min-max
values (see Algorithm1) of each segment are determined and considered asinitial cluster
centers. In doing so, we increase the diversity of individuals leading to enhanced genetic
reproduction.
Algorithm1: generate20%population
 Sort the datset;
 For (k=2, k<=Kmax-1, k++){
- Divide the sorted data int k equal segments: {Sk};
- Modal step: take the modal value of each segment k as a center Ck;
- Means step: take the means value of each segment k as a center Ck;
- Min-Max step:
For (i=0, i<k, k++){
Calculate c = (max-min)/2
Consider c as a ith center
3.4. Reproduction
Genetic Algorithms aids to look for the best solution among a number of possible
solutions throw reproduction. Reproduction can be implemented in an algorithmic form, based
on reproduction operators, suited to the encoding scheme.The objective of the reproduction
operators is to ensure diversity in the population, such that the fittest solutions can be derived
through the evolutionary process. Such evolution process is composed of crossover and
mutation operations.
 ISSN: 2089-3272
IJEEI Vol. 5, No. 3, September 2017 : 275 – 284
280
During such process, invalid chromosomes may be produced. In the proposed
chromosome representation, repetition of genes produces invalid chromosomes because a data
point cannot be center point of more than one cluster. So to be able to detect production of
invalid chromosomes, we sort the chromosome genes in ascending order. In doing so, we aim
to achieve small length, fast detection of invalid chromosomes, and faster crossover and
mutation operations. Figure 3. illustrates our crossover process using a mask.
Figure 3. The crossover process using a mask
3.4.1. Crossover
It is a reproduction process that the children chromosomes are generated according to
the fitness function values of their parents. We define a crossover operator in such a way that it
could accept parent chromosomes with different number of genes. It occurs with probability
crossp. For each two selected parents, crossover is applied, producing two offsprings. To this
end, a binary mask is created. Its length is equal to that of the shorter parent. In this mask, digit
“1” occurs with a certain probability p.
C
(1)
j=C
(1)
j+α C
(2)
j (2)
C
(2)
j=C
(2)
j+(1-α) C
(1)
j (3)
Figure 4. Depicts the crossover formulas with α as a uniformly distributed random
number such that
α ∈[-1, 1]. Offspring O1 is created by applying the formula (1) and offspring O2 is created
by applying the formula (2). These formulas in Figure 4 are applied on genes in the parents
corresponding to the positions in the mask with a digit “1”. Also, we should check the validity of
the resulting chromosomes to avoid redundant centers in one chromosome. In the case of
redundancy, we replace the redundant gene with a non redundant one, extracted from a parent.
IJEEI ISSN: 2089-3272 
An Heterogeneous Population-Based Genetic Algorithm for Data… (Amina Bedboudi)
281
Algorithm2: Clustering with GA
Input: data points, kmax
Output: clusters
Procedure:
(1) Use algorithm 1 to generate 20% of the initial population;
(2) Randomly generate the remaining 80% of initial populations;
(3) Calculate the fitness values of the chromosomes in the initial population;
(4) Select the two fittest chromosomes (P1, P2) and randomly choose a target one P3 from
the population;
(5) Apply the crossover and mutation operators on (P1, P2) to generate two offsprings (O1,
O2);
(6) Calculate the fitness values of the offsprings
(7) Replace P1 and P2 with O1 or O2 , if they are not better than such offsprings;
(8) Repeat from step (3) if there is a significant change between generations, or the
number of iteration has not been reached;
(9) Final clusters are derived from the fittest chromosome using the one step K-means
algorithm [21].
Figure 5. The main steps of the proposed genetic algorithm to clustering.
3.4.2. Mutation
Mutation is intended to prevent falling of all solutions in the population into a local
optimum of the solved problem or enhancing the obtained chromosome. Mutation takes place
with a lower probability than that of the crossover. We define two types of mutation to be taken
place after the crossover. Topological mutation is intended to add or delete genes from an
offspring. Gene mutation is intended to modify a selected gene from an offspring. Gene
mutation is applied by randomly selecting a gene from an offspring and replacing it with a data
point randomly extract from a dataset. Afterward, a validity check is performed to avoid
redundant centers and to keep the cluster number in [2, Kmax].
Our overall genetic process organizes operations as depicted by Figure 5. To find the closest
cluster to every point, we apply a k-means operator (KMO), which is a one step of the classical
k-means algorithm. In other words, we assign data points to their clusters for each new
chromosome using the KMO operator, in order to compute its fitness.
4. Related works
In the k-means algorithm, the most important challenge is related to determining of the
number of clusters and their initial centers. Randomly determining initial centers does not
guarantee good quality of clustering, due to the fact that at each run of K-Means algorithm on
the same dataset may result in a different clusyers. Besides, the process may converge to
suboptimal partitions.
In [3], the author proposed to sort the dateset according to their distances from each
other, and then divide the sorted data into k equal segments. He takes the first data item of
each segment as a corresponding cluster center. In [12], the authors proposed a solution based
on a statistical test, with the assymption that a subset of data follows a Gaussian distribution.
The proposed process begins with a small number k of clusters, and then increments k at each
step based on appropriate condition.
In [17-21] genetic algorithm-based solutions are proposed. They maintain a population
of some coded solutions and apply a one step k-means operator to calculate the fitness of a
chromosome. In [13], the authors proposed to use genetic algorithm to determine the initial
values of the clusters. They use k-means algorithm to evaluate fitness of each candidate center.
At the final iteration, the fittest chromosome will be the k-means initial cluster center. As k-
means algorithm must be executed for many times, the approach is usually computationally
expensive.
These above mentioned methods share the common drawback of searching initial
centers while fixing a given number of clusters. Their aims was to find results in fewer number of
steps, but because of intensively calculating neighbourhood distance between data points and
rearranging them, it would be very slow in partitioning large datasets. When searching for the
 ISSN: 2089-3272
IJEEI Vol. 5, No. 3, September 2017 : 275 – 284
282
initial centroids, they evaluate the concerned criteria separately for different values of k, leading
to exeute K-means algorithm for different numbers of clusters. Such a process is usually very
time-consuming.
5. Experimental results
To evaluate our approach and test its efficiency, we compare it with k-means, and the
closest work to ours of [9], using two selected datasets from UCI repository of machine learning
database, which are Iris and Lymphoma datasets. It is worth noting that the first difference of
our approach and the two above mentioned methods is that we automatically determine the k
number of clusters and their initial centroids. In the following, we show that our approach
outperforms at least such two methods when considering two important parameters, that is, the
average error and the average execution time. Experiments have been done usin parameters of
the Table 1.
Table 1.
Parameter Value
Maximum number of clusters, Kmax 15
Population size, P 80
Crossover probability 0.6
Mutation probability 0.3
Maximum no. of iterations, 100
In the Iris dataset, each data point has four feature values, which represents the length
and the width of sepal and petal, in centimeters. It has three classes with 50 samples per class.
The value of k is therefore chosen to be three (clusters).
In the Lymphoma dataset, we find 62 samples consisting of 4026 genes spanning three
classes, which include 42 Diffuse Large B-Cell Lymphoma samples, nine Follicular Lymphoma
samples, and 11 B-cell Chronic Lymphocytic Leukemia samples
This dataset is to be partitioned into three clusters. Each dataset was use for each
method for 10 times and then we determined the average time and error as mentioned in Table
2, which shows the efficiently experiments of the proposed method over K-means and the work
of [9], which closest to ours
Table 2. Evaluation results using UCI data sets
Dataset parameters k-means [11]
Proposed
approach
Iris Avg Error 36.53 29.59 19.59
Avg Time 18.34 10.56 9.34
Lymphoma Avg Error 44.12 38.49 22.15
Avg Time 16.15 10.25 8.28
The experiments have been conducted to measure the average error and average
execution time parameters for the three methods. For each dataset, average error and average
time are listed to show the trade-off between them. From the table 2 it is observed that, our
method outperforms k-means and its variants, especially the work in [11], by enhancing the
average error and the average execution time.
IJEEI ISSN: 2089-3272 
An Heterogeneous Population-Based Genetic Algorithm for Data… (Amina Bedboudi)
283
Because of our proposed population model consists of chromosomes with varied
structures, and using improved genetic operators for reproduction, the offsprings will inherit
appropriate information from their parent, leading to find an enhanced solution for the
concerned problem. Moreover, the deterministic determination of important portion of the initial
population will brings high diversity the genetic reproduction process, leading to enhanced
quality of the result
5. Conclusion
In this paper we have introduced a variable structure of genetic algorithms to be applied
on the data clustering problem without requiring the number and initial centers of clusters. The
cluster centroids are calculated by a genetic algorithm with heterogeneous population leading to
better results than using random numbers. Moreover, our approach allowed acceleration of the
search process by reducing the average execution time while obtaining better quality of data
partitions. As a future works, we plan to investigate improved fitness function and other
reproduction operators to achieve better performance
References
[1] Mecca GS, Raunich A, Pappalardo. New Algorithm For Clustering Search Results. Data And
Knowledge Engineering. 2007; 62: 504-522.
[2] H He, Y Tan. A Two-Stage Genetic Algorithm for Automatic Clustering. Neuro-Computing. 2012; 81:
49-59.
[3] C Raposo, C Antunes, J Barreto. Automatic Clustering Using A Genetic Algorithm With New Solution
Encoding And Operators. ICCSA (2). Lecture Notes In Computer Science, Springer. 2004; 8580: 92-
103.
[4] K Jain, MN Murty, PJ Flynn. Data Clustering: A Review, ACM Comput. Surv. 1999; 31(3): 264–323.
[5] S Garc a, uengo, Herrera. Data Pre-Processing In Data Mining, Cham. Springer. 2004: 327.
[6] SN Sivanandam, SN Deepa. Introduction To Genetic Algorithms. Berlin Heidelberg New York:
Springer. 2008.
[7] K Fukunaga. Introduction To Statistical Pattern Recognition. New York Academic. 1990.
[8] KB Sawant. Efficient Determination of Clusters in K-Mean Algorithm Using Neighborhood Distance.
The International Journal of Emerging Engineering Research and Technology. 2015; 3(1): 22-27.
[9] V Chittu, N Sumathi. A Modified Genetic Algorithm Initializing K-Means Clustering. Global Journal of
Computer Science And Technology. 2011; 11(2).
[10] K Fukunaga. Introduction To Statistical Pattern Recognition. Newyork: Academic. 1990.
[11] DW Scott. Multivariate Density Estimation. Wiley, New York. 1992.
[12] G Hamerly, C Elkan. Learning the k in k-means. Advances in Neural Information Processing
Systems. 2004; 17: 281–288.
[13] G Karegowda, VT Shama, MA Jayaram, AS Manjunath. Improving Performance of K-Means
Clustering by Initializing Cluster Centers Using Genetic Algorithm and Entropy Based Fuzzy
Clustering for Categorization of Diabetic Patients. In Proceedings of International Conference on
Advances in Computing. MSRIT, Bangalore: Springer India. 2013; 899-904.
[14] M Ettaouil, E Abdelatif, F Harchli. Improving The Performance of K-Means Algorithm Using An
Automatic Choice of Suitable Code Vectors And Optimal Number of Clusters. Journal of Theoretical
and Applied Information Technology. 2013; 56.
[15] Yugal Kumar, G Sahoo. A New Initialization Method To Originate Initial Cluster Centers For K-Means
Algorithm. International Journal of Advanced Science And Technology. 2014; 62: 43-54.
[16] SK Nandha Kumar, P Renuga. Reactive Power Planning Using Real GA Comparison With
Evolutionary Programming. Intl Journal of Recent Trends In Engineering. 2009; 1(3).
[17] K Krishna, MN Murty. Genetic K-means algorithm. IEEE Transactions on Systems, Man, and
Cybernetics. 1999; 29(3): 433-439.
[18] G Delavar, GH Mohebpour. A New Genetic Center Based Data Clustering Algorithm Based on K-
Means. International Journal of Mechatronics, Electrical And Computer Technology. 2014; 4(13):
1820-1839.
 ISSN: 2089-3272
IJEEI Vol. 5, No. 3, September 2017 : 275 – 284
284
[19] K Valarmathi, D Devaraj, TK Radhakrishnan. Real-Coded Genetic Algorithm For System
Identification and Controller Tuning. Applied Mathematical Modelling. 2009; 33: 3392–3401.
[20] DE Goldberg. Genetic Algorithms In Search Optimization And Machine Learning, Addison Wesley.
1989.
[21] Najmah M, Ghasemzadeh, Ch Meinel. A Variant of Genetic Algorithms For Non-Homogeneous
Population. Intl Conf. on Applied Mathematics. Computational Science and Systems Engineering.
Taly-Rome. 2016.
[22] Asuncion J. Newman, UCI Mach. Learning Repository, Irvine, CA: Dpt Of Inf. And Univ. of California,
Http://Www.Ics.Uci.Edu/~Mlearn/Mlrepository.Html. Computer Science. 2016.

More Related Content

PDF
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
PDF
84cc04ff77007e457df6aa2b814d2346bf1b
PDF
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
PDF
11.software modules clustering an effective approach for reusability
PDF
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
PDF
Ae044209211
PDF
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
84cc04ff77007e457df6aa2b814d2346bf1b
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
11.software modules clustering an effective approach for reusability
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
Ae044209211
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...

What's hot (19)

PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
PDF
I0704047054
PDF
Particle Swarm Optimization based K-Prototype Clustering Algorithm
PDF
T180203125133
PDF
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
PDF
Applying the big bang-big crunch metaheuristic to large-sized operational pro...
PDF
A Preference Model on Adaptive Affinity Propagation
PDF
Analysis On Classification Techniques In Mammographic Mass Data Set
PDF
Multi-Cluster Based Approach for skewed Data in Data Mining
PDF
A new model for iris data set classification based on linear support vector m...
PDF
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
PDF
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
PDF
ROLE OF CERTAINTY FACTOR IN GENERATING ROUGH-FUZZY RULE
PDF
K-Medoids Clustering Using Partitioning Around Medoids for Performing Face Re...
PDF
Comparative study of various supervisedclassification methodsforanalysing def...
PDF
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
PDF
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
PDF
Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...
PDF
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
I0704047054
Particle Swarm Optimization based K-Prototype Clustering Algorithm
T180203125133
EXTRACTING USEFUL RULES THROUGH IMPROVED DECISION TREE INDUCTION USING INFORM...
Applying the big bang-big crunch metaheuristic to large-sized operational pro...
A Preference Model on Adaptive Affinity Propagation
Analysis On Classification Techniques In Mammographic Mass Data Set
Multi-Cluster Based Approach for skewed Data in Data Mining
A new model for iris data set classification based on linear support vector m...
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
ROLE OF CERTAINTY FACTOR IN GENERATING ROUGH-FUZZY RULE
K-Medoids Clustering Using Partitioning Around Medoids for Performing Face Re...
Comparative study of various supervisedclassification methodsforanalysing def...
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
Ad

Similar to An Heterogeneous Population-Based Genetic Algorithm for Data Clustering (20)

PDF
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
PDF
The improved k means with particle swarm optimization
PDF
I017235662
PDF
Classification of medical datasets using back propagation neural network powe...
PDF
Multilevel techniques for the clustering problem
PDF
A Comparative Study Of Various Clustering Algorithms In Data Mining
PDF
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
PDF
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
PDF
A comparative study of clustering and biclustering of microarray data
PDF
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
PDF
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
PDF
An Analysis On Clustering Algorithms In Data Mining
PDF
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
PDF
Introduction to Multi-Objective Clustering Ensemble
PDF
Web Based Fuzzy Clustering Analysis
PDF
Gene Selection for Sample Classification in Microarray: Clustering Based Method
PDF
Ensemble based Distributed K-Modes Clustering
PDF
Mine Blood Donors Information through Improved K-Means Clustering
PDF
Comparison Between Clustering Algorithms for Microarray Data Analysis
PDF
Clustering in Aggregated User Profiles across Multiple Social Networks
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
The improved k means with particle swarm optimization
I017235662
Classification of medical datasets using back propagation neural network powe...
Multilevel techniques for the clustering problem
A Comparative Study Of Various Clustering Algorithms In Data Mining
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
The Evaluated Measurement of a Combined Genetic Algorithm and Artificial Immu...
A comparative study of clustering and biclustering of microarray data
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
An Analysis On Clustering Algorithms In Data Mining
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
Introduction to Multi-Objective Clustering Ensemble
Web Based Fuzzy Clustering Analysis
Gene Selection for Sample Classification in Microarray: Clustering Based Method
Ensemble based Distributed K-Modes Clustering
Mine Blood Donors Information through Improved K-Means Clustering
Comparison Between Clustering Algorithms for Microarray Data Analysis
Clustering in Aggregated User Profiles across Multiple Social Networks
Ad

More from ijeei-iaes (20)

PDF
Development of a Wireless Sensors Network for Greenhouse Monitoring and Control
PDF
Analysis of Genetic Algorithm for Effective power Delivery and with Best Upsurge
PDF
Design for Postplacement Mousing based on GSM in Long-Distance
PDF
Investigation of TTMC-SVPWM Strategies for Diode Clamped and Cascaded H-bridg...
PDF
Optimal Power Flow with Reactive Power Compensation for Cost And Loss Minimiz...
PDF
Mitigation of Power Quality Problems Using Custom Power Devices: A Review
PDF
Comparison of Dynamic Stability Response of A SMIB with PI and Fuzzy Controll...
PDF
Embellished Particle Swarm Optimization Algorithm for Solving Reactive Power ...
PDF
Intelligent Management on the Home Consumers with Zero Energy Consumption
PDF
Analysing Transportation Data with Open Source Big Data Analytic Tools
PDF
A Pattern Classification Based approach for Blur Classification
PDF
Computing Some Degree-Based Topological Indices of Graphene
PDF
A Lyapunov Based Approach to Enchance Wind Turbine Stability
PDF
Fuzzy Control of a Large Crane Structure
PDF
Site Diversity Technique Application on Rain Attenuation for Lagos
PDF
Impact of Next Generation Cognitive Radio Network on the Wireless Green Eco s...
PDF
Music Recommendation System with User-based and Item-based Collaborative Filt...
PDF
A Real-Time Implementation of Moving Object Action Recognition System Based o...
PDF
Wireless Sensor Network for Radiation Detection
PDF
Charge Sharing Suppression in Single Photon Processing Pixel Array
Development of a Wireless Sensors Network for Greenhouse Monitoring and Control
Analysis of Genetic Algorithm for Effective power Delivery and with Best Upsurge
Design for Postplacement Mousing based on GSM in Long-Distance
Investigation of TTMC-SVPWM Strategies for Diode Clamped and Cascaded H-bridg...
Optimal Power Flow with Reactive Power Compensation for Cost And Loss Minimiz...
Mitigation of Power Quality Problems Using Custom Power Devices: A Review
Comparison of Dynamic Stability Response of A SMIB with PI and Fuzzy Controll...
Embellished Particle Swarm Optimization Algorithm for Solving Reactive Power ...
Intelligent Management on the Home Consumers with Zero Energy Consumption
Analysing Transportation Data with Open Source Big Data Analytic Tools
A Pattern Classification Based approach for Blur Classification
Computing Some Degree-Based Topological Indices of Graphene
A Lyapunov Based Approach to Enchance Wind Turbine Stability
Fuzzy Control of a Large Crane Structure
Site Diversity Technique Application on Rain Attenuation for Lagos
Impact of Next Generation Cognitive Radio Network on the Wireless Green Eco s...
Music Recommendation System with User-based and Item-based Collaborative Filt...
A Real-Time Implementation of Moving Object Action Recognition System Based o...
Wireless Sensor Network for Radiation Detection
Charge Sharing Suppression in Single Photon Processing Pixel Array

Recently uploaded (20)

PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
DOCX
573137875-Attendance-Management-System-original
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Project quality management in manufacturing
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
UNIT 4 Total Quality Management .pptx
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
composite construction of structures.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Sustainable Sites - Green Building Construction
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CYBER-CRIMES AND SECURITY A guide to understanding
573137875-Attendance-Management-System-original
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Project quality management in manufacturing
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Safety Seminar civil to be ensured for safe working.
UNIT 4 Total Quality Management .pptx
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
composite construction of structures.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Sustainable Sites - Green Building Construction
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Embodied AI: Ushering in the Next Era of Intelligent Systems
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...

An Heterogeneous Population-Based Genetic Algorithm for Data Clustering

  • 1. Indonesian Journal of Electrical Engineering and Informatics (IJEEI) Vol. 5, No. 3, September 2017, pp. 275~284 ISSN: 2089-3272, DOI: 10.11591/ijeei.v5i3.299  275 Received March 25, 2017; Revised July 10, 2017; Accepted August 11, 2017 An Heterogeneous Population-Based Genetic Algorithm for Data Clustering Amina Bedboudi *1 , Cherif Bouras 2 , Mohamed Tahar Kimour 3 1,3 Departement of Computer Science, University of Badji Mokhtar - Annaba, Algeria 2 Departement of Mathematics, University of Badji Mokhtar - Annaba, Algeria e-mail: [email protected] Abstract As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches Keywords: data mining, clustering, K-means, genetic algorithms, crossover 1. Introduction As one of the most important tasks of spatial data mining, cluster analysis has been widely used in several domains, such as biology, system engineering and social sciences, in order to identify natural groups in large amounts of data [1-3]. A clustering algorithm assigns a large number of data points to a smaller number of groups (or clusters) such that data points in the same group share the same properties (similar) while, in different groups, they are dissimilar [4]. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis [5]. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. Especially, popular methods such as K- means, genetic algorithms and their variants are the most notable ones for data clustering [1-9], [14-21]. Their advantages lie in the simplicity, efficiency and ease convergenc, however, most of them suffer from the drawbacks of requiring the number of clusters and their initial centers, but also from the poor clustering quality [9]. In this paper we present an approach to automatically generate such parameters, while improving the speed and accuracy of the algorithm. It is based on an appropriate data structure and process that handles heterogeneous populations in a modified genetic algorithm that operates on varied individual structures .Generally, using parameters such as the crossover and mutation probabilities that adapt to the evolution of the algorithm is a good choice, since higher diversity in the population can be achieved, preventing the algorithm to fall in local minima. Moreover, to accelerate the genetic algorithm process and increase the individual diversity of the initial population, we generate that initial population in two manners: the first subpopulation is obtained with a deterministic way and the second one with a random way. Increasing the population diversity will allow to achieve better quality. The rest of the article is structured as follows: section 2 gives an analysis of two data clustering techniques; k-means and genetic algorithms. Section 3 details our data clustering
  • 2.  ISSN: 2089-3272 IJEEI Vol. 5, No. 3, September 2017 : 275 – 284 276 approach by describing the proposed structure of chromosomes and genetic operators. Section 4 presents the related works and a comparaison with our aproposed approach. Section 5 describes the experimental results and discusses the strength of our approach. Finally, we draw a conclusion and define some future works in section 6. 2. Clustering Clustering can be thought of as data partitioning or segmenting into homogeneous groups. The clustering is usually accomplished by determining the similarity among the data on predefined attributes. The most similar data are grouped into clusters [2-5], [9]. There is a close relationship between clustering techniques and many other disciplines. Clustering has always been used in statistics and science [7]. Typical applications include speech and character recognition. viewed as a density estimation problem, machine learning clustering algorithms were applied to image segmentation and computer vision [10-13]. In the literature, various methods have been used to handle the clustering problem. The K-means [4-5] [17] is considered one of the major algorithms widely used in clustering [1]. Genetic algorithms [6] are also used in clustering, either in a separated way or combined with-the k-means algorithm [9]. 2.1. K-means K-means first randomly select the k objects; each object initially represents a cluster center. For each remaining object according to its distance from the center of each cluster, assign it to the nearest cluster. Afterward, each cluster center is replaced by the average value on the respective cluster. This process is repeated, until k centers do not change. K-means algorithm attempts to provide a partition of the given data such that the centroids of provided classes are the minima of the following objective function [8-9], [13], [20] : ∑ ∑ ‖ ‖ The typical algorithm is as follows [1] 1. Select k objects as initial centers; 2. Assign each data object to the closest center; 3. Recalculate the centers of each cluster; 4. Repeat steps 2 and 3 until centers do not change; Due to its practice properties, K-means algorithm is considered as the most popular technique to cluster information. Its main advantages are: i) easy to implement, ii) the comparison is conducted only between the observations and the center of classes, and iii) it detects and isolates the outliers. However, the algorithm exhibits some drawbacks, especially on the fact that it can be really slow since in each step the distance between each point to each cluster has to be calculated, which can be really expensive in the presence of a large dataset. Moreover, it is sensitive to the initial cluster centers that are provided by the user as well as the k number of clusters.
  • 3. IJEEI ISSN: 2089-3272  An Heterogeneous Population-Based Genetic Algorithm for Data… (Amina Bedboudi) 277 2.2. Genetic algorithms Genetic algorithms [6] are robust, stochastic optimization algorithms, used to solve a wide variety of problems. They were developed with the goal of better understanding natural processes such as adaptation, and it belongs to a type of search techniques that mimic the principles of natural selection to develop solutions of large optimization problems. A genetic algorithm finds the optimal value for a particular objective function depending on the problem to be solved. The standard approach to an optimization problem begins by designing an objective function that can model the problem’s objectives while incorporating any constraints. A genetic algorithm consists of: Chromosomal Representation, initial population, fitness evaluation, selection and reproduction (i.e., crossover and mutation). GA operates by maintaining and manipulating a population of potential solutions called chromosomes. Each chromosome has an associated fitness value which is a qualitative measure of the goodness of the solution encoded in it. In a genetic algorithm process, we determine firstly an initial solution, composed of a set of chromosomes (initial population), and iteratively apply reproduction operators(selection, crossover and mutation) until achieving a certain quality parameter or a predefined number of iterations. A use of a fitness function guides the stochastic selection of chromosomes which are then used to generate new candidate solutions through crossover and mutation. Therefore, basic operations are: i) Crossover, which generates new chromosomes by combining sections of two or more selected parents, ii) mutation, which acts by randomly selecting genes which are then altered; thereby preventing suboptimal solutions from persisting and increases diversity in the population. There are three main types of selection methods: fitness proportionate selection, ranking method and tournament selection. In tournament selection [14], individuals are selected randomly from the population, based on the fitness function. Tournaments are often held between pairs of individuals, although larger tournaments can be used. Simple outline of a genetic algorithm is shown in Figure 1. Figure 1. Outline of the genetic algorithm.
  • 4.  ISSN: 2089-3272 IJEEI Vol. 5, No. 3, September 2017 : 275 – 284 278 Genetic algorithms are characterized by attributes such as objective function, encoding of the input data, crossover, mutation, and population size. (1) Objective function. It is used to assign each individual in the population a fitness value; an individual with a higher fitness represents a better solution to the problem than an individual with a lower fitness value; (2) Encoding. Genetic algorithms operate on an en-coding of the problem’s input data (which represent in-dependent variables for the objective function); (3) Elitism. This is a way to ensure that the highly fit-ting chromosomes are not lost and copied to the new population. Elitism has been found to be very important to the performance of genetic algorithms; (4) Crossover. It is a procedure in which a highly fit-ting chromosome is given an opportunity to reproduce by exchanging pieces of its genetic information with other highly fitting chromosomes; (5) Mutation. This is often applied after crossover by randomly altering some genes to individual parents; (6) Population size. It is the number of individuals in a population. The larger the population size, the better the chance that an optimal solution will be found. Genetic algorithms iterate a fixed number of times. Since the function’s upper bound (the maximum fitness value possible for an individual) may not be known or cannot be reached, we must limit the number of generations in order to guarantee the termination of the search process. This may result in a suboptimal solution. Moreover, combined approaches of genetic algorithms with other clustering techniques are still not satisfying on some user requirements especially for the accuracy and execution time 3. The Proposed Approach The aim of our approach is to give an efficient clustering process leading to better enhanced clusters’ quality and execution time using a varied structure-based genetic algorithm with a new crossover operator. Tthe input of such modified genetic algorithm process is a dataset. The genetic process automatically generates the number of clusters with their initial centroids 3.1. Chromosome representation Chromosome representation is a problem encoding performed at the most important steps in using genetic algorithm to solve a problem. To encode both the number of clusters and their centers, we propose a chromosome structure that applies a real coded genetic algorithm to the clustering problem, where crossover and mutation operators are applied directly to real parameter values. The use of real parameter values in the GA representation has a number of advantages over binary coding. The efficiency of the GA is increased as there is no need to convert the solution variables to the binary type, less memory is required, there is no loss in precision by discretization to binary or other values, and there is greater freedom to use different genetic operators. In this work, the existing population is processed in the form of several subpopulations with different size. In other words, the number of genes in every cluster differs. A chromosome in a subpopulations may contain 2 genes, 3 genes to Kmax genes, where Kmaxis the maximum number of clusters. These genes hold corresponding information about the cluster centers. Here, each gene in the chromosome represents cluster center, and number K of genes in a chromosome stands for number of clusters. The value of K is assumed to lie in the interval [2; Kmax]. Thus, each solution i is a fixed-length string represented by cluster centers cij ; j = 2,…,Kmax. Then, solution i of the population is represented via a vector as follows:
  • 5. IJEEI ISSN: 2089-3272  An Heterogeneous Population-Based Genetic Algorithm for Data… (Amina Bedboudi) 279 ci1 … cij … ciKmax Figure.2. An example of chromosomes as defined by our approach. 3.2. Fitness function The fitness function f is a measure of profit we need during optimization. It is an objective function that is used to summarize how close a given solution is to achieving the aims and has an important effect on success of a genetic algorithm. Fitness is proportional to the utility or ability of individual which the chromosome represents. Measure of fitness helps in evolving good solutions and implementing natural selection. In this work, the fitness of a chromosome is computed using Mean Square Error (MSE) [18]. The MSE calculate the error between the clusters. MSE= ∑ (1) Fitness = After the definitions of the fitness function the different parameters of the genetic algorithm operators are fixed. 3.3. Generation of the Initial Population To accelerate the genetic algorithm process and increase the individual diversity of the initial population, we generate that initial population in two phases. The first phase consists in a deterministic producing manner, and the second one consists in a randomly generation. We proceed such generation so to obtain 20% of the initial population from the first phase and 80% from the second phase. his phase, we deterministically generate the first sub-population that constitutes 20% the entire population. In From the input dataset, we produce a sorted dataset. Then, the sorted dataset is divided into k equal segments and the means, modal, and min-max values (see Algorithm1) of each segment are determined and considered asinitial cluster centers. In doing so, we increase the diversity of individuals leading to enhanced genetic reproduction. Algorithm1: generate20%population  Sort the datset;  For (k=2, k<=Kmax-1, k++){ - Divide the sorted data int k equal segments: {Sk}; - Modal step: take the modal value of each segment k as a center Ck; - Means step: take the means value of each segment k as a center Ck; - Min-Max step: For (i=0, i<k, k++){ Calculate c = (max-min)/2 Consider c as a ith center 3.4. Reproduction Genetic Algorithms aids to look for the best solution among a number of possible solutions throw reproduction. Reproduction can be implemented in an algorithmic form, based on reproduction operators, suited to the encoding scheme.The objective of the reproduction operators is to ensure diversity in the population, such that the fittest solutions can be derived through the evolutionary process. Such evolution process is composed of crossover and mutation operations.
  • 6.  ISSN: 2089-3272 IJEEI Vol. 5, No. 3, September 2017 : 275 – 284 280 During such process, invalid chromosomes may be produced. In the proposed chromosome representation, repetition of genes produces invalid chromosomes because a data point cannot be center point of more than one cluster. So to be able to detect production of invalid chromosomes, we sort the chromosome genes in ascending order. In doing so, we aim to achieve small length, fast detection of invalid chromosomes, and faster crossover and mutation operations. Figure 3. illustrates our crossover process using a mask. Figure 3. The crossover process using a mask 3.4.1. Crossover It is a reproduction process that the children chromosomes are generated according to the fitness function values of their parents. We define a crossover operator in such a way that it could accept parent chromosomes with different number of genes. It occurs with probability crossp. For each two selected parents, crossover is applied, producing two offsprings. To this end, a binary mask is created. Its length is equal to that of the shorter parent. In this mask, digit “1” occurs with a certain probability p. C (1) j=C (1) j+α C (2) j (2) C (2) j=C (2) j+(1-α) C (1) j (3) Figure 4. Depicts the crossover formulas with α as a uniformly distributed random number such that α ∈[-1, 1]. Offspring O1 is created by applying the formula (1) and offspring O2 is created by applying the formula (2). These formulas in Figure 4 are applied on genes in the parents corresponding to the positions in the mask with a digit “1”. Also, we should check the validity of the resulting chromosomes to avoid redundant centers in one chromosome. In the case of redundancy, we replace the redundant gene with a non redundant one, extracted from a parent.
  • 7. IJEEI ISSN: 2089-3272  An Heterogeneous Population-Based Genetic Algorithm for Data… (Amina Bedboudi) 281 Algorithm2: Clustering with GA Input: data points, kmax Output: clusters Procedure: (1) Use algorithm 1 to generate 20% of the initial population; (2) Randomly generate the remaining 80% of initial populations; (3) Calculate the fitness values of the chromosomes in the initial population; (4) Select the two fittest chromosomes (P1, P2) and randomly choose a target one P3 from the population; (5) Apply the crossover and mutation operators on (P1, P2) to generate two offsprings (O1, O2); (6) Calculate the fitness values of the offsprings (7) Replace P1 and P2 with O1 or O2 , if they are not better than such offsprings; (8) Repeat from step (3) if there is a significant change between generations, or the number of iteration has not been reached; (9) Final clusters are derived from the fittest chromosome using the one step K-means algorithm [21]. Figure 5. The main steps of the proposed genetic algorithm to clustering. 3.4.2. Mutation Mutation is intended to prevent falling of all solutions in the population into a local optimum of the solved problem or enhancing the obtained chromosome. Mutation takes place with a lower probability than that of the crossover. We define two types of mutation to be taken place after the crossover. Topological mutation is intended to add or delete genes from an offspring. Gene mutation is intended to modify a selected gene from an offspring. Gene mutation is applied by randomly selecting a gene from an offspring and replacing it with a data point randomly extract from a dataset. Afterward, a validity check is performed to avoid redundant centers and to keep the cluster number in [2, Kmax]. Our overall genetic process organizes operations as depicted by Figure 5. To find the closest cluster to every point, we apply a k-means operator (KMO), which is a one step of the classical k-means algorithm. In other words, we assign data points to their clusters for each new chromosome using the KMO operator, in order to compute its fitness. 4. Related works In the k-means algorithm, the most important challenge is related to determining of the number of clusters and their initial centers. Randomly determining initial centers does not guarantee good quality of clustering, due to the fact that at each run of K-Means algorithm on the same dataset may result in a different clusyers. Besides, the process may converge to suboptimal partitions. In [3], the author proposed to sort the dateset according to their distances from each other, and then divide the sorted data into k equal segments. He takes the first data item of each segment as a corresponding cluster center. In [12], the authors proposed a solution based on a statistical test, with the assymption that a subset of data follows a Gaussian distribution. The proposed process begins with a small number k of clusters, and then increments k at each step based on appropriate condition. In [17-21] genetic algorithm-based solutions are proposed. They maintain a population of some coded solutions and apply a one step k-means operator to calculate the fitness of a chromosome. In [13], the authors proposed to use genetic algorithm to determine the initial values of the clusters. They use k-means algorithm to evaluate fitness of each candidate center. At the final iteration, the fittest chromosome will be the k-means initial cluster center. As k- means algorithm must be executed for many times, the approach is usually computationally expensive. These above mentioned methods share the common drawback of searching initial centers while fixing a given number of clusters. Their aims was to find results in fewer number of steps, but because of intensively calculating neighbourhood distance between data points and rearranging them, it would be very slow in partitioning large datasets. When searching for the
  • 8.  ISSN: 2089-3272 IJEEI Vol. 5, No. 3, September 2017 : 275 – 284 282 initial centroids, they evaluate the concerned criteria separately for different values of k, leading to exeute K-means algorithm for different numbers of clusters. Such a process is usually very time-consuming. 5. Experimental results To evaluate our approach and test its efficiency, we compare it with k-means, and the closest work to ours of [9], using two selected datasets from UCI repository of machine learning database, which are Iris and Lymphoma datasets. It is worth noting that the first difference of our approach and the two above mentioned methods is that we automatically determine the k number of clusters and their initial centroids. In the following, we show that our approach outperforms at least such two methods when considering two important parameters, that is, the average error and the average execution time. Experiments have been done usin parameters of the Table 1. Table 1. Parameter Value Maximum number of clusters, Kmax 15 Population size, P 80 Crossover probability 0.6 Mutation probability 0.3 Maximum no. of iterations, 100 In the Iris dataset, each data point has four feature values, which represents the length and the width of sepal and petal, in centimeters. It has three classes with 50 samples per class. The value of k is therefore chosen to be three (clusters). In the Lymphoma dataset, we find 62 samples consisting of 4026 genes spanning three classes, which include 42 Diffuse Large B-Cell Lymphoma samples, nine Follicular Lymphoma samples, and 11 B-cell Chronic Lymphocytic Leukemia samples This dataset is to be partitioned into three clusters. Each dataset was use for each method for 10 times and then we determined the average time and error as mentioned in Table 2, which shows the efficiently experiments of the proposed method over K-means and the work of [9], which closest to ours Table 2. Evaluation results using UCI data sets Dataset parameters k-means [11] Proposed approach Iris Avg Error 36.53 29.59 19.59 Avg Time 18.34 10.56 9.34 Lymphoma Avg Error 44.12 38.49 22.15 Avg Time 16.15 10.25 8.28 The experiments have been conducted to measure the average error and average execution time parameters for the three methods. For each dataset, average error and average time are listed to show the trade-off between them. From the table 2 it is observed that, our method outperforms k-means and its variants, especially the work in [11], by enhancing the average error and the average execution time.
  • 9. IJEEI ISSN: 2089-3272  An Heterogeneous Population-Based Genetic Algorithm for Data… (Amina Bedboudi) 283 Because of our proposed population model consists of chromosomes with varied structures, and using improved genetic operators for reproduction, the offsprings will inherit appropriate information from their parent, leading to find an enhanced solution for the concerned problem. Moreover, the deterministic determination of important portion of the initial population will brings high diversity the genetic reproduction process, leading to enhanced quality of the result 5. Conclusion In this paper we have introduced a variable structure of genetic algorithms to be applied on the data clustering problem without requiring the number and initial centers of clusters. The cluster centroids are calculated by a genetic algorithm with heterogeneous population leading to better results than using random numbers. Moreover, our approach allowed acceleration of the search process by reducing the average execution time while obtaining better quality of data partitions. As a future works, we plan to investigate improved fitness function and other reproduction operators to achieve better performance References [1] Mecca GS, Raunich A, Pappalardo. New Algorithm For Clustering Search Results. Data And Knowledge Engineering. 2007; 62: 504-522. [2] H He, Y Tan. A Two-Stage Genetic Algorithm for Automatic Clustering. Neuro-Computing. 2012; 81: 49-59. [3] C Raposo, C Antunes, J Barreto. Automatic Clustering Using A Genetic Algorithm With New Solution Encoding And Operators. ICCSA (2). Lecture Notes In Computer Science, Springer. 2004; 8580: 92- 103. [4] K Jain, MN Murty, PJ Flynn. Data Clustering: A Review, ACM Comput. Surv. 1999; 31(3): 264–323. [5] S Garc a, uengo, Herrera. Data Pre-Processing In Data Mining, Cham. Springer. 2004: 327. [6] SN Sivanandam, SN Deepa. Introduction To Genetic Algorithms. Berlin Heidelberg New York: Springer. 2008. [7] K Fukunaga. Introduction To Statistical Pattern Recognition. New York Academic. 1990. [8] KB Sawant. Efficient Determination of Clusters in K-Mean Algorithm Using Neighborhood Distance. The International Journal of Emerging Engineering Research and Technology. 2015; 3(1): 22-27. [9] V Chittu, N Sumathi. A Modified Genetic Algorithm Initializing K-Means Clustering. Global Journal of Computer Science And Technology. 2011; 11(2). [10] K Fukunaga. Introduction To Statistical Pattern Recognition. Newyork: Academic. 1990. [11] DW Scott. Multivariate Density Estimation. Wiley, New York. 1992. [12] G Hamerly, C Elkan. Learning the k in k-means. Advances in Neural Information Processing Systems. 2004; 17: 281–288. [13] G Karegowda, VT Shama, MA Jayaram, AS Manjunath. Improving Performance of K-Means Clustering by Initializing Cluster Centers Using Genetic Algorithm and Entropy Based Fuzzy Clustering for Categorization of Diabetic Patients. In Proceedings of International Conference on Advances in Computing. MSRIT, Bangalore: Springer India. 2013; 899-904. [14] M Ettaouil, E Abdelatif, F Harchli. Improving The Performance of K-Means Algorithm Using An Automatic Choice of Suitable Code Vectors And Optimal Number of Clusters. Journal of Theoretical and Applied Information Technology. 2013; 56. [15] Yugal Kumar, G Sahoo. A New Initialization Method To Originate Initial Cluster Centers For K-Means Algorithm. International Journal of Advanced Science And Technology. 2014; 62: 43-54. [16] SK Nandha Kumar, P Renuga. Reactive Power Planning Using Real GA Comparison With Evolutionary Programming. Intl Journal of Recent Trends In Engineering. 2009; 1(3). [17] K Krishna, MN Murty. Genetic K-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics. 1999; 29(3): 433-439. [18] G Delavar, GH Mohebpour. A New Genetic Center Based Data Clustering Algorithm Based on K- Means. International Journal of Mechatronics, Electrical And Computer Technology. 2014; 4(13): 1820-1839.
  • 10.  ISSN: 2089-3272 IJEEI Vol. 5, No. 3, September 2017 : 275 – 284 284 [19] K Valarmathi, D Devaraj, TK Radhakrishnan. Real-Coded Genetic Algorithm For System Identification and Controller Tuning. Applied Mathematical Modelling. 2009; 33: 3392–3401. [20] DE Goldberg. Genetic Algorithms In Search Optimization And Machine Learning, Addison Wesley. 1989. [21] Najmah M, Ghasemzadeh, Ch Meinel. A Variant of Genetic Algorithms For Non-Homogeneous Population. Intl Conf. on Applied Mathematics. Computational Science and Systems Engineering. Taly-Rome. 2016. [22] Asuncion J. Newman, UCI Mach. Learning Repository, Irvine, CA: Dpt Of Inf. And Univ. of California, Http://Www.Ics.Uci.Edu/~Mlearn/Mlrepository.Html. Computer Science. 2016.