SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 9, No. 4, August 2019, pp. 2659~2667
ISSN: 2088-8708, DOI: 10.11591/ijece.v9i4.pp2659-2667  2659
Journal homepage: http://iaescore.com/journals/index.php/IJECE
An improved differential evolution algorithm for
data stream clustering
Bhaskar Adepu1
, Jayadev Gyani2
, G. Narsimha3
1
Department of Information Technology, Kakatiya Institute of Technology & Science, India
2
Department of CS, College of Computer & Information Sciences, Majmaah University, Saudi Arabia
3
Department of CSE, JNTUH College of Engineering, India
Article Info ABSTRACT
Article history:
Received Jun 27, 2018
Revised Jan 9, 2019
Accepted Jan 11, 2019
A Few algorithms were actualized by the analysts for performing clustering
of data streams. Most of these algorithms require that the number of clusters
(K) has to be fixed by the customer based on input data and it can be kept
settled all through the clustering process. Stream clustering has faced few
difficulties in picking up K. In this paper, we propose an efficient approach
for data stream clustering by embracing an Improved Differential Evolution
(IDE) algorithm. The IDE algorithm is one of the quick, powerful and
productive global optimization approach for programmed clustering. In our
proposed approach, we additionally apply an entropy based method for
distinguishing the concept drift in the data stream and in this way updating
the clustering procedure online. We demonstrated that our proposed method
is contrasted with Genetic Algorithm and identified as proficient
optimization algorithm. The performance of our proposed technique is
assessed and cr eates the accuracy of 92.29%, the precision is 86.96%, recall
is 90.30% and F-measure estimate is 88.60%.
Keywords:
Concept drift
Datastream clustering
Differential evolution
Encoding scheme
Entropy theory
Copyright © 2019 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Bhaskar Adepu,
Kakatiya Institute of Technology & Science,
Affiliated to Kakatiya University,
Warangal, 506015-India.
Email: bhaskar_adepu@yahoo.com
1. INTRODUCTION
Nowadays, datastreams becomes an important source of data. In recent years, multiple organizations
generate huge amounts of data. Data stream application domains includes information analysis in network
data flow monitoring, Internet of Things (IoT) applications regularly sending sensors data, web page access
and web click information, weather forecasting information and the economic information produced by
finance and securities companies and so on [1-5]. Conventional data mining methods mostly focused on
mining static and memory resident data repositories. However, with the emergence of data streams and
technological developments changed the way people store, process and communicate the data [6].
Data streams are temporarily ordered, fast changing, infinite and massively potential. It may be not possible
to store the entire data stream into memory [7]. Stream mining has to deal with rapid and dynamic data with
real time processing and aiming at extracting useful and interesting patterns. The biggest challenge is finding
valuable information in a single scanning of massive data streams [7, 8].
Various algorithms and procedures proposed for mining data streams can be grouped into two
groups of techniques. One group ofalgorithms can achieve desired clustering results, but insufficiency of data
storage capacity which leads us to process data dynamically in extracting knowledge. Another group of
algorithms refers to streaming of data and applies mining techniques [9]. These two groups of techniques
express some difficulties in clusterin data streams.Some of the difficulties includes: visiting of data once
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2659 - 2667
2660
during the processing of data stream, performance of processing stream is crucial and detecting a change or a
concept drift during the time whether gradual or abrupt in the evolutionary data stream is difficult [10].
In the area of data stream clustering, multiple techniques have been proposed. They are K-means clustering
algorithm [11-13], Fast Evolutionary Algorithm for Clustering (FEAC) [14], A Support Vector Clustering
(SVC) based algorithm [15, 16], Multiclass Novelty Detection (MND) algorithm [17, 18], Fuzzy C-Mean
algorithm [1, 19], BIRCH algorithm [20], KNN (K-Nearest Neighbor) algorithm [8].
For clustering data in many real world problems, K-means algorithm is popularly used because of its
simplicity and scalabilityof that algorithm for most of the real world applications. K-means algorithm has a
limitation that we have to specify the number of clusters i.e. k-value as an input parameter to the algorithm
[11-13]. For automatically estimating the number of clusters (i.e.k-value) from the input data, there is an
algorithm known as Fast Evolutionary Algorithm for Clustering (FEAC) [14] hasshown to be efficient.
The limitation of this algorithm was that it is not applied for data stream analysis. Support Vector Clustering
(SVC) algorithm is an efficient and effective data stream clustering algorithm. The disadvantage of this
algorithm isthat it cannot find arbitary shape clusters because most of these algorithms are based on K-means
algorithms [15, 16]. Multiclass Novelty Detection algorithm can be used in various applications like
intrusion, fault and fraud detections, spam filters and in text mining [17, 18]. BIRCH is a hierarchical
clustering algorithm which is based on calculating the distance between the new data point and the remaining
known datapoints. After that, compare these distances with a threshold to determine the category of new data
point. The limitation of this algorithm is that it does not function effectively for the data with arbitrary
shape [20]. KNN algorithm is a skewed approach and is widely used method forsolving classification and
pattern recognition problems in machine learning.KNN algorithm is used to avoid the high computational
complexity even though we could not get a satisfactory performance in many applications [19, 21-24].
In this paper, we propose an Improved Differential Evolution algorithm (IDE) for the data stream
clustering. At first, from the input data streams, the nearest cluster center is assessed for every incoming
object and the clusters are updated. Around then any concept drift occurs, the approaching objects are put in
the buffer till a settled time period. From that point onwards, use the IDE-based optimization for finding the
optimal K value. On the off chance that any concept drift occurs, the underlying advances are rehashed.
The rest of this paper is portrayed in the segment underneath. The proposed method is delineated in
section 2. The overwiew of the IDE algorithm is explained in section 2.2. Our proposed IDE Stream
algorithm is delineated in section 2.7. Results and the conclusion were depicted in sections 3 and 4.
2. PROPOSED METHOD
Consider the data stream which consists of N number of objects and each object is an l-dimensional
feature vector xi=[xi
j
] where j=1 to l and 1<= i<=N. Initially the number of clusters  max,2 kk  and
the centroids are randomly selected from the given datastream. The normal distance between closest cluster
center and the underlying objects are evaluated and updated. The evaluated cluster centers are updated and
this procedure is rehashed until the point when some ceasing rule is met. At the mean time, if any concept
drift occurs, it utilizes the entropy theory which is presented in the section 2.1.
Our proposed IDE algorithm doesn't require the streaming module to store the data for outline which
utilizes the clustering module or module for estimation of k-value for apportioning the data. Here, the
updated apportioning data are kept up by utilizing a single component; it is done in the online mode.
Distinctive quantities of trail arrangements are utilized for our approach, which has diverse cluster centers
and its coordinates. The best answer for the updated clustering data is chosen and the most noticeably awful
arrangements are disposed of in the light of the IDE algorithm. In the data stream clustering process, just a
single object is touching base at once. At first, the quantity of clusters is assessed by IDE Stream and the each
evaluated cluster measure it ought to be equivalent to the underlying size objects from the stream. At that
point, the assessed cluster is kept up in an online manner, subsequent to building up the underlying cluster.
The procedure of IDE Stream algorithm has appeared in Figure 1. At the underlying phase of the
data stream processing, the number of clusters is randomly selected from the information of data stream.
The normal distance between closest cluster center and the underlying objects are evaluated and updated.
With specific goal to recognize changes in the data partition, the clusters are administered by the entropy
theory. At the point when the entropy test triggers an alarm, that is the point at which the real clusters being
updated don't mirror the adjustments in the data streams and these clustered objects are put in the buffer for
some time until any concept drift or change occurs. At that point, the IDE scan is begun for optimization and
it is dealt in the encoding scheme.
Int J Elec & Comp Eng ISSN: 2088-8708 
An improved differential evolution algorithm for data stream clustering (Bhaskar Adepu)
2661
2.1. Entropy theory for detecting concept drift
At the time of partitioning the data, the data distribution may change over time in unforeseen ways,
this problem is known as concept drift. Theconcept drift is going on, while concept evolution appearance to
vanishing of clusters. To recognize the concept drift, the accompanying speculations are considered for the
entropy theory. Here, just a single idea is holed by the data stream. Because of this reason, the concept drift
isn't happening in light of the fact that the data stream is steady. On the off chance that the data stream does
not have any concept drift, the time series of the entropy strategy is about zero and stays stationary. In the
event that the data stream has any concept drift, the entropy is can't expand on the last idea, it just coordinates
with the underlying one. That implies, the precision judgment of the back to back data is can't control the
framework, if the updating is not finished. Because of this reason, the discovery of concept drift is critical.
Thus, the entropy estimation is established in the light of the membership esteems and it is critical for
identifying the concept drift.
In our method shannon’s entropy strategy is utilized for the entropy estimation. Here, the discrete
random variables are considered as X and the conceivable esteems are considered as:
 l
diX
l
iX
l
iX
l
iX  ,,2,1,  and the probability mass function is P(X). The entropy E(X) is calculated for
every random variable X using the following equation.



n
i
XiPXiPXE
1
))(log()()( (1)
The non-consistency of the given cluster is assessed in view of the entropy measure. For the all-out
data clustering, the entropy theory is a proficient technique. On the off chance that the entropy esteems are
high, the vulnerability of the IDE is bigger. The time series of the classification entropy points is almost zero
if the data stream does not contain any concept drift, generally, the esteem will turn out to be extensive.
2.2. Overview of IDE Stream algorithm
Figure 1 shown the procedure of IDE Stream algorithm.
Figure 1. Process of IDE Stream algorithm
2.3. Buffer for IDE Stream
In the wake of recognizing the concept drift, the genuine clusters are updated, and these outdated
clusters are put away in buffer for some time to run the IDE algorithm. Here, the base size of the buffer is
10%×initial size. For the computational assets, just the base size of the buffer is accessible. To characterize
the warning and the alarm states, there are two threshold values, specifically w and a are utilized
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2659 - 2667
2662
(i.e. wa   ). Right when the alarm threshold value a isn't as much as the estimation of the contrast
between the two variables, the stationary state is triggered. Starting now and into the foreseeable future, to
address the data partition and recognize the optimized clusters, the encoding scheme is executed to run the
IDE algorithm. For the efficient and robust optimization process, the IDE algorithm is an efficient and well
known algorithm which is a population based algorithm. The floating point representations are used in
this algorithm.
2.4. Encoding scheme
For the clustering problems, the candidate solutions (individuals) are described by the encoding
scheme of [14] is adopted by the IDE-stream algorithm. A data stream X is a significant arrangement of
illustrations and it is given as follow:
 l
j
j
ixXeit
l
dix
l
ix
l
ix
l
ixt
l
iX 1.,.)}(,,,2,1,{)(   (2)
where i Incoming object l Number of attributes in each object.
The above equation is potentially unbounded ).( N Each case is depicted by a n-dimensional
attributes vector  n
l
l
ixiX 1 . At that point, the input data points (D) are partitioned into k number of non-
overlapping clusters },,2,1{ kCCCC  such that satisfying the equation
DiC
k
i
jikjijCiCiC 


1
;,,2,1,,;  (3)
The above expression clarified, that the objects in the similar clusters are like each other, and the
objects in the diverse clusters are disparate. In the data set, the closeness and disparity between the objects
are found by evaluating the Euclidean distance between the points  n
l
l
ixiX 1 . The above partitioned data
C is changed into the integer string of N positions by using encoded process. Here, the position of every
string and the numerical orders for the objects of the datasets are almost similar. The example encoding
scheme of the dataset is shown in Figure 2.
1 1 1 1 3 1 2 3 3 2 3
O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 K
Figure 2. Encoding scheme
In the above example, the dataset consists of ten objects }10,,1{,.. iiOei and the encoding
clusters are three )3( k . Here, the ith
object of the dataset is stored in the ith
position and the k-value is
stored in the last position.
Here, the feature vectors are used to describe the every cluster and the number of objects N, data
object’s linear sum called S1 and their squared sum called S2 and time ‘t’ of the most recent objects that the
cluster received are the four genuine quantities of measurement. The centroid of the cluster is evaluated by
using the initial three components and the importance of the cluster is weighted by the rest of the component.
At time ‘T’, the cluster weight ‘W’ is evaluated as follows:
v
tT
eW

 (4)
where W is represented as the weight of the cluster, v is represented as the user defined parameter used to
control the fading factor. When the weighting value of the cluster is less than 0.1 (i.e. W<0.1), that cluster is
removed from the data partition.
Int J Elec & Comp Eng ISSN: 2088-8708 
An improved differential evolution algorithm for data stream clustering (Bhaskar Adepu)
2663
2.5. Validity indices of the clusters
The validity index is utilized to evaluate the number of clusters. The evolutionary algorithm
optimizes the k-value based on the fitness function and it is estimated by using the validity index. The well
knownvalidity indexused in our method is a Simplified Silhouette (SS) index which is used in [5, 14].
The compactness and the partition of the clusters are assessed by utilizing the Silhouette width. The given
arrangement of the data point is given as takes after,
l
jiiii
l
i xxxxX ,
321
,,,,  (5)
where in the above equation, Ni 3,2,1 the number of objects in the partition and Kj 3,2,1 the
number of cluster’s range variations.The above considered data points
l
iX comprises the cluster aC . In aC ,
the alternate objectives are distinguished in view of the
l
iX difference capacity and it is meant by )(
l
iXa .
In cluster aC , the normal divergence capacity of every single other object is indicated as ),( bC
l
iXd .
The minimum dissimilarity function of the data points is selected, if the cluster aCbC  , which is
given beneath,
 )1(,)(  jcentroid
l
iXdis
l
iXa (6)
 )(,min)( ijcentroid
l
iXdis
l
iXb  (7)
After computing the dissimilarity function, the silhouette estimation )(
l
iXS is given as follows,
 )(),(max
)()(
)(
l
iXb
l
iXa
l
iXa
l
iXbl
iXS

 (8)
The silhouette values are only ranges between the interval [0, 1], the closest cluster values are
accessed according to the equation (11). At that point, when the silhouette value is nearer to 1, it represents
l
iX is clustered precisely, generally the data points are wrongly clustered. The overall silhouette index of the
portioning cluster  kCCCC ,,2,1  , is given as:



N
i
l
iXS
N
SS
1
)(
1 (9)
In the above equation the maximum value of SS )max(SS is known as the fitness function of the
object in the cluster, which is utilized to determine the quality of the partitioning data. Here, which cluster
has the maximum SS value, that is considered as the best clustering and then the evolutionary search is
started.
2.6. Innovative evolutionary search
In our method, the DE algorithm is modified based on the adjustment of population X , mutation
and crossover CR estimations, because the best optimization results are obtained through the crossover and
the mutation rate.
In our IDE implementation, initially the candidate solutions )}(,,,,{ ,21 txxxx l
di
l
i
l
i
l
i  are
generated from the initial population X and it is shown in section 2.4 as the encoding scheme. Then the
silhouette index maxSS is evaluated for each individual to know the fitness and this estimation is shown in
section 2.5. After that, the other objects )(),( txtx l
j
l
i and )(txl
p are randomly generated from the initial
generated candidate solution. Then the difference between )(txl
i and )(txl
j are estimated and the estimated
difference values are scaled by scalar S , it is represented as ]1,0[S . Here, the scaled value of the two
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2659 - 2667
2664
random vector’s weighted difference values ))(,)(,( t
l
ujXtu
l
iXO  is added to the third vector )(txl
p to
generate the new vector )1( tOl
i , it is known as the mutation and it is given as follows:










)(,,
))(,)(,()(,
,]1,0[
)1(,
tukXotherwise
t
l
ujXtu
l
iXOtupX
CRurandif
tukO (10)
Then the other predetermined vector is mixed with the mutated vector, it is known as the crossover
rate (CR). It is the scalar parameter of the algorithm and in between 0 and 1, i.e., ]1,0[CR . Atlast, assess
the new candidate solution with the SSmaxand then the initial candidate solution )(t
l
iX is substituted in the
new candidate solution )1( t
l
iX . Here, if the newly generated candidate solution yields the maximum
objective function, these solutions are considered as the best solution otherwise the solution is retained in the
population; it is depicted in the equation below,







)()),(())1((
)1()),(())1((
)1(
tiXt
l
iXStiOif
tiOt
l
iXStiOSif
t
l
iX (11)
In the above equation, S(.) is the objective equation. To enhance our IDE calculation, we have
enhanced the properties of IDE in different ways. To scale the weighted contrast vector, )(,)(, t
l
ujXtu
l
iX  ,
the scaling factor S is using and it extends in the range of 0.5 and 1.
2.7. Pseudo code of IDE Stream algorithm
Step 1 : Generate the candidate solutions from the initial population(X) i.e.the individual points
l
nX
l
X
l
X ,2,1  are randomly generated from the data sets.
Step 2 : CalcualteSSmaxforeach individual. [Use eq.9]
Step 3 : Randomly choose three objects )(),( t
l
jXt
l
iX , )(t
l
pX from the initially generated candidate
solutions.
Step 4 : Estimate the difference between any two objects and scale it in the range [0, 1] and add this value
to third object to generate a new object. [Use eq.10]
Step 5 : Perform crossover by mixing this mutated object resulted from step4 with predefined object such
that crossover rate is in the range [0,1].
Step 6 : Assess the newly generated candidate solution and output the best result. [Use eq.11]
Repeat step 2 to step 6 until stopping criteria is met.i.e. K<=SSmax
3. RESULTS AND ANALYSIS
The performance of the IDE algorithm is experimentally assessed by contrasting and the use of late
created optimization algorithm, to be specific genetic algorithm [25]. The primary objective of this
calculation is, to optimize the candidate solution to produce the best outcome. The candidate solutions are
picked relies upon the fitness function, the nature of the candidate solutions is assessed as for the
optimization issue. The principle favorable circumstances of this algorithm are, it can deal with a few
competitor arrangements all the while. All things considered, in numerous handy applications, the rough
assurance of the data set is unthinkable. In our technique, if any progressions are happening in the season of
DE based clustering, which is contrasted and the IDE algorithm for utilizing a similar objective portrayal and
the silhouette index of the IDE.
3.1. Datasets description
We used three datasets and they are KDD CUP’99 [4],[5], forest cover type [5],[26], electric power
consumption dataset [26]. We compared the accuracy, precision, recall and F-Measures on these datasets.
Int J Elec & Comp Eng ISSN: 2088-8708 
An improved differential evolution algorithm for data stream clustering (Bhaskar Adepu)
2665
(i) KDD CUP’99 Dataset: KDD CUP’99 is the first real data, we use 10% version of detection set,
and this version is more challenging and more concentrated than the full version. 34 continuous attributes are
used in this data set which contains 23 different classes of data and 494,020 connection records. Each object
is either a normal connection or an intrusion. Popular minimum-maximum procedure is used for normalizing
the dataset attribute values within [0, 1] before running the experiment.
(ii) Forest Cover Type Dataset: Forest cover type is the second real dataset, which contain 7 forest
cover type of 581,012 geospatial descriptions. Totally 54 attributes are described in objectsout of which 10
are quantitative attributes and 44 are binary attributes. These 10 quantitative attributes are widely used in
many experiments as reported in the literature. Similar to the KDD CUP’99, Popular minimum-maximum
procedure is used for normalizing the dataset attribute values within [0, 1] before running the experiment.
By taking the data input, the data set is converted to a data stream.
(iii) Electric Power Consumption Dataset: Electricity dataset is the third real dataset, which contains
2,075,259 number of instances and 9 different attributes. The main characteristic of this dataset is
multivariate and time series. The attribute values are normalized within [0, 1]. This datasetgave better results
for each activation function and configuration. It performed better than the other datasets.
3.2. Evaluation measures
The accurate k-values are predicted by using clustering oriented approach. In our method, we use
four important evaluation metrics namely precision, accuracy, recall and F-measure for our experiments.
In a cluster group, the accurate grouped data percentage is measured by the cluster purity. The online
components generate the micro clusters and the quality of this cluster is evaluated based on this cluster
purity. The accuracy evaluation is utilized to detect the correctly assigned class, based on this accuracy, the
cluster purity is evaluated. For stream clustering, the precision, recall, accuracy and F-measure are used as
evaluation measures. The ratio between the number of retrieved relevant samples and the total number of
grouped samples is known as the precision, which is utilized to determine the accurate clustering results of
the clustering method. The ratio between the number of retrieved relevant samples and the total number of
samples is known as accuracy. The contribution of the precision and recall is known as an F-measure.
The accuracy, precision, recall and F-measure are given as follows:
FNFPTP
TP
Accuracy  (12)
FPTP
TP
(P)Precision

 (13)
FNTP
TP
(R)Recall

 (14)








RP
RP.
2measureF (15)
In the above equations, TP is the true positive, TN is the true negative, FP is the false positive and
FN is the false negative samples. Table 1 shows the results of evaluation mesures computed on three different
data sets. This shows that our proposed IDE Stream algorithm performs better compared with Genetic
Algorithm (GA). Following graphs represents comparison of our proposed IDE Stream algorithm with
genetic algorithm in four evaluation measures on three datasets as shown in Figure 3.
Table 1. Evaluation measures on three datasets
Data Sets Evaluation Measures
Precision Recall Accuracy F_Measure
GA Proposed IDE GA Proposed IDE GA Proposed IDE GA Proposed IDE
1. KDD CUP’99 0.7815 0.8631 0.8134 0.9023 0.8634 0.9149 0.7972 0.8823
2. Foreset Cover Type 0.8117 0.8657 0.8430 0.9056 0.8452 0.9314 0.8271 0.8852
3. Electric Power
Consumption
0.8174 0.8802 0.8475 0.9012 0.8819 0.9225 0.8321 0.8906
Average value 0.8035 0.8697 0.8346 0.9030 0.8635 0.9229 0.8188 0.8860
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2659 - 2667
2666
 Proposed IDE Stream Algorithm  Genetic Algorithm(GA)
Figure 3. The evaluation of comparison between proposed IDEStream algorithm and genetic algorithm
4. CONCLUSION
In this paper we presented an IDE Stream algorithm for automatic clustering of stream data.
Primary contributions of this paper are to detect the optimal number of clusters automatically for all data sets.
Our testing data sets are taken from the KDD CUP’99, forest cover types, and electric power consumption
datasets. Our proposed IDE algorithm is more efficient than the existing well known clustering algorithms
such as Genetic algorithm. In our proposed approach, we also apply an entropy based technique for detecting
the concept drift thereby updating the clustering process. Our experimental results show that the proposed
method is simple, practical and impactful. The performance of our method is estimated based on the
accuracy, precision, recall and F-measure values. In the future work, we could study how to enhance the
system to improve the precision. In future we will implement this IDE algorithm for different types of
attributes in the clustering domain.
REFERENCES
[1] B. Zhang, et al., “Data stream clustering based on Fuzzy C-Mean algorithm and entropy theory,” Signal
Processing, pp. 111-116, 2015.
[2] J. Barddal, et al., “SNCStream+: Extending a high quality true anytime data stream clustering algorithm,” Journal
of Information Systems, vol. 62, pp. 60-73, 2016.
[3] M. Hosseini, et al., “An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary
data streams,” Knowledge and Information Systems, vol/issue: 46(3), pp. 567-597, 2015.
[4] M. Khalilian, et al., “Data stream clustering by divide and conquer approach based on vector model,” Journal of
Big Data, vol/issue: 3(1), 2016.
[5] J. A. Silva, et al., “An evolutionary algorithm for clustering data streams with a variable number of clusters,”
Expert Systems with Applications, vol. 67, pp. 228-238, 2017.
[6] E. de Faria, et al., “Evaluation of Multiclass Novelty Detection Algorithms for Data Streams,’ IEEE Transactions
on Knowledge and Data Engineering, vol/issue: 27(11), pp. 2961-2973, 2015.
[7] C. Wang, et al., “SVStream: A Support Vector-Based Algorithm for Clustering Data Streams,” IEEE Transactions
on Knowledge and Data Engineering, vol/issue: 25(6), pp. 1410-1424, 2013.
[8] D. Adeniyi, et al., “Automated web usage data mining and recommendation system using K-Nearest Neighbor
(KNN) classification method,” Applied Computing and Informatics, vol/issue: 12(1), pp. 90-108, 2016.
[9] J. Zhang, et al., “Distributed data stream clustering algorithm based on affinity propagation,” Journal of Computer
Applications, vol/issue: 33(9), pp. 2477-2481, 2013.
Int J Elec & Comp Eng ISSN: 2088-8708 
An improved differential evolution algorithm for data stream clustering (Bhaskar Adepu)
2667
[10] K. Guo and Q. Zhang, “Fast Clustering-Based Anonymization Algorithm for Data Streams,” Journal of Software
Engineering, vol/issue: 24(8), pp. 1852-1867, 2014.
[11] R. Hyde, et al., “Fully online clustering of evolving data streams into arbitrarily shaped clusters,” Journal of
Information Sciences, vol. 382-383, pp. 96-114, 2017.
[12] N. Chaturvedi, et al., “An Improvement in K-mean Clustering Algorithm Using Better Time and Accuracy,”
International Journal of Programming Languages and Applications, vol/issue: 3(4), pp. 13-19, 2013.
[13] M. Naldi and R. Campello, “Evolutionary k-means for distributed data sets,” Neuro Computing, vol. 127,
pp. 30-42, 2014.
[14] V. S. Alves, et al., “Towards a Fast Evolutionary Algorithm for Clustering,” IEEE congress on evolutionary
computation, IEEE Press, pp. 1776-1783, 2006.
[15] Y. Ping, et al., “Fast and scalable support vector clustering for large-scale data analysis,” Knowledge and
Information Systems, vol/issue: 43(2), pp. 281-310, 2014.
[16] E. de Faria, et al., “MINAS: multiclass learning algorithm for novelty detection in data streams,” Data Mining and
Knowledge Discovery, vol/issue: 30(3), pp. 640-680, 2015.
[17] P. D. and A. Dixit, “Multi Novel Class Classification of Feature Evolving Data Streams with J48,” International
Journal of Computer Applications, vol/issue: 124(11), pp. 31-36, 2015.
[18] Y. Han, “Improved BIRCH Clustering Algorithm and Human Resource Management Efficiency:
An Organizational Learning Perspective,” International Journal of Security and Its Applications, vol/issue: 10(8),
pp. 385-394, 2016.
[19] Y. Liu, “Fuzzy-Clustering Web based on Mining,” Journal of Multimedia, vol/issue: 9(1), 2014.
[20] H. Lee, et al., “A MapReduce-based kNN Join Query Processing Algorithm for Analyzing Large-scale Data,”
Journal of KIISE, vol/issue: 42(4), pp. 504-511, 2015.
[21] Z. Miller, et al., “Twitter spammer detection using data stream clustering,” Information Sciences, vol. 260,
pp. 64-73, 2014.
[22] T. Velmurugan, “Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for
connection oriented telecommunication data,” Applied Soft Computing, vol. 19, pp. 134-146, 2014.
[23] R. Fok, et al., “Mining Evolving Data Streams with Particle Filters,” Computational Intelligence, vol/issue: 33(2),
pp. 147-180, 2015.
[24] V. Bhatnagar, et al., “Clustering data streams using grid-based synopsis,” Knowledge and Information Systems,
vol/issue: 41(1), pp. 127-152, 2013.
[25] X. Yuan, et al., “A Genetic Algorithm-Based, Dynamic Clustering Method towards Improved WSN Longevity,”
Journal of Network and Systems Management, vol/issue: 25(1), pp. 21-46, 2016.
[26] D. Marrón, et al., “Data stream classification using random feature functions and novel method combinations,”
Journal of Systems and Software, vol. 127, pp. 195-204, 2017.
BIOGRAPHIES OF AUTHORS
Bhaskar Adepu is an Associate Professor at Kakatiya Institute of Technology & Science, Warangal
and Affiliated to Kakatiya University, India. He is pursuing his Ph.D. from Jawaharlal Nehru
Technological University (JNTU), Hyderabad. He received M.Tech. (CSE) from JNTU-Hyderabad
in 2010. His research interests include Data Mining, Image Processing and Artificial Intelligence.
He delivered guest lectures in the field of data mining and artificial intelligence at various platforms.
He is a Member of IEEE and a member of ISTE.
Jayadev Gyani is an Assistant Professor at the College of Computer and Information Sciences,
Majmaah University, Al Majmaah 15341, Saudi Arabia. He received his Ph.D. from the University
of Hyderabad in 2009. He has published more than 80 refereed journals and conference articles in
the area of software engineering, data mining, web information systems and digital image
processing. Dr.Gyani was program Co-chair and Vice Chair for few international conferences. He is
a Member of IEEE Computer Society and a Member of ACM.
Narsimha Gugulotu is a Professor and Head at JNTUH College of Engineering, Sultanpur,
Jawaharlal Nehru Technological University, Hyderabad, India. He received his Ph.D. from the
Osmania University, Hyderabad in 2009. He has published more than 60 refereed journals and
conference articles in the area of Data mining, Mobile Computing, Computer Networks, Cloud
Computing and Big data analytics. Dr.Narsimha is catering various administrative and academic
responsibilities at various capacities. He is also reviewer for few international conferences. He is a
Member of IEEE Computer Society and a Member of ISTE.

More Related Content

What's hot (20)

PDF
Dynamic approach to k means clustering algorithm-2
IAEME Publication
 
PDF
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
 
PDF
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET Journal
 
PDF
A time efficient and accurate retrieval of range aggregate queries using fuzz...
IJECEIAES
 
PDF
Vol 16 No 2 - July-December 2016
ijcsbi
 
PDF
IEEE Datamining 2016 Title and Abstract
tsysglobalsolutions
 
PDF
Fast Range Aggregate Queries for Big Data Analysis
IRJET Journal
 
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
PDF
A Survey on Privacy-Preserving Data Aggregation Without Secure Channel
IRJET Journal
 
PDF
Data mining techniques application for prediction in OLAP cube
IJECEIAES
 
PDF
An effective classification approach for big data with parallel generalized H...
riyaniaes
 
PDF
|QAB> : Quantum Computing, AI and Blockchain
Kan Yuenyong
 
PDF
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
IRJET Journal
 
PDF
Use of genetic algorithm for
ijitjournal
 
PDF
Ikdd co ds2017presentation_v2
Ram Mohan
 
PDF
Efficient Reversible Data Hiding Algorithms Based on Dual Prediction
sipij
 
PDF
Demand-driven Gaussian window optimization for executing preferred population...
IJECEIAES
 
PDF
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
PDF
PRIVACY PRESERVING DATA MINING BASED ON VECTOR QUANTIZATION
IJDMS
 
PDF
Preprocessing and secure computations for privacy preservation data mining
IAEME Publication
 
Dynamic approach to k means clustering algorithm-2
IAEME Publication
 
Review of Existing Methods in K-means Clustering Algorithm
IRJET Journal
 
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET Journal
 
A time efficient and accurate retrieval of range aggregate queries using fuzz...
IJECEIAES
 
Vol 16 No 2 - July-December 2016
ijcsbi
 
IEEE Datamining 2016 Title and Abstract
tsysglobalsolutions
 
Fast Range Aggregate Queries for Big Data Analysis
IRJET Journal
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET Journal
 
A Survey on Privacy-Preserving Data Aggregation Without Secure Channel
IRJET Journal
 
Data mining techniques application for prediction in OLAP cube
IJECEIAES
 
An effective classification approach for big data with parallel generalized H...
riyaniaes
 
|QAB> : Quantum Computing, AI and Blockchain
Kan Yuenyong
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
IRJET Journal
 
Use of genetic algorithm for
ijitjournal
 
Ikdd co ds2017presentation_v2
Ram Mohan
 
Efficient Reversible Data Hiding Algorithms Based on Dual Prediction
sipij
 
Demand-driven Gaussian window optimization for executing preferred population...
IJECEIAES
 
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
PRIVACY PRESERVING DATA MINING BASED ON VECTOR QUANTIZATION
IJDMS
 
Preprocessing and secure computations for privacy preservation data mining
IAEME Publication
 

Similar to An Improved Differential Evolution Algorithm for Data Stream Clustering (20)

PDF
Study of Density Based Clustering Techniques on Data Streams
IJERA Editor
 
PDF
E502024047
IJERA Editor
 
PDF
E502024047
IJERA Editor
 
PDF
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET Journal
 
PDF
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET Journal
 
PDF
Application of Dynamic Clustering Alogirthm in Medical Surveillance
IJCSEA Journal
 
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
PDF
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
PDF
ME Synopsis
Poonam Debnath
 
PDF
Parametric comparison based on split criterion on classification algorithm
IAEME Publication
 
PDF
Adaptive Stream Mining Pattern Learning And Mining From Evolving Data Streams...
temimudaday4
 
PDF
In data streams using classification and clustering different techniques to f...
eSAT Journals
 
PDF
A frame work for clustering time evolving data
iaemedu
 
PDF
In data streams using classification and clustering
eSAT Publishing House
 
PDF
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Albert Bifet
 
PDF
1105.1950
Nhat Tam
 
Study of Density Based Clustering Techniques on Data Streams
IJERA Editor
 
E502024047
IJERA Editor
 
E502024047
IJERA Editor
 
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET Journal
 
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET Journal
 
Application of Dynamic Clustering Alogirthm in Medical Surveillance
IJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
APPLICATION OF DYNAMIC CLUSTERING ALGORITHM IN MEDICAL SURVEILLANCE
IJCSEA Journal
 
ME Synopsis
Poonam Debnath
 
Parametric comparison based on split criterion on classification algorithm
IAEME Publication
 
Adaptive Stream Mining Pattern Learning And Mining From Evolving Data Streams...
temimudaday4
 
In data streams using classification and clustering different techniques to f...
eSAT Journals
 
A frame work for clustering time evolving data
iaemedu
 
In data streams using classification and clustering
eSAT Publishing House
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Albert Bifet
 
1105.1950
Nhat Tam
 
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
PDF
Neural network optimizer of proportional-integral-differential controller par...
IJECEIAES
 
PDF
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
PDF
A review on features and methods of potential fishing zone
IJECEIAES
 
PDF
Electrical signal interference minimization using appropriate core material f...
IJECEIAES
 
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
IJECEIAES
 
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
IJECEIAES
 
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
IJECEIAES
 
PDF
Smart grid deployment: from a bibliometric analysis to a survey
IJECEIAES
 
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
IJECEIAES
 
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
IJECEIAES
 
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
IJECEIAES
 
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
IJECEIAES
 
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
IJECEIAES
 
PDF
Detecting and resolving feature envy through automated machine learning and m...
IJECEIAES
 
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
IJECEIAES
 
PDF
An efficient security framework for intrusion detection and prevention in int...
IJECEIAES
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Neural network optimizer of proportional-integral-differential controller par...
IJECEIAES
 
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
A review on features and methods of potential fishing zone
IJECEIAES
 
Electrical signal interference minimization using appropriate core material f...
IJECEIAES
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Bibliometric analysis highlighting the role of women in addressing climate ch...
IJECEIAES
 
Voltage and frequency control of microgrid in presence of micro-turbine inter...
IJECEIAES
 
Enhancing battery system identification: nonlinear autoregressive modeling fo...
IJECEIAES
 
Smart grid deployment: from a bibliometric analysis to a survey
IJECEIAES
 
Use of analytical hierarchy process for selecting and prioritizing islanding ...
IJECEIAES
 
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
IJECEIAES
 
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
IJECEIAES
 
Adaptive synchronous sliding control for a robot manipulator based on neural ...
IJECEIAES
 
Remote field-programmable gate array laboratory for signal acquisition and de...
IJECEIAES
 
Detecting and resolving feature envy through automated machine learning and m...
IJECEIAES
 
Smart monitoring technique for solar cell systems using internet of things ba...
IJECEIAES
 
An efficient security framework for intrusion detection and prevention in int...
IJECEIAES
 
Ad

Recently uploaded (20)

PPTX
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
rr22001247
 
PPT
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
PPTX
Introduction to File Transfer Protocol with commands in FTP
BeulahS2
 
PPTX
Work at Height training for workers .pptx
cecos12
 
PPTX
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
PPTX
How to Un-Obsolete Your Legacy Keypad Design
Epec Engineered Technologies
 
PPTX
Mobile database systems 20254545645.pptx
herosh1968
 
PPTX
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
PPTX
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
PDF
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
PPTX
Introduction to Python Programming Language
merlinjohnsy
 
PDF
PRIZ Academy - Process functional modelling
PRIZ Guru
 
PPT
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
PPTX
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
PPTX
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 
PPT
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
PDF
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
PDF
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
PDF
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
 
LECTURE 7 COMPUTATIONS OF LEVELING DATA APRIL 2025.pptx
rr22001247
 
SF 9_Unit 1.ppt software engineering ppt
AmarrKannthh
 
Introduction to File Transfer Protocol with commands in FTP
BeulahS2
 
Work at Height training for workers .pptx
cecos12
 
MATERIAL SCIENCE LECTURE NOTES FOR DIPLOMA STUDENTS
SAMEER VISHWAKARMA
 
How to Un-Obsolete Your Legacy Keypad Design
Epec Engineered Technologies
 
Mobile database systems 20254545645.pptx
herosh1968
 
Kel.3_A_Review_on_Internet_of_Things_for_Defense_v3.pptx
Endang Saefullah
 
Bitumen Emulsion by Dr Sangita Ex CRRI Delhi
grilcodes
 
How to Buy Verified CashApp Accounts IN 2025
Buy Verified CashApp Accounts
 
Introduction to Python Programming Language
merlinjohnsy
 
PRIZ Academy - Process functional modelling
PRIZ Guru
 
دراسة حاله لقرية تقع في جنوب غرب السودان
محمد قصص فتوتة
 
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
AsadShad4
 
Stability of IBR Dominated Grids - IEEE PEDG 2025 - short.pptx
ssuser307730
 
Comparison of Flexible and Rigid Pavements in Bangladesh
Arifur Rahman
 
FINAL plumbing code for board exam passer
MattKristopherDiaz
 
Python Mini Project: Command-Line Quiz Game for School/College Students
MPREETHI7
 
CLIP_Internals_and_Architecture.pdf sdvsdv sdv
JoseLuisCahuanaRamos3
 
Rapid Prototyping for XR: Lecture 1 Introduction to Prototyping
Mark Billinghurst
 

An Improved Differential Evolution Algorithm for Data Stream Clustering

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 9, No. 4, August 2019, pp. 2659~2667 ISSN: 2088-8708, DOI: 10.11591/ijece.v9i4.pp2659-2667  2659 Journal homepage: http://iaescore.com/journals/index.php/IJECE An improved differential evolution algorithm for data stream clustering Bhaskar Adepu1 , Jayadev Gyani2 , G. Narsimha3 1 Department of Information Technology, Kakatiya Institute of Technology & Science, India 2 Department of CS, College of Computer & Information Sciences, Majmaah University, Saudi Arabia 3 Department of CSE, JNTUH College of Engineering, India Article Info ABSTRACT Article history: Received Jun 27, 2018 Revised Jan 9, 2019 Accepted Jan 11, 2019 A Few algorithms were actualized by the analysts for performing clustering of data streams. Most of these algorithms require that the number of clusters (K) has to be fixed by the customer based on input data and it can be kept settled all through the clustering process. Stream clustering has faced few difficulties in picking up K. In this paper, we propose an efficient approach for data stream clustering by embracing an Improved Differential Evolution (IDE) algorithm. The IDE algorithm is one of the quick, powerful and productive global optimization approach for programmed clustering. In our proposed approach, we additionally apply an entropy based method for distinguishing the concept drift in the data stream and in this way updating the clustering procedure online. We demonstrated that our proposed method is contrasted with Genetic Algorithm and identified as proficient optimization algorithm. The performance of our proposed technique is assessed and cr eates the accuracy of 92.29%, the precision is 86.96%, recall is 90.30% and F-measure estimate is 88.60%. Keywords: Concept drift Datastream clustering Differential evolution Encoding scheme Entropy theory Copyright © 2019 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Bhaskar Adepu, Kakatiya Institute of Technology & Science, Affiliated to Kakatiya University, Warangal, 506015-India. Email: [email protected] 1. INTRODUCTION Nowadays, datastreams becomes an important source of data. In recent years, multiple organizations generate huge amounts of data. Data stream application domains includes information analysis in network data flow monitoring, Internet of Things (IoT) applications regularly sending sensors data, web page access and web click information, weather forecasting information and the economic information produced by finance and securities companies and so on [1-5]. Conventional data mining methods mostly focused on mining static and memory resident data repositories. However, with the emergence of data streams and technological developments changed the way people store, process and communicate the data [6]. Data streams are temporarily ordered, fast changing, infinite and massively potential. It may be not possible to store the entire data stream into memory [7]. Stream mining has to deal with rapid and dynamic data with real time processing and aiming at extracting useful and interesting patterns. The biggest challenge is finding valuable information in a single scanning of massive data streams [7, 8]. Various algorithms and procedures proposed for mining data streams can be grouped into two groups of techniques. One group ofalgorithms can achieve desired clustering results, but insufficiency of data storage capacity which leads us to process data dynamically in extracting knowledge. Another group of algorithms refers to streaming of data and applies mining techniques [9]. These two groups of techniques express some difficulties in clusterin data streams.Some of the difficulties includes: visiting of data once
  • 2.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2659 - 2667 2660 during the processing of data stream, performance of processing stream is crucial and detecting a change or a concept drift during the time whether gradual or abrupt in the evolutionary data stream is difficult [10]. In the area of data stream clustering, multiple techniques have been proposed. They are K-means clustering algorithm [11-13], Fast Evolutionary Algorithm for Clustering (FEAC) [14], A Support Vector Clustering (SVC) based algorithm [15, 16], Multiclass Novelty Detection (MND) algorithm [17, 18], Fuzzy C-Mean algorithm [1, 19], BIRCH algorithm [20], KNN (K-Nearest Neighbor) algorithm [8]. For clustering data in many real world problems, K-means algorithm is popularly used because of its simplicity and scalabilityof that algorithm for most of the real world applications. K-means algorithm has a limitation that we have to specify the number of clusters i.e. k-value as an input parameter to the algorithm [11-13]. For automatically estimating the number of clusters (i.e.k-value) from the input data, there is an algorithm known as Fast Evolutionary Algorithm for Clustering (FEAC) [14] hasshown to be efficient. The limitation of this algorithm was that it is not applied for data stream analysis. Support Vector Clustering (SVC) algorithm is an efficient and effective data stream clustering algorithm. The disadvantage of this algorithm isthat it cannot find arbitary shape clusters because most of these algorithms are based on K-means algorithms [15, 16]. Multiclass Novelty Detection algorithm can be used in various applications like intrusion, fault and fraud detections, spam filters and in text mining [17, 18]. BIRCH is a hierarchical clustering algorithm which is based on calculating the distance between the new data point and the remaining known datapoints. After that, compare these distances with a threshold to determine the category of new data point. The limitation of this algorithm is that it does not function effectively for the data with arbitrary shape [20]. KNN algorithm is a skewed approach and is widely used method forsolving classification and pattern recognition problems in machine learning.KNN algorithm is used to avoid the high computational complexity even though we could not get a satisfactory performance in many applications [19, 21-24]. In this paper, we propose an Improved Differential Evolution algorithm (IDE) for the data stream clustering. At first, from the input data streams, the nearest cluster center is assessed for every incoming object and the clusters are updated. Around then any concept drift occurs, the approaching objects are put in the buffer till a settled time period. From that point onwards, use the IDE-based optimization for finding the optimal K value. On the off chance that any concept drift occurs, the underlying advances are rehashed. The rest of this paper is portrayed in the segment underneath. The proposed method is delineated in section 2. The overwiew of the IDE algorithm is explained in section 2.2. Our proposed IDE Stream algorithm is delineated in section 2.7. Results and the conclusion were depicted in sections 3 and 4. 2. PROPOSED METHOD Consider the data stream which consists of N number of objects and each object is an l-dimensional feature vector xi=[xi j ] where j=1 to l and 1<= i<=N. Initially the number of clusters  max,2 kk  and the centroids are randomly selected from the given datastream. The normal distance between closest cluster center and the underlying objects are evaluated and updated. The evaluated cluster centers are updated and this procedure is rehashed until the point when some ceasing rule is met. At the mean time, if any concept drift occurs, it utilizes the entropy theory which is presented in the section 2.1. Our proposed IDE algorithm doesn't require the streaming module to store the data for outline which utilizes the clustering module or module for estimation of k-value for apportioning the data. Here, the updated apportioning data are kept up by utilizing a single component; it is done in the online mode. Distinctive quantities of trail arrangements are utilized for our approach, which has diverse cluster centers and its coordinates. The best answer for the updated clustering data is chosen and the most noticeably awful arrangements are disposed of in the light of the IDE algorithm. In the data stream clustering process, just a single object is touching base at once. At first, the quantity of clusters is assessed by IDE Stream and the each evaluated cluster measure it ought to be equivalent to the underlying size objects from the stream. At that point, the assessed cluster is kept up in an online manner, subsequent to building up the underlying cluster. The procedure of IDE Stream algorithm has appeared in Figure 1. At the underlying phase of the data stream processing, the number of clusters is randomly selected from the information of data stream. The normal distance between closest cluster center and the underlying objects are evaluated and updated. With specific goal to recognize changes in the data partition, the clusters are administered by the entropy theory. At the point when the entropy test triggers an alarm, that is the point at which the real clusters being updated don't mirror the adjustments in the data streams and these clustered objects are put in the buffer for some time until any concept drift or change occurs. At that point, the IDE scan is begun for optimization and it is dealt in the encoding scheme.
  • 3. Int J Elec & Comp Eng ISSN: 2088-8708  An improved differential evolution algorithm for data stream clustering (Bhaskar Adepu) 2661 2.1. Entropy theory for detecting concept drift At the time of partitioning the data, the data distribution may change over time in unforeseen ways, this problem is known as concept drift. Theconcept drift is going on, while concept evolution appearance to vanishing of clusters. To recognize the concept drift, the accompanying speculations are considered for the entropy theory. Here, just a single idea is holed by the data stream. Because of this reason, the concept drift isn't happening in light of the fact that the data stream is steady. On the off chance that the data stream does not have any concept drift, the time series of the entropy strategy is about zero and stays stationary. In the event that the data stream has any concept drift, the entropy is can't expand on the last idea, it just coordinates with the underlying one. That implies, the precision judgment of the back to back data is can't control the framework, if the updating is not finished. Because of this reason, the discovery of concept drift is critical. Thus, the entropy estimation is established in the light of the membership esteems and it is critical for identifying the concept drift. In our method shannon’s entropy strategy is utilized for the entropy estimation. Here, the discrete random variables are considered as X and the conceivable esteems are considered as:  l diX l iX l iX l iX  ,,2,1,  and the probability mass function is P(X). The entropy E(X) is calculated for every random variable X using the following equation.    n i XiPXiPXE 1 ))(log()()( (1) The non-consistency of the given cluster is assessed in view of the entropy measure. For the all-out data clustering, the entropy theory is a proficient technique. On the off chance that the entropy esteems are high, the vulnerability of the IDE is bigger. The time series of the classification entropy points is almost zero if the data stream does not contain any concept drift, generally, the esteem will turn out to be extensive. 2.2. Overview of IDE Stream algorithm Figure 1 shown the procedure of IDE Stream algorithm. Figure 1. Process of IDE Stream algorithm 2.3. Buffer for IDE Stream In the wake of recognizing the concept drift, the genuine clusters are updated, and these outdated clusters are put away in buffer for some time to run the IDE algorithm. Here, the base size of the buffer is 10%×initial size. For the computational assets, just the base size of the buffer is accessible. To characterize the warning and the alarm states, there are two threshold values, specifically w and a are utilized
  • 4.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2659 - 2667 2662 (i.e. wa   ). Right when the alarm threshold value a isn't as much as the estimation of the contrast between the two variables, the stationary state is triggered. Starting now and into the foreseeable future, to address the data partition and recognize the optimized clusters, the encoding scheme is executed to run the IDE algorithm. For the efficient and robust optimization process, the IDE algorithm is an efficient and well known algorithm which is a population based algorithm. The floating point representations are used in this algorithm. 2.4. Encoding scheme For the clustering problems, the candidate solutions (individuals) are described by the encoding scheme of [14] is adopted by the IDE-stream algorithm. A data stream X is a significant arrangement of illustrations and it is given as follow:  l j j ixXeit l dix l ix l ix l ixt l iX 1.,.)}(,,,2,1,{)(   (2) where i Incoming object l Number of attributes in each object. The above equation is potentially unbounded ).( N Each case is depicted by a n-dimensional attributes vector  n l l ixiX 1 . At that point, the input data points (D) are partitioned into k number of non- overlapping clusters },,2,1{ kCCCC  such that satisfying the equation DiC k i jikjijCiCiC    1 ;,,2,1,,;  (3) The above expression clarified, that the objects in the similar clusters are like each other, and the objects in the diverse clusters are disparate. In the data set, the closeness and disparity between the objects are found by evaluating the Euclidean distance between the points  n l l ixiX 1 . The above partitioned data C is changed into the integer string of N positions by using encoded process. Here, the position of every string and the numerical orders for the objects of the datasets are almost similar. The example encoding scheme of the dataset is shown in Figure 2. 1 1 1 1 3 1 2 3 3 2 3 O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 K Figure 2. Encoding scheme In the above example, the dataset consists of ten objects }10,,1{,.. iiOei and the encoding clusters are three )3( k . Here, the ith object of the dataset is stored in the ith position and the k-value is stored in the last position. Here, the feature vectors are used to describe the every cluster and the number of objects N, data object’s linear sum called S1 and their squared sum called S2 and time ‘t’ of the most recent objects that the cluster received are the four genuine quantities of measurement. The centroid of the cluster is evaluated by using the initial three components and the importance of the cluster is weighted by the rest of the component. At time ‘T’, the cluster weight ‘W’ is evaluated as follows: v tT eW   (4) where W is represented as the weight of the cluster, v is represented as the user defined parameter used to control the fading factor. When the weighting value of the cluster is less than 0.1 (i.e. W<0.1), that cluster is removed from the data partition.
  • 5. Int J Elec & Comp Eng ISSN: 2088-8708  An improved differential evolution algorithm for data stream clustering (Bhaskar Adepu) 2663 2.5. Validity indices of the clusters The validity index is utilized to evaluate the number of clusters. The evolutionary algorithm optimizes the k-value based on the fitness function and it is estimated by using the validity index. The well knownvalidity indexused in our method is a Simplified Silhouette (SS) index which is used in [5, 14]. The compactness and the partition of the clusters are assessed by utilizing the Silhouette width. The given arrangement of the data point is given as takes after, l jiiii l i xxxxX , 321 ,,,,  (5) where in the above equation, Ni 3,2,1 the number of objects in the partition and Kj 3,2,1 the number of cluster’s range variations.The above considered data points l iX comprises the cluster aC . In aC , the alternate objectives are distinguished in view of the l iX difference capacity and it is meant by )( l iXa . In cluster aC , the normal divergence capacity of every single other object is indicated as ),( bC l iXd . The minimum dissimilarity function of the data points is selected, if the cluster aCbC  , which is given beneath,  )1(,)(  jcentroid l iXdis l iXa (6)  )(,min)( ijcentroid l iXdis l iXb  (7) After computing the dissimilarity function, the silhouette estimation )( l iXS is given as follows,  )(),(max )()( )( l iXb l iXa l iXa l iXbl iXS   (8) The silhouette values are only ranges between the interval [0, 1], the closest cluster values are accessed according to the equation (11). At that point, when the silhouette value is nearer to 1, it represents l iX is clustered precisely, generally the data points are wrongly clustered. The overall silhouette index of the portioning cluster  kCCCC ,,2,1  , is given as:    N i l iXS N SS 1 )( 1 (9) In the above equation the maximum value of SS )max(SS is known as the fitness function of the object in the cluster, which is utilized to determine the quality of the partitioning data. Here, which cluster has the maximum SS value, that is considered as the best clustering and then the evolutionary search is started. 2.6. Innovative evolutionary search In our method, the DE algorithm is modified based on the adjustment of population X , mutation and crossover CR estimations, because the best optimization results are obtained through the crossover and the mutation rate. In our IDE implementation, initially the candidate solutions )}(,,,,{ ,21 txxxx l di l i l i l i  are generated from the initial population X and it is shown in section 2.4 as the encoding scheme. Then the silhouette index maxSS is evaluated for each individual to know the fitness and this estimation is shown in section 2.5. After that, the other objects )(),( txtx l j l i and )(txl p are randomly generated from the initial generated candidate solution. Then the difference between )(txl i and )(txl j are estimated and the estimated difference values are scaled by scalar S , it is represented as ]1,0[S . Here, the scaled value of the two
  • 6.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2659 - 2667 2664 random vector’s weighted difference values ))(,)(,( t l ujXtu l iXO  is added to the third vector )(txl p to generate the new vector )1( tOl i , it is known as the mutation and it is given as follows:           )(,, ))(,)(,()(, ,]1,0[ )1(, tukXotherwise t l ujXtu l iXOtupX CRurandif tukO (10) Then the other predetermined vector is mixed with the mutated vector, it is known as the crossover rate (CR). It is the scalar parameter of the algorithm and in between 0 and 1, i.e., ]1,0[CR . Atlast, assess the new candidate solution with the SSmaxand then the initial candidate solution )(t l iX is substituted in the new candidate solution )1( t l iX . Here, if the newly generated candidate solution yields the maximum objective function, these solutions are considered as the best solution otherwise the solution is retained in the population; it is depicted in the equation below,        )()),(())1(( )1()),(())1(( )1( tiXt l iXStiOif tiOt l iXStiOSif t l iX (11) In the above equation, S(.) is the objective equation. To enhance our IDE calculation, we have enhanced the properties of IDE in different ways. To scale the weighted contrast vector, )(,)(, t l ujXtu l iX  , the scaling factor S is using and it extends in the range of 0.5 and 1. 2.7. Pseudo code of IDE Stream algorithm Step 1 : Generate the candidate solutions from the initial population(X) i.e.the individual points l nX l X l X ,2,1  are randomly generated from the data sets. Step 2 : CalcualteSSmaxforeach individual. [Use eq.9] Step 3 : Randomly choose three objects )(),( t l jXt l iX , )(t l pX from the initially generated candidate solutions. Step 4 : Estimate the difference between any two objects and scale it in the range [0, 1] and add this value to third object to generate a new object. [Use eq.10] Step 5 : Perform crossover by mixing this mutated object resulted from step4 with predefined object such that crossover rate is in the range [0,1]. Step 6 : Assess the newly generated candidate solution and output the best result. [Use eq.11] Repeat step 2 to step 6 until stopping criteria is met.i.e. K<=SSmax 3. RESULTS AND ANALYSIS The performance of the IDE algorithm is experimentally assessed by contrasting and the use of late created optimization algorithm, to be specific genetic algorithm [25]. The primary objective of this calculation is, to optimize the candidate solution to produce the best outcome. The candidate solutions are picked relies upon the fitness function, the nature of the candidate solutions is assessed as for the optimization issue. The principle favorable circumstances of this algorithm are, it can deal with a few competitor arrangements all the while. All things considered, in numerous handy applications, the rough assurance of the data set is unthinkable. In our technique, if any progressions are happening in the season of DE based clustering, which is contrasted and the IDE algorithm for utilizing a similar objective portrayal and the silhouette index of the IDE. 3.1. Datasets description We used three datasets and they are KDD CUP’99 [4],[5], forest cover type [5],[26], electric power consumption dataset [26]. We compared the accuracy, precision, recall and F-Measures on these datasets.
  • 7. Int J Elec & Comp Eng ISSN: 2088-8708  An improved differential evolution algorithm for data stream clustering (Bhaskar Adepu) 2665 (i) KDD CUP’99 Dataset: KDD CUP’99 is the first real data, we use 10% version of detection set, and this version is more challenging and more concentrated than the full version. 34 continuous attributes are used in this data set which contains 23 different classes of data and 494,020 connection records. Each object is either a normal connection or an intrusion. Popular minimum-maximum procedure is used for normalizing the dataset attribute values within [0, 1] before running the experiment. (ii) Forest Cover Type Dataset: Forest cover type is the second real dataset, which contain 7 forest cover type of 581,012 geospatial descriptions. Totally 54 attributes are described in objectsout of which 10 are quantitative attributes and 44 are binary attributes. These 10 quantitative attributes are widely used in many experiments as reported in the literature. Similar to the KDD CUP’99, Popular minimum-maximum procedure is used for normalizing the dataset attribute values within [0, 1] before running the experiment. By taking the data input, the data set is converted to a data stream. (iii) Electric Power Consumption Dataset: Electricity dataset is the third real dataset, which contains 2,075,259 number of instances and 9 different attributes. The main characteristic of this dataset is multivariate and time series. The attribute values are normalized within [0, 1]. This datasetgave better results for each activation function and configuration. It performed better than the other datasets. 3.2. Evaluation measures The accurate k-values are predicted by using clustering oriented approach. In our method, we use four important evaluation metrics namely precision, accuracy, recall and F-measure for our experiments. In a cluster group, the accurate grouped data percentage is measured by the cluster purity. The online components generate the micro clusters and the quality of this cluster is evaluated based on this cluster purity. The accuracy evaluation is utilized to detect the correctly assigned class, based on this accuracy, the cluster purity is evaluated. For stream clustering, the precision, recall, accuracy and F-measure are used as evaluation measures. The ratio between the number of retrieved relevant samples and the total number of grouped samples is known as the precision, which is utilized to determine the accurate clustering results of the clustering method. The ratio between the number of retrieved relevant samples and the total number of samples is known as accuracy. The contribution of the precision and recall is known as an F-measure. The accuracy, precision, recall and F-measure are given as follows: FNFPTP TP Accuracy  (12) FPTP TP (P)Precision   (13) FNTP TP (R)Recall   (14)         RP RP. 2measureF (15) In the above equations, TP is the true positive, TN is the true negative, FP is the false positive and FN is the false negative samples. Table 1 shows the results of evaluation mesures computed on three different data sets. This shows that our proposed IDE Stream algorithm performs better compared with Genetic Algorithm (GA). Following graphs represents comparison of our proposed IDE Stream algorithm with genetic algorithm in four evaluation measures on three datasets as shown in Figure 3. Table 1. Evaluation measures on three datasets Data Sets Evaluation Measures Precision Recall Accuracy F_Measure GA Proposed IDE GA Proposed IDE GA Proposed IDE GA Proposed IDE 1. KDD CUP’99 0.7815 0.8631 0.8134 0.9023 0.8634 0.9149 0.7972 0.8823 2. Foreset Cover Type 0.8117 0.8657 0.8430 0.9056 0.8452 0.9314 0.8271 0.8852 3. Electric Power Consumption 0.8174 0.8802 0.8475 0.9012 0.8819 0.9225 0.8321 0.8906 Average value 0.8035 0.8697 0.8346 0.9030 0.8635 0.9229 0.8188 0.8860
  • 8.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 9, No. 4, August 2019 : 2659 - 2667 2666  Proposed IDE Stream Algorithm  Genetic Algorithm(GA) Figure 3. The evaluation of comparison between proposed IDEStream algorithm and genetic algorithm 4. CONCLUSION In this paper we presented an IDE Stream algorithm for automatic clustering of stream data. Primary contributions of this paper are to detect the optimal number of clusters automatically for all data sets. Our testing data sets are taken from the KDD CUP’99, forest cover types, and electric power consumption datasets. Our proposed IDE algorithm is more efficient than the existing well known clustering algorithms such as Genetic algorithm. In our proposed approach, we also apply an entropy based technique for detecting the concept drift thereby updating the clustering process. Our experimental results show that the proposed method is simple, practical and impactful. The performance of our method is estimated based on the accuracy, precision, recall and F-measure values. In the future work, we could study how to enhance the system to improve the precision. In future we will implement this IDE algorithm for different types of attributes in the clustering domain. REFERENCES [1] B. Zhang, et al., “Data stream clustering based on Fuzzy C-Mean algorithm and entropy theory,” Signal Processing, pp. 111-116, 2015. [2] J. Barddal, et al., “SNCStream+: Extending a high quality true anytime data stream clustering algorithm,” Journal of Information Systems, vol. 62, pp. 60-73, 2016. [3] M. Hosseini, et al., “An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams,” Knowledge and Information Systems, vol/issue: 46(3), pp. 567-597, 2015. [4] M. Khalilian, et al., “Data stream clustering by divide and conquer approach based on vector model,” Journal of Big Data, vol/issue: 3(1), 2016. [5] J. A. Silva, et al., “An evolutionary algorithm for clustering data streams with a variable number of clusters,” Expert Systems with Applications, vol. 67, pp. 228-238, 2017. [6] E. de Faria, et al., “Evaluation of Multiclass Novelty Detection Algorithms for Data Streams,’ IEEE Transactions on Knowledge and Data Engineering, vol/issue: 27(11), pp. 2961-2973, 2015. [7] C. Wang, et al., “SVStream: A Support Vector-Based Algorithm for Clustering Data Streams,” IEEE Transactions on Knowledge and Data Engineering, vol/issue: 25(6), pp. 1410-1424, 2013. [8] D. Adeniyi, et al., “Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method,” Applied Computing and Informatics, vol/issue: 12(1), pp. 90-108, 2016. [9] J. Zhang, et al., “Distributed data stream clustering algorithm based on affinity propagation,” Journal of Computer Applications, vol/issue: 33(9), pp. 2477-2481, 2013.
  • 9. Int J Elec & Comp Eng ISSN: 2088-8708  An improved differential evolution algorithm for data stream clustering (Bhaskar Adepu) 2667 [10] K. Guo and Q. Zhang, “Fast Clustering-Based Anonymization Algorithm for Data Streams,” Journal of Software Engineering, vol/issue: 24(8), pp. 1852-1867, 2014. [11] R. Hyde, et al., “Fully online clustering of evolving data streams into arbitrarily shaped clusters,” Journal of Information Sciences, vol. 382-383, pp. 96-114, 2017. [12] N. Chaturvedi, et al., “An Improvement in K-mean Clustering Algorithm Using Better Time and Accuracy,” International Journal of Programming Languages and Applications, vol/issue: 3(4), pp. 13-19, 2013. [13] M. Naldi and R. Campello, “Evolutionary k-means for distributed data sets,” Neuro Computing, vol. 127, pp. 30-42, 2014. [14] V. S. Alves, et al., “Towards a Fast Evolutionary Algorithm for Clustering,” IEEE congress on evolutionary computation, IEEE Press, pp. 1776-1783, 2006. [15] Y. Ping, et al., “Fast and scalable support vector clustering for large-scale data analysis,” Knowledge and Information Systems, vol/issue: 43(2), pp. 281-310, 2014. [16] E. de Faria, et al., “MINAS: multiclass learning algorithm for novelty detection in data streams,” Data Mining and Knowledge Discovery, vol/issue: 30(3), pp. 640-680, 2015. [17] P. D. and A. Dixit, “Multi Novel Class Classification of Feature Evolving Data Streams with J48,” International Journal of Computer Applications, vol/issue: 124(11), pp. 31-36, 2015. [18] Y. Han, “Improved BIRCH Clustering Algorithm and Human Resource Management Efficiency: An Organizational Learning Perspective,” International Journal of Security and Its Applications, vol/issue: 10(8), pp. 385-394, 2016. [19] Y. Liu, “Fuzzy-Clustering Web based on Mining,” Journal of Multimedia, vol/issue: 9(1), 2014. [20] H. Lee, et al., “A MapReduce-based kNN Join Query Processing Algorithm for Analyzing Large-scale Data,” Journal of KIISE, vol/issue: 42(4), pp. 504-511, 2015. [21] Z. Miller, et al., “Twitter spammer detection using data stream clustering,” Information Sciences, vol. 260, pp. 64-73, 2014. [22] T. Velmurugan, “Performance based analysis between k-Means and Fuzzy C-Means clustering algorithms for connection oriented telecommunication data,” Applied Soft Computing, vol. 19, pp. 134-146, 2014. [23] R. Fok, et al., “Mining Evolving Data Streams with Particle Filters,” Computational Intelligence, vol/issue: 33(2), pp. 147-180, 2015. [24] V. Bhatnagar, et al., “Clustering data streams using grid-based synopsis,” Knowledge and Information Systems, vol/issue: 41(1), pp. 127-152, 2013. [25] X. Yuan, et al., “A Genetic Algorithm-Based, Dynamic Clustering Method towards Improved WSN Longevity,” Journal of Network and Systems Management, vol/issue: 25(1), pp. 21-46, 2016. [26] D. Marrón, et al., “Data stream classification using random feature functions and novel method combinations,” Journal of Systems and Software, vol. 127, pp. 195-204, 2017. BIOGRAPHIES OF AUTHORS Bhaskar Adepu is an Associate Professor at Kakatiya Institute of Technology & Science, Warangal and Affiliated to Kakatiya University, India. He is pursuing his Ph.D. from Jawaharlal Nehru Technological University (JNTU), Hyderabad. He received M.Tech. (CSE) from JNTU-Hyderabad in 2010. His research interests include Data Mining, Image Processing and Artificial Intelligence. He delivered guest lectures in the field of data mining and artificial intelligence at various platforms. He is a Member of IEEE and a member of ISTE. Jayadev Gyani is an Assistant Professor at the College of Computer and Information Sciences, Majmaah University, Al Majmaah 15341, Saudi Arabia. He received his Ph.D. from the University of Hyderabad in 2009. He has published more than 80 refereed journals and conference articles in the area of software engineering, data mining, web information systems and digital image processing. Dr.Gyani was program Co-chair and Vice Chair for few international conferences. He is a Member of IEEE Computer Society and a Member of ACM. Narsimha Gugulotu is a Professor and Head at JNTUH College of Engineering, Sultanpur, Jawaharlal Nehru Technological University, Hyderabad, India. He received his Ph.D. from the Osmania University, Hyderabad in 2009. He has published more than 60 refereed journals and conference articles in the area of Data mining, Mobile Computing, Computer Networks, Cloud Computing and Big data analytics. Dr.Narsimha is catering various administrative and academic responsibilities at various capacities. He is also reviewer for few international conferences. He is a Member of IEEE Computer Society and a Member of ISTE.