SlideShare a Scribd company logo
International Journal of Research in Computer Science
eISSN 2249-8265 Volume 2 Issue 1 (2011) pp. 39-43
© White Globe Publications
www.ijorcs.org


            QUALITY OF CLUSTER INDEX BASED ON
                 STUDY OF DECISION TREE
                       B.Rajasekhar1, B. Sunil Kumar2, Rajesh Vibhudi3, B.V.Rama Krishna4
                       12
                          Assistant Professor, Jawaharlal Nehru Institute of Technology, Hyderabad
                                 3
                                   Sri Mittapalli college of Engineering, Guntur, Hyderabad
                     4
                      Associate Professor, St Mary’s College of Engineering & Technology, Hyderabad

   Abstract:- Quality of clustering is an important          similarity measure of clusters whose bases are the
issue in application of clustering techniques. Most          dispersion measure of a cluster and the cluster
traditional cluster validity indices are geometry-based      dissimilarity measure. In the case of fuzzy clustering
cluster quality measures. This work proposes a cluster       algorithms, some validity indices such as partition
validity index based on the decision-theoretic rough         coefficient [1] and classification entropy use only the
set model by considering various loss functions. Real        information of fuzzy membership grades to evaluate
time retail data show the usefulness of the proposed         clustering results. The third category consists of
validity index for the evaluation of rough and crisp         validity indices that make use of not only the fuzzy
clustering. The measure is shown to help determine           membership grades but also the structure of the data.
optimal number of clusters, as well as an important          All these validity indices are essentially based on the
parameter called threshold in rough clustering. The          geometric characteristics of the clusters. A decision-
experiments with a promotional campaign for the              theoretic measure of cluster quality, decision theoretic
retail data illustrate the ability of the proposed           framework has been helpful in providing a better
measure to incorporate financial considerations in           understanding of classification models [4]. The
evaluating quality of a clustering scheme. This ability      decision theoretic rough set model considers various
to deal with monetary values distinguishes the               classes of loss functions. By adjusting loss functions,
proposed decision-theoretic measure from other               the decision-theoretic rough set model can also be
distance-based measures. Our proposed system                 extended to the multi category problem. It is possible
validity index can also be efficient for evaluating other    to construct a cluster validity index by considering
clustering algorithms such as fuzzy clustering.              various loss functions based on decision theory. Such a
                                                             measure has an added advantage of being applicable to
Keywords: Clustering, Classification, Decision Tree,
                                                             rough-set-based clustering. This work describes how to

                I.
K-means.
                                                             develop a cluster validity index from the decision-
                                                             theoretic rough set model. Based on the decision
                      INTRODUCTION
                                                             theory, the proposed rough cluster validity index is
   Unsupervised learning clustering is one of the            taken as a function of total risk for grouping objects
techniques in data mining, categorizes unlabeled             using a clustering algorithm. Since crisp[5] clustering
objects into several clusters such that the objects          is a special case of rough clustering, index validity is
belonging to the same cluster are other similar than         applicable to both rough clustering and crisp
those belonging to different clusters. Conventional          clustering. Experiments with synthetic and real-world
clustering assigns an object to exactly one cluster.         data show the usefulness of the proposed validity
Assign an object to rough-set-based variation makes it       index for the evaluation of rough clustering and crisp
possible. [3]. Quality of clustering is an important         clustering.
issue in application of clustering techniques to real-
world data. A good measure of cluster quality will help                 II.   CLUSTERING TECHNIQUE
in deciding various parameters used in clustering               The clustering technique K-means [7] is a
algorithms. One such parameter that is common to             prototype-based,     simple     partitional    clustering
most clustering algorithms is the number of clusters.        technique which attempts to find k non-overlapping
Many different indices of cluster validity have been         clusters. These clusters are represented by their
proposed. In general, indices of cluster validity fall       centroids (a cluster centroid is typically the mean of
into one of three categories. Some validity indices          the points in the cluster). The clustering process of K-
measure partition validity to evaluate the properties of     means is as follows. Firstly, k initial centroids are
crisp structure imposed on the data by the clustering        selected, where k is specified by the user and indicates
algorithm, such as Dunn indices [7] and Davies-Bould         the desired number of clusters.
in index [2]. These validity indices are based on


                                                                               www.ijorcs.org
40                                                          B.Rajasekhar, B. Sunil Kumar, Rajesh Vibhudi, B.V.Rama Krishna
   Secondly, every point in the data is then assigned to         and Intra Î. The greater the value of Intra the more is
the closest centroid, and each collection of points              the cluster compactness.[1] If the second BMUs of all
assigned to a centroid forms a cluster. The centroid of          data vectors in Ck are also in Ck, then Intra(Ck)=1. The
each cluster is then updated based on the points                 intra-cluster connectivity of all clusters (Intra) is the


                                                                                  𝐴 = ∑ 𝑘𝐾
                                                                                                 𝐼𝑛𝑡𝑟𝑎_𝐶𝑜𝑛𝑛(𝐶 𝑘 )
assigned to the cluster. This process is repeated until          average compactness which is given below

                                                                                                                  𝐾
no point changes clusters.
A. Clustering Crisp Method: The objective of the k-
means is to assign n objects to k clusters. The process          D. Cluster Quality: Several cluster validity indices to
begins by randomly choosing k objects as the centroids           evaluate cluster quality obtained by different clustering


 𝑑(𝑥1 , ⃗1 ) between the object vector ⃗1 and the cluster
   ⃗ 𝑐                                  𝑥
of the k clusters. The objects are assigned to one of the        algorithms. An excellent summary of various validity


vector ⃗1 the distance 𝑑(𝑥1 , ⃗1 ) can be the standard
           𝑐                ⃗ 𝑐
k clusters based on the minimum value of the distance            measures [10] are two classical cluster validity indices
                                                                 and one used for fuzzy clusters.
Euclidean distance.                                              1. Davies-Bouldin Index:
Assignment of all the objects to various clusters, the              This index [6] is a function of the ratio of the sum


                  ∑ ⃗ 𝑖 ∈𝑐 𝑖 ⃗ 𝑖
                    𝑥 ⃗ 𝑥                                        the distance between cluster ⃗ 𝑖 and ⃗𝑗 , denoted by 𝑑 𝑖𝑗 ,
                                                                                                𝑐     𝑐
                                                                 of within cluster scatter to between-cluster separation.


           ⃗𝑖 =
           𝑐                     , 𝑤ℎ𝑒𝑟𝑒 1 ≤ 𝑖 ≤ 𝑘
new centroid vectors of the clusters are calculated as


                     |𝑐 𝑖 |
                                                                 The scatter within the ith cluster, denoted by Si, and

                        ⃗
                                                                                                                              1�
Here |𝑐 𝑖 | is the cardinality of cluster ⃗ 𝑖 . The process
       ⃗                                  𝑐                                 𝑆 𝑖,𝑞 = �|𝑐 | ∑ ⃗𝑥∈𝑐 𝑖‖ ⃗ −
                                                                                               ⃗ 𝑥                    ⃗ 𝑖 ‖2 �
                                                                                                                      𝑐
                                                                                         1                                  𝑞    𝑞
                                                                 are defined as follows:

                                                                                      ⃗   𝑖

                                                                                       𝑑 𝑖𝑗,𝑡 = �𝑐 𝑖 − ⃗𝑗 � 𝑡
                                                                                                 ⃗     𝑐
stops when the centroids of clusters stabilize, i.e., the


                                                                 where 𝑐 𝑖 is the center of the ith cluster. 𝑐 𝑖𝑗 is the
centroid vectors from the previous iteration are
identical to those generated in the current iteration.

                                                                 number of objects in ⃗𝑗 . Integers q and t can be
                                                                                         𝑐
B. Cluster Validity: A New validity index conn_index
for prototype based clustering of data sets is applicable
with a wide variety of cluster characteristics clusters of       selected independently such that q, t > 1. The Davies-
different shapes, sizes, densities and even overlaps.            Bouldin index for a clustering scheme (CS) is then

                                                                            1
                                                                                   𝑘

                                                                    𝐷𝐵(𝐶𝑆) = � 𝑅 𝑖,𝑞𝑡 , 𝑤ℎ𝑒𝑟𝑒 𝑅 𝑖,𝑞𝑡
Conn_index is based on weighted Delaunay                         defined as


                                                                             𝑘
triangulation called “connectivity matrix”.


                                                                                       = max1≤𝑗≤𝑘,𝑗≠1 { 𝑆 𝑖,𝑞 + 𝑆 𝑗,𝑞 ⁄ 𝑑 𝑖𝑗,𝑡 }
                                                                                  𝑖=1
   Crisp clustering the Davies-Bouldin index and the
generalized Dunn Index are some of the most
commonly used indices depend on a separation
measure between clusters and a measure for                       The Davies-Bouldin index considers the average case
compactness of clusters based on distance. When the              of similarity between each cluster and the one that is
clusters have homogeneous density distribution, one              most similar to it. Lower Davies-Bouldin index means
effective approach to correctly evaluate the clustering          a better clustering scheme.
of data sets is CDbw (composite density between and
within clusters) [16]. CDbw finds prototypes for                 2. Dunn Index:
clusters instead of representing the clusters by their           Dunn proposed another cluster validity index [7]. The
centroids, and calculates the validity measure based on

                                                                                                                     𝛿�𝑐 𝑖 , ⃗𝑗 �
                                                                                                                       ⃗ 𝑐
                                                                 index corresponding to a clustering scheme (CS) is


                                                                    𝐷(𝐶𝑆) = min � min                         �                     ��
inter- and intra-cluster densities, and cluster                  defined by

                                                                                                                  max1≤𝑞≤𝑘 ∆(𝑐 𝑞 )⃗
separation.

                                                                               1≤𝑗≤𝑘 1≤𝑗≤𝑘,𝑗≠1

                                                                          𝛿�𝑐 𝑖 , ⃗𝑗 � =
                                                                            ⃗ 𝑐                    min                �𝑐 𝑖 −𝑐 𝑗 � ,
                                                                                                                       ⃗ ⃗
C. Compactness of Clusters: Assuming k number of
clusters, N prototypes v in a data set, Ck and Cl are two

                                                                                              1≤𝑖,𝑗≤𝑘,𝑖≠𝑗

                                                                             ∆(𝑐 𝑖 ) = max ‖ ⃗ 𝑖 − ⃗ 𝑡 ‖
                                                                               ⃗             𝑥     𝑥
different clusters where 1 ≤ k, l ≤K, the new proposed
CONN_Index will be defined with the help of Intra

                                                                                              ⃗ 𝑖 ,𝑥 𝑡 ∈𝑐 𝑖
                                                                                              𝑥 ⃗ ⃗
and Inter quantities which are considered as
compactness and separation. The compactness of Ck,
Intra (Ck) is the ratio of the number of data vectors in


                                                                 usually large and the diameters of the clusters, Δ ( 𝑐 𝑖 )
Ck whose second BMU is also in Ck, to the number of              If a data set is well separated by a clustering scheme,
                                                                 the distance among the clusters, δ(ci,cj)(1≤ i,j ≤k) is

                        ∑ 𝑖,𝑗�𝐶𝐴𝐷𝐽(𝑖, 𝑗): 𝑣 𝑖 𝑣 𝑗 ∈ 𝐶 𝑘 �
                           𝑁
data vectors in Ck. The Intra (Ck) is defined by

     𝐼𝑛𝑡𝑟𝑎_𝐶𝑜𝑛𝑛(𝐶 𝑘 ) =
                         ∑ 𝑖,𝑗{ 𝐶𝐴𝐷𝐽(𝑖, 𝑗): 𝑣 𝑖 ∈ 𝐶 𝑘 }
                             𝑁
                                                                 (1≤ i ≤k), are expected to be small. Therefore, a large
                                                                 value of D(CS) corresponds to a good clustering



                                                                                  www.ijorcs.org
Quality of Cluster Index Based on Study of Decision Tree                                                              41
scheme. The main drawback of the Dunn index is that            each other, would like to compute a function f: X*Y->
the calculation is computationally expensive and the           {0,1} at (x,y) with minimal amount of interaction
index is sensitive to noise.                                   between them. Interaction is some measure of
                                                               communication between the two parties and it is
                  III. DECISION TREE                           usually the total number of bits exchanged between the
                                                               parties. The classification of objects according to
A. Decision Tree
                                                               approximation operators in rough set theory can be
   A decision tree depicts riles for Classifying data          easily fitted into the Bayesian decision-theoretic
into groups. Splits entire data set into some number of        framework. Let Ω={A,Ac} denote the set of states
pieces and then another rule may be applied to a piece,        indicating that an object is in A and not in A,
different rules to different pieces forming a second           respectively. Let A ={a1,a2,a3} be the set of actions,
generation of pieces. The tree depicts the first split into    where a1,a2,a3 represent the three actions in classifying
pieces as branches emanating from a root and                   an object, deciding POS(A), deciding NEG(A), and
subsequent splits as branches emanating from nodes on          deciding BND(A), respectively.
older branches. The leaves of the tree are the final
groups, the unsplit nodes. For some perverse reason,           C. Implementation of the CRISP-DM:
trees are always drawn upside down, like an
                                                               CRISP-DM is based on the process flow showed in
organizational chart. For a tree to be useful, the data in
                                                               Figure 1. The model proposes the following steps:
a leaf must be similar with respect to some target
measure, so that the tree represents the segregation of a       1. Business Understanding – to understand the rules
mixture of data into purified groups.                              and business objectives of the company.
   Consider an example of data collected on people in           2. Understanding Data – to collect and describe data.
a city park in the vicinity of a hotdog and ice cream           3. Data Preparation – to prepare data for import into
stand. The owner of the concession stand wants to                  the software.
know what predisposes people to buy ice cream.                  4. Modelling – to select the modelling technique to be
Among all the people observed, forty percent buy ice               used.
cream. This is represented in the root node of the tree         5. Evaluation – to evaluate the process to see if the
at the [9] top of the diagram. The first rule splits the           technique solves the problem of modelling and
data according to the weather. Unless it is sunny and              creation of rules.
hot, only five percent buy ice cream. This is                   6. Deployment – to deploy the system and train its
represented in the leaf on the left branch. On sunny               users.
and hot days, sixty percent buy ice cream. The tree
represents this population as an internal node that is
further split into two branches, one of which is split
again.

                     40% Buy ice cream

      No Sunny                                 Hot Yes


  5 % Buy Ice cream                 60% Buy Ice Cream

                     No           Have extra money ?     Yes


   30% Buy Ice Cream                   80% Buy Ice Cream


         No
                                                                         Figure 2 Example of Crisp data mining
                                    Yes
                Crave Ice Cream
                                                                            IV. PROPOSED SYSTEM
   10% Buy Ice Cream                70% Buy Ice Cream
                                                                  Unsupervised classification method when the only
                                                               data available are unlabeled. It is need to know the
              Figure 1 Example of Decision Tree
                                                               number of clusters. A cluster validity measure can
B. Yao’s model for Decision Tree:                              provide us some information about the appropriate
                                                               number of clusters. Our solution possible to construct a
  The model consists of two parties to holding values          cluster validity measure by considering various loss
xЄX and yЄY respectively who can communicate with              functions based on decision theory.


                                                                                www.ijorcs.org
42                                                          B.Rajasekhar, B. Sunil Kumar, Rajesh Vibhudi, B.V.Rama Krishna

                           Cluster                               favorable execution time and the user has to know in
     Dataset               Validity                              advance how many clusters are to be searched, k-
                           Measure                               means is data driven is efficient for smaller data sets
                                                                 and anomaly detection. Instead of taking the mean
                                                                 value of the objects in a cluster as a reference point, a
                                                                 Medoid can be used, which is the most centrally
                           Decision              Loss            located object in a cluster. Clustering requires the
                           Tree                  Function        distance between every pair of objects only once and
                                                                 uses the distance at every stage of iteration.
                                        Result
                                                                 Comparing to [8] clustering, classification algorithms
                                                                 performs efficient for complex datasets, noise and
               Figure 3 is proposed system ([6])
                                                                 outlier detection such as algorithm designers have had
   We choose K-means clustering because 1)it is data             much success with equal width method, equal depth
driven method relatively few assumptions on the                  method approaches to building class descriptions. It is
distributions of the underlying data and 2) greedy               chosen decision tree learners made popular by ID3,
search strategy of K-means guarantees at least a local           C4.5 and CART, because they are relatively fast and
minimum of the criterion function, thereby                       typically they produce competitive classifiers. In fact,
accelerating the convergence of clusters on large                the decision tree generator C4.5, a successor to ID3,
datasets.                                                        has become a standard factor for comparison in
A. Cluster Quality on Decision Theory:                           machine learning research, because it produces good
                                                                 classifiers quickly. For non numeric datasets, the
    Unsupervised learning method is the techniques we            growth of the run time of ID3 (and C4.5) is linear in all
apply only data available are unlabeled, algorithms              examples.
need to know the number of clusters. Cluster validity
measures are Davies-Bouldin can help us assess                      The practical run time complexity of C4.5 has been
whether a clustering method accurately presents the              determined empirically to be worse than O (e2) on
structure of the data set. There are several cluster             some datasets. One possible explanation is based on
indices to evaluate crisp and fuzzy clusteruing.                 the observation of Oates and Jensen (1998) that the
Decision framework has been helpful in providing a               size of C4.5 trees increases linearly with the number of
better understanding of the classification model.                examples. One of the factors of a in C4.5’s run-time
Decision rough set model considers various classes of            complexity corresponds to the tree depth, which
loss functions, the extension of the decision rough set          cannot be larger than the number of attributes. Tree
model to multicategory is possible to construct a                depth is related to tree size, and thereby to the number
cluster validity measure by considering various loss             of examples. When compared with C4.5, the run time


                                                                                   V. CONCLUSION
functions based on decision theory. Within a given set           complexity of CART is satisfactory.
of objects there may be clusters such that objects in the
same cluster are more similar than those in different
clusters. Clustering is to find the right groups or                 A cluster quality index based on decision theory,
clusters for the given set of objects. To find right             proposal uses a loss function to construct the quality
cluster we need exponential time comparisons has                 index. Therefore, the cluster quality is evaluated by
been proved to be NP-hard. For defining framework                considering the total risk of classifying all the objects.
we assume partitions a set of objects X={x1….xn} into            Such a decision-theoretic representation of cluster
clusters CS={c1…ck}, the k-means algorithm                       quality may be more useful in business-oriented data
approximate the actual clustering. It is possible that           mining than traditional geometry-based cluster quality
each object may not necessarily belong to only one               measures. In addition to evaluating crisp clustering, the
cluster. However there will be corresponding to each             proposal is an evaluation measure for rough clustering.
cluster within the clustering scheme, the centroid of            This is the first measure that takes into account special
the hypothetical core will be used Cluster core. Let             features of rough clustering that allow for an object to
core (ci) be the core of the cluster ci, which is used to        belong to more than one cluster. The measure is shown
calculate the centroid of the cluster. Any x1Є core (ci)         to be useful in determining important aspects of a
cannot belong to other clusters. Therefore, core (ci) can        clustering exercise such as determining the appropriate
be considered the best representation of ci to a certain         number of clusters and size of boundary region. The
extent.                                                          application of the measure to synthetic data with
                                                                 known number of clusters and boundary region
B. Comparison of Clustering and Classification:                  provides credence to the proposal.
Clustering work well for finding unlabeled clusters in              A real advantage of the decision-theoretic cluster
small to large data points K-means algorithm is its              validity measure is its ability to include monetary


                                                                                  www.ijorcs.org
Quality of Cluster Index Based on Study of Decision Tree                            43
considerations in evaluating a clustering scheme. Use
of the measure to derive an appropriate clustering
scheme for a promotional campaign in a retail store
highlighted its unique ability to include cost and
benefit considerations in commercial data mining. We
can also extend it to evaluating other clustering
algorithms such as fuzzy clustering. Such a cluster
validity measure can be useful in further theoretical


                    VI. REFERENCES
development in clustering.



[1] J.C. Bezdek, Pattern Recognition with Fuzzy Objective
      Function Algorithms. Plenum Press, 1981.
[2] D.L. Davies and D.W. Bouldin, “A Cluster Separation
      Measure,” IEEE Trans. Pattern Analysis and Machine
      Intelligence, vol. 1, no 2, pp. 224-227, Apr. 1979.
[3]   J.C. Dunn, “Well Separated Clusters and Optimal Fuzzy
      Partitions,” J. Cybernetics, vol. 4, pp. 95-104, 1974.
[4]   S. Hirano and S. Tsumoto, “On Constructing Clusters
      from Non- Euclidean Dissimilarity Matrix by Using
      Rough Clustering,” Proc. Japanese Soc. for Artificial
      Intelligence (JSAI) Workshops, pp. 5-16, 2005.
[5]   T.B. Ho and N.B. Nguyen, “Nonhierarchical Document
      Clustering by a Tolerance Rough Set Model,” Int’l J.
      Intelligent Systems, vol. 17, no. 2, pp. 199-212, 2002.
[6]   Rough Cluster Quality Index Based on Decision Theory
      Pawan Lingras, Member, IEEE, Min Chen, and
      Duoqian Miao IEEE TRANSACTIONS ON
      KNOWLEDGE AND DATA ENGINEERING, VOL.
      21, NO. 7, JULY 2009
[7]   W. Pedrycz and J. Waletzky, “Fuzzy Clustering with
      Partial Supervision,” IEEE Trans. Systems, Man, and
      Cybernetics, vol. 27, no. 5, pp. 787-795, Sept. 1997.
[8]   Partition Algorithms– A Study and Emergence of
      Mining Projected Clusters in High-Dimensional
      Dataset-International Journal of Computer Science and
      Telecommunications [Volume 2, Issue 4, July 2011]
[9]   Jensen, D. D. and Cohen, P. R (1999), "Multiple
      Comparisons in Induction Algorithms," Machine
      Learning (to appear). Excellent discussion of bias
      inherent      in    selecting     an     input.    Explore
      http://www.cs.umass.edu/~jensen/papers.




                                                                   www.ijorcs.org

More Related Content

PDF
Bj24390398
PPT
Statistical Clustering
PDF
Cluster Analysis
PDF
47 292-298
PDF
Cluster analysis
PPT
Clustering
PDF
Data clustering
PDF
Cluster Analysis : Assignment & Update
Bj24390398
Statistical Clustering
Cluster Analysis
47 292-298
Cluster analysis
Clustering
Data clustering
Cluster Analysis : Assignment & Update

What's hot (20)

PPT
Cluster analysis
PPT
3.1 clustering
PPTX
K means clustering
PPT
My8clst
PDF
A survey on Efficient Enhanced K-Means Clustering Algorithm
PPTX
Clusters techniques
PDF
Unsupervised learning clustering
PPTX
Cluster Validation
DOC
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
PPT
cluster analysis
PDF
PPTX
Cluster analysis
DOCX
K means report
PPTX
Cluster analysis
PPTX
machine learning - Clustering in R
PDF
Bb25322324
PDF
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
PDF
Clustering: A Survey
PDF
Spss tutorial-cluster-analysis
PPTX
Cluster analysis
Cluster analysis
3.1 clustering
K means clustering
My8clst
A survey on Efficient Enhanced K-Means Clustering Algorithm
Clusters techniques
Unsupervised learning clustering
Cluster Validation
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
cluster analysis
Cluster analysis
K means report
Cluster analysis
machine learning - Clustering in R
Bb25322324
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
Clustering: A Survey
Spss tutorial-cluster-analysis
Cluster analysis
Ad

Viewers also liked (6)

PDF
50120130406008
PDF
Comparative analysis of various data stream mining procedures and various dim...
PDF
Improved Performance of Unsupervised Method by Renovated K-Means
PPTX
Trabajo de musica 3 diver adrian naranjo hernandez ajajjajajajja
PDF
WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)
50120130406008
Comparative analysis of various data stream mining procedures and various dim...
Improved Performance of Unsupervised Method by Renovated K-Means
Trabajo de musica 3 diver adrian naranjo hernandez ajajjajajajja
WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)
Ad

Similar to QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE (20)

PDF
Methods from Mathematical Data Mining (Supported by Optimization)
PDF
41 125-1-pb
PDF
Gravitational Based Hierarchical Clustering Algorithm
PPT
4 DM Clustering ifor computerscience.ppt
PDF
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
PDF
Dynamic approach to k means clustering algorithm-2
PDF
50120140505013
PDF
K means clustering in the cloud - a mahout test
PPTX
Data Mining Lecture_8(a).pptx
PDF
An Iterative Improved k-means Clustering
PDF
Dh31504508
PPTX
Oxford 05-oct-2012
PDF
Ir3116271633
PPTX
Silhouette math.pptx
PPT
Cluster validity 1.ppt on cluster validity
PPT
Chap8 basic cluster_analysis
PPTX
ACM 2013-02-25
PPTX
datamining-lect8a-amachinelearningapproach.pptx
PDF
nips勉強会_Toward Property-Based Classification of Clustering Paradigms
PDF
Machine Learning - Clustering
Methods from Mathematical Data Mining (Supported by Optimization)
41 125-1-pb
Gravitational Based Hierarchical Clustering Algorithm
4 DM Clustering ifor computerscience.ppt
IRJET- Optimal Number of Cluster Identification using Robust K-Means for ...
Dynamic approach to k means clustering algorithm-2
50120140505013
K means clustering in the cloud - a mahout test
Data Mining Lecture_8(a).pptx
An Iterative Improved k-means Clustering
Dh31504508
Oxford 05-oct-2012
Ir3116271633
Silhouette math.pptx
Cluster validity 1.ppt on cluster validity
Chap8 basic cluster_analysis
ACM 2013-02-25
datamining-lect8a-amachinelearningapproach.pptx
nips勉強会_Toward Property-Based Classification of Clustering Paradigms
Machine Learning - Clustering

More from IJORCS (20)

PDF
Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
PDF
Call for Papers - IJORCS, Volume 4 Issue 4
PDF
Real-Time Multiple License Plate Recognition System
PDF
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
PDF
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
PDF
Algebraic Fault Attack on the SHA-256 Compression Function
PDF
Enhancement of DES Algorithm with Multi State Logic
PDF
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
PDF
CFP. IJORCS, Volume 4 - Issue2
PDF
Call for Papers - IJORCS - Vol 4, Issue 1
PDF
Voice Recognition System using Template Matching
PDF
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
PDF
A Review and Analysis on Mobile Application Development Processes using Agile...
PDF
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
PDF
A Study of Routing Techniques in Intermittently Connected MANETs
PDF
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
PDF
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
PDF
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
PDF
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
PDF
A PSO-Based Subtractive Data Clustering Algorithm
Help the Genetic Algorithm to Minimize the Urban Traffic on Intersections
Call for Papers - IJORCS, Volume 4 Issue 4
Real-Time Multiple License Plate Recognition System
FPGA Implementation of FIR Filter using Various Algorithms: A Retrospective
Using Virtualization Technique to Increase Security and Reduce Energy Consump...
Algebraic Fault Attack on the SHA-256 Compression Function
Enhancement of DES Algorithm with Multi State Logic
Hybrid Simulated Annealing and Nelder-Mead Algorithm for Solving Large-Scale ...
CFP. IJORCS, Volume 4 - Issue2
Call for Papers - IJORCS - Vol 4, Issue 1
Voice Recognition System using Template Matching
Channel Aware Mac Protocol for Maximizing Throughput and Fairness
A Review and Analysis on Mobile Application Development Processes using Agile...
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
A Study of Routing Techniques in Intermittently Connected MANETs
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
An Adaptive Load Sharing Algorithm for Heterogeneous Distributed System
The Design of Cognitive Social Simulation Framework using Statistical Methodo...
An Enhanced Framework for Improving Spatio-Temporal Queries for Global Positi...
A PSO-Based Subtractive Data Clustering Algorithm

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Network Security Unit 5.pdf for BCA BBA.
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf

QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE

  • 1. International Journal of Research in Computer Science eISSN 2249-8265 Volume 2 Issue 1 (2011) pp. 39-43 © White Globe Publications www.ijorcs.org QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE B.Rajasekhar1, B. Sunil Kumar2, Rajesh Vibhudi3, B.V.Rama Krishna4 12 Assistant Professor, Jawaharlal Nehru Institute of Technology, Hyderabad 3 Sri Mittapalli college of Engineering, Guntur, Hyderabad 4 Associate Professor, St Mary’s College of Engineering & Technology, Hyderabad Abstract:- Quality of clustering is an important similarity measure of clusters whose bases are the issue in application of clustering techniques. Most dispersion measure of a cluster and the cluster traditional cluster validity indices are geometry-based dissimilarity measure. In the case of fuzzy clustering cluster quality measures. This work proposes a cluster algorithms, some validity indices such as partition validity index based on the decision-theoretic rough coefficient [1] and classification entropy use only the set model by considering various loss functions. Real information of fuzzy membership grades to evaluate time retail data show the usefulness of the proposed clustering results. The third category consists of validity index for the evaluation of rough and crisp validity indices that make use of not only the fuzzy clustering. The measure is shown to help determine membership grades but also the structure of the data. optimal number of clusters, as well as an important All these validity indices are essentially based on the parameter called threshold in rough clustering. The geometric characteristics of the clusters. A decision- experiments with a promotional campaign for the theoretic measure of cluster quality, decision theoretic retail data illustrate the ability of the proposed framework has been helpful in providing a better measure to incorporate financial considerations in understanding of classification models [4]. The evaluating quality of a clustering scheme. This ability decision theoretic rough set model considers various to deal with monetary values distinguishes the classes of loss functions. By adjusting loss functions, proposed decision-theoretic measure from other the decision-theoretic rough set model can also be distance-based measures. Our proposed system extended to the multi category problem. It is possible validity index can also be efficient for evaluating other to construct a cluster validity index by considering clustering algorithms such as fuzzy clustering. various loss functions based on decision theory. Such a measure has an added advantage of being applicable to Keywords: Clustering, Classification, Decision Tree, rough-set-based clustering. This work describes how to I. K-means. develop a cluster validity index from the decision- theoretic rough set model. Based on the decision INTRODUCTION theory, the proposed rough cluster validity index is Unsupervised learning clustering is one of the taken as a function of total risk for grouping objects techniques in data mining, categorizes unlabeled using a clustering algorithm. Since crisp[5] clustering objects into several clusters such that the objects is a special case of rough clustering, index validity is belonging to the same cluster are other similar than applicable to both rough clustering and crisp those belonging to different clusters. Conventional clustering. Experiments with synthetic and real-world clustering assigns an object to exactly one cluster. data show the usefulness of the proposed validity Assign an object to rough-set-based variation makes it index for the evaluation of rough clustering and crisp possible. [3]. Quality of clustering is an important clustering. issue in application of clustering techniques to real- world data. A good measure of cluster quality will help II. CLUSTERING TECHNIQUE in deciding various parameters used in clustering The clustering technique K-means [7] is a algorithms. One such parameter that is common to prototype-based, simple partitional clustering most clustering algorithms is the number of clusters. technique which attempts to find k non-overlapping Many different indices of cluster validity have been clusters. These clusters are represented by their proposed. In general, indices of cluster validity fall centroids (a cluster centroid is typically the mean of into one of three categories. Some validity indices the points in the cluster). The clustering process of K- measure partition validity to evaluate the properties of means is as follows. Firstly, k initial centroids are crisp structure imposed on the data by the clustering selected, where k is specified by the user and indicates algorithm, such as Dunn indices [7] and Davies-Bould the desired number of clusters. in index [2]. These validity indices are based on www.ijorcs.org
  • 2. 40 B.Rajasekhar, B. Sunil Kumar, Rajesh Vibhudi, B.V.Rama Krishna Secondly, every point in the data is then assigned to and Intra Î. The greater the value of Intra the more is the closest centroid, and each collection of points the cluster compactness.[1] If the second BMUs of all assigned to a centroid forms a cluster. The centroid of data vectors in Ck are also in Ck, then Intra(Ck)=1. The each cluster is then updated based on the points intra-cluster connectivity of all clusters (Intra) is the 𝐴 = ∑ 𝑘𝐾 𝐼𝑛𝑡𝑟𝑎_𝐶𝑜𝑛𝑛(𝐶 𝑘 ) assigned to the cluster. This process is repeated until average compactness which is given below 𝐾 no point changes clusters. A. Clustering Crisp Method: The objective of the k- means is to assign n objects to k clusters. The process D. Cluster Quality: Several cluster validity indices to begins by randomly choosing k objects as the centroids evaluate cluster quality obtained by different clustering 𝑑(𝑥1 , ⃗1 ) between the object vector ⃗1 and the cluster ⃗ 𝑐 𝑥 of the k clusters. The objects are assigned to one of the algorithms. An excellent summary of various validity vector ⃗1 the distance 𝑑(𝑥1 , ⃗1 ) can be the standard 𝑐 ⃗ 𝑐 k clusters based on the minimum value of the distance measures [10] are two classical cluster validity indices and one used for fuzzy clusters. Euclidean distance. 1. Davies-Bouldin Index: Assignment of all the objects to various clusters, the This index [6] is a function of the ratio of the sum ∑ ⃗ 𝑖 ∈𝑐 𝑖 ⃗ 𝑖 𝑥 ⃗ 𝑥 the distance between cluster ⃗ 𝑖 and ⃗𝑗 , denoted by 𝑑 𝑖𝑗 , 𝑐 𝑐 of within cluster scatter to between-cluster separation. ⃗𝑖 = 𝑐 , 𝑤ℎ𝑒𝑟𝑒 1 ≤ 𝑖 ≤ 𝑘 new centroid vectors of the clusters are calculated as |𝑐 𝑖 | The scatter within the ith cluster, denoted by Si, and ⃗ 1� Here |𝑐 𝑖 | is the cardinality of cluster ⃗ 𝑖 . The process ⃗ 𝑐 𝑆 𝑖,𝑞 = �|𝑐 | ∑ ⃗𝑥∈𝑐 𝑖‖ ⃗ − ⃗ 𝑥 ⃗ 𝑖 ‖2 � 𝑐 1 𝑞 𝑞 are defined as follows: ⃗ 𝑖 𝑑 𝑖𝑗,𝑡 = �𝑐 𝑖 − ⃗𝑗 � 𝑡 ⃗ 𝑐 stops when the centroids of clusters stabilize, i.e., the where 𝑐 𝑖 is the center of the ith cluster. 𝑐 𝑖𝑗 is the centroid vectors from the previous iteration are identical to those generated in the current iteration. number of objects in ⃗𝑗 . Integers q and t can be 𝑐 B. Cluster Validity: A New validity index conn_index for prototype based clustering of data sets is applicable with a wide variety of cluster characteristics clusters of selected independently such that q, t > 1. The Davies- different shapes, sizes, densities and even overlaps. Bouldin index for a clustering scheme (CS) is then 1 𝑘 𝐷𝐵(𝐶𝑆) = � 𝑅 𝑖,𝑞𝑡 , 𝑤ℎ𝑒𝑟𝑒 𝑅 𝑖,𝑞𝑡 Conn_index is based on weighted Delaunay defined as 𝑘 triangulation called “connectivity matrix”. = max1≤𝑗≤𝑘,𝑗≠1 { 𝑆 𝑖,𝑞 + 𝑆 𝑗,𝑞 ⁄ 𝑑 𝑖𝑗,𝑡 } 𝑖=1 Crisp clustering the Davies-Bouldin index and the generalized Dunn Index are some of the most commonly used indices depend on a separation measure between clusters and a measure for The Davies-Bouldin index considers the average case compactness of clusters based on distance. When the of similarity between each cluster and the one that is clusters have homogeneous density distribution, one most similar to it. Lower Davies-Bouldin index means effective approach to correctly evaluate the clustering a better clustering scheme. of data sets is CDbw (composite density between and within clusters) [16]. CDbw finds prototypes for 2. Dunn Index: clusters instead of representing the clusters by their Dunn proposed another cluster validity index [7]. The centroids, and calculates the validity measure based on 𝛿�𝑐 𝑖 , ⃗𝑗 � ⃗ 𝑐 index corresponding to a clustering scheme (CS) is 𝐷(𝐶𝑆) = min � min � �� inter- and intra-cluster densities, and cluster defined by max1≤𝑞≤𝑘 ∆(𝑐 𝑞 )⃗ separation. 1≤𝑗≤𝑘 1≤𝑗≤𝑘,𝑗≠1 𝛿�𝑐 𝑖 , ⃗𝑗 � = ⃗ 𝑐 min �𝑐 𝑖 −𝑐 𝑗 � , ⃗ ⃗ C. Compactness of Clusters: Assuming k number of clusters, N prototypes v in a data set, Ck and Cl are two 1≤𝑖,𝑗≤𝑘,𝑖≠𝑗 ∆(𝑐 𝑖 ) = max ‖ ⃗ 𝑖 − ⃗ 𝑡 ‖ ⃗ 𝑥 𝑥 different clusters where 1 ≤ k, l ≤K, the new proposed CONN_Index will be defined with the help of Intra ⃗ 𝑖 ,𝑥 𝑡 ∈𝑐 𝑖 𝑥 ⃗ ⃗ and Inter quantities which are considered as compactness and separation. The compactness of Ck, Intra (Ck) is the ratio of the number of data vectors in usually large and the diameters of the clusters, Δ ( 𝑐 𝑖 ) Ck whose second BMU is also in Ck, to the number of If a data set is well separated by a clustering scheme, the distance among the clusters, δ(ci,cj)(1≤ i,j ≤k) is ∑ 𝑖,𝑗�𝐶𝐴𝐷𝐽(𝑖, 𝑗): 𝑣 𝑖 𝑣 𝑗 ∈ 𝐶 𝑘 � 𝑁 data vectors in Ck. The Intra (Ck) is defined by 𝐼𝑛𝑡𝑟𝑎_𝐶𝑜𝑛𝑛(𝐶 𝑘 ) = ∑ 𝑖,𝑗{ 𝐶𝐴𝐷𝐽(𝑖, 𝑗): 𝑣 𝑖 ∈ 𝐶 𝑘 } 𝑁 (1≤ i ≤k), are expected to be small. Therefore, a large value of D(CS) corresponds to a good clustering www.ijorcs.org
  • 3. Quality of Cluster Index Based on Study of Decision Tree 41 scheme. The main drawback of the Dunn index is that each other, would like to compute a function f: X*Y-> the calculation is computationally expensive and the {0,1} at (x,y) with minimal amount of interaction index is sensitive to noise. between them. Interaction is some measure of communication between the two parties and it is III. DECISION TREE usually the total number of bits exchanged between the parties. The classification of objects according to A. Decision Tree approximation operators in rough set theory can be A decision tree depicts riles for Classifying data easily fitted into the Bayesian decision-theoretic into groups. Splits entire data set into some number of framework. Let Ω={A,Ac} denote the set of states pieces and then another rule may be applied to a piece, indicating that an object is in A and not in A, different rules to different pieces forming a second respectively. Let A ={a1,a2,a3} be the set of actions, generation of pieces. The tree depicts the first split into where a1,a2,a3 represent the three actions in classifying pieces as branches emanating from a root and an object, deciding POS(A), deciding NEG(A), and subsequent splits as branches emanating from nodes on deciding BND(A), respectively. older branches. The leaves of the tree are the final groups, the unsplit nodes. For some perverse reason, C. Implementation of the CRISP-DM: trees are always drawn upside down, like an CRISP-DM is based on the process flow showed in organizational chart. For a tree to be useful, the data in Figure 1. The model proposes the following steps: a leaf must be similar with respect to some target measure, so that the tree represents the segregation of a 1. Business Understanding – to understand the rules mixture of data into purified groups. and business objectives of the company. Consider an example of data collected on people in 2. Understanding Data – to collect and describe data. a city park in the vicinity of a hotdog and ice cream 3. Data Preparation – to prepare data for import into stand. The owner of the concession stand wants to the software. know what predisposes people to buy ice cream. 4. Modelling – to select the modelling technique to be Among all the people observed, forty percent buy ice used. cream. This is represented in the root node of the tree 5. Evaluation – to evaluate the process to see if the at the [9] top of the diagram. The first rule splits the technique solves the problem of modelling and data according to the weather. Unless it is sunny and creation of rules. hot, only five percent buy ice cream. This is 6. Deployment – to deploy the system and train its represented in the leaf on the left branch. On sunny users. and hot days, sixty percent buy ice cream. The tree represents this population as an internal node that is further split into two branches, one of which is split again. 40% Buy ice cream No Sunny Hot Yes 5 % Buy Ice cream 60% Buy Ice Cream No Have extra money ? Yes 30% Buy Ice Cream 80% Buy Ice Cream No Figure 2 Example of Crisp data mining Yes Crave Ice Cream IV. PROPOSED SYSTEM 10% Buy Ice Cream 70% Buy Ice Cream Unsupervised classification method when the only data available are unlabeled. It is need to know the Figure 1 Example of Decision Tree number of clusters. A cluster validity measure can B. Yao’s model for Decision Tree: provide us some information about the appropriate number of clusters. Our solution possible to construct a The model consists of two parties to holding values cluster validity measure by considering various loss xЄX and yЄY respectively who can communicate with functions based on decision theory. www.ijorcs.org
  • 4. 42 B.Rajasekhar, B. Sunil Kumar, Rajesh Vibhudi, B.V.Rama Krishna Cluster favorable execution time and the user has to know in Dataset Validity advance how many clusters are to be searched, k- Measure means is data driven is efficient for smaller data sets and anomaly detection. Instead of taking the mean value of the objects in a cluster as a reference point, a Medoid can be used, which is the most centrally Decision Loss located object in a cluster. Clustering requires the Tree Function distance between every pair of objects only once and uses the distance at every stage of iteration. Result Comparing to [8] clustering, classification algorithms performs efficient for complex datasets, noise and Figure 3 is proposed system ([6]) outlier detection such as algorithm designers have had We choose K-means clustering because 1)it is data much success with equal width method, equal depth driven method relatively few assumptions on the method approaches to building class descriptions. It is distributions of the underlying data and 2) greedy chosen decision tree learners made popular by ID3, search strategy of K-means guarantees at least a local C4.5 and CART, because they are relatively fast and minimum of the criterion function, thereby typically they produce competitive classifiers. In fact, accelerating the convergence of clusters on large the decision tree generator C4.5, a successor to ID3, datasets. has become a standard factor for comparison in A. Cluster Quality on Decision Theory: machine learning research, because it produces good classifiers quickly. For non numeric datasets, the Unsupervised learning method is the techniques we growth of the run time of ID3 (and C4.5) is linear in all apply only data available are unlabeled, algorithms examples. need to know the number of clusters. Cluster validity measures are Davies-Bouldin can help us assess The practical run time complexity of C4.5 has been whether a clustering method accurately presents the determined empirically to be worse than O (e2) on structure of the data set. There are several cluster some datasets. One possible explanation is based on indices to evaluate crisp and fuzzy clusteruing. the observation of Oates and Jensen (1998) that the Decision framework has been helpful in providing a size of C4.5 trees increases linearly with the number of better understanding of the classification model. examples. One of the factors of a in C4.5’s run-time Decision rough set model considers various classes of complexity corresponds to the tree depth, which loss functions, the extension of the decision rough set cannot be larger than the number of attributes. Tree model to multicategory is possible to construct a depth is related to tree size, and thereby to the number cluster validity measure by considering various loss of examples. When compared with C4.5, the run time V. CONCLUSION functions based on decision theory. Within a given set complexity of CART is satisfactory. of objects there may be clusters such that objects in the same cluster are more similar than those in different clusters. Clustering is to find the right groups or A cluster quality index based on decision theory, clusters for the given set of objects. To find right proposal uses a loss function to construct the quality cluster we need exponential time comparisons has index. Therefore, the cluster quality is evaluated by been proved to be NP-hard. For defining framework considering the total risk of classifying all the objects. we assume partitions a set of objects X={x1….xn} into Such a decision-theoretic representation of cluster clusters CS={c1…ck}, the k-means algorithm quality may be more useful in business-oriented data approximate the actual clustering. It is possible that mining than traditional geometry-based cluster quality each object may not necessarily belong to only one measures. In addition to evaluating crisp clustering, the cluster. However there will be corresponding to each proposal is an evaluation measure for rough clustering. cluster within the clustering scheme, the centroid of This is the first measure that takes into account special the hypothetical core will be used Cluster core. Let features of rough clustering that allow for an object to core (ci) be the core of the cluster ci, which is used to belong to more than one cluster. The measure is shown calculate the centroid of the cluster. Any x1Є core (ci) to be useful in determining important aspects of a cannot belong to other clusters. Therefore, core (ci) can clustering exercise such as determining the appropriate be considered the best representation of ci to a certain number of clusters and size of boundary region. The extent. application of the measure to synthetic data with known number of clusters and boundary region B. Comparison of Clustering and Classification: provides credence to the proposal. Clustering work well for finding unlabeled clusters in A real advantage of the decision-theoretic cluster small to large data points K-means algorithm is its validity measure is its ability to include monetary www.ijorcs.org
  • 5. Quality of Cluster Index Based on Study of Decision Tree 43 considerations in evaluating a clustering scheme. Use of the measure to derive an appropriate clustering scheme for a promotional campaign in a retail store highlighted its unique ability to include cost and benefit considerations in commercial data mining. We can also extend it to evaluating other clustering algorithms such as fuzzy clustering. Such a cluster validity measure can be useful in further theoretical VI. REFERENCES development in clustering. [1] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, 1981. [2] D.L. Davies and D.W. Bouldin, “A Cluster Separation Measure,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no 2, pp. 224-227, Apr. 1979. [3] J.C. Dunn, “Well Separated Clusters and Optimal Fuzzy Partitions,” J. Cybernetics, vol. 4, pp. 95-104, 1974. [4] S. Hirano and S. Tsumoto, “On Constructing Clusters from Non- Euclidean Dissimilarity Matrix by Using Rough Clustering,” Proc. Japanese Soc. for Artificial Intelligence (JSAI) Workshops, pp. 5-16, 2005. [5] T.B. Ho and N.B. Nguyen, “Nonhierarchical Document Clustering by a Tolerance Rough Set Model,” Int’l J. Intelligent Systems, vol. 17, no. 2, pp. 199-212, 2002. [6] Rough Cluster Quality Index Based on Decision Theory Pawan Lingras, Member, IEEE, Min Chen, and Duoqian Miao IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009 [7] W. Pedrycz and J. Waletzky, “Fuzzy Clustering with Partial Supervision,” IEEE Trans. Systems, Man, and Cybernetics, vol. 27, no. 5, pp. 787-795, Sept. 1997. [8] Partition Algorithms– A Study and Emergence of Mining Projected Clusters in High-Dimensional Dataset-International Journal of Computer Science and Telecommunications [Volume 2, Issue 4, July 2011] [9] Jensen, D. D. and Cohen, P. R (1999), "Multiple Comparisons in Induction Algorithms," Machine Learning (to appear). Excellent discussion of bias inherent in selecting an input. Explore http://www.cs.umass.edu/~jensen/papers. www.ijorcs.org