Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
1. Visit https://ebookultra.com to download the full version and
explore more ebooks
Computational Methods of Feature Selection 1st
Edition Huan Liu (Editor)
_____ Click the link below to download _____
https://ebookultra.com/download/computational-methods-
of-feature-selection-1st-edition-huan-liu-editor/
Explore and download more ebooks at ebookultra.com
2. Here are some suggested products you might be interested in.
Click the link to download
Feature Selection and Ensemble Methods for Bioinformatics
Algorithmic Classification and Implementations 1st Edition
Oleg Okun
https://ebookultra.com/download/feature-selection-and-ensemble-
methods-for-bioinformatics-algorithmic-classification-and-
implementations-1st-edition-oleg-okun/
Smoothed Finite Element Methods 1st Edition Liu
https://ebookultra.com/download/smoothed-finite-element-methods-1st-
edition-liu/
Next Generation Sequencing and Whole Genome Selection in
Aquaculture 1st Edition Zhanjiang (John) Liu
https://ebookultra.com/download/next-generation-sequencing-and-whole-
genome-selection-in-aquaculture-1st-edition-zhanjiang-john-liu/
Systems Evaluation Methods Models and Applications 1st
Edition Sifeng Liu (Author)
https://ebookultra.com/download/systems-evaluation-methods-models-and-
applications-1st-edition-sifeng-liu-author/
3. Computational Methods in Plasma Physics Chapman Hall CRC
Computational Science 1st Edition Stephen Jardin
https://ebookultra.com/download/computational-methods-in-plasma-
physics-chapman-hall-crc-computational-science-1st-edition-stephen-
jardin/
Computational Methods in Biomedical Research 1st Edition
Ravindra Khattree
https://ebookultra.com/download/computational-methods-in-biomedical-
research-1st-edition-ravindra-khattree/
Directed Enzyme Evolution Screening and Selection Methods
1st Edition Frances H. Arnold
https://ebookultra.com/download/directed-enzyme-evolution-screening-
and-selection-methods-1st-edition-frances-h-arnold/
Handbook of Computational and Numerical Methods in Finance
1st Edition Oliver J. Blaskowitz
https://ebookultra.com/download/handbook-of-computational-and-
numerical-methods-in-finance-1st-edition-oliver-j-blaskowitz/
Computational Methods in Biomedical Research 1st Edition
Ravindra Khattree (Editor)
https://ebookultra.com/download/computational-methods-in-biomedical-
research-1st-edition-ravindra-khattree-editor/
5. Computational Methods of Feature Selection 1st Edition
Huan Liu (Editor) Digital Instant Download
Author(s): Huan Liu (Editor); Hiroshi Motoda (Editor)
ISBN(s): 9781584888796, 1584888792
Edition: 1
File Details: PDF, 15.24 MB
Year: 2007
Language: english
8. Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix
Decompositions
David Skillicorn
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
PUBLISHED TITLES
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A
AIMS AND SCOPE
This series aims to capture new developments and applications in data mining and knowledge
discovery, while summarizing the computational tools and techniques useful in data analysis.This
series encourages the integration of mathematical, statistical, and computational methods and
techniques through the publication of a broad range of textbooks, reference works, and hand-
books. The inclusion of concrete examples and applications is highly encouraged. The scope of the
series includes, but is not limited to, titles in the areas of data mining and knowledge discovery
methods and applications, modeling, algorithms, theory and foundations, data and knowledge
visualization, data mining systems and tools, and privacy and security issues.
9. Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
Computational Methods
of
Feature Selection
Edited by
)VBO-JVr)JSPTIJ.PUPEB
11. Preface
It has been ten years since we published our first two books on feature se-
lection in 1998. In the past decade, we witnessed a great expansion of feature
selection research in multiple dimensions. We experienced the fast data evolu-
tion in which extremely high-dimensional data, such as high-throughput data
of bioinformatics and Web/text data, became increasingly common. They
stretch the capabilities of conventional data processing techniques, pose new
challenges, and stimulate accelerated development of feature selection research
in two major ways. One trend is to improve and expand the existing tech-
niques to meet the new challenges. The other is to develop brand new algo-
rithms directly targeting the arising challenges. In this process, we observe
many feature-selection-centered activities, such as one well-received competi-
tion, two well-attended tutorials at top conferences, and two multi-disciplinary
workshops, as well as a special development section in a recent issue of IEEE
Intelligent Systems, to name a few.
This collection bridges the widening gap between existing texts and the
rapid developments in the field, by presenting recent research works from var-
ious disciplines. It features excellent survey work, practical guides, exciting
new directions, and comprehensive tutorials from leading experts. The book
also presents easy-to-understand illustrations, state-of-the-art methodologies,
and algorithms, along with real-world case studies ranging from text classi-
fication, to Web mining, to bioinformatics where high-dimensional data are
pervasive. Some vague ideas suggested in our earlier book have been de-
veloped into mature areas with solid achievements, along with progress that
could not have been imagined ten years ago. With the steady and speedy
development of feature selection research, we sincerely hope that this book
presents distinctive and representative achievements; serves as a convenient
point for graduate students, practitioners, and researchers to further the re-
search and application of feature selection; and sparks a new phase of feature
selection research. We are truly optimistic about the impact of feature selec-
tion on massive, high-dimensional data and processing in the near future, and
we have no doubt that in another ten years, when we look back, we will be
humbled by the newfound power of feature selection, and by its indelible con-
tributions to machine learning, data mining, and many real-world challenges.
Huan Liu and Hiroshi Motoda
12. Acknowledgments
The inception of this book project was during SDM 2006’s feature selec-
tion workshop. Randi Cohen, an editor of Chapman and Hall/CRC Press,
eloquently convinced one of us that it was a time for a new book on feature
selection. Since then, she closely worked with us to make the process easier
and smoother and allowed us to stay focused. With Randi’s kind and expert
support, we were able to adhere to the planned schedule when facing unex-
pected difficulties. We truly appreciate her generous support throughout the
project.
This book is a natural extension of the two successful feature selection
workshops held at SDM 20051
and SDM 2006.2
The success would not be
a reality without the leadership of two workshop co-organizers (Robert Stine
of Wharton School and Leonard Auslender of SAS); the meticulous work of
the proceedings chair (Lei Yu of Binghamton University); and the altruistic
efforts of PC members, authors, and contributors. We take this opportunity
to thank all who helped to advance the frontier of feature selection research.
The authors, contributors, and reviewers of this book played an instru-
mental role in this project. Given the limited space of this book, we could
not include all quality works. Reviewers’ detailed comments and constructive
suggestions significantly helped improve the book’s consistency in content,
format, comprehensibility, and presentation. We thank the authors who pa-
tiently and timely accommodated our (sometimes many) requests.
We would also like to express our deep gratitude for the gracious help we
received from our colleagues and students, including Zheng Zhao, Lei Tang,
Quan Nguyen, Payam Refaeilzadeh, and Shankara B. Subramanya of Arizona
State University; Kozo Ohara of Osaka University; and William Nace and
Kenneth Gorreta of AFOSR/AOARD, Air Force Research Laboratory.
Last but not least, we thank our families for their love and support. We
are grateful and happy that we can now spend more time with our families.
Huan Liu and Hiroshi Motoda
1The 2005 proceedings are at http://enpub.eas.asu.edu/workshop/.
2The 2006 proceedings are at http://enpub.eas.asu.edu/workshop/2006/.
13. Contributors
Jesús S. Aguilar-Ruiz
Pablo de Olavide University,
Seville, Spain
Jennifer G. Dy
Northeastern University, Boston,
Massachusetts
Constantin F. Aliferis
Vanderbilt University, Nashville,
Tennessee
André Elisseeff
IBM Research, Zürich, Switzer-
land
Paolo Avesani
ITC-IRST, Trento, Italy
Susana Eyheramendy
Ludwig-Maximilians Universität
München, Germany
Susan M. Bridges
Mississippi State University,
Mississippi
George Forman
Hewlett-Packard Labs, Palo
Alto, California
Alexander Borisov
Intel Corporation, Chandler,
Arizona
Lise Getoor
University of Maryland, College
Park, Maryland
Shane Burgess
Mississippi State University,
Mississippi
Dimitrios Gunopulos
University of California, River-
side
Diana Chan
Mississippi State University,
Mississippi
Isabelle Guyon
ClopiNet, Berkeley, California
Claudia Diamantini
Universitá Politecnica delle
Marche, Ancona, Italy
Trevor Hastie
Stanford University, Stanford,
California
Rezarta Islamaj Dogan
University of Maryland, College
Park, Maryland and National
Center for Biotechnology Infor-
mation, Bethesda, Maryland
Joshua Zhexue Huang
University of Hong Kong, Hong
Kong, China
Carlotta Domeniconi
George Mason University, Fair-
fax, Virginia
Mohamed Kamel
University of Waterloo, Ontario,
Canada
14. Igor Kononenko
University of Ljubljana, Ljubl-
jana, Slovenia
Wei Tang
Florida Atlantic University,
Boca Raton, Florida
David Madigan
Rutgers University, New Bruns-
wick, New Jersey
Kari Torkkola
Motorola Labs, Tempe, Arizona
Masoud Makrehchi
University of Waterloo, Ontario,
Canada
Eugene Tuv
Intel Corporation, Chandler,
Arizona
Michael Ng
Hong Kong Baptist University,
Hong Kong, China
Sriharsha Veeramachaneni
ITC-IRST, Trento, Italy
Emanuele Olivetti
ITC-IRST, Trento, Italy
W. John Wilbur
National Center for Biotech-
nology Information, Bethesda,
Maryland
Domenico Potena
Universitá Politecnica delle
Marche, Ancona, Italy
Jun Xu
Georgia Institute of Technology,
Atlanta, Georgia
José C. Riquelme
University of Seville, Seville,
Spain
Yunming Ye
Harbin Institute of Technology,
Harbin, China
Roberto Ruiz
Pablo de Olavide University,
Seville, Spain
Lei Yu
Binghamton University, Bing-
hamton, New York
Marko Robnik Šikonja
University of Ljubljana, Ljubl-
jana, Slovenia
Shi Zhong
Yahoo! Inc., Sunnyvale, Califor-
nia
David J. Stracuzzi
Arizona State University,
Tempe, Arizona
Hui Zou
University of Minnesota, Min-
neapolis
Yijun Sun
University of Florida, Gaines-
ville, Florida
25. Chapter 1
Less Is More
Huan Liu
Arizona State University
Hiroshi Motoda
AFOSR/AOARD, Air Force Research Laboratory
1.1 Background and Basics .................................................. 4
1.2 Supervised, Unsupervised, and Semi-Supervised Feature Selection ..... 7
1.3 Key Contributions and Organization of the Book ...................... 10
1.4 Looking Ahead ........................................................... 15
References ............................................................... 16
As our world expands at an unprecedented speed from the physical into the
virtual, we can conveniently collect more and more data in any ways one can
imagine for various reasons. Is it “The more, the merrier (better)”? The
answer is “Yes” and “No.” It is “Yes” because we can at least get what we
might need. It is also “No” because, when it comes to a point of too much,
the existence of inordinate data is tantamount to non-existence if there is no
means of effective data access. More can mean less. Without the processing
of data, its mere existence would not become a useful asset that can impact
our business, and many other matters. Since continued data accumulation
is inevitable, one way out is to devise data selection techniques to keep pace
with the rate of data collection. Furthermore, given the sheer volume of data,
data generated by computers or equivalent mechanisms must be processed
automatically, in order for us to tame the data monster and stay in control.
Recent years have seen extensive efforts in feature selection research. The
field of feature selection expands both in depth and in breadth, due to in-
creasing demands for dimensionality reduction. The evidence can be found
in many recent papers, workshops, and review articles. The research expands
from classic supervised feature selection to unsupervised and semi-supervised
feature selection, to selection of different feature types such as causal and
structural features, to different kinds of data like high-throughput, text, or
images, to feature selection evaluation, and to wide applications of feature
selection where data abound.
No book of this size could possibly document the extensive efforts in the
frontier of feature selection research. We thus try to sample the field in several
ways: asking established experts, calling for submissions, and looking at the
3
26. 4 Computational Methods of Feature Selection
recent workshops and conferences, in order to understand the current devel-
opments. As this book aims to serve a wide audience from practitioners to
researchers, we first introduce the basic concepts and the essential problems
with feature selection; next illustrate feature selection research in parallel
to supervised, unsupervised, and semi-supervised learning; then present an
overview of feature selection activities included in this collection; and last
contemplate some issues about evolving feature selection. The book is orga-
nized in five parts: (I) Introduction and Background, (II) Extending Feature
Selection, (III) Weighting and Local Methods, (IV) Text Feature Selection,
and (V) Feature Selection in Bioinformatics. These five parts are relatively
independent and can be read in any order. For a newcomer to the field of fea-
ture selection, we recommend that you read Chapters 1, 2, 9, 13, and 17 first,
then decide on which chapters to read further according to your need and in-
terest. Rudimentary concepts and discussions of related issues such as feature
extraction and construction can also be found in two earlier books [10, 9].
Instance selection can be found in [11].
1.1 Background and Basics
One of the fundamental motivations for feature selection is the curse of
dimensionality [6]. Plainly speaking, two close data points in a 2-d space are
likely distant in a 100-d space (refer to Chapter 2 for an illustrative example).
For the case of classification, this makes it difficult to make a prediction of
unseen data points by a hypothesis constructed from a limited number of
training instances. The number of features is a key factor that determines the
size of the hypothesis space containing all hypotheses that can be learned from
data [13]. A hypothesis is a pattern or function that predicts classes based
on given data. The more features, the larger the hypothesis space. Worse
still, the linear increase of the number of features leads to the exponential
increase of the hypothesis space. For example, for N binary features and a
binary class feature, the hypothesis space is as big as 22N
. Therefore, feature
selection can efficiently reduce the hypothesis space by removing irrelevant
and redundant features. The smaller the hypothesis space, the easier it is
to find correct hypotheses. Given a fixed-size data sample that is part of the
underlying population, the reduction of dimensionality also lowers the number
of required training instances. For example, given M, when the number of
binary features N = 10 is reduced to N = 5, the ratio of M/2N
increases
exponentially. In other words, it virtually increases the number of training
instances. This helps to better constrain the search of correct hypotheses.
Feature selection is essentially a task to remove irrelevant and/or redun-
dant features. Irrelevant features can be removed without affecting learning
27. Less Is More 5
performance [8]. Redundant features are a type of irrelevant feature [16]. The
distinction is that a redundant feature implies the co-presence of another fea-
ture; individually, each feature is relevant, but the removal of one of them will
not affect learning performance. The selection of features can be achieved
in two ways: One is to rank features according to some criterion and select
the top k features, and the other is to select a minimum subset of features
without learning performance deterioration. In other words, subset selection
algorithms can automatically determine the number of selected features, while
feature ranking algorithms need to rely on some given threshold to select fea-
tures. An example of feature ranking algorithms is detailed in Chapter 9. An
example of subset selection can be found in Chapter 17.
Other important aspects of feature selection include models, search strate-
gies, feature quality measures, and evaluation [10]. The three typical models
are filter, wrapper, and embedded. An embedded model of feature selection
integrates the selection of features in model building. An example of such a
model is the decision tree induction algorithm, in which at each branching
node, a feature has to be selected. The research shows that even for such
a learning algorithm, feature selection can result in improved learning per-
formance. In a wrapper model, one employs a learning algorithm and uses
its performance to determine the quality of selected features. As shown in
Chapter 2, filter and wrapper models are not confined to supervised feature
selection, and can also apply to the study of unsupervised feature selection
algorithms.
Search strategies [1] are investigated and various strategies are proposed
including forward, backward, floating, branch-and-bound, and randomized.
If one starts with an empty feature subset and adds relevant features into
the subset following a procedure, it is called forward selection; if one begins
with a full set of features and removes features procedurally, it is backward
selection. Given a large number of features, either strategy might be too costly
to work. Take the example of forward selection. Since k is usually unknown
a priori, one needs to try
N
1
+
N
2
+ ... +
N
k
times in order to figure out
k out of N features for selection. Therefore, its time complexity is O(2N
).
Hence, more efficient algorithms are developed. The widely used ones are
sequential strategies. A sequential forward selection (SFS) algorithm selects
one feature at a time until adding another feature does not improve the subset
quality with the condition that a selected feature remains selected. Similarly,
a sequential backward selection (SBS) algorithm eliminates one feature at a
time and once a feature is eliminated, it will never be considered again for
inclusion. Obviously, both search strategies are heuristic in nature and cannot
guarantee the optimality of the selected features. Among alternatives to these
strategies are randomized feature selection algorithms, which are discussed in
Chapter 3. A relevant issue regarding exhaustive and heuristic searches is
whether there is any reason to perform exhaustive searches if time complexity
were not a concern. Research shows that exhaustive search can lead the
features that exacerbate data overfitting, while heuristic search is less prone
28. 6 Computational Methods of Feature Selection
to data overfitting in feature selection, facing small data samples.
The small sample problem addresses a new type of “wide” data where the
number of features (N) is several degrees of magnitude more than the num-
ber of instances (M). High-throughput data produced in genomics and pro-
teomics and text data are typical examples. In connection to the curse of
dimensionality mentioned earlier, the wide data present challenges to the reli-
able estimation of the model’s performance (e.g., accuracy), model selection,
and data overfitting. In [3], a pithy illustration of the small sample problem
is given with detailed examples.
The evaluation of feature selection often entails two tasks. One is to com-
pare two cases: before and after feature selection. The goal of this task is to
observe if feature selection achieves its intended objectives (recall that feature
selection does not confine it to improving classification performance). The
aspects of evaluation can include the number of selected features, time, scala-
bility, and learning model’s performance. The second task is to compare two
feature selection algorithms to see if one is better than the other for a certain
task. A detailed empirical study is reported in [14]. As we know, there is
no universally superior feature selection, and different feature selection algo-
rithms have their special edges for various applications. Hence, it is wise to
find a suitable algorithm for a given application. An initial attempt to ad-
dress the problem of selecting feature selection algorithms is presented in [12],
aiming to mitigate the increasing complexity of finding a suitable algorithm
from many feature selection algorithms.
Another issue arising from feature selection evaluation is feature selection
bias. Using the same training data in both feature selection and classifica-
tion learning can result in this selection bias. According to statistical theory
based on regression research, this bias can exacerbate data over-fitting and
negatively affect classification performance. A recommended practice is to
use separate data for feature selection and for learning. In reality, however,
separate datasets are rarely used in the selection and learning steps. This is
because we want to use as much data as possible in both selection and learning.
It is against this intuition to divide the training data into two datasets leading
to the reduced data in both tasks. Feature selection bias is studied in [15]
to seek answers if there is discrepancy between the current practice and the
statistical theory. The findings are that the statistical theory is correct, but
feature selection bias has limited effect on feature selection for classification.
Recently researchers started paying attention to interacting features [7].
Feature interaction usually defies those heuristic solutions to feature selection
evaluating individual features for efficiency. This is because interacting fea-
tures exhibit properties that cannot be detected in individual features. One
simple example of interacting features is the XOR problem, in which both
features together determine the class and each individual feature does not tell
much at all. By combining careful selection of a feature quality measure and
design of a special data structure, one can heuristically handle some feature
interaction as shown in [17]. The randomized algorithms detailed in Chapter 3
29. Less Is More 7
may provide an alternative. An overview of various additional issues related
to improving classification performance can be found in [5]. Since there are
many facets of feature selection research, we choose a theme that runs in par-
allel with supervised, unsupervised, and semi-supervised learning below, and
discuss and illustrate the underlying concepts of disparate feature selection
types, their connections, and how they can benefit from one another.
1.2 Supervised, Unsupervised, and Semi-Supervised Fea-
ture Selection
In one of the early surveys [2], all algorithms are supervised in the sense
that data have class labels (denoted as Xl). Supervised feature selection al-
gorithms rely on measures that take into account the class information. A
well-known measure is information gain, which is widely used in both feature
selection and decision tree induction. Assuming there are two features F1 and
F2, we can calculate feature Fi’s information gain as E0 − Ei, where E is
entropy. E0 is the entropy before the data split using feature Fi, and can be
calculated as E0 =
c pc log pc, where p is the estimated probability of class
c and c = 1, 2, ..., C. Ei is the entropy after the data split using Fi. A better
feature can result in larger information gain. Clearly, class information plays
a critical role here. Another example is the algorithm ReliefF, which also uses
the class information to determine an instance’s “near-hit” (a neighboring in-
stance having the same class) and “near-miss” (a neighboring instance having
different classes). More details about ReliefF can be found in Chapter 9. In
essence, supervised feature selection algorithms try to find features that help
separate data of different classes and we name it class-based separation. If a
feature has no effect on class-based separation, it can be removed. A good
feature should, therefore, help enhance class-based separation.
In the late 90’s, research on unsupervised feature selection intensified in
order to deal with data without class labels (denoted as Xu). It is closely
related to unsupervised learning [4]. One example of unsupervised learning is
clustering, where similar instances are grouped together and dissimilar ones
are separated apart. Similarity can be defined by the distance between two
instances. Conceptually, the two instances are similar if the distance between
the two is small, otherwise they are dissimilar. When all instances are con-
nected pair-wisely, breaking the connections between those instances that are
far apart will form clusters. Hence, clustering can be thought as achieving
locality-based separation. One widely used clustering algorithm is k-means.
It is an iterative algorithm that categorizes instances into k clusters. Given
predetermined k centers (or centroids), it works as follows: (1) Instances are
categorized to their closest centroid, (2) the centroids are recalculated using
30. 8 Computational Methods of Feature Selection
the instances in each cluster, and (3) the first two steps are repeated until the
centroids do not change. Obviously, the key concept is distance calculation,
which is sensitive to dimensionality, as we discussed earlier about the curse of
dimensionality. Basically, if there are many irrelevant or redundant features,
clustering will be different from that with only relevant features. One toy
example can be found in Figure 1.1 in which two well-formed clusters in a 1-d
space (x) become two different clusters (denoted with different shapes, circles
vs. diamonds) in a 2-d space after introducing an irrelevant feature y. Unsu-
pervised feature selection is more difficult to deal with than supervised feature
selection. However, it also is a very useful tool as the majority of data are
unlabeled. A comprehensive introduction and review of unsupervised feature
selection is presented in Chapter 2.
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
FIGURE 1.1: An illustrative example: left - two well-formed clusters; middle -
after an irrelevant feature is added; right - after applying 2-means clustering.
When a small number of instances are labeled but the majority are not,
semi-supervised feature selection is designed to take advantage of both the
large number of unlabeled instances and the labeling information as in semi-
supervised learning. Intuitively, the additional labeling information should
help constrain the search space of unsupervised feature selection. In other
words, semi-supervised feature selection attempts to align locality-based sep-
aration and class-based separations Since there are a large number of unla-
31. Less Is More 9
beled data and a small number of labeled instances, it is reasonable to use
unlabeled data to form some potential clusters and then employ labeled data
to find those clusters that can achieve both locality-based and class-based sep-
arations. For the two possible clustering results in Figure 1.1, if we are given
one correctly labeled instance each for the clusters of circles and diamonds,
the correct clustering result (the middle figure) will be chosen. The idea of
semi-supervised feature selection can be illustrated as in Figure 1.2 showing
how the properties of Xl and Xu complement each other and work together to
find relevant features. Two feature vectors (corresponding to two features, f
and f
) can generate respective cluster indicators representing different clus-
tering results: The left one can satisfy both constraints of Xl and Xu, but the
right one can only satisfy Xu. For semi-supervised feature selection, we want
to select f over f
. In other words, there are two equally good ways to cluster
the data as shown in the figure, but only one way can also attain class-based
C1
C2
C1'
C2'
(a)
The cluster structure corresponding
to cluster indicator
(b)
The cluster structure corresponding
to cluster indicator
feature vector feature vector
f '
f
cluster indicator cluster indicator
'
g
g
g
'
g
FIGURE 1.2: The basic idea for comparing the fitness of cluster indicators accord-
ing to both Xl (labeled data) and Xu (unlabeled data) for semi-supervised feature
selection. “-” and “+” correspond to instances of negative and positive classes, and
“M” to unlabeled instances.
separation. A semi-supervised feature selection algorithm sSelect is proposed
in [18], and sSelect is effective to use both data properties when locality-based
32. 10 Computational Methods of Feature Selection
separation and class-based separation do not generate conflicts. We expect to
witness a surge of study on semi-supervised feature selection. The reason is
two-fold: It is often affordable to carefully label a small number of instances,
and it also provides a natural way for human experts to inject their knowledge
into the feature selection process in the form of labeled instances.
Above, we presented and illustrated the development of feature selection
in parallel to supervised, unsupervised, and semi-supervised learning to meet
the increasing demands of labeled, unlabeled, and partially labeled data. It
is just one perspective of feature selection that encompasses many aspects.
However, from this perspective, it can be clearly seen that as data evolve,
feature selection research adapts and develops into new areas in various forms
for emerging real-world applications. In the following, we present an overview
of the research activities included in this book.
1.3 Key Contributions and Organization of the Book
The ensuing chapters showcase some current research issues of feature se-
lection. They are categorically grouped into five parts, each containing four
chapters. The first chapter in Part I is this introduction. The other three
discuss issues such as unsupervised feature selection, randomized feature se-
lection, and causal feature selection. Part II reports some recent results of em-
powering feature selection, including active feature selection, decision-border
estimate, use of ensembles with independent probes, and incremental fea-
ture selection. Part III deals with weighting and local methods such as an
overview of the ReliefF family, feature selection in k-means clustering, local
feature relevance, and a new interpretation of Relief. Part IV is about text
feature selection, presenting an overview of feature selection for text classifi-
cation, a new feature selection score, constraint-guided feature selection, and
aggressive feature selection. Part V is on Feature Selection in Bioinformat-
ics, discussing redundancy-based feature selection, feature construction and
selection, ensemble-based robust feature selection, and penalty-based feature
selection. A summary of each chapter is given next.
1.3.1 Part I - Introduction and Background
Chapter 2 is an overview of unsupervised feature selection, finding the
smallest feature subset that best uncovers interesting, natural clusters for the
chosen criterion. The existence of irrelevant features can misguide clustering
results. Both filter and wrapper approaches can apply as in a supervised
setting. Feature selection can either be global or local, and the features to
be selected can vary from cluster to cluster. Disparate feature subspaces can
33. Less Is More 11
have different underlying numbers of natural clusters. Therefore, care must
be taken when comparing two clusters with different sets of features.
Chapter 3 is also an overview about randomization techniques for feature
selection. Randomization can lead to an efficient algorithm when the benefits
of good choices outweigh the costs of bad choices. There are two broad classes
of algorithms: Las Vegas algorithms, which guarantee a correct answer but
may require a long time to execute with small probability, and Monte Carlo
algorithms, which may output an incorrect answer with small probability but
always complete execution quickly. The randomized complexity classes define
the probabilistic guarantees that an algorithm must meet. The major sources
of randomization are the input features and/or the training examples. The
chapter introduces examples of several randomization algorithms.
Chapter 4 addresses the notion of causality and reviews techniques for
learning causal relationships from data in applications to feature selection.
Causal Bayesian networks provide a convenient framework for reasoning about
causality and an algorithm is presented that can extract causality from data
by finding the Markov blanket. Direct causes (parents), direct effects (chil-
dren), and other direct causes of the direct effects (spouses) are all members
of the Markov blanket. Only direct causes are strongly causally relevant. The
knowledge of causal relationships can benefit feature selection, e.g., explain-
ing relevance in terms of causal mechanisms, distinguishing between actual
features and experimental artifacts, predicting the consequences of actions,
and making predictions in a non-stationary environment.
1.3.2 Part II - Extending Feature Selection
Chapter 5 poses an interesting problem of active feature sampling in do-
mains where the feature values are expensive to measure. The selection of
features is based on the maximum benefit. A benefit function minimizes the
mean-squared error in a feature relevance estimate. It is shown that the
minimum mean-squared error criterion is equivalent to the maximum average
change criterion. The results obtained by using a mixture model for the joint
class-feature distribution show the advantage of the active sampling policy
over the random sampling in reducing the number of feature samples. The
approach is computationally expensive. Considering only a random subset of
the missing entries at each sampling step is a promising solution.
Chapter 6 discusses feature extraction (as opposed to feature selection)
based on the properties of the decision border. It is intuitive that the direction
normal to the decision boundary represents an informative direction for class
discriminability and its effectiveness is proportional to the area of decision bor-
der that has the same normal vector. Based on this, a labeled vector quantizer
that can efficiently be trained by the Bayes risk weighted vector quantization
(BVQ) algorithm was devised to extract the best linear approximation to the
decision border. The BVQ produces a decision boundary feature matrix, and
the eigenvectors of this matrix are exploited to transform the original feature
34. 12 Computational Methods of Feature Selection
space into a new feature space with reduced dimensionality. It is shown that
this approach is comparable to the SVM-based decision boundary approach
and better than the MLP (Multi Layer Perceptron)-based approach, but with
a lower computational cost.
Chapter 7 proposes to compare feature relevance against the relevance of
its randomly permuted version (or probes) for classification/regression tasks
using random forests. The key is to use the same distribution in generating
a probe. Feature relevance is estimated by averaging the relevance obtained
from each tree in the ensemble. The method iterates over the remaining fea-
tures by removing the identified important features using the residuals as new
target variables. It offers autonomous feature selection taking into account
non-linearity, mixed-type data, and missing data in regressions and classifica-
tions. It shows excellent performance and low computational complexity, and
is able to address massive amounts of data.
Chapter 8 introduces an incremental feature selection algorithm for high-
dimensional data. The key idea is to decompose the whole process into feature
ranking and selection. The method first ranks features and then resolves the
redundancy by an incremental subset search using the ranking. The incre-
mental subset search does not retract what it has selected, but it can decide
not to add the next candidate feature, i.e., skip it and try the next according
to the rank. Thus, the average number of features used to construct a learner
during the search is kept small, which makes the wrapper approach feasible
for high-dimensional data.
1.3.3 Part III - Weighting and Local Methods
Chapter 9 is a comprehensive description of the Relief family algorithms.
Relief exploits the context of other features through distance measures and can
detect highly conditionally-dependent features. The chapter explains the idea,
advantages, and applications of Relief and introduces two extensions: ReliefF
and RReliefF. ReliefF is for classification and can deal with incomplete data
with multi-class problems. RReliefF is its extension designed for regression.
The variety of the Relief family shows the general applicability of the basic
idea of Relief as a non-myopic feature quality measure.
Chapter 10 discusses how to automatically determine the important fea-
tures in the k-means clustering process. The weight of a feature is determined
by the sum of the within-cluster dispersions of the feature, which measures
its importance in clustering. A new step to calculate the feature weights is
added in the iterative process in order not to seriously affect the scalability.
The weight can be defined either globally (same weights for all clusters) or
locally (different weights for different clusters). The latter, called subspace
k-means clustering, has applications in text clustering, bioinformatics, and
customer behavior analysis.
Chapter 11 is in line with Chapter 5, but focuses on local feature relevance
and weighting. Each feature’s ability for class probability prediction at each
35. Less Is More 13
point in the feature space is formulated in a way similar to the weighted χ-
square measure, from which the relevance weight is derived. The weight has
a large value for a direction along which the class probability is not locally
constant. To gain efficiency, a decision boundary is first obtained by an SVM,
and its normal vector nearest to the point in query is used to estimate the
weights reflected in the distance measure for a k-nearest neighbor classifier.
Chapter 12 gives further insights into Relief (refer to Chapter 9). The
working of Relief is proven to be equivalent to solving an online convex opti-
mization problem with a margin-based objective function that is defined based
on a nearest neighbor classifier. Relief usually performs (1) better than other
filter methods due to the local performance feedback of a nonlinear classifier
when searching for useful features, and (2) better than wrapper methods due
to the existence of efficient algorithms for a convex optimization problem. The
weights can be iteratively updated by an EM-like algorithm, which guaran-
tees the uniqueness of the optimal weights and the convergence. The method
was further extended to its online version, which is quite effective when it is
difficult to use all the data in a batch mode.
1.3.4 Part IV - Text Classification and Clustering
Chapter 13 is a comprehensive presentation of feature selection for text
classification, including feature generation, representation, and selection, with
illustrative examples, from a pragmatic view point. A variety of feature gen-
erating schemes is reviewed, including word merging, word phrases, character
N-grams, and multi-fields. The generated features are ranked by scoring each
feature independently. Examples of scoring measures are information gain,
χ-square, and bi-normal separation. A case study shows considerable im-
provement of F-measure by feature selection. It also shows that adding two
word phrases as new features generally gives good performance gain over the
features comprising only selected words.
Chapter 14 introduces a new feature selection score, which is defined as the
posterior probability of inclusion of a given feature over all possible models,
where each model corresponds to a different set of features that includes the
given feature. The score assumes a probability distribution on the words of
the documents. Bernoulli and Poisson distributions are assumed respectively
when only the presence or absence of a word matters and when the number
of occurrences of a word matters. The score computation is inexpensive,
and the value that the score assigns to each word has an appealing Bayesian
interpretation when the predictive model corresponds to a naive Bayes model.
This score is compared with five other well-known scores.
Chapter 15 focuses on dimensionality reduction for semi-supervised clus-
tering where some weak supervision is available in terms of pairwise instance
constraints (must-link and cannot-link). Two methods are proposed by lever-
aging pairwise instance constraints: pairwise constraints-guided feature pro-
jection and pairwise constraints-guided co-clustering. The former is used to
36. 14 Computational Methods of Feature Selection
project data into a lower dimensional space such that the sum-squared dis-
tance between must-link instances is minimized and the sum-squared dis-
tance between cannot-link instances is maximized. This reduces to an elegant
eigenvalue decomposition problem. The latter is to use feature clustering
benefitting from pairwise constraints via a constrained co-clustering mecha-
nism. Feature clustering and data clustering are mutually reinforced in the
co-clustering process.
Chapter 16 proposes aggressive feature selection, removing more than
95% features (terms) for text data. Feature ranking is effective to remove
irrelevant features, but cannot handle feature redundancy. Experiments show
that feature redundancy can be as destructive as noise. A new multi-stage
approach for text feature selection is proposed: (1) pre-processing to remove
stop words, infrequent words, noise, and errors; (2) ranking features to iden-
tify the most informative terms; and (3) removing redundant and correlated
terms. In addition, term redundancy is modeled by a term-redundancy tree
for visualization purposes.
1.3.5 Part V - Feature Selection in Bioinformatics
Chapter 17 introduces the challenges of microarray data analysis and
presents a redundancy-based feature selection algorithm. For high-throughput
data like microarrays, redundancy among genes becomes a critical issue. Con-
ventional feature ranking algorithms cannot effectively handle feature redun-
dancy. It is known that if there is a Markov blanket for a feature, the feature
can be safely eliminated. Finding a Markov blanket is computationally heavy.
The solution proposed is to use an approximate Markov blanket, in which it is
assumed that the Markov blanket always consists of one feature. The features
are first ranked, and then each feature is checked in sequence if it has any ap-
proximate Markov blanket in the current set. This way it can efficiently find
all predominant features and eliminate the rest. Biologists would welcome
an efficient filter algorithm to feature redundancy. Redundancy-based fea-
ture selection makes it possible for a biologist to specify what genes are to be
included before feature selection.
Chapter 18 presents a scalable method for automatic feature generation
on biological sequence data. The algorithm uses sequence components and do-
main knowledge to construct features, explores the space of possible features,
and identifies the most useful ones. As sequence data have both compositional
and positional properties, feature types are defined to capture these proper-
ties, and for each feature type, features are constructed incrementally from
the simplest ones. During the construction, the importance of each feature is
evaluated by a measure that best fits to each type, and low ranked features
are eliminated. At the final stage, selected features are further pruned by an
embedded method based on recursive feature elimination. The method was
applied to the problem of splice-site prediction, and it successfully identified
the most useful set of features of each type. The method can be applied
37. Less Is More 15
to complex feature types and sequence prediction tasks such as translation
start-site prediction and protein sequence classification.
Chapter 19 proposes an ensemble-based method to find robust features
for biomarker research. Ensembles are obtained by choosing different alterna-
tives at each stage of data mining: three normalization methods, two binning
methods, eight feature selection methods (including different combination of
search methods), and four classification methods. A total of 192 different clas-
sifiers are obtained, and features are selected by favoring frequently appearing
features that are members of small feature sets of accurate classifiers. The
method is successfully applied to a publicly available Ovarian Cancer Dataset,
in which case the original attribute is the m/z (mass/charge) value of mass
spectrometer and the value of the feature is its intensity.
Chapter 20 presents a penalty-based feature selection method, elastic net,
for genomic data, which is a generalization of lasso (a penalized least squares
method with L1 penalty for regression). Elastic net has a nice property that
irrelevant features receive their parameter estimates equal to 0, leading to
sparse and easy to interpret models like lasso, and, in addition, strongly cor-
related relevant features are all selected whereas in lasso only one of them
is selected. Thus, it is a more appropriate tool for feature selection with
high-dimensional data than lasso. Details are given on how elastic net can be
applied to regression, classification, and sparse eigen-gene analysis by simul-
taneously building a model and selecting relevant and redundant features.
1.4 Looking Ahead
Feature selection research has found applications in many fields where large
(either row-wise or column-wise) volumes of data present challenges to effec-
tive data analysis and processing. As data evolve, new challenges arise and
the expectations of feature selection are also elevated, due to its own suc-
cess. In addition to high-throughput data, the pervasive use of Internet and
Web technologies has been bringing about a great number of new services and
applications, ranging from recent Web 2.0 applications to traditional Web ser-
vices where multi-media data are ubiquitous and abundant. Feature selection
is widely applied to find topical terms, establish group profiles, assist in cat-
egorization, simplify descriptions, facilitate personalization and visualization,
among many others.
The frontier of feature selection research is expanding incessantly in an-
swering the emerging challenges posed by the ever-growing amounts of data,
multiple sources of heterogeneous data, data streams, and disparate data-
intensive applications. On one hand, we naturally anticipate more research
on semi-supervised feature selection, unifying supervised and unsupervised
38. 16 Computational Methods of Feature Selection
feature selection [19], and integrating feature selection with feature extrac-
tion. On the other hand, we expect new feature selection methods designed
for various types of features like causal, complementary, relational, struc-
tural, and sequential features, and intensified research efforts on large-scale,
distributed, and real-time feature selection. As the field develops, we are op-
timistic and confident that feature selection research will continue its unique
and significant role in taming the data monster and helping turning data into
nuggets.
References
[1] A. Blum and P. Langley. Selection of relevant features and examples in
machine learning. Artificial Intelligence, 97:245–271, 1997.
[2] M. Dash and H. Liu. Feature selection methods for classifications. Intel-
ligent Data Analysis: An International Journal, 1(3):131–156, 1997.
[3] E. Dougherty. Feature-selection overfitting with small-sample classi-
fier design. IEEE Intelligent Systems, 20(6):64–66, November/December
2005.
[4] J. Dy and C. Brodley. Feature selection for unsupervised learning. Jour-
nal of Machine Learning Research, 5:845–889, 2004.
[5] I. Guyon and A. Elisseeff. An introduction to variable and feature se-
lection. Journal of Machine Learning Research (JMLR), 3:1157–1182,
2003.
[6] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical
Learning. Springer, 2001.
[7] A. Jakulin and I. Bratko. Testing the significance of attribute interac-
tions. In ICML ’04: Twenty-First International Conference on Machine
Learning. ACM Press, 2004.
[8] G. John, R. Kohavi, and K. Pfleger. Irrelevant feature and the subset se-
lection problem. In W. Cohen and H. H., editors, Machine Learning: Pro-
ceedings of the Eleventh International Conference, pages 121–129, New
Brunswick, NJ: Rutgers University, 1994.
[9] H. Liu and H. Motoda, editors. Feature Extraction, Construction and
Selection: A Data Mining Perspective. Boston: Kluwer Academic Pub-
lishers, 1998. 2nd Printing, 2001.
[10] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery
Data Mining. Boston: Kluwer Academic Publishers, 1998.
39. Less Is More 17
[11] H. Liu and H. Motoda, editors. Instance Selection and Construction for
Data Mining. Boston: Kluwer Academic Publishers, 2001.
[12] H. Liu and L. Yu. Toward integrating feature selection algorithms for
classification and clustering. IEEE Trans. on Knowledge and Data En-
gineering, 17(3):1–12, 2005.
[13] T. Mitchell. Machine Learning. New York: McGraw-Hill, 1997.
[14] P. Refaeilzadeh, L. Tang, and H. Liu. On comparison of feature selection
algorithms. In AAAI 2007 Workshop on Evaluation Methods for Machine
Learning II, Vancouver, British Columbia, Canada, July 2007.
[15] S. Singhi and H. Liu. Feature subset selection bias for classification
learning. In International Conference on Machine Learning, 2006.
[16] L. Yu and H. Liu. Efficient feature selection via analysis of rele-
vance and redundancy. Journal of Machine Learning Research (JMLR),
5(Oct):1205–1224, 2004.
[17] Z. Zhao and H. Liu. Searching for interacting features. In Proceedings of
IJCAI - International Joint Conference on AI, January 2007.
[18] Z. Zhao and H. Liu. Semi-supervised feature selection via spectral anal-
ysis. In Proceedings of SIAM International Conference on Data Mining
(SDM-07), 2007.
[19] Z. Zhao and H. Liu. Spectral feature selection for supervised and unsu-
pervised learning. In Proceedings of International Conference on Machine
Learning, 2007.
41. Chapter 2
Unsupervised Feature Selection
Jennifer G. Dy
Northeastern University
2.1 Introduction ............................................................. 19
2.2 Clustering ................................................................ 21
2.3 Feature Selection ........................................................ 23
2.4 Feature Selection for Unlabeled Data ................................... 25
2.5 Local Approaches ........................................................ 32
2.6 Summary ................................................................ 34
Acknowledgment ......................................................... 35
References ............................................................... 35
2.1 Introduction
Many existing databases are unlabeled, because large amounts of data make
it difficult for humans to manually label the categories of each instance. More-
over, human labeling is expensive and subjective. Hence, unsupervised learn-
ing is needed. Besides being unlabeled, several applications are characterized
by high-dimensional data (e.g., text, images, gene). However, not all of the
features domain experts utilize to represent these data are important for the
learning task. We have seen the need for feature selection in the supervised
learning case. This is also true in the unsupervised case. Unsupervised means
there is no teacher, in the form of class labels. One type of unsupervised learn-
ing problem is clustering. The goal of clustering is to group “similar” objects
together. “Similarity” is typically defined in terms of a metric or a probabil-
ity density model, which are both dependent on the features representing the
data.
In the supervised paradigm, feature selection algorithms maximize some
function of prediction accuracy. Since class labels are available in supervised
learning, it is natural to keep only the features that are related to or lead
to these classes. But in unsupervised learning, we are not given class labels.
Which features should we keep? Why not use all the information that we
have? The problem is that not all the features are important. Some of the
features may be redundant and some may be irrelevant. Furthermore, the ex-
istence of several irrelevant features can misguide clustering results. Reducing
19
42. 20 Computational Methods of Feature Selection
the number of features also facilitates comprehensibility and ameliorates the
problem that some unsupervised learning algorithms break down with high-
dimensional data. In addition, for some applications, the goal is not just
clustering, but also to find the important features themselves.
A reason why some clustering algorithms break down in high dimensions is
due to the curse of dimensionality [3]. As the number of dimensions increases,
a fix data sample becomes exponentially sparse. Additional dimensions in-
crease the volume exponentially and spread the data such that the data points
would look equally far. Figure 2.1 (a) shows a plot of data generated from
a uniform distribution between 0 and 2 with 25 instances in one dimension.
Figure 2.1 (b) shows a plot of the same data in two dimensions, and Figure
2.1 (c) displays the data in three dimensions. Observe that the data become
more and more sparse in higher dimensions. There are 12 samples that fall
inside the unit-sized box in Figure 2.1 (a), seven samples in (b) and two in
(c). The sampling density is proportional to M1/N
, where M is the number
of samples and N is the dimension. For this example, a sampling density of
25 in one dimension would require 253
= 125 samples in three dimensions to
achieve a similar sample density.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
(a)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
y
(b)
0
0.5
1
1.5
2
0
0.5
1
1.5
2
0
0.5
1
1.5
2
x
y
z
(c)
FIGURE 2.1: Illustration for the curse of dimensionality. These are plots of a
25-sample data generated from a uniform distribution between 0 and 2. (a) Plot in
one dimension, (b) plot in two dimensions, and (c) plot in three dimensions. The
boxes in the figures show unit-sized bins in the corresponding dimensions. Note that
data are more sparse with respect to the unit-sized volume in higher dimensions.
There are 12 samples in the unit-sized box in (a), 7 samples in (b), and 2 samples
in (c).
As noted earlier, supervised learning has class labels to guide the feature
search. In unsupervised learning, these labels are missing, and in fact its goal
is to find these grouping labels (also known as cluster assignments). Finding
these cluster labels is dependent on the features describing the data, thus
making feature selection for unsupervised learning difficult.
Dy and Brodley [14] define the goal of feature selection for unsupervised
learning as:
43. Unsupervised Feature Selection 21
to find the smallest feature subset that best uncovers “interesting
natural” groupings (clusters) from data according to the chosen
criterion.
Without any labeled information, in unsupervised learning, we need to make
some assumptions. We need to define what “interesting” and “natural” mean
in the form of criterion or objective functions. We will see examples of these
criterion functions later in this chapter.
Before we proceed with how to do feature selection on unsupervised data,
it is important to know the basics of clustering algorithms. Section 2.2 briefly
describes clustering algorithms. In Section 2.3 we review the basic components
of feature selection algorithms. Then, we present the methods for unsuper-
vised feature selection in Sections 2.4 and 2.5, and finally provide a summary
in Section 2.6.
2.2 Clustering
The goal of clustering is to group similar objects together. There are two
types of clustering approaches: partitional and hierarchical. Partitional clus-
tering provides one level of clustering. Hierarchical clustering, on the other
hand, provides multiple levels (hierarchy) of clustering solutions. Hierarchical
approaches can proceed bottom-up (agglomerative) or top-down (divisive).
Bottom-up approaches typically start with all instances as clusters and then,
at each level, merge clusters that are most similar with each other. Top-
down approaches divide the data into k clusters at each level. There are
several methods for performing clustering. A survey of these algorithms can
be found in [29, 39, 18].
In this section we briefly present two popular partitional clustering algo-
rithms: k-means and finite mixture model clustering. As mentioned earlier,
similarity is typically defined by a metric or a probability distribution. K-
means is an approach that uses a metric, and finite mixture models define
similarity by a probability density.
Let us denote our dataset as X = {x1, x2, . . . , xM }. X consists of M data
instances xk, k = 1, 2, . . ., M, and each xk represents a single N-dimensional
instance.
2.2.1 The K-Means Algorithm
The goal of k-means is to partition X into K clusters {C1, . . . , CK }. The
most widely used criterion function for the k-means algorithm is the sum-
44. 22 Computational Methods of Feature Selection
squared-error (SSE) criterion. SSE is defined as
SSE =
K
j=1
xk∈Cj
xk − μj2
(2.1)
where μj denotes the mean (centroid) of those instances in cluster Cj.
K-means is an iterative algorithm that locally minimizes the SSE criterion.
It assumes each cluster has a hyper-spherical structure. “K-means” denotes
the process of assigning each data point, xk, to the cluster with the nearest
mean. The k-means algorithm starts with initial K centroids, then it assigns
each remaining point to the nearest centroid, updates the cluster centroids,
and repeats the process until the K centroids do not change (convergence).
There are two versions of k-means: One version originates from Forgy [17] and
the other version from Macqueen [36]. The difference between the two is when
to update the cluster centroids. In Forgy’s k-means [17], cluster centroids are
re-computed after all the data points have been assigned to their nearest
centroids. In Macqueen’s k-means [36], the cluster centroids are re-computed
after each data assignment. Since k-means is a greedy algorithm, it is only
guaranteed to find a local minimum, the solution of which is dependent on
the initial assignments. To avoid local optimum, one typically applies random
restarts and picks the clustering solution with the best SSE. One can refer
to [47, 4] for other ways to deal with the initialization problem.
Standard k-means utilizes Euclidean distance to measure dissimilarity be-
tween the data points. Note that one can easily create various variants of
k-means by modifying this distance metric (e.g., other Lp norm distances)
to ones more appropriate for the data. For example, on text data, a more
suitable metric is the cosine similarity. One can also modify the objective
function, instead of SSE, to other criterion measures to create other cluster-
ing algorithms.
2.2.2 Finite Mixture Clustering
A finite mixture model assumes that data are generated from a mixture
of K component density functions, in which p(xk|θj) represents the density
function of component j for all j
s, where θj is the parameter (to be estimated)
for cluster j. The probability density of data xk is expressed by
p(xk) =
K
j=1
αjp(xk|θj) (2.2)
where the α
s are the mixing proportions of the components (subject to αj ≥ 0
and
K
j=1 αj = 1). The log-likelihood of the M observed data points is then
given by
L =
M
k=1
ln{
K
j=1
αjp(xk|θj)} (2.3)
45. Unsupervised Feature Selection 23
It is difficult to directly optimize (2.3), therefore we apply the Expectation-
Maximization (EM) [10] algorithm to find a (local) maximum likelihood or
maximum a posteriori (MAP) estimate of the parameters for the given data
set. EM is a general approach for estimating the maximum likelihood or
MAP estimate for missing data problems. In the clustering context, the
missing or hidden variables are the class labels. The EM algorithm iterates
between an Expectation-step (E-step), which computes the expected com-
plete data log-likelihood given the observed data and the model parameters,
and a Maximization-step (M-step), which estimates the model parameters
by maximizing the expected complete data log-likelihood from the E-step,
until convergence. In clustering, the E-step is similar to estimating the clus-
ter membership and the M-step estimates the cluster model parameters. The
clustering solution that we obtain in a mixture model is what we call a “soft”-
clustering solution because we obtain an estimated cluster membership (i.e.,
each data point belongs to all clusters with some probability weight of be-
longing to each cluster). In contrast, k-means provides a “hard”-clustering
solution (i.e., each data point belongs to only a single cluster).
Analogous to metric-based clustering, where one can develop different algo-
rithms by utilizing other similarity metric, one can design different probability-
based mixture model clustering algorithms by choosing an appropriate density
model for the application domain. A Gaussian distribution is typically uti-
lized for continuous features and multinomials for discrete features. For a
more thorough description of clustering using finite mixture models, see [39]
and a review is provided in [18].
2.3 Feature Selection
Feature selection algorithms has two main components: (1) feature search
and (2) feature subset evaluation.
2.3.1 Feature Search
Feature search strategies have been widely studied for classifications. Gen-
erally speaking, search strategies used for supervised classifications can also
be used for clustering algorithms. We repeat and summarize them here for
completeness. An exhaustive search would definitely find the optimal solution;
however, a search on 2N
possible feature subsets (where N is the number of
features) is computationally impractical. More realistic search strategies have
been studied. Narendra and Fukunaga [40] introduced the branch and bound
algorithm, which finds the optimal feature subset if the criterion function used
is monotonic. However, although the branch and bound algorithm makes
46. 24 Computational Methods of Feature Selection
problems more tractable than an exhaustive search, it becomes impractical
for feature selection problems involving more than 30 features [43]. Sequential
search methods generally use greedy techniques and hence do not guarantee
global optimality of the selected subsets, only local optimality. Examples of
sequential searches include sequential forward selection, sequential backward
elimination, and bidirectional selection [32, 33]. Sequential forward/backward
search methods generally result in an O(N2
) worst case search. Marill and
Green [38] introduced the sequential backward selection (SBS) [43] method,
which starts with all the features and sequentially eliminates one feature at a
time (eliminating the feature that contributes least to the criterion function).
Whitney [50] introduced sequential forward selection (SFS) [43], which starts
with the empty set and sequentially adds one feature at a time. A problem
with these hill-climbing search techniques is that when a feature is deleted in
SBS, it cannot be re-selected, while a feature added in SFS cannot be deleted
once selected. To prevent this effect, the Plus-l-Minus-r (l-r) search method
was developed by Stearns [45]. Indeed, at each step the values of l and r
are pre-specified and fixed. Pudil et al. [43] introduced an adaptive version
that allows l and r values to “float.” They call these methods floating search
methods: sequential forward floating selection (SFFS) and sequential back-
ward floating selection (SBFS) based on the dominant search method (i.e.,
either in the forward or backward direction). Random search methods such
as genetic algorithms and random mutation hill climbing add some random-
ness in the search procedure to help to escape from a local optimum. In some
cases when the dimensionality is very high, one can only afford an individual
search. Individual search methods evaluate each feature individually accord-
ing to a criterion or a condition [24]. They then select features, which either
satisfy the condition or are top-ranked.
2.3.2 Feature Evaluation
Not all the features are important. Some of the features may be irrelevant
and some of the features may be redundant. Each feature or feature subset
needs to be evaluated based on importance by a criterion. Different criteria
may select different features. It is actually deciding the evaluation criteria that
makes feature selection in clustering difficult. In classification, it is natural
to keep the features that are related to the labeled classes. However, in
clustering, these class labels are not available. Which features should we keep?
More specifically, how do we decide which features are relevant/irrelevant, and
which are redundant?
Figure 2.2 gives a simple example of an irrelevant feature for clustering.
Suppose data have features F1 and F2 only. Feature F2 does not contribute
to cluster discrimination, thus, we consider feature F2 to be irrelevant. We
want to remove irrelevant features because they may mislead the clustering
algorithm (especially when there are more irrelevant features than relevant
ones). Figure 2.3 provides an example showing feature redundancy. Observe
47. Unsupervised Feature Selection 25
FIGURE 2.2: In this example, feature F2 is irrelevant because it does not con-
tribute to cluster discrimination.
F2
F1
FIGURE 2.3: In this example, features F1 and F2 have redundant information,
because feature F1 provides the same information as feature F2 with regard to
discriminating the two clusters.
that both features F1 and F2 lead to the same clustering results. Therefore,
we consider features F1 and F2 to be redundant.
2.4 Feature Selection for Unlabeled Data
There are several feature selection methods for clustering. Similar to super-
vised learning, these feature selection methods can be categorized as either
filter or wrapper approaches [33] based on whether the evaluation methods
depend on the learning algorithms1
.
As Figure 2.4 shows, the wrapper approach wraps the feature search around
the learning algorithms that will ultimately be applied, and utilizes the learned
results to select the features. On the other hand, as shown in Figure 2.5, the
filter approach utilizes the data alone to decide which features should be kept,
48. 26 Computational Methods of Feature Selection
Search
Clustering
Algorithm
Feature
Evaluation
Criterion
All Features
Feature
Subset
Criterion Value
Clusters
Selected
Features
Clusters
FIGURE 2.4: Wrapper approach for feature selection for clustering.
Search
Feature
Evaluation
Criterion
All Features
Feature
Subset
Criterion Value
Selected
Features
FIGURE 2.5: Filter approach for feature selection for clustering.
without running the learning algorithm. Usually, a wrapper approach may
lead to better performance compared to a filter approach for a particular
learning algorithm. However, wrapper methods are more computationally
expensive since one needs to run the learning algorithm for every candidate
feature subset.
In this section, we present the different methods categorized into filter and
wrapper approaches.
2.4.1 Filter Methods
Filter methods use some intrinsic property of the data to select features
without utilizing the clustering algorithm that will ultimately be applied. The
basic components in filter methods are the feature search method and the fea-
ture selection criterion. Filter methods have the challenge of defining feature
relevance (interestingness) and/or redundancy without applying clustering on
the data.
Talavera [48] developed a filter version of his wrapper approach that selects
features based on feature dependence. He claims that irrelevant features are
features that do not depend on the other features. Manoranjan et al. [37]
introduced a filter approach that selects features based on the entropy of dis-
tances between data points. They observed that when the data are clustered,
the distance entropy at that subspace should be low. He, Cai, and Niyogi [26]
select features based on the Laplacian score that evaluates features based on
their locality preserving power. The Laplacian score is based on the premise
that two data points that are close together probably belong to the same
cluster.
These three filter approaches try to remove features that are not relevant.
49. Unsupervised Feature Selection 27
Another way to reduce the dimensionality is to remove redundancy. A filter
approach primarily for reducing redundancy is simply to cluster the features.
Note that even though we apply clustering, we consider this as a filter method
because we cluster on the feature space as opposed to the data sample space.
One can cluster the features using a k-means clustering [36, 17] type of algo-
rithm with feature correlation as the similarity metric. Instead of a cluster
mean, represent each cluster by the feature that has the highest correlation
among features within the cluster it belongs to.
Popular techniques for dimensionality reduction without labels are prin-
cipal components analysis (PCA) [30], factor analysis, and projection pur-
suit [20, 27]. These early works in data reduction for unsupervised data can
be thought of as filter methods, because they select the features prior to ap-
plying clustering. But rather than selecting a subset of the features, they
involve some type of feature transformation. PCA and factor analysis aim to
reduce the dimension such that the representation is as faithful as possible to
the original data. As such, these techniques aim at reducing dimensionality
by removing redundancy. Projection pursuit, on the other hand, aims at find-
ing “interesting” projections (defined as the directions that are farthest from
Gaussian distributions and close to uniform). In this case, projection pur-
suit addresses relevance. Another method is independent component analysis
(ICA) [28]. ICA tries to find a transformation such that the transformed vari-
ables are statistically independent. Although the goals of ICA and projection
pursuit are different, the formulation in ICA ends up being similar to that of
projection pursuit (i.e., they both search for directions that are farthest from
the Gaussian density). These techniques are filter methods, however, they
apply transformations on the original feature space. We are interested in sub-
sets of the original features, because we want to retain the original meaning of
the features. Moreover, transformations would still require the user to collect
all the features to obtain the reduced set, which is sometimes not desired.
2.4.2 Wrapper Methods
Wrapper methods apply the clustering algorithm to evaluate the features.
They incorporate the clustering algorithm inside the feature search and selec-
tion. Wrapper approaches consist of: (1) a search component, (2) a clustering
algorithm, and (3) a feature evaluation criterion. See Figure 2.4.
One can build a feature selection wrapper approach for clustering by simply
picking a favorite search method (any method presented in Section 2.3.1), and
apply a clustering algorithm and a feature evaluation criterion. However, there
are issues that one must take into account in creating such an algorithm. In
[14], Dy and Brodley investigated the issues involved in creating a general
wrapper method where any feature selection, clustering, and selection criteria
can be applied. The first issue they observed is that it is not a good idea
to use the same number of clusters throughout the feature search because
different feature subspaces have different underlying numbers of “natural”
50. 28 Computational Methods of Feature Selection
clusters. Thus, the clustering algorithm should also incorporate finding the
number of clusters in feature search. The second issue they discovered is that
various selection criteria are biased with respect to dimensionality. They then
introduced a cross-projection normalization scheme that can be utilized by
any criterion function.
Feature subspaces have different underlying numbers of clusters.
When we are searching for the best feature subset, we run into a new problem:
The value of the number of clusters depends on the feature subset. Figure
2.6 illustrates this point. In two dimensions {F1, F2} there are three clusters,
whereas in one dimension (the projection of the data only on F1) there are
only two clusters. It is not a good idea to use a fixed number of clusters in
feature search, because different feature subsets require different numbers of
clusters. And, using a fixed number of clusters for all feature sets does not
model the data in the respective subspace correctly. In [14], they addressed
finding the number of clusters by applying a Bayesian information criterion
penalty [44].
x
x
x
x x
x
xx x
x
x
x x
x x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x x
x xx
x
F
F
xxxxxxxxxx xxxxxxxx
2
1
FIGURE 2.6: The number of cluster components varies with dimension.
Feature evaluation criterion should not be biased with respect to
dimensionality. In a wrapper approach, one searches in feature space, ap-
plies clustering in each candidate feature subspace, Si, and then evaluates the
results (clustering in space Si) with other cluster solutions in other subspaces,
Sj, j = i, based on an evaluation criterion. This can be problematic especially
when Si and Sj have different dimensionalities. Dy and Brodley [14] examined
two feature selection criteria: maximum likelihood and scatter separability.
They have shown that the scatter separability criterion prefers higher dimen-
sionality. In other words, the criterion value monotonically increases as fea-
52. Ihminen, joka tarinan lopusta osaa saada selville koko tarinan
alusta saakka.
No, ottakaa sitten selville minunkin tarinani, hän sanoi tarttuen
käteeni. Oli kerran mies, joka jätti maailman, missä häntä ihailtiin,
ja loi itselleen toisen, missä häntä rakastetaan.
Rohkenenko kysyä nimeänne?
Ukko kohosi päätään pidemmäksi, kun kuuli nämä sanat.
Senjälkeen hän kohotti vapisevan kätensä ja laski sen pääni päälle.
Ja siinä silmänräpäyksessä minusta tuntui kuin olisi, kauan, kauan
sitten tämä sama käsi levännyt pääni päällä, silloin kun lapsen
kiharat vielä liehuivat sen ympärillä, ja kuin olisin kerran ennen
nähnyt nämä kasvot.
Kysymykseeni hän vastasi:
Minun nimeni on 'Ei kukaan.'
Sitten hän kääntyi pois virkkamatta enää mitään. Hän meni taloon
eikä enää näyttäytynyt meidän saarella olomme aikana.
* * * * *
Sellainen on Vapaan Saaren nykyinen tila.
Kahden hallituksen suoma etuoikeus, joka tekee tämän maapalan
riippumattomaksi sen molemmin puolin olevista maista kestää vielä
viisikymmentä vuotta.
Viisikymmentä vuotta! — Kuka tietää, miksi maailma sinä aikana
on muuttunut?
54. *** END OF THE PROJECT GUTENBERG EBOOK ONNEN
KULTAPOIKA: ROMAANI. 2/2 ***
Updated editions will replace the previous one—the old editions will
be renamed.
Creating the works from print editions not protected by U.S.
copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying
copyright royalties. Special rules, set forth in the General Terms of
Use part of this license, apply to copying and distributing Project
Gutenberg™ electronic works to protect the PROJECT GUTENBERG™
concept and trademark. Project Gutenberg is a registered trademark,
and may not be used if you charge for an eBook, except by following
the terms of the trademark license, including paying royalties for use
of the Project Gutenberg trademark. If you do not charge anything
for copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.
START: FULL LICENSE
56. PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
To protect the Project Gutenberg™ mission of promoting the free
distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.
Section 1. General Terms of Use and
Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund
from the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only be
used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
57. 1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
1.E. Unless you have removed all references to Project Gutenberg:
1.E.1. The following sentence, with active links to, or other
immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
58. This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
1.E.2. If an individual Project Gutenberg™ electronic work is derived
from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.
1.E.3. If an individual Project Gutenberg™ electronic work is posted
with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.
1.E.4. Do not unlink or detach or remove the full Project
Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute this
electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
59. with active links or immediate access to the full terms of the Project
Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
1.E.7. Do not charge a fee for access to, viewing, displaying,
performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.
1.E.8. You may charge a reasonable fee for copies of or providing
access to or distributing Project Gutenberg™ electronic works
provided that:
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
60. about donations to the Project Gutenberg Literary Archive
Foundation.”
• You provide a full refund of any money paid by a user who
notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.
• You provide, in accordance with paragraph 1.F.3, a full refund of
any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™
electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.
1.F.
1.F.1. Project Gutenberg volunteers and employees expend
considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
61. damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.
1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for
the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.
1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you
discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
62. INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
1.F.5. Some states do not allow disclaimers of certain implied
warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,
the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.
Section 2. Information about the Mission
of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.
Volunteers and financial support to provide volunteers with the
assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
63. remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.
Section 3. Information about the Project
Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.
The Foundation’s business office is located at 809 North 1500 West,
Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
64. small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.
The Foundation is committed to complying with the laws regulating
charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states where
we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot make
any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Section 5. General Information About
Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
65. Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
66. Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookultra.com