SlideShare a Scribd company logo
Visit https://ebookultra.com to download the full version and
explore more ebooks
Computational Methods of Feature Selection 1st
Edition Huan Liu (Editor)
_____ Click the link below to download _____
https://ebookultra.com/download/computational-methods-
of-feature-selection-1st-edition-huan-liu-editor/
Explore and download more ebooks at ebookultra.com
Here are some suggested products you might be interested in.
Click the link to download
Feature Selection and Ensemble Methods for Bioinformatics
Algorithmic Classification and Implementations 1st Edition
Oleg Okun
https://ebookultra.com/download/feature-selection-and-ensemble-
methods-for-bioinformatics-algorithmic-classification-and-
implementations-1st-edition-oleg-okun/
Smoothed Finite Element Methods 1st Edition Liu
https://ebookultra.com/download/smoothed-finite-element-methods-1st-
edition-liu/
Next Generation Sequencing and Whole Genome Selection in
Aquaculture 1st Edition Zhanjiang (John) Liu
https://ebookultra.com/download/next-generation-sequencing-and-whole-
genome-selection-in-aquaculture-1st-edition-zhanjiang-john-liu/
Systems Evaluation Methods Models and Applications 1st
Edition Sifeng Liu (Author)
https://ebookultra.com/download/systems-evaluation-methods-models-and-
applications-1st-edition-sifeng-liu-author/
Computational Methods in Plasma Physics Chapman Hall CRC
Computational Science 1st Edition Stephen Jardin
https://ebookultra.com/download/computational-methods-in-plasma-
physics-chapman-hall-crc-computational-science-1st-edition-stephen-
jardin/
Computational Methods in Biomedical Research 1st Edition
Ravindra Khattree
https://ebookultra.com/download/computational-methods-in-biomedical-
research-1st-edition-ravindra-khattree/
Directed Enzyme Evolution Screening and Selection Methods
1st Edition Frances H. Arnold
https://ebookultra.com/download/directed-enzyme-evolution-screening-
and-selection-methods-1st-edition-frances-h-arnold/
Handbook of Computational and Numerical Methods in Finance
1st Edition Oliver J. Blaskowitz
https://ebookultra.com/download/handbook-of-computational-and-
numerical-methods-in-finance-1st-edition-oliver-j-blaskowitz/
Computational Methods in Biomedical Research 1st Edition
Ravindra Khattree (Editor)
https://ebookultra.com/download/computational-methods-in-biomedical-
research-1st-edition-ravindra-khattree-editor/
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Computational Methods of Feature Selection 1st Edition
Huan Liu (Editor) Digital Instant Download
Author(s): Huan Liu (Editor); Hiroshi Motoda (Editor)
ISBN(s): 9781584888796, 1584888792
Edition: 1
File Details: PDF, 15.24 MB
Year: 2007
Language: english
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Computational Methods
of
Feature Selection
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix
Decompositions
David Skillicorn
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
PUBLISHED TITLES
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A
AIMS AND SCOPE
This series aims to capture new developments and applications in data mining and knowledge
discovery, while summarizing the computational tools and techniques useful in data analysis.This
series encourages the integration of mathematical, statistical, and computational methods and
techniques through the publication of a broad range of textbooks, reference works, and hand-
books. The inclusion of concrete examples and applications is highly encouraged. The scope of the
series includes, but is not limited to, titles in the areas of data mining and knowledge discovery
methods and applications, modeling, algorithms, theory and foundations, data and knowledge
visualization, data mining systems and tools, and privacy and security issues.
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
Computational Methods
of
Feature Selection
Edited by
)VBO-JVr)JSPTIJ.PUPEB
CRC Press
Taylor  Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2007 by Taylor  Francis Group, LLC
CRC Press is an imprint of Taylor  Francis Group, an Informa business
No claim to original U.S. Government works
Version Date: 20140114
International Standard Book Number-13: 978-1-58488-879-6 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor  Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Preface
It has been ten years since we published our first two books on feature se-
lection in 1998. In the past decade, we witnessed a great expansion of feature
selection research in multiple dimensions. We experienced the fast data evolu-
tion in which extremely high-dimensional data, such as high-throughput data
of bioinformatics and Web/text data, became increasingly common. They
stretch the capabilities of conventional data processing techniques, pose new
challenges, and stimulate accelerated development of feature selection research
in two major ways. One trend is to improve and expand the existing tech-
niques to meet the new challenges. The other is to develop brand new algo-
rithms directly targeting the arising challenges. In this process, we observe
many feature-selection-centered activities, such as one well-received competi-
tion, two well-attended tutorials at top conferences, and two multi-disciplinary
workshops, as well as a special development section in a recent issue of IEEE
Intelligent Systems, to name a few.
This collection bridges the widening gap between existing texts and the
rapid developments in the field, by presenting recent research works from var-
ious disciplines. It features excellent survey work, practical guides, exciting
new directions, and comprehensive tutorials from leading experts. The book
also presents easy-to-understand illustrations, state-of-the-art methodologies,
and algorithms, along with real-world case studies ranging from text classi-
fication, to Web mining, to bioinformatics where high-dimensional data are
pervasive. Some vague ideas suggested in our earlier book have been de-
veloped into mature areas with solid achievements, along with progress that
could not have been imagined ten years ago. With the steady and speedy
development of feature selection research, we sincerely hope that this book
presents distinctive and representative achievements; serves as a convenient
point for graduate students, practitioners, and researchers to further the re-
search and application of feature selection; and sparks a new phase of feature
selection research. We are truly optimistic about the impact of feature selec-
tion on massive, high-dimensional data and processing in the near future, and
we have no doubt that in another ten years, when we look back, we will be
humbled by the newfound power of feature selection, and by its indelible con-
tributions to machine learning, data mining, and many real-world challenges.
Huan Liu and Hiroshi Motoda
Acknowledgments
The inception of this book project was during SDM 2006’s feature selec-
tion workshop. Randi Cohen, an editor of Chapman and Hall/CRC Press,
eloquently convinced one of us that it was a time for a new book on feature
selection. Since then, she closely worked with us to make the process easier
and smoother and allowed us to stay focused. With Randi’s kind and expert
support, we were able to adhere to the planned schedule when facing unex-
pected difficulties. We truly appreciate her generous support throughout the
project.
This book is a natural extension of the two successful feature selection
workshops held at SDM 20051
and SDM 2006.2
The success would not be
a reality without the leadership of two workshop co-organizers (Robert Stine
of Wharton School and Leonard Auslender of SAS); the meticulous work of
the proceedings chair (Lei Yu of Binghamton University); and the altruistic
efforts of PC members, authors, and contributors. We take this opportunity
to thank all who helped to advance the frontier of feature selection research.
The authors, contributors, and reviewers of this book played an instru-
mental role in this project. Given the limited space of this book, we could
not include all quality works. Reviewers’ detailed comments and constructive
suggestions significantly helped improve the book’s consistency in content,
format, comprehensibility, and presentation. We thank the authors who pa-
tiently and timely accommodated our (sometimes many) requests.
We would also like to express our deep gratitude for the gracious help we
received from our colleagues and students, including Zheng Zhao, Lei Tang,
Quan Nguyen, Payam Refaeilzadeh, and Shankara B. Subramanya of Arizona
State University; Kozo Ohara of Osaka University; and William Nace and
Kenneth Gorreta of AFOSR/AOARD, Air Force Research Laboratory.
Last but not least, we thank our families for their love and support. We
are grateful and happy that we can now spend more time with our families.
Huan Liu and Hiroshi Motoda
1The 2005 proceedings are at http://enpub.eas.asu.edu/workshop/.
2The 2006 proceedings are at http://enpub.eas.asu.edu/workshop/2006/.
Contributors
Jesús S. Aguilar-Ruiz
Pablo de Olavide University,
Seville, Spain
Jennifer G. Dy
Northeastern University, Boston,
Massachusetts
Constantin F. Aliferis
Vanderbilt University, Nashville,
Tennessee
André Elisseeff
IBM Research, Zürich, Switzer-
land
Paolo Avesani
ITC-IRST, Trento, Italy
Susana Eyheramendy
Ludwig-Maximilians Universität
München, Germany
Susan M. Bridges
Mississippi State University,
Mississippi
George Forman
Hewlett-Packard Labs, Palo
Alto, California
Alexander Borisov
Intel Corporation, Chandler,
Arizona
Lise Getoor
University of Maryland, College
Park, Maryland
Shane Burgess
Mississippi State University,
Mississippi
Dimitrios Gunopulos
University of California, River-
side
Diana Chan
Mississippi State University,
Mississippi
Isabelle Guyon
ClopiNet, Berkeley, California
Claudia Diamantini
Universitá Politecnica delle
Marche, Ancona, Italy
Trevor Hastie
Stanford University, Stanford,
California
Rezarta Islamaj Dogan
University of Maryland, College
Park, Maryland and National
Center for Biotechnology Infor-
mation, Bethesda, Maryland
Joshua Zhexue Huang
University of Hong Kong, Hong
Kong, China
Carlotta Domeniconi
George Mason University, Fair-
fax, Virginia
Mohamed Kamel
University of Waterloo, Ontario,
Canada
Igor Kononenko
University of Ljubljana, Ljubl-
jana, Slovenia
Wei Tang
Florida Atlantic University,
Boca Raton, Florida
David Madigan
Rutgers University, New Bruns-
wick, New Jersey
Kari Torkkola
Motorola Labs, Tempe, Arizona
Masoud Makrehchi
University of Waterloo, Ontario,
Canada
Eugene Tuv
Intel Corporation, Chandler,
Arizona
Michael Ng
Hong Kong Baptist University,
Hong Kong, China
Sriharsha Veeramachaneni
ITC-IRST, Trento, Italy
Emanuele Olivetti
ITC-IRST, Trento, Italy
W. John Wilbur
National Center for Biotech-
nology Information, Bethesda,
Maryland
Domenico Potena
Universitá Politecnica delle
Marche, Ancona, Italy
Jun Xu
Georgia Institute of Technology,
Atlanta, Georgia
José C. Riquelme
University of Seville, Seville,
Spain
Yunming Ye
Harbin Institute of Technology,
Harbin, China
Roberto Ruiz
Pablo de Olavide University,
Seville, Spain
Lei Yu
Binghamton University, Bing-
hamton, New York
Marko Robnik Šikonja
University of Ljubljana, Ljubl-
jana, Slovenia
Shi Zhong
Yahoo! Inc., Sunnyvale, Califor-
nia
David J. Stracuzzi
Arizona State University,
Tempe, Arizona
Hui Zou
University of Minnesota, Min-
neapolis
Yijun Sun
University of Florida, Gaines-
ville, Florida
Contents
I Introduction and Background 1
1 Less Is More 3
Huan Liu and Hiroshi Motoda
1.1 Background and Basics . . . . . . . . . . . . . . . . . . . . . 4
1.2 Supervised, Unsupervised, and Semi-Supervised Feature Selec-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Key Contributions and Organization of the Book . . . . . . . 10
1.3.1 Part I - Introduction and Background . . . . . . . . . 10
1.3.2 Part II - Extending Feature Selection . . . . . . . . . 11
1.3.3 Part III - Weighting and Local Methods . . . . . . . . 12
1.3.4 Part IV - Text Classification and Clustering . . . . . . 13
1.3.5 Part V - Feature Selection in Bioinformatics . . . . . . 14
1.4 Looking Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Unsupervised Feature Selection 19
Jennifer G. Dy
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 The K-Means Algorithm . . . . . . . . . . . . . . . . 21
2.2.2 Finite Mixture Clustering . . . . . . . . . . . . . . . . 22
2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Feature Search . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Feature Evaluation . . . . . . . . . . . . . . . . . . . . 24
2.4 Feature Selection for Unlabeled Data . . . . . . . . . . . . . 25
2.4.1 Filter Methods . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Wrapper Methods . . . . . . . . . . . . . . . . . . . . 27
2.5 Local Approaches . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Subspace Clustering . . . . . . . . . . . . . . . . . . . 32
2.5.2 Co-Clustering/Bi-Clustering . . . . . . . . . . . . . . . 33
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Randomized Feature Selection 41
David J. Stracuzzi
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Types of Randomizations . . . . . . . . . . . . . . . . . . . . 42
3.3 Randomized Complexity Classes . . . . . . . . . . . . . . . . 43
3.4 Applying Randomization to Feature Selection . . . . . . . . 45
3.5 The Role of Heuristics . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Examples of Randomized Selection Algorithms . . . . . . . . 47
3.6.1 A Simple Las Vegas Approach . . . . . . . . . . . . . 47
3.6.2 Two Simple Monte Carlo Approaches . . . . . . . . . 49
3.6.3 Random Mutation Hill Climbing . . . . . . . . . . . . 51
3.6.4 Simulated Annealing . . . . . . . . . . . . . . . . . . . 52
3.6.5 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . 54
3.6.6 Randomized Variable Elimination . . . . . . . . . . . 56
3.7 Issues in Randomization . . . . . . . . . . . . . . . . . . . . 58
3.7.1 Pseudorandom Number Generators . . . . . . . . . . . 58
3.7.2 Sampling from Specialized Data Structures . . . . . . 59
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Causal Feature Selection 63
Isabelle Guyon, Constantin Aliferis, and André Elisseeff
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Classical “Non-Causal” Feature Selection . . . . . . . . . . . 65
4.3 The Concept of Causality . . . . . . . . . . . . . . . . . . . . 68
4.3.1 Probabilistic Causality . . . . . . . . . . . . . . . . . . 69
4.3.2 Causal Bayesian Networks . . . . . . . . . . . . . . . . 70
4.4 Feature Relevance in Bayesian Networks . . . . . . . . . . . 71
4.4.1 Markov Blanket . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 Characterizing Features Selected via Classical Methods 73
4.5 Causal Discovery Algorithms . . . . . . . . . . . . . . . . . . 77
4.5.1 A Prototypical Causal Discovery Algorithm . . . . . . 78
4.5.2 Markov Blanket Induction Algorithms . . . . . . . . . 79
4.6 Examples of Applications . . . . . . . . . . . . . . . . . . . . 80
4.7 Summary, Conclusions, and Open Problems . . . . . . . . . 82
II Extending Feature Selection 87
5 Active Learning of Feature Relevance 89
Emanuele Olivetti, Sriharsha Veeramachaneni, and Paolo Avesani
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Active Sampling for Feature Relevance Estimation . . . . . . 92
5.3 Derivation of the Sampling Benefit Function . . . . . . . . . 93
5.4 Implementation of the Active Sampling Algorithm . . . . . . 95
5.4.1 Data Generation Model: Class-Conditional Mixture of
Product Distributions . . . . . . . . . . . . . . . . . . 95
5.4.2 Calculation of Feature Relevances . . . . . . . . . . . 96
5.4.3 Calculation of Conditional Probabilities . . . . . . . . 97
5.4.4 Parameter Estimation . . . . . . . . . . . . . . . . . . 97
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 99
5.5.2 UCI Datasets . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.3 Computational Complexity Issues . . . . . . . . . . . 102
5.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . 102
6 A Study of Feature Extraction Techniques Based on Decision
Border Estimate 109
Claudia Diamantini and Domenico Potena
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.1.1 Background on Statistical Pattern Classification . . . 111
6.2 Feature Extraction Based on Decision Boundary . . . . . . . 112
6.2.1 MLP-Based Decision Boundary Feature Extraction . . 113
6.2.2 SVM Decision Boundary Analysis . . . . . . . . . . . 114
6.3 Generalities About Labeled Vector Quantizers . . . . . . . . 115
6.4 Feature Extraction Based on Vector Quantizers . . . . . . . 116
6.4.1 Weighting of Normal Vectors . . . . . . . . . . . . . . 119
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.5.1 Experiment with Synthetic Data . . . . . . . . . . . . 122
6.5.2 Experiment with Real Data . . . . . . . . . . . . . . . 124
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7 Ensemble-Based Variable Selection Using Independent Probes
131
Eugene Tuv, Alexander Borisov, and Kari Torkkola
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2 Tree Ensemble Methods in Feature Ranking . . . . . . . . . 132
7.3 The Algorithm: Ensemble-Based Ranking Against Indepen-
dent Probes . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4.1 Benchmark Methods . . . . . . . . . . . . . . . . . . . 138
7.4.2 Data and Experiments . . . . . . . . . . . . . . . . . . 139
7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8 Efficient Incremental-Ranked Feature Selection in Massive
Data 147
Roberto Ruiz, Jesús S. Aguilar-Ruiz, and José C. Riquelme
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.3 Preliminary Concepts . . . . . . . . . . . . . . . . . . . . . . 150
8.3.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.3.2 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 151
8.4 Incremental Performance over Ranking . . . . . . . . . . . . 152
8.4.1 Incremental Ranked Usefulness . . . . . . . . . . . . . 153
8.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 156
8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
III Weighting and Local Methods 167
9 Non-Myopic Feature Quality Evaluation with (R)ReliefF 169
Igor Kononenko and Marko Robnik Šikonja
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.2 From Impurity to Relief . . . . . . . . . . . . . . . . . . . . . 170
9.2.1 Impurity Measures in Classification . . . . . . . . . . . 171
9.2.2 Relief for Classification . . . . . . . . . . . . . . . . . 172
9.3 ReliefF for Classification and RReliefF for Regression . . . . 175
9.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.4.1 ReliefF for Inductive Logic Programming . . . . . . . 178
9.4.2 Cost-Sensitive ReliefF . . . . . . . . . . . . . . . . . . 180
9.4.3 Evaluation of Ordered Features at Value Level . . . . 181
9.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.5.1 Difference of Probabilities . . . . . . . . . . . . . . . . 182
9.5.2 Portion of the Explained Concept . . . . . . . . . . . 183
9.6 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . 184
9.6.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . 184
9.6.2 Active Sampling . . . . . . . . . . . . . . . . . . . . . 184
9.6.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . 185
9.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.7.1 Feature Subset Selection . . . . . . . . . . . . . . . . . 185
9.7.2 Feature Ranking . . . . . . . . . . . . . . . . . . . . . 186
9.7.3 Feature Weighing . . . . . . . . . . . . . . . . . . . . . 186
9.7.4 Building Tree-Based Models . . . . . . . . . . . . . . . 187
9.7.5 Feature Discretization . . . . . . . . . . . . . . . . . . 187
9.7.6 Association Rules and Genetic Algorithms . . . . . . . 187
9.7.7 Constructive Induction . . . . . . . . . . . . . . . . . . 188
9.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
10 Weighting Method for Feature Selection in K-Means 193
Joshua Zhexue Huang, Jun Xu, Michael Ng, and Yunming Ye
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.2 Feature Weighting in k-Means . . . . . . . . . . . . . . . . . 194
10.3 W-k-Means Clustering Algorithm . . . . . . . . . . . . . . . 197
10.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 198
10.5 Subspace Clustering with k-Means . . . . . . . . . . . . . . . 200
10.6 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.6.1 Text Data and Subspace Clustering . . . . . . . . . . 202
10.6.2 Selection of Key Words . . . . . . . . . . . . . . . . . 203
10.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.8 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11 Local Feature Selection for Classification 211
Carlotta Domeniconi and Dimitrios Gunopulos
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.2 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . 213
11.3 Adaptive Metric Techniques . . . . . . . . . . . . . . . . . . 214
11.3.1 Flexible Metric Nearest Neighbor Classification . . . . 215
11.3.2 Discriminant Adaptive Nearest Neighbor Classification 216
11.3.3 Adaptive Metric Nearest Neighbor Algorithm . . . . . 217
11.4 Large Margin Nearest Neighbor Classifiers . . . . . . . . . . 222
11.4.1 Support Vector Machines . . . . . . . . . . . . . . . . 223
11.4.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . 224
11.4.3 Large Margin Nearest Neighbor Classification . . . . . 225
11.4.4 Weighting Features Increases the Margin . . . . . . . 227
11.5 Experimental Comparisons . . . . . . . . . . . . . . . . . . . 228
11.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12 Feature Weighting through Local Learning 233
Yijun Sun
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.2 Mathematical Interpretation of Relief . . . . . . . . . . . . . 235
12.3 Iterative Relief Algorithm . . . . . . . . . . . . . . . . . . . . 236
12.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 236
12.3.2 Convergence Analysis . . . . . . . . . . . . . . . . . . 238
12.4 Extension to Multiclass Problems . . . . . . . . . . . . . . . 240
12.5 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.6 Computational Complexity . . . . . . . . . . . . . . . . . . . 242
12.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
12.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . 242
12.7.2 Experiments on UCI Datasets . . . . . . . . . . . . . . 244
12.7.3 Choice of Kernel Width . . . . . . . . . . . . . . . . . 248
12.7.4 Online Learning . . . . . . . . . . . . . . . . . . . . . 248
12.7.5 Experiments on Microarray Data . . . . . . . . . . . . 249
12.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
IV Text Classification and Clustering 255
13 Feature Selection for Text Classification 257
George Forman
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
13.1.1 Feature Selection Phyla . . . . . . . . . . . . . . . . . 259
13.1.2 Characteristic Difficulties of Text Classification Tasks 260
13.2 Text Feature Generators . . . . . . . . . . . . . . . . . . . . 261
13.2.1 Word Merging . . . . . . . . . . . . . . . . . . . . . . 261
13.2.2 Word Phrases . . . . . . . . . . . . . . . . . . . . . . . 262
13.2.3 Character N-grams . . . . . . . . . . . . . . . . . . . . 263
13.2.4 Multi-Field Records . . . . . . . . . . . . . . . . . . . 264
13.2.5 Other Properties . . . . . . . . . . . . . . . . . . . . . 264
13.2.6 Feature Values . . . . . . . . . . . . . . . . . . . . . . 265
13.3 Feature Filtering for Classification . . . . . . . . . . . . . . . 265
13.3.1 Binary Classification . . . . . . . . . . . . . . . . . . . 266
13.3.2 Multi-Class Classification . . . . . . . . . . . . . . . . 269
13.3.3 Hierarchical Classification . . . . . . . . . . . . . . . . 270
13.4 Practical and Scalable Computation . . . . . . . . . . . . . . 271
13.5 A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 272
13.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 274
14 A Bayesian Feature Selection Score Based on Naı̈ve Bayes
Models 277
Susana Eyheramendy and David Madigan
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
14.2 Feature Selection Scores . . . . . . . . . . . . . . . . . . . . . 279
14.2.1 Posterior Inclusion Probability (PIP) . . . . . . . . . . 280
14.2.2 Posterior Inclusion Probability (PIP) under a Bernoulli
distribution . . . . . . . . . . . . . . . . . . . . . . . . 281
14.2.3 Posterior Inclusion Probability (PIPp) under Poisson
distributions . . . . . . . . . . . . . . . . . . . . . . . 283
14.2.4 Information Gain (IG) . . . . . . . . . . . . . . . . . . 284
14.2.5 Bi-Normal Separation (BNS) . . . . . . . . . . . . . . 285
14.2.6 Chi-Square . . . . . . . . . . . . . . . . . . . . . . . . 285
14.2.7 Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . 286
14.2.8 Word Frequency . . . . . . . . . . . . . . . . . . . . . 286
14.3 Classification Algorithms . . . . . . . . . . . . . . . . . . . . 286
14.4 Experimental Settings and Results . . . . . . . . . . . . . . . 287
14.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 287
14.4.2 Experimental Results . . . . . . . . . . . . . . . . . . 288
14.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
15 Pairwise Constraints-Guided Dimensionality Reduction 295
Wei Tang and Shi Zhong
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
15.2 Pairwise Constraints-Guided Feature Projection . . . . . . . 297
15.2.1 Feature Projection . . . . . . . . . . . . . . . . . . . . 298
15.2.2 Projection-Based Semi-supervised Clustering . . . . . 300
15.3 Pairwise Constraints-Guided Co-clustering . . . . . . . . . . 301
15.4 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . 302
15.4.1 Experimental Study – I . . . . . . . . . . . . . . . . . 302
15.4.2 Experimental Study – II . . . . . . . . . . . . . . . . . 306
15.4.3 Experimental Study – III . . . . . . . . . . . . . . . . 309
15.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 310
16 Aggressive Feature Selection by Feature Ranking 313
Masoud Makrehchi and Mohamed S. Kamel
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
16.2 Feature Selection by Feature Ranking . . . . . . . . . . . . . 314
16.2.1 Multivariate Characteristic of Text Classifiers . . . . . 316
16.2.2 Term Redundancy . . . . . . . . . . . . . . . . . . . . 316
16.3 Proposed Approach to Reducing Term Redundancy . . . . . 320
16.3.1 Stemming, Stopwords, and Low-DF Terms Elimination 320
16.3.2 Feature Ranking . . . . . . . . . . . . . . . . . . . . . 320
16.3.3 Redundancy Reduction . . . . . . . . . . . . . . . . . 322
16.3.4 Redundancy Removal Algorithm . . . . . . . . . . . . 325
16.3.5 Term Redundancy Tree . . . . . . . . . . . . . . . . . 326
16.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 326
16.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
V Feature Selection in Bioinformatics 335
17 Feature Selection for Genomic Data Analysis 337
Lei Yu
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
17.1.1 Microarray Data and Challenges . . . . . . . . . . . . 337
17.1.2 Feature Selection for Microarray Data . . . . . . . . . 338
17.2 Redundancy-Based Feature Selection . . . . . . . . . . . . . 340
17.2.1 Feature Relevance and Redundancy . . . . . . . . . . 340
17.2.2 An Efficient Framework for Redundancy Analysis . . . 343
17.2.3 RBF Algorithm . . . . . . . . . . . . . . . . . . . . . . 345
17.3 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . 347
17.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 347
17.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . 349
17.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . 349
17.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
18 A Feature Generation Algorithm with Applications to Bio-
logical Sequence Classification 355
Rezarta Islamaj Dogan, Lise Getoor, and W. John Wilbur
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
18.2 Splice-Site Prediction . . . . . . . . . . . . . . . . . . . . . . 356
18.2.1 The Splice-Site Prediction Problem . . . . . . . . . . . 356
18.2.2 Current Approaches . . . . . . . . . . . . . . . . . . . 357
18.2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . 359
18.3 Feature Generation Algorithm . . . . . . . . . . . . . . . . . 359
18.3.1 Feature Type Analysis . . . . . . . . . . . . . . . . . . 360
18.3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . 362
18.3.3 Feature Generation Algorithm (FGA) . . . . . . . . . 364
18.4 Experiments and Discussion . . . . . . . . . . . . . . . . . . 366
18.4.1 Data Description . . . . . . . . . . . . . . . . . . . . 366
18.4.2 Feature Generation . . . . . . . . . . . . . . . . . . . . 367
18.4.3 Prediction Results for Individual Feature Types . . . . 369
18.4.4 Splice-Site Prediction with FGA Features . . . . . . . 370
18.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
19 An Ensemble Method for Identifying Robust Features for
Biomarker Discovery 377
Diana Chan, Susan M. Bridges, and Shane C. Burgess
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
19.2 Biomarker Discovery from Proteome Profiles . . . . . . . . . 378
19.3 Challenges of Biomarker Identification . . . . . . . . . . . . . 380
19.4 Ensemble Method for Feature Selection . . . . . . . . . . . . 381
19.5 Feature Selection Ensemble . . . . . . . . . . . . . . . . . . . 383
19.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 384
19.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
20 Model Building and Feature Selection with Genomic Data 393
Hui Zou and Trevor Hastie
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
20.2 Ridge Regression, Lasso, and Bridge . . . . . . . . . . . . . . 394
20.3 Drawbacks of the Lasso . . . . . . . . . . . . . . . . . . . . . 396
20.4 The Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . 397
20.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 397
20.4.2 A Stylized Example . . . . . . . . . . . . . . . . . . . 399
20.4.3 Computation and Tuning . . . . . . . . . . . . . . . . 400
20.4.4 Analyzing the Cardiomypathy Data . . . . . . . . . . 402
20.5 The Elastic-Net Penalized SVM . . . . . . . . . . . . . . . . 404
20.5.1 Support Vector Machines . . . . . . . . . . . . . . . . 404
20.5.2 A New SVM Classifier . . . . . . . . . . . . . . . . . . 405
20.6 Sparse Eigen-Genes . . . . . . . . . . . . . . . . . . . . . . . 407
20.6.1 PCA and Eigen-Genes . . . . . . . . . . . . . . . . . . 408
20.6.2 Sparse Principal Component Analysis . . . . . . . . . 408
20.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Index 413
Part I
Introduction and
Background
1
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Chapter 1
Less Is More
Huan Liu
Arizona State University
Hiroshi Motoda
AFOSR/AOARD, Air Force Research Laboratory
1.1 Background and Basics .................................................. 4
1.2 Supervised, Unsupervised, and Semi-Supervised Feature Selection ..... 7
1.3 Key Contributions and Organization of the Book ...................... 10
1.4 Looking Ahead ........................................................... 15
References ............................................................... 16
As our world expands at an unprecedented speed from the physical into the
virtual, we can conveniently collect more and more data in any ways one can
imagine for various reasons. Is it “The more, the merrier (better)”? The
answer is “Yes” and “No.” It is “Yes” because we can at least get what we
might need. It is also “No” because, when it comes to a point of too much,
the existence of inordinate data is tantamount to non-existence if there is no
means of effective data access. More can mean less. Without the processing
of data, its mere existence would not become a useful asset that can impact
our business, and many other matters. Since continued data accumulation
is inevitable, one way out is to devise data selection techniques to keep pace
with the rate of data collection. Furthermore, given the sheer volume of data,
data generated by computers or equivalent mechanisms must be processed
automatically, in order for us to tame the data monster and stay in control.
Recent years have seen extensive efforts in feature selection research. The
field of feature selection expands both in depth and in breadth, due to in-
creasing demands for dimensionality reduction. The evidence can be found
in many recent papers, workshops, and review articles. The research expands
from classic supervised feature selection to unsupervised and semi-supervised
feature selection, to selection of different feature types such as causal and
structural features, to different kinds of data like high-throughput, text, or
images, to feature selection evaluation, and to wide applications of feature
selection where data abound.
No book of this size could possibly document the extensive efforts in the
frontier of feature selection research. We thus try to sample the field in several
ways: asking established experts, calling for submissions, and looking at the
3
4 Computational Methods of Feature Selection
recent workshops and conferences, in order to understand the current devel-
opments. As this book aims to serve a wide audience from practitioners to
researchers, we first introduce the basic concepts and the essential problems
with feature selection; next illustrate feature selection research in parallel
to supervised, unsupervised, and semi-supervised learning; then present an
overview of feature selection activities included in this collection; and last
contemplate some issues about evolving feature selection. The book is orga-
nized in five parts: (I) Introduction and Background, (II) Extending Feature
Selection, (III) Weighting and Local Methods, (IV) Text Feature Selection,
and (V) Feature Selection in Bioinformatics. These five parts are relatively
independent and can be read in any order. For a newcomer to the field of fea-
ture selection, we recommend that you read Chapters 1, 2, 9, 13, and 17 first,
then decide on which chapters to read further according to your need and in-
terest. Rudimentary concepts and discussions of related issues such as feature
extraction and construction can also be found in two earlier books [10, 9].
Instance selection can be found in [11].
1.1 Background and Basics
One of the fundamental motivations for feature selection is the curse of
dimensionality [6]. Plainly speaking, two close data points in a 2-d space are
likely distant in a 100-d space (refer to Chapter 2 for an illustrative example).
For the case of classification, this makes it difficult to make a prediction of
unseen data points by a hypothesis constructed from a limited number of
training instances. The number of features is a key factor that determines the
size of the hypothesis space containing all hypotheses that can be learned from
data [13]. A hypothesis is a pattern or function that predicts classes based
on given data. The more features, the larger the hypothesis space. Worse
still, the linear increase of the number of features leads to the exponential
increase of the hypothesis space. For example, for N binary features and a
binary class feature, the hypothesis space is as big as 22N
. Therefore, feature
selection can efficiently reduce the hypothesis space by removing irrelevant
and redundant features. The smaller the hypothesis space, the easier it is
to find correct hypotheses. Given a fixed-size data sample that is part of the
underlying population, the reduction of dimensionality also lowers the number
of required training instances. For example, given M, when the number of
binary features N = 10 is reduced to N = 5, the ratio of M/2N
increases
exponentially. In other words, it virtually increases the number of training
instances. This helps to better constrain the search of correct hypotheses.
Feature selection is essentially a task to remove irrelevant and/or redun-
dant features. Irrelevant features can be removed without affecting learning
Less Is More 5
performance [8]. Redundant features are a type of irrelevant feature [16]. The
distinction is that a redundant feature implies the co-presence of another fea-
ture; individually, each feature is relevant, but the removal of one of them will
not affect learning performance. The selection of features can be achieved
in two ways: One is to rank features according to some criterion and select
the top k features, and the other is to select a minimum subset of features
without learning performance deterioration. In other words, subset selection
algorithms can automatically determine the number of selected features, while
feature ranking algorithms need to rely on some given threshold to select fea-
tures. An example of feature ranking algorithms is detailed in Chapter 9. An
example of subset selection can be found in Chapter 17.
Other important aspects of feature selection include models, search strate-
gies, feature quality measures, and evaluation [10]. The three typical models
are filter, wrapper, and embedded. An embedded model of feature selection
integrates the selection of features in model building. An example of such a
model is the decision tree induction algorithm, in which at each branching
node, a feature has to be selected. The research shows that even for such
a learning algorithm, feature selection can result in improved learning per-
formance. In a wrapper model, one employs a learning algorithm and uses
its performance to determine the quality of selected features. As shown in
Chapter 2, filter and wrapper models are not confined to supervised feature
selection, and can also apply to the study of unsupervised feature selection
algorithms.
Search strategies [1] are investigated and various strategies are proposed
including forward, backward, floating, branch-and-bound, and randomized.
If one starts with an empty feature subset and adds relevant features into
the subset following a procedure, it is called forward selection; if one begins
with a full set of features and removes features procedurally, it is backward
selection. Given a large number of features, either strategy might be too costly
to work. Take the example of forward selection. Since k is usually unknown
a priori, one needs to try
N
1

+
N
2

+ ... +
N
k

times in order to figure out
k out of N features for selection. Therefore, its time complexity is O(2N
).
Hence, more efficient algorithms are developed. The widely used ones are
sequential strategies. A sequential forward selection (SFS) algorithm selects
one feature at a time until adding another feature does not improve the subset
quality with the condition that a selected feature remains selected. Similarly,
a sequential backward selection (SBS) algorithm eliminates one feature at a
time and once a feature is eliminated, it will never be considered again for
inclusion. Obviously, both search strategies are heuristic in nature and cannot
guarantee the optimality of the selected features. Among alternatives to these
strategies are randomized feature selection algorithms, which are discussed in
Chapter 3. A relevant issue regarding exhaustive and heuristic searches is
whether there is any reason to perform exhaustive searches if time complexity
were not a concern. Research shows that exhaustive search can lead the
features that exacerbate data overfitting, while heuristic search is less prone
6 Computational Methods of Feature Selection
to data overfitting in feature selection, facing small data samples.
The small sample problem addresses a new type of “wide” data where the
number of features (N) is several degrees of magnitude more than the num-
ber of instances (M). High-throughput data produced in genomics and pro-
teomics and text data are typical examples. In connection to the curse of
dimensionality mentioned earlier, the wide data present challenges to the reli-
able estimation of the model’s performance (e.g., accuracy), model selection,
and data overfitting. In [3], a pithy illustration of the small sample problem
is given with detailed examples.
The evaluation of feature selection often entails two tasks. One is to com-
pare two cases: before and after feature selection. The goal of this task is to
observe if feature selection achieves its intended objectives (recall that feature
selection does not confine it to improving classification performance). The
aspects of evaluation can include the number of selected features, time, scala-
bility, and learning model’s performance. The second task is to compare two
feature selection algorithms to see if one is better than the other for a certain
task. A detailed empirical study is reported in [14]. As we know, there is
no universally superior feature selection, and different feature selection algo-
rithms have their special edges for various applications. Hence, it is wise to
find a suitable algorithm for a given application. An initial attempt to ad-
dress the problem of selecting feature selection algorithms is presented in [12],
aiming to mitigate the increasing complexity of finding a suitable algorithm
from many feature selection algorithms.
Another issue arising from feature selection evaluation is feature selection
bias. Using the same training data in both feature selection and classifica-
tion learning can result in this selection bias. According to statistical theory
based on regression research, this bias can exacerbate data over-fitting and
negatively affect classification performance. A recommended practice is to
use separate data for feature selection and for learning. In reality, however,
separate datasets are rarely used in the selection and learning steps. This is
because we want to use as much data as possible in both selection and learning.
It is against this intuition to divide the training data into two datasets leading
to the reduced data in both tasks. Feature selection bias is studied in [15]
to seek answers if there is discrepancy between the current practice and the
statistical theory. The findings are that the statistical theory is correct, but
feature selection bias has limited effect on feature selection for classification.
Recently researchers started paying attention to interacting features [7].
Feature interaction usually defies those heuristic solutions to feature selection
evaluating individual features for efficiency. This is because interacting fea-
tures exhibit properties that cannot be detected in individual features. One
simple example of interacting features is the XOR problem, in which both
features together determine the class and each individual feature does not tell
much at all. By combining careful selection of a feature quality measure and
design of a special data structure, one can heuristically handle some feature
interaction as shown in [17]. The randomized algorithms detailed in Chapter 3
Less Is More 7
may provide an alternative. An overview of various additional issues related
to improving classification performance can be found in [5]. Since there are
many facets of feature selection research, we choose a theme that runs in par-
allel with supervised, unsupervised, and semi-supervised learning below, and
discuss and illustrate the underlying concepts of disparate feature selection
types, their connections, and how they can benefit from one another.
1.2 Supervised, Unsupervised, and Semi-Supervised Fea-
ture Selection
In one of the early surveys [2], all algorithms are supervised in the sense
that data have class labels (denoted as Xl). Supervised feature selection al-
gorithms rely on measures that take into account the class information. A
well-known measure is information gain, which is widely used in both feature
selection and decision tree induction. Assuming there are two features F1 and
F2, we can calculate feature Fi’s information gain as E0 − Ei, where E is
entropy. E0 is the entropy before the data split using feature Fi, and can be
calculated as E0 =

c pc log pc, where p is the estimated probability of class
c and c = 1, 2, ..., C. Ei is the entropy after the data split using Fi. A better
feature can result in larger information gain. Clearly, class information plays
a critical role here. Another example is the algorithm ReliefF, which also uses
the class information to determine an instance’s “near-hit” (a neighboring in-
stance having the same class) and “near-miss” (a neighboring instance having
different classes). More details about ReliefF can be found in Chapter 9. In
essence, supervised feature selection algorithms try to find features that help
separate data of different classes and we name it class-based separation. If a
feature has no effect on class-based separation, it can be removed. A good
feature should, therefore, help enhance class-based separation.
In the late 90’s, research on unsupervised feature selection intensified in
order to deal with data without class labels (denoted as Xu). It is closely
related to unsupervised learning [4]. One example of unsupervised learning is
clustering, where similar instances are grouped together and dissimilar ones
are separated apart. Similarity can be defined by the distance between two
instances. Conceptually, the two instances are similar if the distance between
the two is small, otherwise they are dissimilar. When all instances are con-
nected pair-wisely, breaking the connections between those instances that are
far apart will form clusters. Hence, clustering can be thought as achieving
locality-based separation. One widely used clustering algorithm is k-means.
It is an iterative algorithm that categorizes instances into k clusters. Given
predetermined k centers (or centroids), it works as follows: (1) Instances are
categorized to their closest centroid, (2) the centroids are recalculated using
8 Computational Methods of Feature Selection
the instances in each cluster, and (3) the first two steps are repeated until the
centroids do not change. Obviously, the key concept is distance calculation,
which is sensitive to dimensionality, as we discussed earlier about the curse of
dimensionality. Basically, if there are many irrelevant or redundant features,
clustering will be different from that with only relevant features. One toy
example can be found in Figure 1.1 in which two well-formed clusters in a 1-d
space (x) become two different clusters (denoted with different shapes, circles
vs. diamonds) in a 2-d space after introducing an irrelevant feature y. Unsu-
pervised feature selection is more difficult to deal with than supervised feature
selection. However, it also is a very useful tool as the majority of data are
unlabeled. A comprehensive introduction and review of unsupervised feature
selection is presented in Chapter 2.
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3
0
1
2
3
4
5
6
7
8
9
10
FIGURE 1.1: An illustrative example: left - two well-formed clusters; middle -
after an irrelevant feature is added; right - after applying 2-means clustering.
When a small number of instances are labeled but the majority are not,
semi-supervised feature selection is designed to take advantage of both the
large number of unlabeled instances and the labeling information as in semi-
supervised learning. Intuitively, the additional labeling information should
help constrain the search space of unsupervised feature selection. In other
words, semi-supervised feature selection attempts to align locality-based sep-
aration and class-based separations Since there are a large number of unla-
Less Is More 9
beled data and a small number of labeled instances, it is reasonable to use
unlabeled data to form some potential clusters and then employ labeled data
to find those clusters that can achieve both locality-based and class-based sep-
arations. For the two possible clustering results in Figure 1.1, if we are given
one correctly labeled instance each for the clusters of circles and diamonds,
the correct clustering result (the middle figure) will be chosen. The idea of
semi-supervised feature selection can be illustrated as in Figure 1.2 showing
how the properties of Xl and Xu complement each other and work together to
find relevant features. Two feature vectors (corresponding to two features, f
and f
) can generate respective cluster indicators representing different clus-
tering results: The left one can satisfy both constraints of Xl and Xu, but the
right one can only satisfy Xu. For semi-supervised feature selection, we want
to select f over f
. In other words, there are two equally good ways to cluster
the data as shown in the figure, but only one way can also attain class-based
C1
C2
C1'
C2'
(a)
The cluster structure corresponding
to cluster indicator
(b)
The cluster structure corresponding
to cluster indicator
feature vector feature vector
f '
f
cluster indicator cluster indicator
'
g
g
g
'
g
FIGURE 1.2: The basic idea for comparing the fitness of cluster indicators accord-
ing to both Xl (labeled data) and Xu (unlabeled data) for semi-supervised feature
selection. “-” and “+” correspond to instances of negative and positive classes, and
“M” to unlabeled instances.
separation. A semi-supervised feature selection algorithm sSelect is proposed
in [18], and sSelect is effective to use both data properties when locality-based
10 Computational Methods of Feature Selection
separation and class-based separation do not generate conflicts. We expect to
witness a surge of study on semi-supervised feature selection. The reason is
two-fold: It is often affordable to carefully label a small number of instances,
and it also provides a natural way for human experts to inject their knowledge
into the feature selection process in the form of labeled instances.
Above, we presented and illustrated the development of feature selection
in parallel to supervised, unsupervised, and semi-supervised learning to meet
the increasing demands of labeled, unlabeled, and partially labeled data. It
is just one perspective of feature selection that encompasses many aspects.
However, from this perspective, it can be clearly seen that as data evolve,
feature selection research adapts and develops into new areas in various forms
for emerging real-world applications. In the following, we present an overview
of the research activities included in this book.
1.3 Key Contributions and Organization of the Book
The ensuing chapters showcase some current research issues of feature se-
lection. They are categorically grouped into five parts, each containing four
chapters. The first chapter in Part I is this introduction. The other three
discuss issues such as unsupervised feature selection, randomized feature se-
lection, and causal feature selection. Part II reports some recent results of em-
powering feature selection, including active feature selection, decision-border
estimate, use of ensembles with independent probes, and incremental fea-
ture selection. Part III deals with weighting and local methods such as an
overview of the ReliefF family, feature selection in k-means clustering, local
feature relevance, and a new interpretation of Relief. Part IV is about text
feature selection, presenting an overview of feature selection for text classifi-
cation, a new feature selection score, constraint-guided feature selection, and
aggressive feature selection. Part V is on Feature Selection in Bioinformat-
ics, discussing redundancy-based feature selection, feature construction and
selection, ensemble-based robust feature selection, and penalty-based feature
selection. A summary of each chapter is given next.
1.3.1 Part I - Introduction and Background
Chapter 2 is an overview of unsupervised feature selection, finding the
smallest feature subset that best uncovers interesting, natural clusters for the
chosen criterion. The existence of irrelevant features can misguide clustering
results. Both filter and wrapper approaches can apply as in a supervised
setting. Feature selection can either be global or local, and the features to
be selected can vary from cluster to cluster. Disparate feature subspaces can
Less Is More 11
have different underlying numbers of natural clusters. Therefore, care must
be taken when comparing two clusters with different sets of features.
Chapter 3 is also an overview about randomization techniques for feature
selection. Randomization can lead to an efficient algorithm when the benefits
of good choices outweigh the costs of bad choices. There are two broad classes
of algorithms: Las Vegas algorithms, which guarantee a correct answer but
may require a long time to execute with small probability, and Monte Carlo
algorithms, which may output an incorrect answer with small probability but
always complete execution quickly. The randomized complexity classes define
the probabilistic guarantees that an algorithm must meet. The major sources
of randomization are the input features and/or the training examples. The
chapter introduces examples of several randomization algorithms.
Chapter 4 addresses the notion of causality and reviews techniques for
learning causal relationships from data in applications to feature selection.
Causal Bayesian networks provide a convenient framework for reasoning about
causality and an algorithm is presented that can extract causality from data
by finding the Markov blanket. Direct causes (parents), direct effects (chil-
dren), and other direct causes of the direct effects (spouses) are all members
of the Markov blanket. Only direct causes are strongly causally relevant. The
knowledge of causal relationships can benefit feature selection, e.g., explain-
ing relevance in terms of causal mechanisms, distinguishing between actual
features and experimental artifacts, predicting the consequences of actions,
and making predictions in a non-stationary environment.
1.3.2 Part II - Extending Feature Selection
Chapter 5 poses an interesting problem of active feature sampling in do-
mains where the feature values are expensive to measure. The selection of
features is based on the maximum benefit. A benefit function minimizes the
mean-squared error in a feature relevance estimate. It is shown that the
minimum mean-squared error criterion is equivalent to the maximum average
change criterion. The results obtained by using a mixture model for the joint
class-feature distribution show the advantage of the active sampling policy
over the random sampling in reducing the number of feature samples. The
approach is computationally expensive. Considering only a random subset of
the missing entries at each sampling step is a promising solution.
Chapter 6 discusses feature extraction (as opposed to feature selection)
based on the properties of the decision border. It is intuitive that the direction
normal to the decision boundary represents an informative direction for class
discriminability and its effectiveness is proportional to the area of decision bor-
der that has the same normal vector. Based on this, a labeled vector quantizer
that can efficiently be trained by the Bayes risk weighted vector quantization
(BVQ) algorithm was devised to extract the best linear approximation to the
decision border. The BVQ produces a decision boundary feature matrix, and
the eigenvectors of this matrix are exploited to transform the original feature
12 Computational Methods of Feature Selection
space into a new feature space with reduced dimensionality. It is shown that
this approach is comparable to the SVM-based decision boundary approach
and better than the MLP (Multi Layer Perceptron)-based approach, but with
a lower computational cost.
Chapter 7 proposes to compare feature relevance against the relevance of
its randomly permuted version (or probes) for classification/regression tasks
using random forests. The key is to use the same distribution in generating
a probe. Feature relevance is estimated by averaging the relevance obtained
from each tree in the ensemble. The method iterates over the remaining fea-
tures by removing the identified important features using the residuals as new
target variables. It offers autonomous feature selection taking into account
non-linearity, mixed-type data, and missing data in regressions and classifica-
tions. It shows excellent performance and low computational complexity, and
is able to address massive amounts of data.
Chapter 8 introduces an incremental feature selection algorithm for high-
dimensional data. The key idea is to decompose the whole process into feature
ranking and selection. The method first ranks features and then resolves the
redundancy by an incremental subset search using the ranking. The incre-
mental subset search does not retract what it has selected, but it can decide
not to add the next candidate feature, i.e., skip it and try the next according
to the rank. Thus, the average number of features used to construct a learner
during the search is kept small, which makes the wrapper approach feasible
for high-dimensional data.
1.3.3 Part III - Weighting and Local Methods
Chapter 9 is a comprehensive description of the Relief family algorithms.
Relief exploits the context of other features through distance measures and can
detect highly conditionally-dependent features. The chapter explains the idea,
advantages, and applications of Relief and introduces two extensions: ReliefF
and RReliefF. ReliefF is for classification and can deal with incomplete data
with multi-class problems. RReliefF is its extension designed for regression.
The variety of the Relief family shows the general applicability of the basic
idea of Relief as a non-myopic feature quality measure.
Chapter 10 discusses how to automatically determine the important fea-
tures in the k-means clustering process. The weight of a feature is determined
by the sum of the within-cluster dispersions of the feature, which measures
its importance in clustering. A new step to calculate the feature weights is
added in the iterative process in order not to seriously affect the scalability.
The weight can be defined either globally (same weights for all clusters) or
locally (different weights for different clusters). The latter, called subspace
k-means clustering, has applications in text clustering, bioinformatics, and
customer behavior analysis.
Chapter 11 is in line with Chapter 5, but focuses on local feature relevance
and weighting. Each feature’s ability for class probability prediction at each
Less Is More 13
point in the feature space is formulated in a way similar to the weighted χ-
square measure, from which the relevance weight is derived. The weight has
a large value for a direction along which the class probability is not locally
constant. To gain efficiency, a decision boundary is first obtained by an SVM,
and its normal vector nearest to the point in query is used to estimate the
weights reflected in the distance measure for a k-nearest neighbor classifier.
Chapter 12 gives further insights into Relief (refer to Chapter 9). The
working of Relief is proven to be equivalent to solving an online convex opti-
mization problem with a margin-based objective function that is defined based
on a nearest neighbor classifier. Relief usually performs (1) better than other
filter methods due to the local performance feedback of a nonlinear classifier
when searching for useful features, and (2) better than wrapper methods due
to the existence of efficient algorithms for a convex optimization problem. The
weights can be iteratively updated by an EM-like algorithm, which guaran-
tees the uniqueness of the optimal weights and the convergence. The method
was further extended to its online version, which is quite effective when it is
difficult to use all the data in a batch mode.
1.3.4 Part IV - Text Classification and Clustering
Chapter 13 is a comprehensive presentation of feature selection for text
classification, including feature generation, representation, and selection, with
illustrative examples, from a pragmatic view point. A variety of feature gen-
erating schemes is reviewed, including word merging, word phrases, character
N-grams, and multi-fields. The generated features are ranked by scoring each
feature independently. Examples of scoring measures are information gain,
χ-square, and bi-normal separation. A case study shows considerable im-
provement of F-measure by feature selection. It also shows that adding two
word phrases as new features generally gives good performance gain over the
features comprising only selected words.
Chapter 14 introduces a new feature selection score, which is defined as the
posterior probability of inclusion of a given feature over all possible models,
where each model corresponds to a different set of features that includes the
given feature. The score assumes a probability distribution on the words of
the documents. Bernoulli and Poisson distributions are assumed respectively
when only the presence or absence of a word matters and when the number
of occurrences of a word matters. The score computation is inexpensive,
and the value that the score assigns to each word has an appealing Bayesian
interpretation when the predictive model corresponds to a naive Bayes model.
This score is compared with five other well-known scores.
Chapter 15 focuses on dimensionality reduction for semi-supervised clus-
tering where some weak supervision is available in terms of pairwise instance
constraints (must-link and cannot-link). Two methods are proposed by lever-
aging pairwise instance constraints: pairwise constraints-guided feature pro-
jection and pairwise constraints-guided co-clustering. The former is used to
14 Computational Methods of Feature Selection
project data into a lower dimensional space such that the sum-squared dis-
tance between must-link instances is minimized and the sum-squared dis-
tance between cannot-link instances is maximized. This reduces to an elegant
eigenvalue decomposition problem. The latter is to use feature clustering
benefitting from pairwise constraints via a constrained co-clustering mecha-
nism. Feature clustering and data clustering are mutually reinforced in the
co-clustering process.
Chapter 16 proposes aggressive feature selection, removing more than
95% features (terms) for text data. Feature ranking is effective to remove
irrelevant features, but cannot handle feature redundancy. Experiments show
that feature redundancy can be as destructive as noise. A new multi-stage
approach for text feature selection is proposed: (1) pre-processing to remove
stop words, infrequent words, noise, and errors; (2) ranking features to iden-
tify the most informative terms; and (3) removing redundant and correlated
terms. In addition, term redundancy is modeled by a term-redundancy tree
for visualization purposes.
1.3.5 Part V - Feature Selection in Bioinformatics
Chapter 17 introduces the challenges of microarray data analysis and
presents a redundancy-based feature selection algorithm. For high-throughput
data like microarrays, redundancy among genes becomes a critical issue. Con-
ventional feature ranking algorithms cannot effectively handle feature redun-
dancy. It is known that if there is a Markov blanket for a feature, the feature
can be safely eliminated. Finding a Markov blanket is computationally heavy.
The solution proposed is to use an approximate Markov blanket, in which it is
assumed that the Markov blanket always consists of one feature. The features
are first ranked, and then each feature is checked in sequence if it has any ap-
proximate Markov blanket in the current set. This way it can efficiently find
all predominant features and eliminate the rest. Biologists would welcome
an efficient filter algorithm to feature redundancy. Redundancy-based fea-
ture selection makes it possible for a biologist to specify what genes are to be
included before feature selection.
Chapter 18 presents a scalable method for automatic feature generation
on biological sequence data. The algorithm uses sequence components and do-
main knowledge to construct features, explores the space of possible features,
and identifies the most useful ones. As sequence data have both compositional
and positional properties, feature types are defined to capture these proper-
ties, and for each feature type, features are constructed incrementally from
the simplest ones. During the construction, the importance of each feature is
evaluated by a measure that best fits to each type, and low ranked features
are eliminated. At the final stage, selected features are further pruned by an
embedded method based on recursive feature elimination. The method was
applied to the problem of splice-site prediction, and it successfully identified
the most useful set of features of each type. The method can be applied
Less Is More 15
to complex feature types and sequence prediction tasks such as translation
start-site prediction and protein sequence classification.
Chapter 19 proposes an ensemble-based method to find robust features
for biomarker research. Ensembles are obtained by choosing different alterna-
tives at each stage of data mining: three normalization methods, two binning
methods, eight feature selection methods (including different combination of
search methods), and four classification methods. A total of 192 different clas-
sifiers are obtained, and features are selected by favoring frequently appearing
features that are members of small feature sets of accurate classifiers. The
method is successfully applied to a publicly available Ovarian Cancer Dataset,
in which case the original attribute is the m/z (mass/charge) value of mass
spectrometer and the value of the feature is its intensity.
Chapter 20 presents a penalty-based feature selection method, elastic net,
for genomic data, which is a generalization of lasso (a penalized least squares
method with L1 penalty for regression). Elastic net has a nice property that
irrelevant features receive their parameter estimates equal to 0, leading to
sparse and easy to interpret models like lasso, and, in addition, strongly cor-
related relevant features are all selected whereas in lasso only one of them
is selected. Thus, it is a more appropriate tool for feature selection with
high-dimensional data than lasso. Details are given on how elastic net can be
applied to regression, classification, and sparse eigen-gene analysis by simul-
taneously building a model and selecting relevant and redundant features.
1.4 Looking Ahead
Feature selection research has found applications in many fields where large
(either row-wise or column-wise) volumes of data present challenges to effec-
tive data analysis and processing. As data evolve, new challenges arise and
the expectations of feature selection are also elevated, due to its own suc-
cess. In addition to high-throughput data, the pervasive use of Internet and
Web technologies has been bringing about a great number of new services and
applications, ranging from recent Web 2.0 applications to traditional Web ser-
vices where multi-media data are ubiquitous and abundant. Feature selection
is widely applied to find topical terms, establish group profiles, assist in cat-
egorization, simplify descriptions, facilitate personalization and visualization,
among many others.
The frontier of feature selection research is expanding incessantly in an-
swering the emerging challenges posed by the ever-growing amounts of data,
multiple sources of heterogeneous data, data streams, and disparate data-
intensive applications. On one hand, we naturally anticipate more research
on semi-supervised feature selection, unifying supervised and unsupervised
16 Computational Methods of Feature Selection
feature selection [19], and integrating feature selection with feature extrac-
tion. On the other hand, we expect new feature selection methods designed
for various types of features like causal, complementary, relational, struc-
tural, and sequential features, and intensified research efforts on large-scale,
distributed, and real-time feature selection. As the field develops, we are op-
timistic and confident that feature selection research will continue its unique
and significant role in taming the data monster and helping turning data into
nuggets.
References
[1] A. Blum and P. Langley. Selection of relevant features and examples in
machine learning. Artificial Intelligence, 97:245–271, 1997.
[2] M. Dash and H. Liu. Feature selection methods for classifications. Intel-
ligent Data Analysis: An International Journal, 1(3):131–156, 1997.
[3] E. Dougherty. Feature-selection overfitting with small-sample classi-
fier design. IEEE Intelligent Systems, 20(6):64–66, November/December
2005.
[4] J. Dy and C. Brodley. Feature selection for unsupervised learning. Jour-
nal of Machine Learning Research, 5:845–889, 2004.
[5] I. Guyon and A. Elisseeff. An introduction to variable and feature se-
lection. Journal of Machine Learning Research (JMLR), 3:1157–1182,
2003.
[6] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical
Learning. Springer, 2001.
[7] A. Jakulin and I. Bratko. Testing the significance of attribute interac-
tions. In ICML ’04: Twenty-First International Conference on Machine
Learning. ACM Press, 2004.
[8] G. John, R. Kohavi, and K. Pfleger. Irrelevant feature and the subset se-
lection problem. In W. Cohen and H. H., editors, Machine Learning: Pro-
ceedings of the Eleventh International Conference, pages 121–129, New
Brunswick, NJ: Rutgers University, 1994.
[9] H. Liu and H. Motoda, editors. Feature Extraction, Construction and
Selection: A Data Mining Perspective. Boston: Kluwer Academic Pub-
lishers, 1998. 2nd Printing, 2001.
[10] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery 
Data Mining. Boston: Kluwer Academic Publishers, 1998.
Less Is More 17
[11] H. Liu and H. Motoda, editors. Instance Selection and Construction for
Data Mining. Boston: Kluwer Academic Publishers, 2001.
[12] H. Liu and L. Yu. Toward integrating feature selection algorithms for
classification and clustering. IEEE Trans. on Knowledge and Data En-
gineering, 17(3):1–12, 2005.
[13] T. Mitchell. Machine Learning. New York: McGraw-Hill, 1997.
[14] P. Refaeilzadeh, L. Tang, and H. Liu. On comparison of feature selection
algorithms. In AAAI 2007 Workshop on Evaluation Methods for Machine
Learning II, Vancouver, British Columbia, Canada, July 2007.
[15] S. Singhi and H. Liu. Feature subset selection bias for classification
learning. In International Conference on Machine Learning, 2006.
[16] L. Yu and H. Liu. Efficient feature selection via analysis of rele-
vance and redundancy. Journal of Machine Learning Research (JMLR),
5(Oct):1205–1224, 2004.
[17] Z. Zhao and H. Liu. Searching for interacting features. In Proceedings of
IJCAI - International Joint Conference on AI, January 2007.
[18] Z. Zhao and H. Liu. Semi-supervised feature selection via spectral anal-
ysis. In Proceedings of SIAM International Conference on Data Mining
(SDM-07), 2007.
[19] Z. Zhao and H. Liu. Spectral feature selection for supervised and unsu-
pervised learning. In Proceedings of International Conference on Machine
Learning, 2007.
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Chapter 2
Unsupervised Feature Selection
Jennifer G. Dy
Northeastern University
2.1 Introduction ............................................................. 19
2.2 Clustering ................................................................ 21
2.3 Feature Selection ........................................................ 23
2.4 Feature Selection for Unlabeled Data ................................... 25
2.5 Local Approaches ........................................................ 32
2.6 Summary ................................................................ 34
Acknowledgment ......................................................... 35
References ............................................................... 35
2.1 Introduction
Many existing databases are unlabeled, because large amounts of data make
it difficult for humans to manually label the categories of each instance. More-
over, human labeling is expensive and subjective. Hence, unsupervised learn-
ing is needed. Besides being unlabeled, several applications are characterized
by high-dimensional data (e.g., text, images, gene). However, not all of the
features domain experts utilize to represent these data are important for the
learning task. We have seen the need for feature selection in the supervised
learning case. This is also true in the unsupervised case. Unsupervised means
there is no teacher, in the form of class labels. One type of unsupervised learn-
ing problem is clustering. The goal of clustering is to group “similar” objects
together. “Similarity” is typically defined in terms of a metric or a probabil-
ity density model, which are both dependent on the features representing the
data.
In the supervised paradigm, feature selection algorithms maximize some
function of prediction accuracy. Since class labels are available in supervised
learning, it is natural to keep only the features that are related to or lead
to these classes. But in unsupervised learning, we are not given class labels.
Which features should we keep? Why not use all the information that we
have? The problem is that not all the features are important. Some of the
features may be redundant and some may be irrelevant. Furthermore, the ex-
istence of several irrelevant features can misguide clustering results. Reducing
19
20 Computational Methods of Feature Selection
the number of features also facilitates comprehensibility and ameliorates the
problem that some unsupervised learning algorithms break down with high-
dimensional data. In addition, for some applications, the goal is not just
clustering, but also to find the important features themselves.
A reason why some clustering algorithms break down in high dimensions is
due to the curse of dimensionality [3]. As the number of dimensions increases,
a fix data sample becomes exponentially sparse. Additional dimensions in-
crease the volume exponentially and spread the data such that the data points
would look equally far. Figure 2.1 (a) shows a plot of data generated from
a uniform distribution between 0 and 2 with 25 instances in one dimension.
Figure 2.1 (b) shows a plot of the same data in two dimensions, and Figure
2.1 (c) displays the data in three dimensions. Observe that the data become
more and more sparse in higher dimensions. There are 12 samples that fall
inside the unit-sized box in Figure 2.1 (a), seven samples in (b) and two in
(c). The sampling density is proportional to M1/N
, where M is the number
of samples and N is the dimension. For this example, a sampling density of
25 in one dimension would require 253
= 125 samples in three dimensions to
achieve a similar sample density.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
(a)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
y
(b)
0
0.5
1
1.5
2
0
0.5
1
1.5
2
0
0.5
1
1.5
2
x
y
z
(c)
FIGURE 2.1: Illustration for the curse of dimensionality. These are plots of a
25-sample data generated from a uniform distribution between 0 and 2. (a) Plot in
one dimension, (b) plot in two dimensions, and (c) plot in three dimensions. The
boxes in the figures show unit-sized bins in the corresponding dimensions. Note that
data are more sparse with respect to the unit-sized volume in higher dimensions.
There are 12 samples in the unit-sized box in (a), 7 samples in (b), and 2 samples
in (c).
As noted earlier, supervised learning has class labels to guide the feature
search. In unsupervised learning, these labels are missing, and in fact its goal
is to find these grouping labels (also known as cluster assignments). Finding
these cluster labels is dependent on the features describing the data, thus
making feature selection for unsupervised learning difficult.
Dy and Brodley [14] define the goal of feature selection for unsupervised
learning as:
Unsupervised Feature Selection 21
to find the smallest feature subset that best uncovers “interesting
natural” groupings (clusters) from data according to the chosen
criterion.
Without any labeled information, in unsupervised learning, we need to make
some assumptions. We need to define what “interesting” and “natural” mean
in the form of criterion or objective functions. We will see examples of these
criterion functions later in this chapter.
Before we proceed with how to do feature selection on unsupervised data,
it is important to know the basics of clustering algorithms. Section 2.2 briefly
describes clustering algorithms. In Section 2.3 we review the basic components
of feature selection algorithms. Then, we present the methods for unsuper-
vised feature selection in Sections 2.4 and 2.5, and finally provide a summary
in Section 2.6.
2.2 Clustering
The goal of clustering is to group similar objects together. There are two
types of clustering approaches: partitional and hierarchical. Partitional clus-
tering provides one level of clustering. Hierarchical clustering, on the other
hand, provides multiple levels (hierarchy) of clustering solutions. Hierarchical
approaches can proceed bottom-up (agglomerative) or top-down (divisive).
Bottom-up approaches typically start with all instances as clusters and then,
at each level, merge clusters that are most similar with each other. Top-
down approaches divide the data into k clusters at each level. There are
several methods for performing clustering. A survey of these algorithms can
be found in [29, 39, 18].
In this section we briefly present two popular partitional clustering algo-
rithms: k-means and finite mixture model clustering. As mentioned earlier,
similarity is typically defined by a metric or a probability distribution. K-
means is an approach that uses a metric, and finite mixture models define
similarity by a probability density.
Let us denote our dataset as X = {x1, x2, . . . , xM }. X consists of M data
instances xk, k = 1, 2, . . ., M, and each xk represents a single N-dimensional
instance.
2.2.1 The K-Means Algorithm
The goal of k-means is to partition X into K clusters {C1, . . . , CK }. The
most widely used criterion function for the k-means algorithm is the sum-
22 Computational Methods of Feature Selection
squared-error (SSE) criterion. SSE is defined as
SSE =
K

j=1

xk∈Cj
xk − μj2
(2.1)
where μj denotes the mean (centroid) of those instances in cluster Cj.
K-means is an iterative algorithm that locally minimizes the SSE criterion.
It assumes each cluster has a hyper-spherical structure. “K-means” denotes
the process of assigning each data point, xk, to the cluster with the nearest
mean. The k-means algorithm starts with initial K centroids, then it assigns
each remaining point to the nearest centroid, updates the cluster centroids,
and repeats the process until the K centroids do not change (convergence).
There are two versions of k-means: One version originates from Forgy [17] and
the other version from Macqueen [36]. The difference between the two is when
to update the cluster centroids. In Forgy’s k-means [17], cluster centroids are
re-computed after all the data points have been assigned to their nearest
centroids. In Macqueen’s k-means [36], the cluster centroids are re-computed
after each data assignment. Since k-means is a greedy algorithm, it is only
guaranteed to find a local minimum, the solution of which is dependent on
the initial assignments. To avoid local optimum, one typically applies random
restarts and picks the clustering solution with the best SSE. One can refer
to [47, 4] for other ways to deal with the initialization problem.
Standard k-means utilizes Euclidean distance to measure dissimilarity be-
tween the data points. Note that one can easily create various variants of
k-means by modifying this distance metric (e.g., other Lp norm distances)
to ones more appropriate for the data. For example, on text data, a more
suitable metric is the cosine similarity. One can also modify the objective
function, instead of SSE, to other criterion measures to create other cluster-
ing algorithms.
2.2.2 Finite Mixture Clustering
A finite mixture model assumes that data are generated from a mixture
of K component density functions, in which p(xk|θj) represents the density
function of component j for all j
s, where θj is the parameter (to be estimated)
for cluster j. The probability density of data xk is expressed by
p(xk) =
K

j=1
αjp(xk|θj) (2.2)
where the α
s are the mixing proportions of the components (subject to αj ≥ 0
and
K
j=1 αj = 1). The log-likelihood of the M observed data points is then
given by
L =
M

k=1
ln{
K

j=1
αjp(xk|θj)} (2.3)
Unsupervised Feature Selection 23
It is difficult to directly optimize (2.3), therefore we apply the Expectation-
Maximization (EM) [10] algorithm to find a (local) maximum likelihood or
maximum a posteriori (MAP) estimate of the parameters for the given data
set. EM is a general approach for estimating the maximum likelihood or
MAP estimate for missing data problems. In the clustering context, the
missing or hidden variables are the class labels. The EM algorithm iterates
between an Expectation-step (E-step), which computes the expected com-
plete data log-likelihood given the observed data and the model parameters,
and a Maximization-step (M-step), which estimates the model parameters
by maximizing the expected complete data log-likelihood from the E-step,
until convergence. In clustering, the E-step is similar to estimating the clus-
ter membership and the M-step estimates the cluster model parameters. The
clustering solution that we obtain in a mixture model is what we call a “soft”-
clustering solution because we obtain an estimated cluster membership (i.e.,
each data point belongs to all clusters with some probability weight of be-
longing to each cluster). In contrast, k-means provides a “hard”-clustering
solution (i.e., each data point belongs to only a single cluster).
Analogous to metric-based clustering, where one can develop different algo-
rithms by utilizing other similarity metric, one can design different probability-
based mixture model clustering algorithms by choosing an appropriate density
model for the application domain. A Gaussian distribution is typically uti-
lized for continuous features and multinomials for discrete features. For a
more thorough description of clustering using finite mixture models, see [39]
and a review is provided in [18].
2.3 Feature Selection
Feature selection algorithms has two main components: (1) feature search
and (2) feature subset evaluation.
2.3.1 Feature Search
Feature search strategies have been widely studied for classifications. Gen-
erally speaking, search strategies used for supervised classifications can also
be used for clustering algorithms. We repeat and summarize them here for
completeness. An exhaustive search would definitely find the optimal solution;
however, a search on 2N
possible feature subsets (where N is the number of
features) is computationally impractical. More realistic search strategies have
been studied. Narendra and Fukunaga [40] introduced the branch and bound
algorithm, which finds the optimal feature subset if the criterion function used
is monotonic. However, although the branch and bound algorithm makes
24 Computational Methods of Feature Selection
problems more tractable than an exhaustive search, it becomes impractical
for feature selection problems involving more than 30 features [43]. Sequential
search methods generally use greedy techniques and hence do not guarantee
global optimality of the selected subsets, only local optimality. Examples of
sequential searches include sequential forward selection, sequential backward
elimination, and bidirectional selection [32, 33]. Sequential forward/backward
search methods generally result in an O(N2
) worst case search. Marill and
Green [38] introduced the sequential backward selection (SBS) [43] method,
which starts with all the features and sequentially eliminates one feature at a
time (eliminating the feature that contributes least to the criterion function).
Whitney [50] introduced sequential forward selection (SFS) [43], which starts
with the empty set and sequentially adds one feature at a time. A problem
with these hill-climbing search techniques is that when a feature is deleted in
SBS, it cannot be re-selected, while a feature added in SFS cannot be deleted
once selected. To prevent this effect, the Plus-l-Minus-r (l-r) search method
was developed by Stearns [45]. Indeed, at each step the values of l and r
are pre-specified and fixed. Pudil et al. [43] introduced an adaptive version
that allows l and r values to “float.” They call these methods floating search
methods: sequential forward floating selection (SFFS) and sequential back-
ward floating selection (SBFS) based on the dominant search method (i.e.,
either in the forward or backward direction). Random search methods such
as genetic algorithms and random mutation hill climbing add some random-
ness in the search procedure to help to escape from a local optimum. In some
cases when the dimensionality is very high, one can only afford an individual
search. Individual search methods evaluate each feature individually accord-
ing to a criterion or a condition [24]. They then select features, which either
satisfy the condition or are top-ranked.
2.3.2 Feature Evaluation
Not all the features are important. Some of the features may be irrelevant
and some of the features may be redundant. Each feature or feature subset
needs to be evaluated based on importance by a criterion. Different criteria
may select different features. It is actually deciding the evaluation criteria that
makes feature selection in clustering difficult. In classification, it is natural
to keep the features that are related to the labeled classes. However, in
clustering, these class labels are not available. Which features should we keep?
More specifically, how do we decide which features are relevant/irrelevant, and
which are redundant?
Figure 2.2 gives a simple example of an irrelevant feature for clustering.
Suppose data have features F1 and F2 only. Feature F2 does not contribute
to cluster discrimination, thus, we consider feature F2 to be irrelevant. We
want to remove irrelevant features because they may mislead the clustering
algorithm (especially when there are more irrelevant features than relevant
ones). Figure 2.3 provides an example showing feature redundancy. Observe
Unsupervised Feature Selection 25
FIGURE 2.2: In this example, feature F2 is irrelevant because it does not con-
tribute to cluster discrimination.
F2
F1
FIGURE 2.3: In this example, features F1 and F2 have redundant information,
because feature F1 provides the same information as feature F2 with regard to
discriminating the two clusters.
that both features F1 and F2 lead to the same clustering results. Therefore,
we consider features F1 and F2 to be redundant.
2.4 Feature Selection for Unlabeled Data
There are several feature selection methods for clustering. Similar to super-
vised learning, these feature selection methods can be categorized as either
filter or wrapper approaches [33] based on whether the evaluation methods
depend on the learning algorithms1
.
As Figure 2.4 shows, the wrapper approach wraps the feature search around
the learning algorithms that will ultimately be applied, and utilizes the learned
results to select the features. On the other hand, as shown in Figure 2.5, the
filter approach utilizes the data alone to decide which features should be kept,
26 Computational Methods of Feature Selection
Search
Clustering
Algorithm
Feature
Evaluation
Criterion
All Features
Feature
Subset
Criterion Value
Clusters
Selected
Features
Clusters
FIGURE 2.4: Wrapper approach for feature selection for clustering.
Search
Feature
Evaluation
Criterion
All Features
Feature
Subset
Criterion Value
Selected
Features
FIGURE 2.5: Filter approach for feature selection for clustering.
without running the learning algorithm. Usually, a wrapper approach may
lead to better performance compared to a filter approach for a particular
learning algorithm. However, wrapper methods are more computationally
expensive since one needs to run the learning algorithm for every candidate
feature subset.
In this section, we present the different methods categorized into filter and
wrapper approaches.
2.4.1 Filter Methods
Filter methods use some intrinsic property of the data to select features
without utilizing the clustering algorithm that will ultimately be applied. The
basic components in filter methods are the feature search method and the fea-
ture selection criterion. Filter methods have the challenge of defining feature
relevance (interestingness) and/or redundancy without applying clustering on
the data.
Talavera [48] developed a filter version of his wrapper approach that selects
features based on feature dependence. He claims that irrelevant features are
features that do not depend on the other features. Manoranjan et al. [37]
introduced a filter approach that selects features based on the entropy of dis-
tances between data points. They observed that when the data are clustered,
the distance entropy at that subspace should be low. He, Cai, and Niyogi [26]
select features based on the Laplacian score that evaluates features based on
their locality preserving power. The Laplacian score is based on the premise
that two data points that are close together probably belong to the same
cluster.
These three filter approaches try to remove features that are not relevant.
Unsupervised Feature Selection 27
Another way to reduce the dimensionality is to remove redundancy. A filter
approach primarily for reducing redundancy is simply to cluster the features.
Note that even though we apply clustering, we consider this as a filter method
because we cluster on the feature space as opposed to the data sample space.
One can cluster the features using a k-means clustering [36, 17] type of algo-
rithm with feature correlation as the similarity metric. Instead of a cluster
mean, represent each cluster by the feature that has the highest correlation
among features within the cluster it belongs to.
Popular techniques for dimensionality reduction without labels are prin-
cipal components analysis (PCA) [30], factor analysis, and projection pur-
suit [20, 27]. These early works in data reduction for unsupervised data can
be thought of as filter methods, because they select the features prior to ap-
plying clustering. But rather than selecting a subset of the features, they
involve some type of feature transformation. PCA and factor analysis aim to
reduce the dimension such that the representation is as faithful as possible to
the original data. As such, these techniques aim at reducing dimensionality
by removing redundancy. Projection pursuit, on the other hand, aims at find-
ing “interesting” projections (defined as the directions that are farthest from
Gaussian distributions and close to uniform). In this case, projection pur-
suit addresses relevance. Another method is independent component analysis
(ICA) [28]. ICA tries to find a transformation such that the transformed vari-
ables are statistically independent. Although the goals of ICA and projection
pursuit are different, the formulation in ICA ends up being similar to that of
projection pursuit (i.e., they both search for directions that are farthest from
the Gaussian density). These techniques are filter methods, however, they
apply transformations on the original feature space. We are interested in sub-
sets of the original features, because we want to retain the original meaning of
the features. Moreover, transformations would still require the user to collect
all the features to obtain the reduced set, which is sometimes not desired.
2.4.2 Wrapper Methods
Wrapper methods apply the clustering algorithm to evaluate the features.
They incorporate the clustering algorithm inside the feature search and selec-
tion. Wrapper approaches consist of: (1) a search component, (2) a clustering
algorithm, and (3) a feature evaluation criterion. See Figure 2.4.
One can build a feature selection wrapper approach for clustering by simply
picking a favorite search method (any method presented in Section 2.3.1), and
apply a clustering algorithm and a feature evaluation criterion. However, there
are issues that one must take into account in creating such an algorithm. In
[14], Dy and Brodley investigated the issues involved in creating a general
wrapper method where any feature selection, clustering, and selection criteria
can be applied. The first issue they observed is that it is not a good idea
to use the same number of clusters throughout the feature search because
different feature subspaces have different underlying numbers of “natural”
28 Computational Methods of Feature Selection
clusters. Thus, the clustering algorithm should also incorporate finding the
number of clusters in feature search. The second issue they discovered is that
various selection criteria are biased with respect to dimensionality. They then
introduced a cross-projection normalization scheme that can be utilized by
any criterion function.
Feature subspaces have different underlying numbers of clusters.
When we are searching for the best feature subset, we run into a new problem:
The value of the number of clusters depends on the feature subset. Figure
2.6 illustrates this point. In two dimensions {F1, F2} there are three clusters,
whereas in one dimension (the projection of the data only on F1) there are
only two clusters. It is not a good idea to use a fixed number of clusters in
feature search, because different feature subsets require different numbers of
clusters. And, using a fixed number of clusters for all feature sets does not
model the data in the respective subspace correctly. In [14], they addressed
finding the number of clusters by applying a Bayesian information criterion
penalty [44].
x
x
x
x x
x
xx x
x
x
x x
x x
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x x
x xx
x
F
F
xxxxxxxxxx xxxxxxxx
2
1
FIGURE 2.6: The number of cluster components varies with dimension.
Feature evaluation criterion should not be biased with respect to
dimensionality. In a wrapper approach, one searches in feature space, ap-
plies clustering in each candidate feature subspace, Si, and then evaluates the
results (clustering in space Si) with other cluster solutions in other subspaces,
Sj, j = i, based on an evaluation criterion. This can be problematic especially
when Si and Sj have different dimensionalities. Dy and Brodley [14] examined
two feature selection criteria: maximum likelihood and scatter separability.
They have shown that the scatter separability criterion prefers higher dimen-
sionality. In other words, the criterion value monotonically increases as fea-
Another Random Scribd Document
with Unrelated Content
Ihminen, joka tarinan lopusta osaa saada selville koko tarinan
alusta saakka.
No, ottakaa sitten selville minunkin tarinani, hän sanoi tarttuen
käteeni. Oli kerran mies, joka jätti maailman, missä häntä ihailtiin,
ja loi itselleen toisen, missä häntä rakastetaan.
Rohkenenko kysyä nimeänne?
Ukko kohosi päätään pidemmäksi, kun kuuli nämä sanat.
Senjälkeen hän kohotti vapisevan kätensä ja laski sen pääni päälle.
Ja siinä silmänräpäyksessä minusta tuntui kuin olisi, kauan, kauan
sitten tämä sama käsi levännyt pääni päällä, silloin kun lapsen
kiharat vielä liehuivat sen ympärillä, ja kuin olisin kerran ennen
nähnyt nämä kasvot.
Kysymykseeni hän vastasi:
Minun nimeni on 'Ei kukaan.'
Sitten hän kääntyi pois virkkamatta enää mitään. Hän meni taloon
eikä enää näyttäytynyt meidän saarella olomme aikana.
* * * * *
Sellainen on Vapaan Saaren nykyinen tila.
Kahden hallituksen suoma etuoikeus, joka tekee tämän maapalan
riippumattomaksi sen molemmin puolin olevista maista kestää vielä
viisikymmentä vuotta.
Viisikymmentä vuotta! — Kuka tietää, miksi maailma sinä aikana
on muuttunut?
TOISEN OSAN LOPPU.
*** END OF THE PROJECT GUTENBERG EBOOK ONNEN
KULTAPOIKA: ROMAANI. 2/2 ***
Updated editions will replace the previous one—the old editions will
be renamed.
Creating the works from print editions not protected by U.S.
copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying
copyright royalties. Special rules, set forth in the General Terms of
Use part of this license, apply to copying and distributing Project
Gutenberg™ electronic works to protect the PROJECT GUTENBERG™
concept and trademark. Project Gutenberg is a registered trademark,
and may not be used if you charge for an eBook, except by following
the terms of the trademark license, including paying royalties for use
of the Project Gutenberg trademark. If you do not charge anything
for copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.
START: FULL LICENSE
THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
To protect the Project Gutenberg™ mission of promoting the free
distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.
Section 1. General Terms of Use and
Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund
from the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only be
used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
1.E. Unless you have removed all references to Project Gutenberg:
1.E.1. The following sentence, with active links to, or other
immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
1.E.2. If an individual Project Gutenberg™ electronic work is derived
from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.
1.E.3. If an individual Project Gutenberg™ electronic work is posted
with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.
1.E.4. Do not unlink or detach or remove the full Project
Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute this
electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the Project
Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
1.E.7. Do not charge a fee for access to, viewing, displaying,
performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.
1.E.8. You may charge a reasonable fee for copies of or providing
access to or distributing Project Gutenberg™ electronic works
provided that:
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You provide a full refund of any money paid by a user who
notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.
• You provide, in accordance with paragraph 1.F.3, a full refund of
any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™
electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.
1.F.
1.F.1. Project Gutenberg volunteers and employees expend
considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.
1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for
the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.
1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you
discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
1.F.5. Some states do not allow disclaimers of certain implied
warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,
the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.
Section 2. Information about the Mission
of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.
Volunteers and financial support to provide volunteers with the
assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.
Section 3. Information about the Project
Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.
The Foundation’s business office is located at 809 North 1500 West,
Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.
The Foundation is committed to complying with the laws regulating
charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states where
we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot make
any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Section 5. General Information About
Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookultra.com

More Related Content

PDF
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
PDF
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
PDF
Computational Methods Of Feature Selection Huan Liu Hiroshi Motoda
PDF
IRJET- Survey of Feature Selection based on Ant Colony
PDF
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
PDF
Study on Relavance Feature Selection Methods
PPT
feature selection slides share and types of features selection
PPT
Nbvtalkonfeatureselection
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Computational Methods Of Feature Selection Huan Liu Hiroshi Motoda
IRJET- Survey of Feature Selection based on Ant Colony
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
Study on Relavance Feature Selection Methods
feature selection slides share and types of features selection
Nbvtalkonfeatureselection

Similar to Computational Methods of Feature Selection 1st Edition Huan Liu (Editor) (20)

PDF
Booster in High Dimensional Data Classification
PDF
A Survey on Classification of Feature Selection Strategies
PDF
Improving the performance of Intrusion detection systems
PDF
Optimization Technique for Feature Selection and Classification Using Support...
PDF
06522405
PDF
Z suzanne van_den_bosch
PDF
Feature Subset Selection for High Dimensional Data using Clustering Techniques
PDF
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
PDF
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
M43016571
DOC
PDF
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
PDF
Unit 1_Concet of Feature-Feature Selection Methods.pdf
PDF
Balasaraswathi2017 article feature_selectiontechniquesfori
PDF
The International Journal of Engineering and Science (The IJES)
PDF
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
PDF
Network Based Intrusion Detection System using Filter Based Feature Selection...
PDF
763354.MIPRO_2015_JovicBrkicBogunovic.pdf
Booster in High Dimensional Data Classification
A Survey on Classification of Feature Selection Strategies
Improving the performance of Intrusion detection systems
Optimization Technique for Feature Selection and Classification Using Support...
06522405
Z suzanne van_den_bosch
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
M43016571
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
Unit 1_Concet of Feature-Feature Selection Methods.pdf
Balasaraswathi2017 article feature_selectiontechniquesfori
The International Journal of Engineering and Science (The IJES)
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Network Based Intrusion Detection System using Filter Based Feature Selection...
763354.MIPRO_2015_JovicBrkicBogunovic.pdf
Ad

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
Trump Administration's workforce development strategy
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Lesson notes of climatology university.
PPTX
master seminar digital applications in india
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
01-Introduction-to-Information-Management.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
Cell Types and Its function , kingdom of life
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Final Presentation General Medicine 03-08-2024.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Yogi Goddess Pres Conference Studio Updates
Trump Administration's workforce development strategy
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Lesson notes of climatology university.
master seminar digital applications in india
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
01-Introduction-to-Information-Management.pdf
Supply Chain Operations Speaking Notes -ICLT Program
VCE English Exam - Section C Student Revision Booklet
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Cell Types and Its function , kingdom of life
human mycosis Human fungal infections are called human mycosis..pptx
Microbial disease of the cardiovascular and lymphatic systems
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Ad

Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)

  • 1. Visit https://ebookultra.com to download the full version and explore more ebooks Computational Methods of Feature Selection 1st Edition Huan Liu (Editor) _____ Click the link below to download _____ https://ebookultra.com/download/computational-methods- of-feature-selection-1st-edition-huan-liu-editor/ Explore and download more ebooks at ebookultra.com
  • 2. Here are some suggested products you might be interested in. Click the link to download Feature Selection and Ensemble Methods for Bioinformatics Algorithmic Classification and Implementations 1st Edition Oleg Okun https://ebookultra.com/download/feature-selection-and-ensemble- methods-for-bioinformatics-algorithmic-classification-and- implementations-1st-edition-oleg-okun/ Smoothed Finite Element Methods 1st Edition Liu https://ebookultra.com/download/smoothed-finite-element-methods-1st- edition-liu/ Next Generation Sequencing and Whole Genome Selection in Aquaculture 1st Edition Zhanjiang (John) Liu https://ebookultra.com/download/next-generation-sequencing-and-whole- genome-selection-in-aquaculture-1st-edition-zhanjiang-john-liu/ Systems Evaluation Methods Models and Applications 1st Edition Sifeng Liu (Author) https://ebookultra.com/download/systems-evaluation-methods-models-and- applications-1st-edition-sifeng-liu-author/
  • 3. Computational Methods in Plasma Physics Chapman Hall CRC Computational Science 1st Edition Stephen Jardin https://ebookultra.com/download/computational-methods-in-plasma- physics-chapman-hall-crc-computational-science-1st-edition-stephen- jardin/ Computational Methods in Biomedical Research 1st Edition Ravindra Khattree https://ebookultra.com/download/computational-methods-in-biomedical- research-1st-edition-ravindra-khattree/ Directed Enzyme Evolution Screening and Selection Methods 1st Edition Frances H. Arnold https://ebookultra.com/download/directed-enzyme-evolution-screening- and-selection-methods-1st-edition-frances-h-arnold/ Handbook of Computational and Numerical Methods in Finance 1st Edition Oliver J. Blaskowitz https://ebookultra.com/download/handbook-of-computational-and- numerical-methods-in-finance-1st-edition-oliver-j-blaskowitz/ Computational Methods in Biomedical Research 1st Edition Ravindra Khattree (Editor) https://ebookultra.com/download/computational-methods-in-biomedical- research-1st-edition-ravindra-khattree-editor/
  • 5. Computational Methods of Feature Selection 1st Edition Huan Liu (Editor) Digital Instant Download Author(s): Huan Liu (Editor); Hiroshi Motoda (Editor) ISBN(s): 9781584888796, 1584888792 Edition: 1 File Details: PDF, 15.24 MB Year: 2007 Language: english
  • 8. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix Decompositions David Skillicorn COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda PUBLISHED TITLES SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis.This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand- books. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues.
  • 9. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Computational Methods of Feature Selection Edited by )VBO-JVr)JSPTIJ.PUPEB
  • 10. CRC Press Taylor Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2007 by Taylor Francis Group, LLC CRC Press is an imprint of Taylor Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20140114 International Standard Book Number-13: 978-1-58488-879-6 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
  • 11. Preface It has been ten years since we published our first two books on feature se- lection in 1998. In the past decade, we witnessed a great expansion of feature selection research in multiple dimensions. We experienced the fast data evolu- tion in which extremely high-dimensional data, such as high-throughput data of bioinformatics and Web/text data, became increasingly common. They stretch the capabilities of conventional data processing techniques, pose new challenges, and stimulate accelerated development of feature selection research in two major ways. One trend is to improve and expand the existing tech- niques to meet the new challenges. The other is to develop brand new algo- rithms directly targeting the arising challenges. In this process, we observe many feature-selection-centered activities, such as one well-received competi- tion, two well-attended tutorials at top conferences, and two multi-disciplinary workshops, as well as a special development section in a recent issue of IEEE Intelligent Systems, to name a few. This collection bridges the widening gap between existing texts and the rapid developments in the field, by presenting recent research works from var- ious disciplines. It features excellent survey work, practical guides, exciting new directions, and comprehensive tutorials from leading experts. The book also presents easy-to-understand illustrations, state-of-the-art methodologies, and algorithms, along with real-world case studies ranging from text classi- fication, to Web mining, to bioinformatics where high-dimensional data are pervasive. Some vague ideas suggested in our earlier book have been de- veloped into mature areas with solid achievements, along with progress that could not have been imagined ten years ago. With the steady and speedy development of feature selection research, we sincerely hope that this book presents distinctive and representative achievements; serves as a convenient point for graduate students, practitioners, and researchers to further the re- search and application of feature selection; and sparks a new phase of feature selection research. We are truly optimistic about the impact of feature selec- tion on massive, high-dimensional data and processing in the near future, and we have no doubt that in another ten years, when we look back, we will be humbled by the newfound power of feature selection, and by its indelible con- tributions to machine learning, data mining, and many real-world challenges. Huan Liu and Hiroshi Motoda
  • 12. Acknowledgments The inception of this book project was during SDM 2006’s feature selec- tion workshop. Randi Cohen, an editor of Chapman and Hall/CRC Press, eloquently convinced one of us that it was a time for a new book on feature selection. Since then, she closely worked with us to make the process easier and smoother and allowed us to stay focused. With Randi’s kind and expert support, we were able to adhere to the planned schedule when facing unex- pected difficulties. We truly appreciate her generous support throughout the project. This book is a natural extension of the two successful feature selection workshops held at SDM 20051 and SDM 2006.2 The success would not be a reality without the leadership of two workshop co-organizers (Robert Stine of Wharton School and Leonard Auslender of SAS); the meticulous work of the proceedings chair (Lei Yu of Binghamton University); and the altruistic efforts of PC members, authors, and contributors. We take this opportunity to thank all who helped to advance the frontier of feature selection research. The authors, contributors, and reviewers of this book played an instru- mental role in this project. Given the limited space of this book, we could not include all quality works. Reviewers’ detailed comments and constructive suggestions significantly helped improve the book’s consistency in content, format, comprehensibility, and presentation. We thank the authors who pa- tiently and timely accommodated our (sometimes many) requests. We would also like to express our deep gratitude for the gracious help we received from our colleagues and students, including Zheng Zhao, Lei Tang, Quan Nguyen, Payam Refaeilzadeh, and Shankara B. Subramanya of Arizona State University; Kozo Ohara of Osaka University; and William Nace and Kenneth Gorreta of AFOSR/AOARD, Air Force Research Laboratory. Last but not least, we thank our families for their love and support. We are grateful and happy that we can now spend more time with our families. Huan Liu and Hiroshi Motoda 1The 2005 proceedings are at http://enpub.eas.asu.edu/workshop/. 2The 2006 proceedings are at http://enpub.eas.asu.edu/workshop/2006/.
  • 13. Contributors Jesús S. Aguilar-Ruiz Pablo de Olavide University, Seville, Spain Jennifer G. Dy Northeastern University, Boston, Massachusetts Constantin F. Aliferis Vanderbilt University, Nashville, Tennessee André Elisseeff IBM Research, Zürich, Switzer- land Paolo Avesani ITC-IRST, Trento, Italy Susana Eyheramendy Ludwig-Maximilians Universität München, Germany Susan M. Bridges Mississippi State University, Mississippi George Forman Hewlett-Packard Labs, Palo Alto, California Alexander Borisov Intel Corporation, Chandler, Arizona Lise Getoor University of Maryland, College Park, Maryland Shane Burgess Mississippi State University, Mississippi Dimitrios Gunopulos University of California, River- side Diana Chan Mississippi State University, Mississippi Isabelle Guyon ClopiNet, Berkeley, California Claudia Diamantini Universitá Politecnica delle Marche, Ancona, Italy Trevor Hastie Stanford University, Stanford, California Rezarta Islamaj Dogan University of Maryland, College Park, Maryland and National Center for Biotechnology Infor- mation, Bethesda, Maryland Joshua Zhexue Huang University of Hong Kong, Hong Kong, China Carlotta Domeniconi George Mason University, Fair- fax, Virginia Mohamed Kamel University of Waterloo, Ontario, Canada
  • 14. Igor Kononenko University of Ljubljana, Ljubl- jana, Slovenia Wei Tang Florida Atlantic University, Boca Raton, Florida David Madigan Rutgers University, New Bruns- wick, New Jersey Kari Torkkola Motorola Labs, Tempe, Arizona Masoud Makrehchi University of Waterloo, Ontario, Canada Eugene Tuv Intel Corporation, Chandler, Arizona Michael Ng Hong Kong Baptist University, Hong Kong, China Sriharsha Veeramachaneni ITC-IRST, Trento, Italy Emanuele Olivetti ITC-IRST, Trento, Italy W. John Wilbur National Center for Biotech- nology Information, Bethesda, Maryland Domenico Potena Universitá Politecnica delle Marche, Ancona, Italy Jun Xu Georgia Institute of Technology, Atlanta, Georgia José C. Riquelme University of Seville, Seville, Spain Yunming Ye Harbin Institute of Technology, Harbin, China Roberto Ruiz Pablo de Olavide University, Seville, Spain Lei Yu Binghamton University, Bing- hamton, New York Marko Robnik Šikonja University of Ljubljana, Ljubl- jana, Slovenia Shi Zhong Yahoo! Inc., Sunnyvale, Califor- nia David J. Stracuzzi Arizona State University, Tempe, Arizona Hui Zou University of Minnesota, Min- neapolis Yijun Sun University of Florida, Gaines- ville, Florida
  • 15. Contents I Introduction and Background 1 1 Less Is More 3 Huan Liu and Hiroshi Motoda 1.1 Background and Basics . . . . . . . . . . . . . . . . . . . . . 4 1.2 Supervised, Unsupervised, and Semi-Supervised Feature Selec- tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Key Contributions and Organization of the Book . . . . . . . 10 1.3.1 Part I - Introduction and Background . . . . . . . . . 10 1.3.2 Part II - Extending Feature Selection . . . . . . . . . 11 1.3.3 Part III - Weighting and Local Methods . . . . . . . . 12 1.3.4 Part IV - Text Classification and Clustering . . . . . . 13 1.3.5 Part V - Feature Selection in Bioinformatics . . . . . . 14 1.4 Looking Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Unsupervised Feature Selection 19 Jennifer G. Dy 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 The K-Means Algorithm . . . . . . . . . . . . . . . . 21 2.2.2 Finite Mixture Clustering . . . . . . . . . . . . . . . . 22 2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Feature Search . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Feature Evaluation . . . . . . . . . . . . . . . . . . . . 24 2.4 Feature Selection for Unlabeled Data . . . . . . . . . . . . . 25 2.4.1 Filter Methods . . . . . . . . . . . . . . . . . . . . . . 26 2.4.2 Wrapper Methods . . . . . . . . . . . . . . . . . . . . 27 2.5 Local Approaches . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.1 Subspace Clustering . . . . . . . . . . . . . . . . . . . 32 2.5.2 Co-Clustering/Bi-Clustering . . . . . . . . . . . . . . . 33 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 Randomized Feature Selection 41 David J. Stracuzzi 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Types of Randomizations . . . . . . . . . . . . . . . . . . . . 42 3.3 Randomized Complexity Classes . . . . . . . . . . . . . . . . 43
  • 16. 3.4 Applying Randomization to Feature Selection . . . . . . . . 45 3.5 The Role of Heuristics . . . . . . . . . . . . . . . . . . . . . . 46 3.6 Examples of Randomized Selection Algorithms . . . . . . . . 47 3.6.1 A Simple Las Vegas Approach . . . . . . . . . . . . . 47 3.6.2 Two Simple Monte Carlo Approaches . . . . . . . . . 49 3.6.3 Random Mutation Hill Climbing . . . . . . . . . . . . 51 3.6.4 Simulated Annealing . . . . . . . . . . . . . . . . . . . 52 3.6.5 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . 54 3.6.6 Randomized Variable Elimination . . . . . . . . . . . 56 3.7 Issues in Randomization . . . . . . . . . . . . . . . . . . . . 58 3.7.1 Pseudorandom Number Generators . . . . . . . . . . . 58 3.7.2 Sampling from Specialized Data Structures . . . . . . 59 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4 Causal Feature Selection 63 Isabelle Guyon, Constantin Aliferis, and André Elisseeff 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Classical “Non-Causal” Feature Selection . . . . . . . . . . . 65 4.3 The Concept of Causality . . . . . . . . . . . . . . . . . . . . 68 4.3.1 Probabilistic Causality . . . . . . . . . . . . . . . . . . 69 4.3.2 Causal Bayesian Networks . . . . . . . . . . . . . . . . 70 4.4 Feature Relevance in Bayesian Networks . . . . . . . . . . . 71 4.4.1 Markov Blanket . . . . . . . . . . . . . . . . . . . . . 72 4.4.2 Characterizing Features Selected via Classical Methods 73 4.5 Causal Discovery Algorithms . . . . . . . . . . . . . . . . . . 77 4.5.1 A Prototypical Causal Discovery Algorithm . . . . . . 78 4.5.2 Markov Blanket Induction Algorithms . . . . . . . . . 79 4.6 Examples of Applications . . . . . . . . . . . . . . . . . . . . 80 4.7 Summary, Conclusions, and Open Problems . . . . . . . . . 82 II Extending Feature Selection 87 5 Active Learning of Feature Relevance 89 Emanuele Olivetti, Sriharsha Veeramachaneni, and Paolo Avesani 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Active Sampling for Feature Relevance Estimation . . . . . . 92 5.3 Derivation of the Sampling Benefit Function . . . . . . . . . 93 5.4 Implementation of the Active Sampling Algorithm . . . . . . 95 5.4.1 Data Generation Model: Class-Conditional Mixture of Product Distributions . . . . . . . . . . . . . . . . . . 95 5.4.2 Calculation of Feature Relevances . . . . . . . . . . . 96 5.4.3 Calculation of Conditional Probabilities . . . . . . . . 97 5.4.4 Parameter Estimation . . . . . . . . . . . . . . . . . . 97 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.5.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 99
  • 17. 5.5.2 UCI Datasets . . . . . . . . . . . . . . . . . . . . . . . 100 5.5.3 Computational Complexity Issues . . . . . . . . . . . 102 5.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . 102 6 A Study of Feature Extraction Techniques Based on Decision Border Estimate 109 Claudia Diamantini and Domenico Potena 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.1.1 Background on Statistical Pattern Classification . . . 111 6.2 Feature Extraction Based on Decision Boundary . . . . . . . 112 6.2.1 MLP-Based Decision Boundary Feature Extraction . . 113 6.2.2 SVM Decision Boundary Analysis . . . . . . . . . . . 114 6.3 Generalities About Labeled Vector Quantizers . . . . . . . . 115 6.4 Feature Extraction Based on Vector Quantizers . . . . . . . 116 6.4.1 Weighting of Normal Vectors . . . . . . . . . . . . . . 119 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.5.1 Experiment with Synthetic Data . . . . . . . . . . . . 122 6.5.2 Experiment with Real Data . . . . . . . . . . . . . . . 124 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7 Ensemble-Based Variable Selection Using Independent Probes 131 Eugene Tuv, Alexander Borisov, and Kari Torkkola 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2 Tree Ensemble Methods in Feature Ranking . . . . . . . . . 132 7.3 The Algorithm: Ensemble-Based Ranking Against Indepen- dent Probes . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.4.1 Benchmark Methods . . . . . . . . . . . . . . . . . . . 138 7.4.2 Data and Experiments . . . . . . . . . . . . . . . . . . 139 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8 Efficient Incremental-Ranked Feature Selection in Massive Data 147 Roberto Ruiz, Jesús S. Aguilar-Ruiz, and José C. Riquelme 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.3 Preliminary Concepts . . . . . . . . . . . . . . . . . . . . . . 150 8.3.1 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . 150 8.3.2 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 151 8.4 Incremental Performance over Ranking . . . . . . . . . . . . 152 8.4.1 Incremental Ranked Usefulness . . . . . . . . . . . . . 153 8.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 156 8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
  • 18. III Weighting and Local Methods 167 9 Non-Myopic Feature Quality Evaluation with (R)ReliefF 169 Igor Kononenko and Marko Robnik Šikonja 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.2 From Impurity to Relief . . . . . . . . . . . . . . . . . . . . . 170 9.2.1 Impurity Measures in Classification . . . . . . . . . . . 171 9.2.2 Relief for Classification . . . . . . . . . . . . . . . . . 172 9.3 ReliefF for Classification and RReliefF for Regression . . . . 175 9.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.4.1 ReliefF for Inductive Logic Programming . . . . . . . 178 9.4.2 Cost-Sensitive ReliefF . . . . . . . . . . . . . . . . . . 180 9.4.3 Evaluation of Ordered Features at Value Level . . . . 181 9.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 182 9.5.1 Difference of Probabilities . . . . . . . . . . . . . . . . 182 9.5.2 Portion of the Explained Concept . . . . . . . . . . . 183 9.6 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . 184 9.6.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . 184 9.6.2 Active Sampling . . . . . . . . . . . . . . . . . . . . . 184 9.6.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . 185 9.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 9.7.1 Feature Subset Selection . . . . . . . . . . . . . . . . . 185 9.7.2 Feature Ranking . . . . . . . . . . . . . . . . . . . . . 186 9.7.3 Feature Weighing . . . . . . . . . . . . . . . . . . . . . 186 9.7.4 Building Tree-Based Models . . . . . . . . . . . . . . . 187 9.7.5 Feature Discretization . . . . . . . . . . . . . . . . . . 187 9.7.6 Association Rules and Genetic Algorithms . . . . . . . 187 9.7.7 Constructive Induction . . . . . . . . . . . . . . . . . . 188 9.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 10 Weighting Method for Feature Selection in K-Means 193 Joshua Zhexue Huang, Jun Xu, Michael Ng, and Yunming Ye 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 10.2 Feature Weighting in k-Means . . . . . . . . . . . . . . . . . 194 10.3 W-k-Means Clustering Algorithm . . . . . . . . . . . . . . . 197 10.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 198 10.5 Subspace Clustering with k-Means . . . . . . . . . . . . . . . 200 10.6 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 201 10.6.1 Text Data and Subspace Clustering . . . . . . . . . . 202 10.6.2 Selection of Key Words . . . . . . . . . . . . . . . . . 203 10.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 204 10.8 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
  • 19. 11 Local Feature Selection for Classification 211 Carlotta Domeniconi and Dimitrios Gunopulos 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 11.2 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . 213 11.3 Adaptive Metric Techniques . . . . . . . . . . . . . . . . . . 214 11.3.1 Flexible Metric Nearest Neighbor Classification . . . . 215 11.3.2 Discriminant Adaptive Nearest Neighbor Classification 216 11.3.3 Adaptive Metric Nearest Neighbor Algorithm . . . . . 217 11.4 Large Margin Nearest Neighbor Classifiers . . . . . . . . . . 222 11.4.1 Support Vector Machines . . . . . . . . . . . . . . . . 223 11.4.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . 224 11.4.3 Large Margin Nearest Neighbor Classification . . . . . 225 11.4.4 Weighting Features Increases the Margin . . . . . . . 227 11.5 Experimental Comparisons . . . . . . . . . . . . . . . . . . . 228 11.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 12 Feature Weighting through Local Learning 233 Yijun Sun 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 12.2 Mathematical Interpretation of Relief . . . . . . . . . . . . . 235 12.3 Iterative Relief Algorithm . . . . . . . . . . . . . . . . . . . . 236 12.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 236 12.3.2 Convergence Analysis . . . . . . . . . . . . . . . . . . 238 12.4 Extension to Multiclass Problems . . . . . . . . . . . . . . . 240 12.5 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . 240 12.6 Computational Complexity . . . . . . . . . . . . . . . . . . . 242 12.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 12.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . 242 12.7.2 Experiments on UCI Datasets . . . . . . . . . . . . . . 244 12.7.3 Choice of Kernel Width . . . . . . . . . . . . . . . . . 248 12.7.4 Online Learning . . . . . . . . . . . . . . . . . . . . . 248 12.7.5 Experiments on Microarray Data . . . . . . . . . . . . 249 12.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 IV Text Classification and Clustering 255 13 Feature Selection for Text Classification 257 George Forman 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 13.1.1 Feature Selection Phyla . . . . . . . . . . . . . . . . . 259 13.1.2 Characteristic Difficulties of Text Classification Tasks 260 13.2 Text Feature Generators . . . . . . . . . . . . . . . . . . . . 261 13.2.1 Word Merging . . . . . . . . . . . . . . . . . . . . . . 261 13.2.2 Word Phrases . . . . . . . . . . . . . . . . . . . . . . . 262 13.2.3 Character N-grams . . . . . . . . . . . . . . . . . . . . 263
  • 20. 13.2.4 Multi-Field Records . . . . . . . . . . . . . . . . . . . 264 13.2.5 Other Properties . . . . . . . . . . . . . . . . . . . . . 264 13.2.6 Feature Values . . . . . . . . . . . . . . . . . . . . . . 265 13.3 Feature Filtering for Classification . . . . . . . . . . . . . . . 265 13.3.1 Binary Classification . . . . . . . . . . . . . . . . . . . 266 13.3.2 Multi-Class Classification . . . . . . . . . . . . . . . . 269 13.3.3 Hierarchical Classification . . . . . . . . . . . . . . . . 270 13.4 Practical and Scalable Computation . . . . . . . . . . . . . . 271 13.5 A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 272 13.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 274 14 A Bayesian Feature Selection Score Based on Naı̈ve Bayes Models 277 Susana Eyheramendy and David Madigan 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 14.2 Feature Selection Scores . . . . . . . . . . . . . . . . . . . . . 279 14.2.1 Posterior Inclusion Probability (PIP) . . . . . . . . . . 280 14.2.2 Posterior Inclusion Probability (PIP) under a Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . 281 14.2.3 Posterior Inclusion Probability (PIPp) under Poisson distributions . . . . . . . . . . . . . . . . . . . . . . . 283 14.2.4 Information Gain (IG) . . . . . . . . . . . . . . . . . . 284 14.2.5 Bi-Normal Separation (BNS) . . . . . . . . . . . . . . 285 14.2.6 Chi-Square . . . . . . . . . . . . . . . . . . . . . . . . 285 14.2.7 Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . 286 14.2.8 Word Frequency . . . . . . . . . . . . . . . . . . . . . 286 14.3 Classification Algorithms . . . . . . . . . . . . . . . . . . . . 286 14.4 Experimental Settings and Results . . . . . . . . . . . . . . . 287 14.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 287 14.4.2 Experimental Results . . . . . . . . . . . . . . . . . . 288 14.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 15 Pairwise Constraints-Guided Dimensionality Reduction 295 Wei Tang and Shi Zhong 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 15.2 Pairwise Constraints-Guided Feature Projection . . . . . . . 297 15.2.1 Feature Projection . . . . . . . . . . . . . . . . . . . . 298 15.2.2 Projection-Based Semi-supervised Clustering . . . . . 300 15.3 Pairwise Constraints-Guided Co-clustering . . . . . . . . . . 301 15.4 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . 302 15.4.1 Experimental Study – I . . . . . . . . . . . . . . . . . 302 15.4.2 Experimental Study – II . . . . . . . . . . . . . . . . . 306 15.4.3 Experimental Study – III . . . . . . . . . . . . . . . . 309 15.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . 310
  • 21. 16 Aggressive Feature Selection by Feature Ranking 313 Masoud Makrehchi and Mohamed S. Kamel 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 16.2 Feature Selection by Feature Ranking . . . . . . . . . . . . . 314 16.2.1 Multivariate Characteristic of Text Classifiers . . . . . 316 16.2.2 Term Redundancy . . . . . . . . . . . . . . . . . . . . 316 16.3 Proposed Approach to Reducing Term Redundancy . . . . . 320 16.3.1 Stemming, Stopwords, and Low-DF Terms Elimination 320 16.3.2 Feature Ranking . . . . . . . . . . . . . . . . . . . . . 320 16.3.3 Redundancy Reduction . . . . . . . . . . . . . . . . . 322 16.3.4 Redundancy Removal Algorithm . . . . . . . . . . . . 325 16.3.5 Term Redundancy Tree . . . . . . . . . . . . . . . . . 326 16.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 326 16.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 V Feature Selection in Bioinformatics 335 17 Feature Selection for Genomic Data Analysis 337 Lei Yu 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 17.1.1 Microarray Data and Challenges . . . . . . . . . . . . 337 17.1.2 Feature Selection for Microarray Data . . . . . . . . . 338 17.2 Redundancy-Based Feature Selection . . . . . . . . . . . . . 340 17.2.1 Feature Relevance and Redundancy . . . . . . . . . . 340 17.2.2 An Efficient Framework for Redundancy Analysis . . . 343 17.2.3 RBF Algorithm . . . . . . . . . . . . . . . . . . . . . . 345 17.3 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . 347 17.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 347 17.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . 349 17.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . 349 17.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 18 A Feature Generation Algorithm with Applications to Bio- logical Sequence Classification 355 Rezarta Islamaj Dogan, Lise Getoor, and W. John Wilbur 18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 18.2 Splice-Site Prediction . . . . . . . . . . . . . . . . . . . . . . 356 18.2.1 The Splice-Site Prediction Problem . . . . . . . . . . . 356 18.2.2 Current Approaches . . . . . . . . . . . . . . . . . . . 357 18.2.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . 359 18.3 Feature Generation Algorithm . . . . . . . . . . . . . . . . . 359 18.3.1 Feature Type Analysis . . . . . . . . . . . . . . . . . . 360 18.3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . 362 18.3.3 Feature Generation Algorithm (FGA) . . . . . . . . . 364 18.4 Experiments and Discussion . . . . . . . . . . . . . . . . . . 366
  • 22. 18.4.1 Data Description . . . . . . . . . . . . . . . . . . . . 366 18.4.2 Feature Generation . . . . . . . . . . . . . . . . . . . . 367 18.4.3 Prediction Results for Individual Feature Types . . . . 369 18.4.4 Splice-Site Prediction with FGA Features . . . . . . . 370 18.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 19 An Ensemble Method for Identifying Robust Features for Biomarker Discovery 377 Diana Chan, Susan M. Bridges, and Shane C. Burgess 19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 19.2 Biomarker Discovery from Proteome Profiles . . . . . . . . . 378 19.3 Challenges of Biomarker Identification . . . . . . . . . . . . . 380 19.4 Ensemble Method for Feature Selection . . . . . . . . . . . . 381 19.5 Feature Selection Ensemble . . . . . . . . . . . . . . . . . . . 383 19.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 384 19.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 20 Model Building and Feature Selection with Genomic Data 393 Hui Zou and Trevor Hastie 20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 20.2 Ridge Regression, Lasso, and Bridge . . . . . . . . . . . . . . 394 20.3 Drawbacks of the Lasso . . . . . . . . . . . . . . . . . . . . . 396 20.4 The Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . 397 20.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 397 20.4.2 A Stylized Example . . . . . . . . . . . . . . . . . . . 399 20.4.3 Computation and Tuning . . . . . . . . . . . . . . . . 400 20.4.4 Analyzing the Cardiomypathy Data . . . . . . . . . . 402 20.5 The Elastic-Net Penalized SVM . . . . . . . . . . . . . . . . 404 20.5.1 Support Vector Machines . . . . . . . . . . . . . . . . 404 20.5.2 A New SVM Classifier . . . . . . . . . . . . . . . . . . 405 20.6 Sparse Eigen-Genes . . . . . . . . . . . . . . . . . . . . . . . 407 20.6.1 PCA and Eigen-Genes . . . . . . . . . . . . . . . . . . 408 20.6.2 Sparse Principal Component Analysis . . . . . . . . . 408 20.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 Index 413
  • 25. Chapter 1 Less Is More Huan Liu Arizona State University Hiroshi Motoda AFOSR/AOARD, Air Force Research Laboratory 1.1 Background and Basics .................................................. 4 1.2 Supervised, Unsupervised, and Semi-Supervised Feature Selection ..... 7 1.3 Key Contributions and Organization of the Book ...................... 10 1.4 Looking Ahead ........................................................... 15 References ............................................................... 16 As our world expands at an unprecedented speed from the physical into the virtual, we can conveniently collect more and more data in any ways one can imagine for various reasons. Is it “The more, the merrier (better)”? The answer is “Yes” and “No.” It is “Yes” because we can at least get what we might need. It is also “No” because, when it comes to a point of too much, the existence of inordinate data is tantamount to non-existence if there is no means of effective data access. More can mean less. Without the processing of data, its mere existence would not become a useful asset that can impact our business, and many other matters. Since continued data accumulation is inevitable, one way out is to devise data selection techniques to keep pace with the rate of data collection. Furthermore, given the sheer volume of data, data generated by computers or equivalent mechanisms must be processed automatically, in order for us to tame the data monster and stay in control. Recent years have seen extensive efforts in feature selection research. The field of feature selection expands both in depth and in breadth, due to in- creasing demands for dimensionality reduction. The evidence can be found in many recent papers, workshops, and review articles. The research expands from classic supervised feature selection to unsupervised and semi-supervised feature selection, to selection of different feature types such as causal and structural features, to different kinds of data like high-throughput, text, or images, to feature selection evaluation, and to wide applications of feature selection where data abound. No book of this size could possibly document the extensive efforts in the frontier of feature selection research. We thus try to sample the field in several ways: asking established experts, calling for submissions, and looking at the 3
  • 26. 4 Computational Methods of Feature Selection recent workshops and conferences, in order to understand the current devel- opments. As this book aims to serve a wide audience from practitioners to researchers, we first introduce the basic concepts and the essential problems with feature selection; next illustrate feature selection research in parallel to supervised, unsupervised, and semi-supervised learning; then present an overview of feature selection activities included in this collection; and last contemplate some issues about evolving feature selection. The book is orga- nized in five parts: (I) Introduction and Background, (II) Extending Feature Selection, (III) Weighting and Local Methods, (IV) Text Feature Selection, and (V) Feature Selection in Bioinformatics. These five parts are relatively independent and can be read in any order. For a newcomer to the field of fea- ture selection, we recommend that you read Chapters 1, 2, 9, 13, and 17 first, then decide on which chapters to read further according to your need and in- terest. Rudimentary concepts and discussions of related issues such as feature extraction and construction can also be found in two earlier books [10, 9]. Instance selection can be found in [11]. 1.1 Background and Basics One of the fundamental motivations for feature selection is the curse of dimensionality [6]. Plainly speaking, two close data points in a 2-d space are likely distant in a 100-d space (refer to Chapter 2 for an illustrative example). For the case of classification, this makes it difficult to make a prediction of unseen data points by a hypothesis constructed from a limited number of training instances. The number of features is a key factor that determines the size of the hypothesis space containing all hypotheses that can be learned from data [13]. A hypothesis is a pattern or function that predicts classes based on given data. The more features, the larger the hypothesis space. Worse still, the linear increase of the number of features leads to the exponential increase of the hypothesis space. For example, for N binary features and a binary class feature, the hypothesis space is as big as 22N . Therefore, feature selection can efficiently reduce the hypothesis space by removing irrelevant and redundant features. The smaller the hypothesis space, the easier it is to find correct hypotheses. Given a fixed-size data sample that is part of the underlying population, the reduction of dimensionality also lowers the number of required training instances. For example, given M, when the number of binary features N = 10 is reduced to N = 5, the ratio of M/2N increases exponentially. In other words, it virtually increases the number of training instances. This helps to better constrain the search of correct hypotheses. Feature selection is essentially a task to remove irrelevant and/or redun- dant features. Irrelevant features can be removed without affecting learning
  • 27. Less Is More 5 performance [8]. Redundant features are a type of irrelevant feature [16]. The distinction is that a redundant feature implies the co-presence of another fea- ture; individually, each feature is relevant, but the removal of one of them will not affect learning performance. The selection of features can be achieved in two ways: One is to rank features according to some criterion and select the top k features, and the other is to select a minimum subset of features without learning performance deterioration. In other words, subset selection algorithms can automatically determine the number of selected features, while feature ranking algorithms need to rely on some given threshold to select fea- tures. An example of feature ranking algorithms is detailed in Chapter 9. An example of subset selection can be found in Chapter 17. Other important aspects of feature selection include models, search strate- gies, feature quality measures, and evaluation [10]. The three typical models are filter, wrapper, and embedded. An embedded model of feature selection integrates the selection of features in model building. An example of such a model is the decision tree induction algorithm, in which at each branching node, a feature has to be selected. The research shows that even for such a learning algorithm, feature selection can result in improved learning per- formance. In a wrapper model, one employs a learning algorithm and uses its performance to determine the quality of selected features. As shown in Chapter 2, filter and wrapper models are not confined to supervised feature selection, and can also apply to the study of unsupervised feature selection algorithms. Search strategies [1] are investigated and various strategies are proposed including forward, backward, floating, branch-and-bound, and randomized. If one starts with an empty feature subset and adds relevant features into the subset following a procedure, it is called forward selection; if one begins with a full set of features and removes features procedurally, it is backward selection. Given a large number of features, either strategy might be too costly to work. Take the example of forward selection. Since k is usually unknown a priori, one needs to try N 1 + N 2 + ... + N k times in order to figure out k out of N features for selection. Therefore, its time complexity is O(2N ). Hence, more efficient algorithms are developed. The widely used ones are sequential strategies. A sequential forward selection (SFS) algorithm selects one feature at a time until adding another feature does not improve the subset quality with the condition that a selected feature remains selected. Similarly, a sequential backward selection (SBS) algorithm eliminates one feature at a time and once a feature is eliminated, it will never be considered again for inclusion. Obviously, both search strategies are heuristic in nature and cannot guarantee the optimality of the selected features. Among alternatives to these strategies are randomized feature selection algorithms, which are discussed in Chapter 3. A relevant issue regarding exhaustive and heuristic searches is whether there is any reason to perform exhaustive searches if time complexity were not a concern. Research shows that exhaustive search can lead the features that exacerbate data overfitting, while heuristic search is less prone
  • 28. 6 Computational Methods of Feature Selection to data overfitting in feature selection, facing small data samples. The small sample problem addresses a new type of “wide” data where the number of features (N) is several degrees of magnitude more than the num- ber of instances (M). High-throughput data produced in genomics and pro- teomics and text data are typical examples. In connection to the curse of dimensionality mentioned earlier, the wide data present challenges to the reli- able estimation of the model’s performance (e.g., accuracy), model selection, and data overfitting. In [3], a pithy illustration of the small sample problem is given with detailed examples. The evaluation of feature selection often entails two tasks. One is to com- pare two cases: before and after feature selection. The goal of this task is to observe if feature selection achieves its intended objectives (recall that feature selection does not confine it to improving classification performance). The aspects of evaluation can include the number of selected features, time, scala- bility, and learning model’s performance. The second task is to compare two feature selection algorithms to see if one is better than the other for a certain task. A detailed empirical study is reported in [14]. As we know, there is no universally superior feature selection, and different feature selection algo- rithms have their special edges for various applications. Hence, it is wise to find a suitable algorithm for a given application. An initial attempt to ad- dress the problem of selecting feature selection algorithms is presented in [12], aiming to mitigate the increasing complexity of finding a suitable algorithm from many feature selection algorithms. Another issue arising from feature selection evaluation is feature selection bias. Using the same training data in both feature selection and classifica- tion learning can result in this selection bias. According to statistical theory based on regression research, this bias can exacerbate data over-fitting and negatively affect classification performance. A recommended practice is to use separate data for feature selection and for learning. In reality, however, separate datasets are rarely used in the selection and learning steps. This is because we want to use as much data as possible in both selection and learning. It is against this intuition to divide the training data into two datasets leading to the reduced data in both tasks. Feature selection bias is studied in [15] to seek answers if there is discrepancy between the current practice and the statistical theory. The findings are that the statistical theory is correct, but feature selection bias has limited effect on feature selection for classification. Recently researchers started paying attention to interacting features [7]. Feature interaction usually defies those heuristic solutions to feature selection evaluating individual features for efficiency. This is because interacting fea- tures exhibit properties that cannot be detected in individual features. One simple example of interacting features is the XOR problem, in which both features together determine the class and each individual feature does not tell much at all. By combining careful selection of a feature quality measure and design of a special data structure, one can heuristically handle some feature interaction as shown in [17]. The randomized algorithms detailed in Chapter 3
  • 29. Less Is More 7 may provide an alternative. An overview of various additional issues related to improving classification performance can be found in [5]. Since there are many facets of feature selection research, we choose a theme that runs in par- allel with supervised, unsupervised, and semi-supervised learning below, and discuss and illustrate the underlying concepts of disparate feature selection types, their connections, and how they can benefit from one another. 1.2 Supervised, Unsupervised, and Semi-Supervised Fea- ture Selection In one of the early surveys [2], all algorithms are supervised in the sense that data have class labels (denoted as Xl). Supervised feature selection al- gorithms rely on measures that take into account the class information. A well-known measure is information gain, which is widely used in both feature selection and decision tree induction. Assuming there are two features F1 and F2, we can calculate feature Fi’s information gain as E0 − Ei, where E is entropy. E0 is the entropy before the data split using feature Fi, and can be calculated as E0 = c pc log pc, where p is the estimated probability of class c and c = 1, 2, ..., C. Ei is the entropy after the data split using Fi. A better feature can result in larger information gain. Clearly, class information plays a critical role here. Another example is the algorithm ReliefF, which also uses the class information to determine an instance’s “near-hit” (a neighboring in- stance having the same class) and “near-miss” (a neighboring instance having different classes). More details about ReliefF can be found in Chapter 9. In essence, supervised feature selection algorithms try to find features that help separate data of different classes and we name it class-based separation. If a feature has no effect on class-based separation, it can be removed. A good feature should, therefore, help enhance class-based separation. In the late 90’s, research on unsupervised feature selection intensified in order to deal with data without class labels (denoted as Xu). It is closely related to unsupervised learning [4]. One example of unsupervised learning is clustering, where similar instances are grouped together and dissimilar ones are separated apart. Similarity can be defined by the distance between two instances. Conceptually, the two instances are similar if the distance between the two is small, otherwise they are dissimilar. When all instances are con- nected pair-wisely, breaking the connections between those instances that are far apart will form clusters. Hence, clustering can be thought as achieving locality-based separation. One widely used clustering algorithm is k-means. It is an iterative algorithm that categorizes instances into k clusters. Given predetermined k centers (or centroids), it works as follows: (1) Instances are categorized to their closest centroid, (2) the centroids are recalculated using
  • 30. 8 Computational Methods of Feature Selection the instances in each cluster, and (3) the first two steps are repeated until the centroids do not change. Obviously, the key concept is distance calculation, which is sensitive to dimensionality, as we discussed earlier about the curse of dimensionality. Basically, if there are many irrelevant or redundant features, clustering will be different from that with only relevant features. One toy example can be found in Figure 1.1 in which two well-formed clusters in a 1-d space (x) become two different clusters (denoted with different shapes, circles vs. diamonds) in a 2-d space after introducing an irrelevant feature y. Unsu- pervised feature selection is more difficult to deal with than supervised feature selection. However, it also is a very useful tool as the majority of data are unlabeled. A comprehensive introduction and review of unsupervised feature selection is presented in Chapter 2. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 0 1 2 3 4 5 6 7 8 9 10 FIGURE 1.1: An illustrative example: left - two well-formed clusters; middle - after an irrelevant feature is added; right - after applying 2-means clustering. When a small number of instances are labeled but the majority are not, semi-supervised feature selection is designed to take advantage of both the large number of unlabeled instances and the labeling information as in semi- supervised learning. Intuitively, the additional labeling information should help constrain the search space of unsupervised feature selection. In other words, semi-supervised feature selection attempts to align locality-based sep- aration and class-based separations Since there are a large number of unla-
  • 31. Less Is More 9 beled data and a small number of labeled instances, it is reasonable to use unlabeled data to form some potential clusters and then employ labeled data to find those clusters that can achieve both locality-based and class-based sep- arations. For the two possible clustering results in Figure 1.1, if we are given one correctly labeled instance each for the clusters of circles and diamonds, the correct clustering result (the middle figure) will be chosen. The idea of semi-supervised feature selection can be illustrated as in Figure 1.2 showing how the properties of Xl and Xu complement each other and work together to find relevant features. Two feature vectors (corresponding to two features, f and f ) can generate respective cluster indicators representing different clus- tering results: The left one can satisfy both constraints of Xl and Xu, but the right one can only satisfy Xu. For semi-supervised feature selection, we want to select f over f . In other words, there are two equally good ways to cluster the data as shown in the figure, but only one way can also attain class-based C1 C2 C1' C2' (a) The cluster structure corresponding to cluster indicator (b) The cluster structure corresponding to cluster indicator feature vector feature vector f ' f cluster indicator cluster indicator ' g g g ' g FIGURE 1.2: The basic idea for comparing the fitness of cluster indicators accord- ing to both Xl (labeled data) and Xu (unlabeled data) for semi-supervised feature selection. “-” and “+” correspond to instances of negative and positive classes, and “M” to unlabeled instances. separation. A semi-supervised feature selection algorithm sSelect is proposed in [18], and sSelect is effective to use both data properties when locality-based
  • 32. 10 Computational Methods of Feature Selection separation and class-based separation do not generate conflicts. We expect to witness a surge of study on semi-supervised feature selection. The reason is two-fold: It is often affordable to carefully label a small number of instances, and it also provides a natural way for human experts to inject their knowledge into the feature selection process in the form of labeled instances. Above, we presented and illustrated the development of feature selection in parallel to supervised, unsupervised, and semi-supervised learning to meet the increasing demands of labeled, unlabeled, and partially labeled data. It is just one perspective of feature selection that encompasses many aspects. However, from this perspective, it can be clearly seen that as data evolve, feature selection research adapts and develops into new areas in various forms for emerging real-world applications. In the following, we present an overview of the research activities included in this book. 1.3 Key Contributions and Organization of the Book The ensuing chapters showcase some current research issues of feature se- lection. They are categorically grouped into five parts, each containing four chapters. The first chapter in Part I is this introduction. The other three discuss issues such as unsupervised feature selection, randomized feature se- lection, and causal feature selection. Part II reports some recent results of em- powering feature selection, including active feature selection, decision-border estimate, use of ensembles with independent probes, and incremental fea- ture selection. Part III deals with weighting and local methods such as an overview of the ReliefF family, feature selection in k-means clustering, local feature relevance, and a new interpretation of Relief. Part IV is about text feature selection, presenting an overview of feature selection for text classifi- cation, a new feature selection score, constraint-guided feature selection, and aggressive feature selection. Part V is on Feature Selection in Bioinformat- ics, discussing redundancy-based feature selection, feature construction and selection, ensemble-based robust feature selection, and penalty-based feature selection. A summary of each chapter is given next. 1.3.1 Part I - Introduction and Background Chapter 2 is an overview of unsupervised feature selection, finding the smallest feature subset that best uncovers interesting, natural clusters for the chosen criterion. The existence of irrelevant features can misguide clustering results. Both filter and wrapper approaches can apply as in a supervised setting. Feature selection can either be global or local, and the features to be selected can vary from cluster to cluster. Disparate feature subspaces can
  • 33. Less Is More 11 have different underlying numbers of natural clusters. Therefore, care must be taken when comparing two clusters with different sets of features. Chapter 3 is also an overview about randomization techniques for feature selection. Randomization can lead to an efficient algorithm when the benefits of good choices outweigh the costs of bad choices. There are two broad classes of algorithms: Las Vegas algorithms, which guarantee a correct answer but may require a long time to execute with small probability, and Monte Carlo algorithms, which may output an incorrect answer with small probability but always complete execution quickly. The randomized complexity classes define the probabilistic guarantees that an algorithm must meet. The major sources of randomization are the input features and/or the training examples. The chapter introduces examples of several randomization algorithms. Chapter 4 addresses the notion of causality and reviews techniques for learning causal relationships from data in applications to feature selection. Causal Bayesian networks provide a convenient framework for reasoning about causality and an algorithm is presented that can extract causality from data by finding the Markov blanket. Direct causes (parents), direct effects (chil- dren), and other direct causes of the direct effects (spouses) are all members of the Markov blanket. Only direct causes are strongly causally relevant. The knowledge of causal relationships can benefit feature selection, e.g., explain- ing relevance in terms of causal mechanisms, distinguishing between actual features and experimental artifacts, predicting the consequences of actions, and making predictions in a non-stationary environment. 1.3.2 Part II - Extending Feature Selection Chapter 5 poses an interesting problem of active feature sampling in do- mains where the feature values are expensive to measure. The selection of features is based on the maximum benefit. A benefit function minimizes the mean-squared error in a feature relevance estimate. It is shown that the minimum mean-squared error criterion is equivalent to the maximum average change criterion. The results obtained by using a mixture model for the joint class-feature distribution show the advantage of the active sampling policy over the random sampling in reducing the number of feature samples. The approach is computationally expensive. Considering only a random subset of the missing entries at each sampling step is a promising solution. Chapter 6 discusses feature extraction (as opposed to feature selection) based on the properties of the decision border. It is intuitive that the direction normal to the decision boundary represents an informative direction for class discriminability and its effectiveness is proportional to the area of decision bor- der that has the same normal vector. Based on this, a labeled vector quantizer that can efficiently be trained by the Bayes risk weighted vector quantization (BVQ) algorithm was devised to extract the best linear approximation to the decision border. The BVQ produces a decision boundary feature matrix, and the eigenvectors of this matrix are exploited to transform the original feature
  • 34. 12 Computational Methods of Feature Selection space into a new feature space with reduced dimensionality. It is shown that this approach is comparable to the SVM-based decision boundary approach and better than the MLP (Multi Layer Perceptron)-based approach, but with a lower computational cost. Chapter 7 proposes to compare feature relevance against the relevance of its randomly permuted version (or probes) for classification/regression tasks using random forests. The key is to use the same distribution in generating a probe. Feature relevance is estimated by averaging the relevance obtained from each tree in the ensemble. The method iterates over the remaining fea- tures by removing the identified important features using the residuals as new target variables. It offers autonomous feature selection taking into account non-linearity, mixed-type data, and missing data in regressions and classifica- tions. It shows excellent performance and low computational complexity, and is able to address massive amounts of data. Chapter 8 introduces an incremental feature selection algorithm for high- dimensional data. The key idea is to decompose the whole process into feature ranking and selection. The method first ranks features and then resolves the redundancy by an incremental subset search using the ranking. The incre- mental subset search does not retract what it has selected, but it can decide not to add the next candidate feature, i.e., skip it and try the next according to the rank. Thus, the average number of features used to construct a learner during the search is kept small, which makes the wrapper approach feasible for high-dimensional data. 1.3.3 Part III - Weighting and Local Methods Chapter 9 is a comprehensive description of the Relief family algorithms. Relief exploits the context of other features through distance measures and can detect highly conditionally-dependent features. The chapter explains the idea, advantages, and applications of Relief and introduces two extensions: ReliefF and RReliefF. ReliefF is for classification and can deal with incomplete data with multi-class problems. RReliefF is its extension designed for regression. The variety of the Relief family shows the general applicability of the basic idea of Relief as a non-myopic feature quality measure. Chapter 10 discusses how to automatically determine the important fea- tures in the k-means clustering process. The weight of a feature is determined by the sum of the within-cluster dispersions of the feature, which measures its importance in clustering. A new step to calculate the feature weights is added in the iterative process in order not to seriously affect the scalability. The weight can be defined either globally (same weights for all clusters) or locally (different weights for different clusters). The latter, called subspace k-means clustering, has applications in text clustering, bioinformatics, and customer behavior analysis. Chapter 11 is in line with Chapter 5, but focuses on local feature relevance and weighting. Each feature’s ability for class probability prediction at each
  • 35. Less Is More 13 point in the feature space is formulated in a way similar to the weighted χ- square measure, from which the relevance weight is derived. The weight has a large value for a direction along which the class probability is not locally constant. To gain efficiency, a decision boundary is first obtained by an SVM, and its normal vector nearest to the point in query is used to estimate the weights reflected in the distance measure for a k-nearest neighbor classifier. Chapter 12 gives further insights into Relief (refer to Chapter 9). The working of Relief is proven to be equivalent to solving an online convex opti- mization problem with a margin-based objective function that is defined based on a nearest neighbor classifier. Relief usually performs (1) better than other filter methods due to the local performance feedback of a nonlinear classifier when searching for useful features, and (2) better than wrapper methods due to the existence of efficient algorithms for a convex optimization problem. The weights can be iteratively updated by an EM-like algorithm, which guaran- tees the uniqueness of the optimal weights and the convergence. The method was further extended to its online version, which is quite effective when it is difficult to use all the data in a batch mode. 1.3.4 Part IV - Text Classification and Clustering Chapter 13 is a comprehensive presentation of feature selection for text classification, including feature generation, representation, and selection, with illustrative examples, from a pragmatic view point. A variety of feature gen- erating schemes is reviewed, including word merging, word phrases, character N-grams, and multi-fields. The generated features are ranked by scoring each feature independently. Examples of scoring measures are information gain, χ-square, and bi-normal separation. A case study shows considerable im- provement of F-measure by feature selection. It also shows that adding two word phrases as new features generally gives good performance gain over the features comprising only selected words. Chapter 14 introduces a new feature selection score, which is defined as the posterior probability of inclusion of a given feature over all possible models, where each model corresponds to a different set of features that includes the given feature. The score assumes a probability distribution on the words of the documents. Bernoulli and Poisson distributions are assumed respectively when only the presence or absence of a word matters and when the number of occurrences of a word matters. The score computation is inexpensive, and the value that the score assigns to each word has an appealing Bayesian interpretation when the predictive model corresponds to a naive Bayes model. This score is compared with five other well-known scores. Chapter 15 focuses on dimensionality reduction for semi-supervised clus- tering where some weak supervision is available in terms of pairwise instance constraints (must-link and cannot-link). Two methods are proposed by lever- aging pairwise instance constraints: pairwise constraints-guided feature pro- jection and pairwise constraints-guided co-clustering. The former is used to
  • 36. 14 Computational Methods of Feature Selection project data into a lower dimensional space such that the sum-squared dis- tance between must-link instances is minimized and the sum-squared dis- tance between cannot-link instances is maximized. This reduces to an elegant eigenvalue decomposition problem. The latter is to use feature clustering benefitting from pairwise constraints via a constrained co-clustering mecha- nism. Feature clustering and data clustering are mutually reinforced in the co-clustering process. Chapter 16 proposes aggressive feature selection, removing more than 95% features (terms) for text data. Feature ranking is effective to remove irrelevant features, but cannot handle feature redundancy. Experiments show that feature redundancy can be as destructive as noise. A new multi-stage approach for text feature selection is proposed: (1) pre-processing to remove stop words, infrequent words, noise, and errors; (2) ranking features to iden- tify the most informative terms; and (3) removing redundant and correlated terms. In addition, term redundancy is modeled by a term-redundancy tree for visualization purposes. 1.3.5 Part V - Feature Selection in Bioinformatics Chapter 17 introduces the challenges of microarray data analysis and presents a redundancy-based feature selection algorithm. For high-throughput data like microarrays, redundancy among genes becomes a critical issue. Con- ventional feature ranking algorithms cannot effectively handle feature redun- dancy. It is known that if there is a Markov blanket for a feature, the feature can be safely eliminated. Finding a Markov blanket is computationally heavy. The solution proposed is to use an approximate Markov blanket, in which it is assumed that the Markov blanket always consists of one feature. The features are first ranked, and then each feature is checked in sequence if it has any ap- proximate Markov blanket in the current set. This way it can efficiently find all predominant features and eliminate the rest. Biologists would welcome an efficient filter algorithm to feature redundancy. Redundancy-based fea- ture selection makes it possible for a biologist to specify what genes are to be included before feature selection. Chapter 18 presents a scalable method for automatic feature generation on biological sequence data. The algorithm uses sequence components and do- main knowledge to construct features, explores the space of possible features, and identifies the most useful ones. As sequence data have both compositional and positional properties, feature types are defined to capture these proper- ties, and for each feature type, features are constructed incrementally from the simplest ones. During the construction, the importance of each feature is evaluated by a measure that best fits to each type, and low ranked features are eliminated. At the final stage, selected features are further pruned by an embedded method based on recursive feature elimination. The method was applied to the problem of splice-site prediction, and it successfully identified the most useful set of features of each type. The method can be applied
  • 37. Less Is More 15 to complex feature types and sequence prediction tasks such as translation start-site prediction and protein sequence classification. Chapter 19 proposes an ensemble-based method to find robust features for biomarker research. Ensembles are obtained by choosing different alterna- tives at each stage of data mining: three normalization methods, two binning methods, eight feature selection methods (including different combination of search methods), and four classification methods. A total of 192 different clas- sifiers are obtained, and features are selected by favoring frequently appearing features that are members of small feature sets of accurate classifiers. The method is successfully applied to a publicly available Ovarian Cancer Dataset, in which case the original attribute is the m/z (mass/charge) value of mass spectrometer and the value of the feature is its intensity. Chapter 20 presents a penalty-based feature selection method, elastic net, for genomic data, which is a generalization of lasso (a penalized least squares method with L1 penalty for regression). Elastic net has a nice property that irrelevant features receive their parameter estimates equal to 0, leading to sparse and easy to interpret models like lasso, and, in addition, strongly cor- related relevant features are all selected whereas in lasso only one of them is selected. Thus, it is a more appropriate tool for feature selection with high-dimensional data than lasso. Details are given on how elastic net can be applied to regression, classification, and sparse eigen-gene analysis by simul- taneously building a model and selecting relevant and redundant features. 1.4 Looking Ahead Feature selection research has found applications in many fields where large (either row-wise or column-wise) volumes of data present challenges to effec- tive data analysis and processing. As data evolve, new challenges arise and the expectations of feature selection are also elevated, due to its own suc- cess. In addition to high-throughput data, the pervasive use of Internet and Web technologies has been bringing about a great number of new services and applications, ranging from recent Web 2.0 applications to traditional Web ser- vices where multi-media data are ubiquitous and abundant. Feature selection is widely applied to find topical terms, establish group profiles, assist in cat- egorization, simplify descriptions, facilitate personalization and visualization, among many others. The frontier of feature selection research is expanding incessantly in an- swering the emerging challenges posed by the ever-growing amounts of data, multiple sources of heterogeneous data, data streams, and disparate data- intensive applications. On one hand, we naturally anticipate more research on semi-supervised feature selection, unifying supervised and unsupervised
  • 38. 16 Computational Methods of Feature Selection feature selection [19], and integrating feature selection with feature extrac- tion. On the other hand, we expect new feature selection methods designed for various types of features like causal, complementary, relational, struc- tural, and sequential features, and intensified research efforts on large-scale, distributed, and real-time feature selection. As the field develops, we are op- timistic and confident that feature selection research will continue its unique and significant role in taming the data monster and helping turning data into nuggets. References [1] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97:245–271, 1997. [2] M. Dash and H. Liu. Feature selection methods for classifications. Intel- ligent Data Analysis: An International Journal, 1(3):131–156, 1997. [3] E. Dougherty. Feature-selection overfitting with small-sample classi- fier design. IEEE Intelligent Systems, 20(6):64–66, November/December 2005. [4] J. Dy and C. Brodley. Feature selection for unsupervised learning. Jour- nal of Machine Learning Research, 5:845–889, 2004. [5] I. Guyon and A. Elisseeff. An introduction to variable and feature se- lection. Journal of Machine Learning Research (JMLR), 3:1157–1182, 2003. [6] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001. [7] A. Jakulin and I. Bratko. Testing the significance of attribute interac- tions. In ICML ’04: Twenty-First International Conference on Machine Learning. ACM Press, 2004. [8] G. John, R. Kohavi, and K. Pfleger. Irrelevant feature and the subset se- lection problem. In W. Cohen and H. H., editors, Machine Learning: Pro- ceedings of the Eleventh International Conference, pages 121–129, New Brunswick, NJ: Rutgers University, 1994. [9] H. Liu and H. Motoda, editors. Feature Extraction, Construction and Selection: A Data Mining Perspective. Boston: Kluwer Academic Pub- lishers, 1998. 2nd Printing, 2001. [10] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery Data Mining. Boston: Kluwer Academic Publishers, 1998.
  • 39. Less Is More 17 [11] H. Liu and H. Motoda, editors. Instance Selection and Construction for Data Mining. Boston: Kluwer Academic Publishers, 2001. [12] H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. on Knowledge and Data En- gineering, 17(3):1–12, 2005. [13] T. Mitchell. Machine Learning. New York: McGraw-Hill, 1997. [14] P. Refaeilzadeh, L. Tang, and H. Liu. On comparison of feature selection algorithms. In AAAI 2007 Workshop on Evaluation Methods for Machine Learning II, Vancouver, British Columbia, Canada, July 2007. [15] S. Singhi and H. Liu. Feature subset selection bias for classification learning. In International Conference on Machine Learning, 2006. [16] L. Yu and H. Liu. Efficient feature selection via analysis of rele- vance and redundancy. Journal of Machine Learning Research (JMLR), 5(Oct):1205–1224, 2004. [17] Z. Zhao and H. Liu. Searching for interacting features. In Proceedings of IJCAI - International Joint Conference on AI, January 2007. [18] Z. Zhao and H. Liu. Semi-supervised feature selection via spectral anal- ysis. In Proceedings of SIAM International Conference on Data Mining (SDM-07), 2007. [19] Z. Zhao and H. Liu. Spectral feature selection for supervised and unsu- pervised learning. In Proceedings of International Conference on Machine Learning, 2007.
  • 41. Chapter 2 Unsupervised Feature Selection Jennifer G. Dy Northeastern University 2.1 Introduction ............................................................. 19 2.2 Clustering ................................................................ 21 2.3 Feature Selection ........................................................ 23 2.4 Feature Selection for Unlabeled Data ................................... 25 2.5 Local Approaches ........................................................ 32 2.6 Summary ................................................................ 34 Acknowledgment ......................................................... 35 References ............................................................... 35 2.1 Introduction Many existing databases are unlabeled, because large amounts of data make it difficult for humans to manually label the categories of each instance. More- over, human labeling is expensive and subjective. Hence, unsupervised learn- ing is needed. Besides being unlabeled, several applications are characterized by high-dimensional data (e.g., text, images, gene). However, not all of the features domain experts utilize to represent these data are important for the learning task. We have seen the need for feature selection in the supervised learning case. This is also true in the unsupervised case. Unsupervised means there is no teacher, in the form of class labels. One type of unsupervised learn- ing problem is clustering. The goal of clustering is to group “similar” objects together. “Similarity” is typically defined in terms of a metric or a probabil- ity density model, which are both dependent on the features representing the data. In the supervised paradigm, feature selection algorithms maximize some function of prediction accuracy. Since class labels are available in supervised learning, it is natural to keep only the features that are related to or lead to these classes. But in unsupervised learning, we are not given class labels. Which features should we keep? Why not use all the information that we have? The problem is that not all the features are important. Some of the features may be redundant and some may be irrelevant. Furthermore, the ex- istence of several irrelevant features can misguide clustering results. Reducing 19
  • 42. 20 Computational Methods of Feature Selection the number of features also facilitates comprehensibility and ameliorates the problem that some unsupervised learning algorithms break down with high- dimensional data. In addition, for some applications, the goal is not just clustering, but also to find the important features themselves. A reason why some clustering algorithms break down in high dimensions is due to the curse of dimensionality [3]. As the number of dimensions increases, a fix data sample becomes exponentially sparse. Additional dimensions in- crease the volume exponentially and spread the data such that the data points would look equally far. Figure 2.1 (a) shows a plot of data generated from a uniform distribution between 0 and 2 with 25 instances in one dimension. Figure 2.1 (b) shows a plot of the same data in two dimensions, and Figure 2.1 (c) displays the data in three dimensions. Observe that the data become more and more sparse in higher dimensions. There are 12 samples that fall inside the unit-sized box in Figure 2.1 (a), seven samples in (b) and two in (c). The sampling density is proportional to M1/N , where M is the number of samples and N is the dimension. For this example, a sampling density of 25 in one dimension would require 253 = 125 samples in three dimensions to achieve a similar sample density. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x (a) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x y (b) 0 0.5 1 1.5 2 0 0.5 1 1.5 2 0 0.5 1 1.5 2 x y z (c) FIGURE 2.1: Illustration for the curse of dimensionality. These are plots of a 25-sample data generated from a uniform distribution between 0 and 2. (a) Plot in one dimension, (b) plot in two dimensions, and (c) plot in three dimensions. The boxes in the figures show unit-sized bins in the corresponding dimensions. Note that data are more sparse with respect to the unit-sized volume in higher dimensions. There are 12 samples in the unit-sized box in (a), 7 samples in (b), and 2 samples in (c). As noted earlier, supervised learning has class labels to guide the feature search. In unsupervised learning, these labels are missing, and in fact its goal is to find these grouping labels (also known as cluster assignments). Finding these cluster labels is dependent on the features describing the data, thus making feature selection for unsupervised learning difficult. Dy and Brodley [14] define the goal of feature selection for unsupervised learning as:
  • 43. Unsupervised Feature Selection 21 to find the smallest feature subset that best uncovers “interesting natural” groupings (clusters) from data according to the chosen criterion. Without any labeled information, in unsupervised learning, we need to make some assumptions. We need to define what “interesting” and “natural” mean in the form of criterion or objective functions. We will see examples of these criterion functions later in this chapter. Before we proceed with how to do feature selection on unsupervised data, it is important to know the basics of clustering algorithms. Section 2.2 briefly describes clustering algorithms. In Section 2.3 we review the basic components of feature selection algorithms. Then, we present the methods for unsuper- vised feature selection in Sections 2.4 and 2.5, and finally provide a summary in Section 2.6. 2.2 Clustering The goal of clustering is to group similar objects together. There are two types of clustering approaches: partitional and hierarchical. Partitional clus- tering provides one level of clustering. Hierarchical clustering, on the other hand, provides multiple levels (hierarchy) of clustering solutions. Hierarchical approaches can proceed bottom-up (agglomerative) or top-down (divisive). Bottom-up approaches typically start with all instances as clusters and then, at each level, merge clusters that are most similar with each other. Top- down approaches divide the data into k clusters at each level. There are several methods for performing clustering. A survey of these algorithms can be found in [29, 39, 18]. In this section we briefly present two popular partitional clustering algo- rithms: k-means and finite mixture model clustering. As mentioned earlier, similarity is typically defined by a metric or a probability distribution. K- means is an approach that uses a metric, and finite mixture models define similarity by a probability density. Let us denote our dataset as X = {x1, x2, . . . , xM }. X consists of M data instances xk, k = 1, 2, . . ., M, and each xk represents a single N-dimensional instance. 2.2.1 The K-Means Algorithm The goal of k-means is to partition X into K clusters {C1, . . . , CK }. The most widely used criterion function for the k-means algorithm is the sum-
  • 44. 22 Computational Methods of Feature Selection squared-error (SSE) criterion. SSE is defined as SSE = K j=1 xk∈Cj xk − μj2 (2.1) where μj denotes the mean (centroid) of those instances in cluster Cj. K-means is an iterative algorithm that locally minimizes the SSE criterion. It assumes each cluster has a hyper-spherical structure. “K-means” denotes the process of assigning each data point, xk, to the cluster with the nearest mean. The k-means algorithm starts with initial K centroids, then it assigns each remaining point to the nearest centroid, updates the cluster centroids, and repeats the process until the K centroids do not change (convergence). There are two versions of k-means: One version originates from Forgy [17] and the other version from Macqueen [36]. The difference between the two is when to update the cluster centroids. In Forgy’s k-means [17], cluster centroids are re-computed after all the data points have been assigned to their nearest centroids. In Macqueen’s k-means [36], the cluster centroids are re-computed after each data assignment. Since k-means is a greedy algorithm, it is only guaranteed to find a local minimum, the solution of which is dependent on the initial assignments. To avoid local optimum, one typically applies random restarts and picks the clustering solution with the best SSE. One can refer to [47, 4] for other ways to deal with the initialization problem. Standard k-means utilizes Euclidean distance to measure dissimilarity be- tween the data points. Note that one can easily create various variants of k-means by modifying this distance metric (e.g., other Lp norm distances) to ones more appropriate for the data. For example, on text data, a more suitable metric is the cosine similarity. One can also modify the objective function, instead of SSE, to other criterion measures to create other cluster- ing algorithms. 2.2.2 Finite Mixture Clustering A finite mixture model assumes that data are generated from a mixture of K component density functions, in which p(xk|θj) represents the density function of component j for all j s, where θj is the parameter (to be estimated) for cluster j. The probability density of data xk is expressed by p(xk) = K j=1 αjp(xk|θj) (2.2) where the α s are the mixing proportions of the components (subject to αj ≥ 0 and K j=1 αj = 1). The log-likelihood of the M observed data points is then given by L = M k=1 ln{ K j=1 αjp(xk|θj)} (2.3)
  • 45. Unsupervised Feature Selection 23 It is difficult to directly optimize (2.3), therefore we apply the Expectation- Maximization (EM) [10] algorithm to find a (local) maximum likelihood or maximum a posteriori (MAP) estimate of the parameters for the given data set. EM is a general approach for estimating the maximum likelihood or MAP estimate for missing data problems. In the clustering context, the missing or hidden variables are the class labels. The EM algorithm iterates between an Expectation-step (E-step), which computes the expected com- plete data log-likelihood given the observed data and the model parameters, and a Maximization-step (M-step), which estimates the model parameters by maximizing the expected complete data log-likelihood from the E-step, until convergence. In clustering, the E-step is similar to estimating the clus- ter membership and the M-step estimates the cluster model parameters. The clustering solution that we obtain in a mixture model is what we call a “soft”- clustering solution because we obtain an estimated cluster membership (i.e., each data point belongs to all clusters with some probability weight of be- longing to each cluster). In contrast, k-means provides a “hard”-clustering solution (i.e., each data point belongs to only a single cluster). Analogous to metric-based clustering, where one can develop different algo- rithms by utilizing other similarity metric, one can design different probability- based mixture model clustering algorithms by choosing an appropriate density model for the application domain. A Gaussian distribution is typically uti- lized for continuous features and multinomials for discrete features. For a more thorough description of clustering using finite mixture models, see [39] and a review is provided in [18]. 2.3 Feature Selection Feature selection algorithms has two main components: (1) feature search and (2) feature subset evaluation. 2.3.1 Feature Search Feature search strategies have been widely studied for classifications. Gen- erally speaking, search strategies used for supervised classifications can also be used for clustering algorithms. We repeat and summarize them here for completeness. An exhaustive search would definitely find the optimal solution; however, a search on 2N possible feature subsets (where N is the number of features) is computationally impractical. More realistic search strategies have been studied. Narendra and Fukunaga [40] introduced the branch and bound algorithm, which finds the optimal feature subset if the criterion function used is monotonic. However, although the branch and bound algorithm makes
  • 46. 24 Computational Methods of Feature Selection problems more tractable than an exhaustive search, it becomes impractical for feature selection problems involving more than 30 features [43]. Sequential search methods generally use greedy techniques and hence do not guarantee global optimality of the selected subsets, only local optimality. Examples of sequential searches include sequential forward selection, sequential backward elimination, and bidirectional selection [32, 33]. Sequential forward/backward search methods generally result in an O(N2 ) worst case search. Marill and Green [38] introduced the sequential backward selection (SBS) [43] method, which starts with all the features and sequentially eliminates one feature at a time (eliminating the feature that contributes least to the criterion function). Whitney [50] introduced sequential forward selection (SFS) [43], which starts with the empty set and sequentially adds one feature at a time. A problem with these hill-climbing search techniques is that when a feature is deleted in SBS, it cannot be re-selected, while a feature added in SFS cannot be deleted once selected. To prevent this effect, the Plus-l-Minus-r (l-r) search method was developed by Stearns [45]. Indeed, at each step the values of l and r are pre-specified and fixed. Pudil et al. [43] introduced an adaptive version that allows l and r values to “float.” They call these methods floating search methods: sequential forward floating selection (SFFS) and sequential back- ward floating selection (SBFS) based on the dominant search method (i.e., either in the forward or backward direction). Random search methods such as genetic algorithms and random mutation hill climbing add some random- ness in the search procedure to help to escape from a local optimum. In some cases when the dimensionality is very high, one can only afford an individual search. Individual search methods evaluate each feature individually accord- ing to a criterion or a condition [24]. They then select features, which either satisfy the condition or are top-ranked. 2.3.2 Feature Evaluation Not all the features are important. Some of the features may be irrelevant and some of the features may be redundant. Each feature or feature subset needs to be evaluated based on importance by a criterion. Different criteria may select different features. It is actually deciding the evaluation criteria that makes feature selection in clustering difficult. In classification, it is natural to keep the features that are related to the labeled classes. However, in clustering, these class labels are not available. Which features should we keep? More specifically, how do we decide which features are relevant/irrelevant, and which are redundant? Figure 2.2 gives a simple example of an irrelevant feature for clustering. Suppose data have features F1 and F2 only. Feature F2 does not contribute to cluster discrimination, thus, we consider feature F2 to be irrelevant. We want to remove irrelevant features because they may mislead the clustering algorithm (especially when there are more irrelevant features than relevant ones). Figure 2.3 provides an example showing feature redundancy. Observe
  • 47. Unsupervised Feature Selection 25 FIGURE 2.2: In this example, feature F2 is irrelevant because it does not con- tribute to cluster discrimination. F2 F1 FIGURE 2.3: In this example, features F1 and F2 have redundant information, because feature F1 provides the same information as feature F2 with regard to discriminating the two clusters. that both features F1 and F2 lead to the same clustering results. Therefore, we consider features F1 and F2 to be redundant. 2.4 Feature Selection for Unlabeled Data There are several feature selection methods for clustering. Similar to super- vised learning, these feature selection methods can be categorized as either filter or wrapper approaches [33] based on whether the evaluation methods depend on the learning algorithms1 . As Figure 2.4 shows, the wrapper approach wraps the feature search around the learning algorithms that will ultimately be applied, and utilizes the learned results to select the features. On the other hand, as shown in Figure 2.5, the filter approach utilizes the data alone to decide which features should be kept,
  • 48. 26 Computational Methods of Feature Selection Search Clustering Algorithm Feature Evaluation Criterion All Features Feature Subset Criterion Value Clusters Selected Features Clusters FIGURE 2.4: Wrapper approach for feature selection for clustering. Search Feature Evaluation Criterion All Features Feature Subset Criterion Value Selected Features FIGURE 2.5: Filter approach for feature selection for clustering. without running the learning algorithm. Usually, a wrapper approach may lead to better performance compared to a filter approach for a particular learning algorithm. However, wrapper methods are more computationally expensive since one needs to run the learning algorithm for every candidate feature subset. In this section, we present the different methods categorized into filter and wrapper approaches. 2.4.1 Filter Methods Filter methods use some intrinsic property of the data to select features without utilizing the clustering algorithm that will ultimately be applied. The basic components in filter methods are the feature search method and the fea- ture selection criterion. Filter methods have the challenge of defining feature relevance (interestingness) and/or redundancy without applying clustering on the data. Talavera [48] developed a filter version of his wrapper approach that selects features based on feature dependence. He claims that irrelevant features are features that do not depend on the other features. Manoranjan et al. [37] introduced a filter approach that selects features based on the entropy of dis- tances between data points. They observed that when the data are clustered, the distance entropy at that subspace should be low. He, Cai, and Niyogi [26] select features based on the Laplacian score that evaluates features based on their locality preserving power. The Laplacian score is based on the premise that two data points that are close together probably belong to the same cluster. These three filter approaches try to remove features that are not relevant.
  • 49. Unsupervised Feature Selection 27 Another way to reduce the dimensionality is to remove redundancy. A filter approach primarily for reducing redundancy is simply to cluster the features. Note that even though we apply clustering, we consider this as a filter method because we cluster on the feature space as opposed to the data sample space. One can cluster the features using a k-means clustering [36, 17] type of algo- rithm with feature correlation as the similarity metric. Instead of a cluster mean, represent each cluster by the feature that has the highest correlation among features within the cluster it belongs to. Popular techniques for dimensionality reduction without labels are prin- cipal components analysis (PCA) [30], factor analysis, and projection pur- suit [20, 27]. These early works in data reduction for unsupervised data can be thought of as filter methods, because they select the features prior to ap- plying clustering. But rather than selecting a subset of the features, they involve some type of feature transformation. PCA and factor analysis aim to reduce the dimension such that the representation is as faithful as possible to the original data. As such, these techniques aim at reducing dimensionality by removing redundancy. Projection pursuit, on the other hand, aims at find- ing “interesting” projections (defined as the directions that are farthest from Gaussian distributions and close to uniform). In this case, projection pur- suit addresses relevance. Another method is independent component analysis (ICA) [28]. ICA tries to find a transformation such that the transformed vari- ables are statistically independent. Although the goals of ICA and projection pursuit are different, the formulation in ICA ends up being similar to that of projection pursuit (i.e., they both search for directions that are farthest from the Gaussian density). These techniques are filter methods, however, they apply transformations on the original feature space. We are interested in sub- sets of the original features, because we want to retain the original meaning of the features. Moreover, transformations would still require the user to collect all the features to obtain the reduced set, which is sometimes not desired. 2.4.2 Wrapper Methods Wrapper methods apply the clustering algorithm to evaluate the features. They incorporate the clustering algorithm inside the feature search and selec- tion. Wrapper approaches consist of: (1) a search component, (2) a clustering algorithm, and (3) a feature evaluation criterion. See Figure 2.4. One can build a feature selection wrapper approach for clustering by simply picking a favorite search method (any method presented in Section 2.3.1), and apply a clustering algorithm and a feature evaluation criterion. However, there are issues that one must take into account in creating such an algorithm. In [14], Dy and Brodley investigated the issues involved in creating a general wrapper method where any feature selection, clustering, and selection criteria can be applied. The first issue they observed is that it is not a good idea to use the same number of clusters throughout the feature search because different feature subspaces have different underlying numbers of “natural”
  • 50. 28 Computational Methods of Feature Selection clusters. Thus, the clustering algorithm should also incorporate finding the number of clusters in feature search. The second issue they discovered is that various selection criteria are biased with respect to dimensionality. They then introduced a cross-projection normalization scheme that can be utilized by any criterion function. Feature subspaces have different underlying numbers of clusters. When we are searching for the best feature subset, we run into a new problem: The value of the number of clusters depends on the feature subset. Figure 2.6 illustrates this point. In two dimensions {F1, F2} there are three clusters, whereas in one dimension (the projection of the data only on F1) there are only two clusters. It is not a good idea to use a fixed number of clusters in feature search, because different feature subsets require different numbers of clusters. And, using a fixed number of clusters for all feature sets does not model the data in the respective subspace correctly. In [14], they addressed finding the number of clusters by applying a Bayesian information criterion penalty [44]. x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x F F xxxxxxxxxx xxxxxxxx 2 1 FIGURE 2.6: The number of cluster components varies with dimension. Feature evaluation criterion should not be biased with respect to dimensionality. In a wrapper approach, one searches in feature space, ap- plies clustering in each candidate feature subspace, Si, and then evaluates the results (clustering in space Si) with other cluster solutions in other subspaces, Sj, j = i, based on an evaluation criterion. This can be problematic especially when Si and Sj have different dimensionalities. Dy and Brodley [14] examined two feature selection criteria: maximum likelihood and scatter separability. They have shown that the scatter separability criterion prefers higher dimen- sionality. In other words, the criterion value monotonically increases as fea-
  • 51. Another Random Scribd Document with Unrelated Content
  • 52. Ihminen, joka tarinan lopusta osaa saada selville koko tarinan alusta saakka. No, ottakaa sitten selville minunkin tarinani, hän sanoi tarttuen käteeni. Oli kerran mies, joka jätti maailman, missä häntä ihailtiin, ja loi itselleen toisen, missä häntä rakastetaan. Rohkenenko kysyä nimeänne? Ukko kohosi päätään pidemmäksi, kun kuuli nämä sanat. Senjälkeen hän kohotti vapisevan kätensä ja laski sen pääni päälle. Ja siinä silmänräpäyksessä minusta tuntui kuin olisi, kauan, kauan sitten tämä sama käsi levännyt pääni päällä, silloin kun lapsen kiharat vielä liehuivat sen ympärillä, ja kuin olisin kerran ennen nähnyt nämä kasvot. Kysymykseeni hän vastasi: Minun nimeni on 'Ei kukaan.' Sitten hän kääntyi pois virkkamatta enää mitään. Hän meni taloon eikä enää näyttäytynyt meidän saarella olomme aikana. * * * * * Sellainen on Vapaan Saaren nykyinen tila. Kahden hallituksen suoma etuoikeus, joka tekee tämän maapalan riippumattomaksi sen molemmin puolin olevista maista kestää vielä viisikymmentä vuotta. Viisikymmentä vuotta! — Kuka tietää, miksi maailma sinä aikana on muuttunut?
  • 54. *** END OF THE PROJECT GUTENBERG EBOOK ONNEN KULTAPOIKA: ROMAANI. 2/2 *** Updated editions will replace the previous one—the old editions will be renamed. Creating the works from print editions not protected by U.S. copyright law means that no one owns a United States copyright in these works, so the Foundation (and you!) can copy and distribute it in the United States without permission and without paying copyright royalties. Special rules, set forth in the General Terms of Use part of this license, apply to copying and distributing Project Gutenberg™ electronic works to protect the PROJECT GUTENBERG™ concept and trademark. Project Gutenberg is a registered trademark, and may not be used if you charge for an eBook, except by following the terms of the trademark license, including paying royalties for use of the Project Gutenberg trademark. If you do not charge anything for copies of this eBook, complying with the trademark license is very easy. You may use this eBook for nearly any purpose such as creation of derivative works, reports, performances and research. Project Gutenberg eBooks may be modified and printed and given away—you may do practically ANYTHING in the United States with eBooks not protected by U.S. copyright law. Redistribution is subject to the trademark license, especially commercial redistribution. START: FULL LICENSE
  • 55. THE FULL PROJECT GUTENBERG LICENSE
  • 56. PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK To protect the Project Gutenberg™ mission of promoting the free distribution of electronic works, by using or distributing this work (or any other work associated in any way with the phrase “Project Gutenberg”), you agree to comply with all the terms of the Full Project Gutenberg™ License available with this file or online at www.gutenberg.org/license. Section 1. General Terms of Use and Redistributing Project Gutenberg™ electronic works 1.A. By reading or using any part of this Project Gutenberg™ electronic work, you indicate that you have read, understand, agree to and accept all the terms of this license and intellectual property (trademark/copyright) agreement. If you do not agree to abide by all the terms of this agreement, you must cease using and return or destroy all copies of Project Gutenberg™ electronic works in your possession. If you paid a fee for obtaining a copy of or access to a Project Gutenberg™ electronic work and you do not agree to be bound by the terms of this agreement, you may obtain a refund from the person or entity to whom you paid the fee as set forth in paragraph 1.E.8. 1.B. “Project Gutenberg” is a registered trademark. It may only be used on or associated in any way with an electronic work by people who agree to be bound by the terms of this agreement. There are a few things that you can do with most Project Gutenberg™ electronic works even without complying with the full terms of this agreement. See paragraph 1.C below. There are a lot of things you can do with Project Gutenberg™ electronic works if you follow the terms of this agreement and help preserve free future access to Project Gutenberg™ electronic works. See paragraph 1.E below.
  • 57. 1.C. The Project Gutenberg Literary Archive Foundation (“the Foundation” or PGLAF), owns a compilation copyright in the collection of Project Gutenberg™ electronic works. Nearly all the individual works in the collection are in the public domain in the United States. If an individual work is unprotected by copyright law in the United States and you are located in the United States, we do not claim a right to prevent you from copying, distributing, performing, displaying or creating derivative works based on the work as long as all references to Project Gutenberg are removed. Of course, we hope that you will support the Project Gutenberg™ mission of promoting free access to electronic works by freely sharing Project Gutenberg™ works in compliance with the terms of this agreement for keeping the Project Gutenberg™ name associated with the work. You can easily comply with the terms of this agreement by keeping this work in the same format with its attached full Project Gutenberg™ License when you share it without charge with others. 1.D. The copyright laws of the place where you are located also govern what you can do with this work. Copyright laws in most countries are in a constant state of change. If you are outside the United States, check the laws of your country in addition to the terms of this agreement before downloading, copying, displaying, performing, distributing or creating derivative works based on this work or any other Project Gutenberg™ work. The Foundation makes no representations concerning the copyright status of any work in any country other than the United States. 1.E. Unless you have removed all references to Project Gutenberg: 1.E.1. The following sentence, with active links to, or other immediate access to, the full Project Gutenberg™ License must appear prominently whenever any copy of a Project Gutenberg™ work (any work on which the phrase “Project Gutenberg” appears, or with which the phrase “Project Gutenberg” is associated) is accessed, displayed, performed, viewed, copied or distributed:
  • 58. This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook. 1.E.2. If an individual Project Gutenberg™ electronic work is derived from texts not protected by U.S. copyright law (does not contain a notice indicating that it is posted with permission of the copyright holder), the work can be copied and distributed to anyone in the United States without paying any fees or charges. If you are redistributing or providing access to a work with the phrase “Project Gutenberg” associated with or appearing on the work, you must comply either with the requirements of paragraphs 1.E.1 through 1.E.7 or obtain permission for the use of the work and the Project Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9. 1.E.3. If an individual Project Gutenberg™ electronic work is posted with the permission of the copyright holder, your use and distribution must comply with both paragraphs 1.E.1 through 1.E.7 and any additional terms imposed by the copyright holder. Additional terms will be linked to the Project Gutenberg™ License for all works posted with the permission of the copyright holder found at the beginning of this work. 1.E.4. Do not unlink or detach or remove the full Project Gutenberg™ License terms from this work, or any files containing a part of this work or any other work associated with Project Gutenberg™. 1.E.5. Do not copy, display, perform, distribute or redistribute this electronic work, or any part of this electronic work, without prominently displaying the sentence set forth in paragraph 1.E.1
  • 59. with active links or immediate access to the full terms of the Project Gutenberg™ License. 1.E.6. You may convert to and distribute this work in any binary, compressed, marked up, nonproprietary or proprietary form, including any word processing or hypertext form. However, if you provide access to or distribute copies of a Project Gutenberg™ work in a format other than “Plain Vanilla ASCII” or other format used in the official version posted on the official Project Gutenberg™ website (www.gutenberg.org), you must, at no additional cost, fee or expense to the user, provide a copy, a means of exporting a copy, or a means of obtaining a copy upon request, of the work in its original “Plain Vanilla ASCII” or other form. Any alternate format must include the full Project Gutenberg™ License as specified in paragraph 1.E.1. 1.E.7. Do not charge a fee for access to, viewing, displaying, performing, copying or distributing any Project Gutenberg™ works unless you comply with paragraph 1.E.8 or 1.E.9. 1.E.8. You may charge a reasonable fee for copies of or providing access to or distributing Project Gutenberg™ electronic works provided that: • You pay a royalty fee of 20% of the gross profits you derive from the use of Project Gutenberg™ works calculated using the method you already use to calculate your applicable taxes. The fee is owed to the owner of the Project Gutenberg™ trademark, but he has agreed to donate royalties under this paragraph to the Project Gutenberg Literary Archive Foundation. Royalty payments must be paid within 60 days following each date on which you prepare (or are legally required to prepare) your periodic tax returns. Royalty payments should be clearly marked as such and sent to the Project Gutenberg Literary Archive Foundation at the address specified in Section 4, “Information
  • 60. about donations to the Project Gutenberg Literary Archive Foundation.” • You provide a full refund of any money paid by a user who notifies you in writing (or by e-mail) within 30 days of receipt that s/he does not agree to the terms of the full Project Gutenberg™ License. You must require such a user to return or destroy all copies of the works possessed in a physical medium and discontinue all use of and all access to other copies of Project Gutenberg™ works. • You provide, in accordance with paragraph 1.F.3, a full refund of any money paid for a work or a replacement copy, if a defect in the electronic work is discovered and reported to you within 90 days of receipt of the work. • You comply with all other terms of this agreement for free distribution of Project Gutenberg™ works. 1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™ electronic work or group of works on different terms than are set forth in this agreement, you must obtain permission in writing from the Project Gutenberg Literary Archive Foundation, the manager of the Project Gutenberg™ trademark. Contact the Foundation as set forth in Section 3 below. 1.F. 1.F.1. Project Gutenberg volunteers and employees expend considerable effort to identify, do copyright research on, transcribe and proofread works not protected by U.S. copyright law in creating the Project Gutenberg™ collection. Despite these efforts, Project Gutenberg™ electronic works, and the medium on which they may be stored, may contain “Defects,” such as, but not limited to, incomplete, inaccurate or corrupt data, transcription errors, a copyright or other intellectual property infringement, a defective or
  • 61. damaged disk or other medium, a computer virus, or computer codes that damage or cannot be read by your equipment. 1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for the “Right of Replacement or Refund” described in paragraph 1.F.3, the Project Gutenberg Literary Archive Foundation, the owner of the Project Gutenberg™ trademark, and any other party distributing a Project Gutenberg™ electronic work under this agreement, disclaim all liability to you for damages, costs and expenses, including legal fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE. 1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you discover a defect in this electronic work within 90 days of receiving it, you can receive a refund of the money (if any) you paid for it by sending a written explanation to the person you received the work from. If you received the work on a physical medium, you must return the medium with your written explanation. The person or entity that provided you with the defective work may elect to provide a replacement copy in lieu of a refund. If you received the work electronically, the person or entity providing it to you may choose to give you a second opportunity to receive the work electronically in lieu of a refund. If the second copy is also defective, you may demand a refund in writing without further opportunities to fix the problem. 1.F.4. Except for the limited right of replacement or refund set forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
  • 62. INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PURPOSE. 1.F.5. Some states do not allow disclaimers of certain implied warranties or the exclusion or limitation of certain types of damages. If any disclaimer or limitation set forth in this agreement violates the law of the state applicable to this agreement, the agreement shall be interpreted to make the maximum disclaimer or limitation permitted by the applicable state law. The invalidity or unenforceability of any provision of this agreement shall not void the remaining provisions. 1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation, the trademark owner, any agent or employee of the Foundation, anyone providing copies of Project Gutenberg™ electronic works in accordance with this agreement, and any volunteers associated with the production, promotion and distribution of Project Gutenberg™ electronic works, harmless from all liability, costs and expenses, including legal fees, that arise directly or indirectly from any of the following which you do or cause to occur: (a) distribution of this or any Project Gutenberg™ work, (b) alteration, modification, or additions or deletions to any Project Gutenberg™ work, and (c) any Defect you cause. Section 2. Information about the Mission of Project Gutenberg™ Project Gutenberg™ is synonymous with the free distribution of electronic works in formats readable by the widest variety of computers including obsolete, old, middle-aged and new computers. It exists because of the efforts of hundreds of volunteers and donations from people in all walks of life. Volunteers and financial support to provide volunteers with the assistance they need are critical to reaching Project Gutenberg™’s goals and ensuring that the Project Gutenberg™ collection will
  • 63. remain freely available for generations to come. In 2001, the Project Gutenberg Literary Archive Foundation was created to provide a secure and permanent future for Project Gutenberg™ and future generations. To learn more about the Project Gutenberg Literary Archive Foundation and how your efforts and donations can help, see Sections 3 and 4 and the Foundation information page at www.gutenberg.org. Section 3. Information about the Project Gutenberg Literary Archive Foundation The Project Gutenberg Literary Archive Foundation is a non-profit 501(c)(3) educational corporation organized under the laws of the state of Mississippi and granted tax exempt status by the Internal Revenue Service. The Foundation’s EIN or federal tax identification number is 64-6221541. Contributions to the Project Gutenberg Literary Archive Foundation are tax deductible to the full extent permitted by U.S. federal laws and your state’s laws. The Foundation’s business office is located at 809 North 1500 West, Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up to date contact information can be found at the Foundation’s website and official page at www.gutenberg.org/contact Section 4. Information about Donations to the Project Gutenberg Literary Archive Foundation Project Gutenberg™ depends upon and cannot survive without widespread public support and donations to carry out its mission of increasing the number of public domain and licensed works that can be freely distributed in machine-readable form accessible by the widest array of equipment including outdated equipment. Many
  • 64. small donations ($1 to $5,000) are particularly important to maintaining tax exempt status with the IRS. The Foundation is committed to complying with the laws regulating charities and charitable donations in all 50 states of the United States. Compliance requirements are not uniform and it takes a considerable effort, much paperwork and many fees to meet and keep up with these requirements. We do not solicit donations in locations where we have not received written confirmation of compliance. To SEND DONATIONS or determine the status of compliance for any particular state visit www.gutenberg.org/donate. While we cannot and do not solicit contributions from states where we have not met the solicitation requirements, we know of no prohibition against accepting unsolicited donations from donors in such states who approach us with offers to donate. International donations are gratefully accepted, but we cannot make any statements concerning tax treatment of donations received from outside the United States. U.S. laws alone swamp our small staff. Please check the Project Gutenberg web pages for current donation methods and addresses. Donations are accepted in a number of other ways including checks, online payments and credit card donations. To donate, please visit: www.gutenberg.org/donate. Section 5. General Information About Project Gutenberg™ electronic works Professor Michael S. Hart was the originator of the Project Gutenberg™ concept of a library of electronic works that could be freely shared with anyone. For forty years, he produced and distributed Project Gutenberg™ eBooks with only a loose network of volunteer support.
  • 65. Project Gutenberg™ eBooks are often created from several printed editions, all of which are confirmed as not protected by copyright in the U.S. unless a copyright notice is included. Thus, we do not necessarily keep eBooks in compliance with any particular paper edition. Most people start at our website which has the main PG search facility: www.gutenberg.org. This website includes information about Project Gutenberg™, including how to make donations to the Project Gutenberg Literary Archive Foundation, how to help produce our new eBooks, and how to subscribe to our email newsletter to hear about new eBooks.
  • 66. Welcome to our website – the ideal destination for book lovers and knowledge seekers. With a mission to inspire endlessly, we offer a vast collection of books, ranging from classic literary works to specialized publications, self-development books, and children's literature. Each book is a new journey of discovery, expanding knowledge and enriching the soul of the reade Our website is not just a platform for buying books, but a bridge connecting readers to the timeless values of culture and wisdom. With an elegant, user-friendly interface and an intelligent search system, we are committed to providing a quick and convenient shopping experience. Additionally, our special promotions and home delivery services ensure that you save time and fully enjoy the joy of reading. Let us accompany you on the journey of exploring knowledge and personal growth! ebookultra.com