A fast clustering based feature subset selection algorithm for high-dimensional data

Jul 11, 20130 likes18 views

The document presents a fast clustering-based feature subset selection algorithm designed for high-dimensional data, focusing on both efficiency and effectiveness of feature selection. The algorithm clusters features using graph-theoretic methods and selects the most representative features from each cluster, ensuring independence among them. Experimental results demonstrate that the fast algorithm yields smaller feature subsets while enhancing the performance of various classifiers across multiple data sets.

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM FOR
HIGH-DIMENSIONAL DATA
ABSTRACT:
Feature selection involves identifying a subset of the most useful features that produces
compatible results as the original entire set of features. A feature selection algorithm may be
evaluated from both the efficiency and effectiveness points of view. While the efficiency
concerns the time required to find a subset of features, the effectiveness is related to the quality
of the subset of features. Based on these criteria, a fast clustering-based feature selection
algorithm (FAST) is proposed and experimentally evaluated in this paper.
The FAST algorithm works in two steps.
In the first step, features are divided into clusters by using graph-theoretic clustering methods.
In the second step, the most representative feature that is strongly related to target classes is
selected from each cluster to form a subset of features.
Features in different clusters are relatively independent; the clustering-based strategy of FAST
has a high probability of producing a subset of useful and independent features. To ensure the
efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method.
The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical
study. Extensive experiments are carried out to compare FAST and several representative feature
selection algorithms results, on 35 publicly available real-world high-dimensional image,
microarray, and text data, demonstrate that the FAST not only produces smaller subsets of
features but also improves the performances of the four types of classifiers.
ECWAY TECHNOLOGIES
IEEE PROJECTS & SOFTWARE DEVELOPMENTS
OUR OFFICES @ CHENNAI / TRICHY / KARUR / ERODE / MADURAI / SALEM / COIMBATORE
CELL: +91 98949 17187, +91 875487 2111 / 3111 / 4111 / 5111 / 6111
VISIT: www.ecwayprojects.com MAIL TO: ecwaytechnologies@gmail.com

The document presents a novel similarity measure for document comparisons that evaluates three cases of feature occurrence, resulting in improved clustering performance. It critiques existing clustering methods for their inconsistency and local minimum issues while proposing a hierarchical algorithm that enhances efficiency and reduces computation time. The document also outlines hardware and software requirements for implementing the research findings.

Yangetal Efficient LetkfShuChih.Yang

1) The document proposes using weight interpolation for the Local Ensemble Transform Kalman Filter (LETKF) data assimilation method to improve computational efficiency. 2) In LETKF, weights are computed on coarse grids and then interpolated to high-resolution grids, rather than computing the analysis directly on the high-resolution grids. 3) Experiments with a quasi-geostrophic model show that interpolating the weights produces analysis accuracy comparable to computing the analysis on full grids, while greatly reducing computational cost. Weight interpolation allows analysis to be performed in regions without local observations.

Economic dispatch using fuzzy logicSenthil Kumar

This document discusses using fuzzy logic to solve the economic dispatch problem with emission constraints in power systems. The aim is to minimize fuel costs while meeting emission limits. Fuzzy logic models are investigated to handle the nonlinear generator cost equations. A fuzzy model using error and change in error as inputs was able to find optimal solutions faster than methods like simulated annealing. The fuzzy model was tested on a 3 generator system and shown to validate the proposed method.

Understanding Map Integration Using GIS Software Poster_ffMichelle Pasco

This document summarizes a study that analyzed two transportation datasets using GIS software to improve the process of integrating or "conflating" the maps. The study tested different conflation methods - spatial joining and transferring attributes - between Virginia road data and traffic data on several interstate highways. Spatial joining combines datasets based on geometry, while attribute transfer matches features using both geometry and attributes within a search distance. The results showed that attribute transfer was more accurate than spatial joining, and that adjusting the search distance and dataset coordinate systems impacted the success of integration.

Object Tracking By Online Discriminative Feature Selection AlgorithmIRJET Journal

1) The document presents an online discriminative feature selection algorithm for object tracking. It aims to select discriminative features between the target object and background to improve tracking performance. 2) The algorithm formulates the feature selection problem as optimizing an objective function that maximizes the average confidence of positive samples while suppressing the average confidence of negative samples. 3) It uses a greedy sequential forward selection approach to select weak classifiers from a pool that maximize this objective function. This formulation directly couples the classifier score with sample importance, leading to a more robust and efficient tracker.

IMPL Data AnalysisAlkis Vazacopoulos

This document outlines three data analysis techniques — checking, clustering, and componentizing — utilized by Industrial Gorihm LLC for real-time decision-making applications. It details data checking routines that ensure data integrity, fuzzy c-mean clustering for identifying process modes, and principal components analysis for dimensionality reduction in data. Additionally, a new approach called principal component regression optimization (PCRO) is introduced to enhance the efficiency of regression analysis compared to traditional methods.

Learning from data for wind–wave forecastingJonathan D'Cruz

This study used artificial neural networks (ANNs) and instance-based learning (IBL) models to forecast significant wave heights 1, 3, and 6 hours ahead using wind and wave data from buoys located in the Caspian Sea. The ANNs performed slightly better than the IBL models. The models performed better at a deep water location compared to a shallow water location near the coast, where other environmental factors may influence wave patterns beyond wind conditions alone. While the data-driven methods showed potential, more comprehensive historical data could improve the accuracy of wind-wave forecasting in the Caspian Sea region.

Freenome's Biological Machine Learning PlatformBrandon White

The document discusses the challenges of applying machine learning to biological data, including noisy data, biases, and confounding factors. It argues that building a machine learning platform can help accelerate model development by allowing researchers to focus on their specialty rather than infrastructure, easily reproduce and build upon each other's work, and uniformly apply robust interpretation techniques. The platform would make common workflows like exploring new preprocessing methods, data types, validation schemes, or models a simple one-step process by leveraging existing shared components.

Neural Network PresentationOmoye

This document discusses using an artificial neural network to forecast power loads by taking the University of Lagos as a sample space. It involves gathering and arranging historical load data, determining an appropriate network type and topology, training the network using an algorithm, and analyzing the results to test the network's accuracy in predicting loads. The methodology includes randomizing and tagging the training data, experimenting to determine the network topology, training with cross-validation, and performing sensitivity and mean squared error analysis on the network.

Graph-Based Technique for Extracting Keyphrases In a Single-Document (GTEK)Mahmoud Alfarra

The document presents gtek, a graph-based technique for extracting keyphrases from single documents, utilizing a clustering model and the textrank algorithm. Experimental results indicate that gtek outperforms traditional methods like textrank and tf-idf in terms of recall, precision, and f-measure across two datasets. Future work aims to extend gtek's application to multi-document keyphrase extraction and text summarization.

Collaborative Filtering Surveymobilizer1000

This document presents nearest bi-clusters collaborative filtering (NBCF), which improves upon traditional collaborative filtering approaches. NBCF uses biclustering to group users and items simultaneously, addressing the duality between them. It introduces a new similarity measure to achieve partial matches between users' preferences. The algorithm first performs biclustering on the training data. It then calculates similarity between a test user and biclusters to find the k-nearest biclusters. Finally, it generates recommendations by weighting items based on bicluster size and similarity. An example demonstrates how NBCF provides more accurate recommendations than one-sided approaches.

Ppt manqingXiang Zhang

This document summarizes a research paper on robust unsupervised feature selection on networked data. It introduces the challenges of high dimensionality and noise in networked data. The proposed NetFS framework addresses this by (1) modeling link information with latent representations learned from the network structure, and (2) embedding latent representation learning into the feature selection process to reduce noise. The framework is optimized using an alternating optimization approach. Experiments on blog, Flickr, and Epinions networks demonstrate that NetFS improves clustering performance over other methods by selecting more informative features. Future work could apply the framework to other network types and dynamic networks.

ICSE2018-Poster-Bug-LocalizationMasud Rahman

This document summarizes a study on improving bug localization through considering the quality of bug reports and reformulating bug report queries. The study analyzes 5,500 bug reports from eight projects and finds that existing bug localization techniques perform poorly when bug reports lack useful information or contain excessive stack traces. Preliminary findings suggest context-aware query reformulation may help address these limitations by improving the quality and relevance of the queries used.

D0931621IOSR Journals

1. The document presents a hybrid algorithm that combines Kernelized Fuzzy C-Means (KFCM), Hybrid Ant Colony Optimization (HACO), and Fuzzy Adaptive Particle Swarm Optimization (FAPSO) to improve clustering of electrocardiogram (ECG) beat data. 2. The algorithm maps data into a higher dimensional space using kernel functions to make clusters more linearly separable, addresses issues with KFCM being sensitive to initialization and prone to local minima. 3. It uses HACO to optimize cluster centers and membership degrees, and FAPSO to evaluate fitness values and optimize weight vectors, forming usable clusters for applications like ECG classification.

A value added predictive defect type distribution modelUmeshchandraYadav5

The document presents a model for predicting software defects based on project characteristics using statistical methods such as regression analysis and Weibull probability density function. The model aims to enhance software quality assurance activities by identifying defect distributions across different phases of the development lifecycle. The validation of the model shows an average prediction accuracy of 75% across all phases and defect types.

Poster: ICPR 2008Mahfuzul Haque

This paper presents a technique to improve Gaussian mixture models for robust object detection by modifying the new model induction logic and using intensity difference thresholding to detect objects from one or more background models. The proposed method eliminates drawbacks of poor Gaussian mixture quality, susceptibility to background/foreground data proportion, and instability with varying operating environments. Quantitative and qualitative evaluations on test video sequences show the proposed technique achieves lower error rates and better visual results compared to existing methods.

IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Tension in active shapesIEEEBEBTECHSTUDENTPROJECTS

The document introduces a method for improving image segmentation using tension in active contours with prior shape information. The approach aims to avoid local minima in the cost function, particularly in the context of segmenting fish from low-quality underwater images, and shows superior performance compared to existing algorithms. However, the complexity of the proposed method is noted to be high, although it manages to maintain topology unchanged.

Cloud migration research a systematic reviewNexgen Technology

This document presents a systematic literature review of cloud migration research, highlighting the transition of legacy software systems to cloud environments. It identifies research gaps, the need for a comprehensive migration framework, and the lack of tool support for automating migration tasks. The review also suggests the field is in its formative stages and calls for a unified research agenda to enhance cloud migration practices.

Matlab reversible watermarking based on invariant image classification and d...Ecway Technologies

This document proposes a new reversible watermarking scheme with two main contributions: 1) An adaptive histogram shifting modulation that embeds data in textured image areas where other methods fail. It considers prediction errors and their neighborhoods. 2) A classification process that identifies parts of an image best suited for watermarking using a reference image invariant to watermark insertion. Experiments show the method can embed more data with lower distortion than existing schemes, achieving 1-2 dB higher PSNR than the most efficient existing approach.

Different approaches for controlling Boolean networksCeliaBianeFourati

This document summarizes and compares different approaches for controlling Boolean networks: Stable Motifs identifies "lock" variables that push the system towards attractors. Pint identifies bifurcation transitions after which a goal is no longer reachable using abstract interpretation. ActoNet identifies interaction perturbations like setting a variable to stabilize a goal property. Future work involves testing scalability on biological models and extending ActoNet to provide information about initial states that reach goals, specify goal cyclic attractors, and allow sequential control.

One–day wave forecasts based on artificial neural networksJonathan D'Cruz

The document summarizes a study that uses artificial neural networks (ANNs) to generate 24-hour wave forecast based on wave buoy data from 6 locations. It trains ANNs using over 12 years of wave height data from the buoys as input, and forecasts wave heights up to 24 hours ahead as output. The ANNs are able to generate reliable 6-12 hour forecasts, but longer-term forecasts tend to underestimate peak heights or delay their timing. Real-time predictions starting in April 2005 showed similar trends.

New Rough Set Attribute Reduction Algorithm based on Grey Wolf OptimizationAboul Ella Hassanien

The document presents a new algorithm for feature selection that combines rough set theory with grey wolf optimization (GWO), aimed at reducing the number of features while maintaining high accuracy in data classification. It discusses the principles of rough sets and the GWO algorithm and shares experimental results demonstrating the effectiveness of this hybrid approach compared to traditional methods. The findings suggest that the GWO-based feature selection can outperform common methods like particle swarm optimization and genetic algorithms.

Integrative information management for systems biologyNeil Swainston

Java region-based foldings in process discoveryEcway Technologies

The document discusses using region-based techniques for process discovery from event logs. It proposes incorporating region information into cycle detection algorithms to more efficiently identify complex cycles when constructing an automaton from event traces. This enables better application of region-based techniques to discover process models from industrial event logs. The experimental results suggest the techniques can significantly improve applying region theory for process mining in industry scenarios.

A fast clustering based feature subset selection algorithm for high-dimension...JPINFOTECH JAYAPRAKASH

The document proposes a fast clustering-based feature selection algorithm (FAST) to efficiently and effectively select useful feature subsets from high-dimensional data. FAST works in two steps: (1) it clusters features using minimum spanning trees, partitioning clusters so each represents a subset of independent features; (2) it selects the most representative feature from each cluster to form the output subset. Experiments on 35 real-world datasets show FAST not only selects smaller feature subsets but also improves performance of four common classifiers compared to other feature selection methods.

A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS

The document presents a fast clustering-based feature subset selection algorithm designed for high-dimensional data that improves both efficiency and effectiveness in feature selection. The proposed algorithm utilizes a clustering approach to identify representative features from clusters, ensuring the removal of irrelevant and redundant features, thus enhancing classifier performance. Empirical evaluations indicate that the fast algorithm outperforms existing methods in producing smaller yet high-quality feature subsets across multiple data types.

JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...IEEEGLOBALSOFTTECHNOLOGIES

The document presents a fast clustering-based feature subset selection algorithm aimed at efficiently and effectively identifying useful features in high-dimensional data. This algorithm operates in two steps: clustering features using graph-theoretic methods and selecting representative features from each cluster, addressing the challenges of irrelevant and redundant features. Experimental results show that the proposed method outperforms existing algorithms, improving classifier performance while maintaining a smaller subset of features.

Feature Selection Algorithm for Supervised and Semisupervised ClusteringEditor IJCATR

This document summarizes a research paper on feature selection algorithms for supervised and semi-supervised clustering. It discusses how semi-supervised learning uses both labeled and unlabeled data for training, between unsupervised and supervised learning. It also describes a fast clustering-based feature selection algorithm (FAST) that works in two steps: 1) using graph-theoretic clustering to separate features into clusters, and 2) selecting the most representative feature from each cluster to form a subset of features. The algorithm aims to efficiently obtain a good feature subset by removing unrelated and redundant features.

DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...IEEEGLOBALSOFTTECHNOLOGIES

This document presents a novel fast clustering-based feature selection algorithm designed for high-dimensional data, which aims to effectively identify and eliminate irrelevant and redundant features. The algorithm operates in two steps: clustering features using graph-theoretic methods and selecting the most representative features from each cluster, demonstrating improved efficiency and effectiveness compared to existing algorithms. Experimental results indicate that the proposed algorithm significantly enhances the performance of various classifiers across multiple data sets.

Iaetsd an efficient and large data base using subset selection algorithmIaetsd Iaetsd

The document presents a new feature selection algorithm called FAST (Feature Cluster-based Subset Selection) that aims to efficiently reduce dimensionality by removing irrelevant and redundant features. The FAST algorithm works in two steps: (1) it clusters features using graph theoretic methods, and (2) it selects the most representative feature from each cluster. This clustering-based approach has a high probability of selecting useful and independent features. The algorithm is evaluated on high dimensional datasets and shown to improve learning accuracy while reducing dimensionality compared to other feature selection methods.

More Related Content

What's hot (16)