Introduction to Feature Selection

15 Jun 2023

Ever since I started to make summary as a child, I always wondered how could I select the most important information of a textbook. When I first begun, I used to underline almost every feature of the text.

As I grew older, I noticed that was not a good strategy. Even though I did not exactly struggled on the exams, I was not properly taking advantage of my study time. As a result, I decided to underline less and start making abstracts. Nevertheless, I did not always arrived to underline the information that was going to be on the exam. Sometimes, there was a little detail on a page that I considered to be unimportant (not as much by the teacher, actually). As time passed by, I started to learn that selecting the most important information from a book was an art and not as easy as it may seemed.

On Data Science, we do something very similar when we try to get the most important features of a dataset. The data may have a lot of features, as happened on our book, but by no means that implies that we need to use all of them to predict, i.e., to pass our exam on our book example.

There are many different ways to performe Feature Selection.

In this essay, we will cover two of the most important ones: regarding feature selection objectives and regarding the relation of the model learning process.

Different Feature Selection algorithms regarding their objectives.

When we apply Feature Selection regarding the objective that we pursue we differentiate between the following strategies:

Idealized: The main objective of this strategy is to find the smallest subset of features that is able to describe the output of the target. To make it more clear, it would be as if we wanted to make the shortest abstract possible that allows to achieve the best results.
Target Feature Count: The main aim of this strategy is to find a subset of features that allows a certain metric to be optimal over the rest of subsets. To clarify it according to our previous example, it would be as if we wanted to create the abstract that allows to obtain the best result, no matter its length. Some metrics that may be used on Data Science are Akaike Information Criterion, Conditional Entropy or Mean Squared Error, which are aimed to be minimized.
Approximate Original Dataset Prediction: The main purpose of this strategy is towards selecting a feature subset that do not contain additional information beyond what is already on our dataset. Coming back to our book example, it would be as if we wanted to create an abstract that was as similar as possible to the book while removing unnecessary information.
Ranked algorithms: The main objective of this selectors is to rank all features by defining an ad-hoc cutoff. This cutoff may be describe as a desired number of features (static) or as a statistical likelihood of relevance (automatic). It would be as if we wanted to write an abstract that only had the chapters that were going to be asked for on an exam with the highest probability, while removing the ones that were less important.

Relation of Feature Selection with other models of prediction.

On the other hand, we may want to know how these selectors are related to our classifiers/regressors. To do that, we may differentiate between the following approaches.

Filter algorithms

The main idea behind this type of algorithms is to determine the relevance of feature subsets by examining the inherent statistics and characteristics of the data being considered. Its main advantage refers to the time of computation, given that is independent of any type of model of prediction that we aim to use. Nevertheless, as a result of not considering the results of the predictor, we may obtain a suboptimal subset with respect to other strategies.

This approach is the most similar to the one that we humans use in reality. We examine our textbook and try to determine which are the most important parts, e.g., taking into consideration the amount of times the teacher mentioned that specific part in class. If a specific part was mentioned as important at class, we may give a higher value of statistic with respect to the rest. We do not know whether it will be finally included on an exam though; that is why we are only taking into consideration the inherent properties of our notes.

There are many different ways to create filter algorithms. Among the ones that are being the most used are:

MRMR: It is an approach based on maximizing the relevance and minimizing the redundance of the feature at the same time. It performs that by using Mutual Information between the feature and the dataset, F-statistic or both. There are a lot of variants, such as MID (Mutual Information Difference) or FCQ (F-test correlation quotient).
RelieF: It is an approach based on calculating the Manhattan distance between the feature and the target, even though it can be used with any variant of distances desired. It identifies the neighbour instance related to the same class and the neighbour related to the rest, which allows to update the weight of that feature over the optimal subset. There are a lot of variants, such as Relieved-F or MultiSurf. This last one incorporates a threshold calculated as the mean pairwise distance between the target and the rest of the data while excluding the instances near that threshold (dead-band zone).

Wrapper algorithms

The main idea behind this type of algorithms is to train a predictive model using possibly optimal feature subsets. To assess the predictive performance, each subset would be used against the total feature set or a hold-out set. The most similar approach, following our example, would be to create an abstract based on a bunch of exams that were made on the previous years. Its main advantage is that it allows us to search the best feature subset according to a given merit figure; nonetheless, it is computationally expensive.

The main strategies used to create wrappers are:

Forward-selection: Its main aim is to add features to a hold-out set of features in a greedy fashion. It is mainly used when the initial dataset contains a lot of possibly explanatory variables; nevertheless, it tends to overestimate the amount of explained variance of each feature inside the subset.
Backward-selection: Its main objective is to remove features of the total subset until reaching an optimal subset. It is primarily used when dealing with collinear features; nevertheless, it tends to underperform when dealing with datasets with a higher ratio of features over rows.

Embedded algorithms

The main idea behind embedded algorithms is to select feature while executing the modeling algorithm. It would be as if we were doing the exam and writing the abstract at the same time. As the modeling algorithm and the feature selector are working at the same time, they tend to be less computationally expensive when compared to wrapper approaches.

Final remarks

We have covered the main subtleties of feature selection, why is it important and the main approaches that data scientists use to deal with their daily problems. In the future, we will cover more advanced topics and explain different ways to implement the aforementioned strategies over toy datasets, step by step. Never forget that, aside of what many people may think, there is no absolute answer to almost any question, even if that question has been formulated properly. The most important task that we, as humans, are in charge is to find the best question and, afterwards, aspire to provide the best answer. We will never know if we have fallen into a local suboptima (mathematical pun intended), but the only way to get to the global answer is by trying. Let’s keep trying to push towards finding them!