Ever since I started to make summary as a child, I always wondered how could I select the most important information of a textbook. When I first begun, I used to underline almost every feature of the text.
As I grew older, I noticed that was not a good strategy. Even though I did not exactly struggled on the exams, I was not properly taking advantage of my study time. As a result, I decided to underline less and start making abstracts. Nevertheless, I did not always arrived to underline the information that was going to be on the exam. Sometimes, there was a little detail on a page that I considered to be unimportant (not as much by the teacher, actually). As time passed by, I started to learn that selecting the most important information from a book was an art and not as easy as it may seemed.
On Data Science, we do something very similar when we try to get the most important features of a dataset. The data may have a lot of features, as happened on our book, but by no means that implies that we need to use all of them to predict, i.e., to pass our exam on our book example.
In this essay, we will cover two of the most important ones: regarding feature selection objectives and regarding the relation of the model learning process.
When we apply Feature Selection regarding the objective that we pursue we differentiate between the following strategies:
On the other hand, we may want to know how these selectors are related to our classifiers/regressors. To do that, we may differentiate between the following approaches.
The main idea behind this type of algorithms is to determine the relevance of feature subsets by examining the inherent statistics and characteristics of the data being considered. Its main advantage refers to the time of computation, given that is independent of any type of model of prediction that we aim to use. Nevertheless, as a result of not considering the results of the predictor, we may obtain a suboptimal subset with respect to other strategies.
This approach is the most similar to the one that we humans use in reality. We examine our textbook and try to determine which are the most important parts, e.g., taking into consideration the amount of times the teacher mentioned that specific part in class. If a specific part was mentioned as important at class, we may give a higher value of statistic with respect to the rest. We do not know whether it will be finally included on an exam though; that is why we are only taking into consideration the inherent properties of our notes.
There are many different ways to create filter algorithms. Among the ones that are being the most used are:
The main idea behind this type of algorithms is to train a predictive model using possibly optimal feature subsets. To assess the predictive performance, each subset would be used against the total feature set or a hold-out set. The most similar approach, following our example, would be to create an abstract based on a bunch of exams that were made on the previous years. Its main advantage is that it allows us to search the best feature subset according to a given merit figure; nonetheless, it is computationally expensive.
The main strategies used to create wrappers are:
The main idea behind embedded algorithms is to select feature while executing the modeling algorithm. It would be as if we were doing the exam and writing the abstract at the same time. As the modeling algorithm and the feature selector are working at the same time, they tend to be less computationally expensive when compared to wrapper approaches.
We have covered the main subtleties of feature selection, why is it important and the main approaches that data scientists use to deal with their daily problems. In the future, we will cover more advanced topics and explain different ways to implement the aforementioned strategies over toy datasets, step by step. Never forget that, aside of what many people may think, there is no absolute answer to almost any question, even if that question has been formulated properly. The most important task that we, as humans, are in charge is to find the best question and, afterwards, aspire to provide the best answer. We will never know if we have fallen into a local suboptima (mathematical pun intended), but the only way to get to the global answer is by trying. Let’s keep trying to push towards finding them!