Overview
There are many kinds of statistics described in the inferential statistics section. These tests use statistical methods using filter methods to discover the relationship between an independent feature and the dependent variable. Based on the results, it is determined if the feature is to be kept or eliminated.
Process of Filter Methods
To ensure that you are not confused, it is important to keep in mind that this process of selecting features is completely free of machine learning algorithms, and the individual statistics scores of features are used to cut down on the feature’s number.
Filter Methods is a specific stage of pre-processing that is used before modeling is finished (as another method of selecting features like Embedded and Lasso are not data preparation since features are reduced in the modeling process, but they fall under the data preparation section since feature reduction is a part of the data preparation and in the end, assist in reducing the dimensions of the data).
Filter Methods can be useful because they are easy to comprehend and provide insight into the data. Still, they cannot be considered to be the most effective method to improve the quality of features to achieve more generalization (prediction and classification). The process of dimension reduction using the Filter Method is simple; we select our features and perform various filtering methods for every feature for the type of feature (character/numerical) and the relation with the dependent variable (linear/non-linear). After performing the filtering methods, the significant features are selected to form a subset of features; used in the model. After assessing the model’s performance, tweaking can be done to get the best performance by selecting the best subset of features.
Types of Filter Methods
Pearson’s Correlation
Pearson’s Correlation was explained on the website of the Correlation Coefficients. Person’s Correlation is a technique to diminish the numerical properties as it can assess the correlation among the data set’s independent numerical variables and those of dependent variables. That is numerical (Y-Variable or Outcome Variable ). If the coefficient of correlation is sufficiently high, this feature will be chosen (close to the -1 mark or), and If it is not high (close to zero), the feature could be eliminated. We can determine whether we should retain a specific characteristic or drop it by determining the linear dependence between variables X and Y. However, Pearson’s correlation has one major disadvantage: It is not able to eliminate features that have a non-linear relationship to the dependent variable. If your dependent variable happens to be located at zero, then the correlation will turn out to be zero, giving false results; Pearson’s correlation won’t aid in cutting down on the number of features present in these scenarios. In these scenarios, a different type of correlation like Mutual Information and Maximal Information Coefficient (MIC) is recommended to be considered, which employs a more advanced filtering method; however, one must keep in mind that, if the relationship is linear, Pearson’s Correlation is superior over other correlations because it is speedy (which is crucial in the case of large datasets). The range of the correlation coefficient is between 1 and -1 (which other correlations do not include), and the negative range gives us additional information regarding whether the relationship is inverted or not.
Information on mutuality and Maximal Information Coefficient (MIC)
As we have explained, there are limits to Person’s correlation. This issue can be solved by employing a more sophisticated type of correlation, which is Mutual Information Correlation which rigorously measures (in an indefinite quantity that is referred to in the form of “bits”) the amount of information the value of one variable reveals about the value of the other. The result of such a ratio is a range of the 0th to Nth value, and zero (unlike Pearson’s Correlation that has non-linear relationships) in reality means that no data is shared between the two variables, which means that two variables are indistinguishable.
The range of results causes a problem because the range isn’t fixed. The results are not adjusted to be normal (unlike Pearson’s Pearson’s Correlation, where the coefficient’s range is fixed between -1 and 1. The values are normalized so that coefficients from different variables are measured), which makes it challenging to compare Mutual Information values. This is why a different type of correlation could be utilized that is known as the Maximal Information Coefficient; where the mutual information score is transformed into a metric that could be found within a range of 1 and 0, but, as previously discussed, it also gives more than half the information because negative relationships are not described however it is still a viable option for variables with the same non-linear relationship.
ANOVA
ANOVA was covered in the section ”F Tests,” where the usage of ANOVA to determine the relationship between variables is examined. ANOVA is a method to determine whether the numerical dependent variable and the categorical independent variables have any connection or not when the independent variable has greater than two categories (levels). Then it performs a statistical test to determine whether the mean of different groups is equivalent or not. However, it could be a problem in that the ANOVA could indicate variables with no relation; however, it is possible that one group (level) was associated with the dependent variable. However, because of the absence of any relationship of the other groups to this dependent variable, results from the ANOVA led us to eliminate the feature, leading to data loss. That is why, under these conditions, categorical variables can be encoded and then ‘decomposed in numerical terms so that the variables can be used to reduce characteristics of methods that require features to be numerical.
Chi-Square
The statistical tool has been covered within Chi-Square, and it is advised to read it before you do. Chi-square is a tool that can be utilized when the dependent, as well as the independent variable, are categorical. One typical use is when it comes to classification issues when the dependent variable includes categorical categories. If the dependent variables are categorical, then a chi-square test may be performed, and elements that don’t have a relationship to outgoing variables (Y Variableor Dependent variable) can be eliminated. If there are many strategic variables with multiple categories, it’s best to encode categorical variables and employ more advanced methods for feature reduction, requiring numerical attributes. Additionally, running chi-squares for several categorical variables could be lengthy and time-consuming. Other advanced, faster methods can be used to reduce dimensions.
LDA (Linear differential analysis)
LDA can decrease the number of features since it keeps the interclass separation present in the feature vector by identifying the linear blend of attributes that defines or distinguishes the two categories (or different levels) of categorical variables. PCA (a method for removing features) is a method to reduce the number of variables; however, it will give us the list of features after computing their eigenvalue, which will tell us the features that have the greatest impact on our data. But, if we already have prior data or have a sense that the data points in the data set belong to different classes and would like to keep these classes in mind while reducing the number of characteristics, LDA can be used because it can reveal the characteristics or combinations of features that affect the separation of classes.
Using Models
A machine learning model could be constructed to identify the most important variables. For instance, we could develop a linear model that uses a standard regression coefficient to make predictions (similar to Pearson’s correlation) and utilize all of our features. From the results, we will have details about the effects of the features on the dependent variable and can pick the most important characteristics. Similar techniques are possible for a classification problem by applying logistic regression. If the variables are not linear, tree models based on tree structure could be employed. But, this approach is not suggested since it could easily result in overfitting problems, for such as the five most significant variables are linked with the dependent variables; however, they contain the same information leading to the model being overfitted.
Filter Methods are among the most straightforward methods to reduce the number of features present in the data. The major disadvantage to these techniques is that these are the univariate process that makes the process of reduction of features extremely slow. Furthermore, although they aid in providing the necessary variables for our model, by checking whether they have a relationship to the dependent variable, they are not able to eliminate redundant data (i.e., they cannot determine if the important variables contain identical information, or do not, i.e., check for multicollinearity). Therefore, we must select the appropriate features to reduce multicollinearity, thereby providing an accurate model. Greatest precision without creating overfit. To pick the most convenient feature in a subset of highly correlated features, we employ other methods of selecting features, like Wrapper strategies (where Stepwise Regression and other methods are employed) and Embedded strategies (where Linear Models and Regularization techniques are employed).