What Is Stacking In Machine Learning

You are currently viewing What Is Stacking In Machine Learning

Overview

Stacking, also known by the name of Super Learning, is an ensemble method. 

It is possible to make use of 

  • Cross-Validation as well as Bootstrapping (for the algorithm for training using different training datasets for control of overfitting) 
  • Feature Selection Techniques 
  • Regularization Techniques 
  • Various Hyper-parameter settings 

To develop one model that can be considered the highest point of everything which has been discussed on this website. 

Stacking helps in combining a variety of predictive modeling algorithms. 

STACKING 

Stacking, also known by the name of Super Learning, is an ensemble method. 

It is possible to make use of 

  • Many Modeling and Regression algorithms 
  • Bagging and Boosting 
  • Cross-Validation as well as Bootstrapping (for the algorithm for training using different training datasets for control of overfitting) 
  • Feature Selection Techniques 
  • Regularization Techniques 
  • Various Hyper-parameter settings 

To develop one model that can be considered the highest point of everything which has been discussed on this website. 

Stacking helps in combining a variety of predictive modeling algorithms. 

Level-1 Stacking 

There are various types of stacking, such as mixing, and, in simpler terms, it’s the process of training various base learners using the training data. The base learners, in contrast to the ones who learn by Bagging or Boosting, may differ (Bagging and Boosting could employ different modeling algorithms simultaneously but it is usually not performed). The output of these algorithms is processed by a single algorithm which is utilized to integrate data from these base learners. Or, you could be described as a meta-learner that is taught based on the results of these learners. Then through learning when these base learners were correct or wrong, the meta-learner is able to come up with its own forecasts. There are two kinds of meta-learners, namely meta-classifiers as well as meta-regressors. In general logistic regression is employed as a meta-classifier, while linear regression is utilized for the role of meta-regressor. 

For further explanation, let’s consider an example of one-level stacking in which we must solve a regression problem. We have a data set of 1000 records. We divide the dataset into trains and tests, with 60% of the train while 40% is for the test. At the level of 0 (known in the field of Model Library), we build four models: Random Forest, Linear Regression and Support Vector Machines, and Gradient Boosting (with one-level decision trees used for building the base models). We create forecasts, but they all contain some errors. At level 2, we introduce the concept of a meta-learner (let’s say that we used Linear Regression as the meta-learner) to optimize the combination of the base-model predictions into new predictions. This is done by estimating the weights of every base model while minimizing the least-squared error. The purpose of the meta-learner is to determine the best way to integrate the results of the basic learners. The model developed by the meta-learner may be applied to the test data to produce the final predictions, and its accuracy can be tested. 

Methods to create complicated Stacking Models 

The model we have discussed is a very basic method of stacking but can be made extremely complex. Let’s look at how stacking can be complicated. 

Stacking methods can be made very complex by adding/combining different models, different by using different algorithms, hyper-parameters, sets of features, etc. 

Adding Levels 

Suppose we design an ensemble model with two levels for a classification task. This is where we build a model library employing 8 modeling algorithms: Logistic Regression, Decision Tree, Random Forests, Bagging with Decision Trees, Naive Bayes, Gradient Boosted Trees, and Artificial Neural Networks (ANN) and K’s Nearest Neighbour (KNN),. Then we move to stage 1, where the model predictions from the eight models will be used to build two new models. The two models we choose are Gradient Regression and Logistic Regression. Boosted Trees. In the second stage of ensemble stacking, the results of stage 2 made by both models are utilized as inputs for the creation of an entirely new model. Let’s, for instance, use a logistic regression model, which creates the ensemble’s final model. 

Adding Hyper-parameters 

Numerous hyper-parameters can be adjusted in various models, such as the depth of trees for decision trees. The size of k that is k’s closest neighbor, the number of bags that will be made during bagging, the number of trees that will be built within the random forest, the number of base models that are to be constructed during boosting, etc. This is just the top of the iceberg. Each algorithm has a minimum of 6-12 parameters that can be adjusted. Let’s say that we modify the models in order to differentiate them from each other and, using the same 8 models; we can make 32 distinct models that form the model library at level 0. It is then possible to create eight models by tweaking levels 1 and 2. We are able to use two types at level 2 and then add another level to make our final product. 

Feature Selection 

Each model incorporates features, and these models are further distinguished by using different feature extraction and selection methods. It means that by using a combination of hyper-parameters and feature sets, it is possible to multiply the 32 models and use 64 models to build the model library. Then, we employ 15 models for the stage 1 ensemble, and 2 models in stage 2. ensembles, and an additional model as the final meta-learner for stage 3 to generate the predictions. 

Training models at different training datasets 

To make the process more complex and error-proof, We must ensure that the models are not vulnerable to overfitting. To do this, we could use cross-validation to train each of these models. If we employ cross-validation of k-folds at level zero where k=10, we get predictions from each fold of the cross-validated algorithms. If you have 64 algorithms with 10 numbers of folds and 10000 of points in the data to predict, at a level zero, we’ll have 10x10000x 64 predictions. This is similar to the out-of-bag error; we compute the sum of squared errors that are between Out of Fold predictions and actual values. This provides us with cross-validation errors of the model. This method is employed in various stages of stacking. 

Additionally, we can use bootstrap sampling, where we make use of the out-of-bag predictions in order to calculate the difference in predicted and actual values. These steps are crucial because they reduce the risk of overfitting an ensemble model. This helps prevent leaks. For instance, let us utilize all the training data to train 4 models (e.g., Linear Regression, Support Vector Machines, Gradient Boosting, and Random Forest) and present the 4 predictions from those four models back to the modeler (e.g., Linear Regression) then the model that is created by the meta-learner may be overfitted because the variables that are targeted are used two times. 

Regularization 

In addition to cross-validation, we can build models using various regularization techniques like the L1 and L2. They can also assist in keeping the model to avoid becoming a victim of overfitting. 

Weights 

Different weights are available to different models. It is possible to give them equal weights, but one model could be fundamentally superior to the other. The models’ weights can be calculated using neural networks, forward-selection learners, replacement selection, etc. 

It is possible to make the models for stacking as complicated as we wish and create our own stacking ensemble that gives us the most effective outcome. The stacked techniques are often employed in competitions involving data science and are employed to make highly precise predictions. (The first place in this competition, the Otto Group Product Classification problems, was taken by a stacked ensemble of more than 30 models that were then used as features in the three classifiers that are metaclassifiers: XGBoost, Neural Network as well as Adaboost (engineered elements and cross-validation were also employed). The model that took home the KDD cup in 2015 KDD cup was based on a three-stage stacked ensemble model, creating 64 models to create the model library. This was done by making use of variations that comprised Neural Networks. Factorization machine learning, the K’s nearest neighbor, Logistic Regression, Random Forest, Gradient Boosting, along with other machine learning algorithms. (different types of features were used to make the models more diverse). This means that Stacking can be utilized to make extremely precise predictions, using different diversified models by using various architectures and hyper-parameter settings as well as training methods. 

Leave a Reply