Overview
Feature Scaling is one of the most vital steps of feature engineering as well as data pre-processing in general. To comprehend feature scaling clearly, we can look at an instance of a data set with two variables: income and average duration of Call. In order to evaluate them, we need them to be in the same range because income, for instance, is expressed in dollars, and the duration of calls is in minutes. To evaluate these factors, they have to be the same size, which is why feature scaling is helpful. Thus when some specific attributes have a higher ‘magnitude’ than others, due to values being in different units of measurement, it needs to scale all such attributes to bring them to comparable/equivalent ranges.
Use of Feature Scaling
Before examining the different ways that features can be scaled, it’s crucial to comprehend the significance of feature scaling as well as the implications of features that are not scaled. Many machine learning algorithms require feature scaling because it prevents the model from giving higher weightage to particular attributes compared to other attributes such as features.
Models for classification like KNN need features to be scaled when classifiers like KNN employ Euclidean distance to determine what distance two places are. And when one feature is located and separated into different units of measurement that cause it to be in many different values, it may affect the calculation and result in inaccurate and inaccurate results. This is why the change in measured space, i.e., it will be the Euclidean distance of two specimens, will differ after the transformation.
Gradient descent, which can be described as an algorithm for optimizing that is commonly employed to optimize the performance of Logistic Regression, SVM, Neural Networks, etc., is an additional example when features are on different scales, and some are updated more quickly than other weights. However, feature scales aid in creating Gradient Descent to reach a converge quicker as it puts all variables on an equal scale. For instance, linear regression. This allows for a simple calculation of how steep ( that is, y = MX + C) (where you normalize M’s parameter in order to speed up the process of converging).
Techniques for Feature Extraction like Principal Component Analysis (PCA), Kernel Principal Component Analysis (Kernel PCA), and Linear discriminant analysis (LDA) in which we need to determine directions by maximizing the variance. By scaling the features, we can be sure that we are not prioritizing variables with huge dimensions. For example, suppose you have one element (e.g., working hours per day) which has less variation than the other (e.g., monthly income) because of their scales (hours as opposed to. dollars). In that case, the direction of the maximum variance is closer to the axis of income, which PCA might determine. If the features aren’t scaled, an increase in one dollar’s income is considered much more significant than an increase over the course of one hour. In general, the only class of algorithms considered to be scale-invariant is the tree-based method, in contrast to the other algorithms like the Neural Network tends to become convergent faster. K-Means usually provide better clustering results, and different feature extraction methods provide better results with scaled features that have been pre-processed.
Techniques of Feature Scaling
There are a variety of ways in that features can be scaled, with each method having its unique strength and utility. The different scaling methods are described below.
Min-Max Scaling (Rescaling or Normalization)
The most straightforward way to scale that involves scaling the values within the range between the 0 and 1 range or from 1- 1 is Min-Max Scaling. The formula to scale values to encompass the range between 1 and 0 is
The formula to scale the values so that they range between 1 and -1 (making zero the central) is
The most well-known option in Standardization (mentioned in the following paragraphs) aids in reducing the impact of outliers because it has smaller standard deviations from the output.
This technique is employed in algorithms like the k-nearest neighbor, where distances and regression coefficients need to be calculated. When the operation of the model is dependent on the size of values, the normalization process is usually used. In all the above scenarios it is standardization (mentioned in the next section) is the most commonly employed method. There are certain instances in which min-max scaling can be superior to standardization, for instance, in models in which image processing is needed. For instance, in different classification models where images are classified according to the intensity of the pixels, which vary between 0 and 254 (for RGB color range, i.e., for images with colored hues), Rescaling demands that the values are normalized only within this range. A variety of Neural Network algorithms also require features’ values to be in the range of 0-1, and Min-Max scaling can be useful for this purpose.
Z score Normalization (Standardization)
Z score normalization is simply changing the scale of the feature, so they have an average of zero and one standard deviation. The concept was discussed within Inferential Statistics; thus, the feature(s) will possess normal distribution properties.
the formula for calculating z-score
Standardization is among the most widely-used and utilized methods of rescaling. It is utilized in models that utilize machine learning algorithms like logistic regression, neural networks, etc. K-means algorithm and other clustering algorithms gain from standardization, specifically when Euclidean distances are to be calculated. Models that depend upon the spread of feature features, such as Gaussian methods, also need the standardization of the features. Many feature extraction techniques require the features to be scaled. Standardization is the most frequently used method, as Min-Max Scaling offers lower standard deviations.
In contrast, when employing feature extraction techniques like Principal Component Analysis, we must focus on the parts which maximize the variance. The process of feature Scaling is among the most essential aspects of data processing and, if needed, should be carried out before applying any type of Machine Learning algorithm. There are two main methods: Standardization and Normalization to scale the features, each with distinct advantages and disadvantages. Both can be applied based on the algorithm employed in the model.
If the features aren’t scaled, an increase in one dollar’s income is considered much more significant than an increase over the course of one hour. In general, the only class of algorithms that are considered to be scale-invariant is the tree-based method, in contrast to the other algorithms like the Neural Network tends to become convergent faster. K-Means usually provide better clustering results, and different feature extraction methods provide better results with scaled features that have been pre-processed.
Normalisation vs standardisation
Standardization and normalization are two techniques for data preprocessing employed to transform and increase the size of data in order to make it more suitable for machine learning algorithms.
Normalization is a method which rescales data values to a range between 0 and 1. It’s useful when data values are of different scales, and you would like to compare them using the same scale. Normalization is accomplished using the subtraction of the minimum value of every data point, and then dividing the results by value range (the variation between minimum and maximum values).
Standardization is, in contrast refers to the procedure of changing the data so that it has the same mean, but with a unit variation. It is beneficial when values of the data have different mean and standard deviations and we need to compare them on an equal scale. Standardization is accomplished using subtracting the means from every data point, and after that, dividing it by standard deviation.
In general normalization is better when the distribution of data isn’t Gaussian or if the range of data is known, whereas standardization is preferred in the case of a distribution that is Gaussian or unknowable. In the end, the decision between standardization and normalization is based on the particular requirements of the machine learning algorithm as well as the specific characteristics of the dataset.
Which models require feature scaling?
Feature scaling is usually required for models that rely on distance-based measures or optimization algorithms. Some of the models that typically require feature scaling are:
Gradient descent based algorithms
Gradient descent is an iterative optimization algorithm commonly employed in machine learning and deep learning to identify optimal solutions to given problems. Gradient descent aims to minimize discrepancies between predicted and actual values by tweaking model parameters iteratively, thus finding optimal solutions more quickly than other algorithms can.
Gradient descent-based algorithms use the gradient of the cost function with respect to model parameters as an updater of model parameters on every iteration. The cost function represents the difference between predicted and actual values; its gradient determines its path towards its minimum point.
Distance-based algorithms
Distance-based algorithms are machine learning algorithms that utilize the concept of distance to measure similarity or dissimilarity between data points. Their primary advantage lies in that data points near one another in feature space may belong to the same class or have similar properties; conversely, those far apart could represent different entities that share some properties but differ significantly in others.
Distance-based algorithms are widely employed in machine learning for clustering, classification, and anomaly detection tasks. Their user-friendliness makes them simple and straightforward to implement; they’re also straightforward in understanding. But these methods may be sensitive to factors like distance metric selection and data dimensionality if used on large datasets.
Tree-based algorithms
Tree-based algorithms are a subclass of machine learning algorithms that utilize decision trees as a modeling and prediction mechanism. A decision tree is a graphical representation of a series of decisions or rules leading to one outcome; each node represents one such rule, while edges indicate potential outcomes.
Tree-based algorithms are widely employed in machine learning for classification, regression, and prediction tasks. Their user-friendliness makes them ideal for handling numerical and categorical data sets without overfitting issues; however, high dimensional datasets with multiple irrelevant features present may present additional difficulties for these algorithms.
Features must be scaled accordingly because these models are highly sensitive to their input features’ scales. If features of different scales were inputted simultaneously, models may give more weight to those with larger values and less to those with lower ones – potentially altering its accuracy and leading to incorrect predictions. By scaling features to similar ranges during modeling process, we can ensure each feature receives equal weighting during this crucial process.
You May Also Like to Read About : What is Feature Selection in Machine Learning?
Full Stack Data Science Online Program
With Placement Assurance*
- #1 Cheapest Course in Data Science & Big Data
- Industry-Relevant Curriculum & Projects
- Instructor-Led live online class
- Online/Offline Batches
- Study Material for Each Module*
Early Bird Offer for First 111 Enrollments! Hurry Up! Seats Filling Fast