DATA EXPLORATION AND PREPRATION – THEORY

It is necessary to ensure that the data is suitable for various modeling algorithms because they operate differently, frequently giving incorrect results when the data isn’t clean. In reality, of all the time spent on the study of the data, the biggest portion of time is spent preparing and exploring data.
Various techniques are discussed in this section that aid in improving the quality of data, helping us build an even more reliable and stable model.

A typical sequence of actions that are taken from the moment we acquire data/s until the point we begin building models are as follows: steps:

Exploration of Data:

Once we have our hands on the data set, we begin exploring the data/s. This involves knowing the variety and kinds of features present in the data, analyzing some descriptive statistics of the data, and so on.

The Consolidation Of Data:

There are times when we need to build a model of an array of data that doesn’t appear as it should. The data needed is often not in one location and is scattered across different databases. For instance, we want to know the sales figures of a store. The data is in spreadsheets limited to a million rows and a few thousand columns. However, in the modern age, where data is generated quickly, it’s possible that data is from different spreadsheets. This necessitates the consolidation of data.
Additionally, there are times when different information about the same subject is spread across different datasets and requires merging different datasets. For instance, we want to study the behavior of our customers. Still, we have two databases with the details of the customer’s demographics in one and the details of the transaction in another. There are various methods for merging data that come into the picture.

Missing value and Outlier procedure:

Once we get the data set in the format and size we would like, we treat the data to eliminate outliers and missing values. Outliers are often the reason for algorithms to fail or fail, and for some algorithms, being able to ignore outliers can be crucial. Because outliers cannot always be errors and may be components of the original dataset, It is essential to take them seriously because throwing them out without proper analysis may cause more damage than it does good. In addition, missing values make it difficult to utilize all capabilities of the dataset to model. Different methods are available to deal with missing values that range from being easy to quite advanced.

Feature Engineering:

It is a general term used to describe the different actions performed on the dataset’s characteristics. Distance-based algorithms like KNN require the data to be scaled to ensure that the result is meaningful. Likewise, algorithms such as Linear Regression (using the OLS method) can give artificially high results if the features employed are not sufficient; and achieve this, we use different feature reduction methods. Additionally, certain features might not be logical or might not be appropriate for an algorithm. They may require to be separated or combined to make them more effective. For this, we employ various methods for constructing features. Therefore, various changes need to be made to the dataset’s features to make them compatible with the algorithms.

It is necessary to remember that the mentioned steps may be different in their chronology and use or contain additional steps. In this article, we will use & perform the methods mentioned above for data exploration and preparation.

DATA EXPLORATION AND PREPRATION – THEORY

Exploration of Data:

The Consolidation Of Data:

Missing value and Outlier procedure:

Feature Engineering:

MISCELLANEOUS METHODS

FEATURE ENGINEERING