Overview
In the previous post, the Consolidation of Datasets was examined. After the data is consolidated, it’s crucial to examine the dataset. Different descriptive and inferential statistics can be utilized to study the data, as well as the use of different visualization methods.
The process of exploring data can be classified into three kinds:
- Univariate Analysis
- Bivariate Analysis
- Multivariate Analysis
What is Univariate Analysis ?
In univariate analysis, every variable is examined individually, and we never examine multiple variables simultaneously. This is the most basic and most basic method of analysis.
Univariate Analysis is performed for two types of variables:
- Categorical
- Numerical.
Categorical Variables
The various measures of frequency are a way to study the categorical variables by making frequency tables that record the number of times every category in the variable occurs. Based on these tables, pie and bar charts may be constructed. For instance, we have the dataset with the variable Continent, which includes three countries in categories. It is possible to count the frequency each category is repeated.
Numerical Variable
Different descriptive statistics like measurements of frequency (count), Shape (skewness and Kurtosis), Variability (Minimum value or Maximum value, range Quantile, Variance Standard Deviation), and Central Tendency (Mean Median, Mode) can be utilized to study a numerical variable. The available different visualization methods are mostly histograms as well as a box plot.
What is Bivariate Analysis?
Bivariate analysis analyzes two variables where two variables are analyzed to explore the relationship/association between them. Different inferential statistics may be employed to conduct Bivariate Analysis.
Bivariate Analysis is of the following kinds:
- Bivariate Analysis of Numerical (Numerical-Numerical)
- Bivariate Analysis of Categorical (Categorical-Categorical)
- Bivariate Analysis of Numerical and Categorical variable (Numerical-Categorical)
Numerical-Numerical
The Correlation Coefficient from Inferential Statistics is employed to study 2 numerical parameters. Visualization techniques like Scatterplot could be employed.
Categorical-Categorical
Inferential Statistics like the Chi-Square test are a great way to investigate the categorical nature of two variables. Visualization techniques like Stacked Column Charts can be utilized.
Numerical-Categorical
Inferential Statistics like T-Test, Z-Test, and ANOVA are a few examples of how they can be utilized. The information provided by these statistics can assist us in analyzing the dataset by studying the different combinations of categorical and numerical variables. Visualization techniques like a Combination chart and Line Chart with Error Bars are suitable for this analysis.
Multivariate Analysis
This type of analysis requires analyzing at least two different variables at once. In other words, when we need to analyze four variables simultaneously, it will increase the dimensionality. It’s very difficult for our minds to comprehend the relationships between four different variables (4 dimensions) within a graph. Therefore, multivariate analysis is employed (generally employing specific programs for statistical analysis) to analyze more complicated datasets that can’t be examined using univariate or bivariate analyses.
Different types that make up Multivariate Analysis include Cluster Analysis, Factor Analysis, Multiple Regression Analysis, Principal Component Analysis, etc.
The many statistics covered in the Basic Statistics section can be used to analyze an array of data. The most popular method for exploring data is through bivariate and univariate analysis. After data is examined and a deeper knowledge of data has been obtained, it is possible to proceed with the other modeling and data preparation steps.