BASIC STATISTICS
The section in question has basic statistics like Descriptive and Inferential Statistics that have been explored and are the basic elements to understand how data analysis can be carried out.
Let’s go back to the beginning and first understand what data means.
In simplest terms, data is information that is measured, and when it is stored and processed in computers, then this “Computer Data” is used for Data Analysis. At the utmost basic level, it comprises binary digits 0 and 1. However, today, we can find different expressions of this data in the format of text documents, images, videos, and Software.
With the rapid advancement of the computer’s processing and storage capabilities, the volume of data generated is massive, and frequently there’s a need to come up with a solution to the chaos. That’s the point at which Data Analysis kicks in.
Before determining the kinds of analysis that could be conducted with the data, it’s crucial to be aware of the kinds of data available. In general, there are two types of data-
Qualitative (Categorical)
Sometimes referred to by the name Categorical Data, is generally non-numeric. The kind of data described above is composed of words and is not quantifiable. Examples of qualitative data can include gender, location, Color, Shape, etc.
Qualitative Data comprises three types: Binary or Nominal, Ordinal, or Nominal.
Binary Data
Binary Data is a type of data that has only two distinct categories. A good example could be the result of the toss of a coin, which could be either heads or tails.
Nominal Data
It is a type of Qualitative Data with no number of categories (just like Binary Data); however, it contains at least two different categories. Each category is mutually exclusive, and each category is superior to the other. So, it is possible to say that these categories are discrete. Examples of Nominal Categories could be Colors in which, by definition, each colour is superior to one of the others. It is necessary to keep in mind; that Nominal Data may indeed be represented by numbers like the number 1 could be assigned to Red, and number 2 is for Blue, etc. however, these numbers are simply labels, and they have no value; therefore, neither is number 1 superior to number 2, and we are unable to comprehend the distinction or “distance” from the numbers.
Ordinal Data
It is where categories are put in an organized, ordered, logical sequence. The values are not weighted. Examples include the Top 5 poorest countries’ clothing sizes (Small, Medium, Large, etc.). It is not clear the distance that is between the intervals or the values.
Quantitative (Numerical)
It is numerical. As it is named is the type of data that is quantified. You can further subdivide Quantitative Data into two categories. Quantitative Data is into two categories which we have Ratios and Intervals.
The Data are weighty and contain details about their value.
Interval Data
It is like Ordinal data, with the main distinction being that intervals of values are equally divided. One example is the height of a person in inches. The differences between the two values are easily quantifiable with great accuracy.
Ratio Data
It contains an absolute zero. The best example is the temperature at which zero Celsius also has significance.
Quantitative Data
It is also divided into two categories, namely Continuous and Discreet. In the latter case, Continuous Data is the type of data where values are divided into fractions and include all values that fall between their variations, such as height, temperature, and so on.
Discreet Data
It is when the data aren’t separated by their variations but are measured on an array of fixed numbers such as the number of pupils in the classroom.
To fully comprehend the types of analysis that can be performed on different data types, it is necessary to know what we mean by Statistics, Population, and Sample.
The first thought when you hear the word “population” is typically the number of people in a nation. At the same time, a sample refers to a tiny portion of that population to represent the total population. If you are familiar with this definition of Population and Sample, then you may be too far from the meaning behind the terms used in Statistics.
People refer to everyone who is entirely for a specific group. It can be defined as a group of people or individuals who comprise everyone and everything that can be the focus of a statistical study. It is crucial to remember that the population size does not have to be massive, and it could be as small as two people if they are representative of the entire group being studied. For instance, if we determine the length of every 1969 Chevrolet AstroVette, the total population will comprise only three cars since only three have ever been constructed. Additionally, the population will not comprise any other Chevrolet but just this particular kind of Chevrolet. The population is the entirety, and Sample is nothing but one of the subsets of this population. Different methods of analysis can be conducted on this sample data, and the outcomes and inferences drawn from the data of this sample are referred to as Statistics. For instance, if we perform an analysis using computing the mean, the result calculated using the population generates a parameter. In contrast, the mean drawn from an individual sample will be recognized as Statistics.
In most instances, it’s difficult to determine the total number of people who make up the population. Different methods of selecting the right samples are employed, such as –
Simple Random Sampling:
The word “random” means impartial, which means that everyone in the population has the same chance of being selected for the sample. This method is frequently employed in the study of customer satisfaction.
Representative (Stratified) Sampling:
It’s also random. Still, it is based on the same patterns and proportions found in the actual population to correspond to and represent the greater number of people in characters. One example is creating A Representative Sample from the people of Mumbai in which 100 individuals are randomly selected and ensuring that of the 100 people selected, 55 are males, 45 are female, and those who fall into the categories of male and female are randomly selected. In this way, genders are depicted in the same way as within the population. (As per Mumbai City District 2011 Census Data)
Convenience Sampling:
The sampling procedure is carried out with consideration of accessibility and people’s willingness to participate. This type of sample is something we see daily, with representatives of companies handing out pamphlets with forms to fill in for surveys. It is vital to understand that Convenience Sampling does not constitute a faulty or incorrect method of collecting Samples and is acceptable if it can accurately represent the people of interest.
Cluster Sampling:
This method of sampling is typically used in the course of marketing or exit polls. There are variations between subgroups, even though they’re similar. This sampling technique is used for an unusual type of analysis. A good example of such a sampling technique is a prediction of Delhi’s election results by dividing Delhi into six zones and further dividing it into three (3) localities. And after that, from each location, we randomly sample from any of the two blocks.
Once we have a good knowledge of the information above and the knowledge we have gained, we can dive into the realm of statistics, where we will discuss specific aspects of the basics of statistics required to conduct any sophisticated data analysis.
THEORY
The theory section covers various machine learning methods that can be employed to create data models. Some of these algorithms function in a Supervised Learning setup, while others operate in an Unsupervised Learning setup. Many algorithms are utilized for making Time Series models that help forecast values over a particular duration of time.
APPLICATION
Understanding algorithms is a foundation for understanding the behavior of various models. However, it's the knowledge of computer languages that allows one to develop models for data. This section describes different models are constructed using Python and R with a variety of different machine-learning algorithms.