Math for Data Science — Vol 2
Statistics and Probability.
Math can be compared to octopus. It has several tentacles that can touch practically every aspect of our life. Some subjects receive only a slight touch, and others are wrapped like a clam in the tentacles’ vice-like grip. Data science is one of the second kinds.
There are lots of connections between math and data science, and vice versa, but statistics at the heart of this construction. A lot of statistical models, for example, logistic or linear regression — are already methods for machine learning by themselves, so in some way, these two fields are intertwined into one entity.
Evaluation of medians, modes, statistical deviation. Assessment during A/B testing or causal inference tasks. Confidence interval. All these and other modern statistical methods are must-learn stuff for any data scientist.
To ease your path with learning, I continue my series of posts where I try to translate the complex world of the math behind data science into simple words. By the way, the first part was Vol 1 — Functions, Multivariable Calculus & Graphs.
Statistics and Probability in Nutshell
In simple words, statistics is an assumption made on the basis of some data. For example, we have 1000 patient records in a hospital. We can calculate the most dangerous symptoms using the formula by analyzing this data. And the task is to find a formula to calculate all those connections and correlations.
Almost all algorithms in Data Science are pure statistics translated into code.
It is unnecessary to have a perfect knowledge of algorithms or memorize how to calculate all formulas. To achieve this, you can use simple calculators and so on. More important is just to have a general understanding. To do so, I think there is no better book than Naked Statistics by Charles Wheelan. This is a fairly simple book, the explanation is so simple that even your grandma will understand it.
Okay, but what about probability, is it the same stuff as statistics?
Well, yes and no. They are related because they deal with analyzing the relative frequency of events. BUT probability is more about predicting the future, and statistics is about the past.
Some data scientists require more knowledge in probability some less. Nevertheless, this still an important topic. Especially such terms as calculation of mathematical expectation, custom probability, total probability formula, Bayes’ theorem, Central Limit Theorem, etc.
What tasks will require Statistics & Probability?
As a data scientist, you should use statistics and probability to answer various questions related to past and future events. These answers are important to build and develop a product strategy. For instance, it is necessary to highlight the most important details in data, the most common and expected outcome, and how to distinguish noise from valid data.
Top 9 Most Important Concepts Explained
Once again, statistics & probability can cover lots of thighs, and sure, this post is not enough to touch everything, but to give you a more precise picture, here are some of the most commonly used terms.
#1 Mean, Median & Mode
These three things are all measures of central tendency. The main goal is to help us understand the value behind different data sets, more specifically:
- Mean — average value calculated by making a total sum of all numbers and then divided by the number of all variables;
- Median — a middle calculated by organizing the numbers in increasing order, the median is the middle or centermost number;
- Mode — the most frequent value calculated by analyzing how many times a number appears in the list of data.
#2 Central Limit Theorem
One of the most used theorems in statistics is the Central limit theorem (CLT). Its essence is quite simple: if you have a large sample size, sample means are normally distributed.
#3 Basic probability rules
Here is some basic stuff to keep in mind:
- Every probability lies between 0 and 1. If A is an event, then 0 ≤ P(A) ≤1;
- 1 — is the sum of all probabilities. If all the outcomes in the sample space are denoted by Ai, then ∑Ai=1.
- Impossible events have a probability of zero. If event A is impossible, then P(A)=0;
- Certain events have probability one. If event A is certain to occur, then P(A)=1.
#4 P-value
Probability value aka p-value is the probability that the null hypothesis (the idea that a theory being tested is false) gives for a specific experimental result to happen.
Here’s how it looks in practice. Let’s say we have a random variable A and a value X. The P-value of the X value is the probability that A accepts this value or any other value that is the same or less likely to be observable. In practice, if the P-value is lower than A (say it is 0.07), we say that the probability of the result occurring by accident is less than 7%.
P-value is used in assessing how incompatible the data is with the constructed statistical model, but contextual factors such as study design, measurement quality, external evidence of the phenomenon under study, and the validity of the assumptions underlying the data analysis must also be considered.
#5 Dispersion and Standard deviation (SD)
In statistics, dispersion means the extent to which numerical data is likely to vary about an average value. In other words, dispersion helps to understand the distribution of the data.
It is calculated by adding the squares of the differences of each value and the mean and then dividing the sum by the number of samples.
An example of samples from two populations with the same average value but different dispersion. The red population has an average value and dispersion of 100 (SD = 10), the blue population has an average value of 100 and a dispersion of 2500 (SD = 50).
Standard deviation (SD) is a measure of how scattered the values are. More precisely, it is the square root of the dispersion. Mean, median, mode, variance, and standard deviation are basic statistics that are used to describe variables in the early stages of working with data.
#6 Covariance and Correlation
These are two similar terms yet not equal. Covariance is a measure of correlation, and correlation refers to the scaled form of covariance. More specifically, covariance compares two variables in terms of deviations from their mean (or expected) value.
Correlation is the normalization of covariance by the standard deviation of each variable. This normalization cancels the units, and the correlation value is always between 0 and 1. Note that this is an absolute value.
In the case of a negative correlation between two variables, the correlation lies between 0 and -1. If we are comparing relationships between three or more variables, it is preferable to use the correlation, as the ranges of values or units of measurement can lead to false hypotheses.
#7 Linear and logistic regression
When it comes to statistics in data science, linear regression is one of the most fundamental things. In simple words, linear regression is needed to model the relationship between a dependent variable and one or more independent variables.
The main thing is to find the best-fit line representing two or more variables. As a rule, the best fit line is found by minimizing the squares of the distance between the points and the best fit line — this is known as minimizing the sum of squares of residuals.
Logistic regression is practically identical to linear regression. The only difference is that we use it to model the likelihood of a discrete number of outcomes (usually two).
#8 Conditional probability
It is the type of probability when an event will occur and always takes a value between 0 and 1, inclusive. The probability of event A is denoted as p (A) and is calculated as the number of the desired outcome divided by the number of all outcomes. For example, when you roll a die, the chance of getting a number less than four is 2/3. This means that if we know that this is an odd number, then in two out of three cases the sum of the cubes will be less than four.
As you can see, event A will occur provided that another event that has already occurred is related to event A.
#9 Bayes’ theorem
Bayes’ theorem is a conditional probabilistic statement. Essentially, it considers the probability that one event (B) will occur, given that another event (A) has already occurred.
It is one of the most popular machine learning algorithms. A naive Bayesian classifier builds on these two concepts. Also, if you are interested in the field of online machine learning, you will most likely use Bayesian methods.
Bottom line
Where to learn all this stuff? Here are some of my favorite sources:
- Naked Statistics by Charles Wheelan
- Releasing the death-grip of null hypothesis statistical testing (p < .05)
- Foundations of Data Analysis — Part 1: Statistics Using R by the University of Texas at Austin (edX)
- Foundations of Data Analysis — Part 2: Inferential Statistics by the University of Texas at Austin (edX)
- Statistics with R specialization — Coursera, Duke University
- Statistics and Probability in Data Science using Python — edX, Univ of California San Diego
- Business Statistics and Analysis Specialization — Coursera, Rice University
I hope you enjoyed this post, if so feel free to follow me or read my other articles on Linkedin.
Welcome to join Magic people, AI people community.
If you want to discover more math stuff behind data science, stay tuned! The next part is coming soon.
Good Luck!
Gain Access to Expert View — Subscribe to DDI Intel