Math for Data Science — Vol 2

Statistics and Probability.

Oleksii Kharkovyna
DataDrivenInvestor

--

Photo by Matteo Maretto from Unsplash

Math can be compared to octopus. It has several tentacles that can touch practically every aspect of our life. Some subjects receive only a slight touch, and others are wrapped like a clam in the tentacles’ vice-like grip. Data science is one of the second kinds.

There are lots of connections between math and data science, and vice versa, but statistics at the heart of this construction. A lot of statistical models, for example, logistic or linear regression — are already methods for machine learning by themselves, so in some way, these two fields are intertwined into one entity.

Evaluation of medians, modes, statistical deviation. Assessment during A/B testing or causal inference tasks. Confidence interval. All these and other modern statistical methods are must-learn stuff for any data scientist.

To ease your path with learning, I continue my series of posts where I try to translate the complex world of the math behind data science into simple words. By the way, the first part was Vol 1 — Functions, Multivariable Calculus & Graphs.

Statistics and Probability in Nutshell

Photo by Daniel von Appen from Unsplash

In simple words, statistics is an assumption made on the basis of some data. For example, we have 1000 patient records in a hospital. We can calculate the most dangerous symptoms using the formula by analyzing this data. And the task is to find a formula to calculate all those connections and correlations.

Almost all algorithms in Data Science are pure statistics translated into code.

It is unnecessary to have a perfect knowledge of algorithms or memorize how to calculate all formulas. To achieve this, you can use simple calculators and so on. More important is just to have a general understanding. To do so, I think there is no better book than Naked Statistics by Charles Wheelan. This is a fairly simple book, the explanation is so simple that even your grandma will understand it.

Okay, but what about probability, is it the same stuff as statistics?

Well, yes and no. They are related because they deal with analyzing the relative frequency of events. BUT probability is more about predicting the future, and statistics is about the past.

Some data scientists require more knowledge in probability some less. Nevertheless, this still an important topic. Especially such terms as calculation of mathematical expectation, custom probability, total probability formula, Bayes’ theorem, Central Limit Theorem, etc.

What tasks will require Statistics & Probability?

As a data scientist, you should use statistics and probability to answer various questions related to past and future events. These answers are important to build and develop a product strategy. For instance, it is necessary to highlight the most important details in data, the most common and expected outcome, and how to distinguish noise from valid data.

Top 9 Most Important Concepts Explained

Photo by Lauren Mancke from Unsplash

Once again, statistics & probability can cover lots of thighs, and sure, this post is not enough to touch everything, but to give you a more precise picture, here are some of the most commonly used terms.

#1 Mean, Median & Mode

These three things are all measures of central tendency. The main goal is to help us understand the value behind different data sets, more specifically:

  • Mean — average value calculated by making a total sum of all numbers and then divided by the number of all variables;
  • Median — a middle calculated by organizing the numbers in increasing order, the median is the middle or centermost number;
  • Mode — the most frequent value calculated by analyzing how many times a number appears in the list of data.

#2 Central Limit Theorem

One of the most used theorems in statistics is the Central limit theorem (CLT). Its essence is quite simple: if you have a large sample size, sample means are normally distributed.

#3 Basic probability rules

Here is some basic stuff to keep in mind:

  • Every probability lies between 0 and 1. If A is an event, then 0 ≤ P(A) ≤1;
  • 1 — is the sum of all probabilities. If all the outcomes in the sample space are denoted by Ai, then ∑Ai=1.
  • Impossible events have a probability of zero. If event A is impossible, then P(A)=0;
  • Certain events have probability one. If event A is certain to occur, then P(A)=1.

#4 P-value

Probability value aka p-value is the probability that the null hypothesis (the idea that a theory being tested is false) gives for a specific experimental result to happen.

Here’s how it looks in practice. Let’s say we have a random variable A and a value X. The P-value of the X value is the probability that A accepts this value or any other value that is the same or less likely to be observable. In practice, if the P-value is lower than A (say it is 0.07), we say that the probability of the result occurring by accident is less than 7%.

Image credit: Source

P-value is used in assessing how incompatible the data is with the constructed statistical model, but contextual factors such as study design, measurement quality, external evidence of the phenomenon under study, and the validity of the assumptions underlying the data analysis must also be considered.

#5 Dispersion and Standard deviation (SD)

In statistics, dispersion means the extent to which numerical data is likely to vary about an average value. In other words, dispersion helps to understand the distribution of the data.

It is calculated by adding the squares of the differences of each value and the mean and then dividing the sum by the number of samples.

Image credit: Source.

An example of samples from two populations with the same average value but different dispersion. The red population has an average value and dispersion of 100 (SD = 10), the blue population has an average value of 100 and a dispersion of 2500 (SD = 50).

Standard deviation (SD) is a measure of how scattered the values are. More precisely, it is the square root of the dispersion. Mean, median, mode, variance, and standard deviation are basic statistics that are used to describe variables in the early stages of working with data.

#6 Covariance and Correlation

These are two similar terms yet not equal. Covariance is a measure of correlation, and correlation refers to the scaled form of covariance. More specifically, covariance compares two variables in terms of deviations from their mean (or expected) value.

An example of visualizing the covariance of variables. Image credit: Source

Correlation is the normalization of covariance by the standard deviation of each variable. This normalization cancels the units, and the correlation value is always between 0 and 1. Note that this is an absolute value.

In the case of a negative correlation between two variables, the correlation lies between 0 and -1. If we are comparing relationships between three or more variables, it is preferable to use the correlation, as the ranges of values or units of measurement can lead to false hypotheses.

#7 Linear and logistic regression

When it comes to statistics in data science, linear regression is one of the most fundamental things. In simple words, linear regression is needed to model the relationship between a dependent variable and one or more independent variables.

The main thing is to find the best-fit line representing two or more variables. As a rule, the best fit line is found by minimizing the squares of the distance between the points and the best fit line — this is known as minimizing the sum of squares of residuals.

Logistic regression is practically identical to linear regression. The only difference is that we use it to model the likelihood of a discrete number of outcomes (usually two).

Image credit: Source

#8 Conditional probability

It is the type of probability when an event will occur and always takes a value between 0 and 1, inclusive. The probability of event A is denoted as p (A) and is calculated as the number of the desired outcome divided by the number of all outcomes. For example, when you roll a die, the chance of getting a number less than four is 2/3. This means that if we know that this is an odd number, then in two out of three cases the sum of the cubes will be less than four.

Image credit: Source

As you can see, event A will occur provided that another event that has already occurred is related to event A.

#9 Bayes’ theorem

Bayes’ theorem is a conditional probabilistic statement. Essentially, it considers the probability that one event (B) will occur, given that another event (A) has already occurred.

Image credit: Source

It is one of the most popular machine learning algorithms. A naive Bayesian classifier builds on these two concepts. Also, if you are interested in the field of online machine learning, you will most likely use Bayesian methods.

Bottom line

Photo by Andrzej Kryszpiniuk from Unsplash

Where to learn all this stuff? Here are some of my favorite sources:

I hope you enjoyed this post, if so feel free to follow me or read my other articles on Linkedin.

Welcome to join Magic people, AI people community.

If you want to discover more math stuff behind data science, stay tuned! The next part is coming soon.

Good Luck!

Gain Access to Expert View — Subscribe to DDI Intel

--

--