Significance of I.I.D in Machine Learning

The assumption of I.I.D is central to almost all machine learning algorithms and an explicit assumption in most statistical inferences

Sundaresh Chandran
DataDrivenInvestor

--

Photo by Edge2Edge Media on Unsplash

Let’s try to understand what it is and why it is so important in machine learning & statistics

Independent and Identical distribution is when a distribution is well, both independent and identically distributed. Let’s try to breakdown this further.

What makes variables Independent?

By independent, we mean the samples taken from the individual random variables are independent of each other. Samples drawn from random variables do not contain any internal dependency amongst themselves

Lets look at simple examples of dependent and independent distributions:

Independent Event

  • Imagine a coin toss. If you get heads on the first trial, the probability of getting heads or tails in the next trial does not get changed (still 50–50 probability). Each coin tosses are independent of each other. Another point to note is it doesn’t matter if you toss a fair coin or unfair coin, each sample is independent of other samples. Similarly, if you toss a dice, the outcomes and the samples of these outcomes are independent of each other.

If we wanted to combine both coin toss & dice into a single sample, let’s say i want (H,2) from the combination coin toss & dice roll respectively, they still remain independent of each other.

In this case :

Dependent Event

  • Game of snake and ladder where moves are determined by dice is an example of a dependent event. This particular game is also called as a first order Markov chain where the only thing that matters is the current state of the board and the next state is determined by the current state, and the next roll of the dice. Any Markov sequence can be considered as a non independent (or dependent) distribution and we can clearly see the underlying dependence of a state or sample to its previous state

What makes a distribution identical ?

There are different ways to understand identical distribution. Lets look at few ways to understand it:

Mathematically :

  • The samples are identically distributed if we sample them from the same underlying mathematical function, in the same way
  • All the items in the sample are taken from the same probabilistic distribution

In general terms :

  • A distribution is identical if the samples come from the same random variable
  • It can also translate into : The underlying mechanism that generates the data must be same for all the samples considered

An example of this would be selection sample bias where you have more training data from one subgroup or strata of your population and want to generalise it for the entire population

Note: Identically distributed doesn’t mean that the involved random variables need to have same or similar probabilities

Now that we have a good idea of what I.I.D is, let’s try to understand what makes it so critical in machine learning.

Importance of I.I.D in machine learning

  • Let’s take an example of supervised learning. Here, we split our datasets into training & test dataset, train on our training dataset & we test our model performance on test dataset.
    An inbuilt assumption while splitting the data into train-validation-test set is the assumption of I.I.D. If the distributions between training and test set is different or if there are in-built sampling dependencies, the algorithm won’t be able to generalise once it is deployed/live.
    Another point to note is that It is also assumed that the data distribution doesn’t change post deployment. If it changes (called dataset shift due to non-stationary environment), we might have to retrain the model or use active learning/online learning techniques to keep our models up to date.
  • The fundamental principle that governs this idea is called Empirical Risk Minimisation (ERM) which is central to many machine learning and data mining algorithms. ERM deserves a separate in-depth article on its own but in brief it conveys that it is impossible to compute true risk associated with hypothesis h which maps feature vectors X to labels Y since we do not know the true distribution of the complete data the algorithm will work on. Hence, we can compute empirical risk by averaging the loss function on the training data and focusing on choosing the best hypothesis to minimise the empirical risk
  • I.I.D assumption is also central to the law of large numbers which states that an observed sample average from a large sample population will be close to the true population average and that it’ll get closer to the true population average as the sample size increases
  • I.I.D assumption is also core to one of the most widely used theorems in data science, the central limit theorem (CLT) which is the core of hypothesis testing. CLT states that if we take sufficiently large random samples from a population, then the sample means will be approximately normally distributed. As you can notice, the random samples taken cannot be dependent and the distribution of the random variables cannot change say over time

So in a way the assumption of I.I.D helps simplify training machine learning algorithms by assuming that the data distribution won’t change over time or space and sample wont be dependent on each other in anyway. Which ultimately helps us restrict training to a subset of population and ultimately deploy our model to predict for further down the line datasets.

In next few articles, we will dive deeper into the scenarios where this assumption is broken and how we can make our model more resilient to different kind of dataset shifts.

Few links you can use to dive deeper into this topic :

Data Science Lab

Empirical risk minimization

Independent and identically distributed random variables

Happy to hear your feedback. You can reach me via LinkedIn

Like my article? Buy me a coffee

--

--

Data Scientist @Royal Dutch Shell | Deep Learning | NLP | TensorFlow 2.0 | Python | Astrophysics ❤