A Deep Dive into Artificial Intelligence

Karan Mario Jude Pinto

Published in

DataDrivenInvestor

31 min readNov 11, 2018

An exploration into the world of Artificial Intelligence (AI) & Machine Learning (ML).

What are Deep Dives (DD) and Why?

DD’s is an effort to accommodate the twitching desire that has been tingling in the back of my mind, to simplify; to silo, the swarm of ideas in my thought pool. These DD’s intend to play a role in organising my mental semantic trees. The long run theme is to positively influence my habitual behaviour and have seamless availability heuristics, as well as to facilitate an easier build of memory palaces.

Articles in the Deep Dive Series:

Deep Dive : The ‘Digital Macrocosm’
Deep Dive : The First Principles of the Economy
Deep Dive : Meeting Mindfulness with Meditation

Why read this Deep Dive & What could you possibly learn?

Almost every move we make, everything we touch, everything we use creates Data. ‘Data’ today, attains a position equivalent to that attained yesterday by ‘Coal’. This is because, like coal was, if data is harnessed properly it can help derive insights that could be used in different ways, from improving the performance of products to tailoring the experience with services, from identifying problems in systems prescriptively/predictively to influencing your behaviour through target marketing. Though, on its own → Data is useless. You need something, that can help you make that date valuable. That something is Machine Learning (ML). ML is used in web search, spam filters, recommend-er systems, ad placements, credit scoring, fraud detection, stock trading, drug design, and many many other applications.

One of the reasons I drafted this post is because I know that there are many people who, know of the ‘Machine Learning’ buzzword, widely use it in their conversations, without fully grasping the fundamentals of what it actually is!

“Through 2023, one-third of all highly skilled work done by doctors, lawyers, traders, and professors will be replaced by smart machines or by less skilled (non-specialist) humans assisted by cognitive computing technology.” — Gartner

“By 2030, 90% of jobs as we know them today will be replaced by smart machines.” — Gartner

AI & ML are the core underlying drivers of the 2 quotes above. In addition to this, it makes me ‘cringe’ when people use AI or ML without knowing much about it just to associate themselves with something ‘innovative’ or ‘disruptive’! Being someone who wants to be an ‘expert generalist’, fascinated in learning something from scratch and making sense of everything, I hope this post helps you get to grips with the fundamentals. The below is an attempt to understand and then explicitly explain. It is not purely distilled thought but also my interpretation of ML. Please comment if there are any discrepancies or inconsistencies in my understanding!

Deep Dive Structure

This is a long read! I suggest saving this article and jumping in and out!

Data → Where it all begins
AI & ML → What are they & Why are they important?
Types of ML
Deep Machine Learning
Basic Key ML Terminology
ML Algorithm Fundamentals → Representation, Optimisation and Evaluation
Supervised Learning Algorithms → Linear Regression, Decision Trees, K Nearest Neighbours
Important ML Concepts → Generalisation, Over-fitting, Regularisation & Cross Validation
Important ML Lessons & Principles

DATA

A data set is a collection of data. Data is any metric that is an input or output point. Due to the large Volume, diverse Variety and high Velocity of Data, an effective and efficient ‘something’ is needed to gain insight from data. That something is AI & ML. Before we proceed, I would like you to understand that there are 2 types of data:

Structured Data → This is highly organised information, that you can enter into a relational database/ rows & columns. This data is easy to digest, shape, search through and analyse. A spreadsheet is an example.
Unstructured Data → This is data that is hard to organise and analyse as it cannot easily be understood. About 80 to 90% of an enterprise’s data is unstructured. Emails, web pages, videos, audio and social media messages are examples. The amount of unstructured data around us is proliferating!

AI & ML can help us derive insight from not only structured data but also unstructured data! This is incredibly valuable!

Artificial Intelligence (AI) is the theory and development of computer systems such that they are able to perform tasks that would normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.

Artificial Intelligence consists of multiple technologies that enable computers to:

Sense and Perceive: recognize objects in images and words in sounds.
Comprehend, Know: represent knowledge, understand relationships and causality.
Reason: draw conclusions from facts and rules of inference and plan actions.
Learn: acquire new knowledge from examples or experience.
Communicate: understand, generate, and analyze natural language.
Plan and Act: control robots, drive cars or fly drones.

Artificial Intelligence is nothing new; the years 1940–1956 gave rise to the birth of Artificial Intelligence. Initial research focused on, “The science aiming to create intelligent machines that are as capable as humans.” Recently, the focus has shifted to, “Intelligent systems that can support humans in their activities.” In other words, making humans super, not making super humans!

AI in Daily Lives:

Facebook: predicting possible friends.
Netflix: predicting possible movies you might like to watch.
Google/Bing search: retrieving a subset of relevant documents from a large unstructured dataset, for example, providing the five most relevant documents in a search on the internet.
Amazon: providing you with a selection of items inspired by your browsing history, items recommended for you based on previous orders, items inspired by your wish list, and items other people have bought that you may like.
GPS systems in motor vehicles: cutting through the complexity of millions of routes to provide you with the best one to take.

Difference between ML and Artificial Intelligence (AI):

AI usually concentrates on programming computers to make decisions (based on ML models and sets of logical rules)
ML focuses more on making predictions about the future.
They are highly interconnected fields, and, for most non-technical purposes, they are the same.

Why Machine Learning (ML)?

Machine Learning (ML) is used to solve problems that it is really hard to write a computer program to solve

For example, say you want to program a computer to recognize hand-written digits, you would need to devise a set of rules to distinguish each individual digit. Zeros, for instance, are basically one closed loop. But what if the person didn’t perfectly close the loop? What if the right top of the loop closes below where the left top of the loop starts? In this case, we have difficulty differentiating zeroes from sixes. We could possibly establish some sort of cutoff, but how would you decide the value/range of the cutoff in the first place? As you can see, it quickly becomes quite complicated to compile a list of heuristics (i.e., rules and guesses) that accurately classifies handwritten digits.

Now, many more classes of problems fall into this category like:

Recognising Objects
Understanding Concepts

So instead of trying to write a program, we try to develop an algorithm that a computer can use to look at hundreds or thousands of examples (and the correct answers), and then the computer uses that experience to solve the same problem in new situations. Essentially, our goal is to teach the computer to solve by example, very similar to how we might teach a young child to distinguish a cat from a dog.

What is Machine Learning?

ML is a field of study which harnesses principles of computer science and statistics to create statistical models.

These models are generally used to do two things:

Prediction: make predictions about the future based on data about the past
Inference: discover patterns in data

What’s a statistical model?

Teaching a computer to make predictions involves feeding data into machine learning models, which are representations of how the world supposedly works. If I tell a statistical model that the world works a certain way (say, for example, that taller people make more money than shorter people), then this model can then tell me who it thinks will make more money, between Jeff, who is 5’2”, and Rose, who is 5’9”.

What does a model actually look like though? Surely the concept of a model makes sense in the abstract, but knowing this is just half the battle. You should also know how it’s represented inside of a computer, or what it would look like if you wrote it down on paper.

A model is just a mathematical function, which, as you probably already know, is a relationship between a set of inputs and a set of outputs. Here’s an example:

f(x) = x²

This is a function that takes as input a number and returns that number squared. So, f(1) = 1, f(2) = 4, f(3) = 9.

Let’s briefly return to the example of the model that predicts income from height. Let’s assume that a given human’s annual income is, on average, equal to their height (in inches) times £1,000. So, if you’re 60 inches tall (5 feet), then I’ll guess that you probably make £60,000 a year. If you’re a foot taller, I think you’ll make £72,000 a year.

This model can be represented mathematically as follows:

Income = Height × £1,000

In other words, income is a function of height.

ML refers to a set of techniques for estimating functions (like the one involving income) based on data sets (pairs of heights and their associated incomes). These functions, which are called models, can then be used for predictions of future data.

The predictions made by predictive models are intended to aid human decision making in complex domains where only historical observations are available and controlled experiments are not possible.

What are Algorithms?

These functions or models are estimated using algorithms. In this context, an algorithm is a predefined set of steps that takes as input a bunch of data and then transforms it through mathematical operations. You can think of an algorithm like a recipe — first do this, then do that, then do this. Done.

ML of all types uses models and algorithms as its building blocks to make predictions and inferences about the world.

What exactly is being learnt?

To explain what is being learnt in machine learning, let’s start with an example application, spam classification. One approach to write a computer program to classify spam emails from non-spam emails, is to split each email into individual words and maintain a list of words that appear more frequently in spam emails. For example, some example of such words might be ‘loan’, ‘$’, ‘credit’, ‘discount’, ‘offer’, ‘password’, ‘Viagra’, and so on. Then, if an email has a substantial number of these words, it should be classified as spam.

Although the strategy above might give fairly good results (say detect spam with an accuracy of 80%), the accuracy depends in large part on the list of words we maintain, and on the precise threshold we choose to classify an email as spam.

In machine learning, the strategy is to learn the list of words and the threshold from examples. In fact, in addition to which words are bad words, we could also learn how bad each word is. (This example is quite realistic, and is how many spam classification algorithms work.)

So in this case, the thing being learnt is, a notion of how bad each word is. Note that that is not the only way to frame the problem, we framed the problem in this way because we noticed a pattern that spam emails often contain specific words, and then we came up with a strategy that would analyse every possible word as a possible suspect. This strategy might give inaccurate results for other tasks, or be too inefficient.

Desirable properties of machine learning

You might notice that using machine learning to learn how bad each word has many desirable properties over maintaining this list manually.

It reduces the amount of manual work involved in creating the list.

Think about how long this list could get if you try to do this manually. Also, if you’re trying to maintain the list manually, how would you deal with hundreds of languages across the world? This task can easily become infeasible without machine learning.

The same strategy works for other similar tasks.

Say we wanted to classify whether a movie review is speaking positively or negatively about a movie. If we were creating lists of words manually, then we would have to create a new list of words manually. But if we learn it, the same algorithm would work given that we already have some data (say ratings and reviews left by users on imdb).

It updates automatically.

Lets say tomorrow the spammers become more advanced and start typing the word ‘password’ as ‘passw0rd’. Or they might try to sell you insurance, something we haven’t yet encountered. We can simply set the machine learning algorithm to be trained daily, and it will use the new data available and keep adapting over time to changing behaviour.

What are the different types of ML?

The algorithm learns from the data set to produce a model. Based on this learning process, ML algorithms are classified into the following types:

Supervised Learning

Here, the algorithm learns from a training data set of labelled data. Labelled data means that every example in the training data set is tagged with the answer the algorithm should come up with on it’s own. So, a labelled data set of flower images would tell the model which photos were of roses, daisies and daffodils. When shown a new image, the model compares it to the training examples to predict the correct label.

In supervised learning, we feed the algorithm with features and labels. Consider a problem of classifying network packets into malicious or non malicious. Here features could be the attributes of packet such as source IP, destination IP, port, protocol, payload length, flags,etc. And the labels could be 0 or 1 based based on whether the packet is malicious or not.

So you have a set of inputs or features (X ) like an image and the model will predict a target output variable y (caption for the image) = f(X).

Features are independent variables and Targets are dependent variables

2 Types of Supervised Learning Problems:

Classification — When output variable is a category.

Spam filtering: Is an email spam or not?
Image classification: Given an image, output which objects are present in the image (dog, cat, computer, building, so on)

Let us look at a classic image classification problem → Classifying the Iris flower into it’s types. The data set contains three classes (Iris-sentosa, Iris-versicolor and Iris-virginica) of Iris flower shown below with corresponding features like sepal and petal dimensions (length & width). The final aim is to build a model which learns from these features and predicts the type of Iris flower when features are provided as input.

Classification could have Inputs to the Function

Regression — When output variable is a real value.

Given information about a house, predict its price
Netflix: Given a user and a movie, predict the rating the user is going to give to the movie

Unsupervised Learning

You have input data X, but no output variable Y. The goal is to model the distribution of data and identify patterns in the data

2 Types of Problems:

Clustering — To discover inherent groupings in the target data

Given a list of customers and information about them, discover groups of similar users. This knowledge can then be used for targeted marketing.
Anomaly detection: Given measurements from sensors in a manufacturing facility, identify anomalies, i.e. that something is wrong

Association — To discover rules that describe a portion of the data

Discover patterns in data such as whenever it rains, people tend to stay indoors. When it is hot, people buy more ice-cream.

Reinforcement Learning

The input is not given to us, but depends on the actions we take. The robot (agent) gets rewarded for taking certain actions in its environment.

Examples:

Robotics: A robot is in a maze, and it needs to find a way out.
Training an AI for a complex game such as Civilization or Dota

For example, let’s say our goal is to make the tube get to Caledonian Road. We input this desired outcome into the function. This function could move the train north or south. The function, will evaluate the current state, here it is the station at which the train currently is, evaluate the desired outcome and make an action (go north or go south). Say, the current state is at Covent Garden, the function makes an action and moves south to Leicester Square. The function is then accordingly rewarded if the action is in the direction of the goal and it is punished if it isn’t. So since the action has moved the train south to Leicester Square, in a direction away from the desired outcome (Caledonian Road), it will be punished. So the next action will be back to Covent Garden and such.

Other Types of Learnings

Transfer Learning → Learn the alphabet on one model and learn the sentences on another model
Semi- Supervised Learning → Some data is labelled and some is not.

Deep Learning

Deep Machine Learning is a machine learning technique that allows machines to access cognitive domains, previously only accessible to humans, such as: image recognition, text understanding and audio recognition. The learner algorithms are inspired by the structure and function of the human brain’s biological neural network.

The main difference between deep learning and machine learning is that deep learning models have a notion of multiple layers or multiple levels of hierarchy, which opens up the possibility being able to learn models for more complicated tasks.

Deep learning architectures are designed with multiple layers with the intuition that the lower to higher layers will automatically learn to model lower to higher level of abstractions.

Standard Neural Network

A machine who’s learner algorithm mimics a biological neuron. It’s output is a function of the following parameters:

Inputs
Weights
Bias terms
Non-linearity parameter

The machine tunes the weights in the learner algorithm in response to the changing input data to stimulate learning.

Deep Neural Networks

Data is passed through a hierarchy of hidden interconnected processing layers to output appropriate representations from raw input data. Increasing the number of hidden layers increases the depth of the machine. The greater the autonomous tuning, the more hidden the layers become to humans.

Deep Learning expands machine learning through the ability to discover intermediate representations in the data, which
allow more complex problems to be solved with higher accuracy, fewer observations and less manual labour.
Best used when there is a large dataset of convoluted data or input sources (e.g. in images and videos);high number of raw inputs, each with little or no meaning that send a ‘weak labelling signal’ and low level of
human analytical insight.
Allows us to use previously intractable data types for ML, such as images, speech and video.
It is the technique that transformed Google Home and Amazon’s Alexa from comical to capable.
One of the big challenges with traditional machine learning models is a process called feature extraction. Specifically, the programmer needs to tell the computer what kinds of things it should be looking for that will be informative in making a decision. Feeding the algorithm raw data rarely ever works, so feature extraction is a critical part of the traditional machine learning workflow. This places a huge burden on the programmer, and the algorithm’s effectiveness relies heavily on how insightful the programmer is. For complex problems such as object recognition or handwriting recognition, this is a huge challenge.
Deep learning, with the ability to learn multiple layers of representation, is one of the few methods which has enabled us to circumvent feature extraction. We can think of the lower layers as doing performing automatic feature extraction, requiring little guidance from the programmer.
Application:
Computer vision. Language translation. Image captioning. Audio transcription. Molecular biology (predicting protein interaction).

Basic Key Terminology

Training sample: A training sample is a data point x in an available training set that we use for tackling a predictive modelling task. For example, if we are interested in classifying emails, one email in our data set would be one training sample. Sometimes, people also use the synonymous terms training instance or training example.

Target function: In predictive modelling, we are typically interested in modelling a particular process; we want to learn or approximate a particular function that, for example, let’s us distinguish spam from non-spam email. The target function f(x) = y is the true function f that we want to model. It helps us achieve the end goal.

Hypothesis: A hypothesis is a certain function that we believe (or hope) is similar to the true function, the target function that we want to model. In context of email spam classification, it would be the rule we came up with that allows us to separate spam from non-spam emails. It takes us closer to achieving our end goal.

Hypothesis Function (best guesses) ~ True or Target Function

Model: In machine learning field, the terms hypothesis and model are often used interchangeably. In other sciences, they can have different meanings, i.e., the hypothesis would be the “educated guess” by the scientist, and the model would be the manifestation of this guess that can be used to test the hypothesis.

Model = Hypothesis

Learning algorithm or Learner: Again, our goal is to find or approximate the target function, and the learning algorithm is a set of instructions that tries to model the target function using our training data set. A learning algorithm comes with a hypothesis space, the set of possible hypotheses it can come up with in order to model the unknown target function by formulating the final hypothesis. It helps us define our hypothesis space, and arrive at our goal of the true or target function.

Classifier: A classifier is a special case of a hypothesis (nowadays, often learned by a machine learning algorithm). A classifier is a hypothesis or discrete-valued function that is used to assign (categorical) class labels to particular data points. In the email classification example, this classifier could be a hypothesis for labelling emails as spam or non-spam. However, a hypothesis must not necessarily be synonymous to a classifier. In a different application, our hypothesis could be a function for mapping study time and educational backgrounds of students to their future SAT scores. So a classifier is a function that assigns a class label to a data point.

A classifier is a system that inputs a vector of discrete or continuous feature values and outputs a discrete value, a class.

An input variable is also called a feature.

Dimensions in a data set are called features, predictors, or variables. Adding another dimension allows for more nuance.

Weight (like slope)- Weight is the strength of the connection. If I increase the input then how much influence does it have on the output. Weights near zero mean changing this input will not change the output. Many algorithms will automatically set those weights to zero in order to simplify the network.

Bias (like intercepts) — as means how far off our predictions are from real values. Generally parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn they are have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.

Low Bias: Suggests more assumptions about the form of the target function. The predicted values are closer to the real values.
High-Bias: Suggests less assumptions about the form of the target function. The predicted values are further from the real values.

Machine Learning Algorithms Fundamentals

So now that you understand certain important basic terminology that was essential before we proceeded, let us delve into the fundamentals of ML algorithms, stringing together what we have learnt.

A learner algorithm or learner inputs a training set of examples of the observed input and the corresponding output, and outputs a final hypothesis (classifier or regressor).

The idea is that a classifier or regressor is a program built by a learner.

The test of the learner is whether this classifier or regressor that it has outputted based on the training data, produces the correct output for future examples.

An illuminating intuition comes from one of the many definitions of machine learning which is programming with data. So, you have data (training set) and from that data using a computer program you build another program (for example a decision tree, a supervised learning algorithm). The program which builds the decision tree from data is the learner. The decision tree is a classifier, because a classifier is a program which is able to predict, which takes only the input data and for each instance it produces the output data.

An alternative way to understand this is that a learner takes the input x1,x2,..,xp,y and produces a classifier. A classifier takes as input x1′,x2′,..,xp′ and produces y′.

In research papers the distinction between learners and classifiers is hard to find. It seems that the researchers are interested only in how to describe the model. When they come to describe how to build that model, then they talk about a learner, and when they talk about how to predict with that model, then they talk about a classifier.

The function of fitting a model is the function of a learner, while the function of predicting values is a function of a classifier.

Note that a regressor is the same as a classifier, only the nature of the output is different, it is a discrete value instead of a category

So suppose you have an application that you think machine learning might be good for. The first problem facing you is the bewildering variety of learning algorithms available. Which one should you use? There are literally thousands available, and hundreds more are published each year. The key to not getting lost in this huge space is to realise that all machine learning algorithms have three components:

Learning = Representation + Evaluation + Optimisation

Representation → what the model looks like
Evaluation → how do we differentiate good models from bad ones
Optimisation → what is our process for finding the good models among all the possible models

Representation

Representation is important as a classifier/regressor function must be represented in some formal language that the computer can handle. Conversely, choosing a representation for a learner is tantamount to choosing the set of classifiers that it can possibly learn.

This set of of allowed models is called the Hypothesis Space and it is how you look at your data.
It contains all learn-able classifiers/functions. If a function is not in hypothesis space, it can not be learnt.
For a learner, choosing the classifiers/functions that it can possibly learn defines the representation the solution will take.
Example: Sometimes you may want to think of your data in terms of individuals (like in k-nearest neighbours) or like in a graph (like in Bayesian networks).

Key considerations: Is the scenario you are trying to capture well represented by the model function? Is it overly restrictive? Is it overly flexible? For example, if the data has a quadratic trend, but we are trying to fit a linear function to it, we are being overly restrictive.

Evaluation

An evaluation function (also called objective function or scoring function) is needed to distinguish good classifiers from bad ones. The evaluation function used internally by the algorithm may differ from the external one that we want the classifier to optimise.

An evaluation or objective function (or cost function) is needed to evaluate your ML model by distinguishing good classifiers from bad ones.

In essence, it is how you judge or prefer one model over an other. I think of it like the utility function in economics.
It is for supervised learning purposes, helping you evaluate or put a score on how well your learner is doing so it can improve.
Example: In the case of least-squares linear regression, the cost function was mean-squared error cost function.
Examples include accurate, squared error, likelihood and information gain.
Key considerations: Does your cost function capture the relative importance of different kinds of mistakes? For example, is being off by 0.3 for one data point and 0.1 for another data points better or worse than being off by 0.2 for both data points? Is a false positive as bad as a false negative?

Optimisation

Optimisation is the process or algorithm for finding the best model in hypothesis space. It is the method to search for the most optimal learning model by searching for good classifiers.

It is using the evaluation function, to find the learner with the best score from this evaluation function using a choice of optimisation technique. Examples are greedy search and gradient descent.

Finally, we need a method to search among the classifiers in the language for the highest-scoring one. The choice of optimisation technique is key to the efficiency of the learner, and also helps determine the classifier produced if the evaluation function has more than one optimum. It is common for new learners to start out using off-the-shelf optimises, which are later replaced by custom-designed ones.

This is internally how the algorithm traverses the hypothesis space in the context of the evaluation function to a final classifier capable of making accurate predictions.

Given a representation of a machine learning model with parameters (weights and biases) and an evaluation cost function to evaluate how good a particular model is, our learning problem reduces to that of finding a good set of weights for our model to optimise it and in turn minimises the cost function.

For example, given the linear regression model (Representation) and the cost function (Evaluation), we can use Gradient Descent (Optimisation) to find a good set of values for the weight vector.

Gradient Descent is an iterative method and is one of the most popular and widely used optimisation algorithms
Batch gradient descent computes the gradient of the cost function for the entire training data w.r.t to a parameter w, which gets updated. Since we need to calculate the gradients for the whole data set to perform one parameter update, batch gradient descent can be very slow. In mini-batch gradient descent, we calculate the gradient for each small mini-batch of training data. That is, we first divide the training data into small batches (say M samples / batch). We perform one update per mini-batch. M is usually in the range 30–500, depending on the problem.
Stochastic gradient descent (SGD) computes the gradient for each training example i.e. a single training data point is used for each update.

Key considerations:

How efficient is the optimisation technique in practice?
Does it always find the optimal solution? Is it possible for it to output sub-optimal solutions? If yes, how often does it happen in practice?

Algorithms for Supervised Learning

Below are 3 Supervised learning algorithms to better explain the concept of Represent, Evaluate and Optimise. This may be a bit more technical, so feel free to skip this.

Note that algorithmic performance depends upon the type of data, where deciding factors would include total number of dimensions in input data, whether the data is text or numerical or a time series, whether or not the data is sparse, size of the data set, and so on.

Linear Regression Model → For simple regression problems

In general, a line can be represented by linear equation y = m * X + b. Where, y is the dependent variable, X is the independent variable, m is the slope, b is the intercept.
In machine learning, we rewrite our equation as y(x) = w0 + w1 * x where w’s are the parameters of the model, x is the input, and y is the target variable.

Simple linear regression

Establish a relationship between target variable (y) and input variables (x) by fitting a line, known as the regression line.

Multiple linear regression

Data sets which have multiple input variables/features are known as multiple linear regression.

Cost Function

To determine how well a particular line fits our data set. Or, given two lines, to determine which one is better.
It measures the difference between the estimator (the data-set) and the estimated value (the prediction)
A cost function measures, given a particular value for the w’s, how close the y’s are to corresponding y true’s. That is, how well do a particular set of weights predict the target value (y).
For linear regression, we use the mean squared error cost function. It is the average over the various data points (xi, yi) of the squared error between the predicted value y(x) and the target value y true.

Residuals

The cost function defines a cost based on the distance between true target and predicted target (the distance between the sample points and the regression line), also known as the residual.
If a particular line is far from all the points, the residuals will be higher, and so will the cost function. If a line is close to the points, the residuals will be small, and hence the cost function will be small too and this is desirable.

FYI — In Linear Regression, How do you find the line?

Start by drawing a random line. Then lets say the error is the distance of the line from the data points. Try to re-draw the line to decrease this error. This process is called Gradient Descent. As we do not want negative distances, this error reduction method is called least square.

K-Nearest Neighbours (KNN)

It is a supervised learning algorithm which can be used for both classification and regression.

No Training required as the Data set is the model

Classification: To classify a new data point

Find the k points in the training data closest to it. Closest is defined by a suitable distance metric (Euclidean, Manhattan, Minkowski) such as euclidean distance.
Make a prediction based on whichever class is most common among these k points (i.e. we simulate a vote).

Regression: When target variable is a real value

We take the average of the K nearest neighbours.

Tuning the hyper-parameter K

In machine learning, a hyper-parameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.
A small value of k means that noise will have a higher influence on the result and large value make the algorithm computationally expensive. Usually, we perform cross-validation to find out best k value (or to choose the value of k that best suits our accuracy / speed trade-off). If you don’t want to try multiple values of k, a rule of thumb is to set k equal to the square root of total number of data points.

Decision Trees

A supervised learning algorithm that can be used for classification as well as regression problems
Decision trees look at one variable at a time and are a reasonably accessible (though rudimentary) machine learning method.
A decision tree resembles a flow-chart, and is easy to interpret.
The decision tree algorithm works by recursively splitting the data based on the value of a feature. After each split, the portion of the data becomes more and more homogeneous, and eventually becomes mostly the same.
A decision tree uses if-then statements to define patterns in data. In ML, these statements are called forks, and they split the data into two branches based on some value.

How do you construct a Decision Tree?

Lets assume that we have a metric that defines how impure a data-set is. The following is a pseudo-code for constructing the decision tree:

Start with all the data in one node
Split the dataset into two parts A and B based on a feature that results in largest purity gain (or impurity reduction).
Repeat this process of splitting on child node until we get nodes which are pure, i.e. they contain samples of a single class, or some other stopping criteria is met.

Principle : Parametric and non-parametric models

In a parametric model, we continuously update a fixed number of parameters to learn a function which can classify new data point without requiring the training data. For example — logistic regression.
In a non-parametric model, the number of parameters grows with the size of training data. For example — K Nearest Neighbours.

Important Concepts

Generalisation

The ability of a model to perform well on unseen data is called generalisation, and it is a desirable characteristic we want in a model.

Generalisation counts and is the fundamental goal
Generalise beyond the training data set!

Over-fitting

When a model performs well on training data (the data on which the algorithm was trained) but does not perform well on test data (new or unseen data), we say that it has over-fit the training data or that the model is over-fitting. This happens because the model learns the noise present in the training data as if it were a reliable pattern.
Over-fitting is learning the random fluctuations in training data. can be recognised when accuracy is high on the training data and low on the testing data-set.
Conversely, when a model does not perform well on training data (i.e. it fails to capture patterns present in the training data) as well as unseen data then it is said to be under-fitting. That is, the model is unable to capture patterns present in the training data.

The chances of over-fitting are increased by:

A smaller data set — As it is much tougher to separate reliable patterns from noise
Increasing the complexity of a model

Strategies to solve the over-fitting ML problem:

Regularisation

Regularisation artificially discourages a complex explanation of how the world works (model).

Even if the model fits what has been observed better, it is unlikely to generalise well to the future. Complexity is bad as even though it might fit the data perfectly, it may not generalise new cases well. So a simpler model, that reasonably fits the training data, will be more correct for cases not in the sample data.

Types of Regularisation

L1 regularisation or Lasso Regression where the penalty term is λ ||w||
L2 regularisation or Ridge regression where we add a penalty proportional to the squared magnitude of each weight

Cross Validation

Cross Validation is a method for finding the best hyper-parameters of a model. (A hyper-parameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.)

Holdout Method for Cross Validation

For example, in gradient descent, we need to choose a stopping criteria. The simplest stopping criteria is to check whether our accuracy is improving on the training data set. However, this is prone to over-fitting since the model might be capturing noise present in the training data as reliable patterns.
We can overcome this problem by not using the entire training data while training a model. Instead we will hold out some data (a validation data set) and we’ll train only on remaining data. For example, we can split our training data set into 70/30 and use 70% data for training and 30% data for validation.
In the above example of gradient descent, now we train our algorithm on the training data, but check whether or not our model is getting better on the validation data set. This is known as the holdout method and it is one of the simplest cross validation methods.
We can also use the validation data for other types of experimentation. Such as if we want to run multiple experiments where we choose different features to use to train our machine learning model.

K-fold Cross Validation

In K-fold cross validation, the data set is divided into k separate parts. We repeat training process k times. Each time, one part is used as validation data, and the rest is used for training a model. Then we average the error to evaluate a model. Note that k-fold cross validation increases the computational requirements for training our model by a factor of k.

Advantages:

It is more robust to over-fitting than the holdout method when performing large number of experiments.
It is better to use when the dataset size is small. This is because when performing k-fold cross-validation, we can use a much smaller validation split (say 10% instead of 30%) since we are testing the model on various subsamples of the data being in the 10%.
Leave-one-out cross validation is a special instance of k-fold cross validation in which k is equal to the number of data points in the dataset. Each time, we hold out a single data point and train a model on rest of the data. We use the single data point to test our model. Then we calculate the average error to evaluate a model.

Important ML Lessons & Principles

Simplicity does not imply accuracy → Complexity can be considered as the size of the hypothesis space for each classifier. That is the space of possible classifiers that each algorithm could generate. A larger space will be sampled less and resulting classifier may be less likely to have been overfit to the training data.
Focus on the problem of ‘can the target function be learned, not can it be represented.’
Correlation does not imply causation → Classifiers can only learn correlations. They are statistical in nature.
Keep training data separate from evaluation/ test data
You have to use training error as a proxy for testing errors
Do not contaminate the training process with test data
Use cross validation on the training set and create a hold out set for final validation
Data is not enough regardless of how much you have
Every learner must embody some knowledge or assumptions beyond the data it’s given in order to generalise beyond it.
Assumptions create Biases
Limit dependencies and complexity

ML cannot get something from nothing, it gets more from less.

Induction → Small amount of data turned into a large amount of knowledge (Induction > Deduction) For example:

If we have a lot of knowledge about what makes examples similar in our domain we could use instance methods.
If we have knowledge about probabilistic dependencies we could use graphical models.
If we have knowledge about what kinds of preconditions are required by each class we could use rule sets.

Break down generalisation errors:

BIAS — tendency of the learner to constantly learn the same wrong thing (in the image, a high bias would mean more distance from the centre). A linear learner has high bias because it is limited to separating classes using a hyperplane.
VARIANCE — tendency to learn random things irrespective of the signal (in the image, a high variance would mean more scattered points). Decision trees have high variance as they are highly influenced by the specifics in the training data.
Sometimes a strong false assumptions (read bias) can be better than weak true assumptions, explaining why naive Bayes with strong independence assumptions can do better than powerful decision trees like C4.5 that require more data to avoid overfitting.
Noise is not the only reason for over-fitting but it can aggravate the problem

Intuition Fails In High Dimensions — The curse of dimensionality is a problem in ML

Generalising correctly becomes exponentially harder as dimensionality (number of features) become large.

ML algorithms depend on similarity-based reasoning which breaks down in high dimensions as a fixed-size training set covers only a small fraction of the large input space.

Moreover, our intuitions from three-dimensional space often do not apply to higher dimensional spaces. As in high dimensions all examples look alike. So the curse of dimensionality may outweigh the benefits of having more features.
Though, in most cases, learners benefit from the blessing of non-uniformity as data points are concentrated in lower-dimensional manifolds. This refers to the fact that observations from real world domains are often not distributed uniformly but grouped or clustered in useful/meaningful ways.

Do not trust theoretical guarantees as criterion for practical decision making instead use it for understanding and driving algorithm design

Feature engineering is the key and makes the big difference

ML is an iterative process of running the learner, analyzing the results and then modifying the learner.
For the learning algorithms, raw data does not contain enough structure. Features must be constructed fro the available data to make the structure more evident for the algorithm
Learning is easy when all of the features correlate with the class, but more often the class is a complex function of the features.

More DATA → Better Results

ML has 3 constraints → Time, Money and Training Data.

Most classifiers achieve the same results at scale

Learners group nearby examples into the same class.
Use simpler algorithms with fewer number of parameters and terms
Use many models and then ensemble them

Three most popular ensemble methods:

Bagging: generate different samples of the training data, prepare a learner on each and combine the predictions using voting.
Boosting: weight training instances by their difficulty during training to put special focus on those difficult to classify instances.
Stacking: use a higher-level classifier to learn how to best combine the predictions of other classifiers.

Thanks for reading! I hope you learnt something! If you liked this post, you will enjoy my other deep dives! You can also connect with me on LinkedIn here.

Articles in the Deep Dive Series:

Deep Dive : The ‘Digital Macrocosm’
Deep Dive : The First Principles of the Economy
Deep Dive : Meeting Mindfulness with Meditation

A Deep Dive into Artificial Intelligence

What are Deep Dives (DD) and Why?

Why read this Deep Dive & What could you possibly learn?

Deep Dive Structure

DATA

AI in Daily Lives:

Difference between ML and Artificial Intelligence (AI):

Why Machine Learning (ML)?

What is Machine Learning?

What are Algorithms?

Desirable properties of machine learning

What are the different types of ML?

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Deep Learning

Basic Key Terminology

Machine Learning Algorithms Fundamentals

Representation

Evaluation

Optimisation

Algorithms for Supervised Learning

K-Nearest Neighbours (KNN)

Decision Trees

Principle : Parametric and non-parametric models

Important Concepts

Generalisation

Over-fitting

Strategies to solve the over-fitting ML problem:

Regularisation

Cross Validation

Important ML Lessons & Principles

Articles in the Deep Dive Series:

Written by Karan Mario Jude Pinto