Machine Learning basics: Part2

Abhishek Diwate
DataDrivenInvestor

--

In my last article Machine learning basics part-1, we saw some basics and related to ML. Please Check it out, before moving on to this article. We were discussing algorithms in Supervised learning. We will move on with some more algorithms

Random Forest:
This approach first takes a random sample of data with replacement and identify a key set of features to grow each decision tree. Instead of training with all of the training examples, we use only a random subset.

Random forests have a low bias (just like individual decision trees), and by adding more trees, we reduce variance, and thus overfitting. Here we use ensemble methods. The main principle behind ensemble methods is to become strong learner by combining several week learners.

There are two techniques:
1. Bagging
2.Boosting

Bagging is used where our goal is to reduce the variance of a decision tree. we take the sample with replacement.we end up with an ensemble of different models.Average of all the predictions from several decision trees will be robust and give much performance than a single decision tree

Salient Features of bagging :

  • Each model is built independently
  • Unweighted voting
  • Bagging improves the over-fitting problem compared to a single model

Random Forest is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest

Boosting is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.

Salient Features of Boosting :

  • To boost single trees into strong learning algorithms
  • Adaptively change the distribution of training data
  • Random sampling with replacement with weighted data
  • The Weighted average of the estimates
  • Boosting improves bias compared to the single model
  • AdaBoost, GradientBoost to determine the weight to choose in the next training stage/classification

Supervised Learning :(Categorical)

Support Vector Machine(SVM) :

“A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labelled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimensional space this hyperplane is a line dividing a plane into two parts where in each class lay in either side.”

SVM illustration

Minimize X 1 T X 1 such that W T X 1 +b = 1. The Lagrangian equation is formulated as follows:

The minimum distance is given by:

Making a bit complex:

In this case, We can’t draw any linear optimal hyperplane that separates two examples. So we should go with non-linear transformations called kernels. If we draw a circular boundary it clearly separates two given examples.

Advantages of SVMs:

  • High accuracy, nice theoretical guarantees regarding overfitting
  • Especially popular in text classification problems where very high-dimensional spaces are the norm.

K-Nearest Neighbours(KNN) :
KNN algorithm is one of the simplest classification algorithms and it is one of the most used learning algorithms. For a new data point, it will look at the k closest neighbours of this point and assign the label that is the most frequent

Example: The new data point is the green point. When k = 3 the assigned label would be red.When k = 5 the assigned label would be blue.

KNN algorithm

Known Class Membership of Labeled Samples
uij = the degree of belongingness of xj to class i

KNN requires a value k, that will stand for the number of neighbours that need to be taken into consideration

Salient features of KNN:

  • No assumptions about data — useful, for example, for nonlinear data
  • Simple algorithm — to explain and understand/interpret
  • High accuracy (relatively) — it is pretty high but not competitive in comparison to better-supervised learning models
  • Versatile — useful for classification or regression

Naive-Bayes:
Naive Bayes is a classifier that is probability based. It is based on Bayes’ theorem with independence assumptions between predictors.
Bayes’ theorem is stated mathematically as the following equation:

It assumes that the effect of the value of a predictor (x) on a given class c is
independent of the values of other predictors. This assumption is called class conditional independence.

Example:

Advantages of Naive Bayes:

  • just doing a bunch of counts.
  • a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data.
  • it can’t learn interactions between features

Evaluation Measures for Classification:
Precision:

  • Out of all items retrieved, how many are relevant
  • It is based on true positives and false positives.
  • It shows how many selected items are actually also relevant.
  • True positives (TP) are the positive guesses made that were actually correct while the false positives (FP) are the positive guesses made that were incorrect

Recall:

  • How many relevant items retrieved from all relevant items
  • It is based on true positives and false negatives.
  • It shows how many of the relevant information is selected.
  • The true positives (TP) are again the positive guesses made that were actually correct while the false negatives (FN) are negative guesses while they should have been positive.

Confusion Matrix:

Confusion Matrix
  • Accuracy: Overall, how often is the classifier correct?
    (TP+TN)/total = (100+50)/165 = 0.91
  • Misclassification Rate: Overall, how often is it wrong?
    (FP+FN)/total = (10+5)/165 = 0.09
    equivalent to 1 minus Accuracy also known as “Error Rate”
  • Precision: When it predicts yes, how often is it correct?
    TP/predicted yes = 100/110 = 0.91
  • Sensitivity or “Recall”: When it’s actually yes, how often does it predict yes? TP/actual yes = 100/105 = 0.95
  • False Positive Rate: When it’s actually no, how often does it predict
    yes? FP/actual no = 10/60 = 0.17
  • Specificity: When it’s actually no, how often does it predict no?
    TN/actual no = 50/60 = 0.83
    equivalent to 1 minus False Positive Rate

When to use which Algorithm type?

  • Parameters to decide: Accuracy, Training time, Linearity, Number of parameters and features
  • Two class classification:
    Decision tree, logistic regression
  • Multi-class classification:
    Decision Tree, multiclass logistic regression, neural network
  • SVM may work well with a large number of features

Go through this link for cheat codes if you are finding difficulty in choosing the algorithm

So far we have seen various regression and classification techniques. We have discussed several algorithms. In the next article, we will look into various clustering algorithms.In upcoming articles, we will see how to write a code for these algorithms using python. Stay tuned for more such articles

Please Clap and share if you find this article worthy

Happy learning!

--

--