Intro to Machine Learning

Teaching machines to learn!

Mayur Jain
DataDrivenInvestor

--

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.

Other definitions

Machine Learning is essentially a form of applied statistics with increased emphasis on the use of computers to statistically estimate complicated functions and a decreased emphasis on providing confidence intervals around these functions.

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” by Tom M. Mitchell

At the heart of Machine Learning

Data is at the heart of machine learning, quality & clean data is like heaven on earth. But never expect your data to be as clean as given in Kaggle Competition. It is believed that larger the corpus of data, better is the generalization of the learning model. Less data results in poor generalization, thus causing the model to perform poorly on unseen data.

Model, it is a function, which is learnt based on data, complex data like high dimensional data like images, audio etc requires complex function.

For instance, appraisals in workplace is a complex function because various factors plays a key role in your appraisal like your effort in team, quality deliverables, years of experience, cordial relationship with managers etc. One cannot simply make linear function like increasing salary with increasing years of experience. Hence, the model learns these factors and its inter-relation and extracts the hidden patterns from data, these patterns are generalized over a function.

Learning, a model is said to learn from data. Learning here refers to identifying or recognizing the pattern in the data. A model is performing well, if it predicts accurately on unseen data. Unseen data is technically called as test data. When the model is training, it is trained on train data. The terms train and test data is explained in Machine Learning blog post II.

Technical Jargon in Machine Learning

Data, are a set of observations available about a system to which we’ll apply machine learning algorithms. In majority of the cases we perform a series steps on our data to make it available for machine learning, like cleaning the urls from string or removing outliers and structuring it in the form, such that it is consumable by ml algorithm.

For instance, consider a system of patient records in a hospital, where the hospital keep track of patient’s health habits, existing health condition, diseases XYZ exists or not etc. These characteristics all come under a common term called as Data.

Consider we have 1000 patient records with target as disease XYZ with value 0 or 1. 0 represent no disease and 1 represent disease is present. To perform machine learning, we split dataset into multiple sets like train set, validation set and test set.

Predictors, are set of independent variables, which helps in describing a system. From above patient records, if we want to predict, if the patient has disease XYZ then our predictor variable will include health habits (like smoking, drinker etc), existing health condition like blood pressure, diabetic etc. Predictor is also referred as independent variable.

Figure 2: Machine Learning Flow

Label is a dependent variable, it tells about the system and what it is meant for. From patient record system, our label could be disease XYZ, meaning we can predict if a patient has disease XYZ or not. Labels are also called target. Dependent because it depends on predictors.

Model is a machine learning algorithm, which is trained on data to fulfill our task. There are thousands of ML algorithms available readily for applying but it is a enormous challenge to identify that one ML algorithm which is best suitable for the task.

In the above example,we predict whether a patient has a diseases or not, which comes under a class of problem called as binary classification problem, where label variable takes Yes or No values.

Parameters are tunable knobs in machine learning algorithm, there are different parameters available for each algorithm like depth of tree in Decision Tree Algorithm, penalty in Logistic Regression etc.

The tunable parameters are called Hyperparameters, which are set manually by user who is performing machine learning. The other set of parameters are learnt by machine learning algorithm like coefficient of X’s in Linear regression, Weights in Deep Learning Algorithms etc.

For training, we use train set. For identifying the hyperparameter and parameters of the model, we use validation set. For check final performance of the model, we use test set.

Training is a process to learn the parameters of the machine learning algorithm. During this process, we keep track of training error, as the training continuous to takes place, the training error reduces.

Training is done on train set.

Evaluating is a process to learn hyperparameters, one must try out or experiment with different possible hyperparameter for the learning the algorithm. Evaluation can take place simultaneously with training. And during evaluation, we keep track of validation error, which tells us the performance of each hyperparameter we are tuning.

Evaluation is done on validation set of a dataset.

Testing is a process to confirm whether the model is generalized for unseen data or not. If the model performs poorly, we need to get back to training with different model or different parameter or hyperparameter or more data etc.

Testing is done on test set of a dataset.

Note: Sometimes the training error will reduce but the validation error may fail to reduce. It happens because of bias issue in the model. More detail in next blog.

Types of Machine Learning

Supervised Learning

In supervised learning, the algorithm learns from labeled data. It identifies/recognizes the pattern from data with labels and associates those pattern to unlabeled data.

For example: Data on house price based on various factors like area, rooms, lawn and other details, where we predict the value of the house. So our label is House Price.

A dataset has feature ( X’s ) and target ( y ). The target can be a discrete value or a continuous value based on which we can call the problem as Classification or Regression respectively.

Classification problem is classified as Binary or Multi-class classification or Multi-Label classification. Either target value is 0/1 or else multiple values like Dogs, Person, Cat. It answers the yes or no question. It has categorical outcomes.

Regression problem is like predicting a value (Numerical). For instance, predicting the house price or stock value of a product. It answers how much questions. It has numeric outcomes.

Each dataset has features and target variable. The features helps in generating parameters which affects the target variable either directly or indirectly. Now, this relationship between X and y is built by a ML algorithm.

Unsupervised Learning

In unsupervised learning, the algorithm learns from unlabeled data.

For example: Unstructured text documents where we cluster text or paragraphs based on word binding in the sentence using clustering algorithms, to group documents to particular topics.

A dataset with no label is trained by unsupervised learning i.e. the patterns of such dataset is learned widely by using clustering techniques like KMeans, Hierarchical Clustering etc.

Reinforcement Learning

The Algorithm learns from performing some action and receiving rewards(+ve/-ve) for those actions. Here, the algorithms interact with an environment, so there is a feedback loop between the learning system and its experience. For instance: Self Driving Cars and Game Playing Agents.

Thanks for reading. If you liked reading my articles, checkout Math + Computing = AI for more articles.

Connect with me on LinkedIn and Twitter

--

--