Your End-to-End Guide to Solving Machine Learning Problems — A Structured Workflow

Joy Ugoyah
DataDrivenInvestor
Published in
8 min readSep 9, 2020

--

When working on any Machine learning type problem, it is important that the ML engineer and, even the entire data science team are conscious of the structure of their workflow and the sequence of steps they will be taking, for them to be able to produce a scalable ML solution in different areas of applications and environments. In this article, you will understand the importance of following a standard structured ML workflow, you will also see the steps involved and how it can be applied to a real-life project. There is always a method to everything right ;-)

Photo by Olav Ahrens Røtne on Unsplash

Standard ML Workflow, the What, and Why

Practitioners and experts with years of experience up their sleeves in the industry have strongly recommended that a standard workflow be followed when working on an ML project from start to end because, it allows you as an ML engineer to make the best decisions considering how they affect future steps, measure the solution performance and, easily go back to a phase to optimize for better performance of the entire solution. Here’s a checklist to follow that is an excerpt from the popular and very resourceful ML textbook by Aurelien Geron, Hands-On Machine Learning with Scikit-Learn, and TensorFlow.

Source: PHCschoolofAI, based on Aurelien Geron’s book

Phases of ML Workflow

From the picture above, which I will advise you to save a copy of for future reference, you can see how the standard ML workflow methodology is summarized into 8 main steps that are briefly yet clearly explained. Now let us look closely at a fictitious business problem and see how this workflow can be applied to it.

Problem Description

The sales representative of a construction aggregate company had been facing difficulties meeting his sales targets which he believed was due to complaints from clients about inconsistencies in the product quality. From his meeting with the plant supervisor, he found out that the Production plant had undergone an upgrade in the last quarter and that raw materials normally go through testing before production. The Plant supervisor couldn’t see any obvious reason for the inconsistencies but to the Sales representative, the time of the upgrade coincides with when the complaints began. The meeting didn’t yield any solution for the sales rep yet. After doing some more research and interacting with his network, he found out that the machines had data records, so he inquired with the manager of the analytics department for a way forward. The analytics manager informed him of the availability of data that could maybe help them detect the inconsistency in concrete compressive strength through a robust model. The sales rep stated his interest in finding out the factors that influence the concrete compressive strength most so that he can strategize a way to present his findings to the management.

Your role in this as an ML Engineer is to build an end-to-end ML project that solves this problem.

Frame the Problem

The first step of the ML workflow allows us to properly understand the project we are going into. Why do we even need to predict the concrete compressive strength? What is our objective? In this case, the aim is to resolve the problem of product quality inconsistency to improve sales. Being able to predict the concrete compressive stress and determine the factors influencing it will help us achieve this aim so yes this problem can be solved using ML. There are a lot more relevant questions that need to be asked in this phase, you can find some of those important questions here.

Get Relevant Data

We move to the next phase of our project, data gathering. Here you will need to determine what type of data you need and how much of it, find out data sources, check legal obligations for the data, get data and, determine its size and type. This phase is carried out better with the assistance of the project subject domain expert. You can also make effort to ensure data fairness in this phase. Note that automating this phase makes adding more data in the future more seamless. Remember to set aside your test set! In the problem being addressed here, our data source can be found here, you can get more understanding of the relevant data features by researching on the internet.

Exploratory Data Analysis

Drawing insights from exploring the data especially with the help of a domain expert can go a long way to save you the stress of taking the wrong approach or using the wrong attributes for your model building. Study the attributes of your data and characteristics, visualize and understand the data correlation and then document everything you learn from the exploration. Here are the first 5 instances of the data set we are working with,

This is a heatmap showing our data correlation, there are many more things you can do during data exploration.

Prepare the Data

Preparing and transforming the data helps the ML model to be built to better understand the underlying patterns.

Automate this process as much as possible so that you can easily prepare the test set and also fresh data sets or instances when you get them and also you can apply the same preprocessing to any other projects you have to work on. For our example, I tried different methods of scaling and transformation and decided that the model performed better without transformation. Here’s a link to the Github repo, so you can see practically how the methodology in this article is applied.

Iterate over Different Algorithms

Explore different algorithms that you think might be suitable for your data set. If you read more about descriptions of the data set, you will see that the attributes have a non-linear relationship, this can also be deduced in the preprocessing stage, so in picking algorithms to work with, I gravitated towards more non-linear models. Automate this process as much as possible to make you model training faster. After evaluating the model performances, you can shortlist the top-performing models to optimize their performance. You must pay attention to see the variables significant for each model and also the type of errors they make so you know the best candidates in case you decide to build an ensemble model. You can always go back to any of the earlier phases, to better preprocess and understand attribute relationships.

Fine-tune your Models

Fine-tune your model for better performance. Tune hyperparameters, create an ensemble model, etc. In this phase, you need to keep in mind the trade-off of model complexity and interpretability. In the example project, after fine-tuning, I settled for the Decision Tree Regressor model because it didn’t overfit and it was also easier to illustrate the feature importance for the sales rep to see.

Present Your Solution

The importance of presenting your model in a way that is interpretable and understandable by a non-ML or data science professional is important in every setting. Let’s say I use a PowerPoint presentation of charts and graphs to show the sales representative the attributes that affect the concrete strength more than others, I would then have made it possible for him to communicate the finding with management to effect the required changes. Here’s a representation of the level of importance of the features in predicting the concrete compressive strength as stated in the problem overview, you can see the 3 main features that influence this prediction.

Always show them the big picture of your solution and showing what worked and what didn’t work.

Launch and Monitor

Finally, you have to get your ML solution into production and it doesn’t stop there. ML systems rot with more data so you also need to regularly monitor your model and if needed retrain the model to ensure good performance even as the data evolves. When deciding how to take your model to production you need to consider business cost and the user type. Since our example is a fictitious problem it is not being produced, but let’s say you had to take it to production, what strategy or platform would you use? You can share your ideas in the comments so we can see how they can be applied.

Conclusion

You’ve seen the importance of applying a standard Machine Learning workflow to any project you are working on and how it can make your workflow smoothly and bring out the best solution. I want to stress this point, you need to perform a needs analysis and ask the relevant questions before jumping into solving any problem with ML so that you don’t put in all that work only to realize in the end that it doesn’t solve any problem.

Further Reading:

Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools and, Techniques to Build Intelligent Systems

https://blog.phcschoolofai.org/getting-started-with-machine-learning-no-practical-machine-learning-part-3-ck9mf61vj002qihs1iwkzaedn

Here’s a Github link to the example project I worked on

--

--