Photo by Mark Duffel on Unsplash

Guidelines For Data Science Teams

Burak Özen
DataDrivenInvestor
Published in
30 min readFeb 4, 2021

--

How to Build, Structure and Manage a Data Science Team Successfully

Introduction

Currently, companies have been praising the importance of data on internal business decision making process and investing more in their data teams. There has been a contest taking place lately especially among big tech companies to expand their data teams to serve the needs.

In my career, I have had opportunities to work for some of those big companies and took different roles in their data teams. Thus far, I’ve also had numerous data science interviews with many companies in various sizes. From my previous experiences, I wouldn’t be wrong in saying that data science is still not mature enough as compared to other technical disciplines such as Software Engineering or Data Engineering. I also recognised the bitter truth that there is no well-defined and commonly accepted set of standards in the current data science world. That’s the main reason why I decided to write up this article to start a discussion on it through a brainstorming.

Needless to say, it won’t be realistic if we expect all companies to follow the exact same set of standards. However, although data science is a relatively new field, I believe we still need to find systematic ways of approaching to common problems of this field. In this article, I would like to talk about my perspective on such a generic guideline that could be followed by any data science team.

Table of Contents:

I. Optimal Data Science Team Structure: List of Role Definitions

II. How to Hire a Data Scientist: Data Science Interview Process

III. How to Design an Effective On-Boarding Process For New Data Scientist Hires

IV. How to Do Technical Due-Diligence to Assess a Data Science Project

V. Challenges and Best Practices In a Data Science Product Delivery Process

VI. Data Science Implementation Manifesto: How to design different sub-components of an implementation pipeline

Let’s discuss every single topic under a separate section in rest of the article.

I. Optimal Data Science Team Structure: List of Role Definitions

Roles in an Ideal Mid-Sized Data Science Team (Image by Author)

Size of data science teams varies from one company to another depending on company size. Here, I will define a list of different roles in an ideal medium-sized data science team.

Junior Data Scientist(s):

Arguably, one of the most important career decision for a junior data scientist is to find the first company they will work for. It is likely that this decision will shape their entire future in this field. Therefore, to those who are at this stage of their careers, I strongly recommend selecting this first company wisely where they can develop skills in an area they would like to focus on. (e.g Deep Learning, Data Science Applications in Marketing, Finance or Health etc.)

Now, let’s talk about common qualities of this group of data scientists and what to expect from them while forming a successful data science team.

These data scientists are relatively inexperienced compared to the rest of the team and expected to report to the Lead Data Scientist all the time. Therefore, there will be an inevitable high learning curve for the junior data scientists. According to the team vision, they should be trained and directly supervised by the lead data scientist. Hence, for instance, if you see the future of your data science team in the Deep Learning applications, then you should look for a junior who has a little prior experience and wants to follow a career path in the Deep Learning.

You can’t expect from a junior data scientist to know how to code in various programming languages or being able to apply many machine learning algorithms. Therefore, you are better to be task or project oriented in the quest of new juniors to add to your team. Let me give another example to make my point clear here. Assume that you decided to discover and invest more into Fraud Detection modelling in the near future. Although there might be different programming languages have been used by your team, there is always a single language more friendly to your production environment. Then, you need to look for a junior data scientist who had some prior experience in that particular programming language and Fraud-like projects such as Churn Prediction.

Senior Data Scientist(s):

Main difference between juniors and this group is that we could expect senior data scientists to be capable of leading a project from the beginning till the end. They should be in charge of a project without a need for a long learning curve or direct supervision.

Another responsibility of this role is to help junior data scientists in the team. For instance, since seniors are expected to be more aware of infrastructure and data sources, when they are asked for help by a junior, they should direct the junior data scientist to the right data sources or right people who could help them to find what they are looking for.

They are also expected to be paired with junior data scientists to review their code. This has 2-sided positive effects in the team. First, a junior will learn much during those review processes and also double-checking the code by a senior will make models more robust.

Although they are assigned to a specific task or project by the lead data scientist most of the time, we might also expect from a senior data scientist to come up with new ideas to inspire the team and the business. Naturally, it requires some seniority in this field to catch and formalise those data-related business problems.

Lastly, this group of data scientists are also expected to report to the team lead all the time.

Lead Data Scientist (Team Lead)

A Lead Data Scientist must be a team member having a comprehensive technical data science knowledge. We may not expect Junior or Senior Data Scientists to know how to code in different languages or using different frameworks but a Lead Data Scientist must be experienced in all programming languages and tools that have been used by the team. Due to their strong experience, this role is also responsible for technical training of junior data scientists in the team.

All projects must be run under the technical supervision of the Lead Data Scientist. In GitHub jargon, you could think of this person as the final step where we merge a branch into the master. If a team uses Github as the version-control system, the team lead must be in charge of managing code repo and orchestrate it in the team.

This role must ensure that technical documentation of each project will be written clearly by its owner data scientist. In other words, a lead data scientist is responsible for organising technical documentation in the team. These technical reports should be collected in a single place such as a Technical Team Blog.

Lastly, a lead data scientist should be aligned with project manager all the time and expected to report to the Head of Data Science.

Project Manager

A project manager is the only team member has no required technical background. However, we still expect a project manager to know about technical jargon in the data science world. It is because this role is kind of a communication hub in the team and glue all team members together to work in sync.

A project manager should be in charge of improving internal team communication and make sure that every data scientist is updated enough about other team members’ projects as well. This can be achieved by arranging team and 1-to-1 alignment meetings. For each meeting, this role is expected to set up the goal in advance, plan possible follow-ups afterwards and collect some meeting notes to share with all attendees later.

If the team uses a project management tool such as Jira, the project manager must ensure that all data scientists use this tool properly and regularly. Each task on the board must be double-checked by this person to see if it meets the team’s task standards and gives enough level of details about the task.

In addition to all above, this role is required to be a communication bridge between client (business) and the team. One of the most crucial responsibilities is to provide alignment between two sides — client and vendor — throughout a project.

Data Engineer

Most of the time, data science teams need some help from other engineering teams to make their products comply with the current production environment. It’s a privilege and huge advantage if a data science team could manage to have their own API and Production Flow.

Depending on an engineering team all the time makes things complicated and inefficient for a data science team because of several reasons. First, every single team in a company has their own priorities and that’s why, your task will have to wait too long to be picked up by the engineering team. Depending on the capacity of engineering team, this might even take several months to see your models in action. Meanwhile, your customer might lose their interest in your product and even forget about it. Secondly, once your model is out of your hands and control, it is hard for a data scientist to make even a single change of parameter in the model without contacting with the engineering team all over again.

Therefore, a data science team needs an allocated internal data engineer so as not to face with those problems above and to have much more dynamic working environment. Data engineer will help to start team’s own API to serve the results of ML models on the production. This role is also in charge of maintaining data science production pipeline or lately known as MLOps.

In general, since a data engineer knows better than a data scientist about software design principles, implementations done by a data scientist should be revisited and configured by this engineer in a collaboration with the project-owner data scientist. In this way, data scientist will be learning more about stable and production-friendly implementation design without losing all control on its own code.

Data engineer should report to both head of data science and also head of data engineering. It could be designed in a way that this data engineer keeps a close bond with the data engineering team for technical issues while still being a part of the data science team officially.

Product Analyst

Product Analyst is the team member who is in charge of designing experiments and running A/B testing to evaluate performance of models in a real use case.

This role requires high level statistical knowledge, experiment design and great data visualisation skills. Ultimately, it is important to have easy-to-understand and fancy presentation of final results to report to the business people.

Front-End Developer

For lots of data science applications and models, it is needed to have a dashboard to present all the results in one place in a more appealing way. I believe that it’s the most effective approach to sell the team’s products to the rest of community in the company. Assume that your team has been working on several predictive models which produce many user data points such as churn risk score, likelihood to purchase, fraud risk score, predicted age, predicted gender, customer lifecycle stage, recommended products etc. Basically, we could draw predictive 360 user profile using outcomes of many different machine learning models. However, it is not pretty to present those results in slides all the time. Therefore, a front-end developer should create a Data Science Dashboard for the team while closely working with the Product Analyst for the data visualisation and plotting parts.

It’s not always just about a dashboard. Sometimes, it is needed to have some sort of simple UI to show performance of your models neatly. For instance, it is not trivial to show that a newly developed recommender performs better than a baseline recommender. The front-end engineer can also create a good looking UI application to make a comparison between qualities of recommendations coming from those two recommenders.

Head of Data Science

You could think of Head of Data Science as product manager of the team. However, not any product manager in the market will be eligible for this role. This is a common mistake a lot of businesses have been making in this field. Knowing only data science terminology and some high level data problems can not be sufficient to fill this crucial position in the team. This person must be a Senior or Lead Data Scientist before so that having a strong prior technical data science background. Maybe, better definition for this role could be ‘Technical Product Manager’.

Everyone keeps talking about a quality of good data scientists which is product knowledge. In general, I do agree with this statement that it is a huge plus for a data scientist to improve its product side as well. However, this is a required quality especially for the Head of Data Science position. This person must also have a thorough product vision and business understanding.

The most important responsibility of this role is to be a bridge between Business/Product and Data Science worlds. Therefore, it is required to be capable of matching a business problem with an existing data solution and vice versa. In this way, the team will have a chance of having two-way relation with business. Rather than expecting a product request coming from the business side all the time, they could also input business with what is feasible to do using current data tools and algorithms. This will make the team more visible in the company. Thus, changing mindset of C-level or business people towards Data Science can be counted as another specification of this role.

Head of Data Science is also in charge of drawing long-term and short-term road map of the team by making quarter plans and having a list of project ideas by order of priority. Each idea should be detailed in one-pager project description including what’s the problem to be solved, business success metric, possible high level technical approaches and some concrete final use-cases of the model.

Lastly, this role should run a team blog and writing articles about internal team projects. Also, he is expected to attend conferences as a keynote speaker to share team achievements with the external data science community and doing marketing of the team in general.

II. How to Hire a Data Scientist: Data Science Interview Process

Different companies have been following different strategies for their own data scientist hiring process. During my career thus far, I have seen many interview processes from a bunch of different companies. In this section, I would like to describe my perspective on how interview stages should be designed to hire a good-fit data scientist to join your team.

Data Science interview process must be all about a series of case-studies. In these cases, you could easily question about the qualities of a data scientist listed below:

  • How to conduct communication with business and ask relevant questions to gather sufficient information about the problem. → Product Vision
  • How to anticipate and avoid pitfalls of the project during an initial discovery on data → How to approach to a data problem
  • How to design implementation stages, take critical model decisions and their reasonings → Applying Machine Learning
  • How to present the results and the way of selling the final product to the stakeholders → Visualisation and Presentation skills
3 Interview Stages in a Data Scientist Hiring Process (Image by Author)

CASE STUDIES

Total of 3 case-studies will do what we would like to do to qualify a data scientist for a role in the team:

Case I: Asking a data scientist to present one of his previous works:

  • Expecting a comprehensive understanding of what had been done in that project including all technical and non-technical steps.

Case II: Asking a data scientist to come up with an “End-to-End” solution for a problem your team has been struggling lately:

  • Provide a high-level definition of a business problem you have but not giving all detailed specs of it. Expect the candidate data scientist to ask relevant questions to make the picture clearer for solution.
  • Designing each step of a data science project pipeline is arguably the most crucial quality to be tested during this case. It will give you some clues that if the candidate data scientist takes final stages such as productionising the model into consideration or just focusing on a discovery work in a single notebook.
  • You might not ask for knowing all details of how to run an online testing such as an A/B test from the candidate but at least expect him to acknowledge this step in an end-to-end solution.
  • Ask candidate to present final end-to-end solution to you as if you are one of the business stakeholders of the project.

Case III: Asking a data scientist to come up with 5 possible data-related problems your business might have been dealing with:

  • It’s important to find out to measure the candidate’s interest and expertise in your business.
  • It’s also important to assess how strong product vision of this candidate data scientist.
  • Ask for high-level solutions for each problem.

III. How to Design an Effective On-Boarding Process For New Data Scientist Hires

Most of the companies either skip or ignore this on-boarding process for their new data scientist hires. Only a few companies have such processes but in general, they are all cursory and disorganised on-boarding training sessions. Personally, I find these training sessions vital for new data scientists to make them see the big picture clearer and faster. Therefore, companies should not mind spending even a whole month to train new hires and get them familiar with the business and the product itself. This is really crucial especially for a Data Science team because of the following reasons:

  • A data scientist must know about company’s vision and mission to be able to connect the dots between business problems and the data. You should always keep in mind the product-side of a data scientist and make sure to feed them with necessary and detailed business information.
  • A data scientist must know about business strategy and near future plans for the growth and expansion of business to decide on which field they should focus more and prioritise their projects accordingly.
  • A data scientist must be a good user of the product in the first place. You may not always expect a new hire to be a prior loyal user of your product but you could make them one by teaching them every single feature of your product or platform. Data scientists must understand how users feel like or need while using the product. Having this real user experience is especially so important for a data scientist to think as a user and have a thorough understanding of the product flow. This has been done before you let a data scientist begin its data journey.
  • A data scientist must be aware of all revenue sources of a business. Basically, the answer to this question should be clear enough to a new hire: From which channels, the business is making money?
  • A data scientist must spend some time with other teams in the company such as engineering, marketing or analytics to have a comprehensive view of internal business flow. For example, what tools a marketing team has been using to target users? What sort of business metrics they have been trying to improve? (e.g.Number of conversion, number of active users, total time spent etc.)

For a new hire, in order to complete the list above and make the on-boarding process effective, it’s needed to pre-organise a sequence of alignment meetings with different teams in the company. In each alignment, the purpose must be set clearly in an on-boarding training document and hand it over to the trainee. It is better to design it by starting with the business and the product alignments first and then the other teams including marketing and engineering etc.

Since I am currently working at an online marketplace business, I will give a high-level description of how we could design a product training session for a new hire to introduce our marketplace platform. Needless to say that each business needs to plan this training in detail with respect to their products.

Photo by Kelly Sikkema on Unsplash

USER EXPERIENCE TRAINING: DEEP UNDERSTANDING OF USERS IN A MARKETPLACE

We have two main user groups on our marketplace platforms: Buyers and Sellers. If you are on our platform to buy an item, then you are a Buyer but if you are visiting our platform to post a listing to sell an item, then you are a Seller. In this product session, you should ask your new hire to act like a Buyer first and then a Seller on the platform.

Case I: You are a Buyer

Create an account → Browse on different categories → Browse on different listings → Favourite a listing → Go to your ‘My Page’ to list your favourited listings → Type something in the search bar → Browse over the search results → Refine those results using different combination of Search Filters → Save the ‘refined search’ for a later use → Land on a specific listing page → Bid on it → Send a message to its seller

Case II: You are a Seller

Create an account → Pick an item to sell on the platform → Search in the list of categories to find the best-fit category for your item → Fill in relevant attributes of the item you are selling such as size, colour, price etc. → Post your listing and check if you got any validation email → After your item is listed on the platform, go to ‘My Ads’ page to view your item. → Try to collect some performance numbers for your ads such as how many people viewed or favourited your item so far → Get a message from a buyer → Have a decent and long conversation with a buyer → Remove your listing from the platform

IV. How to Do Technical Due-Diligence to Assess a Data Science Project

Let’s do an exercise to learn more about how to question a data science project. Below, you will see a high level overview of a new search ranking pipeline. In this exercise, we will think about what relevant and important questions can be asked about this pipeline to spot possible risks in it and also how to improve it with suggestions. I will group my questions with respect to their context and list them under different titles: Business, Data Collection, Feature Extraction, Train & Evaluate and Production questions.

High Level Search Ranking Pipeline (Image by Author)

Business Questions

  1. What’s the business problem supposed to be solved with this system?
  2. What’s the formal definition of a “successful search” for the business? (e.g. Relevance vs. Semantic Similarity or Diversity in the top search results)
  3. How the current baseline search system works?
  4. What are the main pain points of the baseline?
  5. Any pre-defined business rules expected to be applied to the new search ranking system? ( RISK!: Business’s definite dos and don’ts are needed to take into consideration.)
  6. Which business metric we will be aiming to improve with this new search?

Data Collection

  1. What kind of input datasets this pipeline is connected to? (e.g. impressions, clicks, search results etc.)
  2. Where the input dataset resides? Any partitioning on this dataset?
  3. How big the input data is? (RISK!: We might need data aggregation gradually by creating a bunch of data transformation jobs.)
  4. Is this pipeline for a single platform users (e.g web only) or multi-platform users (e.g web and mobile)? (RISK!: It is likely that search characteristics of web users is different than mobile users and needed to be dealt with differently.)
  5. Is this analysis for just logged-in users or all users including anonymous ones? (SUGGESTION: Consider richer user features for logged-in users)
  6. Is the input data collection real-time streaming job or batch job? (SUGGESTION: Determine the following pipeline and jobs frequency accordingly)
  7. Any prior data discovery has been done to extract some insights from the data? (RISK!: Maybe, 80% of your search results data is coming from only 5% of the users and the rest 20% from 95% of the users. → Biased Search Ranking)
  8. Any prior data cleaning jobs before training data generation? (RISK!: Exclude all unintentionally conducted search results and click-baits from the analysis. OR Drop already removed/deleted items from the analysis.)

Feature Extraction / Training Data

  1. What kind of features are generated from data records and how many in total?
  2. What was the labelling strategy for the relevance score? How to label training data? (RISK!: Leak the label → Features from each search result must be extracted before the date that particular search is conducted)
  3. How to decide to end up with the final feature set using in the production?
  4. New Approach to Consider: Adding personalisation features if not considered → More Personalised Search Ranking
  5. New Approach to Consider: Use an embedding strategy to generate features instead of using explicit hand-tuned features.

Train & Evaluate / Models

  1. How the modelling part is formulated? (e.g Binary, multi-class, regression etc.)
  2. Which ML algorithm has been used and why it is selected?
  3. What is the loss optimisation function?
  4. How the model validation is conducted? (e.g Cross Validation or Separate Train/Valid/Test sets) → RISK!: For problems like search ranking, it is important to separate your data into train/valid/test with respect to their timestamps. Should be in this time order: Train → Valid → Test
  5. How do you decide on the best model? On which algorithm parameters, the grid search hyper-parameter tuning is done?
  6. How to decide which features are the most important during the modelling phase? (SUGGESTION: Partial Dependence Plots increase explainability of the models.)

Production Questions

  1. How many components/jobs running in the production?
  2. How frequently the training job is running?
  3. Is this training job a single job such as a whole Jupyter python notebook or a series of smaller jobs? (RISK!: A single job is always more error prone and hard to track in production.)
  4. How to handle the cold start problem? Any special different flow for fresh newly added records?
  5. Where do you store training data and models running in the production?
  6. What’s the Search API’s endpoint?
  7. How to run an A/B testing for this product? Which metric has been used to assess the performance?
  8. What are you monitoring regarding this whole pipeline? (RISK!: It is a good practice to monitor all datasets including input and all the intermediate dataset to spot an anomaly in the data)

V. Challenges and Best Practices In a Data Science Product Delivery Process

We could truly state that it usually takes too long for a data science project to go from an idea to a running product. There are several problems along a data science product pipeline that cause data science teams work inefficiently.

Major difference of Data Science from Software Engineering is its own experimental nature. Therefore, regular software engineering standards might not always work for data science teams. It’s a different discipline and needs to be dealt with delicately. Putting a machine learning model into a production has also its own unique challenges different than other products. Thus, while discussing common challenges and best practices in Data Science, I need to break it down into two main parts so that you could easily understand and keep on track. I will name part I as: Experimental (Development) and part II as Production (Deployment)

Part I: Experimental (Development)

In this part, we will talk about problems in the model development phase and specifically will be focusing on the best practices in a data science project management.

Agile methodology could be used as a base for development cycle in a Data Science project. However, it is needed to be fine-tuned to fit better the Data Science’s own nature. Now, let’s discuss what agile principles work well with this relatively new field and what other principles are required to change.

  • Agile is at its finest when we apply it to fast-paced development cycles and if project requirements are likely to change a lot throughout the cycle. This is also partly true for Data Science that there is always a possibility of a requirement change. It may happen even at the end of the project if expectations are not set clearly in the very beginning. However, data science project requirements are relatively stable compared to any other software project. That’s a point where we need to change a principle of Agile for Data Science. As opposed to the agile principle about minimising the amount of up-front planning and design, detailed documentation and pre-planning is vital for a data science project. In the beginning of a project, we need to invest enough time to make sure that data scientists and business stakeholders are all on the same page. It’s because designing a ML model is strongly based on those initial feedbacks and product details coming from business side. In other words, even the model development and technical decisions will be affected by these alignment connections between business and data science team in the beginning of a project. That’s the beauty but also one of the biggest challenges of the Data Science World. Let me give an example to make it more clear to digest. Assume that we would like to create a model solution for Fraud Prediction problem. Information such as clear definition of fraud or what could be possibly a sign for a user being a fraud on the platform should come from business side. This is the discussion where business is also expected to be involved so as to pave the way for data scientists.
  • Micro-management arguably could be the worst thing ever happens to a data science team. It is not easy to break a data science project into sub-tasks as small as the case of software development. It’s not possible because of data science’s own unique experimental development nature.
  • No need to push hard to have daily stand ups in data science teams. Otherwise, you may hear a data scientist giving the exact same update to the team throughout the week such as “I am still working on optimising my model parameters.” Instead, it is a better practice for a data science team to have maybe only 2 quick stand ups per week.
  • With regard to aforementioned experimental nature of data science, task planning or Scrum Poker principle should be also applying to a data science team in a different way. Rather than estimating efforts required for each small task as in the case of software development, time periods should be set in advance for each project stage instead. How much time shall a team member invest on a particular project step? For the project step-A, project owner data scientist will spend maximum of X days and move on with the step-B using the best possible outcome within the time frame set for the step-A. Here is a fun fact about Data Science: A data scientist can spend even a whole year just for thinking about new features to add into a model to improve it or doing parameter tuning on a machine learning algorithm to end up with the best one. This sentence sums up that “experimental” side of data science and explains why we call it “science”.
  • Backlogs could be also designed differently in a data science project. A data scientist should create multiple back-up plans for each step of project as backlog tasks. These backlogs can be prioritised in a way that in which order other machine learning algorithms or features will be tried out if any further improvement is needed in the model. For instance, if the current set of features does not give us the desired result, then the next thing to do is making use of another data source to enlarge the feature set. These backlog tasks must be project or owner specific. As opposed to the counter-case in software engineering, a waiting backlog task can not be assigned to any data scientist in the team. It’s because most of the time a project is led by only one data scientist. This is another major difference worth to mention between Software and Data Science development cycles.
  • The principle of Agile that claims face-to-face meetings and keep stakeholders in the loop during the project development fits well with Data Science as well. However, “less meetings, more documentation” would be the best strategy to increase productivity in the team. It will suffice to have 2 stand ups and 1 internal team meeting per week. Apart from that, head of data science and project manager should have periodic face-to-face alignment meetings with business throughout a project for two main reasons: Firstly, update the business about the current status of the project. Second and more importantly, making sure that business is doing what they are expected to do before the product release such as coming up with more concrete use-cases for the final product. For instance, if the team is developing a predictive model to target users in more personalised way via email marketing, then business should design and finalise email templates that will be used once the model is ready. Therefore, it is crucial to make use of face-to-face meeting policy of Agile for Data Science as well.
  • For a data science team, sprints tend to take longer than software engineering teams. It is more efficient to plan sprints up to 1 month rather than a week or two weeks. Of course, it is always a good idea to have retrospective meetings after each sprint to see if the team is doing good or behind the project schedule or deadlines.

Part II: Production (Deployment)

In the second part of whole data science project pipeline, we need to deploy our trained model coming from the previous experiment (development) phase. Currently, this could be arguably the biggest challenge of Data Science teams in general. Therefore, people started discussing different tools to have a smooth product deployment flow. We already had a solution and pre-defined set of tools to do the same for Software Engineering. This is called DevOps. It enables software engineers to have a continuous development, testing and integration along the pipeline. Since data science has its own needs and unique nature, it is required to be dealt with different than DevOps and they named it “MLOps”.

Let’s start with listing the biggest challenges of a data science team during a product deployment. There is a data science layer which has all the tools, frameworks, IDEs or programming languages have been used by the team. There is also a production environment layer where data lake resides and applications are running on. However, misconnection between these two layers creates many problems for teams and this gap between data scientists and production environment needed to be filled to increase efficiency in a data science team. That’s where MLOps comes into play.

The MLOps emerged as a remedy for the following problems below:

  • After passing a new trained model to a data engineer to put it into a production environment, data scientists have no strong control over their own models anymore. Even if a simple change in the model tends to take too long time. Data scientists should be able to see the effect of their latest change on the model as soon as possible. That’s what we call “Continuous Development/Experiment”.
  • Every team needs to use a “code repo” like Github to store all implementations for projects. These codes must exhibit some standards or comply with team implementation manifesto that is pre-defined by the head of data science and lead data scientist. We will discuss this in details next section.
  • Model versioning takes a crucial part here. It would be great to see a model history of a project or tracking experiments have been done before to end up with the best final model. For instance, a data scientist is likely to try out different set of features, different machine learning models and different parameter combination for each of the models. This is a huge number of trials and needs a versioning logic behind. In the end, production candidate models are registered in a central “model repo.
  • Most of the time, data science projects will share common characteristics and inspired from each other. They could even make use of the same set of features but evolve into different solutions by having different mappings between features and targets. For instance, a churn model and a fraud model have lots of similarities by their own nature. They might use similar user features to predict if a user will churn or is a fraud threat. Here, the importance of having a central “feature repo” comes into play. A team should have a feature store in which some intermediate datasets resides and stores some high-level data points. It will highly improve collaboration within the team.
  • “Continuous Training Flow” is one of the main parts of a data science project flow. It means that jobs for training models are needed to be executed automatically, repetitively and continuously. These could be done by either periodic scheduled training jobs running daily/weekly or triggered by some other factors such as every time new training data is available. This flow will output newly trained models and save them in the central model repository where all the running models in the production are stored.
  • Using a trained model to generate predictions in the production environment requires a “Continuous Prediction Flow”. These predictions should be generated and served in real-time by having some scheduled prediction jobs. This flow will output new predictions by feeding models with the most fresh and up-to-date data and save the result in a data store that serves for production.
  • Personally, I find MLOps’ monitoring facility the most important among all others. After putting a model into production, it is vital to monitor data and model all the time to catch problems and solve them quickly. There could be some data skew in raw or intermediate datasets along the entire pipeline because of a problem happening in a data collection or a job producing those intermediate datasets. Some statistics such as mean, max, min or percentile distribution of a column should be consistent in an error-free production environment. If there is a sudden spike or change in a particular data point, then it will have an impact on all the following components and cause problems in the final serving result table. Therefore, it is required to see, catch and solve those problems right away. Here is “data monitoring” takes place. For instance, if your project pipeline starts with a clickstream hit-level data and there is a problem in the streaming data collection job, it will likely to affect entire data pipeline till the end. Most of the time, those streaming data collection jobs are managed by other teams but a data scientist can at least monitor the data stats and compare it with previously generated one to decide if something goes wrong in the data. This could be done for intermediate level datasets as well. Monitor those and see if there is something wrong going on the job which produced that intermediate level dataset. For instance, if you have a job to convert hit level data into aggregate user-level data, then it is a good practice to do sanity check on the user-level data as well. Another example for this could be converting this user level data into high-level training data features. This feature set needs to be monitored all the time to let a data scientist know if training data generation job works properly or not.
  • Monitoring is also needed to show model performance and if there is a sudden change in the model metrics. This is called “model monitoring”. This will give you some clues about sanity of the model and the modelling job.
  • Likewise, we need a “prediction monitoring” to know if any sudden skew happens in the distribution of final predictions to prevent problems in the prediction job.

VI. Data Science Implementation Manifesto: How to design different sub-components of an implementation pipeline

For most of data scientists, it is a common practice to write all the codes into a single notebook and work on that notebook till the end. I don’t think that it is a production-friendly and efficient way of implementing an ML modelling project. Sticking with your notebook could be still useful only if you keep discovering on data and try to produce initial results from a proof of concept (POC) work. However, you will have to use more organised and step-by-step methodology in your implementation before moving forward with production integration. Each member of a Data Science team must be asked to follow pre-set coding guidelines on how to design implementation stages of an internal team project.

Probably, you have seen and read enough about those common major stages in a generic data science project: Data Cleaning → Feature Engineering → Modelling → Generate Predictions. These are the consecutive steps that must be followed by a data scientist while working on a case. More importantly, these steps must be designed as separate components. In that way, we will ease the challenges in the following product deployment step and increase our chance to end up with more error-free production pipeline. In the end, a data scientist always should keep the fact in mind that these steps will be running in the production as sequential jobs. Therefore, implementation phase and coding design patterns play a critical role here.

This is the point where you could make use of the object-oriented programming principle in your implementation pipeline. Here is the list of reasons why it is important to have sequential components or jobs in your pipeline:

  • It will diminish the complexity of the whole pipeline. Most of the times if you have a real big data, running everything in a single job can’t be feasible. It is also impossible for those infrastructures to scale up. Therefore, you should always look for chances to divide the flow into enough number of smaller jobs.
  • It will make the pipeline more manageable so that any change in a specific component will take effect faster. Thanks to the step-by-step implementation pipeline, you have the big picture of all sub-components now and know how a single update in a specific step affects the following steps. This enable you to automate this update process as a chain of events and not to run everything from the scratch again.
  • It will enable us to store intermediate datasets. Each job outputs a dataset to pass it as input for the next stage in the pipeline. If we save those datasets somewhere, it will increase collaboration in the team. For instance, if you have a sub-component just to convert hit-level data into aggregated user-level data, some other team member could make use of those user-level data for their own project.
  • It will be easier to catch and correct a production error. You will easily spot the exact problematic component or job and then directly dealing with that specific part of the pipeline.
  • It will also make the monitoring easier and more manageable. You could do a health check on your intermediate datasets and models by comparing previously generated one with the newly generated one. Those monitoring jobs should save those comparison log files somewhere in your system. Then, you could even create your own dashboard by simply visualising those log files into some fancy plots.

In the diagram below, I will give you a high-level flow of common components in a data science production pipeline.

High-level Flow of Common Components in a Data Science Production Pipeline (Image by Author)

You must already realise that there is a strong similarity between this section and the previous section in which I talked about the MLOps. The reason is that the MLOps strategy completely depends on the separation of components principle during an implementation phase. Ultimately, those MLOps tools are there to help you to have this step-by-step setup ready-to-use. Therefore, it is strongly recommended to either use an external MLOps tool or design your implementation steps in a way we described in this section.

Author Signature

--

--

Senior Machine learning Scientist at Booking --- Living in Amsterdam --- M. Sc - Machine Learning