Data-Science is for Failures That Don’t Matter — Dr. Mark Horton

Kenneth Shakesby
DataDrivenInvestor
Published in
7 min readNov 11, 2018

--

Doesn’t that seem strange? As far as “big data” goes, the failure and planned maintenance records in an ERP are about as big as they come. Every year we add tens or hundreds of thousands of them. The result is gigabytes of data waiting to be coaxed into giving up their secrets: the key to improved profitability, safety, environmental integrity and a better working life.

But the reality is very different. Let’s start by discussing a case-study, and by the end you will see why “digital transformation” could be an expensive adventure without providing the solutions to enterprise in the Asset Management world of engineering maintenance. Indeed, there are more and more companies selling the promise of data-science backed solutions, but is it really clear-cut?

Optimum Maintenance from Failure History

You want to set up a planned maintenance policy for Product Dryers used in a Chemical Plant. You expect to put some sort of preventive task in place. how often should the bearing be replaced? Here is the plan of campaign:

  1. Download the failure history — Easy. a quick query for all the dryer bearings produces a total of 20 failure records. A little arithmetic gives you each bearing’s age at failure.
  2. Draw a survival chart — Each circle on the chart below represents a failure, and its age at failure is plotted along the x-axis. The y-axis is the proportion of bearings surviving to any given age. The shortest bearing life is about 13000 hours, the longest just over 60000 hours, and half the bearings have lasted about 35000 hours
  3. Fit a Weibull curve — The orange curve represents a best-fit Weibull survival curve through the data. (Fig 1)
  4. Work out the optimum planned maintenance interval — Knowing the cost of planned maintenance and the cost of an unplanned breakdown together with the Weibull curve and some maths, you can work out the optimum maintenance interval. In the graph below you can see how the hourly costs of unexpected failures and planned maintenance change as the bearing’s replacement interval is varied. The minimum total cost is the best compromise between the risk of unplanned failure and the cost of planned maintenance.
Fig 1

Changing the bearing at about 14000 hours incurs the lowest overall cost of about $3 per hour. Maintenance history and costs in, optimum maintenance out: that’s a perfect result. (Fig 2)

Fig 2

Today’s Job — Review the Planned Maintenance Interval for the Dryer Bearings

  1. Download failure history — log in to your ERP and download the bearing’s failure records.
  2. Draw a survival chart — There’s one record, and here it is…in all its majestic glory! (Fig 3)
Fig 3

This is why you can’t use your historical failure data

It’s because you don’t have any!

When the dryers were first installed, planned bearing replacement was scheduled every 15000 hours. Why? Because an in-service failure matters: it costs around $250,000 every time it happens. The original maintenance schedule was intended to prevent in-service failures by replacing the bearing before it failed, assuming that it would last for at least 15000 hours.

As it turns out, that assumption was wrong: one bearing did fail early. But the scheduled replacement means that the other 19 failure records you saw in your dream, those to the right of the red line, never happened.

There is a big, blank space where they aren’t. Every event recorded in the maintenance database applies only to equipment that is less than 15000 hours old. You wanted to use the recorded history to set an optimum maintenance policy, but one record is all that is available. You have no idea what would have happened if the equipment had been allowed to fail in that region. None at all.

There is no hope of predicting the bearing’s reliability at 20000 hours, 40000 hours or any other time from the evidence of a single failure. The survival curve could slope gracefully down because of random failure. It could fall off a cliff edge, showing sudden wear out. There could be a plateau; the single recorded failure might have been a manufacturing defect, the result of mis-operation or just very bad luck.

In the absence of knowing the physics of failure development, there’s no amount of analysis of the one data point below 15000 hours that could tell us. The chances of achieving any level of certainty seem minimal even if we can find a physical failure model. This is the paradox of failure recording. There are plenty of records that don’t matter: those that have trivial consequences such as dead indicator lights and seals where leakage caused a slight inconvenience.

Unless something is very wrong, the database doesn’t contain many records of turbines failing catastrophically in service, pressure vessels exploding or critical pipelines corroding right through. If a failure matters, the chances are that a pro-active task has already been put in place to prevent the failure. If the failure is prevented, there’s no failure history and no data for analysing reliability trends.

As a result you probably don’t have historical data that could be used to optimise age-based replacement intervals for failures that really matter. Just to be clear, a failure “really matters” if one or more of these conditions applies to it:

  • It would be expensive to fix
  • It would lead to significant and costly downtime
  • The failure could lead to a safety or environmental incident

Resnikoff Already Knew

None of this is new.

If there were a prize for honesty in reliability analysis, it would have to be awarded to Howard L Resnikoff. His extended paper Mathematical Aspects of Reliability-Centered Maintenance published in 1978 is a sort of companion piece to Stanley Nowlan and Howard Heap’s legendary Reliability-Centered Maintenance. Resnikoff says this in his introduction before he even begins the main sections:

“One of the most important contributions of the Reliability-Centred Maintenance Programme is its explicit recognition that certain types of information heretofore actively sought as a product of maintenance activities are, in principle, as well as in practice, unobtainable.”

After six chapters covering the statistics of survival distributions, hazard rates, inference, Bayes’ theorem and system reliability modelling, you might expect him to conclude by emphasising how important thorough statistical analysis is to RCM decisions. Not H L Resnikoff. Instead, this is what he says about the availability of data in the real world:

“The more effective the [existing maintenance] program is, the fewer critical failures will occur, and correspondingly less information about operational failures will be available to the maintenance policy designer. That the optimal policy must be designed in the absence of critical failure information, utilizing only the results of component tests and prior experience with related but different complex systems, is an apparently paradoxical situation.

Moreover, the applicability of statistical theories of reliability to the very small populations of large-scale complex systems typically encountered in practice is questionable and calls for some discussion. Each of these distinct viewpoints leads to the conclusion that maintenance policy design is necessarily conducted with extremely limited information of dubious reproducibility, and we must consider why it is nevertheless possible, and how it can be done.”

In other words: “You may think you have usable information, but you don’t have it and you probably can’t get it”.

Conclusions

Resnikoff’s Conundrum is very relevant to Machine learning.

ML only looks for correlations in data and expresses them as a series of functions. The ML system determines the correlation between the inputs we present and the output given (pass/fail, failing/OK, probability of failure), whatever it might be. If the learning input to the ML algorithm has to contain both inputs and known outputs, it follows that to predict failures, we have to have both the input data and failure records. Resnikoff tells us we will have lots of experience of failures that don’t matter, but precious few records for failures that do matter.

So our learning data set will either be too empty or too sparse to be useful unless we can aggregate data from thousands of users, and it would take time (months, years or more) to get the data needed. Digital transformation takes time, lots of time. However, we have developed a solution that starts to work from day 1 on your physical assets.

Reach out so we can discuss how to maximise your asset uptime, whilst improving compliance with fully managed costs and risks.

kenneth.shakesby@relmar.co.uk

contact@relmar.co.uk

--

--