‘Making’ and ‘checking’ reliability are very, very different

Sep 6, 2021
4 min read

At the 2019 Annual Reliability and Maintainability Symposium, I was involved in a discussion with US Department of Defense (DoD) reliability engineering teams and industry representatives. And on the agenda was a review of an emerging document called the ‘Reliability and Maintainability Engineering Management Body of Knowledge.’ Let’s call this the ‘DoD RAM BOK’ for short.

I only had access to this document’s quick reference guide, and it suggest that the DoD RAM BOK describes what reliability engineering activities need to happen during from ‘concept’ through to ‘operations’ for a typical military capability being introduced into service.

A list of activities is not a strategy

So what was in this quick reference guide?

The guide described 58 separate ‘reliability engineering’ activities. I decided to categorize them based on the keywords that best described each. And here are the results:

23 x ‘review’ activities

14 x ‘analyze or evaluate’ activities

13 x ‘prepare or update documents’ activities

4 x ‘verify, validate or test’ activities

2 x ‘planning’ activities

1 x ‘status reporting’ activity

1 x ‘trade-off activity’

This is all kinds of wrong – for the same reason you can’t drive a car using the rear vision mirror. The overwhelming sentiment of these activities is to observe what has already been done. And this approach hasn’t worked yet.

It is already too late by the time you observe a reliability problem

There was clearly no ‘contingency’ built into the baseline schedule in the quick reference guide to allow for rectification of any issues identified in the 72 % of all activities that focused on review/analyze/evaluate/verify/validate/test/status reporting.

So what happens if there is a problem?

Nothing really.

No production or manufacturing program can survive systemic ‘do-overs’ of design effort. An issue identified during production that requires even a moderate amount of redesign is quite rightly referred to as a ‘crisis.’ We can only sustain so many crises before something else needs to give.

And this if often reliability.

Problems will often get passed onto the next team or department or phase before it is finally gifted to the user or customer.

Why can’t I just ‘check’ reliability has happened?

Because it just doesn’t work.

Firstly, a ‘checker’ or ‘auditor’ is not a ‘designer’ or ‘engineer.’ Even if their business card has the word ‘engineer’ on it.

Organizations that do reliability well focus on the vital few weak points of their emerging system. As soon as they remove a problem, they update their ‘vital few list’ and keep going. They quickly end up with a good understanding of how their product, system or service will fail, and can create (for example) targeted Accelerated Life Tests (ALT) that can quantify the reliability of the few remaining weak points. So they not only design a reliable system, but they are confident of its reliability before it is ‘checked.’

But a ‘checker’ or ‘auditor’ doesn’t know the product, system or service. They weren’t part of the team that designed it. They never challenged their own creation to identify weak points.

So what do they do? They are forced to check for everything in use conditions. Which is very slow. And this sometimes morphs into making sure that certain components and subsystems apply with a myriad of standards or guidebooks.

And so the program quickly devolves into a marathon of compliance activities. Reviews. Evaluations. Et cetera. The young engineers designing the new product, system or service are all focused on testing to pass and not testing to learn. We try and hide failures when we want to pass. We try and encourage failures when we want to learn.

So how do I stop ‘checking’ and start ‘making’ reliability?

It’s not as bad as you think.

Firstly, stop focusing on ‘reviewing’ and start focusing on ‘investing.’ If you (for example) want your team to build reliability into the first iteration of design (which is a really good idea) then get them trained on things like reliability allocation and Failure Mode and Effect Analysis (FMEA). And yes … this includes your suppliers and vendors. How much more likely are you to get them to do a FMEA you like if you train them to do it versus just mandate a standard in a contract?

Secondly, get selfish and focus on ‘results’ now. And I don’t mean the results you naturally think of when it comes to reliability. Reliability will generate value for your user or customer. But reliability built in at the start of the production life-cycle means you avoid all those production crises I talked about above. And that means you start seeing the payoff really soon. It is not unusual for production time-frames to be halved, and budgets to be reduced by three quarters when reliability is baked into the design from the start.

And thirdly, start doing it. What do I mean by that? If you are a leader and you don’t take reliability seriously, then you never let it get onto your agenda. Sure, you say ‘reliability is my number one priority’ at the monthly meeting. But if you talk about budget, schedule and compliance every other time then your team will get the point. And that point is that you like the idea of reliability, but you really want the perception of project status or achievement.

But I can’t stop checking? … surely?

That’s right! So don’t worry, because investing in reliability makes your ‘checking’ really easy. By understanding what the vital few weak points of your product, system or service are from the start, you will be able to come up with corrective actions. And these corrective actions then become your ‘checking’ strategy. So you as you go through the project, you ‘reviews’ focus on how the proposed corrective actions are being implemented.

The beauty of this approach is that you are checking for things that you have determined are important for your product, system or service. And not everything that everyone else thinks might be important because the read it in a textbook or a standard. This is when bureaucracy rules and good engineering suffers.

What about you? Do you ‘make’ reliability or simply ‘check’ for it? Do you have any related experiences? We would love to hear from you.

ACUITAS

‘Making’ and ‘checking’ reliability are very, very different

A list of activities is not a strategy

It is already too late by the time you observe a reliability problem

Why can’t I just ‘check’ reliability has happened?

So how do I stop ‘checking’ and start ‘making’ reliability?

But I can’t stop checking? … surely?

Recent Posts

Comments

CALL US

EMAIL US

WHERE WE ARE

FOLLOW ALONG