One of the nice things about professional social media is the ability to see why good RCA consultants make a great living. Pictures without context are posted and immediately a group of folks jump in with absolute clarity and announce exactly what the problem is. Mostly the focus is on the broken part itself with absolutely no mention or thought of operating context, conditions, events, or systems. Basically, if a fuse blows it must have been a defective fuse, lets ignore the winding short.
I’ve always supported a systems approach with operating context. This is a far cry from trying to say the root cause of failure is a ‘bad bearing,’ when it could be caused by a VFD or even just end-of-life, or any of hundreds of other potential causes. A proper investigation must be performed with good data and the operating context, not a slap-dash glance and conclusion. Rarely are things as obvious as they might appear.
This means stepping back and looking across the system. What other events, decisions, components, or other causes of degradation could cause the symptoms? It is incumbent on the reliability engineer, investigator or manager to identify if the defect is acting as a fuse, the cause, or some other condition that may be addressed to eliminate the fault from happening again. In most cases problems are immediately identified as being associated with a repair or replacement instead. Sometimes, the conditions are internally understood as ‘normal’ or ‘routine’ when they are not of the sort. Pre-conceived notions of conditions based upon training or experience are extremely common and a real challenge to avoid. This even drops to communications where interpretations are based upon personal experience and goals versus an understanding of the context from the other individual(s) involved.
How often does this occur? Virtually every one of us have at least one tale of a repetitive fault or immediate failure after a repair or replacement. Unfortunately, this condition is often blamed on ‘infant mortality,’ which should be rare, but hearing people discuss it you'd thing that repair industries and OEMs cannot perform good service or build a good product. Basically, when I hear ‘infant mortality,’ I immediately think ‘something was missed.’
The bathtub curve and the concepts surrounding infant mortality are an industry constant in textbooks, papers, and lectures across any industry related to industrial engineering, the father of modern reliability engineering. The problem is that this particular curve is virtually non-existent in reality. However, if repeated enough, the false becomes truth, such as the concept that the origin of ESA/MCSA was to detect rotor faults in electric machines (it was actually to evaluate driven equipment). As human beings we tend to place human characteristics on objects.
One of the earliest texts associated with the ‘bathtub curve’ is E. Halley’s “An estimate of the degrees of the mortality of mankind, drawn from curious tables of the births and funerals at the city of Breslau; with an attempt to ascertain the price of annuities upon lives,” (Philosophical Transactions Royal Society of London, Vol. 17, pp. 596-610, 1693). No, that’s not a typo. 1693 – in a paper related to understanding life curves at birth, death as you age, and wear out (old age). When going through military (NAVSEA/NAVAIR) RCM training based on Nowlan & Heap, and others, this concept is understood as the discussion relates to how the curve is rare. Another good treatise on this topic is by Georgia-Anne Klutke and M. A. Wortmann, professors of industrial engineering at Texas A&M, and Peter Kiessler, Professor of Mathematics at Clemson University, “A Critical Look at the Bathtub Curve,” (IEEE Transactions on Reliability, Vol 52, No 1, March 2003) in which they explore the actual curves observed in product life, including electronics.
This goes along with actual data when looking even at electronics and dielectric (electrical insulation) materials. Defects from OEMs and repair will appear as a hump at the beginning of life, then a flat line until a second hump occurs in a population due to a variety of reasons from wear-out to improper application, conditions, or even designed end of life. Once this concept is understood then acceptance testing to detect significant defects and then the ability to calculate Time to Failure Estimation (TTFE) or Remaining Useful Life (RUL) becomes realistic with an acceptable degree of accuracy.
Often the driving force behind early failure is the result of the application, operating context, or environment and whether the system or device is resistant to those conditions. The application of proper RCFA when unexpected or frequent failures exist will help identify those conditions, whether they are a result of OEM/repair defects, human interaction, or some other driving condition that causes the defect (fuse) to trigger.
Just a few things to consider.
Howard W Penrose, Ph.D., CMRP
President, MotorDoc LLC