There's a move that every operations executive makes at least once. Unplanned downtime is climbing. Breakdowns are getting more frequent. Something clearly isn't working. So the directive comes down: increase PM frequency. Add inspections. Expand the predictive monitoring program. If we're missing failures, we must not be looking hard enough.
It's logical. It's intuitive. And in a low-maturity plant, it's almost completely ineffective.
This is one of the hardest things to explain in reliability, because it goes against every managerial instinct. More inspections should catch more problems. More PMs should prevent more failures. The math seems obvious. But the math is wrong — not because PM doesn't work in principle, but because PM only works when the system it's operating in has the capacity to act on what it finds.
In most plants, the system is already drowning. Adding more inspections just adds more water.
The Crisis Mode Trap
To understand why more PM fails, you need to understand what's actually happening in a plant that's stuck in reactive mode.
In a typical low-maturity facility, 60 to 80 percent of maintenance labor goes to unplanned work — emergency repairs, breakdown response, firefighting. That's not a staffing failure. It's a systemic condition. When the majority of assets in a facility are operating in advanced stages of degradation, the sheer volume of things going wrong simultaneously overwhelms every structured process the organization tries to run.
Technicians spend their days chasing emergencies. Planners can't plan because the schedule keeps getting blown up by breakdowns. Supervisors manage chaos instead of work quality. The backlog grows, but nobody has time to work it. And overtime becomes so normalized that people forget it was supposed to be exceptional.
This is crisis mode. It feeds itself. Reactive repairs get rushed because there's always another emergency waiting. Rushed repairs fail sooner because quality was sacrificed for speed. Premature failures generate more emergencies. The cycle tightens.
Now, into this environment, someone adds more PM tasks.
What Actually Happens When You Add PMs to a Broken System
In theory, a new PM inspection should discover developing problems early enough to schedule a fix. In a well-functioning maintenance system, that's exactly what happens. The PM catches a worn belt, a leaking seal, a bearing with elevated vibration — and the finding enters a planning queue where it gets properly scoped, parts get ordered, and the repair gets scheduled during a production window.
In a plant already running in crisis mode, here's what actually happens.
The PM gets executed — maybe. PM completion rates might look acceptable on the dashboard, but the actual quality of those inspections is degraded. Technicians are rushing through PM routes because they're being pulled to emergencies. Checkboxes get checked. Measurements get eyeballed instead of taken. "Looks fine" gets written on inspection forms for equipment that hasn't been closely examined.
When a PM does catch something real — and it will, because in a crisis-mode plant, nearly everything is degrading — the finding enters a backlog that nobody has capacity to work. The planned repair gets deferred once, then twice, then it falls off the priority list entirely because three new emergencies have jumped ahead of it. The equipment continues to degrade. Eventually it fails anyway, and the PM finding becomes an entry in a database that proves someone saw the problem and nobody fixed it.
This is the frustrating paradox: the PM program generates data that confirms the plant has problems, but the system lacks the capacity to convert that data into prevention. The inspections worked. The system didn't.
The Analogy That Sticks
There's a phrase in the reliability world that captures this perfectly: adding more PMs in crisis mode is like brushing your teeth during a heart attack. The dental hygiene isn't wrong in principle. It's just catastrophically mismatched to the actual problem.
When dozens of assets are simultaneously operating in advanced degradation states — right-hand side of the P-F curve, close to functional failure — the plant is experiencing a systemic condition that PM was never designed to address. PM assumes a baseline of stability. It assumes most equipment is in reasonable condition and the goal is to catch the occasional exception before it becomes a failure. In a stable plant, that assumption holds.
In a crisis-mode plant, the exception IS the baseline. Everything is degrading. Everything is urgent. And PM just becomes another process competing for the same overwhelmed resources.
The "More Predictive Monitoring" Variation
The same logic applies to the next move executives often make: investing in predictive monitoring technology. Vibration sensors. Thermal cameras. Oil analysis programs. Ultrasonic detection. The thinking is that better technology will catch problems that human inspections miss.
The technology works. That's not the issue. The issue is the same capacity problem. In a low-maturity plant, predictive routes consistently generate urgent findings — because the assets are in advanced degradation. Every vibration route comes back with multiple red flags. Every oil sample shows contamination or degradation. Every thermal scan identifies hot spots.
These are all real problems. But they're all problems that require technician capacity, parts, planning, and scheduled downtime to address. The same resources that are already consumed by emergencies.
The result is a predictive program that generates a growing queue of known problems that nobody can fix in time. It's worse than having no data at all, because now the organization knows about the failures that are coming and still can't prevent them. That's demoralizing for every level of the maintenance organization.
So What Actually Works?
If more PM and more monitoring don't stabilize a plant in crisis, what does? The answer is counterintuitive, and it's grounded in a simple observation about capacity.
The bottleneck in a reactive plant isn't detection frequency. It's detection timing.
Problems aren't being missed because inspections are too infrequent. They're being found too late — when degradation has already advanced to the point where the only response is urgent, resource-intensive, and disruptive. By the time a monthly PM route catches a bearing defect, the bearing has been degrading for weeks. The finding arrives on the right side of the P-F curve, where every repair is a mini-emergency.
The leverage point is shifting detection earlier. Way earlier. Into the first 10-20% of the degradation curve, where intervention is cheap, plannnable, and non-disruptive.
And the only workforce positioned to detect signals that early is the one that's already standing next to the equipment on every shift: operators.
This is the fundamental insight that changes the math. You don't need more inspections. You need earlier detection by the people who are already there, already looking, already listening — but not yet trained to understand what they're observing.
How Early Detection Breaks the Cycle
When operators are trained to recognize the early signals of degradation — and given a system to report those signals — the findings enter the maintenance workflow at a fundamentally different point on the P-F curve.
Instead of discovering a bearing with advanced surface damage during a monthly PM, the system catches a bearing with a faint new vibration three weeks earlier. That early finding doesn't require urgent response. It enters the planning queue calmly. Parts get ordered at normal cost. The repair gets scheduled during a convenient production window. The technician performs the work with adequate time, proper tools, and no pressure.
That one early detection doesn't just prevent one emergency. It frees up the technician capacity that would have been consumed by the emergency repair. That freed capacity can now address something else in the backlog. Each early catch creates capacity that enables the next planned repair, which prevents the next emergency, which frees more capacity.
This is the positive cycle — the inverse of the crisis spiral. And it starts with moving detection earlier, not doing more detection.
The Sequence That Actually Stabilizes
For a plant stuck in reactive mode, the stabilization sequence looks like this:
First, train operators to detect and report early degradation signals. This is the highest-leverage, lowest-cost intervention available. It doesn't require new technology, new headcount, or new systems. It requires training the existing workforce to do something they're already positioned to do.
Second, build simple, reliable systems for converting operator observations into prioritized work. This means a reporting process, a triage routine (usually a daily leadership walk at the equipment), and a priority framework that distinguishes between "needs attention this week" and "needs attention today."
Third, protect capacity for planned work. As early detection begins generating planned findings instead of emergency discoveries, the maintenance organization must discipline itself to execute planned work on schedule — not defer it every time a new emergency arrives. This is the hardest part, and it requires leadership commitment.
Fourth — and only fourth — optimize PM and PdM programs. Once the system is stabilized — once most assets are operating in the left-hand side of the P-F curve and most work is planned — then PM frequency, PdM technology, and inspection programs deliver their full intended value. The inspections find early-stage issues instead of late-stage damage. The monitoring catches the exceptions instead of cataloguing a flood of urgent findings.
PM works beautifully in a stable plant. It fails in a crisis-mode plant. The gap between those two realities is where early detection lives.
The Real First Move
If your plant is running in reactive mode — high unplanned percentage, normalized overtime, backlog growing faster than you can work it — resist the urge to throw more inspections at the problem. That's treating the symptom.
Instead, ask a different question: how early are we finding out about degradation? And who is in the best position to detect it earlier?
The answer to the second question is standing on the floor right now, running the equipment. They just need the training to understand what they're already observing.
Start there. The rest follows.