The manufacturer produced precision components for the aerospace and automotive industries. Their production lines ran 24 hours a day, 6 days a week, with planned maintenance windows of 8 hours every 4 weeks per line. The cost of an unplanned failure, factoring in lost production, emergency maintenance labor, expedited parts costs, customer penalty clauses, and downstream scheduling impacts, averaged $380,000 per hour of unplanned downtime.
In the 18 months prior to our engagement, the 14 production lines had experienced a combined total of 847 hours of unplanned downtime, an average of approximately 4 unplanned failure events per month across the network. The total direct cost of this downtime was $321.9M. The indirect cost in customer relationship damage, contract penalties, and premium shipping to meet delivery commitments added another estimated $60M to $80M annually.
The manufacturer had IoT sensors already installed on most of their critical equipment, a legacy of a $24M digital transformation program completed in 2022. Those sensors were generating data. That data was being stored. But no analytical system was processing it in any meaningful way. The operations team received daily CSV files of sensor readings that no one had time to analyze. They were sitting on a goldmine of predictive signal and using none of it.
The previous predictive maintenance attempt had failed for a specific reason. Eighteen months before our engagement, the manufacturer had contracted with an IoT analytics vendor to build predictive maintenance alerts on top of the sensor data. The system was deployed and generated 340 alerts in its first month of operation. Plant maintenance teams investigated 340 potential failures and found actual degradation issues in 38 of them, a false positive rate of 89%. Within 6 weeks, maintenance teams had stopped responding to alerts. The vendor's system ran silently in the background for 11 months before being decommissioned.
The fundamental challenge in industrial predictive maintenance is alert precision. A maintenance team that receives 340 alerts and finds genuine problems in 38 of them will eventually stop responding to alerts. This is not a failure of discipline or attention. It is a rational response to a signal with no predictive value.
The previous system had used a threshold-based anomaly detection approach: any sensor reading outside a defined statistical range triggered an alert. This approach generates many alerts because sensor readings frequently deviate from statistical norms for reasons that are not related to impending failure (temperature changes, production rate changes, different material batches, seasonal variation). The system had been calibrated for sensitivity, not precision, and the result was an unusable alert volume.
For the new system, we established a non-negotiable precision requirement of 85% before any live alerts would be enabled. An 85% precision rate means that 85% of alerts represent genuine degradation events. This is a substantially higher bar than most commercial predictive maintenance systems achieve in production environments. It required a fundamentally different modeling approach.
The second challenge was the heterogeneity of the equipment. The 14 production lines included 7 different equipment types from 5 different OEMs, spanning equipment ages from 3 years to 22 years. Older equipment had different sensor configurations, different baseline operating profiles, and different failure mode signatures than newer equipment. A single generalized model could not capture this heterogeneity without significant precision loss.
Before building any models, we spent 3 weeks conducting structured failure mode and effects analysis (FMEA) with the manufacturer's senior maintenance engineers. This involved cataloging the specific failure modes for each equipment type, identifying the physical precursors to each failure mode that would be detectable in sensor data, and defining the expected lead time between precursor detection and actual failure.
This analysis produced a failure mode taxonomy covering 47 distinct failure types across the 7 equipment categories. For each failure type, we documented the specific sensor signatures that precede failure onset, the typical detection window before failure (ranging from 2 days to 21 days depending on failure type), and the minimum precision threshold required for that failure type to be actionable given maintenance scheduling constraints.
This work was the most important thing we did. It is also the work that most predictive maintenance programs skip. Programs that go straight to model training without failure mode engineering produce models that detect anomalies but cannot distinguish between anomalies that matter and anomalies that do not.
We built 7 separate predictive models, one for each equipment type. Each model was a long short-term memory (LSTM) recurrent neural network trained on multivariate time series data from all sensors on that equipment type. The LSTM architecture was chosen specifically because equipment degradation is a temporal process: the pattern of change over time is more informative than any point-in-time reading. A bearing that has been running hot for 12 hours with increasing vibration is in a different failure state than one that spiked hot briefly and returned to baseline.
For each equipment type, the model was trained on historical sensor data from the 2022 to 2025 period, with failure events labeled from maintenance records. A key data engineering challenge was that maintenance records were inconsistent: some failures had precise timestamps, others had only the shift during which the failure was discovered. We developed an anomaly-back-labeling algorithm that identified the earliest sensor signature consistent with each labeled failure event, extending the labeled training window from the point of failure back to the earliest detectable precursor signal.
The single most important design decision for achieving 85% precision was a multi-stage alert confirmation architecture. Rather than generating an immediate alert when the model detected a degradation signal, the system required the signal to persist above the detection threshold for a minimum confirmation window before generating an alert. The confirmation window varied by failure type: slow-developing bearing degradation required 6 hours of sustained signal before alerting; electrical fault precursors required only 30 minutes because of the faster failure progression.
This confirmation approach sacrificed some detection sensitivity (a very fast-developing failure might not trigger an alert before occurring) in exchange for substantially higher precision. For the failure types where fast development was a concern, we supplemented the predictive model with a separate real-time anomaly detection layer that triggered an immediate alert for sensor readings above a severe threshold, regardless of the persistence requirement.
The new predictive alerts were only valuable if maintenance teams responded to them. We had seen the previous program fail on exactly this point. Our integration approach was to embed alerts directly into the maintenance management software (IBM Maximo) that maintenance planners were already using daily, rather than routing alerts through a separate dashboard or email system. Each alert pre-populated a work order in Maximo with the predicted failure type, estimated remaining useful life, recommended maintenance action, and the specific sensor readings driving the alert. Maintenance planners could approve and schedule the work order with two clicks.
We also established a feedback mechanism: when a maintenance technician closed a work order after inspecting the equipment, they recorded whether they found evidence of the predicted degradation. This outcome data fed back into the model retraining pipeline, continuously improving precision as the system accumulated real-world validation data.
FMEA sessions with senior maintenance engineers. 47 failure modes documented across 7 equipment types. Sensor coverage audit: 4,200 sensors validated, 340 replaced or repositioned to improve signal quality. Historical maintenance records cleaned and failure events labeled.
Real-time sensor data pipeline built on Azure IoT Hub with 1-second resolution for critical sensors, 10-second for secondary sensors. Anomaly-back-labeling algorithm deployed to extend training labels. Feature engineering for LSTM training: rolling statistics, spectral features, cross-sensor correlation features.
7 equipment-type-specific LSTM models trained. Multi-stage confirmation architecture implemented. Alert precision tested against 18-month historical failure record. Achieved 87% precision in backtesting. IBM Maximo integration built and tested with maintenance planning team.
Live deployment on 2 highest-downtime production lines. Daily alert review sessions with maintenance planners. Outcome feedback loop activated. Precision measured in live operation: 91% in first 2 weeks (exceeded 85% target). Maintenance team confidence rebuilt from zero baseline.
Remaining 12 production lines deployed in 3 cohorts over 2 weeks. Pilot outcome data presented to maintenance teams at each subsequent cohort briefing. Full monitoring dashboard activated. Maintenance planning cycle adjusted to incorporate predictive work orders.
"After the previous system, my maintenance team would not look at another AI alert. The credibility problem was severe. The AI Advisory Practice team understood this before we even started technical discussions. Their insistence on the 2-line pilot before any full deployment, and on demonstrating 91% precision before enabling live alerts, was the right call. By the time we deployed to the remaining 12 lines, my team was asking when the rollout was going to happen, not resisting it."
Most industrial manufacturers with IoT sensor infrastructure are generating predictive signal data and doing nothing with it. The difference between a manufacturer spending $300M per year on reactive maintenance and one spending $150M is often a well-designed predictive model, not better equipment. Tell us about your situation and we will tell you what the gap is costing you.
Tell us about your maintenance and IoT program and we will follow up within 1 business day.