Production-ready ML pipeline for predicting failures in metropolitan train air compressor systems using advanced time-domain and frequency-domain feature engineering with a Random Forest classifier.
- Business Case
- Project Architecture
- Feature Engineering
- Dataset
- Quick Start
- Configuration
- Results
- Technical Details
Unplanned downtime in metropolitan rail systems costs thousands of dollars per hour and disrupts passenger services. Air compressor failures in trains are a leading cause of such downtime.
This pipeline implements a Predictive Maintenance (PdM) system that:
- Continuously monitors 7 critical sensor signals (pressure, temperature, current, etc.)
- Extracts 182 engineered features per sample (12 time-domain + 14 frequency-domain per sensor + LPS)
- Trains a Random Forest classifier to detect early fault signatures
- Achieves high precision & recall on historically documented fault events
Impact: Early fault detection enables scheduled maintenance before catastrophic failure, reducing downtime by up to 70% and maintenance costs by up to 25%.
Train-Predictive-maintenance-using-AI/
│
├── config/
│ └── config.yaml # Central configuration (all hyperparameters)
│
├── data/
│ ├── raw/ # Raw dataset storage
│ └── processed/ # Feature-engineered outputs
│
├── models/ # Saved model artifacts (.joblib)
├── figures/ # Confusion matrix plots
├── notebooks/ # Original Jupyter notebook (reference)
│
├── src/
│ ├── data/
│ │ └── make_dataset.py # Data ingestion, parsing, slicing
│ ├── features/
│ │ ├── time_domain.py # 12 rolling-window statistical features
│ │ ├── freq_domain.py # 14 FFT spectral features (zero-div fixed)
│ │ └── build_features.py # Feature engineering orchestrator
│ ├── models/
│ │ ├── train_model.py # StandardScaler + RandomForest training
│ │ └── predict_model.py # Model evaluation & metrics logging
│ └── visualization/
│ └── visualize.py # Confusion matrix heatmaps
│
├── run_pipeline.py # End-to-end orchestrator (single command)
├── requirements.txt # Pinned dependencies
└── README.md
flowchart LR
A[Raw CSV<br/>1.5M rows] --> B[make_dataset.py<br/>Parse & Slice]
B --> C[build_features.py<br/>Feature Engineering]
C --> D[train_model.py<br/>Scale & Train RF]
D --> E[predict_model.py<br/>Evaluate 3 Test Sets]
E --> F[visualize.py<br/>Confusion Matrices]
G[config.yaml] -.-> B & C & D & E
Rolling-window statistics computed over a configurable window (default: 200 samples):
| # | Feature | Formula Description |
|---|---|---|
| 1 | Mean | Rolling arithmetic mean |
| 2 | Std | Rolling standard deviation |
| 3 | Variance | Rolling variance |
| 4 | RMS | Root mean square: √(mean(x²)) |
| 5 | Peak Value | Rolling maximum |
| 6 | Skewness | Third standardized moment |
| 7 | Kurtosis | Fourth standardized moment |
| 8 | Crest Factor | Peak / RMS |
| 9 | Margin Factor | Peak / Variance |
| 10 | Impulse Factor | Peak / |
| 11 | A-Factor | Peak / (Std × Variance) |
| 12 | B-Factor | (Kurtosis × Crest Factor) / Std |
Spectral features extracted using FFT on sliding windows:
| # | Feature | Description |
|---|---|---|
| 1 | Spectral Mean | Mean of FFT magnitude spectrum |
| 2 | Spectral Variance | Variance of the spectrum |
| 3 | 3rd Moment | Spectral skewness analog |
| 4 | 4th Moment | Spectral kurtosis analog |
| 5 | Grand Frequency | Spectral centroid (weighted mean frequency) |
| 6 | Spectral Std | Frequency spread (bandwidth) |
| 7 | C-Factor | RMS frequency |
| 8 | D-Factor | √(Σf⁴·y / Σf²·y) |
| 9 | E-Factor | Σf²·y / √(Σy · Σf⁴·y) |
| 10 | G-Factor | Frequency std / Grand frequency |
| 11 | Freq 3rd Moment | Frequency-weighted skewness |
| 12 | Freq 4th Moment | Frequency-weighted kurtosis |
| 13 | H-Factor | Shape factor based on √ |
| 14 | J-Factor | Shape factor variant |
Critical Fix: The original code suffered from
RuntimeWarning: divide by zeroduring FFT calculations when the spectrum was flat or all-zero. This has been resolved using epsilon-guarded denominators andnp.nan_to_num()cleanup.
| Component | Features/Sensor | Sensors | Total |
|---|---|---|---|
| Time-domain | 12 | 7 | 84 |
| Frequency-domain | 14 | 7 | 98 |
| LPS indicator | — | — | 1 |
| Total | 183 |
Source: MetroPT3 (Air Compressor) Dataset — Real-world sensor data from a metropolitan train's air production unit.
| Property | Value |
|---|---|
| Rows | 1,516,948 |
| Columns | 15 sensor signals |
| Time Range | 2020-02-01 to 2020-09-01 |
| Sampling | ~10 second intervals |
| File | dataset.train (1.5 GB) |
Sensor Signals: TP2, TP3, H1, DV_pressure, Reservoirs, Oil_temperature, Motor_current, COMP, DV_eletric, Towers, MPG, LPS, Pressure_switch, Oil_level, Caudal_impulses.
| Dataset | Slice Range | Fault Window | Purpose |
|---|---|---|---|
| Training | rows 878,462–912,357 | Jun 5–7, 2020 | Model training |
| Test 1 | rows 555,000–580,000 | Apr 18, 2020 | Validation |
| Test 2 | rows 830,000–850,000 | May 29–30, 2020 | Validation |
| Test 3 | rows 1,164,000–1,176,000 | Jul 15, 2020 | Validation |
git clone <repository-url>
cd Train-Predictive-maintenance-using-AI
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtEnsure dataset.train is in the project root directory.
python run_pipeline.pyThis single command will:
- ✅ Load and parse the 1.5M-row CSV
- ✅ Slice into training + 3 test datasets
- ✅ Extract 183 features per sample (time-domain + FFT)
- ✅ Train StandardScaler → RandomForest pipeline
- ✅ Evaluate on all 3 test sets with Precision/Recall/F1
- ✅ Save model to
models/and plots tofigures/
STAGE 1: Data Loading & Slicing
Training dataset: 33,895 samples (fault rate: X.XX%)
Test set metrotest1: 25,000 samples
...
STAGE 4: Model Training
StandardScaler → RandomForestClassifier(n_estimators=100)
Model saved → models/random_forest_model.joblib
STAGE 5: Model Evaluation
═══════════════════════════════════════════
EVALUATION SUMMARY
Test Set Precision Recall F1
metrotest1 0.XXXX 0.XXXX 0.XXXX
metrotest2 0.XXXX 0.XXXX 0.XXXX
metrotest3 0.XXXX 0.XXXX 0.XXXX
═══════════════════════════════════════════
All pipeline parameters are centralized in config/config.yaml:
# Data slicing
data:
training:
slice_start: 878462
slice_end: 912357
fault_start: "2020-06-05 10:00:00"
fault_end: "2020-06-07 14:30:00"
# Feature engineering
features:
time_domain:
window_size: 200
freq_domain:
frame_size: 200
hop_length: 1
# Model hyperparameters
model:
n_estimators: 100
random_state: 42To experiment: simply modify config.yaml — no source code changes required.
The model is evaluated on 3 historically documented fault events using weighted Precision, Recall, and F1 Score. Confusion matrix heatmaps are saved to the figures/ directory.
| Package | Purpose |
|---|---|
pandas |
Data manipulation & datetime handling |
numpy |
Numerical computation |
scikit-learn |
ML pipeline, Random Forest, metrics |
scipy |
Signal processing (FFT) |
matplotlib |
Plotting backend |
seaborn |
Statistical visualization |
pyyaml |
Configuration management |
joblib |
Model serialization |
- Separation of Concerns: Each module has a single responsibility
- Configuration-Driven: All hyperparameters in
config.yaml - Robust Error Handling: Epsilon-guarded divisions in FFT calculations
- Structured Logging: Python
loggingmodule replacesprintstatements - PEP 8 Compliant: Clean, readable, documented Python code
- Reproducible:
random_stateparameter ensures deterministic results
This project is for educational and research purposes.
Built with ❤️ for Predictive Maintenance and Industrial AI.