🏭 Predictive Maintenance Pipeline — Air Compressor (MetroPT3)

Production-ready ML pipeline for predicting failures in metropolitan train air compressor systems using advanced time-domain and frequency-domain feature engineering with a Random Forest classifier.

📋 Table of Contents

Business Case
Project Architecture
Feature Engineering
Dataset
Quick Start
Configuration
Results
Technical Details

💼 Business Case

Unplanned downtime in metropolitan rail systems costs thousands of dollars per hour and disrupts passenger services. Air compressor failures in trains are a leading cause of such downtime.

This pipeline implements a Predictive Maintenance (PdM) system that:

Continuously monitors 7 critical sensor signals (pressure, temperature, current, etc.)
Extracts 182 engineered features per sample (12 time-domain + 14 frequency-domain per sensor + LPS)
Trains a Random Forest classifier to detect early fault signatures
Achieves high precision & recall on historically documented fault events

Impact: Early fault detection enables scheduled maintenance before catastrophic failure, reducing downtime by up to 70% and maintenance costs by up to 25%.

🏗 Project Architecture

Train-Predictive-maintenance-using-AI/
│
├── config/
│   └── config.yaml              # Central configuration (all hyperparameters)
│
├── data/
│   ├── raw/                     # Raw dataset storage
│   └── processed/               # Feature-engineered outputs
│
├── models/                      # Saved model artifacts (.joblib)
├── figures/                     # Confusion matrix plots
├── notebooks/                   # Original Jupyter notebook (reference)
│
├── src/
│   ├── data/
│   │   └── make_dataset.py      # Data ingestion, parsing, slicing
│   ├── features/
│   │   ├── time_domain.py       # 12 rolling-window statistical features
│   │   ├── freq_domain.py       # 14 FFT spectral features (zero-div fixed)
│   │   └── build_features.py    # Feature engineering orchestrator
│   ├── models/
│   │   ├── train_model.py       # StandardScaler + RandomForest training
│   │   └── predict_model.py     # Model evaluation & metrics logging
│   └── visualization/
│       └── visualize.py         # Confusion matrix heatmaps
│
├── run_pipeline.py              # End-to-end orchestrator (single command)
├── requirements.txt             # Pinned dependencies
└── README.md

Pipeline Flow

flowchart LR
    A[Raw CSV<br/>1.5M rows] --> B[make_dataset.py<br/>Parse & Slice]
    B --> C[build_features.py<br/>Feature Engineering]
    C --> D[train_model.py<br/>Scale & Train RF]
    D --> E[predict_model.py<br/>Evaluate 3 Test Sets]
    E --> F[visualize.py<br/>Confusion Matrices]
    G[config.yaml] -.-> B & C & D & E

🔬 Feature Engineering

Time-Domain Features (per sensor)

Rolling-window statistics computed over a configurable window (default: 200 samples):

#	Feature	Formula Description
1	Mean	Rolling arithmetic mean
2	Std	Rolling standard deviation
3	Variance	Rolling variance
4	RMS	Root mean square: √(mean(x²))
5	Peak Value	Rolling maximum
6	Skewness	Third standardized moment
7	Kurtosis	Fourth standardized moment
8	Crest Factor	Peak / RMS
9	Margin Factor	Peak / Variance
10	Impulse Factor	Peak /
11	A-Factor	Peak / (Std × Variance)
12	B-Factor	(Kurtosis × Crest Factor) / Std

Frequency-Domain Features (FFT, per sensor)

Spectral features extracted using FFT on sliding windows:

#	Feature	Description
1	Spectral Mean	Mean of FFT magnitude spectrum
2	Spectral Variance	Variance of the spectrum
3	3rd Moment	Spectral skewness analog
4	4th Moment	Spectral kurtosis analog
5	Grand Frequency	Spectral centroid (weighted mean frequency)
6	Spectral Std	Frequency spread (bandwidth)
7	C-Factor	RMS frequency
8	D-Factor	√(Σf⁴·y / Σf²·y)
9	E-Factor	Σf²·y / √(Σy · Σf⁴·y)
10	G-Factor	Frequency std / Grand frequency
11	Freq 3rd Moment	Frequency-weighted skewness
12	Freq 4th Moment	Frequency-weighted kurtosis
13	H-Factor	Shape factor based on √
14	J-Factor	Shape factor variant

Critical Fix: The original code suffered from RuntimeWarning: divide by zero during FFT calculations when the spectrum was flat or all-zero. This has been resolved using epsilon-guarded denominators and np.nan_to_num() cleanup.

Feature Matrix Summary

Component	Features/Sensor	Sensors	Total
Time-domain	12	7	84
Frequency-domain	14	7	98
LPS indicator	—	—	1
Total			183

📊 Dataset

Source: MetroPT3 (Air Compressor) Dataset — Real-world sensor data from a metropolitan train's air production unit.

Property	Value
Rows	1,516,948
Columns	15 sensor signals
Time Range	2020-02-01 to 2020-09-01
Sampling	~10 second intervals
File	`dataset.train` (1.5 GB)

Sensor Signals: TP2, TP3, H1, DV_pressure, Reservoirs, Oil_temperature, Motor_current, COMP, DV_eletric, Towers, MPG, LPS, Pressure_switch, Oil_level, Caudal_impulses.

Fault Events Used

Dataset	Slice Range	Fault Window	Purpose
Training	rows 878,462–912,357	Jun 5–7, 2020	Model training
Test 1	rows 555,000–580,000	Apr 18, 2020	Validation
Test 2	rows 830,000–850,000	May 29–30, 2020	Validation
Test 3	rows 1,164,000–1,176,000	Jul 15, 2020	Validation

🚀 Quick Start

1. Clone & Install

git clone <repository-url>
cd Train-Predictive-maintenance-using-AI

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate        # Linux/Mac
venv\Scripts\activate           # Windows

# Install dependencies
pip install -r requirements.txt

2. Place the Dataset

Ensure dataset.train is in the project root directory.

3. Run the Full Pipeline

python run_pipeline.py

This single command will:

✅ Load and parse the 1.5M-row CSV
✅ Slice into training + 3 test datasets
✅ Extract 183 features per sample (time-domain + FFT)
✅ Train StandardScaler → RandomForest pipeline
✅ Evaluate on all 3 test sets with Precision/Recall/F1
✅ Save model to models/ and plots to figures/

Expected Output

STAGE 1: Data Loading & Slicing
  Training dataset: 33,895 samples (fault rate: X.XX%)
  Test set metrotest1: 25,000 samples
  ...

STAGE 4: Model Training
  StandardScaler → RandomForestClassifier(n_estimators=100)
  Model saved → models/random_forest_model.joblib

STAGE 5: Model Evaluation
  ═══════════════════════════════════════════
  EVALUATION SUMMARY
  Test Set         Precision      Recall          F1
  metrotest1          0.XXXX      0.XXXX      0.XXXX
  metrotest2          0.XXXX      0.XXXX      0.XXXX
  metrotest3          0.XXXX      0.XXXX      0.XXXX
  ═══════════════════════════════════════════

⚙ Configuration

All pipeline parameters are centralized in config/config.yaml:

# Data slicing
data:
  training:
    slice_start: 878462
    slice_end: 912357
    fault_start: "2020-06-05 10:00:00"
    fault_end: "2020-06-07 14:30:00"

# Feature engineering
features:
  time_domain:
    window_size: 200
  freq_domain:
    frame_size: 200
    hop_length: 1

# Model hyperparameters
model:
  n_estimators: 100
  random_state: 42

To experiment: simply modify config.yaml — no source code changes required.

📈 Results

The model is evaluated on 3 historically documented fault events using weighted Precision, Recall, and F1 Score. Confusion matrix heatmaps are saved to the figures/ directory.

🔧 Technical Details

Dependencies

Package	Purpose
`pandas`	Data manipulation & datetime handling
`numpy`	Numerical computation
`scikit-learn`	ML pipeline, Random Forest, metrics
`scipy`	Signal processing (FFT)
`matplotlib`	Plotting backend
`seaborn`	Statistical visualization
`pyyaml`	Configuration management
`joblib`	Model serialization

Design Principles

Separation of Concerns: Each module has a single responsibility
Configuration-Driven: All hyperparameters in config.yaml
Robust Error Handling: Epsilon-guarded divisions in FFT calculations
Structured Logging: Python logging module replaces print statements
PEP 8 Compliant: Clean, readable, documented Python code
Reproducible: random_state parameter ensures deterministic results

📄 License

This project is for educational and research purposes.

Built with ❤️ for Predictive Maintenance and Industrial AI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏭 Predictive Maintenance Pipeline — Air Compressor (MetroPT3)

📋 Table of Contents

💼 Business Case

🏗 Project Architecture

Pipeline Flow

🔬 Feature Engineering

Time-Domain Features (per sensor)

Frequency-Domain Features (FFT, per sensor)

Feature Matrix Summary

📊 Dataset

Fault Events Used

🚀 Quick Start

1. Clone & Install

2. Place the Dataset

3. Run the Full Pipeline

Expected Output

⚙ Configuration

📈 Results

🔧 Technical Details

Dependencies

Design Principles

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
src		src
.gitignore		.gitignore
Metrotrain predictive analysis.ipynb		Metrotrain predictive analysis.ipynb
README.md		README.md
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

🏭 Predictive Maintenance Pipeline — Air Compressor (MetroPT3)

📋 Table of Contents

💼 Business Case

🏗 Project Architecture

Pipeline Flow

🔬 Feature Engineering

Time-Domain Features (per sensor)

Frequency-Domain Features (FFT, per sensor)

Feature Matrix Summary

📊 Dataset

Fault Events Used

🚀 Quick Start

1. Clone & Install

2. Place the Dataset

3. Run the Full Pipeline

Expected Output

⚙ Configuration

📈 Results

🔧 Technical Details

Dependencies

Design Principles

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages