- Course: AH2179
- Module: 3
- Project: Classification Models for Travel Mode Choice Prediction
This project implements and compares multiple classification models to predict transportation mode choices (air, bus, rail, car) using various machine learning techniques, with a focus on handling class imbalance.
- Rail: 41.8% (1288 samples)
- Car: 31.8% (978 samples)
- Air: 22.9% (705 samples)
- Bus: 3.5% (109 samples)
- Features: All columns except 'choice'
- Scaling: StandardScaler implementation
- Encoding: One-hot encoding
- Data Split: 75% training, 25% testing
- Cross-Validation: Stratified K-fold
- Logistic Regression (Best Performer)
- XGBoost Classifier
- Random Forests
- Decision Trees
- KNN
- SVM
- Nearest Centroid
Logistic Regression parameters:
- C: 0.1
- penalty: 'l2'
- solver: 'newton-cg'
- Accuracy: 57%
- Precision: [0.546, 0.0, 0.471, 0.663]
- Recall: [0.524, 0.0, 0.533, 0.669]
-
Class Imbalance Impact
- Models showed bias towards majority classes (rail, car)
- KNN and Nearest Centroid uniquely detected 'bus' class despite low overall accuracy
- Random Forests and Decision Trees showed high sensitivity to imbalanced data
-
Model-Specific Observations
- Logistic Regression:
- L2 regularization prevented overfitting
- Newton-CG solver optimized classification accuracy
- XGBoost: Performed similarly to Logistic Regression
- KNN & Nearest Centroid: Better at minority class detection
- SVM: Struggled with imbalanced classes
- Logistic Regression:
-
Data Preprocessing Insights
- Critical importance of feature encoding methods
- Impact of training/testing split ratios
- Value of standardization in model performance
-
Model Selection Considerations
- LazyPredict library's utility in initial model screening
- Importance of class_weight parameter
- Trade-offs between accuracy and minority class detection
-
Optimization Techniques
- Grid search effectiveness for parameter tuning
- Cross-validation strategies for imbalanced data
- Regularization impact on model robustness
-
Bus Delay Classification
- Handling imbalanced on-time vs delayed data
- Implementing resampling techniques (SMOTE, oversampling)
- Ensemble methods for improved performance
-
Traffic Pattern Classification
- Multi-class prediction for congestion levels
- Real-time pattern recognition
- Seasonal variation analysis
-
Infrastructure Usage Prediction
- Peak vs off-peak utilization
- Facility type selection
- Maintenance scheduling
-
Handling Imbalanced Data
- SMOTE (Synthetic Minority Over-sampling Technique)
- Adaptive synthetic sampling
- Hybrid sampling approaches
-
Ensemble Methods
- Voting Classifiers
- Stacking multiple models
- Boosting algorithms
-
Advanced Feature Engineering
- Temporal feature extraction
- Spatial correlation analysis
- Domain-specific feature creation
- Scikit-learn
- XGBoost
- Pandas
- NumPy
- LazyPredict
- Matplotlib
- Seaborn
Note: This project was completed as part of the AH2179 course, Module 3. 🎓