A comprehensive machine learning pipeline for predicting passenger survival on the Titanic using advanced feature engineering, family survival rules, and ensemble methods.
Kaggle Leaderboard Position: 490 / 15,435 (Top 3.17%)
Best Score: 0.80382 (80.382% accuracy)
This project implements a complete end-to-end machine learning solution for the classic Kaggle Titanic competition. It predicts which passengers survived the Titanic shipwreck based on features like age, sex, passenger class, family relationships, and more.
- 🎯 Family Survival Rules: Implements domain-specific rules based on family survival patterns for high-confidence predictions
- 🧠 Advanced Feature Engineering: 40+ engineered features including title-based age imputation, cabin deck extraction, and interaction features
- 🚀 Ensemble Methods: Combines XGBoost, LightGBM, RandomForest, and SVM with performance-weighted voting
- 📊 Smart Preprocessing: Title-based missing value imputation, feature interactions (AgeClass, AgeFare), and intelligent binning
- 🔍 Multiple Approaches: From simple rule-based models to complex ensemble methods with automatic model selection
- ⚡ Production Ready: Robust error handling, comprehensive logging, and validation
- 🧪 Extensive Testing: 187 unit, integration, and end-to-end tests with 100% pass rate
- Kaggle Public Score: 0.80382 (80.382% accuracy)
- Leaderboard Position: 490 out of 15,435 submissions
- Percentile: Top 3.17% of all submissions
| Approach | Score | Key Innovation |
|---|---|---|
| Conservative Rules | 0.80382 | 18 family rules + minimal adjustments |
| Rule-based MVP | 0.80143 | Family survival rules + gender baseline |
| Enhanced ML | 0.77033 | Advanced feature engineering + ensemble |
| Basic ML | 0.76794 | Standard preprocessing + RF/SVM |
- XGBoost: 87.32% CV accuracy (best individual model)
- SVM: 86.87% CV accuracy
- LightGBM: 86.76% CV accuracy
- RandomForest: 86.76% CV accuracy
├── main.py # Main CLI application
├── src/ # Source code modules
│ ├── data/ # Data handling
│ │ ├── loader.py # Data loading with validation
│ │ ├── explorer.py # Comprehensive data analysis
│ │ └── preprocessor.py # Feature engineering & preprocessing
│ ├── models/ # Machine learning models
│ │ ├── trainer.py # Model training & hyperparameter tuning
│ │ ├── evaluator.py # Performance evaluation & metrics
│ │ └── predictor.py # Prediction generation & submission
│ └── utils/ # Utilities
│ ├── config.py # Configuration management
│ └── visualization.py # Plotting and visualization tools
├── tests/ # Comprehensive test suite
│ ├── test_data_*.py # Data module tests
│ ├── test_model_*.py # Model module tests
│ ├── test_integration.py # Integration tests
│ └── test_end_to_end_validation.py # End-to-end validation
├── titanic/ # Data directory
│ ├── train.csv # Training dataset
│ └── test.csv # Test dataset
└── outputs/ # Generated outputs
├── models/ # Saved models
├── predictions/ # Prediction files
└── visualizations/ # Generated plots
pip install numpy pandas scikit-learn matplotlib seaborn pytest joblib xgboost lightgbm- Clone or download the project
- Ensure you have the Titanic dataset files in the
titanic/directory:titanic/train.csvtitanic/test.csv
If you're new to data science, here's how this project works step-by-step:
# Look at the raw data first
python -c "import pandas as pd; print(pd.read_csv('titanic/train.csv').head())"The Titanic dataset contains information about passengers like age, sex, ticket class, family size, etc. Our goal is to predict who survived based on these features.
# Run our top-performing approach
python main_conservative.pyThis creates conservative_submission.csv with 80.38% accuracy - our best result!
The magic happens in two parts:
- Smart Rules: We found 18 passengers where family survival patterns give us high confidence
- Machine Learning: For everyone else, we use XGBoost to make predictions
Raw Data → Clean Data → Engineer Features → Train Model → Make Predictions
1. Data Cleaning (src/data/preprocessor.py)
- Fill missing ages using passenger titles (Master = child, Mr = adult man, etc.)
- Handle missing cabin and fare information
- Convert text data to numbers that computers can understand
2. Feature Engineering (Making Data Smarter)
- Extract titles from names: "Smith, Mr. John" → "Mr"
- Calculate family size: SibSp + Parch + 1
- Create interaction features: Age × Class, Age × Fare
- Extract deck information from cabin numbers
3. The Breakthrough: Family Survival Rules
# Rule 1: If a Master's family all survived → He survives
# Rule 2: If a female's family all died → She dies
# These 18 high-confidence predictions boost our score significantly!4. Machine Learning for the Rest
- Use XGBoost (a powerful algorithm) for remaining 400 passengers
- XGBoost learns patterns like: "1st class females usually survive"
- Combine rule predictions with ML predictions
Want to see the best model in action? Run this single command:
python main_conservative.pyThis will create conservative_submission.csv with our top 3% performance (80.38% accuracy)!
Core Components:
src/data/loader.py- Loads and validates the CSV filessrc/data/preprocessor.py- Cleans data and creates featuressrc/models/trainer.py- Trains multiple ML algorithmssrc/models/evaluator.py- Tests model performancesrc/models/predictor.py- Makes final predictions
Advanced Components:
src/data/enhanced_preprocessor.py- 40+ advanced featuresmain_mvp_rules.py- Family survival rules implementationcompare_models.py- Compares different approaches
- Domain Knowledge Beats Complex ML: Simple family rules outperformed sophisticated ensembles
- Feature Engineering Matters: Going from 12 to 40+ features improved accuracy significantly
- Start Simple, Then Optimize: Basic gender rule (females live, males die) gets you 82.3% baseline
- Validation is Critical: Cross-validation scores helped select the best approach
# Run the entire ML pipeline (original approach)
python main.py pipeline --verbose
# This will:
# 1. Explore and analyze the data
# 2. Train multiple models with hyperparameter tuning
# 3. Select the best performing model
# 4. Generate predictions for the test set
# 5. Create submission file at outputs/predictions/submission.csvData Exploration
python main.py explore --verbose- Generates comprehensive data analysis
- Creates visualizations of survival patterns
- Analyzes missing values and feature distributions
Model Training
python main.py train --verbose- Trains Random Forest, Logistic Regression, and SVM models
- Performs hyperparameter tuning with grid search
- Selects best model based on cross-validation
Generate Predictions
python main.py predict --verbose- Loads the best trained model
- Generates predictions for test dataset
- Creates Kaggle-ready submission file
Model Evaluation
python main.py evaluate --verbose- Evaluates model performance with detailed metrics
- Creates confusion matrix and ROC curve visualizations
- Analyzes feature importance
python main.py [mode] [options]
Modes:
explore - Data exploration and analysis
train - Model training and selection
predict - Generate predictions
evaluate - Model evaluation
pipeline - Complete end-to-end pipeline
Options:
--verbose, -v Enable detailed logging
--data-dir DIR Data directory (default: titanic)
--output-dir DIR Output directory (default: outputs)
--model-path PATH Specific model file to use
--random-seed SEED Random seed for reproducibility
--min-accuracy FLOAT Minimum accuracy thresholdRun the comprehensive test suite:
# Run all tests
python -m pytest tests/ -v
# Run specific test categories
python -m pytest tests/test_data_*.py -v # Data processing tests
python -m pytest tests/test_model_*.py -v # Model tests
python -m pytest tests/test_integration.py -v # Integration tests
python -m pytest tests/test_end_to_end_validation.py -v # End-to-end testsThe test suite includes:
- 187 total tests with 100% pass rate
- Unit tests for all components
- Integration tests for the complete pipeline
- End-to-end validation with real data
- Performance benchmark validation
- Submission format compliance testing
The breakthrough to top 4% came from implementing domain-specific family survival rules:
Rule 1: Masters whose entire family (excluding adult males) survived → Predict LIVE
- Applied to 8 passengers: Ryerson, Wells, Touma, Drew, Spedden, Aks, Abbott, Peter
Rule 2: Females whose entire family (excluding adult males) died → Predict DIE
- Applied to 10 passengers: Ilmakangas, Johnston, Cacic, Lefebre, Goodwin, Sage, Oreskovic, Rosblom, Riihivouri
- Smart Age Imputation: Title-based median ages (Master: 4.57, Miss: 21.68, Mrs: 35.86, Mr: 32.32)
- Cabin Intelligence: Deck extraction, cabin counts, availability indicators
- Ticket Analysis: Prefix patterns and numeric extraction
- Family Features: Size categories, survival rates by surname, alone indicators
- Interaction Features: AgeClass, AgeFare, Fare*Class, Title-based ratios
- Advanced Transformations: Log(Fare), Age², √Fare, intelligent binning
- Ensemble Strategy: Performance-weighted combination of top 4 models
- Feature Selection: Quality over quantity - 40 carefully engineered features
- Preprocessing Pipeline: Standardized scaling, missing value strategies, categorical encoding
- Random Forest: Ensemble method with feature importance analysis
- Logistic Regression: Linear model with regularization
- Support Vector Machine: Non-linear classification with RBF kernel
- Grid search with cross-validation
- Optimizes for accuracy while preventing overfitting
- Automatic selection of best performing model
- Accuracy, Precision, Recall, F1-Score
- ROC AUC and confusion matrix analysis
- Feature importance ranking
- Cross-validation scores
After running the pipeline, you'll find:
- Models:
outputs/models/best_*.joblib- Trained model files - Predictions:
outputs/predictions/submission.csv- Kaggle submission file - Visualizations:
outputs/visualizations/- Generated plots and charts - Logs:
titanic_predictor.log- Detailed execution logs
The generated submission file is fully compliant with Kaggle competition requirements:
- Exactly 418 predictions (matching test set size)
- Proper CSV format with PassengerId and Survived columns
- Binary predictions (0 or 1)
- No missing values or duplicates
Key configuration options in src/utils/config.py:
- Model parameters and hyperparameter grids
- File paths and directory structure
- Logging configuration
- Performance thresholds
This project follows best practices for maintainable ML code:
- Modular design with clear separation of concerns
- Comprehensive testing and validation
- Detailed logging and error handling
- Type hints and documentation
- Configuration management
This project is open source and available under the MIT License.
- Kaggle for providing the Titanic dataset
- The scikit-learn team for excellent ML tools
- The Python data science community for inspiration and best practices
Ready to predict Titanic survival? Run python main.py pipeline --verbose to get started!