A comprehensive machine learning pipeline for predicting trip durations using various regression models, feature engineering, and hyperparameter optimization.
Medium post: https://medium.com/@hphadtare02/how-machine-learning-predicts-trip-duration-just-like-uber-zomato-91f7db6e9ce9
GoPredict/
βββ main.py # Main runner script
βββ start_api.py # API server startup script
βββ test_api.py # API testing script
βββ config.py # Project configuration
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ CONTRIBUTING.md # Development and integration guide
βββ CODE_OF_CONDUCT.md # Code of conduct and security
β
βββ api/ # FastAPI backend
β βββ main.py # FastAPI application
β
βββ frontend/ # React frontend
β βββ src/
β βββ lib/
β βββ api.ts # API client library
β
βββ data/ # Data directory
β βββ raw/ # Raw data files
β β βββ train.csv # Training data
β β βββ test.csv # Test data
β βββ processed/ # Processed data files
β β βββ feature_engineered_train.csv
β β βββ feature_engineered_test.csv
β β βββ gmapsdata/ # Google Maps data
β βββ external/ # External data sources
β βββ precipitation.csv # Weather data
β
βββ src/ # Source code
β βββ model/ # Model-related modules
β β βββ models.py # All ML models and pipeline
β β βββ evaluation.py # Model evaluation functions
β β βββ save_models.py # Model persistence
β βββ features/ # Feature engineering modules
β β βββ distance.py # Distance calculations
β β βββ geolocation.py # Geographic features
β β βββ gmaps.py # Google Maps integration
β β βββ precipitation.py # Weather features
β β βββ time.py # Time-based features
β β βββ weather_api.py # Weather API integration
β βββ feature_pipe.py # Feature engineering pipeline
β βββ data_preprocessing.py # Data preprocessing
β βββ complete_pipeline.py # Complete ML pipeline
β
βββ notebooks/ # Jupyter notebooks
β βββ 01_EDA.ipynb # Exploratory Data Analysis
β βββ 02_Feature_Engineering.ipynb # Feature engineering
β βββ 03_Model_Training.ipynb # Model training
β βββ figures/ # Generated plots
β βββ gmaps/ # Interactive maps
β
βββ saved_models/ # Trained models (auto-created)
βββ output/ # Predictions and submissions (auto-created)
βββ logs/ # Log files (auto-created)
# Clone the repository
git clone <your-repo-url>
cd GoPredict
# Install dependencies
pip install -r requirements.txt
# Create necessary directories
mkdir -p logs output saved_modelsStart the FastAPI server to connect your frontend with ML models:
# Start the API server
python start_api.py
# Test the API
python test_api.py
# View API documentation
# Visit http://localhost:8000/docs# Install frontend dependencies
cd frontend
npm install
# Start development server
npm run devThe GoPredict API provides REST endpoints for machine learning-based trip duration prediction using FastAPI.
# Start the API server
python start_api.py
# Or with custom options
python start_api.py --host 0.0.0.0 --port 8000 --reload- Interactive Documentation: http://localhost:8000/docs
- Alternative Documentation: http://localhost:8000/redoc
- Health Check: http://localhost:8000/health
GET /weather - Get weather data for a specific location and time
Parameters:
latitude(float): Latitude coordinatelongitude(float): Longitude coordinatetimestamp(str): ISO format timestamp (e.g., "2016-01-01T17:00:00")
Example:
curl "http://localhost:8000/weather?latitude=40.767937&longitude=-73.982155×tamp=2016-01-01T17:00:00"Response:
{
"success": true,
"data": {
"temp": 5.0,
"humidity": 53.0,
"pressure": 1013.25
},
"location": { "latitude": 40.767937, "longitude": -73.982155 },
"timestamp": "2016-01-01T17:00:00"
}POST /distance - Calculate Manhattan and/or Euclidean distances
Parameters:
start_lat(float): Starting latitudestart_lng(float): Starting longitudeend_lat(float): Ending latitudeend_lng(float): Ending longitudemethod(str): "manhattan", "euclidean", or "both" (default: "both")
Example:
curl -X POST "http://localhost:8000/distance" \
-H "Content-Type: application/json" \
-d '{
"start_lat": 40.767937,
"start_lng": -73.982155,
"end_lat": 40.748817,
"end_lng": -73.985428,
"method": "both"
}'POST /time-features - Extract time-based features from datetime
Parameters:
datetime_str(str): ISO format datetime string
Example:
curl -X POST "http://localhost:8000/time-features" \
-H "Content-Type: application/json" \
-d '{"datetime_str": "2016-01-01T17:00:00"}'POST /predict - Predict trip duration using ML models
Parameters (JSON Body):
{
"from": {
"lat": 40.767937,
"lon": -73.982155
},
"to": {
"lat": 40.748817,
"lon": -73.985428
},
"startTime": "2016-01-01T17:00:00",
"city": "new_york",
"model_name": "XGBoost"
}Response:
{
"minutes": 5.2,
"confidence": 0.75,
"model_version": "XGBoost",
"distance_km": 2.1,
"city": "new_york"
}GET /models - List available trained models
GET /models/{model_name} - Get specific model information
POST /models/train - Train models in background
Example:
# List models
curl "http://localhost:8000/models"
# Train models
curl -X POST "http://localhost:8000/models/train" \
-H "Content-Type: application/json" \
-d '{"models_to_run": ["XGBoost", "Random Forest"]}'GET /health - Health check endpoint
GET /status - Detailed API status
The frontend uses the API client in frontend/src/lib/api.ts:
import { predictTravelTime } from "@/lib/api";
// Example usage
const prediction = await predictTravelTime({
from: { lat: 40.767937, lon: -73.982155 },
to: { lat: 40.748817, lon: -73.985428 },
startTime: "2016-01-01T17:00:00",
city: "new_york",
});python main.pyRuns the complete end-to-end pipeline:
- Data preprocessing - Loads and cleans raw data
- Feature engineering - Adds distance, time, cluster, and weather features
- Model training - Trains all specified models
- Model evaluation - Compares model performance
- Prediction generation - Creates submission files
python main.py --models XGB,RFTrain only specific models.
python main.py --tune-xgbEnable XGBoost hyperparameter tuning.
output/[model_name]/test_prediction_YYYYMMDD_HHMMSS.csv- Ready-to-submit prediction files with timestamps
saved_models/[model_name]_YYYYMMDD_HHMMSS.pkl- Trained models with metadata
logs/main.log- Complete pipeline execution log- Detailed progress tracking and metrics
output/prediction_comparison_YYYYMMDD_HHMMSS.png- Model comparison plots
- Feature importance plots
Edit config.py to customize:
- Model parameters
- Data paths
- Output directories
- Hyperparameter tuning ranges
- Logging settings
from src.model.models import run_complete_pipeline
import pandas as pd
# Load data
train_df = pd.read_csv('data/processed/feature_engineered_train.csv')
test_df = pd.read_csv('data/processed/feature_engineered_test.csv')
# Run complete pipeline
results = run_complete_pipeline(
train_df=train_df,
test_df=test_df,
models_to_run=['LINREG', 'RIDGE', 'XGB'],
tune_xgb=True,
create_submission=True
)from src.model.models import run_regression_models, predict_duration, to_submission
# Train models
models = run_regression_models(train_df, ['XGB', 'RF'])
# Make predictions
predictions = predict_duration(models['XGBoost'], test_df)
# Create submission
submission = to_submission(predictions, test_df)
submission.to_csv('my_submission.csv', index=False)# Run comprehensive API tests
python test_api.pycd frontend
npm run test
npm run test:coverage- LINREG - Linear Regression
- RIDGE - Ridge Regression
- LASSO - Lasso Regression
- SVR - Support Vector Regression
- XGB - XGBoost
- RF - Random Forest
- NN - Neural Network
See CONTRIBUTING.md for development guidelines and frontend integration details.
See CODE_OF_CONDUCT.md for our community guidelines and security policies.
This project is licensed under the MIT License - see the LICENSE file for details.