Skip to content

Lithin-7/end2end-salary-ml

Β 
Β 

Repository files navigation

Capstone Salary Prediction – End-to-End ML Pipeline

This repository provides a complete machine learning pipeline to predict standardized employee salaries (in USD) across different job roles, countries, experience levels, and related attributes. The solution features robust data cleaning, exploratory data analysis (EDA), model building with MLflow experiment tracking, data drift monitoring using Evidently AI, and a production-ready Flask web API for real-time and batch salary predictions.


πŸ“‘ Table of Contents


Python Flask MLflow Evidently License: MIT


🧰 Tech Stack

This project was built using the following tools and technologies:

  • Programming Language: Python
  • Web Framework: Flask
  • Data Handling: Pandas, NumPy
  • Modeling & ML: scikit-learn, XGBoost
  • MLOps & Tracking: MLflow
  • Drift Monitoring: Evidently AI
  • EDA & Profiling: ydata-profiling
  • Database (Optional): PostgreSQL
  • Deployment: Flask API for real-time & batch predictions

πŸ–ΌοΈ Screenshots

1. Flask Web App Interface

  • Upload CSV files for batch salary predictions.
  • Fill in job details manually for single prediction.
  • Real-time standardized salary predictions in USD.
  • Intuitive and interactive user interface.

Flask Web App Screenshot

2. MLflow Experiment Tracking

  • Tracks all training experiments and versions.
  • Visualizes metrics such as RMSE, MAE, and RΒ².
  • Stores hyperparameters, artifacts, and source code versions.

MLflow Experiment Screenshot

3. MLflow Model Registry

  • Model promotion lifecycle: None β†’ Staging β†’ Production.
  • Fully versioned with lineage from source runs.

MLflow Model Registry Screenshot

4. EDA Report

  • Automatically generated EDA report using ydata-profiling.
  • Provides summary statistics, distributions, correlations, and missing value analysis.

➑️ View EDA REPORT (PDF)


πŸ”„ Project Flow

  1. Data Ingestion – From CSV or PostgreSQL (configurable)
  2. Data Cleaning & Preprocessing – Normalization, typo correction, winsorization
  3. EDA – Automated profiling using ydata-profiling
  4. Feature Engineering – Transformations for regression
  5. Model Training – Decision Tree, Random Forest, XGBoost with GridSearchCV
  6. MLflow Tracking – Metrics, artifacts, and model registry
  7. Evaluation – RMSE, MAE, RΒ², MAPE
  8. Drift Monitoring – Evidently AI integration
  9. Deployment – Flask web app for prediction (manual & batch)

βœ… Features

  • End-to-end ML pipeline with MLOps
  • Currency standardization & typo correction
  • Automated EDA profiling
  • Grid search with modular pipelines
  • MLflow for versioning and registry
  • Evidently AI for drift detection
  • Web UI for predictions (Flask)

πŸ—‚οΈ Project Structure

Capstone_instilit/
β”‚
β”œβ”€β”€ auto_eda_project/
β”‚   β”œβ”€β”€ Data/
β”‚   β”œβ”€β”€ Screenshots/
β”‚   β”‚   β”œβ”€β”€ Flask.png
β”‚   β”‚   β”œβ”€β”€ mlflow_exp.png
β”‚   β”‚   └── mlflow_model.png
β”‚   β”œβ”€β”€ db/
β”‚   β”œβ”€β”€ data_ingestion/
β”‚   β”œβ”€β”€ preprocessing/
β”‚   β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ mlflow/
β”‚   β”œβ”€β”€ evidently_ai/
β”‚   └── save_model/
β”‚
β”œβ”€β”€ EDA INSIGHTS/
β”‚   └── EDA REPORT 1.pdf
β”‚
β”œβ”€β”€ main.py
β”œβ”€β”€ requirements.txt
└── README.md

βš™οΈ Installation

1. Clone the Repository

git clone https://github.com/Lithin-7/end2end-salary-ml .git cd end2end-salary-ml

2. Create & Activate Virtual Environment

python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

πŸš€ Usage

πŸ”§ Run the Full Pipeline

python main.py

  • Edit main.py to toggle between PostgreSQL or CSV input, change file paths, or enable/disable pipeline stages.

πŸ“Š Generate EDA Report

from auto_eda_project.auto_eda_runner import run_autoeda import pandas as pd

df = pd.read_csv('auto_eda_project/Data/Software_Salaries.csv') run_autoeda(df, output_path='eda_output')

πŸ“‰ Run Drift Monitoring

python auto_eda_project/evidently_ai/evidently_drift.py

πŸ“ˆ Launch MLflow UI

mlflow ui

Visit: http://localhost:5000 in your browser to track experiments and drift.


🌐 Model Deployment (Flask App)

πŸ§ͺ Launch the Web Interface

python auto_eda_project/flask_app.py

  • Open: http://localhost:5000
  • Upload a CSV for bulk predictions
  • Use the manual form for a single prediction
  • Get instant USD salary estimates

πŸ” MLOps and Drift Monitoring

MLflow for:

  • Experiment tracking
  • Model registry
  • Artifact versioning

Evidently AI for:

  • Input and target drift detection
  • Drift reports logged to MLflow

🀝 Contributing

Feel free to fork this repo and open pull requests. All contributions are welcome.


πŸ“„ License

This project is licensed under the MIT License.

About

A complete end-to-end machine learning pipeline to predict employee salaries (USD) using job-related features. Includes automated EDA, robust preprocessing, ML model training with MLflow, drift detection with Evidently AI, and a Flask-based web app for both single and batch predictions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%