This repository provides a complete machine learning pipeline to predict standardized employee salaries (in USD) across different job roles, countries, experience levels, and related attributes. The solution features robust data cleaning, exploratory data analysis (EDA), model building with MLflow experiment tracking, data drift monitoring using Evidently AI, and a production-ready Flask web API for real-time and batch salary predictions.
- Screenshots
- Project Flow
- Features
- Project Structure
- Installation
- Usage
- Model Deployment (Flask App)
- MLOps and Drift Monitoring
- Contributing
- License
This project was built using the following tools and technologies:
- Programming Language: Python
- Web Framework: Flask
- Data Handling: Pandas, NumPy
- Modeling & ML: scikit-learn, XGBoost
- MLOps & Tracking: MLflow
- Drift Monitoring: Evidently AI
- EDA & Profiling: ydata-profiling
- Database (Optional): PostgreSQL
- Deployment: Flask API for real-time & batch predictions
- Upload CSV files for batch salary predictions.
- Fill in job details manually for single prediction.
- Real-time standardized salary predictions in USD.
- Intuitive and interactive user interface.
- Tracks all training experiments and versions.
- Visualizes metrics such as RMSE, MAE, and RΒ².
- Stores hyperparameters, artifacts, and source code versions.
- Model promotion lifecycle:
None β Staging β Production. - Fully versioned with lineage from source runs.
- Automatically generated EDA report using
ydata-profiling. - Provides summary statistics, distributions, correlations, and missing value analysis.
- Data Ingestion β From CSV or PostgreSQL (configurable)
- Data Cleaning & Preprocessing β Normalization, typo correction, winsorization
- EDA β Automated profiling using
ydata-profiling - Feature Engineering β Transformations for regression
- Model Training β Decision Tree, Random Forest, XGBoost with GridSearchCV
- MLflow Tracking β Metrics, artifacts, and model registry
- Evaluation β RMSE, MAE, RΒ², MAPE
- Drift Monitoring β Evidently AI integration
- Deployment β Flask web app for prediction (manual & batch)
- End-to-end ML pipeline with MLOps
- Currency standardization & typo correction
- Automated EDA profiling
- Grid search with modular pipelines
- MLflow for versioning and registry
- Evidently AI for drift detection
- Web UI for predictions (Flask)
Capstone_instilit/
β
βββ auto_eda_project/
β βββ Data/
β βββ Screenshots/
β β βββ Flask.png
β β βββ mlflow_exp.png
β β βββ mlflow_model.png
β βββ db/
β βββ data_ingestion/
β βββ preprocessing/
β βββ model/
β βββ mlflow/
β βββ evidently_ai/
β βββ save_model/
β
βββ EDA INSIGHTS/
β βββ EDA REPORT 1.pdf
β
βββ main.py
βββ requirements.txt
βββ README.md
git clone https://github.com/Lithin-7/end2end-salary-ml .git cd end2end-salary-ml
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
python main.py
- Edit
main.pyto toggle between PostgreSQL or CSV input, change file paths, or enable/disable pipeline stages.
from auto_eda_project.auto_eda_runner import run_autoeda import pandas as pd
df = pd.read_csv('auto_eda_project/Data/Software_Salaries.csv') run_autoeda(df, output_path='eda_output')
python auto_eda_project/evidently_ai/evidently_drift.py
mlflow ui
Visit: http://localhost:5000 in your browser to track experiments and drift.
python auto_eda_project/flask_app.py
- Open: http://localhost:5000
- Upload a CSV for bulk predictions
- Use the manual form for a single prediction
- Get instant USD salary estimates
- Experiment tracking
- Model registry
- Artifact versioning
- Input and target drift detection
- Drift reports logged to MLflow
Feel free to fork this repo and open pull requests. All contributions are welcome.
This project is licensed under the MIT License.


