Small end-to-end project that processes bug/issue data, builds features, trains an XGBoost multiclass classifier and exposes a FastAPI prediction endpoint.
Key modules and entrypoints
- Data processing flow:
src.process.process_data— implementation: src/process.py - Feature engineering flow:
src.feature_engineer.feature_data— implementation: src/feature_engineer.py - Training:
src.train_model.train_model— implementation: src/train_model.py - API (FastAPI): app/main.py
- Pipeline steps:
DataSelection,CustomFeature,OneHot,FeatureVariance,PCA_Components,XGBoosting,ReverseEncoder— see app/main.py - Test client: app/serve.py
- Pipeline steps:
What it does
- Cleans and normalises raw dataset columns, groups rare products/components into generic buckets.
- Builds aggregated features (counts, uniques) and persists them for serving.
- Encodes categorical variables via one-hot, applies variance threshold and PCA for dimensionality reduction.
- Trains an XGBoost multiclass model via hyperparameter search (Hyperopt), saves final model and artifacts to
models/. - Exposes a prediction endpoint (/predict) through a sklearn-style pipeline wrapped in FastAPI.
Algorithms & libraries
- Data orchestration: Prefect flows (
src/process.py,src/feature_engineer.py) - Feature engineering: OneHotEncoder, VarianceThreshold, PCA, SMOTE (imbalanced sampling) — see
src.feature_engineer - Model: XGBoost classifier (
xgboost.XGBClassifier) with hyperparameter optimization using Hyperopt — seesrc.train_model.get_objectiveandsrc.train_model.optimize - Serving: FastAPI + sklearn Pipeline wrappers — see app/main.py
- Misc: fuzzy matching via
fuzzywuzzyin assignee cleaning (src/process.py)
Quick start (development)
- Install dependencies and hooks:
- Use Poetry (project configured):
make install(callspoetry install) — see Makefile
- Use Poetry (project configured):
- Persist Prefect flow outputs (optional):
export PREFECT__FLOWS__CHECKPOINTING=true
- Run processing + feature flows:
python src/feature_main.py— entry:src.feature_main.main- This runs both processing (
src/process.process_data) and feature engineering (src/feature_engineer.feature_data).
- This runs both processing (
- Train the model:
python src/train_model.py— entry:src.train_model.train_model- Artifacts saved into
models/(label encoder, one-hot encoder, PCA, XGBoost model, feature summaries).
- Run tests:
make test(pytest --no-header -v) — tests live in tests/test_process.py
- Run API (locally / in Docker):
- Local (requires the saved
models/): run FastAPI app defined in app/main.py with Uvicorn. - Docker:
- Build:
docker build -t testliodocker ./— Dockerfile: Dockerfile - Run:
docker run -d --name testliodocker -p 80:80 testliodocker - Test endpoint: see example client app/serve.py
- Build:
- Local (requires the saved
Files of interest
- Processing & features: src/process.py, src/feature_engineer.py
- Training: src/train_model.py
- Flow launcher: src/feature_main.py
- API: app/main.py, app/serve.py
- Tests: tests/test_process.py
- Notebook: exploratory analysis: notebooks/Exploration.ipynb
- Configs: config/main.yaml
- Docker: Dockerfile
- Makefile: Makefile
Notes & tips
- Paths for saved artifacts are controlled by Hydra config (
config/main.yaml) — inspect and adjust before running flows. - The Prefect flows persist intermediate CSVs under
data/processed/anddata/features_data/when checkpointing is enabled. - The FastAPI app expects model artifacts under
/modelsinside the container — the Dockerfile copies./modelsinto the image.