Skip to content

Small end-to-end project that processes bug/issue data, builds features, trains an XGBoost multiclass classifier and exposes a FastAPI prediction endpoint.

Notifications You must be signed in to change notification settings

dtheod/multiclass-classification

Repository files navigation

Multiclass Classification Pipeline

Small end-to-end project that processes bug/issue data, builds features, trains an XGBoost multiclass classifier and exposes a FastAPI prediction endpoint.

Key modules and entrypoints

What it does

  • Cleans and normalises raw dataset columns, groups rare products/components into generic buckets.
  • Builds aggregated features (counts, uniques) and persists them for serving.
  • Encodes categorical variables via one-hot, applies variance threshold and PCA for dimensionality reduction.
  • Trains an XGBoost multiclass model via hyperparameter search (Hyperopt), saves final model and artifacts to models/.
  • Exposes a prediction endpoint (/predict) through a sklearn-style pipeline wrapped in FastAPI.

Algorithms & libraries

  • Data orchestration: Prefect flows (src/process.py, src/feature_engineer.py)
  • Feature engineering: OneHotEncoder, VarianceThreshold, PCA, SMOTE (imbalanced sampling) — see src.feature_engineer
  • Model: XGBoost classifier (xgboost.XGBClassifier) with hyperparameter optimization using Hyperopt — see src.train_model.get_objective and src.train_model.optimize
  • Serving: FastAPI + sklearn Pipeline wrappers — see app/main.py
  • Misc: fuzzy matching via fuzzywuzzy in assignee cleaning (src/process.py)

Quick start (development)

  1. Install dependencies and hooks:
    • Use Poetry (project configured): make install (calls poetry install) — see Makefile
  2. Persist Prefect flow outputs (optional):
    • export PREFECT__FLOWS__CHECKPOINTING=true
  3. Run processing + feature flows:
    • python src/feature_main.py — entry: src.feature_main.main
      • This runs both processing (src/process.process_data) and feature engineering (src/feature_engineer.feature_data).
  4. Train the model:
    • python src/train_model.py — entry: src.train_model.train_model
    • Artifacts saved into models/ (label encoder, one-hot encoder, PCA, XGBoost model, feature summaries).
  5. Run tests:
  6. Run API (locally / in Docker):
    • Local (requires the saved models/): run FastAPI app defined in app/main.py with Uvicorn.
    • Docker:
      • Build: docker build -t testliodocker ./ — Dockerfile: Dockerfile
      • Run: docker run -d --name testliodocker -p 80:80 testliodocker
      • Test endpoint: see example client app/serve.py

Files of interest

Notes & tips

  • Paths for saved artifacts are controlled by Hydra config (config/main.yaml) — inspect and adjust before running flows.
  • The Prefect flows persist intermediate CSVs under data/processed/ and data/features_data/ when checkpointing is enabled.
  • The FastAPI app expects model artifacts under /models inside the container — the Dockerfile copies ./models into the image.

About

Small end-to-end project that processes bug/issue data, builds features, trains an XGBoost multiclass classifier and exposes a FastAPI prediction endpoint.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published