This project demonstrates how to fine-tune a single LLM (Microsoft Phi-3-mini) for multiple specialized tasks without duplicating the base model. By using dual LoRA adapters—one targeting attention layers for code generation, and another targeting MLP layers for docstring generation—the model can seamlessly switch contexts at inference time.
🚀 Try it live on Hugging Face Spaces: BiLoRA AI Assistant
- Dual LoRA Adapters: Task-specific adapters targeting different model layers.
- DVC Pipeline: Reproducible end-to-end pipeline from data download to adapter deployment.
- Experiment Tracking: Hyperparameters, training loss, git hash, and duration logged per run.
- Automated Benchmarking: Quality gate that blocks deployment if metrics regress.
- CI/CD: GitHub Actions smoke test + auto-deploy to HF Spaces.
- Human-in-the-Loop Feedback: User ratings collected in production, downloadable for retraining.
- Model + Adapters: aniketp2009gmail/phi3-bilora-code-review
- User Feedback Dataset: aniketp2009gmail/bilora-user-feedback
Evaluated on 20 samples (10 code generation, 10 docstring generation) with Groq LLM-as-judge.
| Metric | BiLoRA (ours) | Phi-3 Base | Groq LLaMA-3.3-70B |
|---|---|---|---|
| Code Gen Pass Rate | 94.2% | 70.0% | 100.0% |
| Code Gen Quality (1-5) | 3.7 | 3.6 | 4.4 |
| Docstring BLEU | 0.026 | 0.054 | 0.126 |
| Docstring Quality (1-5) | 2.5 | 4.0 | 4.2 |
| Avg Latency | 33,499ms | 24,561ms | 434ms |
data-download → data-process → data-split → train → benchmark → push-adapters
│ │
training_metrics results.json
.json → baseline.json
(auto-promoted)
git push → GitHub Actions smoke test → deploy to HF Spaces
│
user feedback → HF Dataset
│
fetch_feedback.py → training pairs
pip install -r requirements.txtexport GROQ_API_KEY=gsk_... # For LLM-as-judge evaluation
export HF_TOKEN=hf_... # For pushing adapters and feedbackdvc reproThis runs: download data → preprocess → split → train adapters → benchmark (quality gate) → push adapters to HF Hub.
dvc repro data-download
dvc repro data-process
dvc repro data-split
dvc repro train
dvc repro benchmark # Fails if metrics regress — blocks push
dvc repro push-adapters # Only runs after benchmark passes# Edit params.yaml (e.g. change learning_rate, lora_r)
dvc repro
# See what changed
dvc params diff
dvc metrics diff# Full evaluation (all models + Groq judge)
python benchmarking/evaluate.py --groq-api-key "$GROQ_API_KEY"
# Quick smoke test
python benchmarking/evaluate.py --max-samples 1 --only-bilora
# Verbose (see raw model outputs)
python benchmarking/evaluate.py --groq-api-key "$GROQ_API_KEY" --verbose# Commit and push — CI handles the rest
git add .
git commit -m "Experiment: increased lora_r to 8"
git push
# GitHub Actions: smoke test → deploy to HF Spacesstreamlit run hf_space/app.pypython scripts/fetch_feedback.py
# Outputs: data/feedback/all_feedback.json + training_pairs.json├── src/
│ ├── get_data.py # Download MBPP + CodeXGLUE datasets
│ ├── preprocess_data.py # Tokenize and format for training
│ ├── split_data.py # Train/val split
│ ├── training.py # Dual adapter LoRA training + experiment tracking
│ ├── push_adapters.py # Upload adapters to HF Hub
│ └── common.py # Config reader
├── benchmarking/
│ ├── evaluate.py # Full benchmark suite (BiLoRA vs Base vs Groq)
│ ├── eval_dataset.json # 20-sample test set
│ ├── results.json # Latest evaluation results (DVC metric)
│ └── baseline.json # Auto-updated regression baseline
├── scripts/
│ ├── compare_metrics.py # Regression check + baseline promotion
│ └── fetch_feedback.py # Download user feedback from HF Dataset
├── hf_space/
│ ├── app.py # Streamlit app with adapter switching + feedback
│ ├── requirements.txt # Space dependencies
│ └── README.md # HF Spaces metadata
├── .github/workflows/
│ └── deploy.yml # CI: smoke test → deploy to HF Spaces
├── params.yaml # All hyperparameters (DVC-tracked)
├── dvc.yaml # Pipeline definition
└── requirements.txt # Project dependencies
- Base Model: Microsoft Phi-3-mini-4k-instruct (3.8B params)
- Quantization: 4-bit NF4 (bitsandbytes) for training, float16 for inference
- Task 1 Adapter: Targets attention layers (
qkv_proj,o_proj) — trained on MBPP - Task 2 Adapter: Targets MLP layers (
gate_up_proj,down_proj) — trained on CodeXGLUE - LoRA Config: r=4, alpha=8, dropout=0.1
- Evaluation: Functional test cases + BLEU + Groq LLM-as-judge (1-5 quality scale)