model-training

This repository contains the machine learning training pipeline for a sentiment classification model based on restaurant reviews. It is inspired by the GitHub repository: proksch/restaurant-sentiment.

The training pipeline performs the following steps:

Downloads the dataset in .tsv from the sentiment-analysis repo in get_data.py.
Loads the labelled dataset in .tsv format containing restaurant reviews.
Preprocesses the data using methods from the lib-ml package in preprocess.py.
Trains a SVC classifier in train.py.
Saves the trained model locally to model/model.pkl.
Evaluates the trained model in eval.py.
Publishes a versioned model artifact to GitHub Releases.
Tracks data, metrics, and models using DVC with a Google Cloud Storage remote.

Training Data

The dataset is loaded into the /data folder in .tsv format and is downloaded (at run-time) from: 🔗 https://github.com/proksch/restaurant-sentiment

Local Setup

Requirements:

Python 3.12.3
pip
dvc[gs] (already included in dev-requirements.txt)

Setup Virtual Environment: Run the following command from the project root:

python -m venv <venv_name>
source <venv_name>/bin/activate  # For Unix/macOS
# Or use <venv_name>\Scripts\activate on Windows

pip install . 
# Or for development code
pip install -r dev-requirements.txt

To deactive after use:

deactivate

Run Training Pipeline Manually: Run the following commands from the project root:

python model-training/restaurant_sentiment/get_data.py
python model-training/restaurant_sentiment/preprocess.py
python model-training/restaurant_sentiment/train.py
# For evaluation:
python model-training/restaurant_sentiment/eval.py

Model Release

To publish a trained model to GitHub Releases, create a version tag using semantic versioning:

git tag v<MAJOR>.<MINOR>.<PATCH>
git push origin v<MAJOR>.<MINOR>.<PATCH>

This will trigger the release.yml GitHub Actions workflow, which trains the model and uploads it as an artifact to GitHub Releases. Once published, model releases are publicly accessible. For example, the download link for version v0.1.0 would be: https://github.com/remla25-team6/model-training/releases/download/v0.1.0/model-v0.1.0.pkl

DVC Integration (with Google Cloud Storage)

Setup (First-Time Only)

Follow the instructions above to create a virtual environment and install dependencies for local development from dev-requirements.txt.
Add your Google Cloud Storage (GCS) key in the form keyfile.json to the root directory.
Export your GCS key path from the root directory:

export GOOGLE_APPLICATION_CREDENTIALS=keyfile.json

Pull all data and models tracked by DVC:

dvc pull

Reproduce Full Pipeline

dvc repro

DVC will run only the necessary stages in order: preprocess → train → eval.

Rollback to Past Versions

To roll back to a specific previous version:

# Go to a previous Git commit (where metrics/data were different)
git checkout <commit_hash>

# Restore the corresponding data and model artifacts
dvc checkout

You can find commit hashes from:

dvc exp show

Run and Compare Experiments

You can create and track experiments using DVC's experiment tools:

# Run an xxperiment
dvc exp run

# Test a different parameter (e.g., random state)
# # See params.yaml for all configurable parameters
dvc exp run --set-param train.random_state=20

# Show all experiments
dvc exp show

Metrics Tracking

The model/metrics.json file includes:

test_accuracy
test_precision
test_recall
test_f1_score
test_cohens_kappa
test_samples You can compare metrics from different experiments using DVC's experiment and metrics tools:

# Compare experiments
dvc exp show

# Compare specific experiments 
dvc exp diff <exp1> <exp2>

# Compare metric differences
dvc metrics diff

Push to Remote

To share data and metrics to remote, after running or reproducing the pipeline:

dvc push

To share experiments:

dvc exp push origin <exp_id>

To apply the best experiment:

dvc exp apply <exp_id>
git commit -am "<commit_message>"

Google Cloud Storage Access Setup

For Developers: Using `keyfile.json` to Authenticate

Go to Google Cloud Console → IAM & Admin → Service Accounts.
To securely access the DVC remote (stored in Google Cloud Storage), all users use the shared service account which you should see listed:

dvc-access@remla25-team6.iam.gserviceaccount.com

Click on the shared service account.
Create a new key through "Keys → Add Key → Create New Key → Json". This will download a JSON keyfile.
Copy the keyfile as keyfile.json to the project root directory.
Set the environment variable. In your terminal at the root directory:

export GOOGLE_APPLICATION_CREDENTIALS=keyfile.json

Now you can run DVC commands.

For Administrators: Creating and Managing Access

To add new users to the project:

Go to Google Cloud Console → IAM & Admin.
Click “Grant Access”.
Enter the user’s email address.
Assign a role such as:
- Storage Object Viewer (read-only)
- Storage Object Admin (read/write)
Click Save.

Code Quality & Linting

Run Checks Locally

From the project root, run:

# Pylint
pylint --rcfile=.pylintrc <target-directory>

# Flake8
flake8 .

# Bandit
bandit -r .

# Black formatter
black .

Tools Used

Tool	Purpose	Config File
`pylint`	Detect code smells, logic issues	`.pylintrc`
`flake8`	Enforce style conventions	`.flake8`
`bandit`	Detect common security issues	`.bandit`
`black`	Auto-format code to consistent style	default

Automated Tests (important for Developers)

ML Test Score Naming Requirement

To automatically calculate the ML Test Score from the following paper, new test cases that fall under one of the categories from the paper should follow the naming convention below.

test_{category_name}_{case_number}_{your_arbitrary_test_case_name}

The category_name may be one of ["data", "model", "infra", "monitor"]. An example test case name is test_model_6_model_quality_on_slices, which corresponds to the test case from the paper: "Model 6: Model quality is sufficient on all important data slices".

Manual Execution of Tests

To manually execute the tests you can do the following from the project root:

For the normal testing suite with coverage:

pytest -v --cov=.

For the ML Test Score:

python scripts/calculate_ml_test_score.py

CI/CD

This project uses GitHub Actions for automated testing and versioned model artifact releases.

Release

To publish an official release:

Ensure all changes are committed and pushed to any desired release branch.
Tag the commit with a version like v0.1.0 and push:
```
git tag v0.1.0  
git push origin v0.1.0  
```
This triggers the release.yml workflow, which:
- Pulls tracked model artifacts (e.g., model.pkl, bow.pkl, metrics.json) using DVC.
- Runs dvc repro to re-execute the pipeline and ensure consistency with the committed DVC state.
- Packages and publishes these files as part of a GitHub Release under the specified version tag.
- Attaches relevant metadata and metrics for reproducibility.

Pre-Release

To publish a pre-release:

Push a commit to the main branch (i.e. merge a pull request to main).
The prerelease.yml workflow automatically runs on every commit to main.
It pulls the latest DVC-tracked artifacts and publishes a pre-release on GitHub with a timestamped version like 0.1.0-pre.20250625.184512.

Testing

To ensure stability:

Every commit and pull request triggers the testing.yml workflow.
It creates a Python virtual environment, installs dependencies, pulls required DVC data, and runs the full pytest test suite.
Updates the pylint, coverage and ml-test-score badges in the README to reflect the latest repository state.

AI Disclaimer

This documented was refined using ChatGPT 4o.

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
.dvc		.dvc
.github		.github
data		data
model-training		model-training
notebooks		notebooks
pylint_plugins		pylint_plugins
scripts		scripts
.bandit		.bandit
.dvcignore		.dvcignore
.flake8		.flake8
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
dev-requirements.txt		dev-requirements.txt
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

model-training

Training Data

Local Setup

Model Release

DVC Integration (with Google Cloud Storage)

Setup (First-Time Only)

Reproduce Full Pipeline

Rollback to Past Versions

Run and Compare Experiments

Metrics Tracking

Push to Remote

Google Cloud Storage Access Setup

For Developers: Using `keyfile.json` to Authenticate

For Administrators: Creating and Managing Access

Code Quality & Linting

Run Checks Locally

Tools Used

Automated Tests (important for Developers)

ML Test Score Naming Requirement

Manual Execution of Tests

CI/CD

Release

Pre-Release

Testing

AI Disclaimer

About

Uh oh!

Releases 24

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

remla25-team6/model-training

Folders and files

Latest commit

History

Repository files navigation

model-training

Training Data

Local Setup

Model Release

DVC Integration (with Google Cloud Storage)

Setup (First-Time Only)

Reproduce Full Pipeline

Rollback to Past Versions

Run and Compare Experiments

Metrics Tracking

Push to Remote

Google Cloud Storage Access Setup

For Developers: Using keyfile.json to Authenticate

For Administrators: Creating and Managing Access

Code Quality & Linting

Run Checks Locally

Tools Used

Automated Tests (important for Developers)

ML Test Score Naming Requirement

Manual Execution of Tests

CI/CD

Release

Pre-Release

Testing

AI Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

For Developers: Using `keyfile.json` to Authenticate

Packages