This repository contains the machine learning training pipeline for a sentiment classification model based on restaurant reviews. It is inspired by the GitHub repository: proksch/restaurant-sentiment.
The training pipeline performs the following steps:
- Downloads the dataset in
.tsvfrom the sentiment-analysis repo inget_data.py. - Loads the labelled dataset in
.tsvformat containing restaurant reviews. - Preprocesses the data using methods from the
lib-mlpackage inpreprocess.py. - Trains a SVC classifier in
train.py. - Saves the trained model locally to
model/model.pkl. - Evaluates the trained model in
eval.py. - Publishes a versioned model artifact to GitHub Releases.
- Tracks data, metrics, and models using DVC with a Google Cloud Storage remote.
The dataset is loaded into the /data folder in .tsv format and is downloaded (at run-time) from:
đź”— https://github.com/proksch/restaurant-sentiment
Requirements:
- Python
3.12.3 pip- dvc[gs] (already included in
dev-requirements.txt)
Setup Virtual Environment: Run the following command from the project root:
python -m venv <venv_name>
source <venv_name>/bin/activate # For Unix/macOS
# Or use <venv_name>\Scripts\activate on Windows
pip install .
# Or for development code
pip install -r dev-requirements.txtTo deactive after use:
deactivateRun Training Pipeline Manually: Run the following commands from the project root:
python model-training/restaurant_sentiment/get_data.py
python model-training/restaurant_sentiment/preprocess.py
python model-training/restaurant_sentiment/train.py
# For evaluation:
python model-training/restaurant_sentiment/eval.pyTo publish a trained model to GitHub Releases, create a version tag using semantic versioning:
git tag v<MAJOR>.<MINOR>.<PATCH>
git push origin v<MAJOR>.<MINOR>.<PATCH>This will trigger the release.yml GitHub Actions workflow, which trains the model and uploads it as an artifact to GitHub Releases. Once published, model releases are publicly accessible. For example, the download link for version v0.1.0 would be:
https://github.com/remla25-team6/model-training/releases/download/v0.1.0/model-v0.1.0.pkl
- Follow the instructions above to create a virtual environment and install dependencies for local development from
dev-requirements.txt. - Add your Google Cloud Storage (GCS) key in the form
keyfile.jsonto the root directory. - Export your GCS key path from the root directory:
export GOOGLE_APPLICATION_CREDENTIALS=keyfile.json- Pull all data and models tracked by DVC:
dvc pulldvc reproDVC will run only the necessary stages in order: preprocess → train → eval.
To roll back to a specific previous version:
# Go to a previous Git commit (where metrics/data were different)
git checkout <commit_hash>
# Restore the corresponding data and model artifacts
dvc checkoutYou can find commit hashes from:
dvc exp showYou can create and track experiments using DVC's experiment tools:
# Run an xxperiment
dvc exp run
# Test a different parameter (e.g., random state)
# # See params.yaml for all configurable parameters
dvc exp run --set-param train.random_state=20
# Show all experiments
dvc exp showThe model/metrics.json file includes:
test_accuracytest_precisiontest_recalltest_f1_scoretest_cohens_kappatest_samplesYou can compare metrics from different experiments using DVC's experiment and metrics tools:
# Compare experiments
dvc exp show
# Compare specific experiments
dvc exp diff <exp1> <exp2>
# Compare metric differences
dvc metrics diffTo share data and metrics to remote, after running or reproducing the pipeline:
dvc pushTo share experiments:
dvc exp push origin <exp_id>To apply the best experiment:
dvc exp apply <exp_id>
git commit -am "<commit_message>"- Go to Google Cloud Console → IAM & Admin → Service Accounts.
- To securely access the DVC remote (stored in Google Cloud Storage), all users use the shared service account which you should see listed:
dvc-access@remla25-team6.iam.gserviceaccount.com- Click on the shared service account.
- Create a new key through "Keys → Add Key → Create New Key → Json". This will download a JSON keyfile.
- Copy the keyfile as
keyfile.jsonto the project root directory. - Set the environment variable. In your terminal at the root directory:
export GOOGLE_APPLICATION_CREDENTIALS=keyfile.json- Now you can run DVC commands.
To add new users to the project:
- Go to Google Cloud Console → IAM & Admin.
- Click “Grant Access”.
- Enter the user’s email address.
- Assign a role such as:
Storage Object Viewer(read-only)Storage Object Admin(read/write)
- Click Save.
From the project root, run:
# Pylint
pylint --rcfile=.pylintrc <target-directory>
# Flake8
flake8 .
# Bandit
bandit -r .
# Black formatter
black .| Tool | Purpose | Config File |
|---|---|---|
pylint |
Detect code smells, logic issues | .pylintrc |
flake8 |
Enforce style conventions | .flake8 |
bandit |
Detect common security issues | .bandit |
black |
Auto-format code to consistent style | default |
To automatically calculate the ML Test Score from the following paper, new test cases that fall under one of the categories from the paper should follow the naming convention below.
test_{category_name}_{case_number}_{your_arbitrary_test_case_name}
The category_name may be one of ["data", "model", "infra", "monitor"]. An example test case name is test_model_6_model_quality_on_slices, which corresponds to the test case from the paper: "Model 6: Model quality is sufficient on all important
data slices".
To manually execute the tests you can do the following from the project root:
- For the normal testing suite with coverage:
pytest -v --cov=.- For the ML Test Score:
python scripts/calculate_ml_test_score.pyThis project uses GitHub Actions for automated testing and versioned model artifact releases.
To publish an official release:
-
Ensure all changes are committed and pushed to any desired
releasebranch. -
Tag the commit with a version like
v0.1.0and push:git tag v0.1.0 git push origin v0.1.0
-
This triggers the
release.ymlworkflow, which:- Pulls tracked model artifacts (e.g.,
model.pkl,bow.pkl,metrics.json) using DVC. - Runs
dvc reproto re-execute the pipeline and ensure consistency with the committed DVC state. - Packages and publishes these files as part of a GitHub Release under the specified version tag.
- Attaches relevant metadata and metrics for reproducibility.
- Pulls tracked model artifacts (e.g.,
To publish a pre-release:
- Push a commit to the
mainbranch (i.e. merge a pull request tomain). - The
prerelease.ymlworkflow automatically runs on every commit tomain. - It pulls the latest DVC-tracked artifacts and publishes a pre-release on GitHub with a timestamped version like
0.1.0-pre.20250625.184512.
To ensure stability:
- Every commit and pull request triggers the
testing.ymlworkflow. - It creates a Python virtual environment, installs dependencies, pulls required DVC data, and runs the full
pytesttest suite. - Updates the
pylint,coverageandml-test-scorebadges in the README to reflect the latest repository state.
This documented was refined using ChatGPT 4o.