Skip to content

lc-rezende/CICD-for-Machine-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ml_training

CI/CD for Machine Learning with GitHub Actions and CML

This repository is a hands-on sandbox to experiment with GitHub Actions and CML (Continuous Machine Learning) in a simple Machine Learning workflow.

The goal is to validate how typical ML steps (data preparation, training, evaluation, and reporting) can be automated and reviewed through a standard CI pipeline—especially via Pull Requests.


Why this repository exists

Machine Learning projects often struggle with reproducibility and reviewability. In software engineering, CI makes it easy to validate changes on every commit and PR. This repository explores how to bring that mindset to ML by:

  • running training/evaluation automatically in CI
  • generating machine-readable metrics
  • producing plots/artifacts
  • publishing a compact report to GitHub (PR comments / checks) using CML

Core concepts used

GitHub Actions

GitHub Actions is used as the CI runner/orchestrator. The workflow:

  • provisions a fresh environment
  • installs dependencies
  • runs training/evaluation
  • collects outputs (metrics/artifacts)
  • triggers CML reporting

CML (Continuous Machine Learning)

CML is an open-source CLI/tooling layer for ML-oriented CI/CD. It is typically used to:

  • run ML experiments in CI
  • log and publish metrics and plots
  • comment results on Pull Requests

CML documentation:


Repository structure

High-level structure (summarized from the project tree):

.
├── .github/workflows
│   └── ml_training.yml          # GitHub Actions workflow for ML training + CML reporting
├── data
│   ├── raw/                     # Raw dataset(s)
│   ├── cleaned/                 # Cleaned/processed dataset(s)
│   └── README.md                # Data notes (source, transformations, conventions)
├── src
│   ├── notebooks/               # EDA + experiments (not executed in CI)
│   ├── scripts/
│   │   └── train.py             # Training script used by CI
│   └── README.md                # Source notes (how code is organized)
├── tests
│   └── README.md                # Test strategy placeholder
├── main.py                      # Optional entry point for local execution
├── pyproject.toml               # Dependencies and project metadata
├── uv.lock                      # Locked dependencies (uv)
└── README.md                    # This file

Notes:

  • The .venv/ folder is present locally in your working directory, but it should generally not be committed to Git.

What happens in CI

Workflow file: .github/workflows/ml_training.yml

Responsibilities for this workflow:

  1. Trigger on push and/or workflow_dispatch
  2. Checkout repository code
  3. Set up Python
  4. Install dependencies (from pyproject.toml / uv.lock)
  5. Run the training script (e.g., python src/scripts/train.py)
  6. Collect outputs:
    • metrics file(s) (e.g., JSON/TXT)
    • plots/images (e.g., PNG)
    • optionally model artifacts
  7. Use CML to publish a report into the PR (or as a workflow artifact)

Expected CI outputs

In a CI-driven workflow, you typically:

  • generate these outputs during the workflow run
  • attach them to the run as artifacts and/or
  • include them in a CML report comment

Local development

Requirements

  • Python (use the version indicated in .python-version if present)
  • uv (recommended) or pip

If you use uv (recommended because this repo includes uv.lock):

Create environment and install dependencies (uv)

uv sync

If you prefer a traditional virtualenv approach:

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .

Note: The exact install command depends on how your pyproject.toml is configured.


Running the training locally

The canonical CI script lives in:

  • src/scripts/train.py

Run it directly:

uv run python src/scripts/train.py

Data

This repo includes a small churn dataset flow:

  • data/raw/ contains the original dataset file(s)
  • data/cleaned/ contains a cleaned dataset used for training

Read dataset-specific notes here:

  • data/README.md

Rule of thumb used in this repo:

  • never overwrite raw datasets
  • always generate cleaned datasets deterministically from raw data

Notebooks

Location: src/notebooks/

Typical roles:

  • eda.ipynb: exploratory analysis
  • create_cleaned_dataset.ipynb: generates cleaned dataset from raw data
  • train.ipynb: iterative training experiments

Notebooks are primarily for exploration and are not executed in CI by default. Production CI runs should rely on scripts (e.g., src/scripts/train.py) for deterministic behavior.


Tests

Location: tests/

At the moment, this directory is a placeholder to expand testing practices commonly used in ML CI/CD, such as:

  • unit tests for data cleaning logic
  • schema/quality checks (nulls, ranges, constraints)
  • deterministic training sanity checks
  • metric regression tests (e.g., “F1 must not drop below X”)

How CML reporting typically works

A common pattern in GitHub Actions + CML is:

  1. Install CML (or use an action such as iterative/setup-cml)
  2. Generate metrics and plots during workflow execution
  3. Build a short Markdown report during the run
  4. Post that report to the PR using cml comment create

Example conceptual snippet (illustrative only):

# Generate results
python src/scripts/train.py

# Create report content
echo "## Model Metrics" >> report.md
cat metrics.json >> report.md
echo "## Plot" >> report.md
echo "![](./plot.png)" >> report.md

# Publish comment
cml comment create report.md

Official references:


Common setup notes (GitHub Actions + CML)

Depending on how you configured your workflow, you may need:

  • GITHUB_TOKEN (usually available automatically in Actions)
  • permissions to post PR comments (GitHub Actions workflow permissions)
  • pull_request triggers if you want comments on PRs

If comments are not being posted:

  • confirm workflow permissions allow writing comments
  • confirm you’re running on PR events (not only on pushes to branches)
  • ensure the job has access to the PR context

Recommended next steps / roadmap

If you want to evolve this sandbox into a more complete ML CI/CD template:

  • Add unit tests and run them in the workflow
  • Add a small metric regression gate (fail CI if metric drops beyond a threshold)
  • Persist model artifacts as workflow artifacts
  • Add data validation checks (schema, nulls, ranges)
  • Integrate DVC for dataset/model versioning (optional, but common with CML)
  • Add a Makefile or task runner to standardize local commands

References


License

This repository is provided as-is for learning and experimentation.

About

Demo repository for CI/CD applied to Machine Learning using CML.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors