In this lab, you will learn how to build, evaluate, and compare end-to-end Active Learning (AL) pipelines using AWS and MLFlow. The lab focuses on developing and testing classification models that evolve through multiple rounds of data selection, training, and evaluation. You will perform hands-on activities that demonstrate three Active Learning strategies such as uncertainty sampling, and diversity sampling and measure their effectiveness using model evaluation on a fixed test set.
By the end of this lab, you will understand:
- How to implement Active Learning workflows.
- How to train and evaluate models with real datasets.
- How to track and compare model performance using MLflow.
- How to run ML experiments efficiently on AWS EC2 instances.
- How to visualize the model resutls using MLFlow UI.
To follow along and get the most out of this lab, you should have:
- Basic understanding of machine learning and classification tasks.
- Basic understading about python programming.
- Experience working in AWS EC2 environments (not mandatory, but helpful).
Additional requirements:
- AWS account with permissions to launch EC2 instances (preferably with GPU access).
Here are useful links to learn more about the tools and concepts in this lab:
In this lab, we implement an Active Learning-based NLP classification workflow to optimize the labeling process. Traditional supervised learning requires large amounts of labeled data, which is costly and time-consuming to collect. Active Learning allows the model to intelligently select the most informative data points to label, thereby improving performance with fewer labeled examples.
We compare three model training strategies:
- Baseline: Trained on a fixed, randomly sampled dataset.
- Uncertainty Sampling: Selects data where model confidence is low.
- Diversity Sampling: Selects data that best represents the entire feature space.
Each strategy is evaluated on a fixed test set (1,000 samples) and shares a common unlabeled pool (45,000 samples).
You will log metrics and artifacts from each round of training using MLflow and visualize results via its web interface.
You will aslo use AWS EC2 GPU instance to train and run experiments.
Made by Vani Seth for Mizzou Cloud DevOps Portal - University of Missouri, Columbia