Welcome to my machine learning repository! Here you'll find a collection of notebooks that I've created while exploring the world of machine learning. I've used a variety of libraries, including PyTorch, transformers, and xformers, to build models and complete tasks from scratch. Many of the notebooks are well-commented in English, so feel free to learn along with me. Please note that there may be some mistakes or unfinished notebooks - any issues or pull requests are welcome!
Warning There may be some unfinished notebooks, please use with caution.
To get started, you can either install the required environment using conda or build a docker image.
First, clone the repository and navigate to the MachineLearning directory:
git clone https://github.com/JenkinsGage/MachineLearning.git
cd MachineLearning
To install the environment using conda, run the following commands:
conda env create --file environment.yml
conda activate ml-torch
Alternatively, you can build a docker image and run a container:
docker build -t ml-torch-cuda .
docker run -dp 8888:8888 ml-torch-cuda
Once the container is up and running, a Jupyter Lab server will be available on port 8888.
This repository contains a variety of notebooks covering different areas of machine learning. Here's an overview of what you'll find:
- Build a Translation Model Using the Transformer Module of PyTorch
In this notebook, I use PyTorch's transformer module to build a translation model that can translate Chinese to English. The Chinese text is tokenized using jieba and the English text is tokenized using torchtext's basic English tokenizer. The model is trained on the wmt19 dataset from Hugging Face. - Build Tokenizer Using Tokenizers Library
Here I use the tokenizers library to build tokenizers for both English and Chinese. The WordPiece model is used, so this approach can be applied to other languages as well. - Build a Translation Model Using the XFormers Library and Tokenizers
In this notebook, I use XFormers library to build the transformer model quickly. And such memory efficient model uses less memory which means we can train a more complex model with limited memory. I also use the tokenizers I just trained in [Build a Translation Model Using the XFormers Library and Tokenizers], so please make sure to run this notebook first to get tokenizers before training the model. - Build a Translation Model With Pretrained BERT as Encoder
In this notebook, I use a pretrained BERT model to replace the encoder in the last notebook and freeze all the parameters of the encoder. I also tried to use some learning rate warmup techniques to make the model more stable.
- Using Pretrained Model from Huggingface for Paraphrasing
In this notebook, I use a pretrained model from Hugging Face (humarin/chatgpt_paraphraser_on_T5_base) to paraphrase sentences. - Paraphrase with Gradio WebUI App
Here I use Gradio to build a web-based user interface for interacting with the (humarin/chatgpt_paraphraser_on_T5_base) model.
...
The repository is organized as follows:
├── MachineLearning
│ ├── Area(NLP, Machine Vision, ...)
│ │ ├── Task(Translation, Paraphrasing, ...)
│ │ │ ├── Model
│ │ │ │ ├── SavedModels
│ │ │ ├── Data
│ │ │ │ ├── Datasets
│ │ │ ├── Notebook1.ipynb
│ │ │ ├── Notebook2.ipynb
│ │ │ ├── ...
│ │ │ ├── GradioApp.py