Distributed ML Training with In-Network Aggregation

A distributed PS training architecture with P4 programmable switches accelerating.

Dependency

pytorch needed

sudo apt install libjpeg-dev zlib1g-dev libssl-dev libffi-dev python-dev build-essential libxml2-dev libxslt1-dev

python dependency

  pip3 install pulp numpy tensorboard

cpu only pytorch

pip3 install torch==1.10.0+cpu torchvision==0.11.1+cpu torchaudio==0.10.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

Usage

We ignore the config files for security. You need to create config\workers.json for distributed training.

[
    {
        "host_ip": "id of worker 1",
        "ssh_port": "port for ssh",
        "ssh_usr" : "user account to ssh",
        "ssh_psw" : "password",
        "work_dir": "path of files"
    },
    {
        "host_ip": "id of worker 2",
        "ssh_port": "port for ssh",
        "ssh_usr" : "user account to ssh",
        "ssh_psw" : "password",
        "work_dir": "path of files"
    },
]

Run ./deploy.sh to sync codes among all the machines: make sure you have created the <repo> directory.

# deploy.sh

scp -r current_path ssh_usr@machine_ip:dest_path

Run ./test.sh $WORKER_NUM to start training. The scripts will run python3 launch.py --master True xxx to launch the PS, which will launch workers via ssh according to the IP list in config/workers.json

# test.sh

WORKER_NUM=$1

sudo python3 src/launch.py --master 1 --ip machine_ip --worker_num $WORKER_NUM --config_file config/workers.json --dataset CIFAR100 --model resnet50

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed ML Training with In-Network Aggregation

Dependency

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Fangjin98/distributed-training-INA

Folders and files

Latest commit

History

Repository files navigation

Distributed ML Training with In-Network Aggregation

Dependency

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages