DRR is a defragmentation scheduler for shared GPU clusters. It mitigates GPU fragmentation arising from GPU sharing, diverse jobs, and asynchronous lifecycles, improving resource utilization under dynamic scheduling.
- python 3.10
pip install -r requirements.txtFor 64 nodes cluster simulation:
python simulator.py --num-node 64 --interarrival-time 8 --scheduler DRR \
--init_dim 3584 --action_space 64 --lr_actor 0.04 --lr_critic 0.02 \
--use_imitation True --imitation_loss_weight 0.1 \
--use_dynamic_entropy True --beta0 0.04 \
--use_attn True \
--use_advantage_adjustment 0.6 \
--use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5
# Other baseline schedulers
python simulator.py --num-node 64 --interarrival-time 8 --scheduler ElasticFlow
python simulator.py --num-node 64 --interarrival-time 8 --scheduler "R&P"
python simulator.py --num-node 64 --interarrival-time 8 --scheduler FGD
python simulator.py --num-node 64 --interarrival-time 8 --scheduler HopsFor 32 nodes cluster simulation:
python simulator.py --num-node 32 --interarrival-time 16 --scheduler DRR \
--init_dim 1792 --action_space 32 --lr_actor 0.03 --lr_critic 0.02 \
--use_imitation True --imitation_loss_weight 0.1 \
--use_dynamic_entropy True --beta0 0.01 \
--use_attn True \
--use_advantage_adjustment 0.4 \
--use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5 For 128 nodes cluster simulation:
python simulator.py --num-node 128 --interarrival-time 3.8 --scheduler DRR \
--init_dim 7168 --action_space 128 --lr_actor 0.06 --lr_critic 0.04 \
--use_imitation True --imitation_loss_weight 0.1 \
--use_dynamic_entropy True --beta0 0.06 \
--use_attn True \
--use_advantage_adjustment 0.1 \
--use_rescheduling True --util_threshold 0.5 --re_time 3600 --re_num 5 Defrag
├── cluster.py # Cluster environment implementation
├── clusterdata # Cluster trace data and preprocessing scripts
│ ├── cluster-trace-gpu-v2020
│ │ └── trace.txt
│ ├── filtered_traces.csv
│ ├── mypreprocess.ipynb
│ ├── sampled_traces.csv
│ ├── share_0.2_traces.csv
│ └── share_0.6_traces.csv
├── imgs
│ └── overview.jpg
├── job.py # Job representation
├── policy # Scheduling policies
│ ├── __init__.py
│ ├── drr.py
│ ├── elasticflow.py
│ ├── fgd.py
│ ├── gpupacking.py
│ ├── hops.py
│ └── policy.py
├── README.md
├── requirements.txt
├── simulator.py # Main simulation script
└── utils.py # Utility functions