During runtime, I must set per_device_batch_size*gradient_acc_steps equal to num generation. Otherwise, the number of responses per sample (Completions) will be per_device_batch_size*gradient_acc_steps, while the total steps will change.
For example: Running on 4 GPUs with per_device_batch_size=1, gradient_acc_steps=8, and num_generation=8, each sample outputs 8 responses. The total steps become Step=203num_generation/per_device_batch_size/gradient_acc_steps=15.
Keeping everything else constant, changing gradient_acc_steps to 4 results in 4 completions per sample being printed. This equals per_device_batch_size * gradient_acc_steps and no longer equals num_generation (8). However, the total steps change from 15 to 30. The total number of samples processed remains unchanged, and I'm unsure how this affects training.
Changing gradient_acc_steps to 16 should theoretically yield total steps of 203num_generation/per_device_batch_size/gradient_acc_steps=7.5. However, actual runs show total steps as 6, while Completion remains per_device_batch_size*gradient_acc_steps, outputting 16.
I looked up resources, but they only state that (number of GPUs * batch size * gradient) % num_generations must be ensured.
Does anyone know why this happens? To ensure each prompt generates 8 responses for training, must I set per_device_batch_size=1, gradient_acc_steps=8, and num_generation=8? Will this configuration guarantee accuracy? What are the limitations of other settings? Or is it irrelevant? Is it sufficient to only ensure (number_of_cards * batch_size * gradient) % num_generation?