-
Notifications
You must be signed in to change notification settings - Fork 7
Training problems #6
Copy link
Copy link
Open
Description
感谢您出色的工作!
我在尝试训练时面临一些问题:
GPUs: 2 x 80g A800
train_qwen2p5_3b_stage1.sh:
export PYTHONPATH=$PYTHONPATH:$(pwd)
#!/bin/bash
NNODES=1
NODE_RANK=0
MASTER_ADDR=127.0.0.1 # 或 localhost
MASTER_PORT=12345
# MODIFY HERE: please prepare the env related variables
PR1_PATH="./"
CHECKPOINT_PATH="./outputs" # directory to save the checkpoint
RUN_NAME="qwen2p5_stage1" # describe what your experiment is about
# Default Setting
OUTPUT_DIR="${CHECKPOINT_PATH}/${RUN_NAME}" # path to save the output
SRC_PATH="${OUTPUT_DIR}/src" # path to backup the source code
export LOG_DIR="${OUTPUT_DIR}/logs" # path to save the log
export WANDB_PROJECT="LENS" # project name in wandb
export WANDB_TAGS="qwen2p5_stage1" # tags for the experiment in wandb
export WANDB_MODE=offline
if [ ! -d "${OUTPUT_DIR}"/src ]; then
mkdir -p ${OUTPUT_DIR}/src
fi
# backup the source code
cp -r ${PR1_PATH}/src ${SRC_PATH}
mkdir -p ${LOG_DIR}
# run the training
torchrun \
--nproc_per_node="2" \
--nnodes="${NNODES}" \
--node_rank="${NODE_RANK}" \
--master_addr="${MASTER_ADDR}" \
--master_port="${MASTER_PORT}" \
${PR1_PATH}/src/open_r1/grpo_vllm_sam_stage1.py \
--deepspeed ${PR1_PATH}/configs/zero3.json \
--output_dir "${OUTPUT_DIR}" \
--model_name_or_path ./pretrained/Qwen/Qwen2.5-VL-3B-Instruct \
--max_prompt_length 2048 \
--max_completion_length 768 \
--per_device_train_batch_size 8 \ # 修改了 batch size
--gradient_accumulation_steps 64 \
--num_generations 8 \
--logging_steps 1 \
--bf16 True \
--gradient_checkpointing true \
--attn_implementation flash_attention_2 \
--report_to wandb \
--max_pixels 1000000 \
--num_train_epochs 25 \
--run_name ${RUN_NAME} \
--save_steps 100 \
--reward_funcs "pr1_grounding" "pr1_grounding_format" \
--save_only_model true \
--system_prompt_template "default" \
--question_template "pr1_grounding" \
--train_sample_size 500000000000 \
--skip_special_tokens false \
--answer_template "default" \
--if_freeze_llm true \
--learning_rate 3e-5 \
--num_of_query 64 \
--warmup_steps 150 \
--lr_scheduler_type "cosine" \
--if_use_qwen_connector true \
--coord_norm_type "qwen2p5vl"
- 两张80G在较小batchsize下仍无法训练,stage1是否需要更大显存?
- 有些终端输出我很在意:
You are using a model of type qwen2_5_vl to instantiate a model of type qwen2_vl. This is not supported for all configurations of models and can yield errors.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2_5_VisionTransformerPretrainedModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
有关模型加载问题和flash_attn警告,是否正常?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels