-
Notifications
You must be signed in to change notification settings - Fork 18
Description
Why there was only 1 epoch run during training the translator? Was is enough for "max_epoch": 1?
(graphtranslator) user@k9:.../Translator/train $ python train.py --cfg-path ./pretrain_arxiv_stage2.yaml
Not using distributed mode
2024-09-09 20:10:07,222 [INFO]
===== Running Parameters =====
2024-09-09 20:10:07,222 [INFO] {
"accum_grad_iters": 32,
"amp": true,
"batch_size_eval": 64,
"batch_size_train": 1,
"device": "cuda:0",
"dist_url": "env://",
"distributed": false,
"evaluate": false,
"init_lr": 0.0001,
"log_freq": 50,
"lr_sched": "linear_warmup_cosine_lr",
"max_epoch": 1,
"min_lr": 1e-05,
"output_dir": "../model_output/pretrain_arxiv_stage2",
"resume_ckpt_path": null,
"seed": 42,
"task": "arxiv_text_pretrain",
"train_splits": [
"train"
],
"warmup_lr": 1e-06,
"warmup_steps": 5000,
"weight_decay": 0.05
}
2024-09-09 20:10:07,222 [INFO]
====== Dataset Attributes ======
2024-09-09 20:10:07,223 [INFO]
======== arxiv_caption =======
2024-09-09 20:10:07,223 [INFO] {
"arxiv_processor": {
"train": {
"max_length": 1024,
"name": "translator_arxiv_train",
"vocab_size": 100000
}
},
"datasets_dir": "../../data/arxiv/summary_embeddings.csv",
"text_processor": {
"train": {
"name": "translator_caption"
}
},
"type": "translator_train_stage2"
}
2024-09-09 20:10:07,223 [INFO]
====== Model Attributes ======
2024-09-09 20:10:07,223 [INFO] {
"arch": "translator_arxiv_chatglm",
"behavior_length": 768,
"behavior_precision": "fp16",
"bert_dir": "../models/bert-base-uncased",
"freeze_behavior": true,
"llm_dir": "../models/chatglm2-6b",
"load_finetuned": false,
"max_txt_len": 1024,
"model_type": "pretrain_arxiv",
"num_query_token": 32,
"pretrained": "../model_output/pretrain_arxiv_stage1/checkpoint_0.pth"
}
2024-09-09 20:10:07,223 [INFO] Building datasets...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00, 1.26it/s]
2024-09-09 20:10:16,751 [INFO] load checkpoint from ../model_output/pretrain_arxiv_stage1/checkpoint_0.pth
2024-09-09 20:10:16,752 [INFO] Start training
2024-09-09 20:10:22,868 [INFO] number of trainable parameters: 182936320
2024-09-09 20:10:23,003 [INFO] Start training epoch 0, 100 iters per inner epoch.
Time 2024-09-09 20:10:23.003839 Train: data epoch: [0] [ 0/100] eta: 0:03:50 lr: 0.00000100 loss: 2.84765625 time: 2.3050 data: 0.0253 max mem: 18714
Time 2024-09-09 20:10:23.003839 Train: data epoch: [0] [ 50/100] eta: 0:00:15 lr: 0.00000199 loss: 3.83593750 time: 0.2861 data: 0.0001 max mem: 20609
Time 2024-09-09 20:10:23.003839 Train: data epoch: [0] [ 99/100] eta: 0:00:00 lr: 0.00000296 loss: 3.27148438 time: 0.3000 data: 0.0001 max mem: 22493
Time 2024-09-09 20:10:23.003839 Train: data epoch: [0] Total time: 0:00:29 (0.2988 s / it)
2024-09-09 20:10:52,881 [INFO] Averaged stats: lr: 0.00000198 loss: 3.53283203
2024-09-09 20:10:52,883 [INFO] No validation splits found.
2024-09-09 20:10:52,890 [INFO] Saving checkpoint at epoch 0 to ../model_output/pretrain_arxiv_stage2/checkpoint_0.pth.
2024-09-09 20:10:55,732 [INFO] No validation splits found.
2024-09-09 20:10:55,732 [INFO] Training time 0:00:38