Skip to content

Only 1 epoch is conducted during training the translator  #17

@zmce2018

Description

@zmce2018

Why there was only 1 epoch run during training the translator? Was is enough for "max_epoch": 1?

(graphtranslator) user@k9:.../Translator/train $ python train.py --cfg-path ./pretrain_arxiv_stage2.yaml
Not using distributed mode
2024-09-09 20:10:07,222 [INFO]
===== Running Parameters =====
2024-09-09 20:10:07,222 [INFO] {
"accum_grad_iters": 32,
"amp": true,
"batch_size_eval": 64,
"batch_size_train": 1,
"device": "cuda:0",
"dist_url": "env://",
"distributed": false,
"evaluate": false,
"init_lr": 0.0001,
"log_freq": 50,
"lr_sched": "linear_warmup_cosine_lr",
"max_epoch": 1,
"min_lr": 1e-05,
"output_dir": "../model_output/pretrain_arxiv_stage2",
"resume_ckpt_path": null,
"seed": 42,
"task": "arxiv_text_pretrain",
"train_splits": [
"train"
],
"warmup_lr": 1e-06,
"warmup_steps": 5000,
"weight_decay": 0.05
}
2024-09-09 20:10:07,222 [INFO]
====== Dataset Attributes ======
2024-09-09 20:10:07,223 [INFO]
======== arxiv_caption =======
2024-09-09 20:10:07,223 [INFO] {
"arxiv_processor": {
"train": {
"max_length": 1024,
"name": "translator_arxiv_train",
"vocab_size": 100000
}
},
"datasets_dir": "../../data/arxiv/summary_embeddings.csv",
"text_processor": {
"train": {
"name": "translator_caption"
}
},
"type": "translator_train_stage2"
}
2024-09-09 20:10:07,223 [INFO]
====== Model Attributes ======
2024-09-09 20:10:07,223 [INFO] {
"arch": "translator_arxiv_chatglm",
"behavior_length": 768,
"behavior_precision": "fp16",
"bert_dir": "../models/bert-base-uncased",
"freeze_behavior": true,
"llm_dir": "../models/chatglm2-6b",
"load_finetuned": false,
"max_txt_len": 1024,
"model_type": "pretrain_arxiv",
"num_query_token": 32,
"pretrained": "../model_output/pretrain_arxiv_stage1/checkpoint_0.pth"
}
2024-09-09 20:10:07,223 [INFO] Building datasets...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00, 1.26it/s]
2024-09-09 20:10:16,751 [INFO] load checkpoint from ../model_output/pretrain_arxiv_stage1/checkpoint_0.pth
2024-09-09 20:10:16,752 [INFO] Start training
2024-09-09 20:10:22,868 [INFO] number of trainable parameters: 182936320
2024-09-09 20:10:23,003 [INFO] Start training epoch 0, 100 iters per inner epoch.
Time 2024-09-09 20:10:23.003839 Train: data epoch: [0] [ 0/100] eta: 0:03:50 lr: 0.00000100 loss: 2.84765625 time: 2.3050 data: 0.0253 max mem: 18714
Time 2024-09-09 20:10:23.003839 Train: data epoch: [0] [ 50/100] eta: 0:00:15 lr: 0.00000199 loss: 3.83593750 time: 0.2861 data: 0.0001 max mem: 20609
Time 2024-09-09 20:10:23.003839 Train: data epoch: [0] [ 99/100] eta: 0:00:00 lr: 0.00000296 loss: 3.27148438 time: 0.3000 data: 0.0001 max mem: 22493
Time 2024-09-09 20:10:23.003839 Train: data epoch: [0] Total time: 0:00:29 (0.2988 s / it)
2024-09-09 20:10:52,881 [INFO] Averaged stats: lr: 0.00000198 loss: 3.53283203
2024-09-09 20:10:52,883 [INFO] No validation splits found.
2024-09-09 20:10:52,890 [INFO] Saving checkpoint at epoch 0 to ../model_output/pretrain_arxiv_stage2/checkpoint_0.pth.
2024-09-09 20:10:55,732 [INFO] No validation splits found.
2024-09-09 20:10:55,732 [INFO] Training time 0:00:38

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions