Skip to content

The training loss function decreases sharply #414

@czy1998916

Description

@czy1998916

The training loss function drops sharply
Dear author, hello. Recently, I have been using Dope, but there was an error. When I used the following dataset to train my model, the loss function significantly decreased, which is clearly incorrect. The dataset is as follows

29999.json
Image

`(NVISII) rx@rx:~/Deep_Object_Pose/train2$ python -m torch.distributed.run --nproc_per_node=1 train.py --local_rank=0 --data /home/rx/Deep_Object_Pose/data_generation/nvisii_data_gen/output/output_example
[INFO] 2025-07-17 09:48:44,033 run: Running torch.distributed.run with args: ['/home/rx/anaconda3/envs/NVISII/lib/python3.9/site-packages/torch/distributed/run.py', '--nproc_per_node=1', 'train.py', '--local_rank=0', '--data', '/home/rx/Deep_Object_Pose/data_generation/nvisii_data_gen/output/output_example']
[INFO] 2025-07-17 09:48:44,034 run: Using nproc_per_node=1.
[INFO] 2025-07-17 09:48:44,034 api: Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}

[INFO] 2025-07-17 09:48:44,035 local_elastic_agent: log directory set to: /tmp/torchelastic_l0snmhxn/none_q_airym8
[INFO] 2025-07-17 09:48:44,035 api: [default] starting workers for entrypoint: python
[INFO] 2025-07-17 09:48:44,035 api: [default] Rendezvous'ing worker group
[INFO] 2025-07-17 09:48:44,035 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
/home/rx/anaconda3/envs/NVISII/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2025-07-17 09:48:44,036 api: [default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]

[INFO] 2025-07-17 09:48:44,036 api: [default] Starting worker group
[INFO] 2025-07-17 09:48:44,036 init: Setting worker0 reply file to: /tmp/torchelastic_l0snmhxn/none_q_airym8/attempt_0/0/error.json
start: 09:48:44.826437
load data: ['/home/rx/Deep_Object_Pose/data_generation/nvisii_data_gen/output/output_example']
load data:
training data: 938 batches
load models
ready to train!
NaN or Inf found in input tensor.
Train Epoch: 1 [0/30000 (0%)] Loss: 0.028377978131175
Train Epoch: 1 [3200/30000 (11%)] Loss: 0.000086085390649
Train Epoch: 1 [6400/30000 (21%)] Loss: 0.000004931926014`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions