The training loss function decreases sharply

The training loss function drops sharply
Dear author, hello. Recently, I have been using Dope, but there was an error. When I used the following dataset to train my model, the loss function significantly decreased, which is clearly incorrect. The dataset is as follows

[29999.json](https://github.com/user-attachments/files/21273421/29999.json)
<img width="640" height="480" alt="Image" src="https://github.com/user-attachments/assets/b9a2c6b4-62ff-4a26-a1cd-11e4ed2d21f9" />

`(NVISII) rx@rx:~/Deep_Object_Pose/train2$ python -m torch.distributed.run --nproc_per_node=1 train.py --local_rank=0 --data  /home/rx/Deep_Object_Pose/data_generation/nvisii_data_gen/output/output_example 
[INFO] 2025-07-17 09:48:44,033 run: Running torch.distributed.run with args: ['/home/rx/anaconda3/envs/NVISII/lib/python3.9/site-packages/torch/distributed/run.py', '--nproc_per_node=1', 'train.py', '--local_rank=0', '--data', '/home/rx/Deep_Object_Pose/data_generation/nvisii_data_gen/output/output_example']
[INFO] 2025-07-17 09:48:44,034 run: Using nproc_per_node=1.
[INFO] 2025-07-17 09:48:44,034 api: Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 1
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

[INFO] 2025-07-17 09:48:44,035 local_elastic_agent: log directory set to: /tmp/torchelastic_l0snmhxn/none_q_airym8
[INFO] 2025-07-17 09:48:44,035 api: [default] starting workers for entrypoint: python
[INFO] 2025-07-17 09:48:44,035 api: [default] Rendezvous'ing worker group
[INFO] 2025-07-17 09:48:44,035 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
/home/rx/anaconda3/envs/NVISII/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
[INFO] 2025-07-17 09:48:44,036 api: [default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[1]
  global_world_sizes=[1]

[INFO] 2025-07-17 09:48:44,036 api: [default] Starting worker group
[INFO] 2025-07-17 09:48:44,036 __init__: Setting worker0 reply file to: /tmp/torchelastic_l0snmhxn/none_q_airym8/attempt_0/0/error.json
start: 09:48:44.826437
load data: ['/home/rx/Deep_Object_Pose/data_generation/nvisii_data_gen/output/output_example']
load data: 
training data: 938 batches
load models
ready to train!
NaN or Inf found in input tensor.
Train Epoch: 1 [0/30000 (0%)]   Loss: 0.028377978131175
Train Epoch: 1 [3200/30000 (11%)]       Loss: 0.000086085390649
Train Epoch: 1 [6400/30000 (21%)]       Loss: 0.000004931926014`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The training loss function decreases sharply #414

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The training loss function decreases sharply #414

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions