-
Notifications
You must be signed in to change notification settings - Fork 297
Description
The training loss function drops sharply
Dear author, hello. Recently, I have been using Dope, but there was an error. When I used the following dataset to train my model, the loss function significantly decreased, which is clearly incorrect. The dataset is as follows
`(NVISII) rx@rx:~/Deep_Object_Pose/train2$ python -m torch.distributed.run --nproc_per_node=1 train.py --local_rank=0 --data /home/rx/Deep_Object_Pose/data_generation/nvisii_data_gen/output/output_example
[INFO] 2025-07-17 09:48:44,033 run: Running torch.distributed.run with args: ['/home/rx/anaconda3/envs/NVISII/lib/python3.9/site-packages/torch/distributed/run.py', '--nproc_per_node=1', 'train.py', '--local_rank=0', '--data', '/home/rx/Deep_Object_Pose/data_generation/nvisii_data_gen/output/output_example']
[INFO] 2025-07-17 09:48:44,034 run: Using nproc_per_node=1.
[INFO] 2025-07-17 09:48:44,034 api: Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
[INFO] 2025-07-17 09:48:44,035 local_elastic_agent: log directory set to: /tmp/torchelastic_l0snmhxn/none_q_airym8
[INFO] 2025-07-17 09:48:44,035 api: [default] starting workers for entrypoint: python
[INFO] 2025-07-17 09:48:44,035 api: [default] Rendezvous'ing worker group
[INFO] 2025-07-17 09:48:44,035 static_tcp_rendezvous: Creating TCPStore as the c10d::Store implementation
/home/rx/anaconda3/envs/NVISII/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
[INFO] 2025-07-17 09:48:44,036 api: [default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
[INFO] 2025-07-17 09:48:44,036 api: [default] Starting worker group
[INFO] 2025-07-17 09:48:44,036 init: Setting worker0 reply file to: /tmp/torchelastic_l0snmhxn/none_q_airym8/attempt_0/0/error.json
start: 09:48:44.826437
load data: ['/home/rx/Deep_Object_Pose/data_generation/nvisii_data_gen/output/output_example']
load data:
training data: 938 batches
load models
ready to train!
NaN or Inf found in input tensor.
Train Epoch: 1 [0/30000 (0%)] Loss: 0.028377978131175
Train Epoch: 1 [3200/30000 (11%)] Loss: 0.000086085390649
Train Epoch: 1 [6400/30000 (21%)] Loss: 0.000004931926014`
