-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Thank you for providing such a well-organized and comprehensive Transformer tutorial.
As a beginner, I’ve learned a lot from this repository
When I was building the positional encoding block, I mistakenly implemented it as:
pe[:, 0::2] = torch.sin(position / div_term) pe[:, 1::2] = torch.cos(position / div_term)
that is to multiply the position with the dominator, instead of the intended division form
pe[:, 0::2] = torch.sin(position * div_term pe[:, 1::2] = torch.cos(position * div_term))
However, in the first example where the model is trained to repeat the input words as the output, this incorrect implementation seems to converge much faster and nearly reaches zero loss.
I’m a bit confused—is it possible that this incorrect implementation actually performs better than the intended version?
position / div_term implementation outputs:
Epoch Step: 1 | Accumulation Step: 2 | Loss: 3.10 | Tokens / Sec: 1460.0 | Learning Rate: 5.5e-06
tensor([[0, 7, 7, 9, 7, 6, 8, 8, 8, 2]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 2.08 | Tokens / Sec: 1637.4 | Learning Rate: 6.1e-05
tensor([[0, 7, 2, 8, 5, 6, 8, 7, 3, 5]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 1.59 | Tokens / Sec: 1610.8 | Learning Rate: 1.2e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.50 | Tokens / Sec: 1661.7 | Learning Rate: 1.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.01 | Tokens / Sec: 1691.5 | Learning Rate: 2.3e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.00 | Tokens / Sec: 1654.0 | Learning Rate: 2.8e-04
...
position / div_term iimplementation outputs:
Epoch Step: 1 | Accumulation Step: 2 | Loss: 3.07 | Tokens / Sec: 1499.9 | Learning Rate: 5.5e-06
tensor([[0, 3, 6, 2, 2, 6, 3, 3, 4, 2]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 2.07 | Tokens / Sec: 1679.4 | Learning Rate: 6.1e-05
tensor([[0, 3, 2, 6, 5, 4, 8, 7, 6, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 1.76 | Tokens / Sec: 1664.8 | Learning Rate: 1.2e-04
tensor([[0, 3, 2, 6, 5, 4, 7, 9, 8, 3]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 1.45 | Tokens / Sec: 1662.9 | Learning Rate: 1.7e-04
tensor([[0, 2, 3, 6, 5, 4, 7, 8, 9, 7]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.93 | Tokens / Sec: 1643.7 | Learning Rate: 2.3e-04
tensor([[0, 2, 3, 5, 4, 6, 5, 9, 7, 8]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.55 | Tokens / Sec: 1684.9 | Learning Rate: 2.8e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.32 | Tokens / Sec: 1656.6 | Learning Rate: 3.4e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.24 | Tokens / Sec: 1673.1 | Learning Rate: 3.9e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.13 | Tokens / Sec: 1646.1 | Learning Rate: 4.5e-04
tensor([[0, 2, 3, 4, 4, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.17 | Tokens / Sec: 1684.4 | Learning Rate: 5.0e-04
tensor([[0, 2, 3, 4, 5, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.10 | Tokens / Sec: 1655.1 | Learning Rate: 5.6e-04
tensor([[0, 1, 2, 3, 4, 4, 5, 6, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.11 | Tokens / Sec: 1645.2 | Learning Rate: 6.1e-04
tensor([[0, 2, 3, 2, 4, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.08 | Tokens / Sec: 1682.0 | Learning Rate: 6.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.12 | Tokens / Sec: 1666.5 | Learning Rate: 7.2e-04
tensor([[0, 2, 1, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.04 | Tokens / Sec: 1622.6 | Learning Rate: 7.8e-04
tensor([[0, 2, 3, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step: 1 | Accumulation Step: 2 | Loss: 0.17 | Tokens / Sec: 1672.7 | Learning Rate: 8.3e-04
...