Multiplication positional encoding seems to work better than the original division one?

Thank you for providing such a well-organized and comprehensive Transformer tutorial. 
As a beginner, I’ve learned a lot from this repository☺️!

When I was building the positional encoding block, I mistakenly implemented it as:

`pe[:, 0::2] = torch.sin(position / div_term) pe[:, 1::2] = torch.cos(position / div_term)`

that is to multiply the position with the dominator, instead of the intended division form

`pe[:, 0::2] = torch.sin(position * div_term pe[:, 1::2] = torch.cos(position * div_term))`

However, in the first example where the model is trained to repeat the input words as the output, this incorrect implementation seems to converge much faster and nearly reaches zero loss.

I’m a bit confused—is it possible that this incorrect implementation actually performs better than the intended version?

```
position / div_term implementation outputs:
Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.10 | Tokens / Sec:  1460.0 | Learning Rate: 5.5e-06
tensor([[0, 7, 7, 9, 7, 6, 8, 8, 8, 2]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.08 | Tokens / Sec:  1637.4 | Learning Rate: 6.1e-05
tensor([[0, 7, 2, 8, 5, 6, 8, 7, 3, 5]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.59 | Tokens / Sec:  1610.8 | Learning Rate: 1.2e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.50 | Tokens / Sec:  1661.7 | Learning Rate: 1.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.01 | Tokens / Sec:  1691.5 | Learning Rate: 2.3e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.00 | Tokens / Sec:  1654.0 | Learning Rate: 2.8e-04
...
```

```
position / div_term iimplementation outputs:
Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.07 | Tokens / Sec:  1499.9 | Learning Rate: 5.5e-06
tensor([[0, 3, 6, 2, 2, 6, 3, 3, 4, 2]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.07 | Tokens / Sec:  1679.4 | Learning Rate: 6.1e-05
tensor([[0, 3, 2, 6, 5, 4, 8, 7, 6, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.76 | Tokens / Sec:  1664.8 | Learning Rate: 1.2e-04
tensor([[0, 3, 2, 6, 5, 4, 7, 9, 8, 3]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.45 | Tokens / Sec:  1662.9 | Learning Rate: 1.7e-04
tensor([[0, 2, 3, 6, 5, 4, 7, 8, 9, 7]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.93 | Tokens / Sec:  1643.7 | Learning Rate: 2.3e-04
tensor([[0, 2, 3, 5, 4, 6, 5, 9, 7, 8]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.55 | Tokens / Sec:  1684.9 | Learning Rate: 2.8e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.32 | Tokens / Sec:  1656.6 | Learning Rate: 3.4e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.24 | Tokens / Sec:  1673.1 | Learning Rate: 3.9e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.13 | Tokens / Sec:  1646.1 | Learning Rate: 4.5e-04
tensor([[0, 2, 3, 4, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.17 | Tokens / Sec:  1684.4 | Learning Rate: 5.0e-04
tensor([[0, 2, 3, 4, 5, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.10 | Tokens / Sec:  1655.1 | Learning Rate: 5.6e-04
tensor([[0, 1, 2, 3, 4, 4, 5, 6, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.11 | Tokens / Sec:  1645.2 | Learning Rate: 6.1e-04
tensor([[0, 2, 3, 2, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.08 | Tokens / Sec:  1682.0 | Learning Rate: 6.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.12 | Tokens / Sec:  1666.5 | Learning Rate: 7.2e-04
tensor([[0, 2, 1, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.04 | Tokens / Sec:  1622.6 | Learning Rate: 7.8e-04
tensor([[0, 2, 3, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.17 | Tokens / Sec:  1672.7 | Learning Rate: 8.3e-04
...
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiplication positional encoding seems to work better than the original division one? #131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiplication positional encoding seems to work better than the original division one? #131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions