Skip to content

Multiplication positional encoding seems to work better than the original division one? #131

@Mightlaus

Description

@Mightlaus

Thank you for providing such a well-organized and comprehensive Transformer tutorial.
As a beginner, I’ve learned a lot from this repository☺️!

When I was building the positional encoding block, I mistakenly implemented it as:

pe[:, 0::2] = torch.sin(position / div_term) pe[:, 1::2] = torch.cos(position / div_term)

that is to multiply the position with the dominator, instead of the intended division form

pe[:, 0::2] = torch.sin(position * div_term pe[:, 1::2] = torch.cos(position * div_term))

However, in the first example where the model is trained to repeat the input words as the output, this incorrect implementation seems to converge much faster and nearly reaches zero loss.

I’m a bit confused—is it possible that this incorrect implementation actually performs better than the intended version?

position / div_term implementation outputs:
Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.10 | Tokens / Sec:  1460.0 | Learning Rate: 5.5e-06
tensor([[0, 7, 7, 9, 7, 6, 8, 8, 8, 2]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.08 | Tokens / Sec:  1637.4 | Learning Rate: 6.1e-05
tensor([[0, 7, 2, 8, 5, 6, 8, 7, 3, 5]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.59 | Tokens / Sec:  1610.8 | Learning Rate: 1.2e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.50 | Tokens / Sec:  1661.7 | Learning Rate: 1.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.01 | Tokens / Sec:  1691.5 | Learning Rate: 2.3e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.00 | Tokens / Sec:  1654.0 | Learning Rate: 2.8e-04
...
position / div_term iimplementation outputs:
Epoch Step:      1 | Accumulation Step:   2 | Loss:   3.07 | Tokens / Sec:  1499.9 | Learning Rate: 5.5e-06
tensor([[0, 3, 6, 2, 2, 6, 3, 3, 4, 2]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.07 | Tokens / Sec:  1679.4 | Learning Rate: 6.1e-05
tensor([[0, 3, 2, 6, 5, 4, 8, 7, 6, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.76 | Tokens / Sec:  1664.8 | Learning Rate: 1.2e-04
tensor([[0, 3, 2, 6, 5, 4, 7, 9, 8, 3]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.45 | Tokens / Sec:  1662.9 | Learning Rate: 1.7e-04
tensor([[0, 2, 3, 6, 5, 4, 7, 8, 9, 7]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.93 | Tokens / Sec:  1643.7 | Learning Rate: 2.3e-04
tensor([[0, 2, 3, 5, 4, 6, 5, 9, 7, 8]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.55 | Tokens / Sec:  1684.9 | Learning Rate: 2.8e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.32 | Tokens / Sec:  1656.6 | Learning Rate: 3.4e-04
tensor([[0, 2, 3, 4, 5, 4, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.24 | Tokens / Sec:  1673.1 | Learning Rate: 3.9e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.13 | Tokens / Sec:  1646.1 | Learning Rate: 4.5e-04
tensor([[0, 2, 3, 4, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.17 | Tokens / Sec:  1684.4 | Learning Rate: 5.0e-04
tensor([[0, 2, 3, 4, 5, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.10 | Tokens / Sec:  1655.1 | Learning Rate: 5.6e-04
tensor([[0, 1, 2, 3, 4, 4, 5, 6, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.11 | Tokens / Sec:  1645.2 | Learning Rate: 6.1e-04
tensor([[0, 2, 3, 2, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.08 | Tokens / Sec:  1682.0 | Learning Rate: 6.7e-04
tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.12 | Tokens / Sec:  1666.5 | Learning Rate: 7.2e-04
tensor([[0, 2, 1, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.04 | Tokens / Sec:  1622.6 | Learning Rate: 7.8e-04
tensor([[0, 2, 3, 3, 4, 5, 6, 7, 8, 9]])
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.17 | Tokens / Sec:  1672.7 | Learning Rate: 8.3e-04
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions