Skip to content

4层Global Attention的应用深度是怎么确定的? #48

@FallowInTheHill

Description

@FallowInTheHill

您好,您的研究中提到Global Attention相比Window Attention处于亚激活的状态,因此通过实验将12层Global削减为4层。请问这四个层的深度是怎么确定的?对比全部选择浅层应用Global和全部选择深层应用Global,您认为哪一种是更合理的呢?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions