Gradient caching vs Model dropout

The GC-DPR has two steps
1. The first step did a full batch forward without gradient, to get the full batch contrastive learning loss and corresponding embedding gradient. 
2. The second step conduct mini-batch forward, and assign the embedding gradient, then do backward. The mini-batch will loop through the full batch to computing all gradient and accumulate. 

However, during the computation, there might be one issues:
1. The backbone model has randomized dropout process, the dropout will make the 1 & 2 to be inconsistent. 1's dropout process will be different from 2, so 1's gradient can not be directly applied to 2. 2's gradient shall be calculated again for every mini-batch. This bug can be fixed using some more sophisticated operation to make sure 1&2 to be consistent. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Gradient caching vs Model dropout #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Gradient caching vs Model dropout #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions