Thanks for this great work, you have mentioned the accuracy with both SA and AS blocks but under the absence of the co-attention fusion module in the paper and I wonder how did you get the result in this case? Did you have a direct FC layer at the end of the attention modules? How can we replicate that result?