Attention is not Explanation
Citations
755 citations
701 citations
617 citations
616 citations
Cites background or methods from "Attention is not Explanation"
...This view is currently debated (Jain and Wallace, 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019; Brunner et al., 2020), and in a multi-layer model where attention is followed by non-linear transformations, the patterns in individual heads do not provide a full picture....
[...]
...However, there is ongoing debate on the merits of attention as a tool for interpreting deep learning models (Jain and Wallace, 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019; Brunner et al., 2020)....
[...]
484 citations
References
111,197 citations
Additional excerpts
...In practice we maximize a relaxed version of this objective via the Adam SGD optimizer (Kingma and Ba, 2014): f({α}i=1) + λ k ∑k i=1 max(0,TVD[ŷ(x, α (i)), ŷ(x, α̂)]− )....
[...]
52,856 citations
23,486 citations
20,027 citations
"Attention is not Explanation" refers background or methods in this paper
...In this work we consider two common similarity functions: Additive φ(h,Q) = vT tanh(W1h + W2Q) (Bahdanau et al., 2014) and Scaled Dot-Product φ(h,Q) = hQ√ m (Vaswani et al., 2017), where v,W1,W2 are model parameters....
[...]
...In this work we consider two common similarity functions: Additive φ(h,Q) = vT tanh(W1h + W2Q) (Bahdanau et al., 2014) and Scaled Dot-Product φ(h,Q) =...
[...]
...Attention mechanisms (Bahdanau et al., 2014) induce conditional distributions over input units to compose a weighted context vector for downstream modules....
[...]
14,077 citations