Neural Machine Translation by Jointly Learning to Align and Translate
...read more
...read more
Citations
...read more
...read more
11,927 citations
Cites methods or result from "Neural Machine Translation by Joint..."
...We were initially convinced that the LSTM would fail on long sentences due to its limited memory, and other researchers reported poor performance on long sentences with a model similar to ours [5, 2, 26]....
[...]
...[2] also attempted direct translations with a neural network that used an attention mechanism to overcome the poor performance on long sentences experienced by Cho et al....
[...]
...This way of evaluating the BELU score is consistent with [5] and [2], and reproduces the 33....
[...]
...read more
9,680 citations
...read more
...read more
6,974 citations
Cites background or methods from "Neural Machine Translation by Joint..."
...Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29]....
[...]
...This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [31, 2, 8]....
[...]
...The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention....
[...]
...Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 16]....
[...]
...Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [29, 2, 5]....
[...]
...read more
...read more
6,374 citations
Cites background or methods or result from "Neural Machine Translation by Joint..."
...On the other hand, in (Bahdanau et al., 2015; Jean et al., 2015) and this work, s, in fact, implies a set of source hidden states which are consulted throughout the entire course of the translation process....
[...]
...Comparison to other work – Bahdanau et al. (2015) use context vectors, similar to our ct, in building subsequent hidden states, which can also achieve the “coverage” effect....
[...]
...…part of at and for long sentences, we ignore words near the end. goes through a deep-output and a maxout layer before making predictions.7 Lastly, Bahdanau et al. (2015) only experimented with one alignment function, the concat product; whereas we show later that the other alternatives are…...
[...]
...The former approach resembles the model of (Bahdanau et al., 2015) but is simpler architecturally....
[...]
...Bahdanau et al. (2015), on the other hand, use the concatenation of the forward and backward source hidden states in the bi-directional encoder and target hidden states in their non-stacking uni-directional decoder....
[...]
...read more
...read more
5,164 citations
References
...read more
...read more
49,735 citations
"Neural Machine Translation by Joint..." refers methods in this paper
...This gated unit is similar to a long short-term memory (LSTM) unit proposed earlier by Hochreiter and Schmidhuber (1997), sharing with it the ability to better model and learn long-term dependencies....
[...]
...read more
14,140 citations
...read more
6,194 citations
...read more
...read more
5,798 citations
"Neural Machine Translation by Joint..." refers background or methods in this paper
...Since Bengio et al. (2003) introduced a neural probabilistic language model which uses a neural network to model the conditional probability of a word given a fixed number of the preceding words, neural networks have widely been used in machine translation. However, the role of neural networks has been largely limited to simply providing a single feature to an existing statistical machine translation system or to re-rank a list of candidate translations provided by an existing system. For instance, Schwenk (2012) proposed using a feedforward neural network to compute the score of a pair of source and target phrases and to use the score as an additional feature in the phrase-based statistical machine translation system. More recently, Kalchbrenner and Blunsom (2013) and Devlin et al. (2014) reported the successful use of the neural networks as a sub-component of the existing translation system....
[...]
...These paths allow gradients to flow backward easily without suffering too much from the vanishing effect (Hochreiter, 1991; Bengio et al., 1994; Pascanu et al., 2013a)....
[...]
...Since Bengio et al. (2003) introduced a neural probabilistic language model which uses a neural network to model the conditional probability of a word given a fixed number of the preceding words, neural networks have widely been used in machine translation....
[...]
...Since Bengio et al. (2003) introduced a neural probabilistic language model which uses a neural network to model the conditional probability of a word given a fixed number of the preceding words, neural networks have widely been used in machine translation. However, the role of neural networks has been largely limited to simply providing a single feature to an existing statistical machine translation system or to re-rank a list of candidate translations provided by an existing system. For instance, Schwenk (2012) proposed using a feedforward neural network to compute the score of a pair of source and target phrases and to use the score as an additional feature in the phrase-based statistical machine translation system....
[...]
...read more
...read more
5,567 citations
"Neural Machine Translation by Joint..." refers methods in this paper
...We use a minibatch stochastic gradient descent (SGD) algorithm together with Adadelta (Zeiler, 2012) to train each model....
[...]
...Adadelta (Zeiler, 2012) was used to automatically adapt the learning rate of each parameter ( = 10−6 and ρ = 0.95)....
[...]