Neural Machine Translation by Jointly Learning to Align and Translate

Home
/
Papers
/
Neural Machine Translation by Jointly Learning to Align and Translate

Proceedings Article•

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau¹, Kyunghyun Cho², Yoshua Bengio²•Institutions (2)

Jacobs University Bremen¹, Université de Montréal²

01 Jan 2015-

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

read less

Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Reconstruction of sparse vectors in compressive sensing with multiple measurement vectors using bidirectional long short-term memory

[...]

Hamid Palangi¹, Rabab K. Ward¹, Li Deng²•Institutions (2)

University of British Columbia¹, Microsoft²

01 Dec 2016

TL;DR: A reconstruction algorithm which learns sparse structure inside each sparse vector and among sparse vectors based on a cross entropy cost function which outperforms the traditional greedy algorithm SOMP as well as a number of model based Bayesian methods.

...read moreread less

Abstract: In this paper we address the problem of compressive sensing with multiple measurement vectors. We propose a reconstruction algorithm which learns sparse structure inside each sparse vector and among sparse vectors. The learning is based on a cross entropy cost function. The model is the Bidirectional Long Short-Term Memory that is deep in time. All modifications are done at the decoder so that the encoder remains the general compressive sensing encoder, i.e., wide random matrix. Through numerical experiments on a real world dataset, we show that the proposed method outperforms the traditional greedy algorithm SOMP as well as a number of model based Bayesian methods including Multitask Compressive Sensing and Compressive Sensing with Temporally Correlated Sources. We emphasize that since the proposed method is a learning based method, its performance depends on the availability of training data. Nevertheless, in many applications huge dataset of offline training data is usually available.1

...read moreread less

5 citations

Cites methods from "Neural Machine Translation by Joint..."

...A good candidate for this sequence model is Long Short-Term Memory (LSTM) [18] given its recent success in difficult sequence modeling tasks [20, 21]....
[...]

Proceedings Article•DOI•

LIUM-CVC Submissions for WMT18 Multimodal Translation Task.

[...]

Ozan Caglayan, Adrien Bardet, Fethi Bougares, Loïc Barrault, Kai Wang, Marc Masana, Luis Herranz, Joost van de Weijer - Show less +4 more

01 Oct 2018

TL;DR: This paper describes the multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT18 Shared Task on Multimodal Translation and proposes several modifications to the previous multimodals attention architecture in order to better integrate convolutional features and refine them using encoder-side information.

...read moreread less

Abstract: This paper describes the multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT18 Shared Task on Multimodal Translation. This year we propose several modifications to our previous multimodal attention architecture in order to better integrate convolutional features and refine them using encoder-side information. Our final constrained submissions ranked first for English→French and second for English→German language pairs among the constrained submissions according to the automatic evaluation metric METEOR.

...read moreread less

5 citations

Cites methods from "Neural Machine Translation by Joint..."

...We use feed-forward attention (Bahdanau et al., 2014) which encapsulates a learnable layer....
[...]

Proceedings Article•DOI•

TF 2 AN: A Temporal-Frequency Fusion Attention Network for Spectrum Energy Level Prediction

[...]

Kehan Li¹, Zebo Liu¹, Shibo He¹, Jiming Chen¹•Institutions (1)

Zhejiang University¹

10 Jun 2019

TL;DR: A temporal-frequency fusion attention network to model the complex internal and external correlations of the radio spectrum for precise prediction is proposed and experimental results demonstrate that the method outperforms seven baseline methods in terms of prediction accuracy.

...read moreread less

Abstract: Modeling and predicting radio spectrum are significant for better understanding the behavior of spectrum, managing their usage as well as optimizing the performance of dynamic spectrum access. Most of the existing works concentrate on predicting the occupation status of the spectrum via threshold-based binary time series, ignoring abundant frequency correlations. In fact, precisely predicting the energy level of the radio spectrum can provide richer information for applications such as characterizing the spectrum trending for earlier anomaly detection and estimating the channel quality for efficient spectrum sharing. However, the precise prediction is challenging due to the interference from both intra-spectrum and external factors. In this paper, we propose a temporal-frequency fusion attention network to model the complex internal and external correlations for precise prediction. More specifically, our framework consists of three major components: 1) an image processing based robust signal detection algorithm to locate the signal as model input. 2) an attention-based Long Short-term Memory network to model the temporal-frequency correlation of the spectrum. 3) a generalized fusion module to take in the external factors from heterogeneous domains. Extensive experiments are conducted on real-world datasets collected by our spectrum monitoring station deployed in the city of Hangzhou, China, which shows that the proposed signal detection algorithm is robust for frequency bands with different signal to noise ratios. Furthermore, experimental results demonstrate that our method outperforms seven baseline methods in terms of prediction accuracy. The sensitivities of hyper-parameters are analyzed and the interpretability is also well discussed to prove the effectiveness of our method.

...read moreread less

5 citations

Cites background from "Neural Machine Translation by Joint..."

...Furthermore, the decoder component also improves the performance by predicting the future five time steps with one model, which is better than the ANN and LSTM model trained separately on five models for multi-step prediction....
[...]
...Recently, the LSTM network [21] has also been applied for predicting the occupancy states of the spectrum due to its advantage in modeling sequential data with a recurrent unit....
[...]
...(3) Thus the encoder can recurrently update the hidden variable given input series with ht = fe(ht−1, x̃t), where fe is the LSTM unit used in encoder....
[...]
...The main contributions are summarized as follows: • An RSD algorithm via image processing is proposed to locate the signal from the spectrogram considering temporal and frequency domain features, which shows robustness in accurately detecting signal from frequency bands with varied SNRs. • We develop the TF2AN for precise spectrum prediction, which models the complex temporal-frequency correlation of radio spectrum with an attention-based Long Short-term Memory (LSTM) network....
[...]
...Thus, the methods with encoder-decoder structure (i.e., Seq2seq, Attention LSTM, DA-RNN and TF2AN) usually achieve a better performance than other methods because of considering much longer temporal correlation....
[...]

Proceedings Article•

Open-ended Knowledge Tracing for Computer Science Education

[...]

Naiming Liu, Zichao Wang

TL;DR: This paper develops an initial solution to the OKT problem, a student knowledge-guided code generation approach that combines program synthesis methods using language models with student knowledge tracing methods and conducts a series of quantitative and qualitative experiments to validate OKT and demonstrate its promise in educational applications.

...read moreread less

Abstract: In educational applications, knowledge tracing refers to the problem of estimating students’ time-varying concept/skill mastery level from their past responses to questions and predicting their future performance.One key limitation of most existing knowledge tracing methods is that they treat student responses to questions as binary-valued, i.e., whether they are correct or incorrect. Response correctness analysis/prediction is straightforward, but it ignores important information regarding mastery, especially for open-ended questions.In contrast, exact student responses can provide much more information.In this paper, we conduct the first exploration int open-ended knowledge tracing (OKT) by studying the new task of predicting students’ exact open-ended responses to questions.Our work is grounded in the domain of computer science education with programming questions. We develop an initial solution to the OKT problem, a student knowledge-guided code generation approach, that combines program synthesis methods using language models with student knowledge tracing methods. We also conduct a series of quantitative and qualitative experiments on a real-world student code dataset to validate and demonstrate the promise of OKT.

...read moreread less

5 citations

Journal Article•DOI•

AGConv: Adaptive Graph Convolution on 3D Point Clouds

[...]

Mingqiang Wei, Zeyong Wei, Haosu Zhou, Fei Hu, Huajian Si, Zhilei Chen, Zhenliang Zhu, Jingbo Qiu, Xu-le Yan, Yan Guo, Jun Wang, Jing Qin - Show less +8 more

09 Jun 2022-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: AGConv improves the flexibility of point cloud convolutions, effectively and precisely capturing the diverse relations between points from different semantic parts, and implements the adaptiveness inside the convolution operation instead of simply assigning different weights to the neighboring points.

...read moreread less

Abstract: Convolution on 3D point clouds is widely researched yet far from perfect in geometric deep learning. The traditional wisdom of convolution characterises feature correspondences indistinguishably among 3D points, arising an intrinsic limitation of poor distinctive feature learning. In this article, we propose Adaptive Graph Convolution (AGConv) for wide applications of point cloud analysis. AGConv generates adaptive kernels for points according to their dynamically learned features. Compared with the solution of using fixed/isotropic kernels, AGConv improves the flexibility of point cloud convolutions, effectively and precisely capturing the diverse relations between points from different semantic parts. Unlike the popular attentional weight schemes, AGConv implements the adaptiveness inside the convolution operation instead of simply assigning different weights to the neighboring points. Extensive evaluations clearly show that our method outperforms state-of-the-arts of point cloud classification and segmentation on various benchmark datasets. Meanwhile, AGConv can flexibly serve more point cloud analysis approaches to boost their performance. To validate its flexibility and effectiveness, we explore AGConv-based paradigms of completion, denoising, upsampling, registration and circle extraction, which are comparable or even superior to their competitors.

...read moreread less

5 citations

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Long short-term memory

[...]

Sepp Hochreiter¹, Jürgen Schmidhuber²•Institutions (2)

Technische Universität München¹, Dalle Molle Institute for Artificial Intelligence Research²

01 Nov 1997-Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read moreread less

72,897 citations

"Neural Machine Translation by Joint..." refers methods in this paper

...This gated unit is similar to a long short-term memory (LSTM) unit proposed earlier by Hochreiter and Schmidhuber (1997), sharing with it the ability to better model and learn long-term dependencies....
[...]

Proceedings Article•DOI•

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

[...]

Kyunghyun Cho¹, Bart van Merriënboer², Caglar Gulcehre², Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio³, Yoshua Bengio⁴, Yoshua Bengio⁵ - Show less +5 more•Institutions (5)

Aalto University¹, Université de Montréal², École Polytechnique de Montréal³, Alcatel-Lucent⁴, AT&T⁵

01 Jan 2014

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.

...read moreread less

Abstract: In this paper, we propose a novel neural network model called RNN Encoder‐ Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder‐Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

...read moreread less

19,998 citations

Journal Article•DOI•

Learning long-term dependencies with gradient descent is difficult

[...]

Yoshua Bengio¹, Patrice Y. Simard², Paolo Frasconi³•Institutions (3)

Université de Montréal¹, AT&T², University of Florence³

01 Mar 1994-IEEE Transactions on Neural Networks

TL;DR: This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.

...read moreread less

Abstract: Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a trade-off between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered. >

...read moreread less

7,309 citations

"Neural Machine Translation by Joint..." refers background or methods in this paper

...Since Bengio et al. (2003) introduced a neural probabilistic language model which uses a neural network to model the conditional probability of a word given a fixed number of the preceding words, neural networks have widely been used in machine translation. However, the role of neural networks has been largely limited to simply providing a single feature to an existing statistical machine translation system or to re-rank a list of candidate translations provided by an existing system. For instance, Schwenk (2012) proposed using a feedforward neural network to compute the score of a pair of source and target phrases and to use the score as an additional feature in the phrase-based statistical machine translation system. More recently, Kalchbrenner and Blunsom (2013) and Devlin et al. (2014) reported the successful use of the neural networks as a sub-component of the existing translation system....
[...]
...These paths allow gradients to flow backward easily without suffering too much from the vanishing effect (Hochreiter, 1991; Bengio et al., 1994; Pascanu et al., 2013a)....
[...]
...Since Bengio et al. (2003) introduced a neural probabilistic language model which uses a neural network to model the conditional probability of a word given a fixed number of the preceding words, neural networks have widely been used in machine translation....
[...]
...Since Bengio et al. (2003) introduced a neural probabilistic language model which uses a neural network to model the conditional probability of a word given a fixed number of the preceding words, neural networks have widely been used in machine translation. However, the role of neural networks has been largely limited to simply providing a single feature to an existing statistical machine translation system or to re-rank a list of candidate translations provided by an existing system. For instance, Schwenk (2012) proposed using a feedforward neural network to compute the score of a pair of source and target phrases and to use the score as an additional feature in the phrase-based statistical machine translation system....
[...]

Journal Article•DOI•

Bidirectional recurrent neural networks

[...]

Mike Schuster, Kuldip K. Paliwal

01 Nov 1997-IEEE Transactions on Signal Processing

TL;DR: It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.

...read moreread less

Abstract: In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN). The BRNN can be trained without the limitation of using input information just up to a preset future frame. This is accomplished by training it simultaneously in positive and negative time direction. Structure and training procedure of the proposed network are explained. In regression and classification experiments on artificial data, the proposed structure gives better results than other approaches. For real data, classification experiments for phonemes from the TIMIT database show the same tendency. In the second part of this paper, it is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution. For this part, experiments on real data are reported.

...read moreread less

7,290 citations

"Neural Machine Translation by Joint..." refers methods in this paper

...With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly....
[...]
...Hence, we propose to use a bidirectional RNN (BiRNN, Schuster and Paliwal, 1997), which has been successfully used recently in speech recognition (see, e.g., Graves et al., 2013)....
[...]

Journal Article•DOI•

A neural probabilistic language model

[...]

Yoshua Bengio¹, Réjean Ducharme¹, Pascal Vincent¹, Christian Janvin¹•Institutions (1)

Université de Montréal¹

01 Mar 2003-Journal of Machine Learning Research

TL;DR: The authors propose to learn a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences, which can be expressed in terms of these representations.

...read moreread less

Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

...read moreread less

6,832 citations