Visualizing and Understanding Neural Machine Translation

doi:10.18653/V1/P17-1106

Home
/
Papers
/
Visualizing and Understanding Neural Machine Translation

Proceedings Article•DOI•

Visualizing and Understanding Neural Machine Translation

Ding Yanzhuo, Yang Liu¹, Huanbo Luan², Maosong Sun²•Institutions (2)

Jiangsu Normal University¹, Tsinghua University²

01 Jul 2017-Vol. 1, pp 1150-1159

TL;DR: This work proposes to use layer-wise relevance propagation (LRP) to compute the contribution of each contextual word to arbitrary hidden states in the attention-based encoder-decoder framework and shows that visualization with LRP helps to interpret the internal workings of NMT and analyze translation errors.

read less

Abstract: While neural machine translation (NMT) has made remarkable progress in recent years, it is hard to interpret its internal workings due to the continuous representations and non-linearity of neural networks. In this work, we propose to use layer-wise relevance propagation (LRP) to compute the contribution of each contextual word to arbitrary hidden states in the attention-based encoder-decoder framework. We show that visualization with LRP helps to interpret the internal workings of NMT and analyze translation errors.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

[...]

Elena Voita¹, Elena Voita², David Talbot², Fedor Moiseev², Fedor Moiseev³, Rico Sennrich⁴, Ivan Titov¹ - Show less +3 more•Institutions (4)

University of Amsterdam¹, Yandex², Moscow Institute of Physics and Technology³, University of Zurich⁴

28 Jul 2019

TL;DR: It is found that the most important and confident heads play consistent and often linguistically-interpretable roles and when pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, it is observed that specialized heads are last to be pruned.

...read moreread less

Abstract: Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads to the overall performance of the model and analyze the roles played by them in the encoder. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU.

...read moreread less

718 citations

Cites methods from "Visualizing and Understanding Neura..."

...We start by identifying the most important heads in each encoder layer using layer-wise relevance propagation (Ding et al., 2017)....
[...]
...Layer-wise relevance propagation (LRP) (Ding et al., 2017) is a method for computing the relative contribution of neurons at one point in a network to neurons at another....
[...]

Journal Article•DOI•

Analysis Methods in Neural Language Processing: A Survey

[...]

Yonatan Belinkov¹, James Glass²•Institutions (2)

Harvard University¹, Massachusetts Institute of Technology²

01 Apr 2019-Transactions of the Association for Computational Linguistics

TL;DR: Analysis methods in neural language processing are reviewed, categorize them according to prominent research trends, highlight existing limitations, and point to potential directions for future work.

...read moreread less

Abstract: The field of natural language processing has seen impressive progress in recent years, with neural network models replacing many of the traditional systems. A plethora of new models have been proposed, many of which are thought to be opaque compared to their feature-rich counterparts. This has led researchers to analyze, interpret, and evaluate neural networks in novel and more fine-grained ways. In this survey paper, we review analysis methods in neural language processing, categorize them according to prominent research trends, highlight existing limitations, and point to potential directions for future work.

...read moreread less

442 citations

Cites background from "Visualizing and Understanding Neura..."

...The important or salient features can then be visualized in selected examples (Li et al., 2016a; Aubakirova and Bansal, 2016; Sundararajan et al., 2017; Arras et al., 2017a,b; Ding et al., 2017; Murdoch et al., 2018; Mudrakarta et al., 2018; Montavon et al., 2018; Godin et al., 2018)....
[...]

Posted Content•

Neural Approaches to Conversational AI

[...]

Jianfeng Gao¹, Michel Galley¹, Lihong Li²•Institutions (2)

Microsoft¹, Google²

21 Sep 2018-arXiv: Computation and Language

TL;DR: In this article, the authors present a survey of state-of-the-art neural approaches to conversational AI, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.

...read moreread less

Abstract: The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.

...read moreread less

415 citations

Proceedings Article•DOI•

ERASER: A Benchmark to Evaluate Rationalized NLP Models

[...]

Jay DeYoung¹, Sarthak Jain¹, Nazneen Fatema Rajani², Eric Lehman¹, Caiming Xiong², Richard Socher², Byron C. Wallace¹ - Show less +3 more•Institutions (2)

Northeastern University¹, Salesforce.com²

01 Jul 2020

TL;DR: This work proposes the Evaluating Rationales And Simple English Reasoning (ERASER) a benchmark to advance research on interpretable models in NLP, and proposes several metrics that aim to capture how well the rationales provided by models align with human rationales, and also how faithful these rationales are.

...read moreread less

Abstract: State-of-the-art models in NLP are now predominantly based on deep neural networks that are opaque in terms of how they come to make predictions. This limitation has increased interest in designing more interpretable deep models for NLP that reveal the `reasoning' behind model outputs. But work in this direction has been conducted on different datasets and tasks with correspondingly unique aims and metrics; this makes it difficult to track progress. We propose the Evaluating Rationales And Simple English Reasoning (ERASER\, a benchmark to advance research on interpretable models in NLP. This benchmark comprises multiple datasets and tasks for which human annotations of ``rationales'' (supporting evidence) have been collected. We propose several metrics that aim to capture how well the rationales provided by models align with human rationales, and also how faithful these rationales are (i.e., the degree to which provided rationales influenced the corresponding predictions). Our hope is that releasing this benchmark facilitates progress on designing more interpretable NLP systems. The benchmark, code, and documentation are available at https://www.eraserbenchmark.com/

...read moreread less

398 citations

Cites background from "Visualizing and Understanding Neura..."

...There has also been work on visual evaluations of saliency maps (Li et al., 2016; Ding et al., 2017; Sundararajan et al., 2017)....
[...]

Journal Article•DOI•

Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications

[...]

Wojciech Samek¹, Grégoire Montavon, Sebastian Lapuschkin¹, Christopher J. Anders, Klaus-Robert Müller - Show less +1 more•Institutions (1)

Heinrich Hertz Institute¹

17 Mar 2020-arXiv: Learning

TL;DR: In this paper, the authors provide a timely overview of explainable AI, with a focus on 'post-hoc' explanations, explain its theoretical foundations, and put interpretability algorithms to a test both from a theory and comparative evaluation perspective using extensive simulations.

...read moreread less

Abstract: With the broader and highly successful usage of machine learning in industry and the sciences, there has been a growing demand for Explainable AI. Interpretability and explanation methods for gaining a better understanding about the problem solving abilities and strategies of nonlinear Machine Learning, in particular, deep neural networks, are therefore receiving increased attention. In this work we aim to (1) provide a timely overview of this active emerging field, with a focus on 'post-hoc' explanations, and explain its theoretical foundations, (2) put interpretability algorithms to a test both from a theory and comparative evaluation perspective using extensive simulations, (3) outline best practice aspects i.e. how to best include interpretation methods into the standard usage of machine learning and (4) demonstrate successful usage of explainable AI in a representative selection of application scenarios. Finally, we discuss challenges and possible future directions of this exciting foundational field of machine learning.

...read moreread less

385 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Proceedings Article•DOI•

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

[...]

Ross Girshick¹, Jeff Donahue¹, Trevor Darrell¹, Jitendra Malik¹•Institutions (1)

University of California, Berkeley¹

23 Jun 2014

TL;DR: RCNN as discussed by the authors combines CNNs with bottom-up region proposals to localize and segment objects, and when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

...read moreread less

Abstract: Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

...read moreread less

21,729 citations

Proceedings Article•

Neural Machine Translation by Jointly Learning to Align and Translate

[...]

Dzmitry Bahdanau¹, Kyunghyun Cho², Yoshua Bengio²•Institutions (2)

Jacobs University Bremen¹, Université de Montréal²

01 Jan 2015

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

...read moreread less

20,027 citations

"Visualizing and Understanding Neura..." refers background or methods in this paper

...Figure 1: The attention-based encoder-decoder architecture for neural machine translation (Bahdanau et al., 2015)....
[...]
...Though projecting word embedding space into two dimensions (Faruqui and Dyer, 2014) and the attention matrix (Bahdanau et al., 2015) shed partial light on how NMT works, how to interpret the entire network still remains a challenge....
[...]
...Note that the attention mechanism (Bahdanau et al., 2015) is restricted to demonstrate the connection between words in source and target languages and unable to offer more insights in interpreting how target words are generated (see Section 4....
[...]
...We use the open-source toolkit GROUNDHOG (Bahdanau et al., 2015), which implements the attention-based encoder-decoder framework....
[...]
...In this work, we focus on the attention-based encoder-decoder framework (Bahdanau et al., 2015)....
[...]

Proceedings Article•DOI•

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

[...]

Kyunghyun Cho¹, Bart van Merriënboer², Caglar Gulcehre², Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio³, Yoshua Bengio⁴, Yoshua Bengio⁵ - Show less +5 more•Institutions (5)

Aalto University¹, Université de Montréal², AT&T³, École Polytechnique de Montréal⁴, Alcatel-Lucent⁵

01 Jan 2014

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.

...read moreread less

Abstract: In this paper, we propose a novel neural network model called RNN Encoder‐ Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder‐Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

...read moreread less

19,998 citations

Posted Content•

Neural Machine Translation by Jointly Learning to Align and Translate

[...]

Dzmitry Bahdanau¹, Kyunghyun Cho², Yoshua Bengio²•Institutions (2)

Jacobs University Bremen¹, Université de Montréal²

01 Sep 2014-arXiv: Computation and Language

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

14,077 citations