Augmenting Transformers with KNN-Based Composite Memory for Dialog

doi:10.1162/TACL_A_00356

Home
/
Papers
/
Augmenting Transformers with KNN-Based Composite Memory for Dialog

Journal Article•DOI•

Augmenting Transformers with KNN-Based Composite Memory for Dialog

Angela Fan¹, Claire Gardent, Chloé Braud, Antoine Bordes¹•Institutions (1)

Facebook¹

04 Mar 2021-Transactions of the Association for Computational Linguistics (MIT Press - Journals)-Vol. 9, pp 82-99

TL;DR: This article propose augmenting GNNs with KNN-based information fetching (KIF) modules to access fixed external knowledge, such as Wikipedia, images, and human-written dialog utterances.

read less

Abstract: Various machine learning tasks can benefit from access to external information of different modalities, such as text and images. Recent work has focused on learning architectures with large memories capable of storing this knowledge. We propose augmenting generative Transformer neural networks with KNN-based Information Fetching (KIF) modules. Each KIF module learns a read operation to access fixed external knowledge. We apply these modules to generative dialog modeling, a challenging task where information must be flexibly retrieved and incorporated to maintain the topic and flow of conversation. We demonstrate the effectiveness of our approach by identifying relevant knowledge required for knowledgeable but engaging dialog from Wikipedia, images, and human-written dialog utterances, and show that leveraging this retrieved information improves model performance, measured by automatic and human evaluation.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Retrieval Augmentation Reduces Hallucination in Conversation

[...]

Kurt Shuster¹, Spencer Poff¹, Moya Chen¹, Douwe Kiela¹, Jason Weston¹ - Show less +1 more•Institutions (1)

Facebook¹

15 Apr 2021-arXiv: Computation and Language

TL;DR: This paper explore the use of neural retrieval-in-the-loop architectures for knowledge-grounded dialogue, a task that is arguably more challenging as it requires querying based on complex multi-turn dialogue context and generating conversationally coherent responses.

...read moreread less

Abstract: Despite showing increasingly human-like conversational abilities, state-of-the-art dialogue models often suffer from factual incorrectness and hallucination of knowledge (Roller et al., 2020). In this work we explore the use of neural-retrieval-in-the-loop architectures - recently shown to be effective in open-domain QA (Lewis et al., 2020b; Izacard and Grave, 2020) - for knowledge-grounded dialogue, a task that is arguably more challenging as it requires querying based on complex multi-turn dialogue context and generating conversationally coherent responses. We study various types of architectures with multiple components - retrievers, rankers, and encoder-decoders - with the goal of maximizing knowledgeability while retaining conversational ability. We demonstrate that our best models obtain state-of-the-art performance on two knowledge-grounded conversational tasks. The models exhibit open-domain conversational capabilities, generalize effectively to scenarios not within the training data, and, as verified by human evaluations, substantially reduce the well-known problem of knowledge hallucination in state-of-the-art chatbots.

...read moreread less

35 citations

Journal Article•DOI•

Relational Memory-Augmented Language Models

[...]

Qi Liu, Dani Yogatama, Phil Blunsom

24 Jan 2022-Transactions of the Association for Computational Linguistics

TL;DR: A memory-augmented approach to condition an autoregressive language model on a knowledge graph that represents the graph as a collection of relation triples and retrieve relevant relations for a given context to improve text generation.

...read moreread less

Abstract: We present a memory-augmented approach to condition an autoregressive language model on a knowledge graph. We represent the graph as a collection of relation triples and retrieve relevant relations for a given context to improve text generation. Experiments on WikiText-103, WMT19, and enwik8 English datasets demonstrate that our approach produces a better language model in terms of perplexity and bits per character. We also show that relational memory improves coherence, is complementary to token-based memory, and enables causal interventions. Our model provides a simple yet effective way to combine an autoregressive language model and a knowledge graph for more coherent and logical generation.

...read moreread less

15 citations

Journal Article•DOI•

Unstructured Text Enhanced Open-Domain Dialogue System: A Systematic Survey

[...]

MaLongxuan¹, LiMingda¹, ZhangWei-Nan¹, LiJiapeng¹, LiuTing¹ - Show less +1 more•Institutions (1)

Harbin Institute of Technology¹

31 Jan 2022-ACM Transactions on Information Systems

TL;DR: Incorporating external knowledge into dialogue generation has been proven to benefit the performance of an open-domain Dialogue System (DS), such as generating informative or stylized responses.

...read moreread less

Abstract: Incorporating external knowledge into dialogue generation has been proven to benefit the performance of an open-domain Dialogue System (DS), such as generating informative or stylized responses, co...

...read moreread less

12 citations

Proceedings Article•

Retrieval Augmentation Reduces Hallucination in Conversation

[...]

Kurt Shuster¹, Spencer Poff¹, Moya Chen¹, Douwe Kiela¹, Jason Weston¹ - Show less +1 more•Institutions (1)

Facebook¹

15 Apr 2021

...read moreread less

5 citations

Posted Content•

End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering

[...]

Devendra Singh Sachan, Siva Reddy, William L. Hamilton, Chris Dyer, Dani Yogatama - Show less +1 more

09 Jun 2021-arXiv: Computation and Language

TL;DR: The authors proposed an end-to-end differentiable training method for retrieval-augmented open-domain question answering systems that combine information from multiple retrieved documents when generating answers, using an expectation-maximization algorithm.

...read moreread less

Abstract: We present an end-to-end differentiable training method for retrieval-augmented open-domain question answering systems that combine information from multiple retrieved documents when generating answers. We model retrieval decisions as latent variables over sets of relevant documents. Since marginalizing over sets of retrieved documents is computationally hard, we approximate this using an expectation-maximization algorithm. We iteratively estimate the value of our latent variable (the set of relevant documents for a given question) and then use this estimate to update the retriever and reader parameters. We hypothesize that such end-to-end training allows training signals to flow to the reader and then to the retriever better than staged-wise training. This results in a retriever that is able to select more relevant documents for a question and a reader that is trained on more accurate documents to generate an answer. Experiments on three benchmark datasets demonstrate that our proposed method outperforms all existing approaches of comparable size by 2-3% absolute exact match points, achieving new state-of-the-art results. Our results also demonstrate the feasibility of learning to retrieve to improve answer generation without explicit supervision of retrieval decisions.

...read moreread less

5 citations

References

PDF

Open Access

More filters

Proceedings Article•

Adam: A Method for Stochastic Optimization

[...]

Diederik P. Kingma¹, Jimmy Ba²•Institutions (2)

University of Amsterdam¹, University of Toronto²

01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

...read moreread less

111,197 citations

Proceedings Article•

Attention is All you Need

[...]

Ashish Vaswani¹, Noam Shazeer¹, Niki Parmar², Jakob Uszkoreit¹, Llion Jones¹, Aidan N. Gomez¹, Lukasz Kaiser¹, Illia Polosukhin¹ - Show less +4 more•Institutions (2)

Google¹, University of Southern California²

12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

...read moreread less

52,856 citations

Proceedings Article•DOI•

Aggregated Residual Transformations for Deep Neural Networks

[...]

Saining Xie¹, Ross Girshick², Piotr Dollár², Zhuowen Tu¹, Kaiming He² - Show less +1 more•Institutions (2)

University of California, San Diego¹, Facebook²

21 Jul 2017

TL;DR: ResNeXt as discussed by the authors is a simple, highly modularized network architecture for image classification, which is constructed by repeating a building block that aggregates a set of transformations with the same topology.

...read moreread less

Abstract: We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call cardinality (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online.

...read moreread less

7,183 citations

Proceedings Article•DOI•

Neural Machine Translation of Rare Words with Subword Units

[...]

Rico Sennrich, Barry Haddow, Alexandra Birch

12 Aug 2016

TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.

...read moreread less

Abstract: Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character ngram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English!German and English!Russian by up to 1.1 and 1.3 BLEU, respectively.

...read moreread less

6,898 citations

Proceedings Article•

Building end-to-end dialogue systems using generative hierarchical neural network models

[...]

Iulian Vlad Serban¹, Alessandro Sordoni¹, Yoshua Bengio¹, Aaron Courville¹, Joelle Pineau² - Show less +1 more•Institutions (2)

Université de Montréal¹, McGill University²

12 Feb 2016

TL;DR: The authors extend the hierarchical recurrent encoder-decoder neural network to the dialogue domain, and demonstrate that this model is competitive with state-of-the-art neural language models and backoff n-gram models.

...read moreread less

Abstract: We investigate the task of building open domain, conversational dialogue systems based on large dialogue corpora using generative models. Generative models produce system responses that are autonomously generated word-by-word, opening up the possibility for realistic, flexible interactions. In support of this goal, we extend the recently proposed hierarchical recurrent encoder-decoder neural network to the dialogue domain, and demonstrate that this model is competitive with state-of-the-art neural language models and backoff n-gram models. We investigate the limitations of this and similar approaches, and show how its performance can be improved by bootstrapping the learning from a larger question-answer pair corpus and from pretrained word embeddings.

...read moreread less

1,533 citations