Showing papers in "arXiv: Computation and Language in 2017"

PDF

Open Access

Posted Content•

[...]

Ashish Vaswani¹, Noam Shazeer¹, Niki Parmar², Jakob Uszkoreit¹, Llion Jones¹, Aidan N. Gomez¹, Lukasz Kaiser¹, Illia Polosukhin¹ - Show less +4 more•Institutions (2)

Google¹, University of Southern California²

12 Jun 2017-arXiv: Computation and Language

TL;DR: A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

7,019 citations

Posted Content•

Convolutional Sequence to Sequence Learning

[...]

Jonas Gehring¹, Michael Auli¹, David Grangier¹, Denis Yarats¹, Yann N. Dauphin¹ - Show less +1 more•Institutions (1)

Facebook¹

08 May 2017-arXiv: Computation and Language

TL;DR: The authors introduced an architecture based entirely on convolutional neural networks, where computations over all elements can be fully parallelized during training and optimization is easier since the number of nonlinearities is fixed and independent of the input length.

...read moreread less

Abstract: The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. Compared to recurrent models, computations over all elements can be fully parallelized during training and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.

...read moreread less

1,189 citations

Proceedings Article•DOI•

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

[...]

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, Lucia Specia - Show less +1 more

31 Jul 2017-arXiv: Computation and Language

TL;DR: The STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017), providing insight into the limitations of existing models.

...read moreread less

Abstract: Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).

...read moreread less

1,124 citations

Posted Content•

A Deep Reinforced Model for Abstractive Summarization

[...]

Romain Paulus, Caiming Xiong, Richard Socher

11 May 2017-arXiv: Computation and Language

TL;DR: A neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL) that produces higher quality summaries.

...read moreread less

Abstract: Attentional, RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. For longer documents and summaries however these models often include repetitive and incoherent phrases. We introduce a neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL). Models trained only with supervised learning often exhibit "exposure bias" - they assume ground truth is provided at each step during training. However, when standard word prediction is combined with the global sequence prediction training of RL the resulting summaries become more readable. We evaluate this model on the CNN/Daily Mail and New York Times datasets. Our model obtains a 41.16 ROUGE-1 score on the CNN/Daily Mail dataset, an improvement over previous state-of-the-art models. Human evaluation also shows that our model produces higher quality summaries.

...read moreread less

1,119 citations

Posted Content•

Reading Wikipedia to Answer Open-Domain Questions

[...]

Danqi Chen¹, Adam Fisch², Jason Weston², Antoine Bordes²•Institutions (2)

Stanford University¹, Facebook²

31 Mar 2017-arXiv: Computation and Language

TL;DR: In this paper, a multi-layer recurrent neural network model was proposed to detect answer spans in Wikipedia paragraphs, which combines a search component based on bigram hashing and TF-IDF matching.

...read moreread less

Abstract: This paper proposes to tackle open- domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.

...read moreread less

1,100 citations

Posted Content•

Recent Trends in Deep Learning Based Natural Language Processing

[...]

Tom Young¹, Devamanyu Hazarika², Soujanya Poria³, Erik Cambria³•Institutions (3)

Beijing Institute of Technology¹, National University of Singapore², Nanyang Technological University³

09 Aug 2017-arXiv: Computation and Language

TL;DR: Deep learning methods employ multiple processing layers to learn hierarchical representations of data and have produced state-of-the-art results in many domains as mentioned in this paper, such as natural language processing (NLP).

...read moreread less

Abstract: Deep learning methods employ multiple processing layers to learn hierarchical representations of data and have produced state-of-the-art results in many domains. Recently, a variety of model designs and methods have blossomed in the context of natural language processing (NLP). In this paper, we review significant deep learning related models and methods that have been employed for numerous NLP tasks and provide a walk-through of their evolution. We also summarize, compare and contrast the various models and put forward a detailed understanding of the past, present and future of deep learning in NLP.

...read moreread less

997 citations

Posted Content•

A simple neural network module for relational reasoning

[...]

Adam Santoro¹, David Raposo², David G. T. Barrett¹, Mateusz Malinowski³, Razvan Pascanu¹, Peter W. Battaglia¹, Timothy P. Lillicrap¹ - Show less +3 more•Institutions (3)

Google¹, Cold Spring Harbor Laboratory², Max Planck Society³

05 Jun 2017-arXiv: Computation and Language

TL;DR: This work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.

...read moreread less

Abstract: Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn. In this paper we describe how to use Relation Networks (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning. We tested RN-augmented networks on three tasks: visual question answering using a challenging dataset called CLEVR, on which we achieve state-of-the-art, super-human performance; text-based question answering using the bAbI suite of tasks; and complex reasoning about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with RNs. Our work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.

...read moreread less

943 citations

Posted Content•

Regularizing and Optimizing LSTM Language Models

[...]

Stephen Merity¹, Nitish Shirish Keskar¹, Richard Socher¹•Institutions (1)

Salesforce.com¹

07 Aug 2017-arXiv: Computation and Language

TL;DR: This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.

...read moreread less

Abstract: Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.

...read moreread less

899 citations

Posted Content•

Get To The Point: Summarization with Pointer-Generator Networks

[...]

Abigail See¹, Peter J. Liu², Christopher D. Manning¹•Institutions (2)

Stanford University¹, Google²

14 Apr 2017-arXiv: Computation and Language

TL;DR: This paper proposed a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.

...read moreread less

Abstract: Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.

...read moreread less

881 citations

Posted Content•

Automated Hate Speech Detection and the Problem of Offensive Language

[...]

Thomas Davidson¹, Dana Warmsley¹, Michael W. Macy¹, Ingmar Weber²•Institutions (2)

Cornell University¹, Khalifa University²

11 Mar 2017-arXiv: Computation and Language

TL;DR: This article used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords and trained a multi-class classifier to distinguish hate speech from other offensive language, finding that racist and homophobic tweets are more likely to be classified as hate speech but that sexist tweets are generally classified as offensive.

...read moreread less

Abstract: A key challenge for automatic hate-speech detection on social media is the separation of hate speech from other instances of offensive language. Lexical detection methods tend to have low precision because they classify all messages containing particular terms as hate speech and previous work using supervised learning has failed to distinguish between the two categories. We used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords. We use crowd-sourcing to label a sample of these tweets into three categories: those containing hate speech, only offensive language, and those with neither. We train a multi-class classifier to distinguish between these different categories. Close analysis of the predictions and the errors shows when we can reliably separate hate speech from other offensive language and when this differentiation is more difficult. We find that racist and homophobic tweets are more likely to be classified as hate speech but that sexist tweets are generally classified as offensive. Tweets without explicit hate keywords are also more difficult to classify.

...read moreread less

871 citations

Posted Content•

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

[...]

Victor Zhong, Caiming Xiong, Richard Socher

31 Aug 2017-arXiv: Computation and Language

TL;DR: This work proposes Seq2 SQL, a deep neural network for translating natural language questions to corresponding SQL queries, and releases WikiSQL, a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables fromWikipedia that is an order of magnitude larger than comparable datasets.

...read moreread less

Abstract: A significant amount of the world's knowledge is stored in relational databases. However, the ability for users to retrieve facts from a database is limited due to a lack of understanding of query languages such as SQL. We propose Seq2SQL, a deep neural network for translating natural language questions to corresponding SQL queries. Our model leverages the structure of SQL queries to significantly reduce the output space of generated queries. Moreover, we use rewards from in-the-loop query execution over the database to learn a policy to generate unordered parts of the query, which we show are less suitable for optimization via cross entropy loss. In addition, we will publish WikiSQL, a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia. This dataset is required to train our model and is an order of magnitude larger than comparable datasets. By applying policy-based reinforcement learning with a query execution environment to WikiSQL, our model Seq2SQL outperforms attentional sequence to sequence models, improving execution accuracy from 35.9% to 59.4% and logical form accuracy from 23.4% to 48.3%.

...read moreread less

Posted Content•

Advances in Pre-Training Distributed Word Representations

[...]

Tomas Mikolov¹, Edouard Grave¹, Piotr Bojanowski¹, Christian Puhrsch², Armand Joulin¹ - Show less +1 more•Institutions (2)

Facebook¹, Courant Institute of Mathematical Sciences²

26 Dec 2017-arXiv: Computation and Language

TL;DR: This article used a combination of known tricks that are rarely used together to train pre-trained word vector representations and achieved state-of-the-art performance on a number of NLP tasks.

...read moreread less

Abstract: Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available pre-trained models that outperform the current state of the art by a large margin on a number of tasks.

...read moreread less

Posted Content•

A Structured Self-attentive Sentence Embedding

[...]

Zhouhan Lin¹, Minwei Feng², Cicero Nogueira dos Santos², Mo Yu², Bing Xiang², Bowen Zhou², Yoshua Bengio¹ - Show less +3 more•Institutions (2)

Université de Montréal¹, IBM²

09 Mar 2017-arXiv: Computation and Language

TL;DR: This paper proposed a self-attention mechanism and a special regularization term for the model, which achieved a significant performance gain compared to other sentence embedding methods in all of the three tasks.

...read moreread less

Abstract: This paper proposes a new model for extracting an interpretable sentence embedding by introducing self-attention. Instead of using a vector, we use a 2-D matrix to represent the embedding, with each row of the matrix attending on a different part of the sentence. We also propose a self-attention mechanism and a special regularization term for the model. As a side effect, the embedding comes with an easy way of visualizing what specific parts of the sentence are encoded into the embedding. We evaluate our model on 3 different tasks: author profiling, sentiment classification, and textual entailment. Results show that our model yields a significant performance gain compared to other sentence embedding methods in all of the 3 tasks.

...read moreread less

Posted Content•

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

[...]

Jonathan Shen, Ruoming Pang, Ron Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu - Show less +9 more

16 Dec 2017-arXiv: Computation and Language

TL;DR: Tacotron 2 as mentioned in this paper uses a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms.

...read moreread less

Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

...read moreread less

Posted Content•

Comparative Study of CNN and RNN for Natural Language Processing

[...]

Wenpeng Yin, Katharina Kann, Mo Yu, Hinrich Schütze

07 Feb 2017-arXiv: Computation and Language

TL;DR: This work is the first systematic comparison of CNN and RNN on a wide range of representative NLP tasks, aiming to give basic guidance for DNN selection.

...read moreread less

Abstract: Deep neural networks (DNN) have revolutionized the field of natural language processing (NLP). Convolutional neural network (CNN) and recurrent neural network (RNN), the two main types of DNN architectures, are widely explored to handle various NLP tasks. CNN is supposed to be good at extracting position-invariant features and RNN at modeling units in sequence. The state of the art on many NLP tasks often switches due to the battle between CNNs and RNNs. This work is the first systematic comparison of CNN and RNN on a wide range of representative NLP tasks, aiming to give basic guidance for DNN selection.

...read moreread less

Posted Content•

Adversarial Examples for Evaluating Reading Comprehension Systems

[...]

Robin Jia¹, Percy Liang¹•Institutions (1)

Stanford University¹

23 Jul 2017-arXiv: Computation and Language

TL;DR: This paper proposed an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD) to test whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans.

...read moreread less

Abstract: Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of $75\%$ F1 score to $36\%$; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to $7\%$. We hope our insights will motivate the development of new models that understand language more precisely.

...read moreread less

Posted Content•

Adversarial Learning for Neural Dialogue Generation

[...]

Jiwei Li¹, Will S. Monroe¹, Tianlin Shi², Sébastien Jean³, Alan Ritter⁴, Dan Jurafsky¹ - Show less +2 more•Institutions (4)

Stanford University¹, Tsinghua University², Courant Institute of Mathematical Sciences³, Ohio State University⁴

23 Jan 2017-arXiv: Computation and Language

TL;DR: This paper proposed using adversarial training for open-domain dialogue generation, where the generator is trained to generate sequences that are indistinguishable from human-generated dialogue utterances, and the outputs from the discriminator are used as rewards for the generator.

...read moreread less

Abstract: In this paper, drawing intuition from the Turing test, we propose using adversarial training for open-domain dialogue generation: the system is trained to produce sequences that are indistinguishable from human-generated dialogue utterances. We cast the task as a reinforcement learning (RL) problem where we jointly train two systems, a generative model to produce response sequences, and a discriminator---analagous to the human evaluator in the Turing test--- to distinguish between the human-generated dialogues and the machine-generated ones. The outputs from the discriminator are then used as rewards for the generative model, pushing the system to generate dialogues that mostly resemble human dialogues. In addition to adversarial training we describe a model for adversarial {\em evaluation} that uses success in fooling an adversary as a dialogue evaluation metric, while avoiding a number of potential pitfalls. Experimental results on several metrics, including adversarial evaluation, demonstrate that the adversarially-trained system generates higher-quality responses than previous baselines.

...read moreread less

Posted Content•

Learned in Translation: Contextualized Word Vectors

[...]

Bryan McCann¹, James Bradbury¹, Caiming Xiong¹, Richard Socher¹•Institutions (1)

Salesforce.com¹

01 Aug 2017-arXiv: Computation and Language

TL;DR: The authors used a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation (MT) to contextualize word vectors and showed that adding these context vectors (CoVe) improved performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks.

...read moreread less

Abstract: Computer vision has benefited from initializing multiple deep layers with weights pretrained on large supervised training sets like ImageNet. Natural language processing (NLP) typically sees initialization of only the lowest layer of deep models with pretrained word vectors. In this paper, we use a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation (MT) to contextualize word vectors. We show that adding these context vectors (CoVe) improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question classification (TREC), entailment (SNLI), and question answering (SQuAD). For fine-grained sentiment analysis and entailment, CoVe improves performance of our baseline models to the state of the art.

...read moreread less

Posted Content•

Tacotron: Towards End-to-End Speech Synthesis

[...]

Yuxuan Wang¹, RJ Skerry-Ryan¹, Daisy Stanton¹, Yonghui Wu¹, Ron Weiss¹, Navdeep Jaitly², Zongheng Yang³, Ying Xiao⁴, Zhifeng Chen¹, Samy Bengio¹, Quoc V. Le¹, Yannis Agiomyrgiannakis¹, Robert A. J. Clark⁵, Rif A. Saurous¹ - Show less +10 more•Institutions (5)

Google¹, University of Toronto², University of California, Berkeley³, Palantir Technologies⁴, University of Edinburgh⁵

29 Mar 2017-arXiv: Computation and Language

TL;DR: Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

...read moreread less

Abstract: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

...read moreread less

Posted Content•

Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling

[...]

Diego Marcheggiani¹, Ivan Titov•Institutions (1)

University of Amsterdam¹

14 Mar 2017-arXiv: Computation and Language

TL;DR: The authors proposed a graph convolutional network (GCN) to model syntactic dependency graphs for semantic role labeling (SRL) and achieved state-of-the-art performance on the standard benchmark (CoNLL-2009) both for Chinese and English.

...read moreread less

Abstract: Semantic role labeling (SRL) is the task of identifying the predicate-argument structure of a sentence. It is typically regarded as an important step in the standard NLP pipeline. As the semantic representations are closely related to syntactic ones, we exploit syntactic information in our model. We propose a version of graph convolutional networks (GCNs), a recent class of neural networks operating on graphs, suited to model syntactic dependency graphs. GCNs over syntactic dependency trees are used as sentence encoders, producing latent feature representations of words in a sentence. We observe that GCN layers are complementary to LSTM ones: when we stack both GCN and LSTM layers, we obtain a substantial improvement over an already state-of-the-art LSTM SRL model, resulting in the best reported scores on the standard benchmark (CoNLL-2009) both for Chinese and English.

...read moreread less

Posted Content•

Word Translation Without Parallel Data

[...]

Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou - Show less +1 more

11 Oct 2017-arXiv: Computation and Language

TL;DR: The authors aligns monolingual word embedding spaces in an unsupervised way without using any character information, and show that their model even outperforms existing supervised methods on cross-lingual tasks for some language pairs.

...read moreread less

Abstract: State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent studies showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally describe experiments on the English-Esperanto low-resource language pair, on which there only exists a limited amount of parallel data, to show the potential impact of our method in fully unsupervised machine translation. Our code, embeddings and dictionaries are publicly available.

...read moreread less

Posted Content•

HotFlip: White-Box Adversarial Examples for Text Classification

[...]

Javid Ebrahimi¹, Anyi Rao², Daniel Lowd¹, Dejing Dou¹•Institutions (2)

University of Oregon¹, The Chinese University of Hong Kong²

19 Dec 2017-arXiv: Computation and Language

TL;DR: An efficient method to generate white-box adversarial examples to trick a character-level neural classifier based on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors is proposed.

...read moreread less

Abstract: We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well.

...read moreread less

Posted Content•

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

[...]

Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur Guney, Volkan Cirik, Kyunghyun Cho - Show less +2 more

18 Apr 2017-arXiv: Computation and Language

TL;DR: It is shown that there is a meaningful gap between the human and machine performances, which suggests that the proposed dataset could well serve as a benchmark for question-answering.

...read moreread less

Abstract: We publicly release a new large-scale dataset, called SearchQA, for machine comprehension, or question-answering. Unlike recently released datasets, such as DeepMind CNN/DailyMail and SQuAD, the proposed SearchQA was constructed to reflect a full pipeline of general question-answering. That is, we start not from an existing article and generate a question-answer pair, but start from an existing question-answer pair, crawled from J! Archive, and augment it with text snippets retrieved by Google. Following this approach, we built SearchQA, which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet's URL, which we believe will be valuable resources for future research. We conduct human evaluation as well as test two baseline methods, one simple word selection and the other deep learning based, on the SearchQA. We show that there is a meaningful gap between the human and machine performances. This suggests that the proposed dataset could well serve as a benchmark for question-answering.

...read moreread less

Posted Content•

Deep Speaker: an End-to-End Neural Speaker Embedding System

[...]

Chao Li, Ma Xiaokong, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu - Show less +5 more

05 May 2017-arXiv: Computation and Language

TL;DR: Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.

...read moreread less

Abstract: We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering. We experiment with ResCNN and GRU architectures to extract the acoustic features, then mean pool to produce utterance-level speaker embeddings, and train using triplet loss based on cosine similarity. Experiments on three distinct datasets suggest that Deep Speaker outperforms a DNN-based i-vector baseline. For example, Deep Speaker reduces the verification equal error rate by 50% (relatively) and improves the identification accuracy by 60% (relatively) on a text-independent dataset. We also present results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition.

...read moreread less

Posted Content•

"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection

[...]

William Yang Wang¹•Institutions (1)

University of California, Santa Barbara¹

01 May 2017-arXiv: Computation and Language

TL;DR: Liar as mentioned in this paper is a large dataset of 12.8k manually labeled short statements in various contexts from this http URL, which provides detailed analysis report and links to source documents for each case.

...read moreread less

Abstract: Automatic fake news detection is a challenging problem in deception detection, and it has tremendous real-world political and social impacts. However, statistical approaches to combating fake news has been dramatically limited by the lack of labeled benchmark datasets. In this paper, we present liar: a new, publicly available dataset for fake news detection. We collected a decade-long, 12.8K manually labeled short statements in various contexts from this http URL, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. Empirically, we investigate automatic fake news detection based on surface-level linguistic patterns. We have designed a novel, hybrid convolutional neural network to integrate meta-data with text. We show that this hybrid approach can improve a text-only deep learning model.

...read moreread less

Posted Content•

A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques

[...]

Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutierrez, Krys J. Kochut - Show less +3 more

10 Jul 2017-arXiv: Computation and Language

TL;DR: Several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering are described, which briefly explain text mining in biomedical and health care domains.

...read moreread less

Abstract: The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and algorithms are required to discover useful patterns. Text mining is the task of extracting meaningful information from text, which has gained significant attentions in recent years. In this paper, we describe several of the most fundamental text mining tasks and techniques including text pre-processing, classification and clustering. Additionally, we briefly explain text mining in biomedical and health care domains.

...read moreread less

Posted Content•

Non-Autoregressive Neural Machine Translation

[...]

Jiatao Gu¹, James Bradbury², Caiming Xiong², Victor O. K. Li¹, Richard Socher² - Show less +1 more•Institutions (2)

University of Hong Kong¹, Salesforce.com²

07 Nov 2017-arXiv: Computation and Language

TL;DR: The authors use knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning to avoid the autoregressive property and produce its outputs in parallel, allowing an order of magnitude lower latency during inference.

...read moreread less

Abstract: Existing approaches to neural machine translation condition each output word on previously generated outputs. We introduce a model that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference. Through knowledge distillation, the use of input token fertilities as a latent variable, and policy gradient fine-tuning, we achieve this at a cost of as little as 2.0 BLEU points relative to the autoregressive Transformer network used as a teacher. We demonstrate substantial cumulative improvements associated with each of the three aspects of our training strategy, and validate our approach on IWSLT 2016 English-German and two WMT language pairs. By sampling fertilities in parallel at inference time, our non-autoregressive model achieves near-state-of-the-art performance of 29.8 BLEU on WMT 2016 English-Romanian.

...read moreread less

Posted Content•

RACE: Large-scale ReAding Comprehension Dataset From Examinations

[...]

Guokun Lai¹, Qizhe Xie², Hanxiao Liu¹, Yiming Yang¹, Eduard Hovy¹ - Show less +1 more•Institutions (2)

Carnegie Mellon University¹, Shanghai Jiao Tong University²

15 Apr 2017-arXiv: Computation and Language

TL;DR: The proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models and the ceiling human performance.

...read moreread less

Abstract: We present RACE, a new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students' ability in understanding and reasoning. In particular, the proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models (43%) and the ceiling human performance (95%). We hope this new dataset can serve as a valuable resource for research and evaluation in machine comprehension. The dataset is freely available at this http URL and the code is available at this https URL.

...read moreread less

Posted Content•

Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders

[...]

Tiancheng Zhao¹, Ran Zhao¹, Maxine Eskenazi¹•Institutions (1)

Carnegie Mellon University¹

31 Mar 2017-arXiv: Computation and Language

TL;DR: This paper proposed a conditional variational autoencoder to learn a distribution over potential conversational intents and generate diverse responses using only greedy decoders, which can capture the discourse-level diversity in the encoder.

...read moreread less

Abstract: While recent neural encoder-decoder models have shown great promise in modeling open-domain conversations, they often generate dull and generic responses. Unlike past work that has focused on diversifying the output of the decoder at word-level to alleviate this problem, we present a novel framework based on conditional variational autoencoders that captures the discourse-level diversity in the encoder. Our model uses latent variables to learn a distribution over potential conversational intents and generates diverse responses using only greedy decoders. We have further developed a novel variant that is integrated with linguistic prior knowledge for better performance. Finally, the training procedure is improved by introducing a bag-of-word loss. Our proposed models have been validated to generate significantly more diverse responses than baseline approaches and exhibit competence in discourse-level decision-making.

...read moreread less

Proceedings Article•DOI•

Deep Learning for Hate Speech Detection in Tweets

[...]

Pinkesh Badjatiya¹, Shashank Gupta¹, Manish Gupta², Vasudeva Varma¹•Institutions (2)

International Institute of Information Technology, Hyderabad¹, International Institute of Information Technology²

01 Jun 2017-arXiv: Computation and Language

TL;DR: In this article, the authors perform extensive experiments with multiple deep learning architectures to learn semantic word embeddings to handle the complexity of the natural language constructs and achieve state-of-the-art performance on hate speech detection on Twitter.

...read moreread less

Abstract: Hate speech detection on Twitter is critical for applications like controversial event extraction, building AI chatterbots, content recommendation, and sentiment analysis. We define this task as being able to classify a tweet as racist, sexist or neither. The complexity of the natural language constructs makes this task very challenging. We perform extensive experiments with multiple deep learning architectures to learn semantic word embeddings to handle this complexity. Our experiments on a benchmark dataset of 16K annotated tweets show that such deep learning methods outperform state-of-the-art char/word n-gram methods by ~18 F1 points.

...read moreread less

Collapse