Recurrent neural networks for language understanding.

doi:10.21437/INTERSPEECH.2013-569

Home
/
Papers
/
Recurrent neural networks for language understanding.

Proceedings Article•DOI•

Recurrent neural networks for language understanding.

Kaisheng Yao¹, Geoffrey Zweig, Mei-Yuh Hwang¹, Yangyang Shi², Dong Yu¹ - Show less +1 more•Institutions (2)

Microsoft¹, Delft University of Technology²

25 Aug 2013-pp 2524-2528

TL;DR: This paper modify the architecture to perform Language Understanding, and advance the state-of-the-art for the widely used ATIS dataset.

read less

Abstract: Recurrent Neural Network Language Models (RNN-LMs) have recently shown exceptional performance across a variety of applications. In this paper, we modify the architecture to perform Language Understanding, and advance the state-of-the-art for the widely used ATIS dataset. The core of our approach is to take words as input as in a standard RNN-LM, and then to predict slot labels rather than words on the output side. We present several variations that differ in the amount of word context that is used on the input side, and in the use of non-lexical features. Remarkably, our simplest model produces state-of-the-art results, and we advance state-of-the-art through the use of bagof-words, word embedding, named-entity, syntactic, and wordclass features. Analysis indicates that the superior performance is attributable to the task-specific word representations learned by the RNN.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book•

Deep Learning: Methods and Applications

[...]

Li Deng¹, Dong Yu¹•Institutions (1)

Microsoft¹

12 Jun 2014

TL;DR: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.

...read moreread less

Abstract: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.

...read moreread less

2,817 citations

Cites background from "Recurrent neural networks for langu..."

...[403] reported the success of RNNs in spoken language understanding....
[...]
...to emotion recognition from speech in [207, 222], to spoken language understanding in [242, 366, 403], to speaker recognition in [351, 372], to language-type recognition in [112], to dialogue state tracking for spoken dialogue systems in [94, 152], to automatic voice activity detection in [442], to speech enhancement in [396], to voice conversion in [266], and to single-channel source separation in [132, 387]....
[...]

Journal Article•DOI•

Using recurrent neural networks for slot filling in spoken language understanding

[...]

Grégoire Mesnil¹, Yann N. Dauphin¹, Kaisheng Yao², Yoshua Bengio¹, Li Deng², Dilek Hakkani-Tur², Xiaodong He², Larry Heck³, Gokhan Tur⁴, Dong Yu², Geoffrey Zweig² - Show less +7 more•Institutions (4)

Université de Montréal¹, Microsoft², Google³, Apple Inc.⁴

01 Mar 2015-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper implemented and compared several important RNN architectures, including Elman, Jordan, and hybrid variants, and implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark.

...read moreread less

Abstract: Semantic slot filling is one of the most challenging problems in spoken language understanding (SLU). In this paper, we propose to use recurrent neural networks (RNNs) for this task, and present several novel architectures designed to efficiently model past and future temporal dependencies. Specifically, we implemented and compared several important RNN architectures, including Elman, Jordan, and hybrid variants. To facilitate reproducibility, we implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark. In addition, we compared the approaches on two custom SLU data sets from the entertainment and movies domains. Our results show that the RNN-based models outperform the conditional random field (CRF) baseline by 2% in absolute error reduction on the ATIS benchmark. We improve the state-of-the-art by 0.5% in the Entertainment domain, and 6.7% for the movies domain.

...read moreread less

562 citations

Cites background or methods or result from "Recurrent neural networks for langu..."

...We used a context window of 3 for bag-of-word feature [24]....
[...]
...some preliminary SLU experiments [15], [24], [30], [56], in this...
[...]
...Such word embeddings actually present interesting properties [23] and tend to cluster [20] when their semantics are similar While [15][24] suggest initializing the embedding vectors with unsupervised vised learned features and then fine-tune it on the task of interest, we found that directly learning the embedding vectors initialized from random values led to the same performance on the ATIS dataset, when using the SENNA word embeddings (http://ml....
[...]
...In our experiments, we preprocessed the data as in [24]....
[...]
...6% reported in [24] with the 95% significance level based on the binomial test....
[...]

Proceedings Article•DOI•

Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding.

[...]

Grégoire Mesnil¹, Xiaodong He¹, Li Deng², Yoshua Bengio²•Institutions (2)

Université de Montréal¹, Microsoft²

25 Aug 2013

TL;DR: The results show that on this task, both types of recurrent networks outperform the CRF baseline substantially, and a bi-directional Jordantype network that takes into account both past and future dependencies among slots works best, outperforming a CRFbased baseline by 14% in relative error reduction.

...read moreread less

Abstract: One of the key problems in spoken language understanding (SLU) is the task of slot filling. In light of the recent success of applying deep neural network technologies in domain detection and intent identification, we carried out an in-depth investigation on the use of recurrent neural networks for the more difficult task of slot filling involving sequence discrimination. In this work, we implemented and compared several important recurrent-neural-network architectures, including the Elman-type and Jordan-type recurrent networks and their variants. To make the results easy to reproduce and compare, we implemented these networks on the common Theano neural network toolkit, and evaluated them on the ATIS benchmark. We also compared our results to a conditional random fields (CRF) baseline. Our results show that on this task, both types of recurrent networks outperform the CRF baseline substantially, and a bi-directional Jordantype network that takes into account both past and future dependencies among slots works best, outperforming a CRFbased baseline by 14% in relative error reduction.

...read moreread less

503 citations

Cites methods from "Recurrent neural networks for langu..."

...Future work will explore more efficient training of RNNs and the choice of more comprehensive features [28] and using a different RNN training toolkit [14] incorporating more advanced features....
[...]

Journal Article•DOI•

Generative Adversarial Networks for Hyperspectral Image Classification

[...]

Lin Zhu¹, Yushi Chen¹, Pedram Ghamisi², Jon Atli Benediktsson³•Institutions (3)

Harbin Institute of Technology¹, German Aerospace Center², University of Iceland³

06 Mar 2018-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: The usefulness and effectiveness of GAN for classification of hyperspectral images (HSIs) are explored for the first time and the proposed models provide competitive results compared to the state-of-the-art methods.

...read moreread less

Abstract: A generative adversarial network (GAN) usually contains a generative network and a discriminative network in competition with each other. The GAN has shown its capability in a variety of applications. In this paper, the usefulness and effectiveness of GAN for classification of hyperspectral images (HSIs) are explored for the first time. In the proposed GAN, a convolutional neural network (CNN) is designed to discriminate the inputs and another CNN is used to generate so-called fake inputs. The aforementioned CNNs are trained together: the generative CNN tries to generate fake inputs that are as real as possible, and the discriminative CNN tries to classify the real and fake inputs. This kind of adversarial training improves the generalization capability of the discriminative CNN, which is really important when the training samples are limited. Specifically, we propose two schemes: 1) a well-designed 1D-GAN as a spectral classifier and 2) a robust 3D-GAN as a spectral–spatial classifier. Furthermore, the generated adversarial samples are used with real training samples to fine-tune the discriminative CNN, which improves the final classification performance. The proposed classifiers are carried out on three widely used hyperspectral data sets: Salinas, Indiana Pines, and Kennedy Space Center. The obtained results reveal that the proposed models provide competitive results compared to the state-of-the-art methods. In addition, the proposed GANs open new opportunities in the remote sensing community for the challenging task of HSI classification and also reveal the huge potential of GAN-based methods for the analysis of such complex and inherently nonlinear data.

...read moreread less

501 citations

Proceedings Article•DOI•

Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM.

[...]

Dilek Hakkani-Tur¹, Gokhan Tur¹, Asli Celikyilmaz¹, Yun-Nung Chen¹, Jianfeng Gao¹, Li Deng¹, Ye-Yi Wang¹ - Show less +3 more•Institutions (1)

Microsoft¹

08 Sep 2016

TL;DR: Experimental results show the power of a holistic multi-domain, multi-task modeling approach to estimate complete semantic frames for all user utterances addressed to a conversational system over alternative methods based on single domain/task deep learning.

...read moreread less

Abstract: Sequence-to-sequence deep learning has recently emerged as a new paradigm in supervised learning for spoken language understanding. However, most of the previous studies explored this framework for building single domain models for each task, such as slot filling or domain classification, comparing deep learning based approaches with conventional ones like conditional random fields. This paper proposes a holistic multi-domain, multi-task (i.e. slot filling, domain and intent detection) modeling approach to estimate complete semantic frames for all user utterances addressed to a conversational system, demonstrating the distinctive power of deep learning methods, namely bi-directional recurrent neural network (RNN) with long-short term memory (LSTM) cells (RNN-LSTM) to handle such complexity. The contributions of the presented work are three-fold: (i) we propose an RNN-LSTM architecture for joint modeling of slot filling, intent determination, and domain classification; (ii) we build a joint multi-domain model enabling multi-task deep learning where the data from each domain reinforces each other; (iii) we investigate alternative architectures for modeling lexical context in spoken language understanding. In addition to the simplicity of the single model framework, experimental results show the power of such an approach on Microsoft Cortana real user data over alternative methods based on single domain/task deep learning.

...read moreread less

464 citations

Additional excerpts

...[23] and Mesnil et al....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

Collapse

References

PDF

Open Access

More filters

Journal Article•

Visualizing Data using t-SNE

[...]

Laurens van der Maaten, Geoffrey E. Hinton

01 Jan 2008-Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

...read moreread less

30,124 citations

"Recurrent neural networks for langu..." refers methods in this paper

...To understand it better, we use t-SNE [39] to plot frequent words that are in the ten most common slots in two-dimensional space, as shown in color in Figure 5....
[...]

Proceedings Article•

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

[...]

John Lafferty¹, Andrew McCallum, Fernando Pereira•Institutions (1)

Carnegie Mellon University¹

28 Jun 2001

TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

...read moreread less

Abstract: We present conditional random fields , a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

...read moreread less

13,190 citations

"Recurrent neural networks for langu..." refers methods in this paper

...Perhaps the most obvious approach to this task is the use of Conditional Random Fields (CRFs) [26], in which an exponential model is used to compute the probability of a label sequence given the input word sequence....
[...]

Probabilistic Models for Segmenting and Labeling Sequence Data

[...]

John Lafferty, Andrew McCallum, Fernando Pereira, Kevin Duh

01 Jan 2005

11,364 citations

Journal Article•DOI•

Finding Structure in Time

[...]

Jeffrey L. Elman¹•Institutions (1)

University of California, San Diego¹

01 Mar 1990-Cognitive Science

TL;DR: A proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory and suggests a method for representing lexical categories and the type/token distinction is developed.

...read moreread less

10,264 citations

"Recurrent neural networks for langu..." refers methods in this paper

...To adapt the RNN-LM to the LU task, we use the classic Elman RNN architecture [31] adopted by Mikolov [1]....
[...]

Journal Article•DOI•

A neural probabilistic language model

[...]

Yoshua Bengio¹, Réjean Ducharme¹, Pascal Vincent¹, Christian Janvin¹•Institutions (1)

Université de Montréal¹

01 Mar 2003-Journal of Machine Learning Research

TL;DR: The authors propose to learn a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences, which can be expressed in terms of these representations.

...read moreread less

Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

...read moreread less

6,832 citations