Other affiliations: Vassar College, École Normale Supérieure, Technion – Israel Institute of Technology ...read more
Bio: James Glass is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Language model & Word error rate. The author has an hindex of 71, co-authored 495 publications receiving 19337 citations. Previous affiliations of James Glass include Vassar College & École Normale Supérieure.
Papers published on a yearly basis
TL;DR: The purpose of this paper is to describe the development effort of JUPITER in terms of the underlying human language technologies as well as other system-related issues such as utterance rejection and content harvesting.
Abstract: In early 1997, our group initiated a project to develop JUPITER, a conversational interface that allows users to obtain worldwide weather forecast information over the telephone using spoken dialogue. It has served as the primary research platform for our group on many issues related to human language technology, including telephone-based speech recognition, robust language understanding, language generation, dialogue modeling, and multilingual interfaces. Over a two year period since coming online in May 1997, JUPITER has received, via a toll-free number in North America, over 30000 calls (totaling over 180000 utterances), mostly from naive users. The purpose of this paper is to describe our development effort in terms of the underlying human language technologies as well as other system-related issues such as utterance rejection and content harvesting. We also present some evaluation results on the system and its components.
TL;DR: The experiences of researchers at MIT in the collection of two large speech databases, timit and voyager, are described, which have somewhat complementary objectives.
Abstract: Automatic speech recognition by computers can provide the most natural and efficient method of communication between humans and computers. While in recent years high performance speech recognition systems are beginning to emerge from research institutions, scientists unequivocally agree that the deployment of speech recognition systems into realistic operating environments will require many hours of speech data to help us model the inherent variability in the speech signal. This paper describes the experiences of researchers at MIT in the collection of two large speech databases which have somewhat complementary objectives. The timit database was designed to be task and speaker-independent, and is suitable for general acoustic-phonetic research. The voyager database, on the other hand, was intended for development and evaluation of a system which incorporates both speech and natural language processing. This database is particularly valuable as a source of spontaneous utterances elicited in a realistic goal-oriented environment.
TL;DR: Analysis methods in neural language processing are reviewed, categorize them according to prominent research trends, highlight existing limitations, and point to potential directions for future work.
Abstract: The field of natural language processing has seen impressive progress in recent years, with neural network models replacing many of the traditional systems. A plethora of new models have been proposed, many of which are thought to be opaque compared to their feature-rich counterparts. This has led researchers to analyze, interpret, and evaluate neural networks in novel and more fine-grained ways. In this survey paper, we review analysis methods in neural language processing, categorize them according to prominent research trends, highlight existing limitations, and point to potential directions for future work.
••15 Sep 2019
TL;DR: The authors proposed an unsupervised autoregressive neural model for learning generic speech representations, which is designed to preserve information for a wide range of downstream tasks, such as phone classification and speaker verification.
Abstract: This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations. In contrast to other speech representation learning methods that aim to remove noise or speaker variabilities, ours is designed to preserve information for a wide range of downstream tasks. In addition, the proposed model does not require any phonetic or word boundary labels, allowing the model to benefit from large quantities of unlabeled data. Speech representations learned by our model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsupervised approaches. Further analysis shows that different levels of speech information are captured by our model at different layers. In particular, the lower layers tend to be more discriminative for speakers, while the upper layers provide more phonetic content.
TL;DR: It is shown how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream by exploiting the structure of repeating patterns within the speech signal.
Abstract: We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
15 Feb 2018
TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).
Abstract: We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.
12 Aug 2016
TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.
Abstract: Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character ngram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English!German and English!Russian by up to 1.1 and 1.3 BLEU, respectively.