Conference
Human Language Technology
About: Human Language Technology is an academic conference. The conference publishes majorly in the area(s): Natural language & Spoken language. Over the lifetime, 479 publications have been published by the conference receiving 21140 citations.
Papers
More filters
••
23 Feb 1992TL;DR: This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus, a corpus containing significant quantities of both speech data and text data.
Abstract: The DARPA Spoken Language System (SLS) community has long taken a leadership position in designing, implementing, and globally distributing significant speech corpora widely used for advancing speech recognition research. The Wall Street Journal (WSJ) CSR Corpus described here is the newest addition to this valuable set of resources. In contrast to previous corpora, the WSJ corpus will provide DARPA its first general-purpose English, large vocabulary, natural language, high perplexity, corpus containing significant quantities of both speech data (400 hrs.) and text data (47M words), thereby providing a means to integrate speech recognition and natural language processing in application domains with high potential practical value. This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus.
1,100 citations
••
08 Mar 1994TL;DR: The Penn Treebank has recently implemented a new syntactic annotation scheme, designed to highlight aspects of predicate-argument structure as discussed by the authors, which incorporates a more consistent treatment of a wide range of grammatical phenomena, provides a set of coindexed null elements in what can be thought of as "underlying" position for phenomena such as wh-movement, passive, and the subjects of infinitival constructions, and allows for a clear, concise tagging system for some semantic roles.
Abstract: The Penn Treebank has recently implemented a new syntactic annotation scheme, designed to highlight aspects of predicate-argument structure. This paper discusses the implementation of crucial aspects of this new annotation scheme. It incorporates a more consistent treatment of a wide range of grammatical phenomena, provides a set of coindexed null elements in what can be thought of as "underlying" position for phenomena such as wh-movement, passive, and the subjects of infinitival constructions, provides some non-context free annotational mechanism to allow the structure of discontinuous constituents to be easily recovered, and allows for a clear, concise tagging system for some semantic roles.
877 citations
••
24 Jun 1990
TL;DR: This pilot marks the first full-scale attempt to collect a corpus to measure progress in Spoken Language Systems that include both a speech and natural language component and provides guidelines for future efforts.
Abstract: Speech research has made tremendous progress in the past using the following paradigm:• define the research problem,• collect a corpus to objectively measure progress, and• solve the research problem.Natural language research, on the other hand, has typically progressed without the benefit of any corpus of data with which to test research hypotheses. We describe the Air Travel Information System (ATIS) pilot corpus, a corpus designed to measure progress in Spoken Language Systems that include both a speech and natural language component. This pilot marks the first full-scale attempt to collect such a corpus and provides guidelines for future efforts.
842 citations
••
08 Mar 1994TL;DR: This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree, which is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones.
Abstract: The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance between model complexity and available training data. For large vocabulary systems requiring cross-word context dependent modelling, this is particularly acute since many such contexts will never occur in the training data. This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree. This tree-based clustering is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones. State-tying is also compared with traditional model-based tying and shown to be clearly superior. Experimental results are presented for both the Resource Management and Wall Street Journal tasks.
781 citations
••
21 Mar 1993TL;DR: A semantic concordance is a textual corpus and a lexicon so combined that every substantive word in the text is linked to its appropriate sense in the lexicon.
Abstract: A semantic concordance is a textual corpus and a lexicon so combined that every substantive word in the text is linked to its appropriate sense in the lexicon. Thus it can be viewed either as a corpus in which words have been tagged syntactically and semantically, or as a lexicon in which example sentences can be found for many definitions. A semantic concordance is being constructed to use in studies of sense resolution in context (semantic disambiguation). The Brown Corpus is the text and WordNet is the lexicon. Semantic tags (pointers to WordNet synsets) are inserted in the text manually using an interface, ConText, that was designed to facilitate the task. Another interface supports searches of the tagged text. Some practical uses for semantic concordances am proposed.
748 citations