scispace - formally typeset
Search or ask a question
Author

Jonathan G. Fiscus

Bio: Jonathan G. Fiscus is an academic researcher from National Institute of Standards and Technology. The author has contributed to research in topics: TRECVID & Closed captioning. The author has an hindex of 32, co-authored 60 publications receiving 7264 citations.


Papers
More filters
Proceedings ArticleDOI
14 Dec 1997
TL;DR: The NIST Recognizer Output Voting Error Reduction (ROVER) system as discussed by the authors was developed at NIST to produce a composite automatic speech recognition (ASR) system output when the outputs of multiple ASR systems are available, and for which the composite ASR output has a lower error rate than any of the individual systems.
Abstract: Describes a system developed at NIST to produce a composite automatic speech recognition (ASR) system output when the outputs of multiple ASR systems are available, and for which, in many cases, the composite ASR output has a lower error rate than any of the individual systems. The system implements a "voting" or rescoring process to reconcile differences in ASR system outputs. We refer to this system as the NIST Recognizer Output Voting Error Reduction (ROVER) system. As additional knowledge sources are added to an ASR system (e.g. acoustic and language models), error rates are typically decreased. This paper describes a post-recognition process which models the output generated by multiple ASR systems as independent knowledge sources that can be combined and used to generate an output with reduced error rate. To accomplish this, the outputs of multiple of ASR systems are combined into a single, minimal-cost word transition network (WTN) via iterative applications of dynamic programming (DP) alignments. The resulting network is searched by an automatic rescoring or "voting" process that selects the output sequence with the lowest score.

1,188 citations

01 Jan 2013
TL;DR: The TREC Video Retrieval Evaluation (TRECVID) 2012 was a TREC-style video analysis and retrieval evaluation, the goal of which remains to promote progress in content-based exploitation of digital video via open, metrics-based evaluation as mentioned in this paper.
Abstract: The TREC Video Retrieval Evaluation (TRECVID) 2012 was a TREC-style video analysis and retrieval evaluation, the goal of which remains to promote progress in content-based exploitation of digital video via open, metrics-based evaluation. Over the last ten years this effort has yielded a better understanding of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. TRECVID is funded by the NIST and other US government agencies. Many organizations and individuals worldwide contribute significant time and effort.

582 citations

01 Jan 2006
TL;DR: The paper describes the evaluation task posed to Spoken Term Detection systems, the evaluation methodologies, the Arabic, English and Mandarin evaluation corpora, and the results of the evaluation.
Abstract: paper presents the pilot evaluation of Spoken Term Detection technologies, held during the latter part of 2006. Spoken Term Detection systems rapidly detect the presence of a term , which is a sequence of words consecutively spoken, in a large audio corpus of heterogeneous speech material. The paper describes the evaluation task posed to Spoken Term Detection systems, the evaluation methodologies, the Arabic, English and Mandarin evaluation corpora, and the results of the evaluation. Ten participants submitted systems for the evaluation.

252 citations


Cited by
More filters
Proceedings Article
01 Jan 2002
TL;DR: The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.
Abstract: SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimentation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a variety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipulation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and implementation, highlighting ease of rapid prototyping, reusability, and combinability of tools.

4,904 citations

Journal ArticleDOI
TL;DR: This paper presents the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling, and observes that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
Abstract: Several variants of the long short-term memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful functional ANalysis Of VAriance framework. In total, we summarize the results of 5400 experimental runs ( $\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.

4,746 citations

12 Sep 2016
TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

3,248 citations

Proceedings Article
01 Jan 2005
TL;DR: In this article, a modified, full gradient version of the LSTM learning algorithm was used for framewise phoneme classification, using the TIMIT database, and the results support the view that contextual information is crucial to speech processing, and suggest that bidirectional networks outperform unidirectional ones.
Abstract: In this paper, we present bidirectional Long Short Term Memory (LSTM) networks, and a modified, full gradient version of the LSTM learning algorithm. We evaluate Bidirectional LSTM (BLSTM) and several other network architectures on the benchmark task of framewise phoneme classification, using the TIMIT database. Our main findings are that bidirectional networks outperform unidirectional ones, and Long Short Term Memory (LSTM) is much faster and also more accurate than both standard Recurrent Neural Nets (RNNs) and time-windowed Multilayer Perceptrons (MLPs). Our results support the view that contextual information is crucial to speech processing, and suggest that BLSTM is an effective architecture with which to exploit it'.

3,028 citations

Journal ArticleDOI
TL;DR: In this article, a modified, full gradient version of the LSTM learning algorithm was used for framewise phoneme classification, using the TIMIT database, and the results support the view that contextual information is crucial to speech processing, and suggest that bidirectional networks outperform unidirectional ones.

2,200 citations