scispace - formally typeset
Search or ask a question
Author

William M. Fisher

Bio: William M. Fisher is an academic researcher from Texas Instruments. The author has contributed to research in topics: String (computer science) & Speech synthesis. The author has an hindex of 8, co-authored 15 publications receiving 3746 citations.

Papers
More filters
Dataset
01 Jan 1993
TL;DR: The TIMIT corpus as mentioned in this paper contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences, including time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance.
Abstract: The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.

2,096 citations

Proceedings ArticleDOI
11 Apr 1988
TL;DR: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition.
Abstract: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program The data is intended for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition The data consists of read sentences appropriate to a naval resource management task built around existing interactive database and graphics programs The 1000-word task vocabulary is intended to be logically complete and habitable The database, which represents over 21000 recorded utterances from 160 talkers with a variety of dialects, includes a partition of sentences and talkers for training and for testing purposes >

393 citations

Journal ArticleDOI
TL;DR: The DARPA Speech Data Base as discussed by the authors is a large speech data set with 6,300 speakers reading ten sentences each from a set of 450 designed at MIT and 1890 selected at TI from text sources.
Abstract: DARPA has sponsored the design and collection of a large speech data base. Six hundred and thirty speakers read ten sentences each. Two sentences were constant for all speakers; the remaining eight sentences were selected from a set of 450 designed at MIT and 1890 selected at TI from text sources. The set of sentences is phonetically rich, balanced, and deep. Although all recordings were made in Dallas, we sampled as many varieties of American English as possible. Selection of volunteer speakers was based on their childhood locality to give a balanced representation of geographical origins. The subject population is adult; 70% male; young (63% in their twenties); well educated (78% with bachelor's degree); and predominantly white (96%). Recordings were made in a noise‐reducing sound booth using a Sennheiser headset microphone and digitized at 20 kHz. A natural reading style was encouraged. The recordings are complete, and time‐registered phonetic transcriptions are being added to the 6300 speech files at MIT. A version of the complete data base (16‐kHz sample rate, with acoustic‐phonetic transcriptions—approximately 50 megabytes of data) will be made available to researchers through the National Bureau of Standards. [Work supported by DARPA.]

85 citations

PatentDOI
TL;DR: In this paper, a plurality of microphones are disposed on a body to detect the speech of a speaker and the signals from different microphones are compared to allow the discrimination of certain speech sounds.
Abstract: A plurality of microphones are disposed on a body to detect the speech of a speaker. First, second and third microphones may respectively detect the sounds emanating from the speaker's mouth, nose and throat and produce signals representing such sounds. A fourth microphone may detect the fricative and plosive sounds emanating from the speaker's mouth and produce signals representing such sounds. The signals from the different microphones are compared to allow the discrimination of certain speech sounds. For example, a high amplitude of the signal from the nose microphone relative to that from the mouth microphone indicates that a nasal sound such as m, n, or ng was spoken. Identifying signals are provided to the speech recognition system to aid in identifying the speech sounds at each instance. The identifying signals can also select a microphone whose signal can be passed on to the recognition system in its entirety. Signals may also be provided to identify that spoken words such as "paragraph" or "comma" are actually directions controlling the form, rather than the content, of the speech by the speaker. The selected signals, the identifying or classifying signals and the signals representing directions may be recovered by the system of this invention. The selected and identifying signals may be processed to detect syllables of speech and the syllables may be classified into phrases or sentences. The result may then be converted to a printed form representing the speech or utilized in the operation of another device.

66 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This paper presents the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling, and observes that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
Abstract: Several variants of the long short-term memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful functional ANalysis Of VAriance framework. In total, we summarize the results of 5400 experimental runs ( $\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.

4,746 citations

Book
01 Jan 2000
TL;DR: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora, to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation.
Abstract: From the Publisher: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora.Methodology boxes are included in each chapter. Each chapter is built around one or more worked examples to demonstrate the main idea of the chapter. Covers the fundamental algorithms of various fields, whether originally proposed for spoken or written language to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation. Emphasis on web and other practical applications. Emphasis on scientific evaluation. Useful as a reference for professionals in any of the areas of speech and language processing.

3,794 citations

12 Sep 2016
TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

3,248 citations

Journal ArticleDOI
TL;DR: A framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented, and Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications.
Abstract: In this paper, a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely, the choice of prior distribution family, the specification of the parameters of prior densities, and the evaluation of the MAP estimates, are addressed. Using HMM's with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normal-Wishart densities. The classical maximum likelihood estimation algorithms, namely, the forward-backward algorithm and the segmental k-means algorithm, are expanded, and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications/spl minus/parameter smoothing and model adaptation/spl minus/and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications. >

2,430 citations

Proceedings ArticleDOI
23 Mar 1992
TL;DR: SWITCHBOARD as mentioned in this paper is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition.
Abstract: SWITCHBOARD is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition. About 2500 conversations by 500 speakers from around the US were collected automatically over T1 lines at Texas Instruments. Designed for training and testing of a variety of speech processing algorithms, especially in speaker verification, it has over an 1 h of speech from each of 50 speakers, and several minutes each from hundreds of others. A time-aligned word for word transcription accompanies each recording. >

2,102 citations