Home
/
Authors
/
William M. Fisher

Author

William M. Fisher

Bio: William M. Fisher is an academic researcher from Texas Instruments. The author has contributed to research in topics: String (computer science) & Speech synthesis. The author has an hindex of 8, co-authored 15 publications receiving 3746 citations.

Papers

PDF

Open Access

More filters

Dataset•

TIMIT Acoustic-Phonetic Continuous Speech Corpus

[...]

John S. Garofolo, Lori Lamel, William M. Fisher, Jonathan C. Fiscus, David S. Pallett, Nancy L. Dahlgren, Victor W. Zue - Show less +3 more

01 Jan 1993

TL;DR: The TIMIT corpus as mentioned in this paper contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences, including time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance.

...read moreread less

Abstract: The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16kHz speech waveform file for each utterance. Corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments, Inc. (TI). The speech was recorded at TI, transcribed at MIT and verified and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). The TIMIT corpus transcriptions have been hand verified. Test and training subsets, balanced for phonetic and dialectal coverage, are specified. Tabular computer-searchable information is included as well as written documentation.

...read moreread less

2,096 citations

Report•DOI•

DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1

[...]

John S. Garofolo, Lori Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren - Show less +2 more

01 Feb 1993

1,238 citations

Proceedings Article•DOI•

The DARPA 1000-word resource management database for continuous speech recognition

[...]

Patti Price, William M. Fisher¹, Jared Bernstein², David S. Pallett³•Institutions (3)

Texas Instruments¹, SRI International², National Institute of Standards and Technology³

11 Apr 1988

TL;DR: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition.

...read moreread less

Abstract: A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program The data is intended for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition The data consists of read sentences appropriate to a naval resource management task built around existing interactive database and graphics programs The 1000-word task vocabulary is intended to be logically complete and habitable The database, which represents over 21000 recorded utterances from 160 talkers with a variety of dialects, includes a partition of sentences and talkers for training and for testing purposes >

...read moreread less

393 citations

Journal Article•DOI•

An acoustic‐phonetic data base

[...]

William M. Fisher, Victor W. Zue, Jared Bernstein, David S. Pallett

01 May 1987-Journal of the Acoustical Society of America

TL;DR: The DARPA Speech Data Base as discussed by the authors is a large speech data set with 6,300 speakers reading ten sentences each from a set of 450 designed at MIT and 1890 selected at TI from text sources.

...read moreread less

Abstract: DARPA has sponsored the design and collection of a large speech data base. Six hundred and thirty speakers read ten sentences each. Two sentences were constant for all speakers; the remaining eight sentences were selected from a set of 450 designed at MIT and 1890 selected at TI from text sources. The set of sentences is phonetically rich, balanced, and deep. Although all recordings were made in Dallas, we sampled as many varieties of American English as possible. Selection of volunteer speakers was based on their childhood locality to give a balanced representation of geographical origins. The subject population is adult; 70% male; young (63% in their twenties); well educated (78% with bachelor's degree); and predominantly white (96%). Recordings were made in a noise‐reducing sound booth using a Sennheiser headset microphone and digitized at 20 kHz. A natural reading style was encouraged. The recordings are complete, and time‐registered phonetic transcriptions are being added to the 6300 speech files at MIT. A version of the complete data base (16‐kHz sample rate, with acoustic‐phonetic transcriptions—approximately 50 megabytes of data) will be made available to researchers through the National Bureau of Standards. [Work supported by DARPA.]

...read moreread less

85 citations

Patent•DOI•

Speech recognition system

[...]

William M. Fisher¹, Michael L. Mcmahan¹, George R. Doddington¹, Enrico L. Bocchieri¹•Institutions (1)

Texas Instruments¹

10 Apr 1987-Journal of the Acoustical Society of America

TL;DR: In this paper, a plurality of microphones are disposed on a body to detect the speech of a speaker and the signals from different microphones are compared to allow the discrimination of certain speech sounds.

...read moreread less

Abstract: A plurality of microphones are disposed on a body to detect the speech of a speaker. First, second and third microphones may respectively detect the sounds emanating from the speaker's mouth, nose and throat and produce signals representing such sounds. A fourth microphone may detect the fricative and plosive sounds emanating from the speaker's mouth and produce signals representing such sounds. The signals from the different microphones are compared to allow the discrimination of certain speech sounds. For example, a high amplitude of the signal from the nose microphone relative to that from the mouth microphone indicates that a nasal sound such as m, n, or ng was spoken. Identifying signals are provided to the speech recognition system to aid in identifying the speech sounds at each instance. The identifying signals can also select a microphone whose signal can be passed on to the recognition system in its entirety. Signals may also be provided to identify that spoken words such as "paragraph" or "comma" are actually directions controlling the form, rather than the content, of the speech by the speaker. The selected signals, the identifying or classifying signals and the signals representing directions may be recovered by the system of this invention. The selected and identifying signals may be processed to detect syllables of speech and the syllables may be classified into phrases or sentences. The result may then be converted to a printed form representing the speech or utilized in the operation of another device.

...read moreread less

66 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

LSTM: A Search Space Odyssey

[...]

Klaus Greff¹, Rupesh Kumar Srivastava¹, Jan Koutník¹, Bas R. Steunebrink¹, Jürgen Schmidhuber¹ - Show less +1 more•Institutions (1)

University of Lugano¹

01 Oct 2017-IEEE Transactions on Neural Networks

TL;DR: This paper presents the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling, and observes that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.

...read moreread less

Abstract: Several variants of the long short-term memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful functional ANalysis Of VAriance framework. In total, we summarize the results of 5400 experimental runs ( $\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.

...read moreread less

4,746 citations

Book•

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

[...]

Dan Jurafsky, James Martin

01 Jan 2000

TL;DR: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora, to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation.

...read moreread less

Abstract: From the Publisher: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora.Methodology boxes are included in each chapter. Each chapter is built around one or more worked examples to demonstrate the main idea of the chapter. Covers the fundamental algorithms of various fields, whether originally proposed for spoken or written language to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation. Emphasis on web and other practical applications. Emphasis on scientific evaluation. Useful as a reference for professionals in any of the areas of speech and language processing.

...read moreread less

3,794 citations

WaveNet: A Generative Model for Raw Audio

[...]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, Koray Kavukcuoglu - Show less +5 more

12 Sep 2016

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

3,248 citations

Journal Article•DOI•

Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains

[...]

Jean-Luc Gauvain¹, Chin-Hui Lee²•Institutions (2)

Centre national de la recherche scientifique¹, AT&T²

01 Apr 1994-IEEE Transactions on Speech and Audio Processing

TL;DR: A framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented, and Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications.

...read moreread less

Abstract: In this paper, a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely, the choice of prior distribution family, the specification of the parameters of prior densities, and the evaluation of the MAP estimates, are addressed. Using HMM's with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normal-Wishart densities. The classical maximum likelihood estimation algorithms, namely, the forward-backward algorithm and the segmental k-means algorithm, are expanded, and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications/spl minus/parameter smoothing and model adaptation/spl minus/and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications. >

...read moreread less

2,430 citations

Proceedings Article•DOI•

SWITCHBOARD: telephone speech corpus for research and development

[...]

J.J. Godfrey¹, E. Holliman¹, J. McDaniel¹•Institutions (1)

Texas Instruments¹

23 Mar 1992

TL;DR: SWITCHBOARD as mentioned in this paper is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition.

...read moreread less

Abstract: SWITCHBOARD is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition. About 2500 conversations by 500 speakers from around the US were collected automatically over T1 lines at Texas Instruments. Designed for training and testing of a variety of speech processing algorithms, especially in speaker verification, it has over an 1 h of speech from each of 50 speakers, and several minutes each from hundreds of others. A time-aligned word for word transcription accompanies each recording. >

...read moreread less

2,102 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse