scispace - formally typeset
Search or ask a question

Showing papers on "Spectrogram published in 1987"


Proceedings ArticleDOI
01 Apr 1987
TL;DR: It is shown that instants of glottal excitation can be derived from the model even with noisy speech, and the problem of interaction with harmonics of F0 can be solved.
Abstract: An auditory model with two-tone suppression has previously been shown to perform better in speech recognition experiments than a conventional filterbank representation, particularly with noisy or distorted speech. It was, however, known to have several defects including an uneven response across the spectrum and a tendency to detect harmonics of F 0 rather than F 1 . We show that instants of glottal excitation can be derived from the model even with noisy speech. By using this information to carry out pitch-synchronous analysis in a slightly modified model the problem of interaction with harmonics of F 0 can be solved. An analysis of the behavior of the model leads to a specification of a class of processes showing two-tone suppression and hence to a redesigned model avoiding the known defects. The pitch-synchronous analysis is then no longer necessary, but the robust indication of excitation points may have other uses. Spectrograms from the old and new models illustrate the improvements obtained.

25 citations


Proceedings ArticleDOI
01 Apr 1987
TL;DR: Scale-space filtering, proposed by Witkin (ICASSP 84) for describing natural structure in one-dimensional signals, has been extended for application to segmentation and description of vector-valued functions of time, such as speech spectrograms.
Abstract: Scale-space filtering, proposed by Witkin (ICASSP 84) for describing natural structure in one-dimensional signals, has been extended for application to segmentation and description of vector-valued functions of time, such as speech spectrograms. By analyzing the rate of change of a vector trajectory at many different scales of time-smoothing, a tree of natural segments can be constructed. At various levels in the tree (i.e., at various scales), these segments are found to agree well with the kind of linguistically and perceptually important segments that spectrogram readers use to describe sound patterns of speech. Scale-space segmentations of cochleagrams (spectrograms based on a computational model of the peripheral auditory system) have been experimentally applied to word recognition. Recognition using fixed-scale segmentations with finite-state word models and a Viterbi search has led to speaker-independent digit recognition accuracies of greater than 97%, about the same as in tests with non-segmented cochleagrams. More complex recognition algorithms that use the segmentation tree are being developed, and scale-space experiments with connected digits and sentences are also underway.

21 citations


Proceedings ArticleDOI
06 Apr 1987
TL;DR: The speech signal is considered to be composed of elementary waveforms, wf, (windowed sinusoids), each one defined by a small number of parameters, each of which is modeled by a wf.
Abstract: We consider the speech signal to be composed of elementary waveforms, wf, (windowed sinusoids), each one defined by a small number of parameters. The typical duration of a wf is of the order of magnitude of a pitch period in the voiced segments, and a few milliseconds in the noise segments. No preliminary evaluation of voicing or pitch is required ; this largely differentiates the approach from the classical pitch-synchronous analysis. The analysis process uses a filterbank, designed to introduce as few time distortions as possible. The signal at the output of each filter is segmented according to successive amplitude minima, and each segment is modeled by a wf. This decomposition can be validated by reconstructing the wfs from their parameters, and summing them in order to recover a signal perceptually equivalent to the original.

19 citations


01 May 1987
TL;DR: In this paper, the authors used non-oriented kernels to identify the ridge top with zero-crossings in the inner project of the gradient vector and the direction of greatest downward curvature.
Abstract: : This work addresses two related questions. The first question is what joint time-frequency energy representations are most appropriate for auditory signals, in particular, for speech signals in sonorant regions. The quadratic transf for the representation: (1) shift-invariance, (2) positivity, (3) superposition, (4) locality, and (5) smoothness. The second question addressed is how to obtain a rich, symbolic description of the phonetically relevant features om these time-frequency energy surfaces, the so-called schematic spectrogram Time-frequency ridges, the 2-D analog of spectral peaks, are one feature that is proposed. If non-oriented kernels are used for the energy representation, then the ridge tops can be identified with zero-crossings in the inner project of the gradient vector and the direction of greatest downward curvature. If oriented kernels are used, the method can be generalized to give better orientation selectivity (e.g., intersecting ridges) at the cost of poorer time-frequency locality.

11 citations


Journal ArticleDOI
TL;DR: It is shown that increased bandwidth due to rapid time variation can mask an expected, instantaneous spectral representation; current spectral analyses are very likely to provide inconsistent information for accurate classification of rapidly time-varying events such as stops.

9 citations


Proceedings ArticleDOI
Michael Riley1
01 Apr 1987
TL;DR: Time-frequency ridges are proposed in these surfaces, the 2-D analog of spectral peaks, which can be found by examining the derivatives of the time-frequency surface produced above.
Abstract: This work addresses two related questions. The first is what joint time-frequency energy representations are most appropriate for speech signals, in particular, for the analysis of formant structure. Quasi-stationarity is not assumed, since it neglects dynamic regions. A set of desired properties is proposed, and a subclass of the quadratic transforms that best meets these criteria is derived, which consists of two-dimensionally smoothed Wigner distributions with gaussian kernels. The second question addressed is how to obtain suitable symbolic descriptions of the phonetically relevant features in these time-frequency surfaces. We propose time-frequency ridges in these surfaces, the 2-D analog of spectral peaks, which can be found by examining the derivatives of the time-frequency surface produced above.

7 citations


Journal ArticleDOI
TL;DR: The cochlear some processing of the incoming sound waveform takes partition is far from uniform; its stiffness decreases monotonplace there as well, and the system behaves as a mechanical auditory nerve must therefore be carried in the occurrences of mass-stiffness transmission line.
Abstract: C. Daniel Geisler where the net impedance of the series circuit representing the University of Wisconsin-Madison cochlear partition is largely capacitive, an L-C transmission line is formed. Signals applied to one end of such a line travel IN THE MAMMALIAN INNER EAR (cochlea), sound pressure toward the other end with a velocity dependent upon the is transduced into neural signals that are sent into the brain. parameter values. In an analogous way, vibrations introduced Unlike a pick-up microphone, however, the cochlea does not by the stapes (the middle-ear bone connected to the cochlea) simply prepare an analog electrical version of the pressure at the input port of the cochlea produce waves of displacewaveform. Such analog versions are, in fact, formed as ment that travel along the cochlear partition [1]. intermediate steps in the transduction process, but the final If the tubes and the cochlear partition had uniform propercochlear output is in the form of brief pulses, called action ties, signals of different frequencies applied to the stapes potentials, which exist on the fibers of the nerve that would travel along the cochlear partition toward the helicoconnects the cochlea and the central nervous system. The trema (the base of the \"U\") with a constant velocity. No cochlea thus serves the functions both of a microphone and frequency separation would result, for signals of different of a sort of A-to-D converter. Moreover, as we shall see, frequencies would be treated alike. However, the cochlear some processing of the incoming sound waveform takes partition is far from uniform; its stiffness decreases monotonplace there as well. ically by at least several orders of magnitude in going from the The pulses sent from the cochlea, similar to those which stapes end to the helicotrema [11. Consider what this means occur all over the mammalian nervous system, are approxiwhen a sinusoid of, say, 2 kHz is introduced into the cochlea. mately 100 mv in amplitude and about 1 ms in duration. As At this frequency, the stiffness of the cochlear partition the pulse shapes do not vary appreciably, they, by definition, located near the stapes dominates the cochlear-partition carry little information. What information is sent along the impedance, and so the system behaves as a mechanical auditory nerve must therefore be carried in the occurrences of mass-stiffness transmission line, as described above. Thus the pulses and in their temporal patterns. So far as is known, the 2 kHz energy introduced by the stapes is carried away these pulses occur independently on each of the approxifrom the input. However, the further along the cochlea that mately 30,000 individual nerve fibers that make up each the wave progresses, the smaller the stiffness of the partition normal human auditory nerve. Thus, if we were to label the becomes. Since the resonant frequency of a mass-stiffness occurrence of each pulse as a binary \"one\", the cochlea circuit is (stiffness/mass) 1/2, the lower the stiffness, the could be thought of as encoding the impinging acoustic signal lower the resonant frequency of that section of the cochlear into thousands of parallel asynchronous binary signals. It is partition. Eventually, the traveling wave reaches an area the purpose of this paper to describe some of the characteriswhere the resonant frequency is 2 kHz. When this happens, tics of these cochlear output signals and to introduce some of resonance occurs, * producing the following sequela: the the mechanisms that generate them.

7 citations


Proceedings ArticleDOI
01 Apr 1987
TL;DR: Experiments in applying the Short-Time Fourier Transform Magnitude function to the independent modification of formant and pitch information are discussed.
Abstract: A speech modification method presented by Griffin and Lim [1] involves modifying the Short-Time Fourier Transform Magnitude function of the original speech, and using it as a reference to iteratively approximate a modified waveform. This method has been successfully applied to time-scale modification and noise reduction of speech [1-3]. In this paper, we discuss our experiments in applying the algorithm to the independent modification of formant and pitch information.

5 citations


Journal ArticleDOI
TL;DR: Comparison of two tokens of the same utterance is central to many automatic speech recognition systems and early results are promising: Tokens that have been warped in both directions match better than tokens warped only in time.
Abstract: Comparison of two tokens of the same utterance is central to many automatic speech recognition systems. Matching is usually done in the frequency‐time domain; token matching is effectively spectrogram matching. Dynamic time warping (DTW) overcomes, to some extent, the temporal variability of speech tokens; spectrograms are time‐aligned by calculating similarity scores between segments of speech, now represented as “columns” of their spectrograms, and applying the mathematical technique called dynamic programming. DTW distorts the time scales of the spectrograms so that identical speech events in the two spectrograms now occur at identical times. Variability in frequency of these events is normally dealt with only by using robust distance measures. There is a better way. After time alignment, frequency variability can be dealt with specifically by doing a dynamic frequency warp (DFW), a process strictly analogous to the DTW. The “rows” of the speech spectrogram, which show the time behavior of the spectral components, play the same role in the DFW as the “columns” do in the DTW. Distances between the rows are calculated and passed to a dynamic program. The resulting DFW produces a distortion of the frequency scales such that identical speech events in the two tokens now occur both at the same time and at the same frequency. Experience with distance measures between rows is limited, but early results are promising: (1) Tokens that have been warped in both directions match better than tokens warped only in time. (2) Redoing the DTW after a DFW results in improved time alignment. (3) Tokens sound more like each other after DFW than before. (4) Pairs of speakers produce consistent frequency warps. Results will be demonstrated, and other applications suggested.

4 citations


Proceedings ArticleDOI
10 Sep 1987
TL;DR: The methods developed for bioacoustic research yield new insights into the design of man-made imaging and pattern recognition systems and the range, cross-range ambiguity function can be used to improve imaging performance.
Abstract: Standard performance measures and statistical tests must be altered for research on animal sonar. The narrowband range-Doppler ambiguity function must be redefined to analyze wideband signals. A new range, cross-range ambiguity function is needed to represent angle estimation and spatial resolution properties of animal sonar systems. Echoes are transformed into time-frequency (spectrogram-like) representations by the peripheral auditory system. Detection, estimation, and pattern recognition capabilities of animals should thus be analyzed in terms of operations on spectrograms. The methods developed for bioacoustic research yield new insights into the design of man-made imaging and pattern recognition systems. The range, cross-range ambiguity function can be used to improve imaging performance. Important features for echo pattern recognition are illustrated by time-frequency plots showing (i) principal components for spectrograms and (ii) templates for optimum discrimination between data classes.© (1987) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

2 citations


01 Jan 1987
TL;DR: Preliminary experimental results show that phoneme identification using the visual acoustic-feature label is feasible for realizing the quantitative transformation rules between the acoustic feature measurements and phoneme candidates.
Abstract: In order to apply speech spectrogram reading heuristics to an automatic speech recognition system, a more accurate expression of the heuristics must be developed. In particular, the transformation between acoustic feature measurements and phoneme candidates must be developed in a quantitative manner. In this paper, a visual acoustic-feature labeland a phoneme identification approach using this label is proposed. The visual acoustic-feature label, which is a polygon on a speech spectrogram, represents some aspects of an acoustic feature by its own geometric characteristics. Preliminary experimental results show that phoneme identification using the visual acoustic-feature label is feasible for realizing the quantitative transformation rules between the acoustic feature measurements and phoneme candidates.