scispace - formally typeset
Search or ask a question
Topic

Speech coding

About: Speech coding is a research topic. Over the lifetime, 14245 publications have been published within this topic receiving 271964 citations.


Papers
More filters
Book ChapterDOI
15 Apr 1996
TL;DR: Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.
Abstract: Developments in dynamic contour tracking permit sparse representation of the outlines of moving contours. Given the increasing computing power of general-purpose workstations it is now possible to track human faces and parts of faces in real-time without special hardware. This paper describes a real-time lip tracker that uses a Kalman filter based dynamic contour to track the outline of the lips. Two alternative lip trackers, one that tracks lips from a profile view and the other from a frontal view, were developed to extract visual speech recognition features from the lip contour. In both cases, visual features have been incorporated into an acoustic automatic speech recogniser. Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.

96 citations

PatentDOI
TL;DR: A computerized system time aligns frames of spoken training data against models of the speech sounds; automatically selects different sets of phonetic context classifications which divide the speech sound models into speech sound groups aligned against acoustically similar frames; creates model components from the frames aligned againstspeech sound groups with related classifications; and uses these model components to build a separate model for each related speech sound group.
Abstract: A computerized system time aligns frames of spoken training data against models of the speech sounds; automatically selects different sets of phonetic context classifications which divide the speech sound models into speech sound groups aligned against acoustically similar frames; creates model components from the frames aligned against speech sound groups with related classifications; and uses these model components to build a separate model for each related speech sound group. A decision tree classifies speech sounds into such groups, and related speech sound groups descend from common tree nodes. New speech samples time aligned against a given speech sound group's model update models of related speech sound groups, decreasing the training data required to adapt the system. The phonetic context classifications can be based on knowledge of which contextual features are associated with acoustic similarity. The computerized system samples speech sounds using a first, larger, parameter set; automatically selects combinations of phonetic context classifications which divide the speech sounds into groups whose frames are acoustically similar, such as by use of a decision tree; selects a second, smaller, set of parameters based on that set's ability to separate the frames aligned with each speech sound group, such as by used of linear discriminant analysis; and then uses these new parameters to represent frames and speech sound models. Then, using the new parameters, a decision tree classifier can be used to re-classify the speech sounds and to calculate new acoustic models for the resulting groups of speech sounds.

95 citations

Journal ArticleDOI
P. Mermelstein1
TL;DR: A tutorial discussion is provided of the adaptive differential PCM (pulse-code modulation) coding method recommended by the group, which covers the subjective performance tests performed, mode initialization and mode switching, data-speed multiplexing, and communication between narrowband and wideband terminals.
Abstract: CCITT Study Group XVIII recognized the need for a new international coding standard on high-quality audio to allow interconnection of diverse switching, transmission, and terminal equipment and organized an expert group in 1983 to recommend an appropriate coding technique. A tutorial discussion is provided of the adaptive differential PCM (pulse-code modulation) coding method recommended by the group. The discussion covers the subjective performance tests performed, mode initialization and mode switching, data-speed multiplexing, and communication between narrowband and wideband terminals. >

95 citations

Proceedings ArticleDOI
26 Nov 1996
TL;DR: It is observed that wavelets concentrate speech energy into bands which differentiate between voiced or unvoiced speech, and it is shown that the Battle-Lemarie wavelet concentrates more than 97.5% of the signal energy into the approximation part of the coefficients.
Abstract: The trend towards real-time, low-bit-rate speech coders dictates current research efforts in speech compression. A method being evaluated uses wavelets for speech analysis and synthesis. Distinguishing between voiced and unvoiced speech, determining pitch, and methods for choosing optimum wavelets for speech compression are discussed. We observe that wavelets concentrate speech energy into bands which differentiate between voiced or unvoiced speech. Optimum wavelets are selected based on energy conservation properties in the approximation part of the wavelet coefficients. It is shown that the Battle-Lemarie wavelet concentrates more than 97.5% of the signal energy into the approximation part of the coefficients followed closely by the Daubechies D20, D12, D10 or D8 wavelets. The Haar wavelets are the worst. Listening tests show that the Daubechies 10 preserves perceptual information better than other Daubechies wavelets and, indeed, a host of other orthogonal wavelets. Pitch periods and evolution can be identified from contour plots of coefficients obtained at several scales.

95 citations

Patent
09 Jan 2004
TL;DR: In this article, an audio-visual speech activience recognition system (200b/c) of a video-enabled telecommunication device which runs a real-time lip tracking application that can advantageously be used for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise (n'(t)) including both environmental noise and surrounding persons' voices.
Abstract: The present invention generally relates to the field of noise reduction systems which are equipped with an audio-visual user interface, in particular to an audio-visual speech activ­ity recognition system (200b/c) of a video-enabled telecommunication device which runs a real-time lip tracking application that can advantageously be used for a near-speaker detection algorithm in an environment where a speaker's voice is interfered by a statistically distributed background noise (n'(t)) including both environmental noise (n(t)) and surrounding persons' voices

95 citations


Network Information
Related Topics (5)
Signal processing
73.4K papers, 983.5K citations
86% related
Decoding methods
65.7K papers, 900K citations
84% related
Fading
55.4K papers, 1M citations
80% related
Feature vector
48.8K papers, 954.4K citations
80% related
Feature extraction
111.8K papers, 2.1M citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202338
202284
202170
202062
201977
2018108