scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 1995"


Journal ArticleDOI
TL;DR: The survey indicates that the essential points in noisy speech recognition consist of incorporating time and frequency correlations, giving more importance to high SNR portions of speech in decision making, exploiting task-specific a priori knowledge both of speech and of noise, using class-dependent processing, and including auditory models in speech processing.

712 citations


Book
01 Nov 1995
TL;DR: An introduction to speech coding, W.B. Kleijn evaluation of speech coders, and a robust algorithm for pitch tracking (RAPT), D. McAulay and T.F. Quatieri waveform interpolation for coding and synthesis.
Abstract: An introduction to speech coding, W.B. Kleijn and K.K. Paliwal speech coding standards, R.V. Cox linear-prediction based analysis-by-synthesis coding, P. Kroon and W.B. Kleijn sinusoidal coding, R.J. McAulay and T.F. Quatieri waveform interpolation for coding and synthesis, W.B. Kleijn and J. Haagen low-delay coding of speech, J.-H. Chen multimode and variable-rate coding of speech, A. Das et al wideband speech coding, J.-P. Adoul and R. Lefebvre vector quantization for speech transmission, P. Hedelin et al theory for transmission of vector quantization data, P. Hedelin et al waveform coding and auditory masking, R. Veldhuis and A. Kohlrausch quantization of LPC parameters, K.K. Paliwal and W.B. Kleijn evaluation of speech coders, P. Kroon a robust algorithm for pitch tracking (RAPT), D. Talkin time-domain and frequency-domain techniques for prosodic modification of speech, E. Moulines and W. Verhelst nonlinear processing of speech, G. Kubin an approach to text-to-speech synthesis, R. Sproat and J. Olive the generation of prosodic structure and intonation in speech synthesis, J. Terken and R. Collier computation of timing in text-to-speech synthesis, J.P.H. van Santen objective optimization in algorithms for text-to-speech synthesis, Y. Sagisaka and N. Iwahashi quality evaluation of synthesized speech, V.J. van Heuven and R. van Bezooijen.

621 citations


Patent
27 Mar 1995
TL;DR: A code frequency component in the encoded audio signal is detected based on an expected code amplitude or on a noise amplitude within a range of audio frequencies including the frequency of the code component as discussed by the authors.
Abstract: Apparatus and methods for including a code (68) having at least one code frequency component in an audio signal (60) are provided. The abilities of various frequency components in the audio signal to mask the code frequency component to human hearing are evaluated (64), and based on these evaluations an amplitude (76) is assigned to the code frequency component. Methods and apparatus for detecting a code in an encoded audio signal are also provided. A code frequency component in the encoded audio signal is detected based on an expected code amplitude or on a noise amplitude within a range of audio frequencies including the frequency of the code component.

554 citations


Journal ArticleDOI
TL;DR: A modification of the Viterbi decoding algorithm (VA) for binary trellises which uses a priori or a posteriori information about the source bit probability for better decoding in addition to soft inputs and channel state information is proposed.
Abstract: Source and channel coding have been treated separately in most cases. It can be observed that most source coding algorithms for voice, audio and images still have correlation in certain bits. Transmission errors in these bits usually account for the significant errors in the reconstructed source signal. This paper proposes a modification of the Viterbi decoding algorithm (VA) for binary trellises which uses a priori or a posteriori information about the source bit probability for better decoding in addition to soft inputs and channel state information. Analytical upper bounds for the BER of convolutional codes for this modified VA (APRI-VA) are given. The algorithm is combined with the soft output viterbi algorithm (SOVA) and an estimator for the residual correlation of the source bits to achieve source-controlled channel decoding for framed source bits. The description is simplified by an algebra for the log-likelihood ratio L(u)=log(P(u=+1)/P(u=-1)) which allows a clear definition of the "soft" values of source-, channel-, and decoded bits as well as a simplified description of the traceback version of the SOVA. Applications are given for PCM transmission and the full rate GSM speech codec. For an PCM coded oversampled bandlimited Gaussian source transmitted over Gaussian and Rayleigh channels with convolutional codes the decoding errors are reduced by a factor of 4 to 5 when the APRI-SOVA is used instead of the VA. A simple dynamic Markov correlation estimator is used. With these receiver-only modifications the channel SNR in a bad mobile environment can be lowered by 2 to 4 dB resulting in the same voice quality. Further applications are briefly discussed. >

476 citations


Book
01 Feb 1995
TL;DR: A detailed account of the most recently developed digital speech coders designed specifically for use in the evolving communications systems, including an in-depth examination of the important topic of code excited linear prediction (CELP).
Abstract: From the Publisher: A detailed account of the most recently developed digital speech coders designed specifically for use in the evolving communications systems. Discusses the variety of speech coders utilized with such new systems as MBE IMMARSAT-M. Includes an in-depth examination of the important topic of code excited linear prediction (CELP).

453 citations


Journal ArticleDOI
TL;DR: This paper presents a complete description of the original postfiltering algorithm and the underlying ideas that motivated its development, and achieves noticeable noise reduction while introducing only minimal distortion in speech.
Abstract: An adaptive postfiltering algorithm for enhancing the perceptual quality of coded speech is presented. The postfilter consists of a long-term postfilter section in cascade with a short-term postfilter section and includes spectral tilt compensation and automatic gain control. The long-term section emphasizes pitch harmonics and attenuates the spectral valleys between pitch harmonics. The short-term section, on the other hand, emphasizes speech formants and attenuates the spectral valleys between formants. Both filter sections have poles and zeros. Unlike earlier postfilters that often introduced a substantial amount of muffling to the output speech, our postfilter significantly reduces this effect by minimizing the spectral tilt in its frequency response. As a result, this postfilter achieves noticeable noise reduction while introducing only minimal distortion in speech. The complexity of the postfilter is quite low. Variations of this postfilter are now being used in several national and international speech coding standards. This paper presents for the first time a complete description of our original postfiltering algorithm and the underlying ideas that motivated its development. >

278 citations


PatentDOI
Peter Kroon1
TL;DR: In this paper, a speech coding system employing an adaptive codebook model of periodicity is augmented with a pitch-predictive filter (PPF), which has a delay equal to the integer component of the pitch-period and a gain which is adaptive based on a measure of the periodicity of the speech signal.
Abstract: A speech coding system employing an adaptive codebook model of periodicity is augmented with a pitch-predictive filter (PPF). This PPF has a delay equal to the integer component of the pitch-period and a gain which is adaptive based on a measure of periodicity of the speech signal. In accordance with an embodiment of the present invention, speech processing systems which include a first portion comprising an adaptive codebook and corresponding adaptive codebook amplifier and a second portion comprising a fixed codebook coupled to a pitch filter, are adapted to delay the adaptive codebook gain; determine the pitch filter gain based on the delayed adaptive codebook gain, and amplify samples of a signal in the pitch filter based on said determined pitch filter gain. The adaptive codebook gain is delayed for one subframe. The pitch filter gain equals the delayed. adaptive codebook gain, except when the adaptive codebook gain is either less than 0.2 or greater than 0.8., in which cases the pitch filter gain is set equal to 0.2 or 0.8, respectively.

271 citations


PatentDOI
TL;DR: In this paper, each audio signal is digitized and then transformed into a predefined visual image, which is displayed in a 3D space, and selected audio characteristics, such as frequency, amplitude, time and spatial placement, are correlated to selected visual characteristics of the visual image.
Abstract: A method and apparatus for mixing audio signals. Each audio signal is digitized and then transformed into a predefined visual image, which is displayed in a three-dimensional space. Selected audio characteristics of the audio signal, such as frequency, amplitude, time and spatial placement, are correlated to selected visual characteristics of the visual image, such as size, location, texture, density and color. Dynamic changes or adjustment to any one of these parameters causes a corresponding change in the correlated parameter.

218 citations


Journal ArticleDOI
TL;DR: A theoretical analysis of high-rate vector quantization (VQ) systems that use suboptimal, mismatched distortion measures is presented, and the application of the analysis to the problem of quantizing the linear predictive coding (LPC) parameters in speech coding systems is described.
Abstract: The paper presents a theoretical analysis of high-rate vector quantization (VQ) systems that use suboptimal, mismatched distortion measures, and describes the application of the analysis to the problem of quantizing the linear predictive coding (LPC) parameters in speech coding systems. First, it is shown that in many high-rate VQ systems the quantization distortion approaches a simple quadratically weighted error measure, where the weighting matrix is a "sensitivity matrix" that is an extension of the concept of the scalar sensitivity. The approximate performance of VQ systems that train and quantize using mismatched distortion measures is derived, and is used to construct better distortion measures. Second, these results are used to determine the performance of LPC vector quantizers, as measured by the log spectral distortion (LSD) measure, which have been trained using other error measures, such as mean-squared (MSE) or weighted mean-squared error (WMSE) measures of LEPC parameters, reflection coefficients and transforms thereof, and line spectral pair (LSP) frequencies. Computationally efficient algorithms for computing the sensitivity matrices of these parameters are described. In particular, it is shown that the sensitivity matrix for the LSP frequencies is diagonal, implying that a WMSE measured LSP frequencies converges to the LSD measure in high-rate VQ systems. Experimental results to support the theoretical performance estimates are provided. >

182 citations


Patent
27 Mar 1995
TL;DR: A code frequency component in the encoded audio signal is detected based on an expected code amplitude or on a noise amplitude within a range of audio frequencies including the frequency of the code component as mentioned in this paper.
Abstract: Apparatus and methods for including a code having at least one code frequency component in an audio signal are provided. The abilities of various frequency components in the audio signal to mask the code frequency component to human hearing are evaluated and based on these evaluations an amplitude is assigned to the code frequency component. Methods and apparatus for detecting a code in an encoded audio signal are also provided. A code frequency component in the encoded audio signal is detected based on an expected code amplitude or on a noise amplitude within a range of audio frequencies including the frequency of the code component.

179 citations


PatentDOI
TL;DR: A modular system and method is provided for encoding and decoding of speech signals using voicing probability determination and the use of the system in the generation of a variety of voice effects.
Abstract: A modular system and method is provided for encoding and decoding of speech signals using voicing probability determination. The continuous input speech is divided into time segments of a predetermined length. For each segment the encoder of the system computes the signal pitch and a parameter which is related to the relative content of voiced and unvoiced portions in the spectrum of the signal, which is expressed as a ratio Pv, defined as a voicing probability. The voiced portion of the signal spectrum, as determined by the parameter Pv, is encoded using a set of harmonically related amplitudes corresponding to the estimated pitch. The unvoiced portion of the signal is processed in a separate processing branch which uses a modified linear predictive coding algorithm. Parameters representing both the voiced and the unvoiced portions of a speech segment are combined in data packets for transmission. In the decoder, speech is synthesized from the transmitted parameters representing voiced and unvoiced portions of the speech in a reverse order. Boundary conditions between voiced and unvoiced segments are established to ensure amplitude and phase continuity for improved output speech quality. Perceptually smooth transition between frames is ensured by using an overlap and add method of synthesis. Also disclosed is the use of the system in the generation of a variety of voice effects.

PatentDOI
TL;DR: In this paper, a speech coding system robust to frame erasure (or packet loss) is described, where vectors of an excitation signal are synthesized based on previously stored excitation signals generated during non-erased frames.
Abstract: A speech coding system robust to frame erasure (or packet loss) is described. Illustrative embodiments are directed to a modified version of CCITT standard G.728. In the event of frame erasure, vectors of an excitation signal are synthesized based on previously stored excitation signal vectors generated during non-erased frames. This synthesis differs for voiced and non-voiced speech. During erased frames, linear prediction filter coefficients are synthesized as a weighted extrapolation of a set of linear prediction filter coefficients determined during non-erased frames. The weighting factor is a number less than 1. This weighting accomplishes a bandwidth-expansion of peaks in the frequency response of a linear predictive filter. Computational complexity during erased frames is reduced through the elimination of certain computations needed during non-erased frames only. This reduction in computational complexity offsets additional computation required for excitation signal synthesis and linear prediction filter coefficient generation during erased frames.

PatentDOI
Chen Juin-Hwey1
TL;DR: In this paper, a speech coding system robust to frame erasure (or packet loss) is described, where vectors of an excitation signal are synthesized based on previously stored excitation signals generated during non-erased frames.
Abstract: A speech coding system robust to frame erasure (or packet loss) is described. Illustrative embodiments are directed to a modified version of CCITT standard G.728. In the event of frame erasure, vectors of an excitation signal are synthesized based on previously stored excitation signal vectors generated during non-erased frames. This synthesis differs for voiced and non-voiced speech. During erased frames, linear prediction filter coefficients are synthesized as a weighted extrapolation of a set of linear prediction filter coefficients determined during non-erased frames. The weighting factor is a number less than 1. This weighting accomplishes a bandwidth-expansion of peaks in the frequency response of a linear predictive filter. Computational complexity during erased frames is reduced through the elimination of certain computations needed during non-erased frames only. This reduction in computational complexity offsets additional computation required for excitation signal synthesis and linear prediction filter coefficient generation during erased frames.

Proceedings ArticleDOI
09 May 1995
TL;DR: It is suggested that phone rate is a more meaningful measure of speech rate than the more common word rate, and it is found that when data sets are clustered according to the phone rate metric, recognition errors increase when thePhone rate is more than 1 standard deviation greater than the mean.
Abstract: It is well known that a higher-than-normal speech rate will cause the rate of recognition errors in large vocabulary automatic speech recognition (ASR) systems to increase. In this paper we attempt to identify and correct for errors due to fast speech. We first suggest that phone rate is a more meaningful measure of speech rate than the more common word rate. We find that when data sets are clustered according to the phone rate metric, recognition errors increase when the phone rate is more than 1 standard deviation greater than the mean. We propose three methods to improve the recognition accuracy of fast speech, each addressing different aspects of performance degradation. The first method is an implementation of Baum-Welch codebook adaptation. The second method is based on the adaptation of HMM state-transition probabilities. In the third method, the pronunciation dictionaries are modified using rule-based techniques and compound words are added. We compare improvements in recognition accuracy for each method using data sets clustered according to the phone rate metric. Adaptation of the HMM state-transition probabilities to fast speech improves recognition of fast speech by a relative amount of 4 to 6 percent.

PatentDOI
TL;DR: A computerized system time aligns frames of spoken training data against models of the speech sounds; automatically selects different sets of phonetic context classifications which divide the speech sound models into speech sound groups aligned against acoustically similar frames; creates model components from the frames aligned againstspeech sound groups with related classifications; and uses these model components to build a separate model for each related speech sound group.
Abstract: A computerized system time aligns frames of spoken training data against models of the speech sounds; automatically selects different sets of phonetic context classifications which divide the speech sound models into speech sound groups aligned against acoustically similar frames; creates model components from the frames aligned against speech sound groups with related classifications; and uses these model components to build a separate model for each related speech sound group. A decision tree classifies speech sounds into such groups, and related speech sound groups descend from common tree nodes. New speech samples time aligned against a given speech sound group's model update models of related speech sound groups, decreasing the training data required to adapt the system. The phonetic context classifications can be based on knowledge of which contextual features are associated with acoustic similarity. The computerized system samples speech sounds using a first, larger, parameter set; automatically selects combinations of phonetic context classifications which divide the speech sounds into groups whose frames are acoustically similar, such as by use of a decision tree; selects a second, smaller, set of parameters based on that set's ability to separate the frames aligned with each speech sound group, such as by used of linear discriminant analysis; and then uses these new parameters to represent frames and speech sound models. Then, using the new parameters, a decision tree classifier can be used to re-classify the speech sounds and to calculate new acoustic models for the resulting groups of speech sounds.

Proceedings ArticleDOI
09 May 1995
TL;DR: A 2.4 kb/s coder using waveform interpolation principles to represent the speech signal as an evolving characteristic waveform (CW) and a significant increase in coding efficiency is obtained by coding these two components separately.
Abstract: For low-rate speech coding it is advantageous to represent the speech signal as an evolving characteristic waveform (CW). The CW evolves slowly when the speech signal is clearly voiced and rapidly when the speech signal is clearly unvoiced. The voiced (periodic) and unvoiced (nonperiodic) components of the speech signal can be separated by a simple nonadaptive filter in the CW domain. Because of perceptual effects, a significant increase in coding efficiency is obtained by coding these two components separately. A 2.4 kb/s coder using these principles was developed. In an independent evaluation, the performance of the 2.4 kb/s waveform interpolation (WI) coder was found to be at least equivalent to the 4.8 kb/s FS1016 standard for all of the many tests.

Patent
Philippe Ferriere1
11 Oct 1995
TL;DR: In this article, an audio data transmission system uses computing units which are designed to select an appropriate combination of block size and input sampling rate to maximize the available bandwidth of the receiving modem.
Abstract: An audio data transmission system encodes audio files into individual audio data blocks which contain a variable number bits of digital audio data that were sampled at a selectable sample rate. The number of bits of digital data and the input sampling rate are scaleable to produce an encoded bit stream bit rate that is less than or equal to an effective operational bit rate of a recipient's modem. The audio data transmission system uses computing units which are designed to select an appropriate combination of block size and input sampling rate to maximize the available bandwidth of the receiving modem. For example, if the modem connection speed for one modem is 14.4 kbps, a version of the audio data compressed at 13000 bits/s might be sent to the recipient; if the modem connection speed for another modem is 28.8 kbps, a version of the audio data compressed at 24255 bits/s might be sent to the receiver. The audio data blocks are then transmitted at the encoded bit stream bit rate to the intended recipient's modem. The audio data blocks are decoded at the recipient to reconstruct the audio file and immediately play the audio file as it is received. The audio data transmission system can be implemented in online service systems, ITV systems, computer data network systems, and communication systems.

Journal ArticleDOI
TL;DR: This work shows that significant improvements in performance are obtained as compared to an earlier system proposed by Jayant and Christensen (1981) for packetized speech systems and shows that for a first-order Gauss-Markov source significant performance improvements can be obtained by using a second-order predictor instead of a first -order predictor.
Abstract: Speech quality in packetized speech systems can degrade substantially when packets are lost. We consider the problem of DPCM system design for packetized speech systems. The problem is formulated as a multiple description problem and the problem of optimal selection of the encoder and decoder filters is addressed. We show that significant improvements in performance are obtained as compared to an earlier system proposed by Jayant and Christensen (1981). Further, we show that for a first-order Gauss-Markov source significant performance improvements can be obtained by using a second-order predictor instead of a first-order predictor. >

Patent
02 Mar 1995
TL;DR: In this paper, a method for providing described television services includes the steps of generating description data corresponding to an audiovisual program, converting the description data to a speech signal corresponding to the description signals, synchronizing the speech signal with the audi-cation program using a time code signal from the audio-coding program, and mixing the synchronized speech signals with the audio track of the audiovi cation program to create a combined audio signal.
Abstract: An apparatus for providing described television services includes a receiver for receiving description data corresponding to an audiovisual program; a text-to-speech converter for converting the description data into a speech signal corresponding to the description data; a memory device for receiving and storing the speech signal and a corresponding time code from the audiovisual program; a mixing circuit for retrieving the speech signal from the memory device and mixing the retrieved speech signal with the audio track of the audiovisual program to produce a combined audio signal; and a transmitter for simultaneously providing the combined speech signal and the audiovisual program to a viewer. The apparatus provides the combined speech signal to the viewer via the SAP channel. The apparatus may also include a translator for translating the description data into a foreign language prior to converting the description data into the speech signal. A method for providing described television services includes the steps of generating description data corresponding to an audiovisual program; converting the description data to a speech signal corresponding to the description data; synchronizing the speech signal with the audiovisual program using a time code signal from the audiovisual program; mixing the synchronized speech signal with the audio track of the audiovisual program to create a combined audio signal; and simultaneously transmitting the combined audio signal and the audiovisual program to the viewer.

PatentDOI
Willem Bastiaan Kleijn1
TL;DR: In this article, a plurality of sets of indexed parameters are generated based on samples of the speech signal, each set corresponds to a waveform characterizing the speech signals at a discrete point in time.
Abstract: A method of coding a speech signal is described. In accordance with the method, a plurality of sets of indexed parameters are generated based on samples of the speech signal. Each set of indexed parameters corresponds to a waveform characterizing the speech signal at a discrete point in time. Parameters of the plurality of sets are grouped based on index value to form a first set of signals which represents the evolution of characterizing waveform shape; the signals of the first set are filtered to remove low frequency components and thereby produce a second set of signals which represents relatively high rates of evolution of characterizing waveform shape. The speech signal is then coded based on the second set of signals representing high rates of characterizing waveform shape evolution. Coding of the speech signal may further be based on a set of smoothed first signals.

Patent
13 Dec 1995
TL;DR: In this paper, the authors proposed a TDMA mobile-to-mobile (M2M) communication protocol where the two digital signal processors are virtually connected at the channel codecs.
Abstract: In a TDMA mobile-to-mobile connection, the end-to-end audio signal quality as well as system performance can be improved by providing digital signal processors the capability to automatically switch configuration such that each digital signal processor in a mobile-to-mobile communication connection can automatically identify a TDMA mobile-to-mobile connection and bypass the speech encoding and decoding processes within the digital signal processors. The two digital signal processors are virtually connected at the channel codecs.

Patent
Toshiyuki Morii1
27 Nov 1995
TL;DR: In this article, a speech is analyzed by a speech analyzing unit to obtain sample characteristic parameters, and a coding distortion is calculated from the sampled characteristic parameters in each of a plurality of coding modules.
Abstract: A sample speech is analyzed by a speech analyzing unit to obtain sample characteristic parameters, and a coding distortion is calculated from the sample characteristic parameters in each of a plurality of coding modules. The sample characteristic parameters and the coding distortions are statistically processed by a statistical processing unit to obtain a coding module selecting rule. Thereafter, when a speech is analyzed by the speech analyzing unit to obtain characteristic parameters, an appropriate coding module is selected by a coding module selecting unit from the coding modules according to the coding module selecting rule on condition that a coding distortion for the characteristic parameters is minimized in the appropriate coding module. Thereafter, the characteristic parameters of the speech are coded in the appropriate coding module, and a coded speech is obtained. When the coded speech is decoded, a reproduced speech is obtained. Accordingly, because an appropriate coding module can be easily selected from a plurality of coding modules according to the coding module selecting rule, any allophone occurring in a reproduced speech can be prevented at a low calculation volume.

Patent
Peter Kroon1
28 Feb 1995
TL;DR: In this article, a speech coding system robust to frame erasure (or packet loss) is described, where vectors of an excitation signal are synthesized based on previously stored excitation signals generated during non-erased frames.
Abstract: A speech coding system robust to frame erasure (or packet loss) is described. Illustrative embodiments are directed to a modified version of CCITT standard G.728. In the event of frame erasure, vectors of an excitation signal are synthesized based on previously stored excitation signal vectors generated during non-erased frames. This synthesis differs for voiced and non-voiced speech. During erased frames, linear prediction filter coefficients are synthesized as a weighted extrapolation of a set of linear prediction filter coefficients determined during non-erased frames. The weighting factor is a number less than 1. This weighting accomplishes a bandwidth-expansion of peaks in the frequency response of a linear predictive filter. Computational complexity during erased frames is reduced through the elimination of certain computations needed during non-erased frames only. This reduction in computational complexity offsets additional computation required for excitation signal synthesis and linear prediction filter coefficient generation during erased frames.

Journal ArticleDOI
S. Vernon1
01 Aug 1995
TL;DR: The design and implementation of AC-3 coders are described, focusing on issues relevant to minimum cost solutions, and an overview of encoding and decoding strategies is presented.
Abstract: AC-3 is the perceptual coding technology used for HDTV audio compression. This paper describes the design and implementation of AC-3 coders, focusing on issues relevant to minimum cost solutions. AC-3 coding technology has been adopted by the Advanced Television Systems Committee (ATSC) as the audio service standard for high definition television (HDTV) in the United States. The AC-3 audio data compression system is described, and an overview of encoding and decoding strategies is presented. >

Journal ArticleDOI
01 Jun 1995
TL;DR: Basic approaches to speech, wideband speech, and audio bit rate compressions in audiovisual communications are explained and it will become obvious that the use of the knowledge of auditory perception helps minimizing perception of coding artifacts and leads to efficient low bit rate coding algorithms which can achieve substantially more compression than was thought possible only a few years ago.
Abstract: Current and future visual communications for applications such as broadcasting videotelephony, video- and audiographic-conferencing, and interactive multimedia services assume a substantial audio component. Even text, graphics, fax, still images, email documents, etc. will gain from voice annotation and audio clips. A wide range of speech, wideband speech, and wideband audio coders is available for such applications. In the context of audiovisual communications, the quality of telephone-bandwidth speech is acceptable for some videotelephony and videoconferencing services. Higher bandwidths (wideband speech) may be necessary to improve the intelligibility and naturalness of speech. High quality audio coding including multichannel audio will be necessary in advanced digital TV and multimedia services. This paper explains basic approaches to speech, wideband speech, and audio bit rate compressions in audiovisual communications. These signal classes differ in bandwidth, dynamic range, and in listener expectation of offered quality. It will become obvious that the use of our knowledge of auditory perception helps minimizing perception of coding artifacts and leads to efficient low bit rate coding algorithms which can achieve substantially more compression than was thought possible only a few years ago. The paper concentrates on worldwide source coding standards beneficial for consumers, service providers, and manufacturers. >

Proceedings ArticleDOI
14 Sep 1995
TL;DR: The first phase of the development of high quality audio coding for widespread use in broadcasting, telecommunication, computer and consumer applications has been finished with ISO/IEC 11172-3, but the finalisation of MPEG-1 is not the end of standardisation of highquality audio coding systems.
Abstract: The first phase of the development of high quality audio coding for widespread use in broadcasting, telecommunication, computer and consumer applications has been finished with ISO/IEC 11172-3, but the finalisation of MPEG-1 is not the end of standardisation of high quality audio coding systems. MPEG-2 Audio multichannel coding system ensuring forward and backward compatibility with ISO/IEC 11172-3 encoded audio signals is designed for universal applications with and without accompanying picture. Envisaged applications beside DAB are digital television systems, digital video tape recorders and interactive storage media. Configurability with respect to the sound channel allocation and to the bit-rate offers useful combinations of various levels of multi-channel stereo performance and various numbers of channels in the composite and independent coding mode.


Patent
04 Apr 1995
TL;DR: In this paper, the authors exploit the synergy between operations performed by a speech rate modification system and those operations performed in a speech coding system to provide a speech-rate modification system with reduced hardware requirements.
Abstract: Synergy between operations performed by a speech-rate modification system and those operations performed in a speech coding system is exploited to provide a speech-rate modification system with reduced hardware requirements. The speech rate of an input signal is modified based on a signal representing a predetermined change in speech rate. The modified speech-rate signal is then filtered to generate a speech signal having increased short-term correlation. Modification of the input speech signal may be performed by inserting in the input speech signal a previous sequence of samples corresponding substantially to a pitch cycle. Alternatively, the input speech signal may be modified by removing from the input speech signal a sequence of samples corresponding substantially to a pitch cycle.

Journal ArticleDOI
T. Chen1, H.P. Graf1, Kuansan Wang2
TL;DR: The marriage of speech analysis and image processing can solve problems related to lip synchronization and speech information is utilized to improve the quality of audio-visual communications such as videotelephony and videoconferencing.
Abstract: We utilize speech information to improve the quality of audio-visual communications such as videotelephony and videoconferencing. In particular, the marriage of speech analysis and image processing can solve problems related to lip synchronization. We present a technique called speech-assisted frame-rate conversion. Demonstration sequences are presented. Other applications, including speech-assisted video coding, are outlined. >