scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 1997"


Patent
21 Jul 1997
TL;DR: In this paper, a system that performs analysis and comparison of audio data files based upon the content of the data files is presented, which produces a set of numeric values (a feature vector) that can be used to classify and rank the similarity between individual audio files typically stored in a multimedia database or on the Web.
Abstract: A system that performs analysis and comparison of audio data files based upon the content of the data files is presented. The analysis of the audio data produces a set of numeric values (a feature vector) that can be used to classify and rank the similarity between individual audio files typically stored in a multimedia database or on the World Wide Web. The analysis also facilitates the description of user-defined classes of audio files, based on an analysis of a set of audio files that are members of a user-defined class. The system can find sounds within a longer sound, allowing an audio recording to be automatically segmented into a series of shorter audio segments.

726 citations


Journal Article
TL;DR: The ISO/IEC MPEG-2 advanced audio coding (AAC) system was designed to provide MPEG2 with the best audio quality without any restrictions due to compatibility requirements.
Abstract: The ISO/IEC MPEG-2 advanced audio coding (AAC) system was designed to provide MPEG-2 with the best audio quality without any restrictions due to compatibility requirements. The main features of the AAC system (ISO/IEC 13818-7) are described. MPEG-2 AAC combines the coding efficiency of a high-resolution filter bank, prediction techniques, and Huffman coding with additional functionalities aimed to deliver very high audio quality at a variety of data rates.

585 citations


Patent
25 Feb 1997
TL;DR: In this paper, a communication system for simulataneously transmitting ancillary codes and audio signals via a conventional audio communications channel using perceptual coding techniques is described, where an encoder monitors an audio channel to detect "opportunities" to insert an anciliary code such that the inserted signals are masked by the audio signal.
Abstract: A communication system for simulataneously transmitting ancillary codes and audio signals via a conventional audio communications channel using perceptual coding techniques is disclosed. An encoder monitors an audio channel to detect 'opportunities' to insert an ancillary code such that the inserted signals are masked by the audio signal, as defined by the 'perceptual entropy envelope' of the audio signal. An ancillary code containing, for example, an ID or serial number, is encoded as one or more whitened spread stpectrum signals and/or a narrowband FSK ancillary code and transmitted at a time, frequency and/or level such that the data signal is masked by the audio signal. A decoder at a receiving location recovers the encoded ID or serial number.

459 citations


Journal ArticleDOI
TL;DR: Annan B defines a low-bit-rate silence compression scheme designed and optimized to work in conjunction with both the full version of G.729 and its low-complexity Annex A, which enables the achievement of bit-rate savings for coded speech at average rates as low as 4 kb/s during normal speech conversation while maintaining reproduction quality.
Abstract: This article describes Annex B to ITU-T Recommendation G.729. Annex B defines a low-bit-rate silence compression scheme designed and optimized to work in conjunction with both the full version of G.729 and its low-complexity Annex A. To achieve good quality low-bit-rate silence compression, a robust frame-based voice activity detector module is essential to detect inactive voice frames, also called silence or background noise frames. For these detected inactive voice frames, a discontinuous transmission module measures the changes over time of the inactive voice signal characteristics and decides whether a new silence information descriptor frame should be sent to maintain the reproduction quality of the background noise at the receiving end. If such a frame is needed, the spectrum and energy parameters describing the perceptual characteristics of the background noise are efficiently coded and transmitted using 15 b/frame. At the receiving end, the comfort noise generation module regenerates the output background noise using transmitted updates or previously available parameters. The synthesized background noise is obtained by linear predictive filtering of a locally generated pseudo-white excitation signal of a controlled level. This method of coding the background noise enables the achievement of bit-rate savings for coded speech at average rates as low as 4 kb/s during normal speech conversation while maintaining reproduction quality.

332 citations


Patent
28 Jan 1997
TL;DR: The use of EM radiation in conjunction with simultaneously recorded speech information enables a complete mathematical coding of acoustic speech as discussed by the authors, which can be used for speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis and speech translation.
Abstract: The use of EM radiation in conjunction with simultaneously recorded speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector (12, 13) for each pitch period of voiced speech and the forming of feature vectors (12, 13) for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function (7) each time frame. The formation of feature vectors (12, 13) defining all acoustic speech units over well-defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.

313 citations


Book
01 Dec 1997

277 citations


Journal ArticleDOI
TL;DR: SpeakSkimmer as discussed by the authors uses speech processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail, and provides continuous real-time control of the speed and detail level of the audio presentation.
Abstract: Listening to a speech recording is much more difficult than visually scanning a document because of the transient and temporal nature of audio. Audio recordings capture the richness of speech, yet it is difficult to directly browse the stored information. This article describes techniques for structuring, filtering, and presenting recorded speech, allowing a user to navigate and interactively find information in the audio domain. This article describes the SpeechSkimmer system for interactively skimming speech recordings. SpeechSkimmer uses speech-processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail. User interaction, through a manual input device, provides continuous real-time control of the speed and detail level of the audio presentation. SpeechSkimmer reduces the time needed to listen by incorporating time-compressed speech, pause shortening, automatic emphasis detection, and nonspeech audio feedback. This article also presents a multilevel structural approach to auditory skimming and user interface techniques for interacting with recorded speech. An observational usability test of SpeechSkimmer is discussed, as well as a redesign and reimplementation of the user interface based on the results of this usability test.

253 citations


Proceedings ArticleDOI
21 Apr 1997
TL;DR: A simple new procedure called STRAIGHT (speech transformation and representation using adaptive interpolation of weighted spectrum) has been developed, which allows for over 600% manipulation of such speech parameters as pitch, vocal tract length, and speaking rate, without further degradation due to the parameter manipulation.
Abstract: A simple new procedure called STRAIGHT (speech transformation and representation using adaptive interpolation of weighted spectrum) has been developed. STRAIGHT uses pitch-adaptive spectral analysis combined with a surface reconstruction method in the time-frequency region, and an excitation source design based on phase manipulation. It preserves the bilinear surface in the time-frequency region and allows for over 600% manipulation of such speech parameters as pitch, vocal tract length, and speaking rate, without further degradation due to the parameter manipulation.

247 citations


Proceedings ArticleDOI
01 Feb 1997
TL;DR: The theoretic framework and applications of automatic audio content analysis, including analysis of amplitude, frequency and pitch, and simulations of human audio perception, are described.
Abstract: This paper describes the theoretic framework and applications of automatic audio content analysis. Research in multimedia content analysis has so far concentrated on the video domain. We demonstrate the strength of automatic audio content analysis. We explain the algorithms we use, including analysis of amplitude, frequency and pitch, and simulations of human audio perception. These algorithms serve us as tools for further audio content analysis. We use these tools in applications like the segmentation of audio data streams into logical units for further processing, the analysis of music, as well as the recognition of sounds indicative of violence like shots, explosions and cries.

227 citations


Journal ArticleDOI
TL;DR: The proposed analysis-by-synthesis/overlap-add (ABS/OLA) system allows for both fixed and time-varying time-, frequency-, and pitch-scale modifications, and computational shortcuts using the FFT algorithm make its implementation feasible using currently available hardware.
Abstract: Sinusoidal modeling has been successfully applied to a broad range of speech processing problems, and offers advantages over linear predictive modeling and the short-time Fourier transform for speech analysis/synthesis and modification. This paper presents a novel speech analysis/synthesis system based on the combination of an overlap-add sinusoidal model with an analysis-by-synthesis technique to determine the model parameters. It describes this analysis procedure in detail, and introduces an equivalent frequency-domain algorithm that takes advantage of the computational efficiency of the fast Fourier transform (FFT). In addition, a refined overlap-add sinusoidal model capable of shape-invariant speech modification is derived, and a pitch-scale modification algorithm is defined that preserves speech bandwidth and eliminates noise migration effects. Analysis-by-synthesis achieves very high synthetic speech quality by accurately estimating the component frequencies, eliminating sidelobe interference effects, and effectively dealing with nonstationary speech events. The refined overlap-add synthesis model correlates well with analysis-by-synthesis, and modifies speech without objectionable artifacts by explicitly controlling shape invariance and phase coherence. The proposed analysis-by-synthesis/overlap-add (ABS/OLA) system allows for both fixed and time-varying time-, frequency-, and pitch-scale modifications, and computational shortcuts using the FFT algorithm make its implementation feasible using currently available hardware.

220 citations


Proceedings ArticleDOI
21 Apr 1997
TL;DR: This work proposes a new representational format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels.
Abstract: Understanding the human ability to reliably process and decode speech across a wide range of acoustic conditions and speaker characteristics is a fundamental challenge for current theories of speech perception. Conventional speech representations such as the sound spectrogram emphasize many spectro-temporal details that are not directly germane to the linguistic information encoded in the speech signal and which consequently do not display the perceptual stability characteristic of human listeners. We propose a new representational format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels. We describe the representation and illustrate its stability with color-mapped displays and with results from automatic speech recognition experiments.

Journal ArticleDOI
TL;DR: This paper describes in some detail the key technologies and main features of MPEG-1 and MPEG-2 audio coders, and presents the MPEG-4 standard and discusses some of the typical applications for MPEG audio compression.
Abstract: The Moving Pictures Expert Group (MPEG) within the International Organization of Standardization (ISO) has developed a series of audio-visual standards known as MFEG-1 and MPEG-2. These audio-coding standards are the first international standards in the field of high-quality digital audio compression. MPEG-1 covers coding of stereophonic audio signals at high sampling rates aiming at transparent quality, whereas MPEG-2 also offers stereophonic audio coding at lower sampling rates. In addition, MPEG-2 introduces multichannel coding with and without backwards compatibility to MPEG-1 to provide an improved acoustical image for audio-only applications and for enhanced television and video-conferencing systems. MPEG-2 audio coding without backwards compatibility, called IMPEG-2 Advanced Audio Coding (AAC), offers the highest compression rates. Typical application areas for MPEG-based digital audio are in the fields of audio production, program distribution and exchange, digital sound broadcasting, digital storage, and various multimedia applications. We describe in some detail the key technologies and main features of MPEG-1 and MPEG-2 audio coders. We also present the MPEG-4 standard and discuss some of the typical applications for MPEG audio compression.

Patent
16 Dec 1997
TL;DR: A subband audio coder employs perfect/nonperfect reconstruction filters, predictive/non-predictive subband encoding, transient analysis, and psycho-acoustic/minimum mean square error (mmse) bit allocation over time, frequency and the multiple audio channels to encode/decode a data stream to generate high fidelity reconstructed audio as mentioned in this paper.
Abstract: A subband audio coder employs perfect/non-perfect reconstruction filters, predictive/non-predictive subband encoding, transient analysis, and psycho-acoustic/minimum mean-square-error (mmse) bit allocation over time, frequency and the multiple audio channels to encode/decode a data stream to generate high fidelity reconstructed audio. The audio coder windows the multi-channel audio signal such that the frame size, i.e. number of bytes, is constrained to lie in a desired range, and formats the encoded data so that the individual subframes can be played back as they are received thereby reducing latency. Furthermore, the audio coder processes the baseband portion (0-24 kHz) of the audio bandwidth for sampling frequencies of 48 kHz and higher with the same encoding/decoding algorithm so that audio coder architecture is future compatible.

Patent
28 Jan 1997
TL;DR: In this article, the positions and velocities of the speech organs (2, 3, 4) as speech is articulated can be defined for each acoustic speech unit (20) by simultaneously recording EM wave reflections and acoustic speech information.
Abstract: By simultaneously recording EM wave reflections (21) and acoustic speech information (24), the positions and velocities of the speech organs (2, 3, 4) as speech is articulated can be defined for each acoustic speech unit (20). Well defined time frames and feature vectors (6, 7, 8, 9) describing the speech, to the degree required, can be formed. Such feature vectors (6, 7, 8, 9) can uniquely characterize the speech unit (20) being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit (20) recognition, and organ mechanical parameters can be determined.

Journal Article
TL;DR: Common features as well as differences between MPEG-I and MPEG-2 audio, other current audio coding systems currently in use, and the new work for MPEG-4 audio will be presented.
Abstract: Since 1988 MPEG has been working on the standardization of high-quality low-bit rate audio coding. In 1992 and 1994 the MPEG-I and MPEG-2 audio standards were completed. Current work in MPEG includes the MPEG-2 advanced audio coding (MPEG-2 AAC) 1 of stereo or multichannel sound material and the audio part of MPEG-4. Common features as well as differences between MPEG-I and MPEG-2 audio, other current audio coding systems currently in use, and the new work for MPEG-2 AAC and MPEG-4 audio will be presented.

Journal ArticleDOI
TL;DR: The modulated lapped transform properties and how it can be used to generate a time-varying filterbank are described and examples of its implementation in two audio coding standards are presented.
Abstract: The modulated lapped transform (MLT) is used in both audio and video data compression schemes. This paper describes its properties and how it can be used to generate a time-varying filterbank. Examples of its implementation in two audio coding standards are presented.

PatentDOI
TL;DR: In this paper, a speech/music discriminator employs data from multiple features of an audio signal as input to a classifier, and a preferred set of classifiers is based upon variations of a nearest-neighbor approach, including a K-d tree spatial partitioning technique.
Abstract: A speech/music discriminator employs data from multiple features of an audio signal (10) as input to a classifier (16). Some of the feature data is determined from individual frames of the audio signal, and other input data is based upon variations of a feature over several frames, to distinguish the changes in voiced and unvoiced components of speech from the more constant characteristics of music. Several different types of classifiers for labeling test points on the basis of the feature data are disclosed. A preferred set of classifiers is based upon variations of a nearest-neighbor approach, including a K-d tree spatial partitioning technique.

Patent
24 Jan 1997
TL;DR: In this article, a coding process and a coder for inserting an inaudible data signal into an audio signal, the audio signal is first converted into a spectrum range and the masking threshold of the audio signals is determined.
Abstract: In a coding process and a coder for inserting an inaudible data signal into an audio signal, the audio signal is first converted into a spectrum range and the masking threshold of the audio signal is determined. A pseudo-noise signal and a data signal are prepared and multiplied together to provide a frequency-spread data signal. The spread data signal is weighted by the masking threshold and then the audio signal and the weighted data signal are superimposed. In a process and a decoder for decoding an inaudible data signal inserted into an audio signal, first of all the audio signal is sampled and then the scanned audio signal is non-recursively filtered. Thereupon the filtered audio signal is compared with a threshold in order to re-obtain the data signal.

Patent
31 Jul 1997
TL;DR: In this paper, a multi-station audio distribution system with at least two listening stations, a data control mechanism and a listening sation interface mechanism disposed between the listening stations and the data control mechanisms is presented.
Abstract: A multi-station audio distribution apparatus having at least two listening stations, a data control mechanism and a listening sation interface mechanism disposed between the listening stations and the data control mechanism. Each of the listening stations has a user input in the form of a barcode scanner to enter an audio material selection and each has an audio output. The data control mechanism retrieves digitized audio material corresponding to each of the user's audio material selections. The listening station interface mechanism transfers the user's inputs from each of the listening stations to the data control mechanism, receives the digitized audio materials corresponding to each user's input from the data control mechanism, converts the digital audio materials to analog audio signals and transfers the analog audio signals to each of the respective listening stations for the audio output.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: The results of this experiment confirm the general superiority of marginals-based schemes, demonstrate the viability of shared covariance statistics, and suggest several ways in which performance improvements on the larger task may be obtained.
Abstract: In noisy listening conditions, the information available on which to base speech recognition decisions is necessarily incomplete: some spectro-temporal regions are dominated by other sources. We report on the application of a variety of techniques for missing data in speech recognition. These techniques may be based on marginal distributions or on reconstruction of missing parts of the spectrum. Application of these ideas in the resource management task shows a performance which is robust to random removal of up to 80% of the frequency channels, but falls off rapidly with deletions which more realistically simulate masked speech. We report on a vowel classification experiment designed to isolate some of the RM problems for more detailed exploration. The results of this experiment confirm the general superiority of marginals-based schemes, demonstrate the viability of shared covariance statistics, and suggest several ways in which performance improvements on the larger task may be obtained.

Journal ArticleDOI
TL;DR: It is suggested that many of the advantages to be gained from interaction between speech production and speech recognition communities will develop from integrating production models with the probabilistic analysis-by-synthesis strategy currently used by the technology community.

Proceedings ArticleDOI
02 Jul 1997
TL;DR: Algorithms for perceptually transparent coding of CD-quality digital audio, including both research and standardization activities are reviewed, including the ISO/MPEG family and the Dolby AC-3 algorithms.
Abstract: Considerable research has been devoted to the development of algorithms for perceptually transparent coding of high-fidelity (CD-quality) digital audio As a result, many algorithms have been proposed and several have now become international and/or commercial product standards This paper reviews algorithms for perceptually transparent coding of CD-quality digital audio, including both research and standardization activities First, psychoacoustic principles are described with the MPEG psychoacoustic signal analysis model 1 discussed in some detail Then, we review methodologies which achieve perceptually transparent coding of FM- and CD-quality audio signals, including algorithms which manipulate transform components and subband signal decompositions The discussion concentrates on architectures and applications of those techniques which utilize psychoacoustic models to exploit efficiently masking characteristics of the human receiver Several algorithms which have become international and/or commercial standards are also presented, including the ISO/MPEG family and the Dolby AC-3 algorithms The paper concludes with a brief discussion of future research directions

Patent
04 Apr 1997
TL;DR: In this article, a method and device for extrapolating past signal-history data for insertion into missing data segments in order to conceal digital speech frame errors is proposed, which is implemented with a device that utilizes a finite-impulse response (FIR) multi-layer feed-forward artificial neural network.
Abstract: A method and device for extrapolating past signal-history data for insertion into missing data segments in order to conceal digital speech frame errors. The extrapolation method uses past-signal history that is stored in a buffer. The method is implemented with a device that utilizes a finite-impulse response (FIR) multi-layer feed-forward artificial neural network that is trained by back-propagation for one-step extrapolation of speech compression algorithm (SCA) parameters. Once a speech connection has been established, the speech compression algorithm device begins sending encoded speech frames. As the speech frames are received, they are decoded and converted back into speech signal voltages. During the normal decoding process, pre-processing of the required SCA parameters will occur and the results stored in the past-history buffer. If a speech frame is detected to be lost or in error, then extrapolation modules are executed and replacement SCA parameters are generated and sent as the parameters required by the SCA. In this way, the information transfer to the SCA is transparent, and the SCA processing continues as usual. The listener will not normally notice that a speech frame has been lost because of the smooth transition between the last-received, lost, and next-received speech frames.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: The GSM enhanced full rate (EFR) speech codec that has been standardised for the GSM mobile communication system provides wireline quality not only for error-free conditions but also for the most typical error conditions.
Abstract: This paper describes the GSM enhanced full rate (EFR) speech codec that has been standardised for the GSM mobile communication system. The GSM EFR codec has been jointly developed by Nokia and University of Sherbrooke. It provides speech quality at least equivalent to that of a wireline telephony reference (32 kbit/s ADPCM). The EFR codec uses 12.2 kbit/s for speech coding and 10.6 kbit/s for error protection. Speech coding is based on the ACELP algorithm (algebraic code excited linear prediction). The codec provides substantial quality improvement compared to the existing GSM full rate and half rate codecs. The old GSM codecs lack wireline quality even in error-free channel conditions, while the EFR codec provides wireline quality not only for error-free conditions but also for the most typical error conditions. With the EFR codec, wireline quality is also sustained in the presence of background noise and in tandem connections (mobile to mobile calls).

Patent
26 Jun 1997
TL;DR: In this article, a method for speech coding using Code-Excited Linear Prediction (CELP) produces toll-quality speech at data rates between 4 and 16 Kbit/s.
Abstract: The invention provides a method for speech coding using Code-Excited Linear Prediction (CELP) producing toll-quality speech at data rates between 4 and 16 Kbit/s. The invention uses a series of baseline, implied and adaptive codebooks, comprised of pulse and random codebooks, with associated gain vectors, to characterize the speech. Improved quantization and search techniques to achieve real-time operation, based on the codebooks and gains, are also provided.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: This paper focuses on the improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods in Whisper TTS engine.
Abstract: The Whistler text-to-speech engine was designed so that we can automatically construct the model parameters from training data. This paper focuses on the improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style. Whisper TTS engine supports Microsoft Speech API and requires less than 3 MB of working memory.


01 Jan 1997
TL;DR: A classifier that distinguishes speech from music and non-vocal sounds is presented, as well as experimental results showing how perfect classification accuracy may be achieved on a small corpus using substantially less than two seconds per test audio file.
Abstract: This paper presents recent results using statistics generated by a MMl-supervised vector quantizer as a measure of audio similarity. Such a measure has proved successful for talker identification, and the extension from speech to general audio, such as music, is straightforward. A classifier that distinguishes speech from music and non-vocal sounds is presented, as well as experimental results showing how perfect classification accuracy may be achieved on a small corpus using substantially less than two seconds per test audio file. The techniques a presented here may be extended to other applications and domains, such as audio retrieval-by-similarity, musical genre classification, and automatic segmentation of continuous audio.

Journal ArticleDOI
R.V. Cox1
TL;DR: Three new speech coding recommendations from the ITU-T provide good coverage for a wide range of applications that have low bit rate requirements (i.e., from 5.3 to 8 kb/s).
Abstract: Many new speech coding standards have been created in the 10-year period 1987-1996. The author reviews the key attributes that determine what coder to select for different applications. The article then focuses on three new speech coding recommendations from the ITU-T, namely G.723.1, G.729, and Annex A of G.729. They provide good coverage for a wide range of applications that have low bit rate requirements (i.e., from 5.3 to 8 kb/s). In addition to bit rate, the article reviews their delay, complexity, and performance. Also reviewed are the history of these standards, and what considerations influenced the requirements each of these coders had to meet.

Proceedings Article
01 Jan 1997
TL;DR: In this paper, a review of algorithms for perceptually transparent coding of high-fidelity (CD-quality) digital audio signals is presented, including algorithms which manipulate transform components and subband signal decompositions.
Abstract: Considerable research has been devoted to the development of algorithms for perceptually transparent coding of high-fidelity (CD-quality) digital audio. As a result, many algorithms have been proposed and several have now become international and/or commercial product standards. This paper reviews algorithms for perceptually transparent coding of CD-quality digital audio, including both research and standardization activities. First, psychoacoustic principles are described with the MPEG psychoacoustic signal analysis model 1 discussed in some detail. Then, we review methodologies which achieve perceptually transparent coding of FM- and CD-quality audio signals, including algorithms which manipulate transform components and subband signal decompositions. The discussion concentrates on architectures and applications of those techniques which utilize psychoacoustic models to exploit efficiently masking characteristics of the human receiver. Several algorithms which have become international and/or commercial standards are also presented, including the ISO/MPEG family and the Dolby AC-3 algorithms. The paper concludes with a brief discussion of future research directions.