Showing papers on "Speech coding published in 1997"

PDF

Open Access

Patent•

Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information

[...]

Thomas L. Blum, Douglas F. Keislar, James A. Wheaton, Erling H. Wold

21 Jul 1997

TL;DR: In this paper, a system that performs analysis and comparison of audio data files based upon the content of the data files is presented, which produces a set of numeric values (a feature vector) that can be used to classify and rank the similarity between individual audio files typically stored in a multimedia database or on the Web.

...read moreread less

Abstract: A system that performs analysis and comparison of audio data files based upon the content of the data files is presented. The analysis of the audio data produces a set of numeric values (a feature vector) that can be used to classify and rank the similarity between individual audio files typically stored in a multimedia database or on the World Wide Web. The analysis also facilitates the description of user-defined classes of audio files, based on an analysis of a set of audio files that are members of a user-defined class. The system can find sounds within a longer sound, allowing an audio recording to be automatically segmented into a series of shorter audio segments.

...read moreread less

726 citations

Journal Article•

ISO/IEC MPEG-2 Advanced Audio Coding

[...]

Marina Bosi, Karlheinz Brandenburg, Schuyler Quackenbush, Louis Dunn Fielder, Kenzo Akagiri, Hendrik Fuchs, Martin Dietz - Show less +3 more

01 Oct 1997-Journal of The Audio Engineering Society

TL;DR: The ISO/IEC MPEG-2 advanced audio coding (AAC) system was designed to provide MPEG2 with the best audio quality without any restrictions due to compatibility requirements.

...read moreread less

Abstract: The ISO/IEC MPEG-2 advanced audio coding (AAC) system was designed to provide MPEG-2 with the best audio quality without any restrictions due to compatibility requirements. The main features of the AAC system (ISO/IEC 13818-7) are described. MPEG-2 AAC combines the coding efficiency of a high-resolution filter bank, prediction techniques, and Huffman coding with additional functionalities aimed to deliver very high audio quality at a variety of data rates.

...read moreread less

585 citations

Patent•

Simultaneous transmission of ancillary and audio signals by means of perceptual coding

[...]

Donald Wadia Moses¹, Daozheng Lu¹•Institutions (1)

Nielsen Holdings N.V.¹

25 Feb 1997

TL;DR: In this paper, a communication system for simulataneously transmitting ancillary codes and audio signals via a conventional audio communications channel using perceptual coding techniques is described, where an encoder monitors an audio channel to detect "opportunities" to insert an anciliary code such that the inserted signals are masked by the audio signal.

...read moreread less

Abstract: A communication system for simulataneously transmitting ancillary codes and audio signals via a conventional audio communications channel using perceptual coding techniques is disclosed. An encoder monitors an audio channel to detect 'opportunities' to insert an ancillary code such that the inserted signals are masked by the audio signal, as defined by the 'perceptual entropy envelope' of the audio signal. An ancillary code containing, for example, an ID or serial number, is encoded as one or more whitened spread stpectrum signals and/or a narrowband FSK ancillary code and transmitted at a time, frequency and/or level such that the data signal is masked by the audio signal. A decoder at a receiving location recovers the encoded ID or serial number.

...read moreread less

459 citations

Journal Article•DOI•

ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications

[...]

A. Benyassine¹, E. Shlomot, Huan-Yu Su, D. Massaloux, Claude Lamblin, J.-P. Petit - Show less +2 more•Institutions (1)

Rockwell Automation¹

01 Sep 1997-IEEE Communications Magazine

TL;DR: Annan B defines a low-bit-rate silence compression scheme designed and optimized to work in conjunction with both the full version of G.729 and its low-complexity Annex A, which enables the achievement of bit-rate savings for coded speech at average rates as low as 4 kb/s during normal speech conversation while maintaining reproduction quality.

...read moreread less

Abstract: This article describes Annex B to ITU-T Recommendation G.729. Annex B defines a low-bit-rate silence compression scheme designed and optimized to work in conjunction with both the full version of G.729 and its low-complexity Annex A. To achieve good quality low-bit-rate silence compression, a robust frame-based voice activity detector module is essential to detect inactive voice frames, also called silence or background noise frames. For these detected inactive voice frames, a discontinuous transmission module measures the changes over time of the inactive voice signal characteristics and decides whether a new silence information descriptor frame should be sent to maintain the reproduction quality of the background noise at the receiving end. If such a frame is needed, the spectrum and energy parameters describing the perceptual characteristics of the background noise are efficiently coded and transmitted using 15 b/frame. At the receiving end, the comfort noise generation module regenerates the output background noise using transmitted updates or previously available parameters. The synthesized background noise is obtained by linear predictive filtering of a locally generated pseudo-white excitation signal of a controlled level. This method of coding the background noise enables the achievement of bit-rate savings for coded speech at average rates as low as 4 kb/s during normal speech conversation while maintaining reproduction quality.

...read moreread less

332 citations

Patent•

Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

[...]

John F. Holzrichter¹, Lawrence C. Ng¹•Institutions (1)

University of California¹

28 Jan 1997

TL;DR: The use of EM radiation in conjunction with simultaneously recorded speech information enables a complete mathematical coding of acoustic speech as discussed by the authors, which can be used for speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis and speech translation.

...read moreread less

Abstract: The use of EM radiation in conjunction with simultaneously recorded speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector (12, 13) for each pitch period of voiced speech and the forming of feature vectors (12, 13) for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function (7) each time frame. The formation of feature vectors (12, 13) defining all acoustic speech units over well-defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.

...read moreread less

313 citations

Book•

Speech coding

[...]

Bishnu S. Atal, Nikil S. Jayant

01 Dec 1997

277 citations

Journal Article•DOI•

SpeechSkimmer: a system for interactively skimming recorded speech

[...]

Barry Arons¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Mar 1997-ACM Transactions on Computer-Human Interaction

TL;DR: SpeakSkimmer as discussed by the authors uses speech processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail, and provides continuous real-time control of the speed and detail level of the audio presentation.

...read moreread less

Abstract: Listening to a speech recording is much more difficult than visually scanning a document because of the transient and temporal nature of audio. Audio recordings capture the richness of speech, yet it is difficult to directly browse the stored information. This article describes techniques for structuring, filtering, and presenting recorded speech, allowing a user to navigate and interactively find information in the audio domain. This article describes the SpeechSkimmer system for interactively skimming speech recordings. SpeechSkimmer uses speech-processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail. User interaction, through a manual input device, provides continuous real-time control of the speed and detail level of the audio presentation. SpeechSkimmer reduces the time needed to listen by incorporating time-compressed speech, pause shortening, automatic emphasis detection, and nonspeech audio feedback. This article also presents a multilevel structural approach to auditory skimming and user interface techniques for interacting with recorded speech. An observational usability test of SpeechSkimmer is discussed, as well as a redesign and reimplementation of the user interface based on the results of this usability test.

...read moreread less

253 citations

Proceedings Article•DOI•

Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited

[...]

Hideki Kawahara

21 Apr 1997

TL;DR: A simple new procedure called STRAIGHT (speech transformation and representation using adaptive interpolation of weighted spectrum) has been developed, which allows for over 600% manipulation of such speech parameters as pitch, vocal tract length, and speaking rate, without further degradation due to the parameter manipulation.

...read moreread less

Abstract: A simple new procedure called STRAIGHT (speech transformation and representation using adaptive interpolation of weighted spectrum) has been developed. STRAIGHT uses pitch-adaptive spectral analysis combined with a surface reconstruction method in the time-frequency region, and an excitation source design based on phase manipulation. It preserves the bilinear surface in the time-frequency region and allows for over 600% manipulation of such speech parameters as pitch, vocal tract length, and speaking rate, without further degradation due to the parameter manipulation.

...read moreread less

247 citations

Proceedings Article•DOI•

Automatic audio content analysis

[...]

Silvia Pfeiffer¹, Stephan Fischer¹, Wolfgang Effelsberg¹•Institutions (1)

University of Mannheim¹

01 Feb 1997

TL;DR: The theoretic framework and applications of automatic audio content analysis, including analysis of amplitude, frequency and pitch, and simulations of human audio perception, are described.

...read moreread less

Abstract: This paper describes the theoretic framework and applications of automatic audio content analysis. Research in multimedia content analysis has so far concentrated on the video domain. We demonstrate the strength of automatic audio content analysis. We explain the algorithms we use, including analysis of amplitude, frequency and pitch, and simulations of human audio perception. These algorithms serve us as tools for further audio content analysis. We use these tools in applications like the segmentation of audio data streams into logical units for further processing, the analysis of music, as well as the recognition of sounds indicative of violence like shots, explosions and cries.

...read moreread less

227 citations

Journal Article•DOI•

Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model

[...]

E.B. George, M.J.T. Smith¹•Institutions (1)

Georgia Institute of Technology¹

01 Sep 1997-IEEE Transactions on Speech and Audio Processing

TL;DR: The proposed analysis-by-synthesis/overlap-add (ABS/OLA) system allows for both fixed and time-varying time-, frequency-, and pitch-scale modifications, and computational shortcuts using the FFT algorithm make its implementation feasible using currently available hardware.

...read moreread less

Abstract: Sinusoidal modeling has been successfully applied to a broad range of speech processing problems, and offers advantages over linear predictive modeling and the short-time Fourier transform for speech analysis/synthesis and modification. This paper presents a novel speech analysis/synthesis system based on the combination of an overlap-add sinusoidal model with an analysis-by-synthesis technique to determine the model parameters. It describes this analysis procedure in detail, and introduces an equivalent frequency-domain algorithm that takes advantage of the computational efficiency of the fast Fourier transform (FFT). In addition, a refined overlap-add sinusoidal model capable of shape-invariant speech modification is derived, and a pitch-scale modification algorithm is defined that preserves speech bandwidth and eliminates noise migration effects. Analysis-by-synthesis achieves very high synthetic speech quality by accurately estimating the component frequencies, eliminating sidelobe interference effects, and effectively dealing with nonstationary speech events. The refined overlap-add synthesis model correlates well with analysis-by-synthesis, and modifies speech without objectionable artifacts by explicitly controlling shape invariance and phase coherence. The proposed analysis-by-synthesis/overlap-add (ABS/OLA) system allows for both fixed and time-varying time-, frequency-, and pitch-scale modifications, and computational shortcuts using the FFT algorithm make its implementation feasible using currently available hardware.

...read moreread less

220 citations

Proceedings Article•DOI•

The modulation spectrogram: in pursuit of an invariant representation of speech

[...]

Steven Greenberg, Brian Kingsbury¹•Institutions (1)

International Computer Science Institute¹

21 Apr 1997

TL;DR: This work proposes a new representational format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels.

...read moreread less

Abstract: Understanding the human ability to reliably process and decode speech across a wide range of acoustic conditions and speaker characteristics is a fundamental challenge for current theories of speech perception. Conventional speech representations such as the sound spectrogram emphasize many spectro-temporal details that are not directly germane to the linguistic information encoded in the speech signal and which consequently do not display the perceptual stability characteristic of human listeners. We propose a new representational format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels. We describe the representation and illustrate its stability with color-mapped displays and with results from automatic speech recognition experiments.

...read moreread less

Journal Article•DOI•

MPEG digital audio coding

[...]

P. Noll

01 Sep 1997-IEEE Signal Processing Magazine

TL;DR: This paper describes in some detail the key technologies and main features of MPEG-1 and MPEG-2 audio coders, and presents the MPEG-4 standard and discusses some of the typical applications for MPEG audio compression.

...read moreread less

Abstract: The Moving Pictures Expert Group (MPEG) within the International Organization of Standardization (ISO) has developed a series of audio-visual standards known as MFEG-1 and MPEG-2. These audio-coding standards are the first international standards in the field of high-quality digital audio compression. MPEG-1 covers coding of stereophonic audio signals at high sampling rates aiming at transparent quality, whereas MPEG-2 also offers stereophonic audio coding at lower sampling rates. In addition, MPEG-2 introduces multichannel coding with and without backwards compatibility to MPEG-1 to provide an improved acoustical image for audio-only applications and for enhanced television and video-conferencing systems. MPEG-2 audio coding without backwards compatibility, called IMPEG-2 Advanced Audio Coding (AAC), offers the highest compression rates. Typical application areas for MPEG-based digital audio are in the fields of audio production, program distribution and exchange, digital sound broadcasting, digital storage, and various multimedia applications. We describe in some detail the key technologies and main features of MPEG-1 and MPEG-2 audio coders. We also present the MPEG-4 standard and discuss some of the typical applications for MPEG audio compression.

...read moreread less

Patent•

Multi-channel audio decoder

[...]

Smyth Stephen M, Smyth Michael H, Smith William Paul

16 Dec 1997

TL;DR: A subband audio coder employs perfect/nonperfect reconstruction filters, predictive/non-predictive subband encoding, transient analysis, and psycho-acoustic/minimum mean square error (mmse) bit allocation over time, frequency and the multiple audio channels to encode/decode a data stream to generate high fidelity reconstructed audio as mentioned in this paper.

...read moreread less

Abstract: A subband audio coder employs perfect/non-perfect reconstruction filters, predictive/non-predictive subband encoding, transient analysis, and psycho-acoustic/minimum mean-square-error (mmse) bit allocation over time, frequency and the multiple audio channels to encode/decode a data stream to generate high fidelity reconstructed audio. The audio coder windows the multi-channel audio signal such that the frame size, i.e. number of bytes, is constrained to lie in a desired range, and formats the encoded data so that the individual subframes can be played back as they are received thereby reducing latency. Furthermore, the audio coder processes the baseband portion (0-24 kHz) of the audio bandwidth for sampling frequencies of 48 kHz and higher with the same encoding/decoding algorithm so that audio coder architecture is future compatible.

...read moreread less

Patent•

Methods and apparatus for non-acoustic speech characterization and recognition

[...]

John F. Holzrichter¹•Institutions (1)

University of California¹

28 Jan 1997

TL;DR: In this article, the positions and velocities of the speech organs (2, 3, 4) as speech is articulated can be defined for each acoustic speech unit (20) by simultaneously recording EM wave reflections and acoustic speech information.

...read moreread less

Abstract: By simultaneously recording EM wave reflections (21) and acoustic speech information (24), the positions and velocities of the speech organs (2, 3, 4) as speech is articulated can be defined for each acoustic speech unit (20). Well defined time frames and feature vectors (6, 7, 8, 9) describing the speech, to the degree required, can be formed. Such feature vectors (6, 7, 8, 9) can uniquely characterize the speech unit (20) being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit (20) recognition, and organ mechanical parameters can be determined.

...read moreread less

Journal Article•

Overview of MPEG-Audio: Current and Future Standards for Low Bit-Rate Audio Coding

[...]

Karlheinz Brandenburg, Marina Bosi

01 Feb 1997-Journal of The Audio Engineering Society

TL;DR: Common features as well as differences between MPEG-I and MPEG-2 audio, other current audio coding systems currently in use, and the new work for MPEG-4 audio will be presented.

...read moreread less

Abstract: Since 1988 MPEG has been working on the standardization of high-quality low-bit rate audio coding. In 1992 and 1994 the MPEG-I and MPEG-2 audio standards were completed. Current work in MPEG includes the MPEG-2 advanced audio coding (MPEG-2 AAC) 1 of stereo or multichannel sound material and the audio part of MPEG-4. Common features as well as differences between MPEG-I and MPEG-2 audio, other current audio coding systems currently in use, and the new work for MPEG-2 AAC and MPEG-4 audio will be presented.

...read moreread less

Journal Article•DOI•

The modulated lapped transform, its time-varying forms, and its applications to audio coding standards

[...]

S. Shlien

01 Jul 1997-IEEE Transactions on Speech and Audio Processing

TL;DR: The modulated lapped transform properties and how it can be used to generate a time-varying filterbank are described and examples of its implementation in two audio coding standards are presented.

...read moreread less

Abstract: The modulated lapped transform (MLT) is used in both audio and video data compression schemes. This paper describes its properties and how it can be used to generate a time-varying filterbank. Examples of its implementation in two audio coding standards are presented.

...read moreread less

Patent•DOI•

Multi-feature speech/music discrimination system

[...]

Eric D. Scheirer¹, Malcolm Slaney¹•Institutions (1)

Interval Research Corporation¹

05 Dec 1997-Journal of the Acoustical Society of America

TL;DR: In this paper, a speech/music discriminator employs data from multiple features of an audio signal as input to a classifier, and a preferred set of classifiers is based upon variations of a nearest-neighbor approach, including a K-d tree spatial partitioning technique.

...read moreread less

Abstract: A speech/music discriminator employs data from multiple features of an audio signal (10) as input to a classifier (16). Some of the feature data is determined from individual frames of the audio signal, and other input data is based upon variations of a feature over several frames, to distinguish the changes in voiced and unvoiced components of speech from the more constant characteristics of music. Several different types of classifiers for labeling test points on the basis of the feature data are disclosed. A preferred set of classifiers is based upon variations of a nearest-neighbor approach, including a K-d tree spatial partitioning technique.

...read moreread less

Patent•

Coding process for inserting an inaudible data signal into an audio signal, decoding process, coder and decoder

[...]

Ernst Eberlein, Heinz Gerhäuser, Albert Heuberger, Christian Dipl Ing Neubauer, Rainer Perthold, Roland Plankenbühler, Hartmut Schott - Show less +3 more

24 Jan 1997

TL;DR: In this article, a coding process and a coder for inserting an inaudible data signal into an audio signal, the audio signal is first converted into a spectrum range and the masking threshold of the audio signals is determined.

...read moreread less

Abstract: In a coding process and a coder for inserting an inaudible data signal into an audio signal, the audio signal is first converted into a spectrum range and the masking threshold of the audio signal is determined. A pseudo-noise signal and a data signal are prepared and multiplied together to provide a frequency-spread data signal. The spread data signal is weighted by the masking threshold and then the audio signal and the weighted data signal are superimposed. In a process and a decoder for decoding an inaudible data signal inserted into an audio signal, first of all the audio signal is sampled and then the scanned audio signal is non-recursively filtered. Thereupon the filtered audio signal is compared with a threshold in order to re-obtain the data signal.

...read moreread less

Patent•

Multi-station audio distribution apparatus

[...]

Marco Scibora, Warren Kahle

31 Jul 1997

TL;DR: In this paper, a multi-station audio distribution system with at least two listening stations, a data control mechanism and a listening sation interface mechanism disposed between the listening stations and the data control mechanisms is presented.

...read moreread less

Abstract: A multi-station audio distribution apparatus having at least two listening stations, a data control mechanism and a listening sation interface mechanism disposed between the listening stations and the data control mechanism. Each of the listening stations has a user input in the form of a barcode scanner to enter an audio material selection and each has an audio output. The data control mechanism retrieves digitized audio material corresponding to each of the user's audio material selections. The listening station interface mechanism transfers the user's inputs from each of the listening stations to the data control mechanism, receives the digitized audio materials corresponding to each user's input from the data control mechanism, converts the digital audio materials to analog audio signals and transfers the analog audio signals to each of the respective listening stations for the audio output.

...read moreread less

Proceedings Article•DOI•

Missing data techniques for robust speech recognition

[...]

Martin Cooke¹, A.C. Morris¹, Phil D. Green¹•Institutions (1)

University of Sheffield¹

21 Apr 1997

TL;DR: The results of this experiment confirm the general superiority of marginals-based schemes, demonstrate the viability of shared covariance statistics, and suggest several ways in which performance improvements on the larger task may be obtained.

...read moreread less

Abstract: In noisy listening conditions, the information available on which to base speech recognition decisions is necessarily incomplete: some spectro-temporal regions are dominated by other sources. We report on the application of a variety of techniques for missing data in speech recognition. These techniques may be based on marginal distributions or on reconstruction of missing parts of the spectrum. Application of these ideas in the resource management task shows a performance which is robust to random removal of up to 80% of the frequency channels, but falls off rapidly with deletions which more realistically simulate masked speech. We report on a vowel classification experiment designed to isolate some of the RM problems for more detailed exploration. The results of this experiment confirm the general superiority of marginals-based schemes, demonstrate the viability of shared covariance statistics, and suggest several ways in which performance improvements on the larger task may be obtained.

...read moreread less

Journal Article•DOI•

Production models as a structural basis for automatic speech recognition

[...]

Li Deng¹, G. Ramsay¹, D. Sun²•Institutions (2)

University of Waterloo¹, Bell Labs²

01 Aug 1997-Speech Communication

TL;DR: It is suggested that many of the advantages to be gained from interaction between speech production and speech recognition communities will develop from integrating production models with the probabilistic analysis-by-synthesis strategy currently used by the technology community.

...read moreread less

Proceedings Article•DOI•

A review of algorithms for perceptual coding of digital audio signals

[...]

T. Painter¹, Andreas Spanias¹•Institutions (1)

Arizona State University¹

02 Jul 1997

TL;DR: Algorithms for perceptually transparent coding of CD-quality digital audio, including both research and standardization activities are reviewed, including the ISO/MPEG family and the Dolby AC-3 algorithms.

...read moreread less

Abstract: Considerable research has been devoted to the development of algorithms for perceptually transparent coding of high-fidelity (CD-quality) digital audio As a result, many algorithms have been proposed and several have now become international and/or commercial product standards This paper reviews algorithms for perceptually transparent coding of CD-quality digital audio, including both research and standardization activities First, psychoacoustic principles are described with the MPEG psychoacoustic signal analysis model 1 discussed in some detail Then, we review methodologies which achieve perceptually transparent coding of FM- and CD-quality audio signals, including algorithms which manipulate transform components and subband signal decompositions The discussion concentrates on architectures and applications of those techniques which utilize psychoacoustic models to exploit efficiently masking characteristics of the human receiver Several algorithms which have become international and/or commercial standards are also presented, including the ISO/MPEG family and the Dolby AC-3 algorithms The paper concludes with a brief discussion of future research directions

...read moreread less

Patent•

Loss tolerant speech decoder for telecommunications

[...]

Jr. Jaime L. Prieto

04 Apr 1997

TL;DR: In this article, a method and device for extrapolating past signal-history data for insertion into missing data segments in order to conceal digital speech frame errors is proposed, which is implemented with a device that utilizes a finite-impulse response (FIR) multi-layer feed-forward artificial neural network.

...read moreread less

Abstract: A method and device for extrapolating past signal-history data for insertion into missing data segments in order to conceal digital speech frame errors. The extrapolation method uses past-signal history that is stored in a buffer. The method is implemented with a device that utilizes a finite-impulse response (FIR) multi-layer feed-forward artificial neural network that is trained by back-propagation for one-step extrapolation of speech compression algorithm (SCA) parameters. Once a speech connection has been established, the speech compression algorithm device begins sending encoded speech frames. As the speech frames are received, they are decoded and converted back into speech signal voltages. During the normal decoding process, pre-processing of the required SCA parameters will occur and the results stored in the past-history buffer. If a speech frame is detected to be lost or in error, then extrapolation modules are executed and replacement SCA parameters are generated and sent as the parameters required by the SCA. In this way, the information transfer to the SCA is transparent, and the SCA processing continues as usual. The listener will not normally notice that a speech frame has been lost because of the smooth transition between the last-received, lost, and next-received speech frames.

...read moreread less

Proceedings Article•DOI•

GSM enhanced full rate speech codec

[...]

Kari Jarvinen¹, Janne Vainio¹, Pekka Kapanen¹, Tero Honkanen¹, Petri Haavisto¹, R. Salami², Claude Laflamme, J.-P. Adoul² - Show less +4 more•Institutions (2)

Nokia¹, Université de Sherbrooke²

21 Apr 1997

TL;DR: The GSM enhanced full rate (EFR) speech codec that has been standardised for the GSM mobile communication system provides wireline quality not only for error-free conditions but also for the most typical error conditions.

...read moreread less

Abstract: This paper describes the GSM enhanced full rate (EFR) speech codec that has been standardised for the GSM mobile communication system. The GSM EFR codec has been jointly developed by Nokia and University of Sherbrooke. It provides speech quality at least equivalent to that of a wireline telephony reference (32 kbit/s ADPCM). The EFR codec uses 12.2 kbit/s for speech coding and 10.6 kbit/s for error protection. Speech coding is based on the ACELP algorithm (algebraic code excited linear prediction). The codec provides substantial quality improvement compared to the existing GSM full rate and half rate codecs. The old GSM codecs lack wireline quality even in error-free channel conditions, while the EFR codec provides wireline quality not only for error-free conditions but also for the most typical error conditions. With the EFR codec, wireline quality is also sustained in the presence of background noise and in tandem connections (mobile to mobile calls).

...read moreread less

Patent•

Method for speech coding based on a code excited linear prediction (CELP) model

[...]

Kwon Soon-Yeong

26 Jun 1997

TL;DR: In this article, a method for speech coding using Code-Excited Linear Prediction (CELP) produces toll-quality speech at data rates between 4 and 16 Kbit/s.

...read moreread less

Abstract: The invention provides a method for speech coding using Code-Excited Linear Prediction (CELP) producing toll-quality speech at data rates between 4 and 16 Kbit/s. The invention uses a series of baseline, implied and adaptive codebooks, comprised of pulse and random codebooks, with associated gain vectors, to characterize the speech. Improved quantization and search techniques to achieve real-time operation, based on the codebooks and gains, are also provided.

...read moreread less

Proceedings Article•DOI•

Recent improvements on Microsoft's trainable text-to-speech system-Whistler

[...]

Xuedong David Huang¹, Alex Acero, Hsiao-Wuen Hon, Y. Ju, J. Liu, Scott Meredith, Mike Plumpe - Show less +3 more•Institutions (1)

Microsoft¹

21 Apr 1997

TL;DR: This paper focuses on the improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods in Whisper TTS engine.

...read moreread less

Abstract: The Whistler text-to-speech engine was designed so that we can automatically construct the model parameters from training data. This paper focuses on the improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style. Whisper TTS engine supports Microsoft Speech API and requires less than 3 MB of working memory.

...read moreread less

Proceedings Article•

Speaker Interpolation in HMM-Based Speech Synthesis System

[...]

Takayoshi Yoshimura, Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Tadashi Kitamura - Show less +1 more

01 Sep 1997

[...]

Jonathan Foote

01 Jan 1997

TL;DR: A classifier that distinguishes speech from music and non-vocal sounds is presented, as well as experimental results showing how perfect classification accuracy may be achieved on a small corpus using substantially less than two seconds per test audio file.

...read moreread less

Abstract: This paper presents recent results using statistics generated by a MMl-supervised vector quantizer as a measure of audio similarity. Such a measure has proved successful for talker identification, and the extension from speech to general audio, such as music, is straightforward. A classifier that distinguishes speech from music and non-vocal sounds is presented, as well as experimental results showing how perfect classification accuracy may be achieved on a small corpus using substantially less than two seconds per test audio file. The techniques a presented here may be extended to other applications and domains, such as audio retrieval-by-similarity, musical genre classification, and automatic segmentation of continuous audio.

...read moreread less

Journal Article•DOI•

Three new speech coders from the ITU cover a range of applications

[...]

R.V. Cox¹•Institutions (1)

AT&T Labs¹

01 Sep 1997-IEEE Communications Magazine

TL;DR: Three new speech coding recommendations from the ITU-T provide good coverage for a wide range of applications that have low bit rate requirements (i.e., from 5.3 to 8 kb/s).

...read moreread less

Abstract: Many new speech coding standards have been created in the 10-year period 1987-1996. The author reviews the key attributes that determine what coder to select for different applications. The article then focuses on three new speech coding recommendations from the ITU-T, namely G.723.1, G.729, and Annex A of G.729. They provide good coverage for a wide range of applications that have low bit rate requirements (i.e., from 5.3 to 8 kb/s). In addition to bit rate, the article reviews their delay, complexity, and performance. Also reviewed are the history of these standards, and what considerations influenced the requirements each of these coders had to meet.

...read moreread less

Proceedings Article•

Review of algorithms for perceptual coding of digital audio signals

[...]

Ted Painter¹, Andreas Spanias•Institutions (1)

Arizona State University¹

01 Jan 1997

TL;DR: In this paper, a review of algorithms for perceptually transparent coding of high-fidelity (CD-quality) digital audio signals is presented, including algorithms which manipulate transform components and subband signal decompositions.

...read moreread less

Abstract: Considerable research has been devoted to the development of algorithms for perceptually transparent coding of high-fidelity (CD-quality) digital audio. As a result, many algorithms have been proposed and several have now become international and/or commercial product standards. This paper reviews algorithms for perceptually transparent coding of CD-quality digital audio, including both research and standardization activities. First, psychoacoustic principles are described with the MPEG psychoacoustic signal analysis model 1 discussed in some detail. Then, we review methodologies which achieve perceptually transparent coding of FM- and CD-quality audio signals, including algorithms which manipulate transform components and subband signal decompositions. The discussion concentrates on architectures and applications of those techniques which utilize psychoacoustic models to exploit efficiently masking characteristics of the human receiver. Several algorithms which have become international and/or commercial standards are also presented, including the ISO/MPEG family and the Dolby AC-3 algorithms. The paper concludes with a brief discussion of future research directions.

...read moreread less

Collapse