Showing papers on "Speech coding published in 2002"

PDF

Open Access

Book•

Introduction to digital audio coding and standards

[...]

01 Jan 2002

TL;DR: This paper presents a meta-modelling framework for building a Perceptual Audio Decoder that automates the very labor-intensive and therefore time-heavy, and therefore expensive, and expensive, process of Audio Coding.

...read moreread less

Abstract: Foreword. Preface. I: Audio Coding Methods. 2. Quantization. 3. Representation of Audio Signals. 4. Time to Frequency Mapping Part I: The PQMF. 5. Time to Frequency Mapping Part II: The MDCT. 6. Introduction to Psychoacoustics. 7. Psychoacoustic Models for Audio Coding. 8. Bit Allocation Strategies. 9. Building a Perceptual Audio Decoder. 10. Quality Measurement of Perceptual Audio Codecs. II: Audio Coding Standards. 11. MPEG-1 Audio. 12. MPEG-2 Audio. 13. MPEG-2 AAC. 14.Dolby AC-3. 15. MPEG-4 Audio. Index.

...read moreread less

367 citations

Journal Article•

Spectral Band Replication, a Novel Approach in Audio Coding

[...]

Martin Dietz, Lars Liljeryd, Kristofer Kjörling, Oliver Kunz

01 Apr 2002-Journal of The Audio Engineering Society

303 citations

Patent•DOI•

Method and apparatus for audio-visual speech detection and recognition

[...]

Sankar Basu¹, Philippe Christian de Cuetos¹, Stephane H. Maes¹, Chalapathy Neti¹, Andrew W. Senior¹ - Show less +1 more•Institutions (1)

IBM¹

30 Aug 2002-Journal of the Acoustical Society of America

TL;DR: In this article, the authors propose a speech recognition technique for video and audio signals that consists of processing a video signal associated with an arbitrary content video source, processing an audio signal associated to the video signal, and recognizing at least a portion of the processed audio signal using at least the processed video signal to generate output signal representative of the audio signal.

...read moreread less

Abstract: Techniques for providing speech recognition comprise the steps of processing a video signal associated with an arbitrary content video source, processing an audio signal associated with the video signal, and recognizing at least a portion of the processed audio signal, using at least a portion of the processed video signal, to generate an output signal representative of the audio signal.

...read moreread less

302 citations

Proceedings Article•DOI•

A coupled HMM for audio-visual speech recognition

[...]

Ara V. Nefian¹, Luhong Liang¹, Xiaobo Pi¹, Liu Xiaoxiang¹, Crusoe Mao¹, Kevin Murphy¹ - Show less +2 more•Institutions (1)

Intel¹

13 May 2002

TL;DR: This paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM) to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time.

...read moreread less

Abstract: In recent years several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of features that are invariant to acoustic noise perturbation. The audio-visual speech recognition system presented in this paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM). The statistical properties of the coupled-HMM allow us to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time. The experimental results show that the coupled HMM outperforms the multistream HMM in audio visual speech recognition.

...read moreread less

252 citations

Patent•

Language independent and voice operated information management system

[...]

Seth A. Cameron

30 May 2002

TL;DR: In this paper, a voice operated portable information management system that is substantially language independent and capable of supporting a substantially unlimited vocabulary is presented, which includes a microphone, speaker, clock and GPS connected to a speech processing system.

...read moreread less

Abstract: A voice operated portable information management system that is substantially language independent and capable of supporting a substantially unlimited vocabulary. The system (Fig. 1) includes a microphone, speaker, clock and GPS connected to a speech processing system. The speech processing system (Fig. 4): 1) generates and stores compressed speech data corresponding to a user's speech received through the microphone, 2) compares the stored speech data, 3) re-synthesizes the stored speech data for output as speech through the speaker, 4) provides an audible user interface including a speech assistant for providing instructions in the user's language, 5) stores user-specific compressed speech data, including commands, received in response to prompts from the speech assistant for purposes of adapting the system to the user's speech, 6) identifies memo management commands spoken by the user, and stores and organizes compressed speech data as a function of the identified commands, and 7) identifies memo retrieval commands spoken by the user, and retrieves and outputs the stored speech data as a function of the commands.

...read moreread less

160 citations

Patent•DOI•

Digital audio signal filtering mechanism and method

[...]

Philip R. Wiser¹, LeeAnn Heringer¹, Gerry Kearby¹, Leon Rishniw¹, Jason S. Brownell¹ - Show less +1 more•Institutions (1)

Microsoft¹

16 May 2002-Journal of the Acoustical Society of America

TL;DR: In this paper, audio processing profiles are organized according to specific delivery bandwidths such that a sound engineer can quickly and efficiently encode audio signals for each of a number of distinct delivery media.

...read moreread less

Abstract: Essentially all of the processing parameters which control processing of a source audio signal to produce an encoded audio signal are stored in an audio processing profile. Multiple audio processing profiles are stored in a processing profile database such that specific combinations of processing parameters can be retrieved and used at a later time. Audio processing profiles are organized according to specific delivery bandwidths such that a sound engineer can quickly and efficiently encode audio signals for each of a number of distinct delivery media. Synchronized A/B switching during playback of various encoded audio signals allows the sound engineer to detect nuances in the sound characteristics of the various encoded audio signals.

...read moreread less

151 citations

Journal Article•DOI•

Speech recognition with the nucleus 24 SPEAK, ACE, and CIS speech coding strategies in newly implanted adults

[...]

Margaret W. Skinner¹, Laura K. Holden¹, Lesley A. Whitford², Kerrie Plant², Colleen Psarros², Timothy A. Holden¹ - Show less +2 more•Institutions (2)

Washington University in St. Louis¹, Cochlear Limited²

01 Jun 2002-Ear and Hearing

TL;DR: There was remarkably close agreement in the pattern of group mean scores for the three strategies for CNC words and CUNY sentences in noise between the present study and the Conversion study, and the same percentage of subjects preferred each strategy.

...read moreread less

Abstract: ObjectiveThe objective of this study was to determine whether 1) the SPEAK, ACE or CIS speech coding strategy was associated with significantly better speech recognition for individual subjects implanted with the Nucleus CI24M internal device who used the SPrint™ speech processor, and 2) whether a s

...read moreread less

143 citations

Patent•

Method for generating pesonalized speech from text

[...]

Donald T. Tang¹, Ligin Shen¹, Qin Shi¹, Wei Zhang¹•Institutions (1)

IBM¹

05 Apr 2002

TL;DR: In this paper, a method for generating personalized speech from text includes the steps of analyzing the input text to get standard parameters of the speech to be synthesized from a standard text-to-speech database.

...read moreread less

Abstract: A method for generating personalized speech from text includes the steps of analyzing the input text to get standard parameters of the speech to be synthesized from a standard text-to-speech database; mapping the standard speech parameters to the personalized speech parameters via a personalization model obtained in a training process; and synthesizing speech of the input text based on the personalized speech parameters. The method can be used to simulate the speech of the target person so as to make the speech produced by a TTS system more attractive and personalized.

...read moreread less

127 citations

Patent•DOI•

Method for coding speech and music signals

[...]

Kazuhuito Koishida¹, Vladimir Cuperman¹, Amir H. Majidimehr¹, Allen Gersho¹•Institutions (1)

Microsoft¹

15 May 2002-Journal of the Acoustical Society of America

TL;DR: In this article, a transform coding method for music signals was proposed, which is suitable for use in a hybrid codec, whereby a common linear predictive (LP) synthesis filter was employed for both speech and music signals.

...read moreread less

Abstract: The present invention provides a transform coding method efficient for music signals that is suitable for use in a hybrid codec, whereby a common Linear Predictive (LP) synthesis filter is employed for both speech and music signals. The LP synthesis filter switches between a speech excitation generator and a transform excitation generator, in accordance with the coding of a speech or music signal, respectively. For coding speech signals, the conventional CELP technique may be used, while a novel asymmetrical overlap-add transform technique is applied for coding music signals. In performing the common LP synthesis filtering, interpolation of the LP coefficients is conducted for signals in overlap-add operation regions. The invention enables smooth transitions when the decoder switches between speech and music decoding modes.

...read moreread less

126 citations

Journal Article•DOI•

Perceptual audio coding using adaptive pre- and post-filters and lossless compression

[...]

Gerald Schuller¹, Bin Yu², Dawei Huang³, Bernd Edler⁴•Institutions (4)

Bell Labs¹, University of California, Berkeley², Alcatel-Lucent³, Leibniz University of Hanover⁴

10 Dec 2002-IEEE Transactions on Speech and Audio Processing

TL;DR: A subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay.

...read moreread less

Abstract: This paper proposes a versatile perceptual audio coding method that achieves high compression ratios and is capable of low encoding/decoding delay. It accommodates a variety of source signals (including both music and speech) with different sampling rates. It is based on separating irrelevance and redundancy reductions into independent functional units. This contrasts traditional audio coding where both are integrated within the same subband decomposition. The separation allows for the independent optimization of the irrelevance and redundancy reduction units. For both reductions, we rely on adaptive filtering and predictive coding as much as possible to minimize the delay. A psycho-acoustically controlled adaptive linear filter is used for the irrelevance reduction, and the redundancy reduction is carried out by a predictive lossless coding scheme, which is termed weighted cascaded least mean squared (WCLMS) method. Experiments are carried out on a database of moderate size which contains mono-signals of different sampling rates and varying nature (music, speech, or mixed). They show that the proposed WCLMS lossless coder outperforms other competing lossless coders in terms of compression ratios and delay, as applied to the pre-filtered signal. Moreover, a subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay.

...read moreread less

125 citations

Patent•

Method for time aligning audio signals using characterizations based on auditory events

[...]

Brett G. Crockett¹, Michael J. Smithers¹•Institutions (1)

Dolby Laboratories¹

25 Feb 2002

TL;DR: In this paper, a method for time aligning audio signal, wherein one signal has been derived from the other or both have been derived separately from another signal, comprises deriving reduced-information characterizations of the audio signals, auditory scene analysis.

...read moreread less

Abstract: A method for time aligning audio signal, wherein one signal has been derived from the other or both have been derived from another signal, comprises deriving reduced-information characterizations of the audio signals, auditory scene analysis. The time offset of one characterization with respect to the other characterization is calculated and the temporal relationship of the audio signals with respect to each other is modified in response to the time offset such that the audio signals are coicident with each other. These principles may also be applied to a method for time aligning a video signal and an audio signal that will be subjected to differential time offsets.

...read moreread less

Patent•

Method and apparatus for identifying a digital audio signal

[...]

John C. Peiffer¹, Michael A. Hicks¹, David Howell Wright¹, Paul Mears¹, Venugopal Srinivasan¹, Daozheng Lu¹, Paul C. Kempter¹ - Show less +3 more•Institutions (1)

Nielsen Holdings N.V.¹

09 Oct 2002

TL;DR: In this article, the authors present a method and apparatus for identifying broadcast digital audio signals, where the digital audio signal is provided to a processing structure which is configured to identify a program-identifying code in the received digital signal, identifying a program identifying code in a decompressed received digital audio message, and identifying a feature signature in the decoded received digital message.

...read moreread less

Abstract: Method and apparatus for identifying broadcast digital audio signals include structure and/or function whereby the digital audio signal is provided to processing structure which is configured to (i) identify a program-identifying code in the received digital audio signal, (ii) identify a program-identifying code in a decompressed received digital audio signal, (iii) identify a feature signature in the received digital audio signal, and (iv) identify a feature signature in the decompressed received digital audio signal. Preferably, such processing structure is disposed in a dwelling or a monitoring site in an audience measurement system, such as the Nielsen TV ratings system.

...read moreread less

Journal Article•DOI•

Speech enhancement using a mixture-maximum model

[...]

David Burshtein¹, Sharon Gannot²•Institutions (2)

Tel Aviv University¹, Technion – Israel Institute of Technology²

10 Dec 2002-IEEE Transactions on Speech and Audio Processing

TL;DR: A spectral domain, speech enhancement algorithm based on a mixture model for the short time spectrum of the clean speech signal, and on a maximum assumption in the production of the noisy speech spectrum that shows improved performance compared to alternative speech enhancement algorithms.

...read moreread less

Abstract: We present a spectral domain, speech enhancement algorithm. The new algorithm is based on a mixture model for the short time spectrum of the clean speech signal, and on a maximum assumption in the production of the noisy speech spectrum. In the past this model was used in the context of noise robust speech recognition. In this paper we show that this model is also effective for improving the quality of speech signals corrupted by additive noise. The computational requirements of the algorithm can be significantly reduced, essentially without paying performance penalties, by incorporating a dual codebook scheme with tied variances. Experiments, using recorded speech signals and actual noise sources, show that in spite of its low computational requirements, the algorithm shows improved performance compared to alternative speech enhancement algorithms.

...read moreread less

Journal Article•

Parametric Coding for High-Quality Audio

[...]

Albertus C. den Brinker, Erik Gosuinus Petrus Schuijers, Werner Oomen

01 Apr 2002-Journal of The Audio Engineering Society

Proceedings Article•DOI•

Speech enhancement using excitation source information

[...]

B. Yegnanarayana¹, S. R. Mahadeva Prasanna¹, K. Sreenivasa Rao¹•Institutions (1)

Indian Institute of Technology Madras¹

13 May 2002

TL;DR: The characteristics of voiced speech can be used to derive a coherently added signal from the linear prediction (LP) residuals of the degraded speech data from different microphones to enhance speech degraded by noise and reverberation.

...read moreread less

Abstract: This paper proposes an approach for processing speech from multiple microphones to enhance speech degraded by noise and reverberation. The approach is based on exploiting the features of the excitation source in speech production. In particular, the characteristics of voiced speech can be used to derive a coherently added signal from the linear prediction (LP) residuals of the degraded speech data from different microphones. A weight function is derived from the coherently added signal. For coherent addition the time-delay between a pair of microphones is estimated using the knowledge of the source information present in the LP residual. The enhanced speech is generated by exciting the time varying all-pole filter with the weighted LP residual.

...read moreread less

Proceedings Article•DOI•

VAD techniques for real-time speech transmission on the Internet

[...]

A. Sangwan¹, M.C. Chiranth¹, H. S. Jamadagni², R. Sah¹, R. Venkatesha Prasad², Vishal Gaurav¹ - Show less +2 more•Institutions (2)

PES University¹, Indian Institute of Science²

07 Nov 2002

TL;DR: A comparison of the relative merits and demerits along with the subjective quality of speech after the pruning of silence periods for four time-domain VAD algorithms in terms of speech quality, compression level and computational complexity.

...read moreread less

Abstract: We discuss techniques for voice activity detection (VAD) for voice over Internet Protocol (VoIP). VAD aids in reducing the bandwidth requirement of a voice session, thereby using bandwidth efficiently. Such a scheme would be implemented in the application layer. Thus the VAD is independent of the lower layers in the network stack (see Flood, J.E., "Telecommunications Switching - Traffic and Networks", Prentice Hall India). We compare four time-domain VAD algorithms in terms of speech quality, compression level and computational complexity. A comparison of the relative merits and demerits along with the subjective quality of speech after the pruning of silence periods is presented for all the algorithms. A quantitative measurement of speech quality for different algorithms is also presented.

...read moreread less

Proceedings Article•DOI•

Speaker independent audio-visual continuous speech recognition

[...]

Luhong Liang¹, Xiaoxing Liu¹, Yibao Zhao¹, Xiaobo Pi¹, Ara V. Nefian¹ - Show less +1 more•Institutions (1)

Intel¹

07 Nov 2002

TL;DR: The speaker independent audio-visual continuous speech recognition system presented relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region using a coupled hidden Markov (CHMM) model.

...read moreread less

Abstract: The increase in the number of multimedia applications that require robust speech recognition systems determined a large interest in the study of audio-visual speech recognition (AVSR) systems. The use of visual features in AVSR is justified by both the audio and visual modality of the speech generation and the need for features that are invariant to acoustic noise perturbation. The speaker independent audio-visual continuous speech recognition system presented relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region. Further, the visual and acoustic observation sequences are integrated using a coupled hidden Markov (CHMM) model. The statistical properties of the CHMM can model the audio and visual state asynchrony while preserving their natural correlation over time. The experimental results show that the current system tested on the XM2VTS database reduces by over 55% the error rate of the audio only speech recognition system at SNR of 0 dB.

...read moreread less

Proceedings Article•DOI•

Perceived speech quality prediction for voice over IP-based networks

[...]

Lingfen Sun¹, Emmanuel Ifeachor¹•Institutions (1)

University of Plymouth¹

07 Aug 2002

TL;DR: This work carried out a fundamental investigation of the impact of packet loss and talkers on perceived speech quality to provide the basis for developing an artificial neural network (ANN) model to predict speech quality for VoIP.

...read moreread less

Abstract: Perceived speech quality is the key metric for QoS in VoIP applications. Our primary aims are to carry out a fundamental investigation of the impact of packet loss and talkers on perceived speech quality using an objective method and, thus, to provide the basis for developing an artificial neural network (ANN) model to predict speech quality for VoIP. The impact on perceived speech quality of packet loss and of different talkers was investigated for three modern codecs (G.729, G.723.1 and AMR) using the new ITU PESQ algorithm. Results show that packet loss burstiness, loss locations/patterns and the gender of talkers have an impact. Packet size has, in general, no obvious influence on perceived speech quality for the same network conditions, but the deviation in speech quality depends on packet size and codec. Based on the investigation, we used talkspurt-based conditional and unconditional packet loss rates (which are perceptually more relevant than network packet loss rates), codec type and the gender of the talker (extracted from decoder) as inputs to an ANN model to predict speech quality directly from network parameters. Results show that high prediction accuracy was obtained from the ANN model (correlation coefficients for the test and validation datasets were 0.952 and 0.946 respectively). This work should help to develop efficient, nonintrusive QoS monitoring and control strategies for VoIP applications.

...read moreread less

Proceedings Article•DOI•

Methods for detecting impulsive noise in speech and audio signals

[...]

I. Kauppinen

07 Nov 2002

TL;DR: Computationally efficient methods for detecting non-Gaussian impulsive noise in digital speech and audio signals are presented and can be applied in real time to a digital data stream.

...read moreread less

Abstract: Computationally efficient methods for detecting non-Gaussian impulsive noise in digital speech and audio signals are presented. The aim of the detection is to find the errors without false detections in the case of e.g. percussive sounds in music signal or stop-consonants in speech signal. Various methods for computing a detection signal and a threshold curve are studied and tested. The detection can be applied in real time to a digital data stream.

...read moreread less

Journal Article•DOI•

Speech recognition of MPEG/audio encoded files

[...]

Gregory L. Zick, Lawrence Yapp

01 Jan 2002-Journal of the Acoustical Society of America

Proceedings Article•DOI•

A 1200/2400 bps coding suite based on MELP

[...]

Tian Wang¹, Kazuhuito Koishida¹, V. Cuperman, A. Gersho, J.S. Collura - Show less +1 more•Institutions (1)

Microsoft¹

06 Oct 2002

TL;DR: Key algorithm features of the future NATO narrow band voice coder (NBVC) are presented, a 1.2/2.4 kbps speech coder with noise preprocessor based on the MELP analysis algorithm that achieves quality close to the existing federal standard 2.2 kbps.

...read moreread less

Abstract: This paper presents key algorithm features of the future NATO narrow band voice coder (NBVC), a 1.2/2.4 kbps speech coder with noise preprocessor based on the MELP analysis algorithm. At 1.2 kbps, the MELP parameters for three consecutive frames are grouped into a superframe and jointly quantized to obtain high coding efficiency. The inter-frame redundancy is exploited with distinct quantization schemes for different unvoiced/voiced (U/V) frame combinations in the superframe. Novel techniques used at 1.2 kbps include pitch vector quantization using pitch differentials, joint quantization of pitch and U/V decisions and LSF quantization with a forward-backward interpolation method. A new harmonic synthesizer is introduced for both rates which improves the reproduction quality. Subjective test results indicate that the 1.2 kbps speech coder achieves quality close to the existing federal standard 2.4 kbps MELP coder.

...read moreread less

Patent•

Enhanced error concealment for spatial audio

[...]

Jussi Virolainen¹, Ari Lakaniemi¹•Institutions (1)

Nokia¹

14 Jun 2002

TL;DR: In this article, an error concealment method for multi-channel digital audio is proposed, where the first and second audio channels are correlated with each other in a manner so that a spatial sensation is typically perceived when listened to by a user.

...read moreread less

Abstract: An error concealment method for multi-channel digital audio involves receiving an audio signal having audio data forming a first audio channel and a second audio channel included therein, wherein the first and second audio channels are correlated with each other in a manner so that a spatial sensation is typically perceived when listened to by a user. Erroneous first-channel data is detected in the first audio channel, and second-channel data is obtained from the second audio channel. The erroneous first-channel data of the first audio channel is corrected by using the second-channel data. Upon detection of the erroneous first-channel data, a spatially perceivable inter-channel relation between the first and second audio channels is determined, and the determined inter-channel relation is used when correcting the erroneous first-channel data of the first audio channel so as to preserve the spatial sensation perceived by the user.

...read moreread less

Proceedings Article•DOI•

IntMDCT - A link between perceptual and lossless audio coding

[...]

Ralf Geiger, Jürgen Herre, Jürgen Koller, Karlheinz Brandenburg

13 May 2002

TL;DR: This paper presents an integer approximation of this lapped transform, called IntMDCT, which is derived from theMDCT using the lifting scheme, and inherits most of the attractive properties of the MDCT, exhibiting a good spectral representation of the audio signal, critical sampling and overlapping of blocks.

...read moreread less

Abstract: The Modified Discrete Cosine Transform (MDCT) is widely used in modem perceptual audio coding schemes. In this paper we present an integer approximation of this lapped transform, called IntMDCT, which is derived from the MDCT using the lifting scheme. This reversible integer transform inherits most of the attractive properties of the MDCT, exhibiting a good spectral representation of the audio signal, critical sampling and overlapping of blocks. This makes the IntMDCT well suited for both lossless audio coding as well as for combined perceptual and lossless audio coding. A scalable system is presented providing a lossless enhancement of perceptual audio coding schemes, such as MPEG-2 AAC.

...read moreread less

Proceedings Article•

DCT-based video features for audio-visual speech recognition

[...]

Martin Heckmann, Kristian Kroschel, Christophe Savariaux, Frédéric Berthommier

01 Jan 2002

TL;DR: This work investigates how the selection of the DCT coefficients influences the recognition scores in a hybridANN/HMM audio-visual speech recognition system on a continuous word recognition task with a vocabulary of 30 numbers.

...read moreread less

Abstract: Encouraged by the good performance of the DCT in audiovisual speech recognition [1], we investigate how the selection of the DCT coefficients influences the recognition scores in a hybridANN/HMM audio-visual speech recognition system on a continuous word recognition task with a vocabulary of 30 numbers. Three sets of coefficients, based on the mean energy, the variance and the variance relative to the mean value, were chosen. The performance of these coefficients is evaluated in a video only and an audio-visual recognition scenario with varying Signal to Noise Ratios (SNR). The audio-visual tests are performed with 5 types of additional noise at 12 SNR values each. Furthermore the results of the DCT based recognition are compared to those obtained via chroma-keyed geometric lip features [2]. In order to achieve this comparison, a second audio-visual database without chroma-key has been recorded. This database has similar content but a different speaker.

...read moreread less

Journal Article•DOI•

Audio-visual speech recognition using MPEG-4 compliant visual features

[...]

Petar Aleksic¹, J.J. Williams¹, Zhilin Wu¹, Aggelos K. Katsaggelos¹•Institutions (1)

Northwestern University¹

01 Jan 2002-EURASIP Journal on Advances in Signal Processing

TL;DR: An audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions, and a robust and automatic algorithm is described to extract FAPs from visual data, which does not require hand labeling or extensive training procedures.

...read moreread less

Abstract: We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs) supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA) was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR) experiments. Both single-stream and multistream hidden Markov models (HMMs) were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words) speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0-30 dB) with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.

...read moreread less

Journal Article•

Nonlinear speech processing: Overview and applications

[...]

M. Foundez-Zanuy, Stephen McLaughlin, Arianna Esposito, Amir Hussain, Jean Schoentgen, Gernot Kubin, W.B. Kleijn, Petros Maragos - Show less +4 more

01 Jan 2002-Control and Intelligent Systems

TL;DR: In this paper, an overview of various nonlinear processing techniques applied to speech signals is presented, including speech coding, speech synthesis, speech and speaker recognition, voice analysis and enhancement, and analyses and simulation of dysphonic voices.

...read moreread less

Abstract: This article presents an overview of various nonlinear processing techniques applied to speech signals. Evidence relating to the existence of nonlinearities in speech is presented, and the main differences between linear and nonlinear analysis are summarized. A brief review is given of the important nonlinear speech processing techniques reported to date, and their applications to speech coding, speech synthesis, speech and speaker recognition, voice analysis and enhancement, and analyses and simulation of dysphonic voices.

...read moreread less

Proceedings Article•DOI•

Binaural cue coding: a novel and efficient representation of spatial audio

[...]

Christof Faller¹, Frank Baumgarte¹•Institutions (1)

Agere Systems¹

13 May 2002

TL;DR: Results from a subjective test suggest that BCC, combined with existing mono audio coders, offers better quality than conventional stereo and multi-channel perceptual transform audiocoders for a wide range of bitrates.

...read moreread less

Abstract: We present a novel concept for representing multi-channel audio signals: Binaural Cue Coding (BCC). BCC aims at separating the basic audio content and the information relevant for spatial perception. A multi-channel audio signal is represented as a mono signal and BCC parameters. We present two types of applications of BCC. Firstly, a number of separate sound source signals are reduced to a mono signal and BCC parameters. In this case, the decoder has control over the location of each source in auditory space. In other words, the decoder can render spatial images as if the separate source signals were given. Secondly, a multi-channel audio signal is reduced to a mono signal and BCC parameters. In this case the decoder generates a multi-channel signal with a spatial image similar to the spatial image of the input signal of the encoder. Results from a subjective test suggest that BCC, combined with existing mono audio coders, offers better quality than conventional stereo and multi-channel perceptual transform audio coders for a wide range of bitrates.

...read moreread less

Journal Article•DOI•

Low-bitrate distributed speech recognition for packet-based and wireless communication

[...]

A. Bernard¹, Abeer Alwan¹•Institutions (1)

University of California, Los Angeles¹

01 Nov 2002-IEEE Transactions on Speech and Audio Processing

TL;DR: Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bit rates of 1.2 kbps or less.

...read moreread less

Abstract: We present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packet-based) speech recognition. It is shown that speech recognition as opposed to speech coding, is more sensitive to channel errors than channel erasures, and appropriate channel coding design criteria are determined. For channel decoding, we introduce a novel technique for combining at the receiver soft decision decoding with error detection. Frame erasure concealment techniques are used at the decoder to deal with unreliable frames. At the recognition stage, we present a technique to modify the recognition engine itself to take into account the time-varying reliability of the decoded feature after channel transmission. The resulting engine, referred to as weighted Viterbi recognition, further improves the recognition accuracy. Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bit rates of 1.2 kbps or less.

...read moreread less

Journal Article•DOI•

Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli

[...]

David Sodoyer¹, Jean-Luc Schwartz¹, Laurent Girin¹, Jacob Klinkisch¹, Christian Jutten² - Show less +1 more•Institutions (2)

University of Grenoble¹, Joseph Fourier University²

01 Jan 2002-EURASIP Journal on Advances in Signal Processing

TL;DR: A theoretical framework is presented showing that it is indeed possible to separate a source when some of its spectral characteristics are provided to the system and how, if a statistical model of the joint probability of visual and spectral audio input is learnt to quantify the audio-visual coherence, separation can be achieved by maximizing this probability.

...read moreread less

Abstract: We present a new approach to the source separation problem in the case of multiple speech signals. The method is based on the use of automatic lipreading, the objective is to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the speaker's lip movements. We consider the case of an additive stationary mixture of decorrelated sources, with no further assumptions on independence or non-Gaussian character. Firstly, we present a theoretical framework showing that it is indeed possible to separate a source when some of its spectral characteristics are provided to the system. Then we address the case of audio-visual sources. We show how, if a statistical model of the joint probability of visual and spectral audio input is learnt to quantify the audio-visual coherence, separation can be achieved by maximizing this probability. Finally, we present a number of separation results on a corpus of vowel-plosive-vowel sequences uttered by a single speaker, embedded in a mixture of other voices. We show that separation can be quite good for mixtures of 2, 3, and 5 sources. These results, while very preliminary, are encouraging, and are discussed in respect to their potential complementarity with traditional pure audio separation or enhancement techniques.

...read moreread less

Patent•

Data transmitting method, data receiving method, data transmitting device and data receiving device

[...]

Hiroshige Okamoto, Hiroe Tetsuya, Sho Murakoshi, Ejima Naoki, Toshiroh Nishio, Akihisa Kawamura, Hidekazu Suzuki - Show less +3 more

25 Mar 2002

TL;DR: In this article, the authors proposed a data transmission method and a data receiving method which enable audio data to be multiplexed into video data and be transmitted using a DVI standard cable or the like satisfactorily with a simple configuration.

...read moreread less

Abstract: This invention provides a data transmission method and a data receiving method which enable audio data to be multiplexed into video data and be transmitted using a DVI standard cable or the like satisfactorily with a simple configuration. From a data transmitting end, a superimposed video/audio data signal in which audio data are superimposed over a video blanking interval of video data in superimposition timing that is generated using a video blank sync signal and a pixel clock, are transmitted to a data receiving end through the DVI cable, together with the video blank sync signal and the pixel clock. On the data receiving end, a timing signal for extracting the audio data from the superimposed video/audio data signal is generated using the transmitted video blank sync signal and pixel clock, and the superimposed video/audio data signal is separated into video data and audio data using the generated timing signal, as well as the digital audio data are converted into an analog audio signal using an audio clock that is generated by dividing the frequency of the pixel clock.

...read moreread less

Collapse