Showing papers on "Linear predictive coding published in 2004"

PDF

Open Access

Journal Article•DOI•

Efficient voice activity detection algorithms using long-term speech information

[...]

Javier Ramírez¹, José C. Segura¹, M. Carmen Benítez¹, Angel de la Torre¹, Antonio J. Rubio¹ - Show less +1 more•Institutions (1)

University of Granada¹

01 Apr 2004-Speech Communication

TL;DR: A new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems is presented, which formsulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors.

...read moreread less

412 citations

Patent•DOI•

Speech synthesis method

[...]

Takehiko Kagoshima¹, Masami Akamine¹•Institutions (1)

Toshiba¹

05 Mar 2004-Journal of the Acoustical Society of America

TL;DR: In this article, a plurality of synthesis speech segments are generated by synthesizing training speech segments labeled with phonetic contexts and input speech segments while altering the pitch/duration of the input text segments in accordance with the pitch and duration of the training text segments.

...read moreread less

Abstract: In a synthesis unit generator, a plurality of synthesis speech segments are generated by synthesizing training speech segments labeled with phonetic contexts and input speech segments while altering the pitch/duration of the input speech segments in accordance with the pitch/duration of the training speech segments. Typical speech segments are selected from the input speech segments on the basis of a distance between the synthesis speech segments and the training speech segments, and are stored in a storage. In addition, a plurality of phonetic context clusters corresponding to the synthesis units are generated on the basis of the distance, and are stored in a storage. A synthesis speech signal is generated by reading out, from the storage, those of the synthesis units, which correspond to the phonetic context clusters including phonetic contexts of input phonemes, and connecting the selected synthesis units in a speech synthesizer.

...read moreread less

203 citations

Journal Article•DOI•

Likelihood-maximizing beamforming for robust hands-free speech recognition

[...]

Michael L. Seltzer, Bhiksha Raj¹, Richard M. Stern²•Institutions (2)

Mitsubishi Electric¹, Carnegie Mellon University²

16 Aug 2004-IEEE Transactions on Speech and Audio Processing

TL;DR: A new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis.

...read moreread less

Abstract: Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms, designed for signal enhancement, are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance and ignores the manner in which speech recognition systems operate. In this paper a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Speech recognition experiments performed in a real distant-talking environment confirm the efficacy of the proposed approach.

...read moreread less

147 citations

Proceedings Article•DOI•

Adaptive rate control for H.264

[...]

Zhengguo Li¹, Feng Pan¹, K.P. Lim¹, X. Lin¹, Susanto Rahardja¹ - Show less +1 more•Institutions (1)

Institute for Infocomm Research Singapore¹

24 Oct 2004

TL;DR: A rate control scheme for H.264 is presented by introducing the concept of basic unit and a linear prediction model that is used to solve the chicken and egg dilemma existing in the rate control of H. 264.

...read moreread less

Abstract: This paper presents a rate control scheme for H264 by introducing the concept of basic unit and a linear prediction model The basic unit can be a macroblock (MB), a slice, or a frame It can be used to obtain a trade-off between the overall coding efficiency and the bits fluctuation The linear model is used to solve the chicken and egg dilemma existing in the rate control of H264 Both constant bit rate (CBR) and variable bit rate (VBR) cases are studied Our scheme has been adopted by H264

...read moreread less

119 citations

Patent•

Audio Segmentation and Classification

[...]

Hao Jiang¹, Hong-Jiang Zhang¹•Institutions (1)

Microsoft¹

27 Oct 2004-Journal of the Acoustical Society of America

TL;DR: In this paper, a portion of an audio signal is separated into multiple frames from which one or more different features are extracted, in combination with a set of rules, to classify the portion of the audio signal into one of multiple different classifications (for example, speech, non-speech, music, environment sound, silence).

...read moreread less

Abstract: A portion of an audio signal is separated into multiple frames from which one or more different features are extracted. These different features are used, in combination with a set of rules, to classify the portion of the audio signal into one of multiple different classifications (for example, speech, non-speech, music, environment sound, silence, etc.). In one embodiment, these different features include one or more of line spectrum pairs (LSPs), a noise frame ratio, periodicity of particular bands, spectrum flux features, and energy distribution in one or more of the bands. The line spectrum pairs are also optionally used to segment the audio signal, identifying audio classification changes as well as speaker changes when the audio signal is speech.

...read moreread less

102 citations

Journal Article•DOI•

An information theoretic approach to source enumeration in array signal processing

[...]

Shahrokh Valaee¹, Peter Kabal²•Institutions (2)

University of Toronto¹, McGill University²

01 May 2004-IEEE Transactions on Signal Processing

TL;DR: A new information theoretic algorithm is proposed for signal enumeration in array processing based on predictive description length that is defined as the length of a predictive code for the set of observations and can detect both coherent and noncoherent signals.

...read moreread less

Abstract: In this paper, a new information theoretic algorithm is proposed for signal enumeration in array processing. The approach is based on predictive description length (PDL) that is defined as the length of a predictive code for the set of observations. We assume that several models, with each model representing a certain number of sources, will compete. The PDL criterion is computed for the candidate models and is minimized over all models to select the best model and to determine the number of signals. In the proposed method, the correlation matrix is decomposed into two orthogonal components in the signal and noise subspaces. The maximum likelihood (ML) estimates of the angles-of-arrival are used to find the projection of the sample correlation matrix onto the signal and noise subspaces. The summation of the ML estimates of these matrices is the ML estimate of the correlation matrix. This method can detect both coherent and noncoherent signals. The proposed method can be used online and can be applied to time-varying systems and target tracking.

...read moreread less

102 citations

Proceedings Article•DOI•

LP-TRAP: Linear predictive temporal patterns

[...]

Marios Athineos, Hynek Hermansky, Daniel P. W. Ellis

04 Oct 2004

TL;DR: The cosine transform coefficients of the approximated sub-band envelopes, computed recursively from the all-pole polynomials, are used as inputs to a TRAP-based speech recognition system and are shown to improve recognition accuracy.

...read moreread less

Abstract: Autoregressive modeling is applied for approximating the temporal evolution of spectral density in critical-band-sized sub-bands of a segment of speech signal. The generalized autocorrelation linear predictive technique allows for a compromise between fitting the peaks and the troughs of the Hilbert envelope of the signal in the sub-band. The cosine transform coefficients of the approximated sub-band envelopes, computed recursively from the all-pole polynomials, are used as inputs to a TRAP-based speech recognition system and are shown to improve recognition accuracy.

...read moreread less

61 citations

Proceedings Article•DOI•

Entropy-based variable frame rate analysis of speech signals and its application to ASR

[...]

H. You¹, Qifeng Zhu¹, Abeer Alwan¹•Institutions (1)

University of California, Los Angeles¹

17 May 2004

TL;DR: This paper compares entropy and Euclidian distance measures for VFR in ASR experiments using the Aurora2 and T146 databases and finds better performance is observed for the entropy-based VFR over the earlier VFR approach and over the fixed-rate system.

...read moreread less

Abstract: Most speech processing algorithms analyze speech signals frame by frame with a fixed frame rate. Fixed-rate analysis is inconsistent with human speech perception and effectively assigns the same importance or 'weight' to all equi-duration frames. In Zhu et al. (2000), we proposed a variable frame rate (VFR) analysis technique that is based on a Euclidian distance measure. In this paper, we propose another approach for VFR based on the entropy of the signal. We compare entropy and Euclidian distance measures for VFR in ASR experiments using the Aurora2 and T146 databases. Better performance is observed for the entropy-based VFR over our earlier VFR approach and over the fixed-rate system.

...read moreread less

59 citations

Journal Article•DOI•

Detection of speech landmarks: Use of temporal information

[...]

Ariel Salomon¹, Carol Y. Espy-Wilson, Om D. Deshmukh•Institutions (1)

University of Maryland, College Park¹

27 Feb 2004-Journal of the Acoustical Society of America

TL;DR: In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system and results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used.

...read moreread less

Abstract: Studies by Shannon et al. [Science, 270, 303-304 (1995)], Van Tasell et al. [J. Acoust. Soc. Am. 82, 1152-1161 (1987)], and others show that human listeners can understand important aspects of the speech signal when spectral shape has been significantly degraded. These experiments suggest that temporal information is particularly important in human speech perception when the speech signal is heavily degraded. In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system. The parameters targeted include energy onset and offsets (computed using an adaptive algorithm) and measures of periodic and aperiodic content; together these are used to find abrupt acoustic events which signify landmarks. Overall detection rates for strongly robust events, robust events, and weak events in a portion of the TIMIT test database are 98.9%, 94.7%, and 52.1%, respectively. Error rates increase by less than 5% when the speech signals are spectrally impoverished. Use of the four temporal parameters as the front end of a hidden Markov model (HMM)-based system for the automatic recognition of the manner classes "sonorant," "fricative," "stop," and "silence" results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used, 70.1%. The combination of the temporal parameters and cepstral parameters results in an accuracy of 74.8%.

...read moreread less

57 citations

Proceedings Article•DOI•

A structured speech model with continuous hidden dynamics and prediction-residual training for tracking vocal tract resonances

[...]

Li Deng¹, Leo J. Lee¹, Hagai Attias¹, Alejandro Acero¹•Institutions (1)

Microsoft¹

17 May 2004

TL;DR: An iterative tracking algorithm is described and evaluated that embeds both the prediction-residual training and the piecewise linearization design in an adaptive Kalman filtering framework and provides meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics.

...read moreread less

Abstract: A novel approach is developed for efficient and accurate tracking of vocal tract resonances, which are natural frequencies of the resonator from larynx to lips, in fluent speech. The tracking algorithm is based on a version of the structured speech model consisting of continuous-valued hidden dynamics and a piecewise-linearized prediction function from resonance frequencies and bandwidths to LPC cepstra. We present details of the piecewise linearization design process and an adaptive training technique for the parameters that characterize the prediction residuals. An iterative tracking algorithm is described and evaluated that embeds both the prediction-residual training and the piecewise linearization design in an adaptive Kalman filtering framework. Experiments on tracking vocal tract resonances in Switchboard speech data demonstrate high accuracy in the results, as well as the effectiveness of residual training embedded in the algorithm. Our approach differs from traditional formant trackers in that it provides meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics.

...read moreread less

57 citations

Proceedings Article•DOI•

Music instrument recognition: from isolated notes to solo phrases

[...]

A.G. Krishna¹, Thippur V. Sreenivas¹•Institutions (1)

Indian Institute of Science¹

17 May 2004

TL;DR: Non-temporal, frame level features only are used so that the proposed system is scalable from the isolated notes to the solo instrumental phrases scenario without the need for temporal segmentation of solo music.

...read moreread less

Abstract: Speech and audio processing techniques are used along with statistical pattern recognition principles to solve the problem of music instrument recognition. Non-temporal, frame level features only are used so that the proposed system is scalable from the isolated notes to the solo instrumental phrases scenario without the need for temporal segmentation of solo music. Based on their effectiveness in speech, line spectral frequencies (LSF) are proposed as features for music instrument recognition. The proposed system has also been evaluated using MFCC and LPCC features. Gaussian mixture models and K-nearest neighbour model classifier are used for classification. The experimental dataset included the Ulowa MIS and the C Music corporation RWC databases. Our best results at the instrument family level is about 95% and at the instrument level is about 90% when classifying 14 instruments.

...read moreread less

Book•

Digital Speech Signal Processing

[...]

Peter Vary, Ulrich Heute, Wolfgang F. Hess

01 Dec 2004

Book•

Bandwidth Extension of Speech Signals

[...]

Bernd Iser, Wolfgang Minker, Gerhard Schmidt

15 Oct 2004

TL;DR: The theory and methods for quality enhancement of clean speech signals and distorted speech signals such as those that have undergone a band limitation, for instance, in a telephone network are described.

...read moreread less

Abstract: Bandwidth Extension of Speech Signals describes the theory and methods for quality enhancement of clean speech signals and distorted speech signals such as those that have undergone a band limitation, for instance, in a telephone network. Problems and the respective solutions are discussed for the different approaches. The different approaches are evaluated and a real-time implementation of the most promising approach is presented. The book includes topics related to speech coding, pattern- / speech recognition, speech enhancement, statistics and digital signal processing in general.

...read moreread less

Patent•

Pitch detection of speech signals

[...]

Kabi Prakash Padhi¹, George Sapna¹•Institutions (1)

STMicroelectronics¹

23 Sep 2004

TL;DR: In this article, the pitch detection of speech signals finds numerous applications in karaoke, voice recognition and scoring applications using frequency domain methods, while most of the existing techniques rely on time domain methods.

...read moreread less

Abstract: Pitch detection of speech signals finds numerous applications in karaoke, voice recognition and scoring applications. While most of the existing techniques rely on time domain methods, the invention utilizes frequency domain methods. There is provided a method and system for determining the pitch of speech from a speech signal. The method includes the steps of: producing or obtaining the speech signal; distinguishing the speech signal into voiced, unvoiced or silence sections using speech signal energy levels; applying a Fourier Transform to the speech signal and obtaining speech signal parameters; determining peaks of the Fourier transformed speech signal; tracking the speech signal parameters of the determined peaks to select partials; and determining the pitch from the selected partials using a two-way mismatch error calculation.

...read moreread less

Patent•

Method and apparatus for enhancing loudness of a speech signal

[...]

Marc Boillot¹, John G. Harris¹•Institutions (1)

Motorola¹

31 Dec 2004

TL;DR: In this paper, a speech filter (108) is proposed to enhance the loudness of a speech signal by expanding the formant regions of the speech signal beyond a natural bandwidth of the formants.

...read moreread less

Abstract: A speech filter (108) enhances the loudness of a speech signal by expanding the formant regions of the speech signal beyond a natural bandwidth of the formant regions. The energy level of the speech signal is maintained so that the filtered speech signal contains the same energy as the pre-filtered signal. By expanding the formant regions of the speech signal on a critical band scale corresponding to human hearing, the listener of the speech signal perceives it to be louder even though the signal contains the same energy.

...read moreread less

Patent•

Speech synthesis method, speech synthesis system, and speech synthesis program

[...]

Tatsuya Mizutani¹, Takehiko Kagoshima¹•Institutions (1)

Toshiba¹

26 Nov 2004

TL;DR: In this paper, a speech synthesis system stores a group of speech units in a memory, selects a plurality of speech unit from the group based on prosodic information of target speech, the speech units selected corresponding to each of segments which are obtained by segmenting a phoneme string of the target speech and minimizing distortion of synthetic speech generated from the speech unit selected to the target text.

...read moreread less

Abstract: A speech synthesis system stores a group of speech units in a memory, selects a plurality of speech units from the group based on prosodic information of target speech, the speech units selected corresponding to each of segments which are obtained by segmenting a phoneme string of the target speech and minimizing distortion of synthetic speech generated from the speech units selected to the target speech, generates a new speech unit corresponding to the each of the segments, by fusing the speech units selected, to obtain a plurality of new speech units corresponding to the segments respectively, and generates synthetic speech by concatenating the new speech units.

...read moreread less

Journal Article•DOI•

Detection and separation of speech event using audio and video information fusion and its application to robust speech interface

[...]

Futoshi Asano¹, Kiyoshi Yamamoto², Isao Hara¹, Jun Ogata¹, Takashi Yoshimura¹, Yoichi Motomura¹, Naoyuki Ichimura¹, Hideki Asoh¹ - Show less +4 more•Institutions (2)

National Institute of Advanced Industrial Science and Technology¹, University of Tsukuba²

01 Jan 2004-EURASIP Journal on Advances in Signal Processing

TL;DR: A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed and a maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise.

...read moreread less

Abstract: A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise. The coefficients of the beamformer are kept updated based on the information of the speech events. The information on the speech events is also used by the speech recognizer for extracting the speech segment.

...read moreread less

Proceedings Article•DOI•

Coding of prediction residual in MPEG-4 standard for lossless audio coding (MPEG-4 ALS)

[...]

Yuriy Reznik¹•Institutions (1)

RealNetworks¹

17 May 2004

TL;DR: Two alternative schemes for encoding of the prediction residual adopted in the MPEG-4 ALS (audio lossless coding) standard for lossless audio coding are described and analytical and experimental analysis of their performance is provided.

...read moreread less

Abstract: We describe two alternative schemes for encoding of the prediction residual adopted in the MPEG-4 ALS (audio lossless coding) standard for lossless audio coding. We explain choices of algorithms used in their design and provide both analytical and experimental analysis of their performance.

...read moreread less

Proceedings Article•DOI•

Empirical mode decomposition of voiced speech signal

[...]

Aicha Bouzid, Noureddine Ellouze

27 Sep 2004

TL;DR: This paper describes a new technique, called the empirical mode decomposition (EMD), for adaptively representing nonstationary signals as sums of zero-mean AM-FM components that allows the analysis of frequency composition of one-dimensional signals.

...read moreread less

Abstract: This paper describes a new technique, called the empirical mode decomposition (EMD) that has recently been pioneered by N. E. Huang and al., for adaptively representing nonstationary signals as sums of zero-mean AM-FM components [N. E. Huang, et al., 1998]. The components, called intrinsic mode functions (IMFs), allow the analysis of frequency composition of one-dimensional signals. Applied to speech signal, the EMD allows us to study the different intrinsic oscillatory modes. Besides, computing the LPC analysis of each mode provides an estimation of formants. The presented method is firstly applied on a sum of pure frequency signals. Among different modes we can detect all frequencies taking a part of a signal.

...read moreread less

Journal Article•

Digit Recognition Using Neural Networks

[...]

Chin Luh Tan¹, Adznan B. Jantan¹•Institutions (1)

Universiti Putra Malaysia¹

01 Dec 2004-Malaysian Journal of Computer Science

TL;DR: An automated system from the training stage to the recognition stage without the need of manual cropping for speech signals is developed to evaluate the performance of the automatic speech recognition (ASR) system.

...read moreread less

Abstract: This paper investigates the use of feed-forward multi-layer perceptrons trained by back-propagation in speech recognition. Besides this, the paper also proposes an automatic technique for both training and recognition. The use of neural networks for speaker independent isolated word recognition on small vocabularies is studied and an automated system from the training stage to the recognition stage without the need of manual cropping for speech signals is developed to evaluate the performance of the automatic speech recognition (ASR) system. Linear predictive coding (LPC) has been applied to represent speech signal in frames in early stage. Features from the selected frames are used to train multilayer perceptrons (MLP) using back-propagation. The same routine is applied to the speech signal during the recognition stage and unknown test patterns are classified to the nearest patterns. In short, the selected frames represent the local features of the speech signal and all of them contribute to the global similarity for the whole speech signal. The analysis, design and development of the automation system are done in MATLAB, in which an isolated word speaker independent digits recogniser is developed.

...read moreread less

Proceedings Article•DOI•

Digital audio watermarking algorithm based on linear predictive coding in wavelet domain

[...]

Wang Rangding¹, Xu Dawen¹, Chen Jin-er¹, Du Chengtou¹•Institutions (1)

Ningbo University¹

01 Jan 2004

TL;DR: Experimental results show that the watermark is imperceptible and the algorithm is robust to many attacks, such as low pass filtering, resampling, MP3 compression and so on.

...read moreread less

Abstract: A digital audio watermarking algorithm based on discrete wavelet transform is presented. A visually significant binary image via some pre-processing and SS modulating is embedded in audio low-middle frequency coefficients in wavelet domain. A scheme of watermark detection is presented by using linear predictive coding, and it does not use the original signal during extracting watermark. The BER is improved 10%-15% in this algorithm compared with the algorithm in Wang R.D. and Chai P.Q. (2003). Experimental results show that the watermark is imperceptible and the algorithm is robust to many attacks, such as low pass filtering, resampling, MP3 compression and so on.

...read moreread less

Journal Article•DOI•

A noise reduction preprocessor for mobile voice communication

[...]

Rainer Martin¹, David Malah², Richard V. Cox³, Anthony J. Accardi•Institutions (3)

Ruhr University Bochum¹, Technion – Israel Institute of Technology², AT&T Labs³

01 Jan 2004-EURASIP Journal on Advances in Signal Processing

TL;DR: A speech enhancement algorithm which leads to significant quality and intelligibility improvements when used as a preprocessor to a low bit rate speech coder and special emphasis is placed on enhancing the performance of the preprocessor in nonstationary noise environments.

...read moreread less

Abstract: We describe a speech enhancement algorithm which leads to significant quality and intelligibility improvements when used as a preprocessor to a low bit rate speech coder. This algorithm was developed in conjunction with the mixed excitation linear prediction (MELP) coder which, by itself, is highly susceptible to environmental noise. The paper presents novel as well as known speech and noise estimation techniques and combines them into a highly effective speech enhancement system. The algorithm is based on short-time spectral amplitude estimation, soft-decision gain modification, tracking of the a priori probability of speech absence, and minimum statistics noise power estimation. Special emphasis is placed on enhancing the performance of the preprocessor in nonstationary noise environments.

...read moreread less

Patent•

Method and apparatus for constructing a speech filter using estimates of clean speech and noise

[...]

Jian Wu¹, James G. Droppo¹, Li Deng¹, Alejandro Acero¹•Institutions (1)

Microsoft¹

16 Feb 2004

TL;DR: In this paper, the clean speech value and the noise value are estimated from the noisy speech signal and then used to define a gain on a filter, with the numerator being guaranteed to be positive.

...read moreread less

Abstract: A method and apparatus identify a clean speech signal from a noisy speech signal. To do this, a clean speech value and a noise value are estimated from the noisy speech signal. The clean speech value and the noise value are then used to define a gain on a filter. The noisy speech signal is applied to the filter to produce the clean speech signal. Under some embodiments, the noise value and the clean speech value are used in both the numerator and the denominator of the filter gain, with the numerator being guaranteed to be positive.

...read moreread less

Patent•

Speech correction apparatus

[...]

Toru Marumoto

04 Jun 2004

TL;DR: A speech correction apparatus includes a speaker for generating guidance speech, a microphone set at a hearing position, an acoustic-characteristic setting unit for separating ambient noise from the guidance speech at the hearing position; an operating unit; a speech correcting filter for correcting the sound pressure level of guidance speech generated by the speaker based on the average power of the speaker.

...read moreread less

Abstract: A speech correction apparatus includes a speaker for generating guidance speech; a microphone set at a hearing position; an acoustic-characteristic setting unit for separating ambient noise from the guidance speech at the hearing position; an operating unit; a speech correcting filter for correcting the sound pressure level of the guidance speech generated by the speaker based on the average power of the guidance speech and the average power of the ambient noise which are separated; a loudness-compensating-gain calculating unit; and a speech-head correcting unit for correcting the average power of the guidance speech corresponding to the speech head at the border between a silent state and a speech state of the guidance speech.

...read moreread less

Proceedings Article•

Noise reduction method for wideband speech coding

[...]

Milan Jelinek¹, Redwan Salami•Institutions (1)

Université de Sherbrooke¹

01 Sep 2004

TL;DR: The NR is as a part of VMR-WB speech codec recently selected as a new 3GPP2 standard for wideband speech applications in cdma2000 3G wireless system.

...read moreread less

Abstract: We present a new low complexity noise reduction (NR) method based on spectral subtraction and overlap-add analysis/synthesis. A voicing dependent cut-off frequency is introduced, dividing the speech spectrum into two parts. In lower end, the NR gain function varies with frequency bins to minimize distortion at pitch harmonic frequencies while maximizing the suppression between them. In higher end, the gain function is estimated per critical band reducing energy variations. The gain function is further smoothed over time with a smoothing factor adaptive with the actual NR gain to prevent distortion on voiced speech onsets. The NR is as a part of VMR-WB speech codec recently selected as a new 3GPP2 standard for wideband speech applications in cdma2000 3G wireless system.

...read moreread less

Proceedings Article•DOI•

An introduction to MPEG-4 audio lossless coding

[...]

Tilman Liebchen¹•Institutions (1)

Free University of Berlin¹

17 May 2004

TL;DR: The paper describes the basic elements of the codec, points out envisaged applications, and gives an outline of the standardization process.

...read moreread less

Abstract: Lossless coding is to become the latest extension of the MPEG-4 audio standard. In response to a call for proposals, many companies have submitted lossless audio codecs for evaluation. The codec of the Technical University of Berlin was chosen as reference model for MPEG-4 audio lossless coding (ALS), attaining working draft status in July 2003. The encoder is based on linear prediction, which enables high compression even with moderate complexity, while the corresponding decoder is straightforward. The paper describes the basic elements of the codec, points out envisaged applications, and gives an outline of the standardization process.

...read moreread less

Proceedings Article•DOI•

Voice activity detection with noise reduction and long-term spectral divergence estimation

[...]

Javier Ramírez¹, José C. Segura¹, Carmen Benitez¹, A. de la Torre¹, Antonio J. Rubio¹ - Show less +1 more•Institutions (1)

University of Granada¹

17 May 2004

TL;DR: Experimental results show clear improvements over different VAD methods in speech/pause discrimination and speech recognition performance, and the proposed VAD reduces misclassification errors in highly noisy environments by using a noise reduction stage before the long-term spectral tracking.

...read moreread less

Abstract: The paper mainly focusses on an improved voice activity detection algorithm employing long-term signal processing and maximum spectral component tracking. The benefits of this approach have been analyzed in a previous work (Ramirez, J. et al., Proc. EUROSPEECH 2003, p.3041-4, 2003) with clear improvements in speech/non-speech discriminability and speech recognition performance in noisy environments. Two clear aspects are now considered. The first one, which improves the performance of the VAD in low noise conditions, considers an adaptive length frame window to track the long-term spectral components. The second one reduces misclassification errors in highly noisy environments by using a noise reduction stage before the long-term spectral tracking. Experimental results show clear improvements over different VAD methods in speech/pause discrimination and speech recognition performance. Particularly, improvements in recognition rate were reported when the proposed VAD replaced the VADs of the ETSI advanced front-end (AFE) for distributed speech recognition (DSR).

...read moreread less

Journal Article•DOI•

Estimation of speech presence probability in the field of microphone array

[...]

Ilyas Potamitis¹•Institutions (1)

University of Patras¹

22 Nov 2004-IEEE Signal Processing Letters

TL;DR: The subject of this work is the robust estimation of speech presence probability of every spectral component of a speech signal impinging on a linear microphone array based on the generalized likelihood ratio test applied to the multichannel framework and far-field, wideband sources.

...read moreread less

Abstract: The subject of this work is the robust estimation of speech presence probability of every spectral component of a speech signal impinging on a linear microphone array. The approach is based on the generalized likelihood ratio test (GLRT) applied to the multichannel framework and far-field, wideband sources. It is shown that under certain distributional assumptions the GLRT provides a framework for speech presence detection by exploiting both the spatial localization and spectral content of the speech signal. The efficiency of the approach and its superiority over a state-of-the-art one-channel speech presence estimation technique is illustrated when additive white Gaussian noise is present in the acoustical field in low signal-to-noise ratio (SNR).

...read moreread less

Proceedings Article•DOI•

Extended linear prediction tools for lossless audio coding

[...]

Takehiro Moriya¹, Dai Yang², Tilman Liebchen³•Institutions (3)

Spacelabs Healthcare¹, University of Southern California², Technical University of Berlin³

17 May 2004

TL;DR: Two extension tools for enhancing the compression performance of prediction-based lossless audio coding are proposed, one is progressive-order prediction of the starting samples at the random access points, where the information of previous samples is not available and the other is interchannel joint coding.

...read moreread less

Abstract: Two extension tools for enhancing the compression performance of prediction-based lossless audio coding are proposed. One is progressive-order prediction of the starting samples at the random access points, where the information of previous samples is not available. The first sample is coded as is, the second is predicted by first-order prediction, the third is predicted by second-order prediction, and so on. This can be efficiently carried out with PAR-COR (PARtial autoCORrelation) coefficients. The second tool is interchannel joint coding. Both predictive coefficients and prediction error signals are efficiently coded by interchannel differential or three-tap adaptive prediction. These new prediction tools lead to a steady reduction in bit rate when random access is activated and the interchannel correlation is strong.

...read moreread less

Proceedings Article•DOI•

Speaker-independent Malay vowel recognition of children using multi-layer perceptron

[...]

Hua Nong Ting¹, Jasmy Yunus¹•Institutions (1)

Universiti Teknologi Malaysia¹

21 Nov 2004

TL;DR: This paper investigates the use of neural networks in recognizing 6 Malay vowels of Malay children in a speaker-independent manner using multi-layer perceptron with one hidden layer to recognize these vowels.

...read moreread less

Abstract: Most of the speech recognitions are based on adult speech sounds. Less research is done in the recognition of children speech sounds. The speech of children is more dynamic and inconsistent if compared to adult's speech. This paper investigates the use of neural networks in recognizing 6 Malay vowels of Malay children in a speaker-independent manner. Multi-layer perceptron with one hidden layer was used to recognize these vowels. The multi-layer perceptron was trained and tested with speech samples of Malay children with their ages between seven and ten years old. A single frame of cepstral coefficients was extracted around the vowel onset point using linear predictive coding. The vowel length was examined from 5 ms to 70 ms. Experiments were conducted to determine the optimal vowel length as well as the number of cepstral coefficients.

...read moreread less

Collapse