scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2007"


Journal ArticleDOI
TL;DR: The group delay function is modified to overcome the short-time spectral structure of speech owing to zeros that are close to the unit circle in the z-plane and also due to pitch periodicity effects and is called the modified group delay feature (MODGDF).
Abstract: Spectral representation of speech is complete when both the Fourier transform magnitude and phase spectra are specified. In conventional speech recognition systems, features are generally derived from the short-time magnitude spectrum. Although the importance of Fourier transform phase in speech perception has been realized, few attempts have been made to extract features from it. This is primarily because the resonances of the speech signal which manifest as transitions in the phase spectrum are completely masked by the wrapping of the phase spectrum. Hence, an alternative to processing the Fourier transform phase, for extracting speech features, is to process the group delay function which can be directly computed from the speech signal. The group delay function has been used in earlier efforts, to extract pitch and formant information from the speech signal. In all these efforts, no attempt was made to extract features from the speech signal and use them for speech recognition applications. This is primarily because the group delay function fails to capture the short-time spectral structure of speech owing to zeros that are close to the unit circle in the z-plane and also due to pitch periodicity effects. In this paper, the group delay function is modified to overcome these effects. Cepstral features are extracted from the modified group delay function and are called the modified group delay feature (MODGDF). The MODGDF is used for three speech recognition tasks namely, speaker, language, and continuous-speech recognition. Based on the results of feature and performance evaluation, the significance of the MODGDF as a new feature for speech recognition is discussed

181 citations


Journal ArticleDOI
TL;DR: Experiments indicate that the scheme proposed in this paper performs significantly better than competing methods, in particular one that leads to the minimum mean squared error estimate of the clean speech signal.
Abstract: In this paper, we propose a Bayesian minimum mean squared error approach for the joint estimation of the short-term predictor parameters of speech and noise, from the noisy observation. We use trained codebooks of speech and noise linear predictive coefficients to model the a priori information required by the Bayesian scheme. In contrast to current Bayesian estimation approaches that consider the excitation variances as part of the a priori information, in the proposed method they are computed online for each short-time segment, based on the observation at hand. Consequently, the method performs well in nonstationary noise conditions. The resulting estimates of the speech and noise spectra can be used in a Wiener filter or any state-of-the-art speech enhancement system. We develop both memoryless (using information from the current frame alone) and memory-based (using information from the current and previous frames) estimators. Estimation of functions of the short-term predictor parameters is also addressed, in particular one that leads to the minimum mean squared error estimate of the clean speech signal. Experiments indicate that the scheme proposed in this paper performs significantly better than competing methods

143 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: The AMI transcription system for speech in meetings developed in collaboration by five research groups includes generic techniques such as discriminative and speaker adaptive training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, maximum likelihood linear regression, and phone posterior based features, as well as techniques specifically designed for meeting data.
Abstract: This paper describes the AMI transcription system for speech in meetings developed in collaboration by five research groups. The system includes generic techniques such as discriminative and speaker adaptive training, vocal tract length normalisation, heteroscedastic linear discriminant analysis, maximum likelihood linear regression, and phone posterior based features, as well as techniques specifically designed for meeting data. These include segmentation and cross-talk suppression, beam-forming, domain adaptation, Web-data collection, and channel adaptive training. The system was improved by more than 20% relative in word error rate compared to our previous system and was used in the NIST RT106 evaluations where it was found to yield competitive performance.

137 citations


Proceedings ArticleDOI
01 Feb 2007
TL;DR: A novel approach for speech signal modeling using fractional calculus that has the merit of requiring a smaller number of model parameters, and is demonstrated to be superior to the LPC approach in capturing the details of the modeled signal.
Abstract: In this paper, we present a novel approach for speech signal modeling using fractional calculus. This approach is contrasted with the celebrated Linear Predictive Coding (LPC) approach which is based on integer order models. It is demonstrated via numerical simulations that by using a few integrals of fractional orders as basis functions, the speech signal can be modeled accurately. The new approach has the merit of requiring a smaller number of model parameters, and is demonstrated to be superior to the LPC approach in capturing the details of the modeled signal.

93 citations


Journal ArticleDOI
TL;DR: A speech-and-speaker (SAS) identification system based on spoken Arabic digit recognition is discussed, building a system to recognize both the uttered words and their speaker through an innovative graphical algorithm for feature extraction from the voice signal.
Abstract: This paper discusses a speech-and-speaker (SAS) identification system based on spoken Arabic digit recognition. The speech signals of the Arabic digits from zero to ten are processed graphically (the signal is treated as an object image for further processing). The identifying and classifying methods are performed with Burg's estimation model and the algorithm of Toeplitz matrix minimal eigenvalues as the main tools for signal-image description and feature extraction. At the stage of classification, both conventional and neural-network-based methods are used. The success rate of the speaker-identifying system obtained in the presented experiments for individually uttered words is excellent and has reached about 98.8% in some cases. The miss rate of about 1.2% was almost only because of false acceptance (13 miss cases in 1100 tested voices). These results have promisingly led to the design of a security system for SAS identification. The average overall success rate was then 97.45% in recognizing one uttered word and identifying its speaker, and 92.5% in recognizing a three-digit password (three individual words), which is really a high success rate because, for compound cases, we should successfully test all the three uttered words consecutively in addition to and after identifying their speaker; hence, the probability of making an error is basically higher. The authors' major contribution to this task involves building a system to recognize both the uttered words and their speaker through an innovative graphical algorithm for feature extraction from the voice signal. This Toeplitz-based algorithm reduces the amount of computations from operations on an ntimesn matrix that contains n2 different elements to a matrix (of Toeplitz form) that contains only n elements that are different from each other

77 citations


Book
24 Sep 2007
TL;DR: This book discusses Speech Signals and Wavelets and Pitch Detection, Predictive Coding, and the Quadratic Spline Wavelets, and concludes with a comparison of Speech Transceivers and their applications.
Abstract: About the Authors. Other Wiley and IEEE Press Books on Related Topics. Preface and Motivation. Acknowledgements. I Speech Signals andWaveform Coding. 2 Predictive Coding. 3 Analysis-by-synthesis Principles. 4 Speech Spectral Quantization. 5 RPE Coding. 6 Forward-Adaptive CELP Coding. 7 Standard CELP Codecs. 8 Backward-Adaptive CELP Coding. 9 Wideband Speech Coding. 10 MPEG-4 Audio Compression and Transmission. 11 Overview of Low-rate Speech Coding. 12 Linear Predictive Vocoder. 13 Wavelets and Pitch Detection. 14 Zinc Function Excitation. 15 Mixed-Multiband Excitation. 16 Sinusoidal Transform Coding Below 4kbps. 17 Conclusions on Low Rate Coding. 18 Comparison of Speech Transceivers. 19 Voice Over the Internet Protocol. A Constructing the Quadratic Spline Wavelets. B Zinc Function Excitation. C Probability Density Function for Amplitudes. Bibliography. Index. Author Index.

74 citations


Proceedings ArticleDOI
01 Dec 2007
TL;DR: In contrast to the common belief that "there is no data like more data", it is found possible to select a highly informative subset of data that produces recognition performance comparable to a system that makes use of a much larger amount of data.
Abstract: This paper presents a strategy for efficiently selecting informative data from large corpora of transcribed speech. We propose to choose data uniformly according to the distribution of some target speech unit (phoneme, word, character, etc). In our experiment, in contrast to the common belief that "there is no data like more data", we found it possible to select a highly informative subset of data that produces recognition performance comparable to a system that makes use of a much larger amount of data. At the same time, our selection process is efficient and fast.

66 citations


Journal ArticleDOI
TL;DR: It is suggested that amplification of transient information may improve the intelligibility of speech in noise and that this improvement is more effective in severe noise conditions.
Abstract: The role of transient speech components on speech intelligibility was investigated. Speech was decomposed into two components—quasi-steady-state (QSS) and transient—using a set of time-varying filters whose center frequencies and bandwidths were controlled to identify the strongest formant components in speech. The relative energy and intelligibility of the QSS and transient components were compared to original speech. Most of the speech energy was in the QSS component, but this component had low intelligibility. The transient component had much lower energy but was almost as intelligible as the original speech, suggesting that the transient component included speech elements important to speech perception. A modified version of speech was produced by amplifying the transient component and recombining it with the original speech. The intelligibility of the modified speech in background noise was compared to that of the original speech, using a psychoacoustic procedure based on the modified rhyme protocol. Word recognition rates for the modified speech were significantly higher at low signal-to-noise ratios (SNRs), with minimal effect on intelligibility at higher SNRs. These results suggest that amplification of transient information may improve the intelligibility of speech in noise and that this improvement is more effective in severe noise conditions.

65 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: The latest wideband vocoder standard adopted by the cdma2000 standardization body, 3GPP2, is described and it is demonstrated that the EVRC-WB codec performs statistically significantly better than the adaptive multirate wideband.
Abstract: In this paper, the latest wideband vocoder standard adopted by the cdma2000 standardization body, 3GPP2, is described. Christened enhanced variable rate codec-wideband (EVRC-WB), the proposed codec encodes wideband speech (16 KHz sampling frequency) at a maximum bit-rate of 8.55 kbit/s. EVRC-WB is based on a split band coding paradigm in which two different coding models are used for the signal components in the low frequency (LF) (0-4 KHz) and the high frequency (HF) (3.5-7 KHz) bands. The coding model used for the former is based on the EVRC-B narrowband (0-4 KHz) codec, modified to encode the LF band signal at a maximum bitrate of 7.75 kbit/s. The HF band coding model is a LPC based coding scheme where the excitation is derived from the coded LF band excitation using non-linear processing. Mean opinion scores from 3GPP2 characterization tests are provided to demonstrate that the EVRC-WB codec (8.55 kbit/s, max.) performs statistically significantly better than the adaptive multirate wideband (12.65 kbit/s, max.).

62 citations


Journal ArticleDOI
TL;DR: The approach provides a solution to the traditional "hidden formant problem," and produces meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics.
Abstract: A novel Kalman filtering/smoothing algorithm is presented for efficient and accurate estimation of vocal tract resonances or formants, which are natural frequencies and bandwidths of the resonator from larynx to lips, in fluent speech. The algorithm uses a hidden dynamic model, with a state-space formulation, where the resonance frequency and bandwidth values are treated as continuous-valued hidden state variables. The observation equation of the model is constructed by an analytical predictive function from the resonance frequencies and bandwidths to LPC cepstra as the observation vectors. This nonlinear function is adaptively linearized, and a residual or bias term, which is adaptively trained, is added to the nonlinear function to represent the iteratively reduced piecewise linear approximation error. Details of the piecewise linearization design process are described. An iterative tracking algorithm is presented, which embeds both the adaptive residual training and piecewise linearization design in the Kalman filtering/smoothing framework. Experiments on estimating resonances in Switchboard speech data show accurate estimation results. In particular, the effectiveness of the adaptive residual training is demonstrated. Our approach provides a solution to the traditional "hidden formant problem," and produces meaningful results even during consonantal closures when the supra-laryngeal source may cause no spectral prominences in speech acoustics

56 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: This work proposes an improved low bit rate bandwidth extension algorithm along with a robust watermarking scheme for CELP-type speech codecs which is especially tailored to state-of-the-art narrowband speech communication networks such as GSM or UMTS.
Abstract: We consider the problem of transmitting a wideband speech signal with a cut-off frequency of fc = 7 kHz over a standardized narrowband (fc = 3.4 kHz) communication link in a backwards compatible manner. In a previous contribution we have shown that backwards compatibility can be achieved by using digital watermarking: we embedded compact side information about the missing high frequency band (3.4 - 7 kHz) into the narrowband speech signal. Here, we present a related system which is especially tailored to state-of-the-art narrowband speech communication networks such as GSM or UMTS. Therefore, we propose an improved low bit rate bandwidth extension algorithm along with a robust watermarking scheme for CELP-type speech codecs. The practical relevance of our system is shown by speech quality evaluations and by link-level simulations for the "enhanced full rate traffic channel" (TCH/EFS) of the GSM cellular communication system.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: Results show that the combination of mapping and frame selection provide the best results, and underline the interest to work on methods to convert the LP excitation.
Abstract: The subject of this paper is the conversion of a given speaker's voice (the source speaker) into another identified voice (the target one). We assume we have at our disposal a large amount of speech samples from source and target voice with at least a part of them being parallel. The proposed system is built on a mapping function between source and target spectral envelopes followed by a frame selection algorithm to produce final spectral envelopes. Converted speech is produced by a basic LP analysis of the source and LP synthesis using the converted spectral envelopes. We compared three types of conversion: without mapping, with mapping and using the excitation of the source speaker and finally with mapping using the excitation of the target. Results show that the combination of mapping and frame selection provide the best results, and underline the interest to work on methods to convert the LP excitation.

Journal ArticleDOI
TL;DR: This work overviews some recently proposed discrete Fourier transform (DFT)- and discrete wavelet packet transform (DWPT)-based speech parameterization methods and compares their performance against traditional techniques, such as the Mel-frequency cepstral coefficients (MFCC) and perceptual linear predictive (PLP), which presently dominate the speech recognition field.
Abstract: In the present work we overview some recently proposed discrete Fourier transform (DFT)- and discrete wavelet packet transform (DWPT)-based speech parameterization methods and evaluate their performance on the speech recognition task. Specifically, in order to assess the practical value of these less studied speech parameterization methods, we evaluate them in a common experimental setup and compare their performance against traditional techniques, such as the Mel-frequency cepstral coefficients (MFCC) and perceptual linear predictive (PLP) cepstral coefficients which presently dominate the speech recognition field. In particular, utilizing the well established TIMIT speech corpus and employing the Sphinx-III speech recognizer, we present comparative results of 8 different speech parameterization techniques.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: A source-filter decomposition of the speech signal by means of an ARX-LF model allows the representation of the glottal signal as the sum of an LF waveform and a residual signal that enables high quality speech modification such as pitch, duration or even voice quality transformation.
Abstract: In this paper a new method for speech synthesis is proposed. It relies on a source-filter decomposition of the speech signal by means of an ARX-LF model. This model allows the representation of the glottal signal as the sum of an LF waveform and a residual signal. The residual information is then analyzed by HNM. This signal representation enables high quality speech modification such as pitch, duration or even voice quality transformation. Experiments performed on a real speech database show the relevance of the proposed method as compared to other existing approaches.

Journal ArticleDOI
TL;DR: A content-dependent watermarking scheme suitable for codebook-excited linear prediction (CELP)-based speech codec that ensures the integrity of compressed speech data.
Abstract: As speech compression technologies have advanced, digital recording devices have become increasingly popular. However, data formats used in popular speech codecs are known a priori, such that compressed data can be modified easily via insertion, deletion, and replacement. This work proposes a content-dependent watermarking scheme suitable for codebook-excited linear prediction (CELP)-based speech codec that ensures the integrity of compressed speech data. Speech data are initially partitioned into many groups, each of which includes multiple speech frames. The watermark embedded in each frame is then generated according to the line spectrum frequency (LSF) feature in the current frame, the pitch extracted from the succeeding frame, the watermark embedded in the preceding frame, and the group index which is determined by the location of the current frame. Finally, some of the least significant bits (LSBs) of the indices indicating the excitation pulse positions or excitation vectors are substituted for the watermark. Conventional watermarking schemes can only detect whether compressed speech data are intact. They cannot determine where compressed speech data are altered by insertion, deletion, or replacement, whereas the proposed scheme can. Experiments established that the proposed scheme used in the G.723.1 6.3 kb/s speech codecs embeds 12 bits in each compressed speech frame with 189 bits, and only decreases the perceptual evaluation of speech quality (PESQ) by 0.11. Additionally, its accuracy in detecting the locations of attacked frames is very high, with only two normal frames mistaken as attacked frames. Therefore, the proposed watermarking scheme effectively ensures the integrity of compressed speech data.

Journal ArticleDOI
Ki-Seung Lee1
TL;DR: An evaluation by objective tests and informal listening tests clearly indicated the effectiveness of the proposed voice transformation method, and confirmed that the proposed method leads to smoothly evolving spectral contours over time, which were superior to conventional vector quantization (VQ)-based methods.
Abstract: A voice transformation method which changes the source speaker's utterances so as to sound similar to those of a target speaker is described. Speaker individuality transformation is achieved by altering the LPC cepstrum, average pitch period and average speaking rate. The main objective of the work involves building a nonlinear relationship between the parameters for the acoustical features of two speakers, based on a probabilistic model. The conversion rules involve the probabilistic classification and a cross correlation probability between the acoustic features of the two speakers. The parameters of the conversion rules are estimated by estimating the maximum likelihood of the training data. To obtain transformed speech signals which are perceptually closer to the target speaker's voice, prosody modification is also involved. Prosody modification is achieved by scaling excitation spectrum and time scale modification with appropriate modification factors. An evaluation by objective tests and informal listening tests clearly indicated the effectiveness of the proposed transformation method. We also confirmed that the proposed method leads to smoothly evolving spectral contours over time, which, from a perceptual standpoint, produced results that were superior to conventional vector quantization (VQ)-based methods

Journal ArticleDOI
TL;DR: A method for compressing speech based on polynomial approximations of the trajectories in time of various speech features (i.e., spectrum, gain, and pitch), which can be integrated into frame-based speech coders, and can also be applied to features that can be represented as temporal series greater in duration than the frame interval.
Abstract: Methods for speech compression aim at reducing the transmission bit rate while preserving the quality and intelligibility of speech. These objectives are antipodal in nature since higher compression presupposes preserving less information about the original speech signal. This paper presents a method for compressing speech based on polynomial approximations of the trajectories in time of various speech features (i.e., spectrum, gain, and pitch). The compression method can be integrated into frame-based speech coders, and can also be applied to features that can be represented as temporal series greater in duration than the frame interval. Theoretical issues and experimental results regarding this type of compression are addressed in this paper. Experimental implementation into a 2400 b/s standard speech coder is reported along with objective and subjective evaluations of operation in various noise environments. The new speech coder operates at a transmission rate of 1533 b/s, and for all noisy conditions tested performs better than the 2400 b/s standard speech coder

Patent
Juin-Hwey Chen1
17 Jan 2007
TL;DR: In this article, a wireless telephone including a receiver module, a channel decoder, a speech decoder and a speaker is presented, where each version of the voice signal comprises a plurality of speech frames.
Abstract: An embodiment of the present invention provides a wireless telephone including a receiver module, a channel decoder, a speech decoder, and a speaker. The receiver module receives a plurality of versions of a voice signal, wherein each version of the voice signal comprises a plurality of speech frames. The channel decoder is configured to decode a speech parameter associated with a first speech frame from a first version of the voice signal, wherein decoding the speech parameter includes selecting an optimal bit sequence from a plurality of candidate bit sequences and wherein the selection of the optimal bit sequence is based in part on a second speech frame from a second version of the voice signal. The speech decoder decodes at least one of the first or second versions of the voice signal based on the speech parameter to generate an output signal. The speaker receives the output signal and produces a sound pressure wave corresponding thereto.

Proceedings ArticleDOI
07 May 2007
TL;DR: It is shown that a predictive exploitation of the proposed layered configuration of vertices can improve the compression performance upon other state-of-the-art approaches by up to 16% in domains relevant for applications.
Abstract: We present a linear predictive compression approach for time-consistent 3D mesh sequences supporting and exploiting scalability. The algorithm decomposes each frame of a mesh sequence in layers employing patch-based mesh simplification techniques. This layered decomposition is consistent in time. Following the predictive coding paradigm, local temporal and spatial dependencies between layers and frames are exploited for compression. Prediction is performed vertex-wise from coarse to fine layers exploiting the motion of already encoded 1-ring neighbor vertices for prediction of the current vertex location. It is shown that a predictive exploitation of the proposed layered configuration of vertices can improve the compression performance upon other state-of-the-art approaches by up to 16% in domains relevant for applications.

Patent
03 Oct 2007
TL;DR: In this article, a speech processing apparatus comprises a setting unit that sets an association between a speech recognition target vocabulary and the shortcut data for transitioning to a state to which a transition is made.
Abstract: An speech processing apparatus comprises a setting unit that sets an association between a speech recognition target vocabulary and the shortcut data for transitioning to a state to which a transition is made, when a user makes a transition to a state among the plurality of states using an operation input unit, an speech input unit that inputs an audio, a speech recognition unit that employs the speech recognition target vocabulary to recognize the audio that is input via the speech input unit, and a control unit that employs the shortcut data that corresponds to the speech recognition target vocabulary that is a recognition result of the speech recognition unit to transition to the state, in order to improve speech recognition accuracy for audio shortcuts, while also preserving the convenience of the audio shortcuts.

Journal ArticleDOI
TL;DR: This paper presents a noise robust feature extraction algorithm NRFE using joint wavelet packet decomposition (WPD) and autoregressive (AR) modeling of a speech signal to improve noise robustness and performance.

Journal ArticleDOI
TL;DR: A switching coding scheme that will combine the advantages of both run-length and adaptive linear predictive coding, and uses a simple yet effective edge detector using only causal pixels for estimating the coding pixels in the proposed encoder.
Abstract: Many coding methods are more efficient with some images than others. In particular, run-length coding is very useful for coding areas of little changes. Adaptive predictive coding achieves high coding efficiency for fast changing areas like edges. In this paper, we propose a switching coding scheme that will combine the advantages of both run-length and adaptive linear predictive coding. For pixels in slowly varying areas, run-length coding is used; otherwise least-squares (LS)-adaptive predictive coding is used. Instead of performing LS adaptation in a pixel-by-pixel manner, we adapt the predictor coefficients only when an edge is detected so that the computational complexity can be significantly reduced. For this, we use a simple yet effective edge detector using only causal pixels. This way, the proposed system can look ahead to determine if the coding pixel is around an edge and initiate the LS adaptation in advance to prevent the occurrence of a large prediction error. With the proposed switching structure, very good prediction results can be obtained in both slowly varying areas and pixels around boundaries. Furthermore, only causal pixels are used for estimating the coding pixels in the proposed encoder; no additional side information needs to be transmitted. Extensive experiments as well as comparisons to existing state-of-the-art predictors and coders will be given to demonstrate its usefulness.

Proceedings ArticleDOI
01 Dec 2007
TL;DR: An evaluation framework for VAD in such environments, called corpus and environment for noisy speech recognition 1 concatenated (CENSREC-1-C), is developed and it is shown that a small extension improves speech recognition.
Abstract: Voice activity detection (VAD) plays an important role in speech processing including speech recognition, speech enhancement, and speech coding in noisy environments. We developed an evaluation framework for VAD in such environments, called corpus and environment for noisy speech recognition 1 concatenated (CENSREC-1-C). This framework consists of noisy continuous digit utterances and evaluation tools for VAD results. By adoptiong two evaluation measures, one for frame-level detection performance and the other for utterance-level detection performance, we provide the evaluation results of a power-based VAD method as a baseline. When using VAD in speech recognizer, the detected speech segments are extended to avoid the loss of speech frames and the pause segments are then absorbed by a pause model. We investigate the balance of an explicit segmentation by VAD and an implicit segmentation by a pause model using an experimental simulation of segment extension and show that a small extension improves speech recognition.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: The results reveal that the proposed method is more reliable and less sensitive to mode of signal acquisition and unforeseen conditions.
Abstract: The goal of this work is to provide robust and accurate speech detection for automatic speech recognition (ASR) in meeting room settings. The solution is based on computing long-term modulation spectrum, and examining specific frequency range for dominant speech components to classify speech and non-speech signals for a given audio signal. Manually segmented speech segments, short-term energy, short-term energy and zero-crossing based segmentation techniques, and a recently proposed multi layer perceptron (MLP) classifier system are tested for comparison purposes. Speech recognition evaluations of the segmentation methods are performed on a standard database and tested in conditions where the signal-to-noise ratio (SNR) varies considerably, as in the cases of close-talking headset, lapel, distant microphone array output, and distant microphone. The results reveal that the proposed method is more reliable and less sensitive to mode of signal acquisition and unforeseen conditions.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: A data hiding method based on dither quantization is used for speech bandwidth extension and results show that compared with using non-classified codebook, the propose scheme have a better bandwidth extension performance in terms of log spectral distortion (LSD).
Abstract: Speech bandwidth extension can be defined as the deliberate process of expanding the frequency range (bandwidth) for speech transmission. Its significant advancement in recent years has led to the technology being adopted commercially in several areas including psychoacoustic bass enhancement of small loudspeakers and the high frequency enhancement of perceptually coded audio. In this paper, a data hiding method based on dither quantization is used for speech bandwidth extension. More specifically, the out-of-band information is encoded and embedded into the narrowband speech without degrading the quality of the bandlimited signal. At the receiver, when the out-of-band information is extracted from the hidden channel, it can be used to combine with the bandlimited signal, providing a signal with a wider bandwidth. To encode the out-of-band speech more efficiently, acoustic phonetic classification is employed to generate three linear prediction (LP) codebook. The simulation results show that compared with using non-classified codebook, the propose scheme have a better bandwidth extension performance in terms of log spectral distortion (LSD).

Proceedings ArticleDOI
27 Aug 2007
TL;DR: A signal separation scheme that allows for a detailed analysis of unknown speech enhance-ment systems in a black box test scenario and achieves three separate signals that can be measured or auditively assessed in shorter time.
Abstract: Quality assessment of speech enhancement systems is a non-trivial task, especially when (residual) noise and echo signalcomponents occur. We present a signal separation schemethat allows for a detailed analysis of unknown speech enhance-ment systems in a black box test scenario. Our approach sep-arates the speech, (residual) noise, and (residual) echo compo-nent of the speech enhancement system in the sending direc-tion (uplink direction). This makes it possible to independentlyjudge the speech degradation and the noise and echo attenua-tion/degradation. While state of the art tests always try to judgethe sending direction signal mixture, our new scheme allows amore reliable analysis in shorter time. It will be very usefulfor testing hands-free devices in practice as well as for testingspeech enhancement algorithms in research and development. Index Terms : objective signal quality assessment, non-blindsignal separation, speech enhancement, hands-free 1. Introduction In science, a comfortable way to evaluate speech enhancementalgorithms is to digitally add near-end speech and noise to theecho signal and thereby construct the microphone signal. Dur-ing the uplink processing of the speech enhancement (hands-free) system the operational influence on the noisy microphonesignal is then to be logged, and later applied individually to thespeech, echo, and noise components of the microphone signal(see, e.g., [1, 2, 3]). This presumes linear processing, as canbe found e.g. in frequency domain noise reduction, where again is applied to the spectral amplitudes. The strength of suchmethod is that one achieves three separate signals: The filteredspeechcomponent, thefilteredechocomponent, andthefilterednoise component, which represent the (slightly) distorted near-end talker’s speech signal, the suppressed echo signal, and theresidual noisesignal, respectively. Focusingonnoisereduction,e.g., aspects such as speech distortion, noise attenuation, andnoisedistortioncan thencomfortably bemeasured or auditivelyassessed.Thishowever isa

Proceedings ArticleDOI
15 Apr 2007
TL;DR: Speech intelligibility index (SII) obtained from the enhanced speech and the subjective test results indicate that the proposed system provides a significant gain in speech intelligibility in environments with moderate or heavy ambient noise with the introduction of minimal processing artifacts.
Abstract: Speech reproduction systems such as mobile handsets are often used in environments with moderate or high level of ambient noise where the intelligibility of the spoken words is degraded heavily. We propose a low-complexity system to increase the intelligibility of far-end clean speech signal to a listener who is located in such environment. An implementation of the proposed system on ARM9 RISC processor in a mobile handset is also reported here. Speech intelligibility index (SII) obtained from the enhanced speech and the subjective test results indicate that the proposed system provides a significant gain in speech intelligibility in environments with moderate or heavy ambient noise with the introduction of minimal processing artifacts.

Patent
16 Feb 2007
TL;DR: In this article, a method for varying speech speed is presented, which includes the following steps: receive an original speech signal, calculate a pitch period of the original signal, define search ranges according to pitch period, find a maximum within each of the search ranges, divide the signal into speech sections according to the maxima, and obtain a speed-varied speech signal by applying a speed varying algorithm to each speech section of the signal according to a speed changing command.
Abstract: A method for varying speech speed is provided. The method includes the following steps: receive an original speech signal; calculate a pitch period of the original speech signal; define search ranges according to the pitch period; find a maximum within each of the search ranges of the original speech signal; divide the original speech signal into speech sections according to the maxima; obtain a speed-varied speech signal by applying a speed-varying algorithm to each speech section of the original speed signal according to a speed-varying command; and eventually, output the speed-varied speech signal.

Patent
14 Sep 2007
TL;DR: In this paper, a speech component signal is identified and modified by assuming that the speech source (e.g., the actor currently speaking) is in the center of a stereo sound image of the plural-channel audio signal.
Abstract: A plural-channel audio signal (e.g., a stereo audio) is processed to modify a gain (e.g., a volume or loudness) of a speech component signal (e.g., dialogue spoken by actors in a movie) relative to an ambient component signal (e.g., reflected or reverberated sound) or other component signals. In one aspect, the speech component signal is identified and modified. In one aspect, the speech component signal is identified by assuming that the speech source (e.g., the actor currently speaking) is in the center of a stereo sound image of the plural-channel audio signal and by considering the spectral content of the speech component signal.

Patent
15 Aug 2007
TL;DR: In this paper, a multi-stage speech recognition system is proposed, which consists of a first speech recognition unit performing initial speech recognition on a feature vector, which is extracted from an input speech signal, and generating a plurality of candidate words; and a second speech retrieval unit rescoring the candidate words using a temporal posterior feature vector extracted from the speech signal.
Abstract: Provided are a multi-stage speech recognition apparatus and method. The multi-stage speech recognition apparatus includes a first speech recognition unit performing initial speech recognition on a feature vector, which is extracted from an input speech signal, and generating a plurality of candidate words; and a second speech recognition unit rescoring the candidate words, which are provided by the first speech recognition unit, using a temporal posterior feature vector extracted from the speech signal.