The purpose of this paper is to develop a novel speech feature extraction framework for independently compensating the real and imaginary acoustic spectra of speech signals in the modulation domain with the techniques of histogram equalization (HEQ) and non-negative matrix factorization (NMF). By doing so, we can enhance not only the magnitude but also the phase components of the acoustic spectra, thereby creating noise-robust speech features. More specifically, the proposed framework makes the following three major contributions: First, via either of the HEQ and NMF operations, the long-term cross-frame correlation among the acoustic spectra at the same frequency can be captured to compensate for the spectral distortion caused by noise. Second, the noise effect can be handled in a high acoustic frequency resolution. Finally, the distortion dwelt in the acoustic spectra can be more extensively mitigated due to the independent processes for the respective real and imaginary parts. The evaluation experiments were carried out on the Aurora-2 and Aurora-4 benchmark tasks, and the corresponding results suggest that our proposed methods can achieve performance competitive to or better than many widely used noise robustness methods, including the well-known advanced front-end (AFE) extraction scheme, in speech recognition.

Robust speech recognition via enhancing the complex-valued acoustic spectrum in modulation domain

The masking release (i.e., better speech recognition in fluctuating compared to continuous noise backgrounds) observed for normal-hearing (NH) listeners is generally reduced or absent in hearing-impaired (HI) listeners. One explanation for this lies in the effects of reduced audibility: elevated thresholds may prevent HI listeners from taking advantage of signals available to NH listeners during the dips of temporally fluctuating noise where the interference is relatively weak. This hypothesis was addressed through the development of a signal-processing technique designed to increase the audibility of speech during dips in interrupted noise. This technique acts to (i) compare short-term and long-term estimates of energy, (ii) increase the level of short-term segments whose energy is below the average energy, and (iii) normalize the overall energy of the processed signal to be equivalent to that of the original long-term estimate. Evaluations of this energy-equalizing (EEQ) technique included consonant identification and sentence reception in backgrounds of continuous and regularly interrupted noise. For HI listeners, performance was generally similar for processed and unprocessed signals in continuous noise; however, superior performance for EEQ processing was observed in certain regularly interrupted noise backgrounds.

Masking release for hearing-impaired listeners: The effect of increased audibility through reduction of amplitude variability

https://pureadmin.qub.ac.uk/ws/files/18161603/csl_manuscript.pdf

An iterative longest matching segment approach to speech enhancement with additive noise and channel distortion

Sub-band based histogram equalization in cepstral domain for speech recognition

In the beginning, search engines provide placements next to the original search results for advertisers on specific keywords. Since users often search for their interests or purchasing decision, timely presenting proper advertisements to users will encourage them to click on search ads. With the rapid growth of advertising, there is a bidding mechanism that advertisers need to bid keywords on their ads. They should carefully compose keywords in order to enhance the opportunity for their ads to be clicked. Until now, how to efficiently improve the ad performance to earn more clicks remains a main task. In this paper, we focus on the scope of smart phone and produce a social intentional model with advertising based features to forecast future trend on ads’ click-through rate (CTR). In terms of social intentional model, we analyze Chinese text content of technology forum to derive social intentional factors which are Hotness, Sentiment, Promotion, and Event. Our results indicate that with knowing public opinions or occurring events beforehand can efficiently enhance click prediction. This will be very helpful for advertisers on adjusting bidding keywords to improve ad performance via social intention.

Constructing Social Intentional Corpora to Predict Click-Through Rate for Search Advertising

This paper describes a novel modification of Histogram Equalization approach to robust speech recognition We propose separate equalization of the high frequency and low frequency bands We study different combinations of the sub-band equalization and obtain best results when we performs a twostage equalization First, conventional Histogram Equalization (HEQ) is performed on the cepstral features, which does not completely equalize high frequency and low frequency bands, even though the overall histogram equalization is good In the second stage, an equalization is done separately on the high frequency and the low frequency components of the above equalized cepstra We refer to this approach as Sub-band Histogram Equalization (S-HEQ) The new set of features has better equalization of the sub-bands as well as the overall cepstral histogram Recognition results show a relative improvement of 12% and 15% over conventional HEQ on Aurora-2 and Aurora4 databases respectively

Sub-Band Level Histogram Equalization for Robust Speech Recognition.

In this paper, we describe a computationally efficient approach for combining speaker and noise normalization techniques. In particular, we combine the simple yet effective Histogram Equalization (HEQ) for noise compensation with Vocal-tract length normalization (VTLN) for speaker-normalization. While it is intuitive to remove noise first and then perform VTLN, this is difficult since HEQ performs noise compensation in the cepstral domain, while VTLN involves warping in spectral domain. In this paper, we investigate the use of the recently proposed T-VTLN approach to speaker normalization where matrix transformations are directly applied on cepstral features. We show that the speaker-specific warp-factors estimated even from noisy speech using this approach closely match those from clean-speech. Further, using sub-band HEQ (S-HEQ) and TVTLN we get a significant relative improvement of 20% and an impressive 33.54% over baseline in recognition accuracy for Aurora-2 and Aurora-4 task respectively.

Efficient Speaker and Noise Normalization for Robust Speech Recognition.

Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3.33% relative WERR. A further detailed analysis of the streaming system indicates that our proposed method brings especially good gain on media-playing related intents (e.g. 9.12% relative WERR on PlayMusicIntent).

/pdf/listen-with-intent-improving-speech-recognition-with-audio-27aehj9448.pdf

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

In this paper, we propose a method to compensate for noise and speaker-variability directly in the Log filter-bank (FB) domain, so that MFCC features are robust to noise and speaker-variations. For noise-compensation, we use Vector Taylor Series (VTS) approach in the Log FB domain, and speaker-normalization is also done in the Log FB domain using Linear Vocal tract length (VTLN) matrices. For VTLN, optimal selection of warp-factor is done in Log FB domain using canonical GMM model, avoiding the two-pass approach needed by a HMM model. Further, this can be efficiently implemented using sufficient statistics obtained from the GMM and the FB-VTLN-matrices. The warp-factor selection using GMM can also be done in cepstral domain by applying DCT matrices without the usual approximations associated with conventional linear-VTLN. The elegance of the proposed approach is that given the speech data, we obtain directly MFCC features that are robust to noise and speaker-variations. The proposed approach, show a significant relative improvement of 31% over baseline on Aurora-4 task.

Noise and speaker compensation in the Log filter bank domain

An additional feature processing algorithm using Non-negative Matrix Factorization (NMF) is proposed to be included during the conventional extraction of Mel-frequency cepstral coefficients (MFCC) for achieving noise robustness in HMM based speech recognition. The proposed approach reconstructs log-Mel filterbank outputs of speech data from a set of building blocks that form the bases of a speech subspace. The bases are learned using the standard NMF of training data. A variation of learning the bases is proposed, which uses histogram equalized activation coefficients during training, to achieve noise robustness. The proposed methods give up to 5.96% absolute improvement in recognition accuracy on Aurora-2 task over a baseline with standard MFCCs, and up to 13.69% improvement when combined with other feature normalization techniques like Histogram Equalization (HEQ) and Heteroscedastic Linear Discriminant Analysis (HLDA).

Raghavendra Bilgi

Papers

Sub-Band Level Histogram Equalization for Robust Speech Recognition.

Efficient Speaker and Noise Normalization for Robust Speech Recognition.

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Noise and speaker compensation in the Log filter bank domain

Non-negative subspace projection during conventional MFCC feature extraction for noise robust speech recognition