scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2012"


Journal ArticleDOI
Jani Lainema1, Frank Bossen2, Woo-Jin Han3, Min Jung-Hye4, Kemal Ugur1 
TL;DR: The design principles applied during the development of the new intra coding methods are discussed, the compression performance of the individual tools is analyzed, and the bitrate reduction provided by the HEVC intra coding over the H.264/advanced video coding reference is reported to be 22% on average and up to 36%.
Abstract: This paper provides an overview of the intra coding techniques in the High Efficiency Video Coding (HEVC) standard being developed by the Joint Collaborative Team on Video Coding (JCT-VC). The intra coding framework of HEVC follows that of traditional hybrid codecs and is built on spatial sample prediction followed by transform coding and postprocessing steps. Novel features contributing to the increased compression efficiency include a quadtree-based variable block size coding structure, block-size agnostic angular and planar prediction, adaptive pre- and postfiltering, and prediction direction-based transform coefficient scanning. This paper discusses the design principles applied during the development of the new intra coding methods and analyzes the compression performance of the individual tools. Computational complexity of the introduced intra prediction algorithms is analyzed both by deriving operational cycle counts and benchmarking an optimized implementation. Using objective metrics, the bitrate reduction provided by the HEVC intra coding over the H.264/advanced video coding reference is reported to be 22% on average and up to 36%. Significant subjective picture quality improvements are also reported when comparing the resulting pictures at fixed bitrate.

667 citations


Journal ArticleDOI
TL;DR: In this paper, five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers.
Abstract: The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the glottal closure instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers. The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA). The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy. Their robustness to additive noise and to reverberation is also assessed. A further contribution of the paper is the evaluation of their performance on a concrete application of speech processing: the causal-anticausal decomposition of speech. It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy. ZFR and SEDREAMS also show a superior robustness to additive noise and reverberation.

241 citations


Patent
11 Dec 2012
TL;DR: In this article, power consumption for a computing device may be managed by one or more keywords, such as a keyword, network interface module and/or application processing module of the computing device.
Abstract: Power consumption for a computing device may be managed by one or more keywords. For example, if an audio input obtained by the computing device includes a keyword, a network interface module and/or an application processing module of the computing device may be activated. The audio input may then be transmitted via the network interface module to a remote computing device, such as a speech recognition server. Alternately, the computing device may be provided with a speech recognition engine configured to process the audio input for on-device speech recognition.

140 citations


Proceedings ArticleDOI
01 Dec 2012
TL;DR: A voice conversion technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal, which is confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.
Abstract: This paper presents a voice conversion (VC) technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal. The parallel exemplars (dictionary) consist of the source exemplars and target exemplars, having the same texts uttered by the source and target speakers. The input source signal is decomposed into the source exemplars, noise exemplars obtained from the input signal, and their weights (activities). Then, by using the weights of the source exemplars, the converted signal is constructed from the target exemplars. We carried out speaker conversion tasks using clean speech data and noise-added speech data. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

138 citations


Patent
06 Dec 2012
TL;DR: In this paper, a signal analyzer for analyzing the audio signal is provided, which determines whether an audio portion is effective in the encoder output signal as a first encoded signal from the first encoding branch or as a second encoded message from a second encoding branch.
Abstract: An audio encoder for encoding an audio signal has a first coding branch, the first coding branch comprising a first converter for converting a signal from a time domain into a frequency domain Furthermore, the audio encoder has a second coding branch comprising a second time/frequency converter Additionally, a signal analyzer for analyzing the audio signal is provided The signal analyzer, on the hand, determines whether an audio portion is effective in the encoder output signal as a first encoded signal from the first encoding branch or as a second encoded signal from a second encoding branch On the other hand, the signal analyzer determines a time/frequency resolution to be applied by the converters when generating the encoded signals An output interface includes, in addition to the first encoded signal and the second encoded signal, a resolution information identifying the resolution used by the first time/frequency converter and used by the second time/frequency converter

128 citations


Journal ArticleDOI
TL;DR: In this paper, a set of speech processing tools created by introducing sparsity constraints into the linear prediction framework is presented, which have shown to be effective in several issues related to modeling and coding of speech signals.
Abstract: The aim of this paper is to provide an overview of Sparse Linear Prediction, a set of speech processing tools created by introducing sparsity constraints into the linear prediction framework. These tools have shown to be effective in several issues related to modeling and coding of speech signals. For speech analysis, we provide predictors that are accurate in modeling the speech production process and overcome problems related to traditional linear prediction. In particular, the predictors obtained offer a more effective decoupling of the vocal tract transfer function and its underlying excitation, making it a very efficient method for the analysis of voiced speech. For speech coding, we provide predictors that shape the residual according to the characteristics of the sparse encoding techniques resulting in more straightforward coding strategies. Furthermore, encouraged by the promising application of compressed sensing in signal compression, we investigate its formulation and application to sparse linear predictive coding. The proposed estimators are all solutions to convex optimization problems, which can be solved efficiently and reliably using, e.g., interior-point methods. Extensive experimental results are provided to support the effectiveness of the proposed methods, showing the improvements over traditional linear prediction in both speech analysis and coding.

128 citations


Journal ArticleDOI
TL;DR: The method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.
Abstract: The enhancement of speech degraded by real-world interferers is a highly relevant and difficult task. Its importance arises from the multitude of practical applications, whereas the difficulty is due to the fact that interferers are often nonstationary and potentially similar to speech. The goal of monaural speech enhancement is to separate a single mixture into its underlying clean speech and interferer components. This under-determined problem is solved by incorporating prior knowledge in the form of learned speech and interferer dictionaries. The clean speech is recovered from the degraded speech by sparse coding of the mixture in a composite dictionary consisting of the concatenation of a speech and an interferer dictionary. Enhancement performance is measured using objective measures and is limited by two effects. A too sparse coding of the mixture causes the speech component to be explained with too few speech dictionary atoms, which induces an approximation error we denote source distortion. However, a too dense coding of the mixture results in source confusion, where parts of the speech component are explained by interferer dictionary atoms and vice-versa. Our method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.

111 citations


Journal ArticleDOI
TL;DR: A new algorithm is proposed for steganography in low bit-rate VoIP audio streams by integrating information hiding into the process of speech encoding, thus maintaining synchronization between information hiding and speech encoding.
Abstract: Low bit-rate speech codecs have been widely used in audio communications like VoIP and mobile communications, so that steganography in low bit-rate audio streams would have broad applications in practice. In this paper, the authors propose a new algorithm for steganography in low bit-rate VoIP audio streams by integrating information hiding into the process of speech encoding. The proposed algorithm performs data embedding while pitch period prediction is conducted during low bit-rate speech encoding, thus maintaining synchronization between information hiding and speech encoding. The steganography algorithm can achieve high quality of speech and prevent detection of steganalysis, but also has great compatibility with a standard low bit-rate speech codec without causing further delay by data embedding and extraction. Testing shows, with the proposed algorithm, the data embedding rate of the secret message can attain 4 bits/frame (133.3 bits/second).

109 citations


Patent
Elizabeth V. Woodward1, Shunguo Yan1
26 Sep 2012
TL;DR: In this article, a speaker providing speech in an audio track of the at least one segment is identified using information retrieved from a social network service source, an acoustic profile for the segment is generated based on the generated speech profile, and an automatic speech recognition engine is dynamically configured for operation on the audio track corresponding to the speaker.
Abstract: Mechanisms for performing dynamic automatic speech recognition on a portion of multimedia content are provided. Multimedia content is segmented into homogeneous segments of content with regard to speakers and background sounds. For the at least one segment, a speaker providing speech in an audio track of the at least one segment is identified using information retrieved from a social network service source. A speech profile for the speaker is generated using information retrieved from the social network service source, an acoustic profile for the segment is generated based on the generated speech profile, and an automatic speech recognition engine is dynamically configured for operation on the at least one segment based on the acoustic profile. Automatic speech recognition operations are performed on the audio track of the at least one segment to generate a textual representation of speech content in the audio track corresponding to the speaker.

104 citations


Journal Article
TL;DR: All aspects of this standardization eort are outlined, starting with the history and motivation of the MPEG work item, describing all technical features of the nal system, and further discussing listening test results and performance numbers which show the advantages of the new system over current state-of-the-art codecs.
Abstract: In early 2012 the ISO/IEC JTC1/SC29/WG11 (MPEG) nalized the new MPEG-D Unied Speech and Audio Coding standard The new codec brings together the previously separated worlds of general audio coding and speech coding It does so by integrating elements from audio coding and speech coding into a unied system The present publication outlines all aspects of this standardization eort, starting with the history and motivation of the MPEG work item, describing all technical features of the nal system, and further discussing listening test results and performance numbers which show the advantages of the new system over current state-of-the-art codecs

88 citations


Patent
21 Sep 2012
TL;DR: In this article, the authors describe techniques for selecting audio from locations that are most likely to be sources of spoken commands or words, where directional audio signals are generated to emphasize sounds from different regions of an environment.
Abstract: Techniques are described for selecting audio from locations that are most likely to be sources of spoken commands or words. Directional audio signals are generated to emphasize sounds from different regions of an environment. The directional audio signals are processed by an automated speech recognizer to generate recognition confidence values corresponding to each of the different regions, and the region resulting in the highest recognition confidence value is selected as the region most likely to contain a user who is speaking commands.

Journal ArticleDOI
TL;DR: The proposed VOP detection method has shown significant improvement in the performance compared to the existing method under clean as well as coded cases and is analyzed in CV recognition by using VOP as an anchor point.
Abstract: In this paper, we propose a method for detecting the vowel onset points (VOPs) for low bit rate coded speech. VOP is the instant at which the onset of the vowel takes place in the speech signal. VOP plays an important role for the applications, such as consonant-vowel (CV) unit recognition and speech rate modification. The proposed VOP detection method is based on the spectral energy present in the glottal closure region of the speech signal. Speech coders considered to carry out this study are Global System for Mobile Communications (GSM) full rate, code-excited linear prediction (CELP), and mixed-excitation linear prediction (MELP). TIMIT database and CV units collected from the broadcast news corpus are used for evaluation. Performance of the proposed method is compared with existing methods, which uses the combination of evidence from the excitation source, spectral peaks energy, and modulation spectrum. The proposed VOP detection method has shown significant improvement in the performance compared to the existing method under clean as well as coded cases. The effectiveness of the proposed VOP detection method is analyzed in CV recognition by using VOP as an anchor point.

Patent
24 Apr 2012
TL;DR: In this paper, a method of operating an audio system in an automobile includes identifying a user of the audio system and an audio recording playing on the audio systems is identified and stored in memory in association with the identified user and the identified audio recording.
Abstract: A method of operating an audio system in an automobile includes identifying a user of the audio system. An audio recording playing on the audio system is identified. An audio setting entered into the audio system by the identified user while the audio recording is being played by the audio system is sensed. The sensed audio setting is stored in memory in association with the identified user and the identified audio recording. The audio recording is retrieved from memory with the sensed audio setting being embedded in the retrieved audio recording as a watermark signal. The retrieved audio recording is played on the audio system with the embedded sensed audio setting being automatically implemented by the audio system during the playing.

Journal ArticleDOI
TL;DR: Experimental results prove the efficiency of the proposed hiding technique since the stego signals are perceptually indistinguishable from the equivalent cover signal, while being able to recover the secret speech message with slight degradation in the quality.
Abstract: A new method to secure speech communication using the discrete wavelet transforms (DWT) and the fast Fourier transform is presented in this article. In the first phase of the hiding technique, we separate the speech high-frequency components from the low-frequency components using the DWT. In a second phase, we exploit the low-pass spectral proprieties of the speech spectrum to hide another secret speech signal in the low-amplitude high-frequency regions of the cover speech signal. The proposed method allows hiding a large amount of secret information while rendering the steganalysis more complex. Experimental results prove the efficiency of the proposed hiding technique since the stego signals are perceptually indistinguishable from the equivalent cover signal, while being able to recover the secret speech message with slight degradation in the quality.

Patent
13 Sep 2012
TL;DR: In this article, a speech recognition system and a voice activity detection unit are coupled to the speech recognition, and the VADU is used to detect whether an audio signal is a voice signal and accordingly generate a voice-activity detection result to control whether the system should perform speech recognition upon the audio signal.
Abstract: A signal processing apparatus includes a speech recognition system and a voice activity detection unit. The voice activity detection unit is coupled to the speech recognition system, and arranged for detecting whether an audio signal is a voice signal and accordingly generating a voice activity detection result to the speech recognition system to control whether the speech recognition system should perform speech recognition upon the audio signal.

Proceedings ArticleDOI
25 Mar 2012
TL;DR: This paper proposes an effective splicing detection method for audios by detecting abnormal differences in the local noise levels in an audio signal and demonstrates the efficacy and robustness of the proposed method using both synthetic and realistic audio splicing forgeries.
Abstract: One common form of tampering in digital audio signals is known as splicing, where sections from one audio is inserted to another audio. In this paper, we propose an effective splicing detection method for audios. Our method achieves this by detecting abnormal differences in the local noise levels in an audio signal. This estimation of local noise levels is based on an observed property of audio signals that they tend to have kurtosis close to a constant in the band-pass filtered domain. We demonstrate the efficacy and robustness of the proposed method using both synthetic and realistic audio splicing forgeries.

Patent
28 Sep 2012
TL;DR: In this paper, a signal indicating the existence of a voice call is detected in a vehicle and it is determined whether a portable device inside the vehicle is outputting audio to a first audio device.
Abstract: Systems and methods are provided for playing media assets in a vehicle. A signal indicating the existence of a voice call is detected in a vehicle. It is determined whether a portable device inside the vehicle is outputting audio to a first audio device. If a portable device is outputting audio to a first audio device, the audio level of the output audio component is compared with an audio level threshold to determine whether the output of the audio component to the first audio device interferes with audio output of the call. If the output audio component is determined to interfere with the audio output of the call, the audio component output through the first audio device is repressed. If the output audio component is determined to not interfere with the audio output of the call, the audio component is output through the first audio device without interruption.

Patent
31 Jul 2012
TL;DR: In this article, the transmission of noise parameters for improving automatic speech recognition has been discussed, where a system includes one or more microphones, wherein each microphone is configured to produce an audio signal.
Abstract: Methods and systems for transmission of noise parameters for improving automatic speech recognition are disclosed. A system includes one or more microphones, wherein each microphone is configured to produce an audio signal. The system also includes a noise reduction module configured to generate a noise-reduced audio signal and a noise parameter. Furthermore, the system includes a transmitter configured to transmit, to a computing device, the noise-reduced audio signal and a noise parameter. The computing device may use the noise parameter in obtaining a model to use for performing automatic speech recognition.

Patent
14 Aug 2012
TL;DR: In this paper, an apparatus for generating an audio output signal having two or more audio output channels from an audio input signal having multiple audio input channels is presented. But the apparatus comprises a provider (110) and a signal processor (120).
Abstract: An apparatus for generating an audio output signal having two or more audio output channels from an audio input signal having two or more audio input channels is provided. The apparatus comprises a provider (110) and a signal processor (120). The provider (110) is adapted to provide first covariance properties of the audio input signal. The signal processor (120) is adapted to generate the audio output signal by applying a mixing rule on at least two of the two or more audio input channels. The signal processor (120) is configured to determine the mixing rule based on the first covariance properties of the audio input signal and based on second covariance properties of the audio output signal, the second covariance properties being different from the first covariance properties.

Patent
01 May 2012
TL;DR: In this article, the authors present systems, methods, and apparatus for determining audio context between an audio source and an audio sink and selecting signal profiles based at least in part on that audio context.
Abstract: Described herein are systems, methods, and apparatus for determining audio context between an audio source and an audio sink and selecting signal profiles based at least in part on that audio context. The signal profiles may include noise cancellation which is configured to facilitate operation within the audio context. Audio context may include user-to-user and user-to-device communications.

Book
30 Oct 2012
TL;DR: This edition of Introduction to Data Compression provides an extensive introduction to the theory underlying todays compression techniques with detailed instruction for their applications using several examples to explain the concepts.
Abstract: Each edition of Introduction to Data Compression has widely been considered the best introduction and reference text on the art and science of data compression, and thefourth edition continues in this tradition Data compression techniques and technology are ever-evolving with new applications in image, speech, text, audio, and video The fourth edition includes all the cutting edge updates the reader will need during the work day and in class Khalid Sayood provides an extensive introduction to the theory underlying todays compression techniques with detailed instruction for their applications using several examples to explain the concepts Encompassing the entire field of data compression, Introduction to Data Compression includes lossless and lossy compression, Huffman coding, arithmetic coding, dictionary techniques, context based compression, scalar and vector quantization Khalid Sayood provides a working knowledge of data compression, giving the reader the tools to develop a complete and concise compression package upon completion of his book New content added to include a more detailed description of the JPEG 2000 standard New content includes speech coding for internet applications Explains established and emerging standards in depth including JPEG 2000, JPEG-LS, MPEG-2, H264, JBIG 2, ADPCM, LPC, CELP, MELP, and iLBC Source code provided via companion web site that gives readers the opportunity to build their own algorithms, choose and implement techniques in their own applications

Patent
21 Dec 2012
TL;DR: In this article, the extracted audio features are compared to stored audio templates that include ranges and/or values for certain features and are tagged for specific ranges or values, and the tags are used to determine the semantic audio data that includes genre, instrumentation, style, acoustical dynamics, and emotive descriptor for the audio signal.
Abstract: System, apparatus and method for determining semantic information from audio, where incoming audio is sampled and processed to extract audio features, including temporal, spectral, harmonic and rhythmic features. The extracted audio features are compared to stored audio templates that include ranges and/or values for certain features and are tagged for specific ranges and/or values. The semantic information may be associated with audio signature data Extracted audio features that are most similar to one or more templates from the comparison are identified according to the tagged information. The tags are used to determine the semantic audio data that includes genre, instrumentation, style, acoustical dynamics, and emotive descriptor for the audio signal.

Proceedings ArticleDOI
13 Nov 2012
TL;DR: A novel approach to automatic recognition of code-switching speech using parallel automatic speech recognizers for speech recognition and rescoring, which shows reduction in WER, when they are used for code switching speech recognition.
Abstract: In this paper, we propose a novel approach to automatic recognition of code-switching speech The proposed method consists of two phases: automatic speech recognition, and rescoring The framework uses parallel automatic speech recognizers for speech recognition The lattices produced are subsequently joined and rescored to estimate the most probable word sequence Experiment shows that the proposed approach reduction of more than 5% WER, when tested on English/Malay code switching speech In addition, the framework has shown to be very robust Besides, we also propose an acoustic model adaptation approach known as hybrid approach of interpolation and merging to cross adapt acoustic models of different languages to recognize code switching speech The adapted acoustic models show reduction in WER, when they are used for code switching speech recognition

Journal ArticleDOI
TL;DR: An adaptive suboptimal pulse combination constrained (ASOPCC) method is presented to embed data on compressed speech signal of AMR-WB codec, which takes advantage of the ''redundancy'', created by non-exhaustive search of algebraic codebook, to encode secret information.

Journal ArticleDOI
TL;DR: A single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product and a sinusoidal model-based algorithm for speech separation is proposed.
Abstract: In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility.

Journal ArticleDOI
Kyu Woong Hwang1, Soo-Young Lee1
TL;DR: A crowdsourcing framework that models the combination of scene, event, and phone context to overcome environmental audio recognition issues is proposed and found that audio scenes, events, andPhone context are classified with 85.2, 77.6, and 88.9% accuracy.
Abstract: Environmental audio recognition through mobile devices is difficult because of background noise, unseen audio events, and changes in audio channel characteristics due to the phone's context, e.g., whether the phone is in the user's pocket or in his hand. We propose a crowdsourcing framework that models the combination of scene, event, and phone context to overcome these issues. The framework gathers audio data from many people and shares user-generated models through a cloud server to accurately classify unseen audio data. A Gaussian histogram is used to represent an audio clip with a small number of parameters, and a k-nearest classifier allows the easy incorporation of new training data into the system. Using the Kullback-Leibler divergence between two Gaussian histograms as the distance measure, we find that audio scenes, events, and phone context are classified with 85.2%, 77.6%, and 88.9% accuracy, respectively.

Patent
09 Mar 2012
TL;DR: In this paper, a system, method, and computer readable storage medium generates an audio fingerprint for an input audio clip that is robust to differences in key, instrumentation, and other performance variations.
Abstract: A system, method, and computer readable storage medium generates an audio fingerprint for an input audio clip that is robust to differences in key, instrumentation, and other performance variations. The audio fingerprint comprises a sequence of intervalgrams that represent a melody in an audio clip according pitch intervals between different time points in the audio clip. The fingerprint for an input audio clip can be compared to a set of reference fingerprints in a reference database to determine a matching reference audio clip.

Journal ArticleDOI
TL;DR: A weighted least squares (WLS) training procedure is suggested that facilitates the possibility of imposing a compact semiparametric model on the SVM, which results in a dramatic complexity reduction that allows the proposed hybrid WLS-SVC/HMM system to perform real-time speech decoding on a connected-digit recognition task (SpeechDat Spanish database).
Abstract: In the last years, support vector machines (SVMs) have shown excellent performance in many applications, especially in the presence of noise. In particular, SVMs offer several advantages over artificial neural networks (ANNs) that have attracted the attention of the speech processing community. Nevertheless, their high computational requirements prevent them from being used in practice in automatic speech recognition (ASR), where ANNs have proven to be successful. The high complexity of SVMs in this context arises from the use of huge speech training databases with millions of samples and highly overlapped classes. This paper suggests the use of a weighted least squares (WLS) training procedure that facilitates the possibility of imposing a compact semiparametric model on the SVM, which results in a dramatic complexity reduction. Such a complexity reduction with respect to conventional SVMs, which is between two and three orders of magnitude, allows the proposed hybrid WLS-SVC/HMM system to perform real-time speech decoding on a connected-digit recognition task (SpeechDat Spanish database). The experimental evaluation of the proposed system shows encouraging performance levels in clean and noisy conditions, although further improvements are required to reach the maturity level of current context-dependent HMM-based recognizers.

Journal ArticleDOI
TL;DR: The results suggest that there is potential for speech intelligibility improvement when an enhancement of the onsets of the speech envelope is included in the signal processing of auditory prostheses.
Abstract: Recent studies have shown that transient parts of a speech signal contribute most to speech intelligibility in normal-hearing listeners. In this study, the influence of enhancing the onsets of the envelope of the speech signal on speech intelligibility in noisy conditions using an eight channel cochlear implant vocoder simulation was investigated. The enhanced envelope (EE) strategy emphasizes the onsets of the speech envelope by deriving an additional peak signal at the onsets in each frequency band. A sentence recognition task in stationary speech shaped noise showed a significant speech reception threshold (SRT) improvement of 2.5 dB for the EE in comparison to the reference continuous interleaved sampling strategy and of 1.7 dB when an ideal Wiener filter was used for the onset extraction on the noisy signal. In a competitive talker condition, a significant SRT improvement of 2.6 dB was measured. A benefit was obtained in all experiments with the peak signal derived from the clean speech. Although the EE strategy is not effective in many real-life situations, the results suggest that there is potential for speech intelligibility improvement when an enhancement of the onsets of the speech envelope is included in the signal processing of auditory prostheses.

Patent
30 Jul 2012
TL;DR: In this article, a method for providing audio streams to multiple listeners using a common audio source, such as a set of loudspeakers, and, optionally, a shared audio stream is described.
Abstract: The various embodiments relate generally to systems, devices, apparatuses, and methods for providing audio streams to multiple listeners, and more specifically, to a system, a device, and a method for providing independent listener-specific audio streams to multiple listeners using a common audio source, such as a set of loudspeakers, and, optionally, a shared audio stream. In some embodiments, a method includes identifying a first audio stream for reception at a first region to be canceled at a second region, and generating a cancellation signal that is projected in another audio stream destined for the second region. The cancellation signal and the first audio steam are combined at the second region. Further, a compensation signal to reduce the cancellation signal at the first region can be generated.