scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2011"


Patent
13 Jun 2011
TL;DR: In this paper, a user speech model associated with the user may be accessed and a determination may be made background audio in the audio signal is below a defined threshold, in response to determining that the background audio is below the defined threshold.
Abstract: An audio signal generated by a device based on audio input from a user may be received. The audio signal may include at least a user audio portion that corresponds to one or more user utterances recorded by the device. A user speech model associated with the user may be accessed and a determination may be made background audio in the audio signal is below a defined threshold. In response to determining that the background audio in the audio signal is below the defined threshold, the accessed user speech model may be adapted based on the audio signal to generate an adapted user speech model that models speech characteristics of the user. Noise compensation may be performed on the received audio signal using the adapted user speech model to generate a filtered audio signal with reduced background audio compared to the received audio signal.

276 citations



Journal ArticleDOI
TL;DR: This article presents a tutorial overview of models for estimating the quality experienced by users of speech transmission and communication services, serving as a guide to an appropriate usage of the multitude of current and emerging speech quality models.
Abstract: This article presents a tutorial overview of models for estimating the quality experienced by users of speech transmission and communication services. Such models can be classified as either parametric or signal based. Signal-based models use input speech signals measured at the electrical or acoustic interfaces of the transmission channel. Parametric models, on the other hand, depend on signal and system parameters estimated during network planning or at run time. This tutorial describes the underlying principles as well as advantages and limitations of existing models. It also presents new developments, thus serving as a guide to an appropriate usage of the multitude of current and emerging speech quality models.

135 citations


Journal ArticleDOI
TL;DR: It is revealed that, contrary to existing thought, the inactive frames of VoIP streams are more suitable for data embedding than the active frames of the streams; that is, steganography in the inactive audio frames attains a largerData embedding capacity than that in the active audio frames under the same imperceptibility.
Abstract: This paper describes a novel high-capacity steganography algorithm for embedding data in the inactive frames of low bit rate audio streams encoded by G.723.1 source codec, which is used extensively in Voice over Internet Protocol (VoIP). This study reveals that, contrary to existing thought, the inactive frames of VoIP streams are more suitable for data embedding than the active frames of the streams; that is, steganography in the inactive audio frames attains a larger data embedding capacity than that in the active audio frames under the same imperceptibility. By analyzing the concealment of steganography in the inactive frames of low bit rate audio streams encoded by G.723.1 codec with 6.3 kb/s, the authors propose a new algorithm for steganography in different speech parameters of the inactive frames. Performance evaluation shows embedding data in various speech parameters led to different levels of concealment. An improved voice activity detection algorithm is suggested for detecting inactive audio frames taking into packet loss account. Experimental results show our proposed steganography algorithm not only achieved perfect imperceptibility but also gained a high data embedding rate up to 101 bits/frame, indicating that the data embedding capacity of the proposed algorithm is very much larger than those of previously suggested algorithms.

127 citations


Journal ArticleDOI
TL;DR: The results of the numerical simulation support the effectiveness of the proposed approach for environmental audio classification with over 10% accuracy-rate improvement compared to the MFCC features.
Abstract: Audio feature extraction and classification are important tools for audio signal analysis in many applications, such as multimedia indexing and retrieval, and auditory scene analysis. However, due to the nonstationarities and discontinuities exist in these signals, their quantification and classification remains a formidable challenge. In this paper, we develop a new approach for audio feature extraction to effectively quantify these nonstationarities in an attempt to achieve high classification accuracy for environmental audio signals. Our approach consists of three stages: first we propose to construct the time-frequency matrix (TFM) of audio signals using matching-pursuit time-frequency distribution (MP-TFD) technique, and then apply the non-negative matrix decomposition (NMF) technique to decompose the TFM into its significant components. Finally, we propose seven novel features from the spectral and temporal structures of the decomposed vectors in a way that they successfully represent joint TF structure of the audio signal, and combine them with the Mel-frequency cepstral coefficients (MFCCs) features. These features are examined using a database of 192 environmental audio signals which includes 20 aircraft, 17 helicopter, 20 drum, 15 flute, 20 piano, 20 animal, 20 bird, and 20 insect sounds, and the speech of 20 males and 20 females. The results of the numerical simulation support the effectiveness of the proposed approach for environmental audio classification with over 10% accuracy-rate improvement compared to the MFCC features.

124 citations


Journal ArticleDOI
TL;DR: The different aspects of front end analysis of speech recognition including sound characteristics, feature extraction techniques, spectral representations of the speech signal etc are discussed.
Abstract: Automatic speech recognition (ASR) has made great strides with the development of digital signal processing hardware and software. But despite of all these advances, machines can not match the performance of their human counterparts in terms of accuracy and speed, especially in case of speaker independent speech recognition. So, today significant portion of speech recognition research is focused on speaker independent speech recognition problem. Before recognition, speech processing has to be carried out to get a feature vectors of the signal. So, front end analysis plays a important role. The reasons are its wide range of applications, and limitations of available techniques of speech recognition. So, in this report we briefly discuss the different aspects of front end analysis of speech recognition including sound characteristics, feature extraction techniques, spectral representations of the speech signal etc. We have also discussed the various advantages and disadvantages of each feature extraction technique, along with the suitability of each method to particular application.

109 citations


Patent
Wei-Han Liu1, Hsiao-Yu Han
25 May 2011
TL;DR: In this paper, an audio signal processing apparatus consisting of a plurality of individual audio interfaces and an audio channel splitting unit is used for determining a total number of audio channels corresponding to the individual interfaces and generating a first output audio signal with a first number of channels according to an input audio signal and the total amount of channels.
Abstract: An audio signal processing apparatus and an audio signal processing method are provide The audio signal processing apparatus comprises: a plurality of individual audio interfaces, an audio signal processing unit, and an audio channel splitting unit The audio signal processing unit is utilized for determining a total number of audio channels corresponding to the individual audio interfaces and generating a first output audio signal with a first number of audio channels according to an input audio signal and the total number of audio channels when the audio signal processing apparatus is operated under a first operational mode The audio channel splitting unit is coupled to the audio signal processing unit and the audio interfaces When the audio signal processing apparatus is operated under the first operational mode, the audio channel splitting unit splits the first output audio signal with the first number of audio channels to the audio interfaces, respectively

82 citations


Patent
Thomas M. Soemo1, Leo Soong1, Michael H. Kim1, Chad R. Heinemann1, Dax Hawkins1 
02 Sep 2011
TL;DR: A system for integrating local speech recognition with cloud-based speech recognition in order to provide an efficient natural user interface is described in this article, where a computing device determines a direction associated with a particular person within an environment and generates an audio recording associated with the direction.
Abstract: A system for integrating local speech recognition with cloud-based speech recognition in order to provide an efficient natural user interface is described In some embodiments, a computing device determines a direction associated with a particular person within an environment and generates an audio recording associated with the direction The computing device then performs local speech recognition on the audio recording in order to detect a first utterance spoken by the particular person and to detect one or more keywords within the first utterance The first utterance may be detected by applying voice activity detection techniques to the audio recording The first utterance and the one or more keywords are subsequently transferred to a server which may identify speech sounds within the first utterance associated with the one or more keywords and adapt one or more speech recognition techniques based on the identified speech sounds

75 citations


Proceedings Article
01 Aug 2011
TL;DR: A novel dual-channel algorithm is proposed which estimates the coherent-to-diffuse energy ratio (CDR) of background noise in mixed noise fields based on an estimate of the noise field coherence from a noisy speech signal and a subsequent minima tracking in order to increase the estimation accuracy even in the presence of speech.
Abstract: A novel dual-channel algorithm is proposed which estimates the coherent-to-diffuse energy ratio (CDR) of background noise in mixed noise fields. The algorithm is based on an estimate of the noise field coherence from a noisy speech signal and a subsequent minima tracking in order to increase the estimation accuracy even in the presence of speech. The obtained CDR estimate can be used, e.g., for the acoustic environment classification in hearing aids or to control speech enhancement algorithms such as noise reduction or speech dereverberation. Besides, the approach can be used to calculate an estimate of the direct-to-reverberant energy ratio (DRR) blindly from reverberant speech signals.

74 citations


Patent
Dai Yang1, Daniel J. Sinder1
01 Jun 2011
TL;DR: In this paper, an excitation signal for a first frequency band of the audio signal is used to calculate the excitation signals for a second frequency band that is separated from the first band.
Abstract: Methods of audio coding are described in which an excitation signal for a first frequency band of the audio signal is used to calculate an excitation signal for a second frequency band of the audio signal that is separated from the first frequency band.

73 citations


Book
05 Jan 2011
TL;DR: Based on the fundamentals of information and rate distortion theory, the most relevant techniques used in source coding algorithms are described: entropy coding, quantization as well as predictive and transform coding.
Abstract: Digital media technologies have become an integral part of the way we create, communicate, and consume information. At the core of these technologies are source coding methods that are described in this monograph. Based on the fundamentals of information and rate distortion theory, the most relevant techniques used in source coding algorithms are described: entropy coding, quantization as well as predictive and transform coding. The emphasis is put onto algorithms that are also used in video coding, which will be explained in the other part of this two-part monograph.

Patent
11 Mar 2011
TL;DR: In this paper, user profile based audio adjustment techniques are used to render various audio and audio/video content having different audio output parameter values in accordance with a user profile that characterizes a user's desired value and/or range of one or more of the output parameter levels.
Abstract: Embodiments are directed toward user profile based audio adjustment techniques The techniques are used to render various audio and/or audio/video content having different audio output parameter values in accordance with a user profile that characterizes a user's desired value and/or range of one or more of the output parameter levels

Patent
08 Apr 2011
TL;DR: In this article, a mobile device that is capable of automatically starting and ending the recording of an audio signal captured by at least one microphone is presented, which can adjust a number of parameters related with audio logging based on the context information of the audio input signal.
Abstract: A mobile device that is capable of automatically starting and ending the recording of an audio signal captured by at least one microphone is presented. The mobile device is capable of adjusting a number of parameters related with audio logging based on the context information of the audio input signal.

Patent
30 Sep 2011
TL;DR: In this article, the authors proposed a method to enhance noisy speech recognition accuracy by receiving geotagged audio signals that correspond to environmental audio recorded by multiple mobile devices in multiple geographic locations, receiving an audio signal that corresponds to an utterance recorded by a particular mobile device, determining a particular geographic location associated with the particular mobile devices, selecting a subset of geotaggregated audio signals and weighting each geotagated audio signal of the subset based on whether the respective audio signal was manually uploaded or automatically updated.
Abstract: Enhancing noisy speech recognition accuracy by receiving geotagged audio signals that correspond to environmental audio recorded by multiple mobile devices in multiple geographic locations, receiving an audio signal that corresponds to an utterance recorded by a particular mobile device, determining a particular geographic location associated with the particular mobile device, selecting a subset of geotagged audio signals and weighting each geotagged audio signal of the subset based on whether the respective audio signal was manually uploaded or automatically updated, generating a noise model for the particular geographic location using the subset of weighted geotagged audio signals, where noise compensation is performed on the audio signal that corresponds to the utterance using the noise model that has been generated for the particular geographic location.

Journal ArticleDOI
TL;DR: Results indicated that for all subjects tested, speech intelligibility decreased exponentially with an increase in reverberation time, and the proposed channel-selection criterion reduces the temporal envelope smearing effects introduced by reverberation and also diminishes the self-masking effects responsible for flattened formants.
Abstract: Little is known about the extent to which reverberation affects speech intelligibility by cochlear implant (CI) listeners. Experiment 1 assessed CI users’ performance using Institute of Electrical and Electronics Engineers (IEEE) sentences corrupted with varying degrees of reverberation. Reverberation times of 0.30, 0.60, 0.80, and 1.0 s were used. Results indicated that for all subjects tested, speech intelligibility decreased exponentially with an increase in reverberation time. A decaying-exponential model provided an excellent fit to the data. Experiment 2 evaluated (offline) a speech coding strategy for reverberation suppression using a channel-selection criterion based on the signal-to-reverberant ratio (SRR) of individual frequency channels. The SRR reflects implicitly the ratio of the energies of the signal originating from the early (and direct) reflections and the signal originating from the late reflections. Channels with SRR larger than a preset threshold were selected, while channels with SRR smaller than the threshold were zeroed out. Results in a highly reverberant scenario indicated that the proposed strategy led to substantial gains (over 60 percentage points) in speech intelligibility over the subjects’ daily strategy. Further analysis indicated that the proposed channel-selection criterion reduces the temporal envelope smearing effects introduced by reverberation and also diminishes the self-masking effects responsible for flattened formants.

Book Chapter
01 Jan 2011

Patent
30 Mar 2011
TL;DR: In this article, a method for providing sound to at least one user, involves supplying audio signals from an audio signal source to a transmission unit; compressing the audio signals to generate compressed audio data; transmitting compressed audio audio data from the transmission unit to a receiver unit; and stimulating the hearing of the user(s) according to decompressed audio signals supplied from the receiver unit.
Abstract: Method for providing sound to at least one user, involves supplying audio signals from an audio signal source to a transmission unit; compressing the audio signals to generate compressed audio data; transmitting compressed audio data from the transmission unit to at least one receiver unit; decompressing the compressed audio data to generate decompressed audio signals; and stimulating the hearing of the user(s) according to decompressed audio signals supplied from the receiver unit. During certain time periods, transmission of compressed audio data is interrupted, and instead, at least one control data block is generated by the transmission unit in such a manner that audio data transmission is replaced by control data block transmission, thereby temporarily interrupting flow of received compressed audio data, each control data block includes a marker recognized by the at least one receiver unit as a control data block and a command for control of the receiver unit.

Patent
Craig L. Reding, Suzi Levas1
30 Dec 2011
TL;DR: In this paper, a shared speech processing facility is used to support speech recognition for a wide variety of devices with limited capabilities including business computer systems, personal data assistants, etc., which are coupled to the speech processing facilities via a communications channel, e.g., the Internet.
Abstract: Techniques for generating, distributing, and using speech recognition models are described. A shared speech processing facility is used to support speech recognition for a wide variety of devices with limited capabilities including business computer systems, personal data assistants, etc., which are coupled to the speech processing facility via a communications channel, e.g., the Internet. Devices with audio capture capability record and transmit to the speech processing facility, via the Internet, digitized speech and receive speech processing services, e.g., speech recognition model generation and/or speech recognition services, in response. The Internet is used to return speech recognition models and/or information identifying recognized words or phrases. Thus, the speech processing facility can be used to provide speech recognition capabilities to devices without such capabilities and/or to augment a device's speech processing capability. Voice dialing, telephone control and/or other services are provided by the speech processing facility in response to speech recognition results.

Journal ArticleDOI
TL;DR: The second-order derivative-based audio steganalysis method gains a considerable advantage under all categories of signal complexity--especially for audio streams with high signal complexity, which are generally the most challenging for Steganalysis-and thereby significantly improves the state of the art in audio stegansalysis.
Abstract: This article presents a second-order derivative-based audio steganalysis First, Mel-cepstrum coefficients and Markov transition features from the second-order derivative of the audio signal are extracted; a support vector machine is then applied to the features for discovering the existence of hidden data in digital audio streams Also, the relation between audio signal complexity and steganography detection accuracy, which is an issue relevant to audio steganalysis performance evaluation but so far has not been explored, is analyzed experimentally Results demonstrate that, in comparison with a recently proposed signal stream-based Mel-cepstrum method, the second-order derivative-based audio steganalysis method gains a considerable advantage under all categories of signal complexity--especially for audio streams with high signal complexity, which are generally the most challenging for steganalysis-and thereby significantly improves the state of the art in audio steganalysis

Patent
28 Jul 2011
TL;DR: In this article, the authors present a method for continuous monitoring of audio signals and identification of audio items within an audio signal, which utilizes predictive caching of fingerprints to improve the efficiency of audio identification.
Abstract: The present invention relates to the continuous monitoring of an audio signal and identification of audio items within an audio signal. The technology disclosed utilizes predictive caching of fingerprints to improve efficiency. Fingerprints are cached for tracking an audio signal with known alignment and for watching an audio signal without known alignment, based on already identified fingerprints extracted from the audio signal. Software running on a smart phone or other battery-powered device cooperates with software running on an audio identification server.

Patent
06 Sep 2011
TL;DR: In this paper, an audio signal of ambient audio is autonomously sampled in the vicinity of the mobile computer system to capture one or more audio samples of the audio signal, and the audio signature may be compared with multiple previously stored reference audio signatures.
Abstract: A computerized method for engaging a user of a mobile computer system, The mobile computer system may be connectible to a server over a wide area network. An audio signal of ambient audio is autonomously sampled in the vicinity of the mobile computer system to capture one or more audio samples of the audio signal. The multiple samples of the audio signal are autonomously sampled without requiring any interaction from the user, thus avoiding an input from the user to capture each of the samples. The audio sample may be processed to extract an audio signature of the audio sample. The audio signature may be compared with multiple previously stored reference audio signatures. Upon matching the audio signature with at least one reference audio signature a matched reference audio signature may be produced.

Proceedings ArticleDOI
18 Sep 2011
TL;DR: Using compression properties of AO, this formulation extends the notion of Information Rate to individual sequences and allows an optimal estimation of the AO threshold parameter and shows that changes in IR correspond to significant musical structures such as sections in a sonata form.
Abstract: This paper presents a method for analysis of changes in information contents in music based on an audio representation called Audio Oracle (AO). Using compression properties of AO we estimate the amount of information that passes between the past and the present at every instance in a musical signal. This formulation extends the notion of Information Rate (IR) to individual sequences and allows an optimal estimation of the AO threshold parameter. We show that changes in IR correspond to significant musical structures such as sections in a sonata form. Relation to musical perception and applications for composition and improvisation are discussed in the paper.

Journal ArticleDOI
TL;DR: Experiments show that the proposed one-dimensional APD and RTPD features are able to achieve comparable accuracy with popular high-dimensional features in speech/music discrimination, and the SVM-BT approach demonstrates superior performance in multi-class audio classification.
Abstract: Audio classification is an essential task in multimedia content analysis, which is a prerequisite to a variety of tasks such as segmentation, indexing and retrieval. This paper describes our study on multi-class audio classification on broadcast news, a popular multimedia repository with rich audio types. Motivated by the tonal regulations of music, we propose two pitch-density-based features, namely average pitch-density (APD) and relative tonal power density (RTPD). We use an SVM binary tree (SVM-BT) to hierarchically classify an audio clip into five classes: pure speech, music, environment sound, speech with music and speech with environment sound. Since SVM is a binary classifier, we use the SVM-BT architecture to realize coarse-to-fine multi-class classification with high accuracy and efficiency. Experiments show that the proposed one-dimensional APD and RTPD features are able to achieve comparable accuracy with popular high-dimensional features in speech/music discrimination, and the SVM-BT approach demonstrates superior performance in multi-class audio classification. With the help of the pitch-density-based features, we can achieve a high average accuracy of 94.2% in the five-class audio classification task.

Patent
06 Oct 2011
TL;DR: In this article, a wireless multi-channel audio system including an audio source with a wireless transceiver configured to communicate according to a standard wireless protocol and an audio controller is collectively configured to establish wireless communications with multiple audio sinks via a corresponding wireless link.
Abstract: A wireless multi-channel audio system including an audio source with a wireless transceiver configured to communicate according to a standard wireless protocol and an audio controller, which are collectively configured to establish wireless communications with multiple audio sinks via a corresponding wireless link, to assign each audio sink a corresponding audio channel, to synchronize timing with each audio sink, and to transmit audio information for each audio channel to a corresponding audio sink via a corresponding wireless link. The audio source may inquire as to supported audio parameters, such as sample rate and audio codec, and selects a commonly supported configuration. The audio source may separate audio information into queues for each audio channel for each audio sink. The audio source transmits frames with timestamps and a common start time for synchronization, and the audio sinks use this information to synchronize timing and remain virtually synchronized with each other.

Patent
30 Jun 2011
TL;DR: In this paper, a speech processing engine is provided that employs Kalman filtering with a particular speaker's glottal information to clean up an audio speech signal for more efficient automatic speech recognition.
Abstract: A speech processing engine is provided that in some embodiments, employs Kalman filtering with a particular speaker's glottal information to clean up an audio speech signal for more efficient automatic speech recognition.

Patent
28 Mar 2011
TL;DR: In this paper, the authors present a method for generating domain-specific speech recognition models for a domain of interest by combining and tuning existing speech recognition model when a speech recognizer does not have access to a speech recognition system for that domain of the interest and when available domain specific data is below a minimum desired threshold.
Abstract: Disclosed herein are systems, methods, and non-transitory computer-readable storage media for generating domain-specific speech recognition models for a domain of interest by combining and tuning existing speech recognition models when a speech recognizer does not have access to a speech recognition model for that domain of interest and when available domain-specific data is below a minimum desired threshold to create a new domain-specific speech recognition model. A system configured to practice the method identifies a speech recognition domain and combines a set of speech recognition models, each speech recognition model of the set of speech recognition models being from a respective speech recognition domain. The system receives an amount of data specific to the speech recognition domain, wherein the amount of data is less than a minimum threshold to create a new domain-specific model, and tunes the combined speech recognition model for the speech recognition domain based on the data.

Journal ArticleDOI
TL;DR: The evaluation of broadcast news audio segmentation systems carried out in the context of the Albayzín-2010 evaluation campaign is presented, with the aim of gaining an insight into the proposed solutions, and looking for directions which are promising.
Abstract: Recently, audio segmentation has attracted research interest because of its usefulness in several applications like audio indexing and retrieval, subtitling, monitoring of acoustic scenes, etc. Moreover, a previous audio segmentation stage may be useful to improve the robustness of speech technologies like automatic speech recognition and speaker diarization. In this article, we present the evaluation of broadcast news audio segmentation systems carried out in the context of the Albayzin-2010 evaluation campaign. That evaluation consisted of segmenting audio from the 3/24 Catalan TV channel into five acoustic classes: music, speech, speech over music, speech over noise, and the other. The evaluation results displayed the difficulty of this segmentation task. In this article, after presenting the database and metric, as well as the feature extraction methods and segmentation techniques used by the submitted systems, the experimental results are analyzed and compared, with the aim of gaining an insight into the proposed solutions, and looking for directions which are promising.

Patent
01 Mar 2011
TL;DR: In this paper, a method for encoding audio frames by producing a first frame of coded audio samples by coding a first audio frame in a sequence of frames, producing at least a portion of a second frame of audio samples, and producing parameters for generating audio gap filler samples.
Abstract: A method for encoding audio frames by producing a first frame of coded audio samples by coding a first audio frame in a sequence of frames, producing at least a portion of a second frame of coded audio samples by coding at least a portion of a second audio frame in the sequence of frames, and producing parameters for generating audio gap filler samples, wherein the parameters are representative of either a weighted segment of the first frame of coded audio samples or a weighted segment of the portion of the second frame of coded audio samples.

Journal ArticleDOI
TL;DR: This paper applies the CS methodology to sinusoidally modeled audio signals, and proposes encoding few randomly selected samples of the time-domain description of the sinusoidal component (per signal segment).
Abstract: Compressed sensing (CS) samples signals at a much lower rate than the Nyquist rate if they are sparse in some basis. In this paper, the CS methodology is applied to sinusoidally modeled audio signals. As this model is sparse by definition in the frequency domain (being equal to the sum of a small number of sinusoids), we investigate whether CS can be used to encode audio signals at low bitrates. In contrast to encoding the sinusoidal parameters (amplitude, frequency, phase) as current state-of-the-art methods do, we propose encoding few randomly selected samples of the time-domain description of the sinusoidal component (per signal segment). The potential of applying compressed sensing both to single-channel and multi-channel audio coding is examined. The listening test results are encouraging, indicating that the proposed approach can achieve comparable performance to that of state-of-the-art methods. Given that CS can lead to novel coding systems where the sampling and compression operations are combined into one low-complexity step, the proposed methodology can be considered as an important step towards applying the CS framework to audio coding applications.

Patent
Aaron M. Eppolito1
23 Aug 2011
TL;DR: In this article, a method for dynamic range compression of the audio content is presented, based on the analysis of audio content, the method generates a setting for an audio compressor that compresses the dynamic range of audio contents.
Abstract: For a media clip that includes audio content, a novel method for performing dynamic range compression of the audio content is presented. The method performs an analysis of the audio content. Based on the analysis of the audio content, the method generates a setting for an audio compressor that compresses the dynamic range of the audio content. The generated setting includes a set of audio compression parameters that include a noise gating threshold parameter (“noise gate”), a dynamic range compression threshold parameter (“threshold”), and a dynamic range compression ratio parameter (“ratio”).