scispace - formally typeset
Search or ask a question
Topic

Speech coding

About: Speech coding is a research topic. Over the lifetime, 14245 publications have been published within this topic receiving 271964 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis.
Abstract: Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms, designed for signal enhancement, are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance and ignores the manner in which speech recognition systems operate. In this paper a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Speech recognition experiments performed in a real distant-talking environment confirm the efficacy of the proposed approach.

147 citations

Journal ArticleDOI
TL;DR: In this article, the authors proposed a new deep network for audio event recognition, called AENet, which uses a convolutional neural network (CNN) operating on a large temporal input.
Abstract: We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear subword units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works, this allows us to train an audio event detection system end to end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learned generic audio features, similar to the way CNNs learn generic features on vision tasks. In video analysis, combining visual features and traditional audio features, such as mel frequency cepstral coefficients, typically only leads to marginal improvements. Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection. In video highlight detection, our audio features improve the performance by more than 8% over visual features alone.

147 citations

Journal ArticleDOI
TL;DR: The modulated lapped transform properties and how it can be used to generate a time-varying filterbank are described and examples of its implementation in two audio coding standards are presented.
Abstract: The modulated lapped transform (MLT) is used in both audio and video data compression schemes. This paper describes its properties and how it can be used to generate a time-varying filterbank. Examples of its implementation in two audio coding standards are presented.

146 citations

PatentDOI
TL;DR: In this article, a system and method for identifying the phoneme sound types that are contained within an audio speech signal is disclosed, which includes a microphone (12) and associated conditioning circuitry (14, 15, 16, 17, 18).
Abstract: A system and method for identifying the phoneme sound types that are contained within an audio speech signal is disclosed. The system includes a microphone (12) and associated conditioning circuitry (14), for receiving an audio speech signal and converting it to a representative electrical signal. The electrical signal is then sampled and converted to a digital audio signal with a digital-to-analog converter (34). The digital audio signal is input to a programmable digital sound processor (18), which digitally processes the sound so as to extract various time domain and frequency domain sound characteristics. These characteristics are input to a programmable host sound processor (20) which compares the sound characteristics to standard sound data. Based on this comparison, the host sound processor (20) identifies the specific phoneme sounds that are contained within the audio speech signal. The programmable host sound processor (20) further includes linguistic processing program methods to convert the phoneme sounds into English words or other natural language words. These words are input to a host processor (22), which then utilizes the words as either data or commands.

146 citations

Proceedings ArticleDOI
03 Apr 1990
TL;DR: A pitch estimation criterion is derived that is inherently unambiguous, uses pitch-adaptive resolution, uses small-signal suppression to provide enhanced discrimination, and uses amplitude compression to eliminate the effects of pitch-formant interaction.
Abstract: A technique for estimating the pitch of a speech waveform is developed. It fits a harmonic set of sine waves to the input data using a mean-squared-error (MSE) criterion. By exploiting a sinusoidal model for the input speech waveform, a pitch estimation criterion is derived that is inherently unambiguous, uses pitch-adaptive resolution, uses small-signal suppression to provide enhanced discrimination, and uses amplitude compression to eliminate the effects of pitch-formant interaction. The normalized minimum mean squared error proves to be a powerful discriminant for estimating the likelihood that a given frame of speech is voiced. >

145 citations


Network Information
Related Topics (5)
Signal processing
73.4K papers, 983.5K citations
86% related
Decoding methods
65.7K papers, 900K citations
84% related
Fading
55.4K papers, 1M citations
80% related
Feature vector
48.8K papers, 954.4K citations
80% related
Feature extraction
111.8K papers, 2.1M citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202338
202284
202170
202062
201977
2018108