Topic

Speech coding

About: Speech coding is a research topic. Over the lifetime, 14245 publications have been published within this topic receiving 271964 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Likelihood-maximizing beamforming for robust hands-free speech recognition

[...]

Michael L. Seltzer, Bhiksha Raj¹, Richard M. Stern²•Institutions (2)

Mitsubishi Electric¹, Carnegie Mellon University²

16 Aug 2004-IEEE Transactions on Speech and Audio Processing

TL;DR: A new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis.

...read moreread less

Abstract: Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms, designed for signal enhancement, are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance and ignores the manner in which speech recognition systems operate. In this paper a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Speech recognition experiments performed in a real distant-talking environment confirm the efficacy of the proposed approach.

...read moreread less

147 citations

Journal Article•DOI•

AENet: Learning Deep Audio Features for Video Analysis

[...]

Naoya Takahashi¹, Michael Gygli¹, Luc Van Gool¹•Institutions (1)

ETH Zurich¹

01 Mar 2018-IEEE Transactions on Multimedia

TL;DR: In this article, the authors proposed a new deep network for audio event recognition, called AENet, which uses a convolutional neural network (CNN) operating on a large temporal input.

...read moreread less

Abstract: We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear subword units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works, this allows us to train an audio event detection system end to end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learned generic audio features, similar to the way CNNs learn generic features on vision tasks. In video analysis, combining visual features and traditional audio features, such as mel frequency cepstral coefficients, typically only leads to marginal improvements. Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection. In video highlight detection, our audio features improve the performance by more than 8% over visual features alone.

...read moreread less

147 citations

Journal Article•DOI•

The modulated lapped transform, its time-varying forms, and its applications to audio coding standards

[...]

S. Shlien

01 Jul 1997-IEEE Transactions on Speech and Audio Processing

TL;DR: The modulated lapped transform properties and how it can be used to generate a time-varying filterbank are described and examples of its implementation in two audio coding standards are presented.

...read moreread less

Abstract: The modulated lapped transform (MLT) is used in both audio and video data compression schemes. This paper describes its properties and how it can be used to generate a time-varying filterbank. Examples of its implementation in two audio coding standards are presented.

...read moreread less

146 citations

Patent•DOI•

User independent, real-time speech recognition system and method

[...]

C. Hal Hansen, Dale Lynn Shepherd, Robert Brian Moncur

11 Mar 1996-Journal of the Acoustical Society of America

TL;DR: In this article, a system and method for identifying the phoneme sound types that are contained within an audio speech signal is disclosed, which includes a microphone (12) and associated conditioning circuitry (14, 15, 16, 17, 18).

...read moreread less

Abstract: A system and method for identifying the phoneme sound types that are contained within an audio speech signal is disclosed. The system includes a microphone (12) and associated conditioning circuitry (14), for receiving an audio speech signal and converting it to a representative electrical signal. The electrical signal is then sampled and converted to a digital audio signal with a digital-to-analog converter (34). The digital audio signal is input to a programmable digital sound processor (18), which digitally processes the sound so as to extract various time domain and frequency domain sound characteristics. These characteristics are input to a programmable host sound processor (20) which compares the sound characteristics to standard sound data. Based on this comparison, the host sound processor (20) identifies the specific phoneme sounds that are contained within the audio speech signal. The programmable host sound processor (20) further includes linguistic processing program methods to convert the phoneme sounds into English words or other natural language words. These words are input to a host processor (22), which then utilizes the words as either data or commands.

...read moreread less

146 citations

Proceedings Article•DOI•

Pitch estimation and voicing detection based on a sinusoidal speech model

[...]

R.J. McAulay¹, Thomas F. Quatieri¹•Institutions (1)

Massachusetts Institute of Technology¹

03 Apr 1990

TL;DR: A pitch estimation criterion is derived that is inherently unambiguous, uses pitch-adaptive resolution, uses small-signal suppression to provide enhanced discrimination, and uses amplitude compression to eliminate the effects of pitch-formant interaction.

...read moreread less

Abstract: A technique for estimating the pitch of a speech waveform is developed. It fits a harmonic set of sine waves to the input data using a mean-squared-error (MSE) criterion. By exploiting a sinusoidal model for the input speech waveform, a pitch estimation criterion is derived that is inherently unambiguous, uses pitch-adaptive resolution, uses small-signal suppression to provide enhanced discrimination, and uses amplitude compression to eliminate the effects of pitch-formant interaction. The normalized minimum mean squared error proves to be a powerful discriminant for estimating the likelihood that a given frame of speech is voiced. >

...read moreread less

145 citations

Collapse

Network Information

Performance

Metrics

14,368

Papers

279,843

Citations

No. of papers in the topic in previous years
Year	Papers
2023	38
2022	84
2021	70
2020	62
2019	77
2018	108

Speech coding

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics