scispace - formally typeset
Search or ask a question
Topic

Speech coding

About: Speech coding is a research topic. Over the lifetime, 14245 publications have been published within this topic receiving 271964 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: An efficient approach for unsupervised audio stream segmentation and clustering via the Bayesian Information Criterion via the BIC is proposed, which is particularly successful for short segment turns of less than 2 s in duration.
Abstract: In many speech and audio applications, it is first necessary to partition and classify acoustic events prior to voice coding for communication or speech recognition for spoken document retrieval. In this paper, we propose an efficient approach for unsupervised audio stream segmentation and clustering via the Bayesian Information Criterion (BIC). The proposed method extends an earlier formulation by Chen and Gopalakrishnan. In our formulation, Hotelling's T/sup 2/-Statistic is used to pre-select candidate segmentation boundaries followed by BIC to perform the segmentation decision. The proposed algorithm also incorporates a variable-size increasing window scheme and a skip-frame test. Our experiments show that we can improve the final algorithm speed by a factor of 100 compared to that in Chen and Gopalakrishnan's while achieving a 6.7% reduction in the acoustic boundary miss rate at the expense of a 5.7% increase in false alarm rate using DARPA Hub4 1997 evaluation data. The approach is particularly successful for short segment turns of less than 2 s in duration. The results suggest that the proposed algorithm is sufficiently effective and efficient for audio stream segmentation applications.

97 citations

Proceedings ArticleDOI
26 Apr 1985
TL;DR: It is demonstrated through analysis of both synthetic and natural speech that psychoacoustic concepts of spectral auditory integration in vowel perception, namely the F1, F2' concept of Carlson and Fant and the 3.5 Bark auditory integration concept of Chistovich, are well modeled by the PLP method.
Abstract: A novel speech analysis method which uses several established psychoacoustic concepts, the perceptually based linear predictive analysis (PLP), models the auditory spectrum by the spectrum of the low-order all-pole model. The auditory spectrum is derived from the speech waveform by critical-band filtering, equal-loudness curve pre-emphasis, and intensity-loudness root compression. We demonstrate through analysis of both synthetic and natural speech that psychoacoustic concepts of spectral auditory integration in vowel perception, namely the F1, F2' concept of Carlson and Fant and the 3.5 Bark auditory integration concept of Chistovich, are well modeled by the PLP method. A complete speech analysis-synthesis system based on the PLP method is also described in the paper.

97 citations

Proceedings ArticleDOI
08 Dec 2008
TL;DR: This is the first work adopting graph theory to improve the codebook partition while using QIM in low bit-rate streaming media and guarantees that every codeword is in the opposite part to its nearest neighbor, and the distortion is limited by a bound.
Abstract: In this paper we introduce a novel codebook partition algorithm for quantization index modulation (QIM), which is applied to information hiding in instant low bit-rate speech stream. The QIM method divides the codebook into two parts, each representing '0' and '1' respectively. Instead of randomly partitioning the codebook, the relationship between codewords is considered. The proposed algorithm - complementary neighbor vertices (CNV) guarantees that every codeword is in the opposite part to its nearest neighbor, and the distortion is limited by a bound. The feasibility of CNV is proved with graph theory. Moreover, in our work the secret message is embedded in the field of vector quantization index of LPC coefficients, getting the benefit that the distortion due to QIM is lightened adaptively by the rest of the encoding procedure. Experiments on iLBC and G.723.1 verify the effectiveness of the proposed method. Both objective and subjective assessments show the proposed method only slightly decreases the speech quality to an indistinguishable degree. The hiding capacity is no less than 100 bps. To the best of our knowledge, this is the first work adopting graph theory to improve the codebook partition while using QIM in low bit-rate streaming media.

96 citations

Proceedings ArticleDOI
12 May 2019
TL;DR: This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality.
Abstract: In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps.

96 citations

Patent
04 Aug 2000
TL;DR: In this paper, a scalable data structure for audio transmission includes core and augmentation layers, the former for carrying a first coding of an audio signal that places post decode noise beneath a desired noise spectrum, the latter for carrying offset data regarding the desired noises spectrum and data about coding of the audio signal.
Abstract: Scalable coding of audio into a core layer in response to a desired noise spectrum established according to psychoacoustic principles supports coding augmentation data into augmentation layers in response to various criteria including offset of such desired noise spectrum. Compatible decoding provides a plurality of decoded resolutions from a single signal. Coding is preferably performed on subband signals generated according to spectral transform, quadrature mirror filtering, or other conventional processing of audio input. A scalable data structure for audio transmission includes core and augmentation layers, the former for carrying a first coding of an audio signal that places post decode noise beneath a desired noise spectrum, the later for carrying offset data regarding the desired noise spectrum and data about coding of the audio signal that places post decode noise beneath the desired noise spectrum shifted by the offset data.

96 citations


Network Information
Related Topics (5)
Signal processing
73.4K papers, 983.5K citations
86% related
Decoding methods
65.7K papers, 900K citations
84% related
Fading
55.4K papers, 1M citations
80% related
Feature vector
48.8K papers, 954.4K citations
80% related
Feature extraction
111.8K papers, 2.1M citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202338
202284
202170
202062
201977
2018108