scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2010"


Proceedings Article
01 Sep 2010
TL;DR: This paper reports the recent exploration of the layer-by-layer learning strategy for training a multi-layer generative model of patches of speech spectrograms and shows that the binary codes learned produce a logspectral distortion that is approximately 2 dB lower than a subband vector quantization technique over the entire frequency range of wide-band speech.
Abstract: This paper reports our recent exploration of the layer-by-layer learning strategy for training a multi-layer generative model of patches of speech spectrograms. The top layer of the generative model learns binary codes that can be used for efficient compression of speech and could also be used for scalable speech recognition or rapid speech content retrieval. Each layer of the generative model is fully connected to the layer below and the weights on these connections are pretrained efficiently by using the contrastive divergence approximation to the log likelihood gradient. After layer-bylayer pre-training we “unroll” the generative model to form a deep auto-encoder, whose parameters are then fine-tuned using back-propagation. To reconstruct the full-length speech spectrogram, individual spectrogram segments predicted by their respective binary codes are combined using an overlapand-add method. Experimental results on speech spectrogram coding demonstrate that the binary codes produce a logspectral distortion that is approximately 2 dB lower than a subband vector quantization technique over the entire frequency range of wide-band speech. Index Terms: deep learning, speech feature extraction, neural networks, auto-encoder, binary codes, Boltzmann machine

372 citations


Journal ArticleDOI
01 Jun 2010
TL;DR: An overview of a number of current and emerging applications of sparse representations in areas from audio coding, audio enhancement and music transcription to blind source separation solutions that can solve the cocktail party problem.
Abstract: Sparse representations have proved a powerful tool in the analysis and processing of audio signals and already lie at the heart of popular coding standards such as MP3 and Dolby AAC. In this paper we give an overview of a number of current and emerging applications of sparse representations in areas from audio coding, audio enhancement and music transcription to blind source separation solutions that can solve the ?cocktail party problem.? In each case we will show how the prior assumption that the audio signals are approximately sparse in some time-frequency representation allows us to address the associated signal processing task.

239 citations


Journal ArticleDOI
TL;DR: This paper introduces a novel non-parametric, exemplar-based method for reconstructing clean speech from noisy observations, based on techniques from the field of Compressive Sensing, which can impute missing features using larger time windows such as entire words.
Abstract: An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing), and to replace (impute) the missing ones by clean speech estimates. Conventional imputation techniques employ parametric models and impute the missing features on a frame-by-frame basis. At low signal-to-noise ratios (SNRs), these techniques fail, because too many time frames may contain few, if any, reliable features. In this paper, we introduce a novel non-parametric, exemplar-based method for reconstructing clean speech from noisy observations, based on techniques from the field of Compressive Sensing. The method, dubbed sparse imputation, can impute missing features using larger time windows such as entire words. Using an overcomplete dictionary of clean speech exemplars, the method finds the sparsest combination of exemplars that jointly approximate the reliable features of a noisy utterance. That linear combination of clean speech exemplars is used to replace the missing features. Recognition experiments on noisy isolated digits show that sparse imputation outperforms conventional imputation techniques at SNR = -5 dB when using an ideal `oracle' mask. With error-prone estimated masks sparse imputation performs slightly worse than the best conventional technique.

113 citations


Journal ArticleDOI
TL;DR: A TF-based audio coding scheme with novel psychoacoustics model, music classification, audio classification of environmental sounds, audio fingerprinting, and audio watermarking will be presented to demonstrate the advantages of using time-frequency approaches in analyzing and extracting information from audio signals.
Abstract: Audio signals are information rich nonstationary signals that play an important role in our day-to-day communication, perception of environment, and entertainment. Due to its non-stationary nature, time- or frequency-only approaches are inadequate in analyzing these signals. A joint time-frequency (TF) approach would be a better choice to efficiently process these signals. In this digital era, compression, intelligent indexing for content-based retrieval, classification, and protection of digital audio content are few of the areas that encapsulate a majority of the audio signal processing applications. In this paper, we present a comprehensive array of TF methodologies that successfully address applications in all of the above mentioned areas. A TF-based audio coding scheme with novel psychoacoustics model, music classification, audio classification of environmental sounds, audio fingerprinting, and audio watermarking will be presented to demonstrate the advantages of using time-frequency approaches in analyzing and extracting information from audio signals.

101 citations



Journal ArticleDOI
TL;DR: A noise adaptive training (NAT) algorithm that can be applied to all training data that normalizes the environmental distortion as part of the model training that later is used with vector Taylor series model adaptation for decoding noisy utterances at test time.
Abstract: In traditional methods for noise robust automatic speech recognition, the acoustic models are typically trained using clean speech or using multi-condition data that is processed by the same feature enhancement algorithm expected to be used in decoding. In this paper, we propose a noise adaptive training (NAT) algorithm that can be applied to all training data that normalizes the environmental distortion as part of the model training. In contrast to feature enhancement methods, NAT estimates the underlying “pseudo-clean” model parameters directly without relying on point estimates of the clean speech features as an intermediate step. The pseudo-clean model parameters learned with NAT are later used with vector Taylor series (VTS) model adaptation for decoding noisy utterances at test time. Experiments performed on the Aurora 2 and Aurora 3 tasks demonstrate that the proposed NAT method obtain relative improvements of 18.83% and 32.02%, respectively, over VTS model adaptation.

85 citations


Patent
16 Sep 2010
TL;DR: An audio system that can be used in multiple modes or use scenarios, while still providing a user with a desirable level of audio quality and comfort, has been proposed in this paper, which includes a use mode detection element that enables the system to detect the mode of use, and in response, to be automatically configured for optimal performance for a specific use scenario.
Abstract: An audio system that may be used in multiple modes or use scenarios, while still providing a user with a desirable level of audio quality and comfort. The inventive system may include multiple components or elements, with the components or elements capable of being used in different configurations depending upon the mode of use. The different configurations provide an optimized user audio experience for multiple modes of use without requiring a user to carry multiple devices or sacrifice the audio quality or features desired for a particular situation. The inventive audio system includes a use mode detection element that enables the system to detect the mode of use, and in response, to be automatically configured for optimal performance for a specific use scenario. This may include, for example, the use of one or more audio processing elements that perform signal processing on the audio signals to implement a variety of desired functions (e.g., noise reduction, echo cancellation, etc.).

85 citations


Proceedings ArticleDOI
14 Mar 2010
TL;DR: A novel feature extraction technique for speech recognition based on the principles of sparse coding to express a spectro-temporal pattern of speech as a linear combination of an overcomplete set of basis functions such that the weights of the linear combination are sparse.
Abstract: This paper proposes a novel feature extraction technique for speech recognition based on the principles of sparse coding. The idea is to express a spectro-temporal pattern of speech as a linear combination of an overcomplete set of basis functions such that the weights of the linear combination are sparse. These weights (features) are subsequently used for acoustic modeling. We learn a set of overcomplete basis functions (dictionary) from the training set by adopting a previously proposed algorithm which iteratively minimizes the reconstruction error and maximizes the sparsity of weights. Furthermore, features are derived using the learned basis functions by applying the well established principles of compressive sensing. Phoneme recognition experiments show that the proposed features outperform the conventional features in both clean and noisy conditions.

83 citations


Journal ArticleDOI
TL;DR: Analysis and reconstruction methods designed for the prosthesis are presented, and their ability to obtain natural-sounding speech from the whisper-speech signal using an external analysis-by-synthesis processing framework is demonstrated.
Abstract: Whispered speech can be useful for quiet and private communication, and is the primary means of unaided spoken communication for many people experiencing voice-box deficiencies. Patients who have undergone partial or full laryngectomy are typically unable to speak anything more than hoarse whispers, without the aid of prostheses or specialized speaking techniques. Each of the current prostheses and rehabilitative methods for post-laryngectomized patients (primarily oesophageal speech, tracheo-esophageal puncture, and electrolarynx) have particular disadvantages, prompting new work on nonsurgical, noninvasive alternative solutions. One such solution, described in this paper, combines whisper signal analysis with direct formant insertion and speech modification located outside the vocal tract. This approach allows laryngectomy patients to regain their ability to speak with a more natural voice than alternative methods, by whispering into an external prosthesis, which then, recreates and outputs natural-sounding speech. It relies on the observation that while the pitch-generation mechanism of laryngectomy patients is damaged or unusable, the remaining components of the speech production apparatus may be largely unaffected. This paper presents analysis and reconstruction methods designed for the prosthesis, and demonstrates their ability to obtain natural-sounding speech from the whisper-speech signal using an external analysis-by-synthesis processing framework.

83 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate the potential of compressed sensing in speech coding techniques, offering high perceptual quality with a very sparse approximated prediction residual.
Abstract: Encouraged by the promising application of compressed sensing in signal compression, we investigate its formulation and application in the context of speech coding based on sparse linear prediction. In particular, a compressed sensing method can be devised to compute a sparse approximation of speech in the residual domain when sparse linear prediction is involved. We compare the method of computing a sparse prediction residual with the optimal technique based on an exhaustive search of the possible nonzero locations and the well known Multi-Pulse Excitation, the first encoding technique to introduce the sparsity concept in speech coding. Experimental results demonstrate the potential of compressed sensing in speech coding techniques, offering high perceptual quality with a very sparse approximated prediction residual.

79 citations


Journal ArticleDOI
TL;DR: This work proposes a codec that simultaneously addresses both high quality and low delay, with a delay of only 8.7 ms at 44.1 kHz, and uses gain-shape algebraic vector quantization in the frequency domain with time-domain pitch prediction.
Abstract: With increasing quality requirements for multimedia communications, audio codecs must maintain both high quality and low delay. Typically, audio codecs offer either low delay or high quality, but rarely both. We propose a codec that simultaneously addresses both these requirements, with a delay of only 8.7 ms at 44.1 kHz. It uses gain-shape algebraic vector quantization in the frequency domain with time-domain pitch prediction. We demonstrate that the proposed codec operating at 48 kb/s and 64 kb/s out-performs both G.722.1C and MP3 and has quality comparable to AAC-LD, despite having less than one fourth of the algorithmic delay of these codecs.

Patent
19 Oct 2010
TL;DR: In this paper, a transform domain path (230, 240, 242, 250, 260) was proposed to obtain a time-domain representation of a portion of the audio content encoded in a transform-domain mode on the basis of a first set of spectral coefficients, a representation of an aliasing-cancellation stimulus signal and a plurality of linear-prediction-domain parameters.
Abstract: An audio signal decoder (200) for providing a decoded representation (212) of an audio content on the basis of an encoded representation (310) of the audio content comprises a transform domain path (230, 240, 242, 250, 260) configured to obtain a time-domain representation (212) of a portion of the audio content encoded in a transform-domain mode on the basis of a first set (220) of spectral coefficients, a representation (224) of an aliasing-cancellation stimulus signal and a plurality of linear-prediction-domain parameters (222). The transform domain path comprises a spectrum processor (230) configured to apply a spectrum shaping to the first set of spectral coefficients in dependence on at least a subset of the linear-prediction-domain parameters, to obtain a spectrally-shaped version (232) of the first set of spectral coefficients. The transform domain path comprises a first frequency-domain-to-time-domain converter (240) configured to obtain a time-domain representation of the audio content on the basis of the spectrally-shaped version of the first set of spectral coefficients. The transform domain path comprises an aliasing-cancellation stimulus filter configured to filter (250) the aliasing-cancellation stimulus signal (324) in dependence on at least a subset of the linear-prediction-domain parameters (222), to derive an aliasing-cancellation synthesis signal (252) from the aliasing-cancellation stimulus signal. The transform domain path also comprises a combiner (260) configured to combine the time-domain representation (242) of the audio content with the aliasing-cancellation synthesis signal (252), or a post-processed version thereof, to obtain an aliasing reduced time-domain signal.

Patent
06 Oct 2010
TL;DR: In this paper, a multi-mode audio signal decoder for providing a decoded representation of an audio content on the basis of an encoded representation of the audio content comprises a spectral value determinator configured to obtain sets of decoded spectral coefficients for a plurality of portions of audio content.
Abstract: A multi-mode audio signal decoder for providing a decoded representation of an audio content on the basis of an encoded representation of the audio content comprises a spectral value determinator configured to obtain sets of decoded spectral coefficients for a plurality of portions of the audio content. The audio signal decoder also comprises a spectrum processor configured to apply a spectral shaping to a set of spectral coefficients, or to a pre-processed version thereof, in dependence on a set of linear-prediction-domain parameters for a portion of the audio content encoded in a linear-prediction mode, and to apply a spectral shaping to a set of decoded spectral coefficients, or a pre-processed version thereof, in dependence on a set of scale factor parameters for a portion of the audio content encoded in a frequency-domain mode. The audio signal decoder comprises a frequency-domain-to-time-domain converter configured to obtain a time-domain representation of the audio content on the basis of a spectrally-shaped set of decoded spectral coefficients for a portion of the audio content encoded in the linear-prediction mode, and to obtain a time domain representation of the audio content on the basis of a spectrally shaped set of decoded spectral coefficients for a portion of the audio content encoded in the frequency domain mode. An audio signal encoder is also described.

Journal ArticleDOI
TL;DR: To detect double MP3 compression, this paper extracts the statistical features on the modified discrete cosine transform and applies a support vector machine to the extracted features for classification and shows that the designed method is highly effective for detecting faked MP3 files.
Abstract: MPEG-1 Audio Layer 3, more commonly referred to as MP3, is a popular audio format for consumer audio storage and a de facto standard of digital audio compression for the transfer and playback of music on digital audio players. MP3 audio forgery manipulations generally uncompress a MP3 file, tamper with the file in the temporal domain, and then compress the doctored audio file back into MP3 format. If the compression quality of doctored MP3 file is different from the quality of original MP3 file, the doctored MP3 file is said to have undergone double MP3 compression. Although double MP3 compression does not prove a malicious tampering, it is evidence of manipulation and thus may warrant further forensic analysis since, e.g., faked MP3 files can be generated by using double MP3 compression at a higher bit-rate for the second compression to claim a higher quality of the audio files. To detect double MP3 compression, in this paper, we extract the statistical features on the modified discrete cosine transform and apply a support vector machine to the extracted features for classification. Experimental results show that our designed method is highly effective for detecting faked MP3 files. Our study also indicates that the detection performance is closely related to the bit-rate of the first-time MP3 encoding and the bit-rate of the second-time MP3 encoding.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: An information theoretic data compression framework with an objective to maximize the overall compression of the visual information gathered in a WMSN, and an entropy-based divergence measure (EDM) scheme is proposed to predict the compression efficiency of performing joint coding on the images collected by spatially correlated cameras.
Abstract: Data redundancy caused by correlation has motivated the application of collaborative multimedia in-network processing for data filtering and compression in wireless multimedia sensor networks (WMSNs). This paper proposes an information theoretic data compression framework with an objective to maximize the overall compression of the visual information gathered in a WMSN. To achieve this, an entropy-based divergence measure (EDM) scheme is proposed to predict the compression efficiency of performing joint coding on the images collected by spatially correlated cameras. The novelty of EDM relies on its independence of the specific image types and coding algorithms, thereby providing a generic mechanism for prior evaluation of compression under different coding solutions. Utilizing the predicted results from EDM, a distributed multi-cluster coding protocol (DMCP) is proposed to construct a compression-oriented coding hierarchy. The DMCP aims to partition the entire network into a set of coding clusters such that the global coding gain is maximized. Moreover, in order to enhance decoding reliability at data sink, the DMCP also guarantees that each sensor camera is covered by at least two different coding clusters. Experiments on H.264 standards show that the proposed EDM can effectively predict the joint coding efficiency from multiple sources. Further simulations demonstrate that the proposed compression framework can reduce 10% - 23% total coding rate compared with the individual coding scheme, i.e., each camera sensor compresses its own image independently.

Journal ArticleDOI
TL;DR: By enhancing fine structure coding in the lower frequencies, as implemented in the FSP coding strategy, speech perception in noise can be enhanced.

Journal ArticleDOI
TL;DR: A novel reformulation of the constraint, which allows for an efficient solution by nonlinear optimization algorithms, is derived in this paper so that a practicable implementation of REMOS for logmelspec features becomes possible.
Abstract: The REMOS (REverberation MOdeling for Speech recognition) concept for reverberation-robust distant-talking speech recognition, introduced in “Distant-talking continuous speech recognition based on a novel reverberation model in the feature domain” (A. Sehr , in Proc. Interspeech, 2006, pp. 769-772) for melspectral features, is extended to logarithmic melspectral (logmelspec) features in this contribution. Thus, the favorable properties of REMOS, including its high flexibility with respect to changing reverberation conditions, become available in the more competitive logmelspec domain. Based on a combined acoustic model consisting of a hidden Markov model (HMM) network and a reverberation model (RM), REMOS determines clean-speech and reverberation estimates during recognition. Therefore, in each iteration of a modified Viterbi algorithm, an inner optimization operation maximizes the joint density of the current HMM output and the RM output subject to the constraint that their combination is equal to the current reverberant observation. Since the combination operation in the logmelspec domain is nonlinear, numerical methods appear necessary for solving the constrained inner optimization problem. A novel reformulation of the constraint, which allows for an efficient solution by nonlinear optimization algorithms, is derived in this paper so that a practicable implementation of REMOS for logmelspec features becomes possible. An in-depth analysis of this REMOS implementation investigates the statistical properties of its reverberation estimates and thus derives possibilities for further improving the performance of REMOS. Connected digit recognition experiments show that the proposed REMOS version in the logmelspec domain significantly outperforms the melspec version. While the proposed RMs with parameters estimated by straightforward training for a given room are robust to a mismatch of the speaker-microphone distance, their performance significantly decreases if they are used in a room with substantially different conditions. However, by training multi-style RMs with data from several rooms, good performance can be achieved across different rooms.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: This work presents a monaural speech enhancement method based on sparse coding of noisy speech signals in a composite dictionary, consisting of the concatenation of a speech and interferer dictionary, both being possibly over-complete.
Abstract: The enhancement of speech degraded by non-stationary interferers is a highly relevant and difficult task of many signal processing applications. We present a monaural speech enhancement method based on sparse coding of noisy speech signals in a composite dictionary, consisting of the concatenation of a speech and interferer dictionary, both being possibly over-complete. The speech dictionary is learned off-line on a training corpus, while an environment specific interferer dictionary is learned on-line during speech pauses. Our approach optimizes the trade-off between source distortion and source confusion, and thus achieves significant improvements on objective quality measures like cepstral distance, in the speaker dependent and independent case, in several real-world environments and at low signal-to-noise ratios. Our enhancement method outperforms state-of-the-art methods like multi-band spectral subtraction and approaches based on vector quantization.

Journal ArticleDOI
TL;DR: Comparison results to the GMM-based method for prosody conversion achieved an improved performance using the hierarchical prosodic structure and the proposed regression-based clustering method.
Abstract: This paper presents an approach to hierarchical prosody conversion for emotional speech synthesis. The pitch contour of the source speech is decomposed into a hierarchical prosodic structure consisting of sentence, prosodic word, and subsyllable levels. The pitch contour in the higher level is encoded by the discrete Legendre polynomial coefficients. The residual, the difference between the source pitch contour and the pitch contour decoded from the discrete Legendre polynomial coefficients, is then used for pitch modeling at the lower level. For prosody conversion, Gaussian mixture models (GMMs) are used for sentence- and prosodic word-level conversion. At subsyllable level, the pitch feature vectors are clustered via a proposed regression-based clustering method to generate the prosody conversion functions for selection. Linguistic and symbolic prosody features of the source speech are adopted to select the most suitable function using the classification and regression tree for prosody conversion. Three small-sized emotional parallel speech databases with happy, angry, and sad emotions, respectively, were designed and collected for training and evaluation. Objective and subjective evaluations were conducted and the comparison results to the GMM-based method for prosody conversion achieved an improved performance using the hierarchical prosodic structure and the proposed regression-based clustering method.

Patent
23 Jun 2010
TL;DR: An audio signal decoder for providing an upmix signal representation in dependence on a down mix signal representation and an object-related parametric information comprises an object separator configured to decompose the downmix signal and an audio signal processor configured to receive the second audio information as discussed by the authors.
Abstract: An audio signal decoder for providing an upmix signal representation in dependence on a downmix signal representation and an object-related parametric information comprises an object separator configured to decompose the downmix signal representation, to provide a first audio information describing a first set of one or more audio objects of a first audio object type and a second audio information describing a second set of one or more audio objects of a second audio object type, in dependence on the downmix signal representation and using at least a part of the object-related parametric information. The audio signal decoder also comprises an audio signal processor configured to receive the second audio information and to process the second audio information in dependence on the object-related parametric information, to obtain a processed version of the second audio information. The audio signal decoder also comprises an audio signal combiner configured to combine the first audio information with the processed version of the second audio information, to obtain the upmix signal representation.

Patent
27 Jan 2010
TL;DR: In this paper, a signal combiner is configured to overlap and add-time-domain representations of subsequent audio frames encoded in different domains, in order to smoothen a transition between the time-domain representation of the subsequent frames.
Abstract: An audio decoder for providing a decoded representation of an audio content on the basis of an encoded representation of the audio content comprises a linear-prediction-domain decoder core configured to provide a time-domain representation of an audio frame on the basis of a set of linear-prediction domain parameters associated with the audio frame and a frequency-domain decoder core configured to provide a time-domain representation of an audio frame on the basis of a set of frequency-domain parameters, taking into account a transform window out of a set comprising a plurality of different transform windows. The audio decoder comprises a signal combiner configured to overlap-and-add-time-domain representations of subsequent audio frames encoded in different domains, in order to smoothen a transition between the time-domain representations of the subsequent frames. The set of transform windows comprises one or more windows specifically adapted for a transition between a frequency-domain core mode and a linear-prediction-domain core mode.

Proceedings ArticleDOI
19 Jul 2010
TL;DR: A two-step framework is proposed to estimate the background noise with minimal speech leakage signal and a correlation based similarity measure is applied to determine the integrity of speech signal, showing that it performs better than the existing speech enhancement algorithms with significant improvement in terms of SNR value.
Abstract: This paper presents a new audio forensics method based on background noise in the audio signals. The traditional speech enhancement algorithms improve the quality of speech signals, however, existing methods leave traces of speech in the removed noise. Estimated noise using these existing methods contains traces of speech signal, also known as leakage signal. Although this speech leakage signal has low SNR, yet it can be perceived easily by listening to the estimated noise signal, it therefore cannot be used for audio forensics applications. For reliable audio authentication, a better noise estimation method is desirable. To achieve this goal, a two-step framework is proposed to estimate the background noise with minimal speech leakage signal. A correlation based similarity measure is then applied to determine the integrity of speech signal. The proposed method has been evaluated for different speech signals recorded in various environments. The results show that it performs better than the existing speech enhancement algorithms with significant improvement in terms of SNR value.

Proceedings ArticleDOI
TL;DR: This piece of work is the first one to detect double compression of audio signal and uses support vector machine classifiers with feature vectors formed by the distributions of the first digits of the quantized MDCT (modified discrete cosine transform) coefficients.
Abstract: MP3 is the most popular audio format nowadays in our daily life, for example music downloaded from the Internet and file saved in the digital recorder are often in MP3 format. However, low bitrate MP3s are often transcoded to high bitrate since high bitrate ones are of high commercial value. Also audio recording in digital recorder can be doctored easily by pervasive audio editing software. This paper presents two methods for the detection of double MP3 compression. The methods are essential for finding out fake-quality MP3 and audio forensics. The proposed methods use support vector machine classifiers with feature vectors formed by the distributions of the first digits of the quantized MDCT (modified discrete cosine transform) coefficients. Extensive experiments demonstrate the effectiveness of the proposed methods. To the best of our knowledge, this piece of work is the first one to detect double compression of audio signal.

Proceedings Article
01 Aug 2010
TL;DR: This contribution uses a theoretical analysis of the Speech Intelligibility Index (SII) to develop an algorithm which numerically maximizes the SII under the constraint of an unchanged average power of the audio signal.
Abstract: In speech communications, signal processing algorithms for near end listening enhancement allow to improve the intelligibility of clean (far end) speech for the near end listener who perceives not only the far end speech but also ambient background noise A typical scenario is mobile telephony in acoustical background noise such as traffic or babble noise In these situations, it is often not acceptable/possible to increase the audio power amplification In this contribution we use a theoretical analysis of the Speech Intelligibility Index (SII) to develop an algorithm which numerically maximizes the SII under the constraint of an unchanged average power of the audio signal

Proceedings ArticleDOI
25 Oct 2010
TL;DR: This paper extracts the statistical features on the modified discrete cosine transform, and applies support vector machines and a dynamic evolving neuron-fuzzy inference system to the extracted features for classification, and effectively and accurately detects double MP3 compression.
Abstract: MP3 is the most popular format for audio storage and a de facto standard of digital audio compression for the transfer and playback. The flexibility of compression ratio of MP3 coding enables users to choose their customized configuration in the trade-off between file size and quality. Double MP3 compression often occurs in audio forgery, steganography and quality faking by transcoding an MP3 audio to a different compression ratio. To detect double MP3 compression, in this paper, we extract the statistical features on the modified discrete cosine transform, and apply support vector machines and a dynamic evolving neuron-fuzzy inference system to the extracted features for classification. Experimental results show that our method effectively and accurately detects double MP3 compression for both up-transcoded and down-transcoded MP3 files. Our study also indicates the potential for mining the audio processing history for forensic purposes.

Patent
Tian Wang1, Hosam A. Khalil1, Kazuhito Koishida1, Wei-ge Chen1, Mu Han1 
22 Jan 2010
TL;DR: In this paper, various rate/quality control and loss resiliency strategies for audio codecs are described, including intra frame coding/decoding, adaptive multi-mode forward error correction (FEC), and rate and quality control techniques.
Abstract: Various strategies for rate/quality control and loss resiliency in an audio codec are described. The various strategies can be used in combination or independently. For example, a real-time speech codec uses intra frame coding/decoding, adaptive multi-mode forward error correction [“FEC”], and rate/quality control techniques. Intra frames help a decoder recover quickly from packet losses, while compression efficiency is still emphasized with predicted frames. Various strategies for inserting intra frames and signaling intra/predicted frames are described. With the adaptive multi-mode FEC, an encoder adaptively selects between multiple modes to efficiently and quickly provide a level of FEC that takes into account the bandwidth currently available for FEC. The FEC information itself may be predictively encoded and decoded relative to primary encoded information. Various rate/quality and FEC control strategies allow additional adaptation to available bandwidth and network conditions.

Patent
Pasi Ojala1
21 Dec 2010
TL;DR: In this article, an apparatus for efficiently browsing and selecting media content and reducing computational load of processing audio data may include a processor and a memory storing executable computer program code that causes the apparatus to at least perform operations including obtaining audio signals corresponding to items of media data.
Abstract: An apparatus for efficiently browsing and selecting media content and reducing computational load of processing audio data may include a processor and a memory storing executable computer program code that causes the apparatus to at least perform operations including obtaining audio signals corresponding to items of media data. The audio data associated with the audio signals is played simultaneously. The computer program code may cause the apparatus to determine whether audio signals correspond to multi channel audio signals when determining whether to generate simplified audio signals. The computer program code may also cause the apparatus to determine directions/locations to output the audio data associated with the audio signals. The determined directions/locations correspond to directions that media data is currently being moved or locations of the media data. Corresponding computer program products and methods are also provided.

Patent
Lin Zhibin1, Deng Zheng1, Hao Yuan1, Jing Lu1, Xiaojun Qiu1, Jiali Li1, Guoming Chen1, Ke Peng1, Liu Kaiwen1 
26 Oct 2010
TL;DR: In this paper, a hierarchical audio coding, decoding method and system is presented, which includes: dividing frequency domain coefficients of an audio signal after MDCT into a plurality of coding sub-bands, quantizing and coding amplitude envelope values of coded subbands, allocating bits to each coding subband of the core layer, quantising and coding core layer frequency coefficients, core layer coefficients coded bits, and extended layer coding signal coded bits.
Abstract: A hierarchical audio coding, decoding method and system are provided. Said method includes: dividing frequency domain coefficients of an audio signal after MDCT into a plurality of coding sub-bands, quantizing and coding amplitude envelope values of coding sub-bands; allocating bits to each coding sub-band of the core layer, quantizing and coding core layer frequency domain coefficients to obtain coded bits of core layer frequency domain coefficients; calculating the amplitude envelope value of each coding sub-band of the core layer residual signal; allocating bits to each coding sub-band of the extended layer, quantizing and coding the extended layer coding signal to obtain coded bits of the extended layer coding signal; multiplexing and packing amplitude value envelope coded bits of each coding sub-band composed by core layer and extended layer frequency domain coefficients, core layer frequency coefficients coded bits, and extended layer coding signal coded bits, then transmitting to the decoding end.

Book ChapterDOI
01 Jan 2010
TL;DR: This chapter provides the discussion of sound and audio signals, and then explores how audio data is presented to the processor from a variety of audio converters.
Abstract: This chapter provides the discussion of sound and audio signals, and then explores how audio data is presented to the processor from a variety of audio converters. Also describes the formats in which audio data is stored and processed and reviews the compromises associated with selecting data sizes. Audio and speech coding is used in digital audio broadcasting (DAB), VoIP phone, media players, military applications, cinema, home entertainment systems, and distance learning, among many other applications. Sound is a longitudinal displacement wave that propagates through a medium, such as air. Speech signals can be considered a subset of audio signals. Speech signals contain information about the time-varying characteristics of the excitation source and the vocal tract system. Speech signals are nonstationary and at best they can be considered quasistationary over short time periods Sound waves are defined in terms of amplitude and frequency attributes. Amplitude describes the sound pressure displacement above and below the equilibrium atmospheric level. In other words, the amplitude of a soundwave is a gauge of pressure change, measured in decibels (dB). The lowest sound amplitude that the human ear can perceive is called the “threshold of hearing,” denoted by 0 dBSPL.

Patent
Elias Nemer1
22 Oct 2010
TL;DR: In this paper, the authors describe a system that uses audio spatialization to help at least one listener on one end of a communication session differentiate between multiple talkers on another end of the communication session.
Abstract: Systems and methods are described that utilize audio spatialization to help at least one listener on one end of a communication session differentiate between multiple talkers on another end of the communication session. In accordance with one embodiment, an audio teleconferencing system obtains speech signals originating from different talkers on one end of the communication session, identifies a particular talker in association with each speech signal, and generates mapping information sufficient to assign each speech signal associated with each identified talker to a corresponding audio spatial region. A telephony system communicatively connected to the audio teleconferencing system receives the speech signals and the mapping information, assigns each speech signal to a corresponding audio spatial region based on the mapping information, and plays back each speech signal in its assigned audio spatial region.