scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2005"


Journal ArticleDOI
01 Nov 2005
TL;DR: Different techniques describing its functional blocks as parts of a common, unified framework for audio fingerprinting are reviewed.
Abstract: An audio fingerprint is a compact content-based signature that summarizes an audio recording. Audio Fingerprinting technologies have attracted attention since they allow the identification of audio independently of its format and without the need of meta-data or watermark embedding. Other uses of fingerprinting include: integrity verification, watermark support and content-based audio retrieval. The different approaches to fingerprinting have been described with different rationales and terminology: Pattern matching, Multimedia (Music) Information Retrieval or Cryptography (Robust Hashing). In this paper, we review different techniques describing its functional blocks as parts of a common, unified framework.

390 citations


Book
01 Jan 2005
TL;DR: A comparison of MPEG-7 Audio Spectrum Projection vs. MFCC Features and Results for Distinguishing Between Speech, Music and Environmental Sound shows that the former is superior to the latter in terms of sound classification.
Abstract: List of Acronyms. List of Symbols. 1. Introduction. 1.1 Audio Content Description. 1.2 MPEG-7 Audio Content Description - An Overview. 1.2.1 MPEG-7 Low-Level Descriptors. 1.2.2 MPEG-7 Description Schemes. 1.2.3 MPEG-7 Description Definition Language (DDL). 1.2.4 BiM (Binary Format for MPEG-7). 1.3 Organization of the Book. 2. Low-Level Descriptors. 2.1 Introduction. 2.2 Basic Parameters and Notations. 2.2.1 Time Domain. 2.2.2 Frequency Domain. 2.3 Scalable Series. 2.3.1 Series of Scalars. 2.3.2 Series of Vectors. 2.3.3 Binary Series. 2.4 Basic Descriptors. 2.4.1 Audio Waveform. 2.4.2 Audio Power. 2.5 Basic Spectral Descriptors. 2.5.1 Audio Spectrum Envelope. 2.5.2 Audio Spectrum Centroid. 2.5.3 Audio Spectrum Spread. 2.5.4 Audio Spectrum Flatness. 2.6 Basic Signal Parameters. 2.6.1 Audio Harmonicity. 2.6.2 Audio Fundamental Frequency. 2.7 Timbral Descriptors. 2.7.1 Temporal Timbral: Requirements. 2.7.2 Log Attack Time. 2.7.3 Temporal Centroid. 2.7.4 Spectral Timbral: Requirements. 2.7.5 Harmonic Spectral Centroid. 2.7.6 Harmonic Spectral Deviation. 2.7.7 Harmonic Spectral Spread. 2.7.8 Harmonic Spectral Variation. 2.7.9 Spectral Centroid. 2.8 Spectral Basis Representations. 2.9 Silence Segment. 2.10 Beyond the Scope of MPEG-7. 2.10.1 Other Low-Level Descriptors. 2.10.2 Mel-Frequency Cepstrum Coefficients. References. 3. Sound Classification and Similarity. 3.1 Introduction. 3.2 Dimensionality Reduction. 3.2.1 Singular Value Decomposition (SVD). 3.2.2 Principal Component Analysis (PCA). 3.2.3 Independent Component Analysis (ICA). 3.2.4 Non-Negative Factorization (NMF). 3.3 Classification Methods. 3.3.1 Gaussian Mixture Model (GMM). 3.3.2 Hidden Markov Model (HMM). 3.3.3 Neural Network (NN). 3.3.4 Support Vector Machine (SVM). 3.4 MPEG-7 Sound Classification. 3.4.1 MPEG-7 Audio Spectrum Projection (ASP) Feature Extraction. 3.4.2 Training Hidden Markov Models (HMMs). 3.4.3 Classification of Sounds. 3.5 Comparison of MPEG-7 Audio Spectrum Projection vs. MFCC Features. 3.6 Indexing and Similarity. 3.6.1 Audio Retrieval Using Histogram Sum of Squared Differences. 3.7 Simulation Results and Discussion. 3.7.1 Plots of MPEG-7 Audio Descriptors. 3.7.2 Parameter Selection. 3.7.3 Results for Distinguishing Between Speech, Music and Environmental Sound. 3.7.4 Results of Sound Classification Using Three Audio Taxonomy Methods. 3.7.5 Results for Speaker Recognition. 3.7.6 Results of Musical Instrument Classification. 3.7.7 Audio Retrieval Results. 3.8 Conclusions. References. 4. Spoken Content. 4.1 Introduction. 4.2 Automatic Speech Recognition. 4.2.1 Basic Principles. 4.2.2 Types of Speech Recognition Systems. 4.2.3 Recognition Results. 4.3 MPEG-7 SpokenContent Description. 4.3.1 General Structure. 4.3.2 SpokenContentHeader. 4.3.3 SpokenContentLattice. 4.4 Application: Spoken Document Retrieval. 4.4.1 Basic Principles of IR and SDR. 4.4.2 Vector Space Models. 4.4.3 Word-Based SDR. 4.4.4 Sub-Word-Based Vector Space Models. 4.4.5 Sub-Word String Matching. 4.4.6 Combining Word and Sub-Word Indexing. 4.5 Conclusions. 4.5.1 MPEG-7 Interoperability. 4.5.2 MPEG-7 Flexibility. 4.5.3 Perspectives. References. 5. Music Description Tools. 5.1 Timbre. 5.1.1 Introduction. 5.1.2 InstrumentTimbre. 5.1.3 HarmonicInstrumentTimbre. 5.1.4 PercussiveInstrumentTimbre. 5.1.5 Distance Measures. 5.2 Melody. 5.2.1 Melody. 5.2.2 Meter. 5.2.3 Scale. 5.2.4 Key. 5.2.5 MelodyContour. 5.2.6 MelodySequence. 5.3 Tempo. 5.3.1 AudioTempo. 5.3.2 AudioBPM. 5.4 Application Example: Query-by-Humming. 5.4.1 Monophonic Melody Transcription. 5.4.2 Polyphonic Melody Transcription. 5.4.3 Comparison of Melody Contours. References. 6. Fingerprinting and Audio Signal Quality. 6.1 Introduction. 6.2 Audio Signature. 6.2.1 Generalities on Audio Fingerprinting. 6.2.2 Fingerprint Extraction. 6.2.3 Distance and Searching Methods. 6.2.4 MPEG-7-Standardized AudioSignature. 6.3 Audio Signal Quality. 6.3.1 AudioSignalQuality Description Scheme. 6.3.2 BroadcastReady. 6.3.3 IsOriginalMono. 6.3.4 BackgroundNoiseLevel. 6.3.5 CrossChannelCorrelation. 6.3.6 RelativeDelay. 6.3.7 Balance. 6.3.8 DcOffset. 6.3.9 Bandwidth. 6.3.10 TransmissionTechnology. 6.3.11 ErrorEvent and ErrorEventList. References. 7. Application. 7.1 Introduction. 7.2 Automatic Audio Segmentation. 7.2.1 Feature Extraction. 7.2.2 Segmentation. 7.2.3 Metric-Based Segmentation. 7.2.4 Model-Selection-Based Segmentation. 7.2.5 Hybrid Segmentation. 7.2.6 Hybrid Segmentation Using MPEG-7 ASP. 7.2.7 Segmentation Results. 7.3 Sound Indexing and Browsing of Home Video Using Spoken Annotations. 7.3.1 A Simple Experimental System. 7.3.2 Retrieval Results. 7.4 Highlights Extraction for Sport Programmes Using Audio Event Detection. 7.4.1 Goal Event Segment Selection. 7.4.2 System Results. 7.5 A Spoken Document Retrieval System for Digital Photo Albums. References. Index.

256 citations


Journal ArticleDOI
TL;DR: Improvement by as much as 71 percentage points was observed for sentence recognition in the presence of a competing voice and the present result strongly suggests that frequency modulation be extracted and encoded to improve cochlear implant performance in realistic listening situations.
Abstract: Different from traditional Fourier analysis, a signal can be decomposed into amplitude and frequency modulation components. The speech processing strategy in most modern cochlear implants only extracts and encodes amplitude modulation in a limited number of frequency bands. While amplitude modulation encoding has allowed cochlear implant users to achieve good speech recognition in quiet, their performance in noise is severely compromised. Here, we propose a novel speech processing strategy that encodes both amplitude and frequency modulations in order to improve cochlear implant performance in noise. By removing the center frequency from the subband signals and additionally limiting the frequency modulation's range and rate, the present strategy transforms the fast-varying temporal fine structure into a slowly varying frequency modulation signal. As a first step, we evaluated the potential contribution of additional frequency modulation to speech recognition in noise via acoustic simulations of the cochlear implant. We found that while amplitude modulation from a limited number of spectral bands is sufficient to support speech recognition in quiet, frequency modulation is needed to support speech recognition in noise. In particular, improvement by as much as 71 percentage points was observed for sentence recognition in the presence of a competing voice. The present result strongly suggests that frequency modulation be extracted and encoded to improve cochlear implant performance in realistic listening situations. We have proposed several implementation methods to stimulate further investigation.

233 citations


Patent
Mark F. Davis1
28 Feb 2005
TL;DR: In this article, the authors proposed an improved decorrelation of multiple audio channels derived from a monophonic audio channel or from multiple channels of audio along with related auxiliary information from which multiple channels can be reconstructed.
Abstract: Multiple channels of audio are combined either to a monophonic composite signal or to multiple channels of audio along with related auxiliary information from which multiple channels of audio are reconstructed, including improved downmixing of multiple audio channels to a monophonic audio signal or to multiple audio channels and improved decorrelation of multiple audio channels derived from a monophonic audio channel or from multiple audio channels. Aspects of the disclosed invention are usable in audio encoders, decoders, encode/decode systems, downmixers, upmixers, and decorrelators.

221 citations


Proceedings Article
01 Jan 2005
TL;DR: An efficient method for audio matching which performs effectively for a wide range of classical music is described, and a new type of chroma-based audio feature is introduced that strongly correlates to the harmonic progression of the audio signal.
Abstract: In this paper, we describe an efficient method for audio matching which performs effectively for a wide range of classical music. The basic goal of audio matching can be described as follows: consider an audio database containing several CD recordings for one and the same piece of music interpreted by various musicians. Then, given a short query audio clip of one interpretation, the goal is to automatically retrieve the corresponding excerpts from the other interpretations. To solve this problem, we introduce a new type of chroma-based audio feature that strongly correlates to the harmonic progression of the audio signal. Our feature shows a high degree of robustness to variations in parameters such as dynamics, timbre, articulation, and local tempo deviations. As another contribution, we describe a robust matching procedure, which allows to handle global tempo variations. Finally, we give a detailed account on our experiments, which have been carried out on a database of more than 110 hours of audio comprising a wide range of classical music.

221 citations


PatentDOI
TL;DR: In this paper, a method and system used to determine the similarity between an input speech data and a sample speech data is provided, where the input speech frames and the sample speech frames are used to build a matching matrix, wherein the matching matrix comprises the distance values between each of the input text frames and each sample text frame.
Abstract: A method and system used to determine the similarity between an input speech data and a sample speech data is provided. First, the input speech data is segmented into a plurality of input speech frames and the sample speech data is segmented into a plurality of sample speech frames. Then, the input speech frames and the sample speech frames are used to build a matching matrix, wherein the matching matrix comprises the distance values between each of the input speech frames and each of the sample speech frames. Next, the distance values are used to calculate a matching score. Finally, the similarity between the input speech data and the sample speech data is determined according to this matching score.

204 citations


Proceedings ArticleDOI
21 Nov 2005
TL;DR: The proposed hybrid solution is capable of detecting new kinds of suspicious audio events that occur as outliers against a background of usual activity and adaptively learns a Gaussian mixture model to model the background sounds and updates the model incrementally as new audio data arrives.
Abstract: We proposed a time series analysis based approach for systematic choice of audio classes for detection of crimes in elevators in R Radhakrishnan et al (2005) Since all the different sounds in a surveillance environment cannot be anticipated, a surveillance system for event detection cannot completely rely on a supervised audio classification framework In this paper, we propose a hybrid solution that consists two parts; one that performs unsupervised audio analysis and another that performs analysis using an audio classification framework obtained from off-line analysis and training The proposed system is capable of detecting new kinds of suspicious audio events that occur as outliers against a background of usual activity It adaptively learns a Gaussian mixture model (GMM) to model the background sounds and updates the model incrementally as new audio data arrives New types of suspicious events can be detected as deviants from this usual background model The results on elevator audio data are promising

196 citations


PatentDOI
TL;DR: In this article, a speech processing system including a speech recognition unit to receive input speech, and a natural language processor is described. And the system includes an adaptation processor to process the feedback information to adapt the acoustic models so that the SPR unit produces the speech recognition result with higher precision than when the adaptation processor is not used.
Abstract: A speech processing system including a speech recognition unit to receive input speech, and a natural language processor. The speech recognition unit performs speech recognition on input speech using acoustic models to produce a speech recognition result. The natural-language processor performs natural language processing on speech recognition result, and includes: a speech zone detector configured to detect correct zones from the speech recognition result; a feedback unit to feed back information obtained as a result of the natural language processing performed on the speech recognition result to said speech recognition unit. The feedback information includes the detected correct zones. The speech recognition unit includes an adaptation processor to process the feedback information to adapt the acoustic models so that the speech recognition unit produces the speech recognition result with higher precision than when the adaptation processor is not used.

162 citations


Patent
11 Oct 2005
TL;DR: In this paper, a method and an apparatus for spectral envelope encoding is presented, which is applicable to both natural audio coding and speech coding systems and is especially suited for coders using SBR [WO 98/57436] or other high frequency reconstruction methods.
Abstract: The present invention provides a new method and an apparatus for spectral envelope encoding. The invention teaches how to perform and signal compactly a time/frequency mapping of the envelope representation, and further, encode the spectral envelope data efficiently using adaptive time/frequency directional coding. The method is applicable to both natural audio coding and speech coding systems and is especially suited for coders using SBR [WO 98/57436] or other high frequency reconstruction methods.

142 citations


Proceedings ArticleDOI
J. Makinen1, B. Bessette2, S. Bruhn, Pasi Ojala1, R. Salami, A. Taleb 
18 Mar 2005
TL;DR: The requirements imposed by mobile audio services are discussed and a technology overview of AMR-WB+ as a codec matching these requirements while providing outstanding audio quality is given.
Abstract: Highly efficient low-rate audio coding methods are required for new compelling and commercially interesting applications of streaming, messaging and broadcasting services using audio media in 3rd generation mobile communication systems. After an audio codec selection phase, 3GPP has standardized the extended AMR-WB (AMR-WB+) codec that provides a unique performance at very low bit rates from below 10 kbps up to 24 kbps. This paper discusses the requirements imposed by mobile audio services and gives a technology overview of AMR-WB+ as a codec matching these requirements while providing outstanding audio quality.

136 citations


PatentDOI
Nobuyuki Katae1
TL;DR: In this paper, a speech synthesizing system consisting of a speech segment storage section where speech segment is stored, speech segment selection information storage section, and a waveform generating section for generating speech waveform data from the combination of speech segment selected by the speech segment selecting section.
Abstract: A speech synthesizing system producing a speech of an improved quality of voice by selecting a combination of speech segment most suitable for a synthesis speech unit sequence. The speech synthesizing system comprises a speech segment storage section where speech segment is stored, a speech segment selection information storage section where speech segment selection information including combinations of speech segment constituted of speech segment stored in the speech segment storage section for an arbitrary speech unit sequence and the appropriateness information representing the appropriatenesses of the combinations are stored, a speech segment selecting section for selecting a combination of speech segment most suitable for a synthesis parameter according to the speech segment selection information stored in the speech segment storage section, and a waveform generating section for generating speech waveform data from the combination of speech segment selected by the speech segment selecting section.

Patent
10 Nov 2005
TL;DR: In this paper, a text-to-speech system uses differential voice coding (DVC) to compress a database of digitized speech waveform segments, and a seed waveform is used to precondition each waveform prior to encoding which, upon encoding, provides a seeded pre-conditioned encoded speech token.
Abstract: A text to speech system ( 100 ) uses differential voice coding ( 230, 416 ) to compress a database of digitized speech waveform segments ( 210 ). A seed waveform ( 535 ) is used to precondition each speech waveform prior to encoding which, upon encoding, provides a seeded preconditioned encoded speech token ( 550 ). The seed portion ( 541 ) may be removed and the preconditioned encoded speech token portion ( 542 ) may be stored in a database for text to speech synthesis. When speech it to be synthesized, upon requesting the appropriate speech waveform for the present sound to be produced, the seed portion is preappended to the preconditioned encoded speech token for differential decoding.

Journal ArticleDOI
Doh-Suk Kim1
TL;DR: The proposed auditory non-intrusive quality estimation (ANIQUE) model is based on the functional roles of human auditory systems and the characteristics of human articulation systems and demonstrates the effectiveness of the proposed model.
Abstract: In predicting subjective quality of speech signal degraded by telecommunication networks, conventional objective models require a reference source speech signal, which is applied as an input to the network, as well as the degraded speech. Non-intrusive estimation of speech quality is a challenging problem in that only the degraded speech signal is available. Non-intrusive estimation can be used in many real applications when source speech signal is not available. In this paper, we propose a new approach for non-intrusive speech quality estimation utilizing the temporal envelope representation of speech. The proposed auditory non-intrusive quality estimation (ANIQUE) model is based on the functional roles of human auditory systems and the characteristics of human articulation systems. Experimental evaluations on 35 different tests demonstrated the effectiveness of the proposed model.

Journal ArticleDOI
TL;DR: The robustness and discriminability of the AM-FM features is investigated in combination with mel cepstrum coefficients (MFCCs), and it is shown that these hybrid features perform well in the presence of noise, both in terms of phoneme-discrimination and speech recognition performance.
Abstract: In this letter, a nonlinear AM-FM speech model is used to extract robust features for speech recognition. The proposed features measure the amount of amplitude and frequency modulation that exists in speech resonances and attempt to model aspects of the speech acoustic information that the commonly used linear source-filter model fails to capture. The robustness and discriminability of the AM-FM features is investigated in combination with mel cepstrum coefficients (MFCCs). It is shown that these hybrid features perform well in the presence of noise, both in terms of phoneme-discrimination (J-measure) and in terms of speech recognition performance in several different tasks. Average relative error rate reduction up to 11% for clean and 46% for mismatched noisy conditions is achieved when AM-FM features are combined with MFCCs.

Patent
11 Oct 2005
TL;DR: In this paper, a down-mixing rule that dynamically depends on parameters describing an interrelation between the audio channels is proposed to ensure that the energy within the down-mixed residual signal is as minimal as possible.
Abstract: An audio signal having at least two channels can be efficiently down-mixed into a downmix signal and a residual signal, when the down-mixing rule used depends on a spatial parameter that is derived from the audio signal and that is post-processed by a limiter to apply a certain limit to the derived spatial parameter with the aim of avoiding instabilities during the up-mixing or down-mixing process By having a down-mixing rule that dynamically depends on parameters describing an interrelation between the audio channels, one can assure that the energy within the down-mixed residual signal is as minimal as possible, which is advantageous in the view of coding efficiency By post processing the spatial parameter with a limiter prior to using it in the down-mixing, one can avoid instabilities in the down- or up-mixing, which otherwise could result in a disturbance of the spatial perception of the encoded or decoded audio signal

Journal ArticleDOI
TL;DR: The source filter model of speech production is adopted as presented in X. Huang et al. (2001), wherein speech is divided into two broad classes: voiced and unvoiced.
Abstract: In this article, we concentrate on spectral estimation techniques that are useful in extracting the features to be used by automatic speech recognition (ASR) system. As an aid to understanding the spectral estimation process for speech signals, we adopt the source filter model of speech production as presented in X. Huang et al. (2001), wherein speech is divided into two broad classes: voiced and unvoiced. Voiced speech is quasi-periodic, consisting of a fundamental frequency corresponding to the pitch of a speaker, as well as its harmonics. Unvoiced speech is stochastic in nature and is best modeled as white noise convolved with an infinite impulse response filter.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: This paper presents a hybrid audio coding algorithm integrating an LP-based coding technique and a more general transform coding technique, which has consistently high performance for both speech and music signals.
Abstract: This paper presents a hybrid audio coding algorithm integrating an LP-based coding technique and a more general transform coding technique. ACELP is used in LP-based coding mode, whereas algebraic TCX is used in transform coding mode. The algorithm extends previously published work on ACELP/TCX coding in several ways. The frame length is increased to 80 ms, adaptive multi-length sub-frames are used with overlapping windowing, an extended multi-rate algebraic VQ is applied to the TCX spectrum to avoid quantizer saturation, and noise shaping is improved. Results show that the proposed hybrid coder has consistently high performance for both speech and music signals.

Journal ArticleDOI
TL;DR: A new signal processing technique for cochlear implants using a psychoacoustic-masking model in order to determine the essential components of any given audio signal.
Abstract: We describe a new signal processing technique for cochlear implants using a psychoacoustic-masking model. The technique is based on the principle of a so-called "NofM" strategy. These strategies stimulate fewer channels (N) per cycle than active electrodes (NofM; N < M). In "NofM" strategies such as ACE or SPEAK, only the N channels with higher amplitudes are stimulated. The new strategy is based on the ACE strategy but uses a psychoacoustic-masking model in order to determine the essential components of any given audio signal. This new strategy was tested on device users in an acute study, with either 4 or 8 channels stimulated per cycle. For the first condition (4 channels), the mean improvement over the ACE strategy was 17%. For the second condition (8 channels), no significant difference was found between the two strategies.

Patent
Zoran Fejzo1
21 Mar 2005
TL;DR: In this article, a lossless audio codec segments audio data within each frame to improve compression performance subject to a constraint that each segment must be fully decodable and less than a maximum size.
Abstract: A lossless audio codec segments audio data within each frame to improve compression performance subject to a constraint that each segment must be fully decodable and less than a maximum size. For each frame, the codec selects the segment duration and coding parameters, e.g., a particular entropy coder and its parameters for each segment, that minimizes the encoded payload for the entire frame subject to the constraints. Distinct sets of coding parameters may be selected for each channel or a global set of coding parameters may be selected for all channels. Compression performance may be further enhanced by forming M/2 decorrelation channels for M-channel audio. The triplet of channels (basis, correlated, decorrelated) provides two possible pair combinations (basis, correlated, decorrelated) provides two possible pair combinations (basis, correlated) and basis, decorrelated) that can be considered during the segmentation and entropy coding optimization to further improve compression performance.

Journal ArticleDOI
TL;DR: An efficient approach for unsupervised audio stream segmentation and clustering via the Bayesian Information Criterion via the BIC is proposed, which is particularly successful for short segment turns of less than 2 s in duration.
Abstract: In many speech and audio applications, it is first necessary to partition and classify acoustic events prior to voice coding for communication or speech recognition for spoken document retrieval. In this paper, we propose an efficient approach for unsupervised audio stream segmentation and clustering via the Bayesian Information Criterion (BIC). The proposed method extends an earlier formulation by Chen and Gopalakrishnan. In our formulation, Hotelling's T/sup 2/-Statistic is used to pre-select candidate segmentation boundaries followed by BIC to perform the segmentation decision. The proposed algorithm also incorporates a variable-size increasing window scheme and a skip-frame test. Our experiments show that we can improve the final algorithm speed by a factor of 100 compared to that in Chen and Gopalakrishnan's while achieving a 6.7% reduction in the acoustic boundary miss rate at the expense of a 5.7% increase in false alarm rate using DARPA Hub4 1997 evaluation data. The approach is particularly successful for short segment turns of less than 2 s in duration. The results suggest that the proposed algorithm is sufficiently effective and efficient for audio stream segmentation applications.

Proceedings ArticleDOI
27 Dec 2005
TL;DR: The potential of perceptive speech analysis and processing in combination with biologically plausible neural network processors is discussed and a preliminary test with recognition of French spoken digits from a small speech database is illustrated.
Abstract: Speech recognition is very difficult in the context of noisy and corrupted speech. Most conventional techniques need huge databases to estimate speech (or noise) density probabilities to perform recognition. We discuss the potential of perceptive speech analysis and processing in combination with biologically plausible neural network processors. We illustrate the potential of such non-linear processing of speech by means of a preliminary test with recognition of French spoken digits from a small speech database

Patent
23 Nov 2005
TL;DR: In this paper, a high-band speech encoding and decoding apparatus was proposed for wideband speech using a bandwidth extension function and a highband speech decoding method performed by the apparatuses.
Abstract: A high-band speech encoding apparatus and a high-band speech decoding apparatus that can reproduce high quality sound even at a low bitrate when wideband speech encoding and decoding using a bandwidth extension function, and a high-band speech encoding and decoding method performed by the apparatuses. The high-band speech encoding apparatus includes: a first encoding unit encoding a high-band speech signal based on a structure in which a harmonic structure and a stochastic structure are combined, if the high-band speech signal has a harmonic component; and a second encoding unit encoding a high-band speech signal based on a stochastic structure if the high-band speech signal has no harmonic components. The high-band speech decoding apparatus includes: a first decoding unit decoding a high-band speech signal based on a combination of a harmonic structure and a stochastic structure using received first decoding information; a second decoding unit decoding the high-band speech signal based on a stochastic structure using received second decoding information; and a switch outputting one of the decoded high-band speech signals received from the first and second decoding units according to received mode selection information.

Journal ArticleDOI
TL;DR: A new perceptual model is presented that predicts masked thresholds for sinusoidal distortions and leads to a reduction of more than 20% in terms of number of sinusoids needed to represent signals at a given quality level.
Abstract: Psychoacoustical models have been used extensively within audio coding applications over the past decades. Recently, parametric coding techniques have been applied to general audio and this has created the need for a psychoacoustical model that is specifically suited for sinusoidal modelling of audio signals. In this paper, we present a new perceptual model that predicts masked thresholds for sinusoidal distortions. The model relies on signal detection theory and incorporates more recent insights about spectral and temporal integration in auditory masking. As a consequence, the model is able to predict the distortion detectability. In fact, the distortion detectability defines a (perceptually relevant) norm on the underlying signal space which is beneficial for optimisation algorithms such as rate-distortion optimisation or linear predictive coding. We evaluate the merits of the model by combining it with a sinusoidal extraction method and compare the results with those obtained with the ISO MPEG-1 Layer I-II recommended model. Listening tests show a clear preference for the new model. More specifically, the model presented here leads to a reduction of more than 20% in terms of number of sinusoids needed to represent signals at a given quality level.

Journal ArticleDOI
TL;DR: In this article, the most prominent speech coding standards are presented and their properties, such as performance, complexity, and coding delay, analyzed, and specific networks and applications for each standard are included.
Abstract: Voice is the preferred method of human communication. Although there have been times when it seemed that the voice communications problem was solved, such as when the PSTN was our primary network or later when digital cellular networks reached maturity, such is not the case today. This paper addresses the challenges and opportunities starting from the basic issues in speech coder design, developing the important speech coding techniques and standards, discussing current and future applications, outlining techniques for evaluating speech coder performance, and identifying research directions. The most prominent speech coding standards are presented and their properties, such as performance, complexity, and coding delay, analyzed. Particular networks and applications for each standard are included. Further, reflecting upon the issues and developments highlighted in this paper, it becomes evident that there is a diverse set of challenges and opportunities for research and innovation in speech coding and voice communications.

Journal Article
TL;DR: Reference LCAV-CONF-2005-033 URL: www.aes.org Record created on 2005-10-07, modified on 2017-05-12.
Abstract: Reference LCAV-CONF-2005-033 URL: www.aes.org Record created on 2005-10-07, modified on 2017-05-12

Patent
01 Sep 2005
TL;DR: In this paper, the authors present a method and apparatus for obtaining complete speech signals for speech recognition applications using a Hidden Markov Model (HMM) and a sequence of frames.
Abstract: The present invention relates to a method and apparatus for obtaining complete speech signals for speech recognition applications. In one embodiment, the method continuously records an audio stream comprising a sequence of frames to a circular buffer. When a user command to commence or terminate speech recognition is received, the method obtains a number of frames of the audio stream occurring before or after the user command in order to identify an augmented audio signal for speech recognition processing. In further embodiments, the method analyzes the augmented audio signal in order to locate starting and ending speech endpoints that bound at least a portion of speech to be processed for recognition. At least one of the speech endpoints is located using a Hidden Markov Model.

Journal ArticleDOI
TL;DR: The proposed method for time-delay estimation is found to perform better than the generalized cross-correlation (GCC) approach and a method for enhancement of speech is also proposed using the knowledge of the time- delay and the information of the excitation source.
Abstract: In this paper, we present a method of extracting the time-delay between speech signals collected at two microphone locations. Time-delay estimation from microphone outputs is the first step for many sound localization algorithms, and also for enhancement of speech. For time-delay estimation, speech signals are normally processed using short-time spectral information (either magnitude or phase or both). The spectral features are affected by degradations in speech caused by noise and reverberation. Features corresponding to the excitation source of the speech production mechanism are robust to such degradations. We show that these source features can be extracted reliably from the speech signal. The time-delay estimate can be obtained using the features extracted even from short segments (50-100 ms) of speech from a pair of microphones. The proposed method for time-delay estimation is found to perform better than the generalized cross-correlation (GCC) approach. A method for enhancement of speech is also proposed using the knowledge of the time-delay and the information of the excitation source.

Patent
02 Mar 2005
TL;DR: In this article, a method for reducing noise disturbance associated with an audio signal received through a microphone is provided, which initiates with magnifying a noise disturbance relative to a remaining component of the audio signal.
Abstract: A method for reducing noise disturbance associated with an audio signal received through a microphone is provided. The method initiates with magnifying a noise disturbance of the audio signal relative to a remaining component of the audio signal. Then, a sampling rate of the audio signal is decreased. Next, an even order derivative is applied to the audio signal having the decreased sampling rate to define a detection signal. Then, the noise disturbance of the audio signal is adjusted according to a statistical average of the detection signal. A system capable of canceling disturbances associated with an audio signal, a video game controller, and an integrated circuit for reducing noise disturbances associated with an audio signal are included.

Patent
23 Feb 2005
TL;DR: In this paper, a frequency-based coding of channels can reduce the encoding and decoding processing loads and/or size of the encoded audio bitstream relative to parametric coding techniques that are applied to all input channels over the entire frequency range.
Abstract: For a multi-channel audio signal, parametric coding is applied to different subsets of audio input channels for different frequency regions. For example, for a 5.1 surround sound signal having five regular channels and one low-frequency (LFE) channel, binaural cue coding (BCC) can be applied to all six audio channels for sub-bands at or below a specified cut-off frequency, but to only five audio channels (excluding the LFE channel) for sub-bands above the cut-off frequency. Such frequency-based coding of channels can reduce the encoding and decoding processing loads and/or size of the encoded audio bitstream relative to parametric coding techniques that are applied to all input channels over the entire frequency range.

Proceedings ArticleDOI
21 Mar 2005
TL;DR: This work shows that hidden messages alter the underlying statistics of audio signals, and develops a low-dimensional statistical feature vector that is extracted from this basis representation and used by a non-linear support vector machine for classification.
Abstract: Digital audio provides a suitable cover for high-throughput steganography. At 16 bits per sample and sampled at a rate of 44,100 Hz, digital audio has the bit-rate to support large messages. In addition, audio is often transient and unpredictable, facilitating the hiding of messages. Using an approach similar to our universal image steganalysis, we show that hidden messages alter the underlying statistics of audio signals. Our statistical model begins by building a linear basis that captures certain statistical properties of audio signals. A low-dimensional statistical feature vector is extracted from this basis representation and used by a non-linear support vector machine for classification. We show the efficacy of this approach on LSB embedding and Hide4PGP. While no explicit assumptions about the content of the audio are made, our technique has been developed and tested on high-quality recorded speech.