scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 1999"


Journal ArticleDOI
TL;DR: An effective hang-over scheme which considers the previous observations by a first-order Markov process modeling of speech occurrences is proposed which shows significantly better performances than the G.729B VAD in low signal-to-noise ratio (SNR) and vehicular noise environments.
Abstract: In this letter, we develop a robust voice activity detector (VAD) for the application to variable-rate speech coding. The developed VAD employs the decision-directed parameter estimation method for the likelihood ratio test. In addition, we propose an effective hang-over scheme which considers the previous observations by a first-order Markov process modeling of speech occurrences. According to our simulation results, the proposed VAD shows significantly better performances than the G.729B VAD in low signal-to-noise ratio (SNR) and vehicular noise environments.

1,341 citations


Reference BookDOI
TL;DR: Haar WaveletsThe Haar TransformConservation and Compaction of EnergyRemoving Noise from Audio SignalsHaarWaveletsMultiresolution AnalysisCompression of audio SignalsRemoving noise from AudiosignalsNotes and ReferencesDaubechies Wavelets
Abstract: Haar WaveletsThe Haar TransformConservation and Compaction of EnergyRemoving Noise from Audio SignalsHaar WaveletsMultiresolution AnalysisCompression of Audio SignalsRemoving Noise from Audio SignalsNotes and ReferencesDaubechies WaveletsThe Daub4 WaveletsConservation and Compaction of EnergyOther Daubechies WaveletsCompression of Audio SignalsQuantization, Entropy, and CompressionDenoising Audio SignalsTwo-Dimensional Wavelet TransformsCompression of ImagesFingerprint CompressionDenoising ImagesSome Topics in Image ProcessingNotes and ReferencesFrequency AnalysisDiscrete Fourier AnalysisCorrelation and Feature DetectionObject Detection in 2-D ImagesCreating Scaling Signals and WaveletsNotes and ReferencesBeyond WaveletsWavelet Packet TransformsApplications of Wavelet Packet TransformsContinuous Wavelet TransformsGabor Wavelets and Speech AnalysisNotes and ReferencesAppendix: Software for Wavelet Analysis

677 citations


Journal ArticleDOI
TL;DR: This paper describes MARSYAS, a framework for experimenting, evaluating and integrating techniques for audio content analysis in restricted domains and a new method for temporal segmentation based on audio texture that is combined with audio analysis techniques and used for hierarchical browsing, classification and annotation of audio files.
Abstract: Existing audio tools handle the increasing amount of computer audio data inadequately. The typical tape-recorder paradigm for audio interfaces is inflexible and time consuming, especially for large data sets. On the other hand, completely automatic audio analysis and annotation is impossible using current techniques. Alternative solutions are semi-automatic user interfaces that let users interact with sound in flexible ways based on content. This approach offers significant advantages over manual browsing, annotation and retrieval. Furthermore, it can be implemented using existing techniques for audio content analysis in restricted domains. This paper describes MARSYAS, a framework for experimenting, evaluating and integrating such techniques. As a test for the architecture, some recently proposed techniques have been implemented and tested. In addition, a new method for temporal segmentation based on audio texture is described. This method is combined with audio analysis techniques and used for hierarchical browsing, classification and annotation of audio files.

444 citations


Book
01 Jan 1999
TL;DR: This Second Edition of Speech and Audio Signal Processing will update and revise the original book to augment it with new material describing both the enabling technologies of digital music distribution and a range of exciting new research areas in automatic music content processing that have emerged in the past five years, driven by the digital music revolution.
Abstract: When Speech and Audio Signal Processing published in 1999, it stood out from its competition in its breadth of coverage and its accessible, intutiont-based style. This book was aimed at individual students and engineers excited about the broad span of audio processing and curious to understand the available techniques. Since then, with the advent of the iPod in 2001, the field of digital audio and music has exploded, leading to a much greater interest in the technical aspects of audio processing.This Second Edition will update and revise the original book to augment it with new material describing both the enabling technologies of digital music distribution (most significantly the MP3) and a range of exciting new research areas in automatic music content processing (such as automatic transcription, music similarity, etc.) that have emerged in the past five years, driven by the digital music revolution.New chapter topics include:Psychoacoustic Audio Coding, describing MP3 and related audio coding schemes based on psychoacoustic masking of quantization noiseMusic Transcription, including automatically deriving notes, beats, and chords from music signals.Music Information Retrieval, primarily focusing on audio-based genre classification, artist/style identification, and similarity estimation.Audio Source Separation, including multi-microphone beamforming, blind source separation, and the perception-inspired techniques usually referred to as Computational Auditory Scene Analysis (CASA).

395 citations


PatentDOI
TL;DR: A high quality speech synthesizer in various embodiments concatenates speech waveforms referenced by a large speech database as mentioned in this paper, which is further improved by speech unit selection and concatenation smoothing.
Abstract: A high quality speech synthesizer in various embodiments concatenates speech waveforms referenced by a large speech database. Speech quality is further improved by speech unit selection and concatenation smoothing.

318 citations


Journal ArticleDOI
TL;DR: A multistage neural model is proposed for an auditory scene analysis task--segregating speech from interfering sound sources, a two-layer oscillator network that performs stream segregation on the basis of oscillatory correlation.
Abstract: A multistage neural model is proposed for an auditory scene analysis task-segregating speech from interfering sound sources. The core of the model is a two-layer oscillator network that performs stream segregation on the basis of oscillatory correlation. In the oscillatory correlation framework, a stream is represented by a population of synchronized relaxation oscillators, each of which corresponds to an auditory feature, and different streams are represented by desynchronized oscillator populations. Lateral connections between oscillators encode harmonicity, and proximity in frequency and time. Prior to the oscillator network are a model of the auditory periphery and a stage in which mid-level auditory representations are formed. The model has been systematically evaluated using a corpus of voiced speech mixed with interfering sounds, and produces improvements in terms of signal-to-noise ratio for every mixture. A number of issues including biological plausibility and real-time implementation are also discussed.

313 citations


Proceedings ArticleDOI
15 Mar 1999
TL;DR: The estimated q's are used to control both the gain and the update of the estimated noise spectrum during speech presence in a modified MMSE log-spectral amplitude estimator, which resulted in higher scores than for the IS-127 standard enhancement algorithm, when pre-processing noisy speech for a coding application.
Abstract: Speech enhancement algorithms which are based on estimating the short-time spectral amplitude of the clean speech have better performance when a soft-decision gain modification, depending on the a priori probability of speech absence, is used. In reported works a fixed probability, q, is assumed. Since speech is non-stationary and may not be present in every frequency bin when voiced, we propose a method for estimating distinct values of q for different bins which are tracked in time. The estimation is based on a decision-theoretic approach for setting a threshold in each bin followed by short-time averaging. The estimated q's are used to control both the gain and the update of the estimated noise spectrum during speech presence in a modified MMSE log-spectral amplitude estimator. Subjective tests resulted in higher scores than for the IS-127 standard enhancement algorithm, when pre-processing noisy speech for a coding application.

217 citations


Patent
TL;DR: In this article, a system and method for indexing an audio stream for subsequent information retrieval and for skimming, gisting, and summarizing the audio stream includes using special audio prefiltering such that only relevant speech segments that are generated by a speech recognition engine are indexed.
Abstract: A system and method for indexing an audio stream for subsequent information retrieval and for skimming, gisting, and summarizing the audio stream includes using special audio prefiltering such that only relevant speech segments that are generated by a speech recognition engine are indexed. Specific indexing features are disclosed that improve the precision and recall of an information retrieval system used after indexing for word spotting. The invention includes rendering the audio stream into intervals, with each interval including one or more segments. For each segment of an interval it is determined whether the segment exhibits one or more predetermined audio features such as a particular range of zero crossing rates, a particular range of energy, and a particular range of spectral energy concentration. The audio features are heuristically determined to represent respective audio events including silence, music, speech, and speech on music. Also, it is determined whether a group of intervals matches a heuristically predefined meta pattern such as continuous uninterrupted speech, concluding ideas, hesitations and emphasis in speech, and so on, and the audio stream is then indexed based on the interval classification and meta pattern matching, with only relevant features being indexed to improve subsequent precision of information retrieval. Also, alternatives for longer terms generated by the speech recognition engine are indexed along with respective weights, to improve subsequent recall.

211 citations


PatentDOI
Dimitri Kanevsky1, Stephane H. Maes1
TL;DR: A system and method for indexing segments of audio/multimedia files and data streams for storage in a database according to audio information such as speaker identity, the background environment and channel, and/or the transcription of the spoken utterances.
Abstract: A system and method for indexing segments of audio/multimedia files and data streams for storage in a database according to audio information such as speaker identity, the background environment and channel (music, street noise, car noise, telephone, studio noise, speech plus music, speech plus noise, speech over speech), and/or the transcription of the spoken utterances. The content or topic of the transcribed text can also be determined using natural language understanding to index based on the context of the transcription. A user can then retrieve desired segments of the audio file from the database by generating a query having one or more desired parameters based on the indexed information.

203 citations


Proceedings ArticleDOI
15 Mar 1999
TL;DR: It is shown that the proposed system has achieved an accuracy higher than 90% for coarse-level audio classification and the query-by-example audio retrieval is implemented where similar sounds can be found according to an input sample audio.
Abstract: A hierarchical system for audio classification and retrieval based on audio content analysis is presented in this paper. The system consists of three stages. The first stage is called the coarse-level audio classification and segmentation, where audio recordings are classified and segmented into speech, music, several types of environmental sounds, and silence, based on morphological and statistical analysis of temporal curves of short-time features of audio signals. In the second stage, environmental sounds are further classified into finer classes such as applause, rain, bird sound, etc. This fine-level classification is based on time-frequency analysis of audio signals and use of the hidden Markov model (HMM) for classification. In the third stage, the query-by-example audio retrieval is implemented where similar sounds can be found according to an input sample audio. It is shown that the proposed system has achieved an accuracy higher than 90% for coarse-level audio classification. Examples of audio fine classification and audio retrieval are also provided.

170 citations


Patent
18 Jan 1999
TL;DR: In this paper, a method for signal controlled switching between audio coding schemes includes receiving input audio signals, classifying a first set of the input audio signal as speech or non-speech signals, coding the speech signals using a time domain coding scheme, and coding the nonspeech signal using a transform coding scheme.
Abstract: A method for signal controlled switching between audio coding schemes includes receiving input audio signals, classifying a first set of the input audio signals as speech or non-speech signals, coding the speech signals using a time domain coding scheme, and coding the nonspeech signals using a transform coding scheme. A multicode coder has an audio signal input and a switch for receiving the audio signal inputs, the switch having a time domain encoder, a transform encoder, and a signal classifier for classifying the audio signals generally as speech or non-speech, the signal classifier directing speech audio signals to the time domain encoder and non-speech audio signals to the transform encoder. A multicode decoder is also provided.

Patent
20 Apr 1999
TL;DR: In this paper, a distributed speech processing system for constructing speech recognition reference models that are to be used by a speech recognizer in a small hardware device, such as a personal digital assistant or cellular telephone is presented.
Abstract: A distributed speech processing system for constructing speech recognition reference models that are to be used by a speech recognizer in a small hardware device, such as a personal digital assistant or cellular telephone. The speech processing system includes a speech recognizer residing on a first computing device and a speech model server residing on a second computing device. The speech recognizer receives speech training data and processes it into an intermediate representation of the speech training data. The intermediate representation is then communicated to the speech model server. The speech model server generates a speech reference model by using the intermediate representation of the speech training data and then communicates the speech reference model back to the first computing device for storage in a lexicon associated with the speech recognizer.

Journal ArticleDOI
TL;DR: A new objective estimation approach that uses a simple but effective perceptual transformation and a distance measure that consists of a hierarchy of measuring normalizing blocks that reflects the magnitude of a perceived distance between two perceptually transformed signals.
Abstract: Perceived speech quality is most directly measured by subjective listening tests. These tests are often slow and expensive, and numerous attempts have been made to supplement them with objective estimators of perceived speech quality. These attempts have found limited success, primarily in analog and higher-rate, error-free digital environments where speech waveforms are preserved or nearly preserved. The objective estimation of the perceived quality of highly compressed digital speech, possibly with bit errors or frame erasures has remained an open question. We report our findings regarding two essential components of objective estimators of perceived speech quality: perceptual transformations and distance measures. A perceptual transformation modifies a representation of an audio signal in a way that is approximately equivalent to the human hearing process. A distance measure reflects the magnitude of a perceived distance between two perceptually transformed signals. We then describe a new objective estimation approach that uses a simple but effective perceptual transformation and a distance measure that consists of a hierarchy of measuring normalizing blocks. Each measuring normalizing block integrates two perceptually transformed signals over some time or frequency interval to determine the average difference across that interval. This difference is then normalized out of one signal, and is further processed to generate one or more measurements.

Patent
Juin-Hwey Chen1
30 Mar 1999
TL;DR: In this paper, a scalable and low-complexity adaptive transform coding method for speech and general audio signals is presented. But the method is not suitable for the Internet Protocol (IP)-based multimedia communications.
Abstract: High-quality, low-complexity and low-delay scalable and embedded system and method are disclosed for coding speech and general audio signals. The invention is particularly suitable in Internet Protocol (IP)-based multimedia communications. Adaptive transform coding, such as a Modified Discrete Cosine Transform, is used, with multiple small-size transforms in a given signal frame to reduce the coding delay and computational complexity. In a preferred embodiment, for a chosen sampling rate of the input signal, one or more output sampling rates may be decoded with varying degrees of complexity. Multiple sampling rates and bit rates are supported due to the scalable and embedded coding approach underlying the present invention. Further, a novel adaptive frame loss concealment approach is used to reduce the distortion caused by packet loss in communications using IP networks.

Patent
James R. Lewis1, Barbara Ballard1
08 Mar 1999
TL;DR: In this paper, a method and system for responding to randomly occurring noise in a voice recognition application program is presented, where the system receives an audio signal representative of sound in an audio environment and processes the audio signal to identify certain non-speech sounds.
Abstract: A method and system for responding to randomly occurring noise in a voice recognition application program. The system receives an audio signal representative of sound in an audio environment and processes the audio signal to identify certain non-speech sounds. A pre-defined action is performed in response to the non-speech sound which has been identified. The pre-defined action is selected from the group consisting of disabling a microphone source of the audio signal, suspending further processing of the audio signal by the speech recognition system, executing a user-defined macro, and ignoring the sound. The system may perform additional steps including recording a sound which is to be identified as a non-speech sound and assigning one of the pre-defined actions to be performed in response when the non-speech sound has been identified.

Patent
07 Jan 1999
TL;DR: In this paper, an audio manager API (application program interface) is provided to enable applications running on the computer to control the various audio sources without knowing the hardware and implementation details of the underlying sound system.
Abstract: A vehicle computer system has an audio entertainment system implemented in a logic unit and audio digital signal processor (DSP) independent from the host CPU. The audio entertainment system employs a set of ping/pong buffers and direct memory access (DMA) circuits to transfer data between different audio devices. Audio data is exchanged using a mapping overlay technique, in which the DMA circuits for two audio devices read and write to the same memory buffer. The computer system provides an audio manager API (application program interface) to enable applications running on the computer to control the various audio sources without knowing the hardware and implementation details of the underlying sound system. Different audio devices and their drivers control different functionality of the audio system, such as equalization, volume controls and surround sound decoding. The audio manager API transfers calls made by the applications to the appropriate device driver (9). The computer system also supports a speech recognition system. Speech utterances are picked up by a microphone and sampled at an internal sampling rate. However, the speech recognition system employs a lower sampling rate. The computer system converts microphone data from the higher internal sampling rate to the desired sampling rate by piggybacking the microphone data oncommand/message streams to an SPI (serial peripheral interface) of the audio DSP. The DSP performs normal low-pass filtering and down sampling on the data stream and then uses the SPI to send out the microphone data at the lower sampling rate.

Journal ArticleDOI
TL;DR: This work follows a novel encoding paradigm, trying to maximize recognition performance instead of perceptual reproduction, and finds that by transmitting the cepstral coefficients the authors can achieve significantly higher recognition performance at a fraction of the bit rate required when encoding the speech signal directly.
Abstract: We examine alternative architectures for a client-server model of speech-enabled applications over the World Wide Web (WWW). We compare a server-only processing model where the client encodes and transmits the speech signal to the server, to a model where the recognition front end runs locally at the client and encodes and transmits the cepstral coefficients to the recognition server over the Internet. We follow a novel encoding paradigm, trying to maximize recognition performance instead of perceptual reproduction, and we find that by transmitting the cepstral coefficients we can achieve significantly higher recognition performance at a fraction of the bit rate required when encoding the speech signal directly. We find that the required bit rate to achieve the recognition performance of high-quality unquantized speech is just 2000 bits per second.

Proceedings ArticleDOI
20 Jun 1999
TL;DR: A new technique for highband spectral envelope prediction, based upon codebook mapping with codebooks split by voicing with a suitable highband excitation synthesis scheme is proposed, which produces a significant quality improvement in speech that has been coded using narrowband standards.
Abstract: Telephone speech is typically bandlimited to 4 kHz, resulting in a 'muffled' quality. Coding speech with a bandwidth greater than 4 kHz reduces this distortion, but requires a higher bit rate to avoid other types of distortion. An alternative to coding wider bandwidth speech is to exploit correlations between the 0-4 kHz and 4-8 kHz speech bands to re-synthesize wideband speech from decoded narrowband speech. This paper proposes a new technique for highband spectral envelope prediction, based upon codebook mapping with codebooks split by voicing. An objective comparison with several existing methods reveals that this new technique produces the smallest highband spectral distortion. Combined with a suitable highband excitation synthesis scheme, this envelope prediction scheme produces a significant quality improvement in speech that has been coded using narrowband standards.

Journal ArticleDOI
TL;DR: It is concluded that amplification, and especially fast-acting compression amplification, can improve the ability to understand speech in background sounds with spectral and temporal dips, but it does not restore performance to normal.
Abstract: People with cochlear hearing loss have markedly higher speech-receptions thresholds (SRTs) than normal for speech presented in background sounds with spectral and/or temporal dips. This article examines the extent to which SRTs can be improved by linear amplification with appropriate frequency-response shaping, and by fast-acting wide-dynamic-range compression amplification with one, two, four, or eight channels. Eighteen elderly subjects with moderate to severe hearing loss were tested. SRTs for sentences were measured for four background sounds, presented at a nominal level (prior to amplification) of 65 dB SPL: (1) A single female talker, digitally filtered so that the long-term average spectrum matched that of the target speech; (2) a noise with the same average spectrum as the target speech, but with the temporal envelope of the single talker; (3) a noise with the same overall spectral shape as the target speech, but filtered so as to have 4 equivalent-rectangular-bandwidth (ERB) wide spectral notches at several frequencies; (4) a noise with both spectral and temporal dips obtained by applying the temporal envelope of a single talker to speech-shaped noise [as in (2)] and then filtering that noise [as in (3)]. Mean SRTs were 5–6 dB lower (better) in all of the conditions with amplification than for unaided listening. SRTs were significantly lower for the systems with one-, four-, and eight-channel compression than for linear amplification, although the benefit, averaged across subjects, was typically only 0.5 to 0.9 dB. The lowest mean SRT (−9.9 dB, expressed as a speech-to-background ratio) was obtained for noise (4) and the system with eight-channel compression. This is about 6 dB worse than for elderly subjects with near-normal hearing, when tested without amplification. It is concluded that amplification, and especially fast-acting compression amplification, can improve the ability to understand speech in background sounds with spectral and temporal dips, but it does not restore performance to normal.

Patent
TL;DR: In this article, an audio bit stream including audio control bits and audio data bits is processed for transmission in a communication system and each of the n different classes of audio bits is then provided with a corresponding one of n different levels of error protection, where n is greater than or equal to two.
Abstract: An audio information bit stream including audio control bits and audio data bits is processed for transmission in a communication system. The audio data bits are first separated into n classes based on error sensitivity, that is, the impact of errors in particular audio data bits on perceived quality of an audio signal reconstructed from the transmission. Each of the n different classes of audio data bits is then provided with a corresponding one of n different levels of error protection, where n is greater than or equal to two. The invention thereby matches error protection for the audio data bits to source and channel error sensitivity. The audio control bits may be transmitted independently of the audio data bits, using an additional level of error protection higher than that used for any of the n classes of the audio data bits. Alternatively, the control bits may be combined with one of the n classes of audio data bits and provided with the highest of the n levels of error protection. Further protection may be provided for the control bits by repeating at least a portion of the control bits from a current packet of the audio information bit stream in a subsequent packet of the audio information bit stream. Moreover, the classification of audio data bits into n different classes can be implemented on a fixed packet-by-packet basis, or in a more flexible, adaptive implementation in which different multipacket error protection profiles are used for different multipacket segments of a source-coded audio signal.

Patent
24 Aug 1999
TL;DR: In this paper, a method of encoding an input speech signal using a multi-rate encoder having a plurality of encoding rates is disclosed, where a high-pass filter and then a perceptual weighting filter are applied to such signal to generate a first target signal.
Abstract: A method of encoding an input speech signal using a multi-rate encoder having a plurality of encoding rates is disclosed. A high-pass filter and then a perceptual weighting filter are applied to such signal to generate a first target signal. An adaptive codebook vector is identified from an adaptive codebook using the first target signal by filtering the vector to generate a filtered adaptive codebook vector. An adaptive codebook gain for the adaptive codebook vector is calculated and an error signal minimized. The adaptive codebook gain is adaptively reduced based on one encoding rate from the plurality of encoding rates to generate a reduced adaptive codebook gain. A second target signal based at least on the first target signal and the reduced adaptive codebook gain is generated. The input speech signal is converted into an encoded speech based on the second target signal.

Patent
10 Aug 1999
TL;DR: In this article, a speech or voice activity detector (VAD) is provided for detecting whether speech signals are present in individual time frames of an input signal, and a state machine is coupled to the VAD and having a plurality of states.
Abstract: A system and method for removing noise from a signal containing speech (or a related, information carrying signal) and noise. A speech or voice activity detector (VAD) is provided for detecting whether speech signals are present in individual time frames of an input signal. The VAD comprises a speech detector that receives as input the input signal and examines the input signal in order to generate a plurality of statistics that represent characteristics indicative of the presence or absence of speech in a time frame of the input signal, and generates an output based on the plurality of statistics representing a likelihood of speech presence in a current time frame; and a state machine coupled to the speech detector and having a plurality of states. The state machine receives as input the output of the speech detector and transitions between the plurality of states based on a state at a previous time frame and the output of the speech detector for the current time frame. The state machine generates as output a speech activity status signal based on the state of the state machine, which provides a measure of the likelihood of speech being present during the current time frame. The VAD may be used in a noise reduction system.

Proceedings ArticleDOI
20 Jun 1999
TL;DR: Tests show that the wide-band speech reconstructed with the new method of regenerating the high frequencies based on vector quantization of the mel-frequency cepstral coefficients is significantly more pleasant to the human ear than the original narrowband speech.
Abstract: Telephone speech is usually limited to less than 4 kHz in bandwidth. This bandwidth limitation results in the typical sound of telephone speech. We present a new method of regenerating the high frequencies (4-8 kHz) based on vector quantization of the mel-frequency cepstral coefficients (MFCC). We also present two methods to avoid perceptually annoying overestimates of the signal power in the high-band. Listening tests show the benefits of the new procedures. Use of MFCC for vector quantization instead of traditionally used spectral representations improves the quality of the speech significantly. Tests also show that the wide-band speech reconstructed with the method is significantly more pleasant to the human ear than the original narrowband speech.

Journal ArticleDOI
TL;DR: A noisy-vowel corpus is used and four possible models for audiovisual speech recognition are proposed, leading to proposals for data representation, fusion architecture, and control of the fusion process through sensor reliability.
Abstract: Audiovisual speech recognition involves fusion of the audio and video sensors for phonetic identification. There are three basic ways to fuse data streams for taking a decision such as phoneme identification: data-to-decision, decision-to-decision, and data-to-data. This leads to four possible models for audiovisual speech recognition, that is direct identification in the first case, separate identification in the second one, and two variants of the third early integration case, namely dominant recoding or motor recoding. However, no systematic comparison of these models is available in the literature. We propose an implementation of these four models, and submit them to a benchmark test. For this aim, we use a noisy-vowel corpus tested on two recognition paradigms in which the systems are tested at noise levels higher than those used for learning. In one of these paradigms, the signal-to-noise ratio (SNR) value is provided to the recognition systems, in the other it is not. We also introduce a new criterion for evaluating performances, based on transmitted information on individual phonetic features. In light of the compared performances of the four models with the two recognition paradigms, we discuss the advantages and drawbacks of these models, leading to proposals for data representation, fusion architecture, and control of the fusion process through sensor reliability.

Journal ArticleDOI
TL;DR: This paper presents new wideband speech coding and integrated speech coding-enhancement systems based on frame-synchronized fast wavelet packet transform algorithms and formulates temporal and spectral psychoacoustic models of masking adapted to wavelet packets analysis.
Abstract: This paper presents new wideband speech coding and integrated speech coding-enhancement systems based on frame-synchronized fast wavelet packet transform algorithms. It also formulates temporal and spectral psychoacoustic models of masking adapted to wavelet packet analysis. The algorithm of the proposed FFT-like overlapped block orthogonal wavelet packet transform permits us to efficiently approximate the auditory critical band decomposition in the time and frequency domains. This allows us to make use of the temporal and spectral masking properties of the human auditory system to decrease the average bit rate of the encoder while perceptually hiding the quantization error. The same wavelet packet representation is used to merge speech enhancement and coding in the context of auditory modeling. The advantage of the method presented in this paper over previous approaches is that perceptual enhancement and coding, which is usually implemented as a cascade of two separate systems, are combined. This leads to a decreased computational load. Experiments show that the proposed wideband coding procedure by itself can achieve transparent coding of speech signals sampled at 16 kHz at an average bit rate of 39.4 kbit/s. The combined speech coding-enhancement procedure achieves higher bit rate values that depend on the residual noise characteristics at the output of the enhancement process.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: The experimental results show that the line spectral frequencies (LSFs) are robust features in distinguishing the different classes of noises.
Abstract: Background environmental noises degrade the performance of speech-processing systems (e.g. speech coding, speech recognition). By modifying the processing according to the type of background noise, the performance can be enhanced. This requires noise classification. In this paper, four pattern-recognition frameworks have been used to design noise classification algorithms. Classification is done on a frame-by-frame basis (e.g. once every 20 ms). Five commonly encountered noises in mobile telephony (i.e. car, street, babble, factory, and bus) have been considered in our study. Our experimental results show that the line spectral frequencies (LSFs) are robust features in distinguishing the different classes of noises.

Proceedings ArticleDOI
30 Oct 1999
TL;DR: A real-time audio segmentation and indexing scheme that can be applied to almost any content-based audio management system and achieves an accuracy rate of more than 90% for audio classification is presented.
Abstract: A real-time audio segmentation and indexing scheme is presented in this paper. Audio recordings are segmented and classified into basic audio types such as silence, speech, music, song, environmental sound, speech with the music background, environmental sound with the music background, etc. Simple audio features such as the energy function, the average zero-crossing rate, the fundamental frequency, and the spectral peak track are adopted in this system to ensure on-line processing. Morphological and statistical analysis for temporal curves of these features are performed to show differences among different types of audio. A heuristic rule-based procedure is then developed to segment and classify audio signals by using these features. The proposed approach is generic and model free. It can be applied to almost any content-based audio management system. It is shown that the proposed scheme achieves an accuracy rate of more than 90% for audio classification. Examples for segmentation and indexing of accompanying audio signals in movies and video programs are also provided.

Journal ArticleDOI
S. Ikeda1, A. Sugiyama
TL;DR: Computer simulation results using speech and diesel engine noise recorded in a special-purpose vehicle show that the proposed adaptive noise canceller with low signal distortion reduces signal distortion in the output signal by up to 15 dB compared with a conventional ANC.
Abstract: This paper proposes an adaptive noise canceller (ANC) with low signal distortion for speech codecs. The proposed ANC has two adaptive filters: a main filter (MF) and a subfilter (SF). The signal-to-noise ratio (SNR) of input signals is estimated using the SF. To reduce signal distortion in the output signal of the ANC, a step size for coefficient update in the MF is controlled according to the estimated SNR. Computer simulation results using speech and diesel engine noise recorded in a special-purpose vehicle show that the proposed ANC reduces signal distortion in the output signal by up to 15 dB compared with a conventional ANC. Results of subjective listening tests show that the mean opinion scores (MOSs) for the proposed ANC with and without a speech codec are one point higher than the scores for the conventional ANC.

Proceedings ArticleDOI
20 Jun 1999
TL;DR: The adaptive multi-rate (AMR) speech coder currently under standardization for GSM systems as part of the AMR speech service is described, which provides seamless switching on 20 ms frame boundaries and the quality when used on GSM channels is significantly higher than for existing services.
Abstract: In this paper, we describe the adaptive multi-rate (AMR) speech coder currently under standardization for GSM systems as part of the AMR speech service. The coder is a multi-rate ACELP coder with 8 modes operating at bit-rates from 12.2 kbit/s down to 4.75 kbit/s. The coder modes are integrated in a common structure where the bit-rate scalability is realized mainly by altering the quantization schemes for the different parameters. The coder provides seamless switching on 20 ms frame boundaries. The quality when used on GSM channels is significantly higher than for existing services.

Patent
05 Oct 1999
TL;DR: In this paper, a start of an input speech signal is detected during presentation of an output audio signal and an input start time, relative to the output audio signals, is determined.
Abstract: A start of an input speech signal is detected during presentation of an output audio signal and an input start time, relative to the output audio signal, is determined. The input start time is then provided for use in responding to the input speech signal. In another embodiment, the output audio signal has a corresponding identification. When the input speech signal is detected during presentation of the output audio signal, the identification of the output audio signal is provided for use in responding to the input speech signal. Information signals comprising data and/or control signals are provided in response to at least the contextual information provided, i.e., the input start time and/or the identification of the output audio signal. In this manner, the present invention accurately establishes a context of an input speech signal relative to an output audio signal regardless of the delay characteristics of the underlying communication system.