scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1999"


Journal ArticleDOI
TL;DR: An effective hang-over scheme which considers the previous observations by a first-order Markov process modeling of speech occurrences is proposed which shows significantly better performances than the G.729B VAD in low signal-to-noise ratio (SNR) and vehicular noise environments.
Abstract: In this letter, we develop a robust voice activity detector (VAD) for the application to variable-rate speech coding. The developed VAD employs the decision-directed parameter estimation method for the likelihood ratio test. In addition, we propose an effective hang-over scheme which considers the previous observations by a first-order Markov process modeling of speech occurrences. According to our simulation results, the proposed VAD shows significantly better performances than the G.729B VAD in low signal-to-noise ratio (SNR) and vehicular noise environments.

1,341 citations


Journal ArticleDOI
TL;DR: This paper addresses the problem of single channel speech enhancement at very low signal-to-noise ratios (SNRs) (<10 dB) with a new computationally efficient algorithm developed based on masking properties of the human auditory system, resulting in improved results over classical subtractive-type algorithms.
Abstract: This paper addresses the problem of single channel speech enhancement at very low signal-to-noise ratios (SNRs) (<10 dB). The proposed approach is based on the introduction of an auditory model in a subtractive-type enhancement process. Single channel subtractive-type algorithms are characterized by a tradeoff between the amount of noise reduction, the speech distortion, and the level of musical residual noise, which can be modified by varying the subtraction parameters. Classical algorithms are usually limited to the use of fixed optimized parameters, which are difficult to choose for all speech and noise conditions. A new computationally efficient algorithm is developed based on masking properties of the human auditory system. It allows for an automatic adaptation in time and frequency of the parametric enhancement system, and finds the best tradeoff based on a criterion correlated with perception. This leads to a significant reduction of the unnatural structure of the residual noise. Objective and subjective evaluation of the proposed system is performed with several noise types form the Noisex-92 database, having different time-frequency distributions. The application of objective measures, the study of the speech spectrograms, as well as subjective listening tests, confirm that the enhanced speech is more pleasant to a human listener. Finally, the proposed enhancement algorithm is tested as a front-end processor for speech recognition in noise, resulting in improved results over classical subtractive-type algorithms.

631 citations


PatentDOI
TL;DR: A high quality speech synthesizer in various embodiments concatenates speech waveforms referenced by a large speech database as mentioned in this paper, which is further improved by speech unit selection and concatenation smoothing.
Abstract: A high quality speech synthesizer in various embodiments concatenates speech waveforms referenced by a large speech database. Speech quality is further improved by speech unit selection and concatenation smoothing.

318 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: Details of the kinds of usability and system design problems likely in current systems and several common patterns of error correction that are found are presented.
Abstract: A study was conducted to evaluate user performance and satisfaction in completion of a set of text creation tasks using three commercially available continuous speech recognition systems. The study also compared user performance on similar tasks using keyboard input. One part of the study (Initial Use) involved 24 users who enrolled, received training and carried out practice tasks, and then completed a set of transcription and composition tasks in a single session. In a parallel effort (Extended Use), four researchers used speech recognition to carry out real work tasks over 10 sessions with each of the three speech recognition software products. This paper presents results from the Initial Use phase of the study along with some preliminary results from the Extended Use phase. We present details of the kinds of usability and system design problems likely in current systems and several common patterns of error correction that we found.

304 citations


Patent
20 Apr 1999
TL;DR: In this paper, a distributed speech processing system for constructing speech recognition reference models that are to be used by a speech recognizer in a small hardware device, such as a personal digital assistant or cellular telephone is presented.
Abstract: A distributed speech processing system for constructing speech recognition reference models that are to be used by a speech recognizer in a small hardware device, such as a personal digital assistant or cellular telephone. The speech processing system includes a speech recognizer residing on a first computing device and a speech model server residing on a second computing device. The speech recognizer receives speech training data and processes it into an intermediate representation of the speech training data. The intermediate representation is then communicated to the speech model server. The speech model server generates a speech reference model by using the intermediate representation of the speech training data and then communicates the speech reference model back to the first computing device for storage in a lexicon associated with the speech recognizer.

156 citations


Patent
08 Jul 1999
TL;DR: In this article, a speech recognition system uses language independent acoustic models derived from speech data from multiple languages to represent speech units which are concatenated into words and vector quantization of the input speech signal which is compared to the language-independent acoustic models may be vectorized according to a codebook which is derived from a speech data source codebook.
Abstract: A speech recognition system uses language independent acoustic models derived from speech data from multiple languages to represent speech units which are concatenated into words In addition, the input speech signal which is compared to the language independent acoustic models may be vector quantized according to a codebook which is derived from speech data from multiple languages

130 citations


Journal ArticleDOI
TL;DR: A front end for automatic speech recognizers is proposed and evaluated which is based on a quantitative model of the "effective" peripheral auditory processing which simulates both spectral and temporal properties of sound processing in the auditory system which were found in psychoacoustical and physiological experiments.
Abstract: A front end for automatic speech recognizers is proposed and evaluated which is based on a quantitative model of the “effective” peripheral auditory processing. The model simulates both spectral and temporal properties of sound processing in the auditory system which were found in psychoacoustical and physiological experiments. The robustness of the auditory-based representation of speech was evaluated in speaker-independent, isolated word recognition experiments in different types of additive noise. The results show a higher robustness of the auditory front end in noise, compared to common mel-scale cepstral feature extraction. In a second set of experiments, different processing stages of the auditory front end were modified to study their contribution to robust speech signal representation in detail. The adaptive compression stage which enhances temporal changes of the input signal appeared to be the most important processing stage towards robust speech representation in noise. Low-pass filtering of the fast fluctuating envelope in each frequency band further reduces the influence of noise in the auditory-based representation of speech.

130 citations


Journal ArticleDOI
TL;DR: This work follows a novel encoding paradigm, trying to maximize recognition performance instead of perceptual reproduction, and finds that by transmitting the cepstral coefficients the authors can achieve significantly higher recognition performance at a fraction of the bit rate required when encoding the speech signal directly.
Abstract: We examine alternative architectures for a client-server model of speech-enabled applications over the World Wide Web (WWW). We compare a server-only processing model where the client encodes and transmits the speech signal to the server, to a model where the recognition front end runs locally at the client and encodes and transmits the cepstral coefficients to the recognition server over the Internet. We follow a novel encoding paradigm, trying to maximize recognition performance instead of perceptual reproduction, and we find that by transmitting the cepstral coefficients we can achieve significantly higher recognition performance at a fraction of the bit rate required when encoding the speech signal directly. We find that the required bit rate to achieve the recognition performance of high-quality unquantized speech is just 2000 bits per second.

118 citations


Proceedings ArticleDOI
20 Jun 1999
TL;DR: A new technique for highband spectral envelope prediction, based upon codebook mapping with codebooks split by voicing with a suitable highband excitation synthesis scheme is proposed, which produces a significant quality improvement in speech that has been coded using narrowband standards.
Abstract: Telephone speech is typically bandlimited to 4 kHz, resulting in a 'muffled' quality. Coding speech with a bandwidth greater than 4 kHz reduces this distortion, but requires a higher bit rate to avoid other types of distortion. An alternative to coding wider bandwidth speech is to exploit correlations between the 0-4 kHz and 4-8 kHz speech bands to re-synthesize wideband speech from decoded narrowband speech. This paper proposes a new technique for highband spectral envelope prediction, based upon codebook mapping with codebooks split by voicing. An objective comparison with several existing methods reveals that this new technique produces the smallest highband spectral distortion. Combined with a suitable highband excitation synthesis scheme, this envelope prediction scheme produces a significant quality improvement in speech that has been coded using narrowband standards.

116 citations


Patent
24 Aug 1999
TL;DR: In this paper, a method of encoding an input speech signal using a multi-rate encoder having a plurality of encoding rates is disclosed, where a high-pass filter and then a perceptual weighting filter are applied to such signal to generate a first target signal.
Abstract: A method of encoding an input speech signal using a multi-rate encoder having a plurality of encoding rates is disclosed. A high-pass filter and then a perceptual weighting filter are applied to such signal to generate a first target signal. An adaptive codebook vector is identified from an adaptive codebook using the first target signal by filtering the vector to generate a filtered adaptive codebook vector. An adaptive codebook gain for the adaptive codebook vector is calculated and an error signal minimized. The adaptive codebook gain is adaptively reduced based on one encoding rate from the plurality of encoding rates to generate a reduced adaptive codebook gain. A second target signal based at least on the first target signal and the reduced adaptive codebook gain is generated. The input speech signal is converted into an encoded speech based on the second target signal.

111 citations


Patent
Jean-Claude Junqua1
15 Dec 1999
TL;DR: In this article, a computer-implemented method and apparatus for processing a spoken request from a user to control an automobile device is described, where a speech recognizer recognizes a user's speech input and a speech understanding module determines semantic components of the speech input.
Abstract: A computer-implemented method and apparatus for processing a spoken request from a user to control an automobile device A speech recognizer recognizes a user's speech input and a speech understanding module determines semantic components of the speech input A dialogue manager determines insufficiency in the input speech, and also provides the user with information about a device in response to the input speech

Patent
05 Oct 1999
TL;DR: In this article, an interrupt indicator is provided during a voice communication between a user of a subscriber unit and another person, and a portion of a speech recognition element is activated to begin processing voice-based commands.
Abstract: In a wireless communication system, local detection of an interrupt indicator during a voice communication between a user of a subscriber unit and another person is provided. Responsive to the interrupt indicator, a portion of a speech recognition element is activated to begin processing voice-based commands. The speech recognition element can be implemented at least in part within an infrastructure, such as in a client-server speech recognition arrangement. The interrupt indicator may be provided using an input device forming a part of the subscriber unit, or through the use of a local speech recognizer within the subscriber unit. By locally detecting interrupt indicators at the subscriber unit, the present invention more readily enables the use of electronic assistants and similar services in wireless communication environments.

Proceedings Article
01 Jan 1999
TL;DR: A method that detects filled pauses and word lengthening on the basis of small fundamental frequency transition and small spectral envelope deformation under the assumption that speakers do not change articulator parameters during filled pauses is proposed.
Abstract: This paper describes a method for automatically detecting filled (vocalized) pauses, which are one of the hesitation phenomena that current speech recognizers typically cannot handle. The detection of these pauses is important in spontaneous speech dialogue systems because they play valuable roles, such as helping a speaker keep a conversational turn, in oral communication. Although a few speech recognition systems have processed filled pauses within subword-based connected word recognition or word-spotting frameworks, they did not detect the pauses individually and consequently could not consider their roles. In this paper we propose a method that detects filled pauses and word lengthening on the basis of small fundamental frequency transition and small spectral envelope deformation under the assumption that speakers do not change articulator parameters during filled pauses. Experimental results for a Japanese spoken dialogue corpus show that our real-time filled-pause-detection system yielded a recall rate of 84.9% and a precision rate of 91.5%.

Patent
10 Aug 1999
TL;DR: In this article, a speech or voice activity detector (VAD) is provided for detecting whether speech signals are present in individual time frames of an input signal, and a state machine is coupled to the VAD and having a plurality of states.
Abstract: A system and method for removing noise from a signal containing speech (or a related, information carrying signal) and noise. A speech or voice activity detector (VAD) is provided for detecting whether speech signals are present in individual time frames of an input signal. The VAD comprises a speech detector that receives as input the input signal and examines the input signal in order to generate a plurality of statistics that represent characteristics indicative of the presence or absence of speech in a time frame of the input signal, and generates an output based on the plurality of statistics representing a likelihood of speech presence in a current time frame; and a state machine coupled to the speech detector and having a plurality of states. The state machine receives as input the output of the speech detector and transitions between the plurality of states based on a state at a previous time frame and the output of the speech detector for the current time frame. The state machine generates as output a speech activity status signal based on the state of the state machine, which provides a measure of the likelihood of speech being present during the current time frame. The VAD may be used in a noise reduction system.

Proceedings ArticleDOI
20 Jun 1999
TL;DR: Tests show that the wide-band speech reconstructed with the new method of regenerating the high frequencies based on vector quantization of the mel-frequency cepstral coefficients is significantly more pleasant to the human ear than the original narrowband speech.
Abstract: Telephone speech is usually limited to less than 4 kHz in bandwidth. This bandwidth limitation results in the typical sound of telephone speech. We present a new method of regenerating the high frequencies (4-8 kHz) based on vector quantization of the mel-frequency cepstral coefficients (MFCC). We also present two methods to avoid perceptually annoying overestimates of the signal power in the high-band. Listening tests show the benefits of the new procedures. Use of MFCC for vector quantization instead of traditionally used spectral representations improves the quality of the speech significantly. Tests also show that the wide-band speech reconstructed with the method is significantly more pleasant to the human ear than the original narrowband speech.

Patent
17 Aug 1999
TL;DR: In this paper, an on-vehicle voice information service device capable of improving the quality of voice information services, reducing memory capacity require for performing the sound information service greatly, and eliminating time and labor required for storing voice information in a storage medium when a vehicle is shipped in advance.
Abstract: PROBLEM TO BE SOLVED: To provide an on-vehicle voice information service device capable of improving the quality of voice information service, reducing memory capacity require for performing the sound information service greatly, and eliminating time and labor required for storing voice information in a storage medium when a vehicle is shipped in advance. SOLUTION: Voice information is taken in from voice database by connecting to the voice database through a network by an interface part 2, and this taken-in sound information is down-loaded by an information processing part 3 and is stored in a storage part 4. Voice playback processing is done in a voice playback part 5 using the stored voice information to provide voice information service by the obtained sound. For example, when voice information service is provided in accordance with a situation at the time of path guide of a vehicle in Japan, voice information is down-loaded from sound data base as required and is stored in the storage part 4 to correspond to it instead of storing all voice information covering all areas of Japan into a ROM such as a CD. COPYRIGHT: (C)2001,JPO

Journal ArticleDOI
TL;DR: This paper presents new wideband speech coding and integrated speech coding-enhancement systems based on frame-synchronized fast wavelet packet transform algorithms and formulates temporal and spectral psychoacoustic models of masking adapted to wavelet packets analysis.
Abstract: This paper presents new wideband speech coding and integrated speech coding-enhancement systems based on frame-synchronized fast wavelet packet transform algorithms. It also formulates temporal and spectral psychoacoustic models of masking adapted to wavelet packet analysis. The algorithm of the proposed FFT-like overlapped block orthogonal wavelet packet transform permits us to efficiently approximate the auditory critical band decomposition in the time and frequency domains. This allows us to make use of the temporal and spectral masking properties of the human auditory system to decrease the average bit rate of the encoder while perceptually hiding the quantization error. The same wavelet packet representation is used to merge speech enhancement and coding in the context of auditory modeling. The advantage of the method presented in this paper over previous approaches is that perceptual enhancement and coding, which is usually implemented as a cascade of two separate systems, are combined. This leads to a decreased computational load. Experiments show that the proposed wideband coding procedure by itself can achieve transparent coding of speech signals sampled at 16 kHz at an average bit rate of 39.4 kbit/s. The combined speech coding-enhancement procedure achieves higher bit rate values that depend on the residual noise characteristics at the output of the enhancement process.

Patent
23 Jul 1999
TL;DR: In this paper, an improved noise reduction algorithm and a voice activity detector are presented for use in a voice communication system, which can be implemented integrally in an encoder or applied independently to speech coding application.
Abstract: An improved noise reduction algorithm is provided, as well as a voice activity detector, for use in a voice communication system. The voice activity detector allows for a reliable estimate of noise and enhancement of noise reduction. The noise reduction algorithm and voice activity detector can be implemented integrally in an encoder or applied independently to speech coding application. The voice activity detector employs line spectral frequencies and enhanced input speech which has undergone noise reduction to generate a voice activity flag. The noise reduction algorithm employs a smooth gain function determined from a smoothed noise spectral estimate and smoothed input noisy speech spectra. The gain function is smoothed both across frequency and time in an adaptive manner based on the estimate of the signal-to-noise ratio. The gain function is used for spectral amplitude enhancement to obtain a reduced noise speech signal. Smoothing employs critical frequency bands corresponding to the human auditory system. Swirl reduction is performed to improve overall human perception of decoded speech.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: This paper investigates the relative sensitivity of a Gaussian mixture model (GMM) based voice verification algorithm to computer voice-altered imposters.
Abstract: This paper investigates the relative sensitivity of a Gaussian mixture model (GMM) based voice verification algorithm to computer voice-altered imposters. First, a new trainable speech synthesis algorithm based on trajectory models of the speech line spectral frequency (LSF) parameters is presented in order to model the spectral characteristics of a target voice. A GMM based speaker verifier is then constructed for the 138 speaker YOHO database and shown to have an initial equal-error rate (EER) of 1.45% for the case of casual imposter attempts using a single combination-lock phrase test. Next, imposter voices are automatically altered using the synthesis algorithm to mimic the customer's voice. After voice transformation, the false acceptance rate is shown to increase from 1.45% to over 86% if the baseline EER threshold is left unmodified. Furthermore, at a customer false rejection rate of 25%, the false acceptance rate for the voice-altered imposter remains as high as 34.6%.

Proceedings ArticleDOI
15 Mar 1999
TL;DR: This paper describes the experiences with developing a real-time telephone-based speech recognizer as part of a conversational system in the weather information domain and describes the development of the recognizer vocabulary, pronunciations, language and acoustic models for this system.
Abstract: This paper describes our experiences with developing a real-time telephone-based speech recognizer as part of a conversational system in the weather information domain. This system has been used to collect spontaneous speech data which has proven to be extremely valuable for research in a number of different areas. After describing the corpus we have collected, we describe the development of the recognizer vocabulary, pronunciations, language and acoustic models for this system, the new weighted finite-state transducer-based lexical access component, and report on the current performance of the recognizer under several different conditions. We also analyze recognition latency to verify that the system performs in real-time.

01 Jan 1999
TL;DR: This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected, and suggests that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciation used in the speech recognizer based on the extended context.
Abstract: As of this writing, the automatic recognition of spontaneous speech by computer is fraught with errors; many systems transcribe one out of every three to five words incorrectly, whereas humans can transcribe spontaneous speech with one error in twenty words or better. This high error rate is due in part to the poor modeling of pronunciations within spontaneous speech. This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected. It includes an investigation of the relationship between speaking rate, word predictability, pronunciations, and errors made by speech recognition systems. The results of these studies suggest that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciations used in the speech recognizer based on the extended context (including surrounding words, phones, speaking rate, etc.). Implementation of new pronunciation models automatically derived from data within the ICSI speech recognition system has shown a 4–5% relative improvement on the Broadcast News recognition task. Roughly two thirds of these gains can be attributed to static baseform improvements; adding the ability to dynamically adjust pronunciations within the recognizer provides the other third of the improvement. The Broadcast News task also allows for comparison of performance on different styles of speech: the new pronunciation models do not help for pre-planned speech, but they provide a significant gain for spontaneous speech. Not only do the automatically learned pronunciation models capture some of the linguistic variation due to the speaking style, but they also represent variation in the acoustic model due to channel effects. The largest improvement was seen in the telephone speech condition, in which 12% of the errors produced by the baseline system were corrected.

Patent
Dong-Ho Cho1, Sung Won Lee1, Young-Ky Kim1, Hyun Seok Lee1, Sun-Mi Kim1 
26 Aug 1999
TL;DR: In this paper, a device and method for communicating voice data in packet form in a mobile communication system is presented, where a packet voice channel is assigned upon generation of voice data, and an active state (1015) is entered where packetized voice data is transmitted on the voice channel.
Abstract: A device and method for communicating voice data in packet form in a mobile communication system In the packet voice data communication method, upon generation of voice data, a packet voice channel is assigned, and an active state (1015) is entered where packetized voice data is transmitted on the voice channel If there is no voice data for a predetermined time period while in the active state (1015), the assigned voice channel is released, and an inactive state (1013) is entered where no voice data is transmitted If the next voice data is generated while in the inactive state (1013), the voice channel active state is entered where a voice channel is assigned to transmit the next voice data, and the voice data is transmitted

Proceedings Article
01 Jan 1999
TL;DR: This report reports on data collected during a study of three commercially available ASR systems that show how initial users of speech systems tend to fixate on a single strategy for error correction, coupled with application assumptions about how error correction features will be used, make a very frustrating, and unsatisfying user experience.
Abstract: Automatic Speech Recognition (ASR) systems have improved greatly over the last three decades. However, even with 98% reported accuracy, error correction still consumes a significant portion of user effort in text creation tasks. We report on data collected during a study of three commercially available ASR systems that show how initial users of speech systems tend to fixate on a single strategy for error correction. This tendency coupled with application assumptions about how error correction features will be used, combine to make a very frustrating, and unsatisfying user experience. We observe two distinct error correction patterns: spiral depth (Oviatt & van Gent, 1996) and cascades. In contrast, users with more extensive experience learn to switch correction strategies more quickly.

Patent
12 Apr 1999
TL;DR: In this paper, a computer-implemented method and apparatus is provided for processing a spoken request from a user, where a speech recognizer converts the spoken request into a digital format.
Abstract: A computer-implemented method and apparatus is provided for processing a spoken request from a user. A speech recognizer converts the spoken request into a digital format. A frame data structure associates semantic components of the digitized spoken request with predetermined slots. The slots are indicative of data which are used to achieve a predetermined goal. A speech understanding module which is connected to the speech recognizer and to the frame data structure determines semantic components of the spoken request. The slots are populated based upon the determined semantic components. A dialog manager which is connected to the speech understanding module may determine at least one slot which is unpopulated based upon the determined semantic components and in a preferred embodiment may provide confirmation of the populated slots. A computer generated-request is formulated in order for the user to provide data related to the unpopulated slot. The method and apparatus are well-suited (but not limited) to use in a hand-held speech translation device.

Patent
06 May 1999
TL;DR: In this paper, a method and computer apparatus for automatically adjusting the content of feedback in a responsive prompt based upon predicted recognition accuracy by a speech recognizer is presented, which includes the steps of receiving a user voice command from the speech recogniser, calculating present speech recognition accuracy based upon the received user voice commands, predicting future recognition accuracy, and generating feedback in response to the predicted recognition accuracies.
Abstract: In a computer speech user interface, a method and computer apparatus for automatically adjusting the content of feedback in a responsive prompt based upon predicted recognition accuracy by a speech recognizer. The method includes the steps of receiving a user voice command from the speech recognizer; calculating present speech recognition accuracy based upon the received user voice command; predicting future recognition accuracy based upon the calculated present speech recognition accuracy; and, generating feedback in a responsive prompt responsive to the predicted recognition accuracy. For predicting future poor recognition accuracy based upon poor present recognition accuracy, the calculating step can include monitoring the received user voice command; detecting a reduced accuracy condition in the monitored user voice command; and, determining poor present recognition accuracy if the reduced accuracy condition is detected in the detecting step.

Patent
16 Apr 1999
TL;DR: In this article, a technique for recognizing telephone numbers and other information embedded in voice messages stored in a telephone voice messaging system is presented, where a voice recognition system is coupled to the telephone voice message system.
Abstract: A technique for recognizing telephone numbers and other information embedded in voice messages stored in a telephone voice messaging system. A voice recognition system is coupled to the telephone voice messaging system. A voice message stored in the voice messaging system is transferred to the voice recognition system. The voice recognition system segments the voice message and then searches the segments for a predetermined speech reference model (grammar) which is expected to contain information of importance to the recipient of the message. In a preferred embodiment, the predetermined is a numeric grammar which specifies a sequence of numbers occurring in the voice message. In alternate embodiments, the grammar specifies a date, a time, an address, and so forth, and can specify more than one such type of information. The grammar can be modified or selected by the recipient of the voice message so that the voice recognition system searches for information of particular interest to the recipient. Once the predetermined grammar is identified, the voice recognition system outputs a portion of the stored voice message which includes the grammar. The output can be a display of the information contained in the grammar, such as a telephone number or an address. Alternately, the output can be an audible replay of the portion of the stored voice message which includes the grammar.

Patent
Jean-Claude Junqua1, Yi Zhao
11 Mar 1999
TL;DR: In this article, the input signal is transformed into the frequency domain and then subdivided into bands corresponding to different frequency ranges, and adaptive thresholds are applied to the data from each frequency band separately.
Abstract: The input signal is transformed into the frequency domain and then subdivided into bands corresponding to different frequency ranges. Adaptive thresholds are applied to the data from each frequency band separately. Thus the short-term band-limited energies are tested for the presence or absence of a speech signal. The adaptive threshold values are independently updated for each of the signal paths, using a histogram data structure to accumulate long-term data representing the mean and variance of energy within the respective frequency band. Endpoint detection is performed by a state machine that transitions from the speech absent state to the speech present state, and vice versa, depending on the results of the threshold comparisons. A partial speech detection system handles cases in which the input signal is truncated.

Patent
21 Dec 1999
TL;DR: In this paper, variable rate coding of a speech signal is proposed to achieve low average bit rates by only employing high fidelity modes (i.e., high bit rate, broadly applicable to different types of speech) during portions of the speech where this fidelity is required for acceptable output.
Abstract: A method and apparatus for the variable rate coding of a speech signal. An input speech signal is classified and an appropriate coding mode is selected based on this classification. For each classification, the coding mode that achieves the lowest bit rate with an acceptable quality of speech reproduction is selected. Low average bit rates are achieved by only employing high fidelity modes (i.e., high bit rate, broadly applicable to different types of speech) during portions of the speech where this fidelity is required for acceptable output. Lower bit rate modes are used during portions of speech where these modes produce acceptable output. Input speech signal is classified into active and inactive regions. Active regions are further classified into voiced, unvoiced, and transient regions. Various coding modes are applied to active speech, depending upon the required level of fidelity. Coding modes may be utilized according to the strengths and weaknesses of each particular mode. The apparatus dynamically switches between these modes as the properties of the speech signal vary with time. And where appropriate, regions of speech are modeled as pseudo-random noise, resulting in a significantly lower bit rate. This coding is used in a dynamic fashion whenever unvoiced speech or background noise is detected.

Proceedings ArticleDOI
A.J. Accardi1, R.V. Cox
15 Mar 1999
TL;DR: In this paper, a modified version of Ephraim and van trees's (see IEEE Trans. Speech and Audio Proc., vol.3, p.251-66, 1995) spectral domain constrained signal subspace estimator is used in this manner, obtaining a system with greater flexibility and similar performance.
Abstract: Ephraim and Malah's (1984, 1985) MMSE-LSA speech enhancement algorithm, while robust and effective, is difficult to tune and adjust for the tradeoff between noise reduction and distortion. We suggest a means of generalizing this design, which allows for other estimators besides the MMSE-LSA to be used within the same supporting framework. When a modified version of Ephraim and Van Trees's (see IEEE Trans. Speech and Audio Proc., vol.3, p.251-66, 1995) spectral domain constrained signal subspace estimator is used in this manner, we obtain a system with greater flexibility and similar performance. We also explore the possibility of using different speech enhancement techniques as pre-processors for different parameter extraction modules of the IS-641 speech coder (a 7.4 kbit/s ACELP codec). We show that such a strategy can increase the quality of the coded speech and lead to a system that is more robust to differing noise types.

Patent
Mika T. Sorsa1
15 Dec 1999
TL;DR: In this paper, a system and method for voice browsing IVR services using a mobile terminal is presented, where a voice application is accessible via a server connected to a network and a call connection is established between the mobile terminal and the server using a dual-mode connection.
Abstract: A system and method for voice browsing IVR services using a mobile terminal. A voice application is accessible via a server connected to a network. A call connection is established between the mobile terminal and the server using a dual-mode connection. The call connection includes a voice mode and a data mode for alternately transmitting voice and data via the network. The voice application sends a state-dependent grammar that defines the speech recognition results that the voice application is ready to accept as input or commands at its present state of execution. The voice application also sends to the mobile terminal state-dependent voice output such as audio prompts and instructions using the voice mode. The user responds orally to the voice output. The mobile terminal processes this voice input using speech recognition facilities. Valid input is extracted from the voice input based on the state-dependent grammar. The mobile terminal sends the valid input to the voice application using the data mode. The voice application updates its state of execution based on the valid input.