Showing papers on "Voice activity detection published in 1996"

PDF

Open Access

Journal Article•DOI•

Robust speaker recognition: a feature-based approach

[...]

Richard J. Mammone, Xiaoyu Zhang, Ravi P. Ramachandran¹•Institutions (1)

01 Jan 1996-IEEE Signal Processing Magazine

TL;DR: Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described, including the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.

...read moreread less

Abstract: The future commercialization of speaker- and speech-recognition technology is impeded by the large degradation in system performance due to environmental differences between training and testing conditions. This is known as the "mismatched condition." Studies have shown [l] that most contemporary systems achieve good recognition performance if the conditions during training are similar to those during operation (matched conditions). Frequently, mismatched conditions axe present in which the performance is dramatically degraded as compared to the ideal matched conditions. A common example of this mismatch is when training is done on clean speech and testing is performed on noise- or channel-corrupted speech. Robust speech techniques [2] attempt to maintain the performance of a speech processing system under such diverse conditions of operation. This article presents an overview of current speaker-recognition systems and the problems encountered in operation, and it focuses on the front-end feature extraction process of robust speech techniques as a method of improvement. Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described. Also described is the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.

...read moreread less

344 citations

Proceedings Article•DOI•

A mew ASR approach based on independent processing and recombination of partial frequency bands

[...]

Hervé Bourlard, Stéphane Dupont¹•Institutions (1)

Faculté polytechnique de Mons¹

03 Oct 1996

TL;DR: The preliminary results presented in this paper show that such an approach, even using quite simple recombination strategies, can yield at least comparable performance on clean speech while providing better robustness in the case of noisy speech.

...read moreread less

Abstract: In the framework of hidden Markov models (HMM) or hybrid HMM/artificial neural network (ANN) systems, we present a new approach towards automatic speech recognition (ASR). The general idea is to split the whole frequency band (represented in terms of critical bands) into a few sub-bands on which different recognizers are independently applied and then recombined at a certain speech unit level to yield global scores and a global recognition decision. The preliminary results presented in this paper show that such an approach, even using quite simple recombination strategies, can yield at least comparable performance on clean speech while providing better robustness in the case of noisy speech.

...read moreread less

312 citations

Journal Article•DOI•

Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition

[...]

John H. L. Hansen¹•Institutions (1)

Duke University¹

01 Nov 1996-Speech Communication

TL;DR: It is suggested that recent studies based on a Source Generator Framework can provide a viable foundation in which to establish robust speech recognition techniques, and three novel approaches for signal enhancement and stress equalization are considered to address the issue of recognition under noisy stressful conditions.

...read moreread less

270 citations

Patent•DOI•

Process for automatic control of one or more devices by voice commands or by real-time voice dialog and apparatus for carrying out this process

[...]

Walter Stammler¹, Fritz Class¹, Carsten-Uwe Möller¹, Gerhard Nüssle¹, Frank Reh¹, Burkard Buschkuhl¹, Christian Heinrich¹ - Show less +3 more•Institutions (1)

Daimler AG¹

09 Sep 1996-Journal of the Acoustical Society of America

TL;DR: In this paper, a real-time voice dialog system is described, where a process for automatic control of devices by voice dialog is used applying methods of voice input, voice signal processing and voice recognition, syntactical-grammatical postediting as well as dialog, executive sequencing and interface control.

...read moreread less

Abstract: The invention pertains to a voice dialog system wherein a process for automatic control of devices by voice dialog is used applying methods of voice input, voice signal processing and voice recognition, syntactical-grammatical postediting as well as dialog, executive sequencing and interface control, and which is characterized in that syntax and command structures are set during real-time dialog operation; preprocessing, recognition and dialog control are designed for operation in a noise-encumbered environment; no user training is required for recognition of general commands; training of individual users is necessary for recognition of special commands; the input of commands is done in linked form, the number of words used to form a command for voice input being variable; a real-time processing and execution of the voice dialog is established; the voice input and output is done in the hands-free mode.

...read moreread less

263 citations

Book•DOI•

Robustness in Automatic Speech Recognition

[...]

Jean-Claude Junqua, Jean-Paul Haton

01 Jan 1996

211 citations

Patent•DOI•

Web triggered word set boosting for speech interfaces to the world wide web

[...]

Ramesh Sarukkai¹, Sekhar R. Sarukkai¹•Institutions (1)

Hewlett-Packard¹

30 Sep 1996-Journal of the Acoustical Society of America

TL;DR: A computer system for user speech actuation of access to stored information, the system including a central processing unit, a memory and a user input/output interface including a microphone for input of user speech utterances and audible sound signal processing circuitry, and a file system for accessing and storing information in the memory of the computer.

...read moreread less

Abstract: A computer system for user speech actuation of access to stored information, the system including a central processing unit, a memory and a user input/output interface including a microphone for input of user speech utterances and audible sound signal processing circuitry, and a file system for accessing and storing information in the memory of the computer. A speech recognition processor operating on the computer system recognizes words based on the input speech utterances of the user in accordance with a set of language/acoustic model and speech recognition search parameters. Software running on the CPU scans a document accessed by a web browser to form a web triggered word set from a selected subset of information in the document. The language/acoustic model and speech recognition search parameters are modified dynamically using the web triggered word set, and used by the speech recognition processor for generating a word string for input to the browser to initiate a change in the information accessed.

...read moreread less

208 citations

Journal Article•DOI•

Landmark detection for distinctive feature-based speech recognition

[...]

Sharlene Anne Liu¹•Institutions (1)

Massachusetts Institute of Technology¹

03 Oct 1996-Journal of the Acoustical Society of America

TL;DR: An algorithm for automatically detecting landmarks associated with segments having abrupt acoustics, which provides hypotheses about the underlying broad phonetic class at each landmark as a consequence of landmark detection.

...read moreread less

Abstract: This thesis is a component of a proposed knowledge-based speech recognition system which uses landmarks to guide the search for distinctive features. In an utterance, landmarks identify localized regions where the acoustic manifestations of the linguistically-motivated distinctive features are most salient. This thesis describes an algorithm for automatically detecting landmarks associated with segments having abrupt acoustics. As a consequence of landmark detection, the algorithm also provides hypotheses about the underlying broad phonetic class at each landmark. The algorithm is hierarchically-structured, and is rooted in linguistic and speech production theory. It uses several factors to detect landmarks: energy abruptness in five frequency bands and at two levels of temporal resolution, segmental duration, specific broad phonetic class constraints, and articulatory constraints. Landmark detection experiments were performed on clean speech (including TIMIT), speech in noise, and telephone speech. On clean speech, the landmark detector performed relatively well, with a detection rate of about 90% if correct landmark type was required, and 94% if correct landmark type was not required. The insertion rate was 6%-9%. An analysis of the temporal precision of the landmark detector showed that a large majority of the landmarks were detected within 20 ms of the landmark transcription, and almost all were within 30 ms. For either speech in noise or telephone speech, performance understandably degraded due to the reduced information content in the speech signal. For each set of experiments, the landmark detection algorithm was manually customized to the database using knowledge about speech and the operating environment. One consequence of this knowledge-driven approach is that there is no degradation in performance between what is typically called the "training" data set and the test data set. This approach also allows careful evaluation and further improvements to be made in a methodical manner. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

...read moreread less

204 citations

Proceedings Article•DOI•

A study of speech recognition for children and the elderly

[...]

Jay G. Wilpon¹, C.N. Jacobsen¹•Institutions (1)

Bell Labs¹

07 May 1996

TL;DR: A thorough investigation into the performance of the current automatic speech recognition technology with children and the elderly using a connected digit recognizer and a major telephone speech database finds that recognition of speech from these people is only a matter of having enough, sufficiently representative training data.

...read moreread less

Abstract: Although children and the elderly have obvious needs for voice operated interfaces, hardly anything is known about the performance of the current automatic speech recognition technology with these people. In this paper we report the results of a thorough investigation into this field using a connected digit recognizer and a major telephone speech database. One would generally assume that the recognition of speech from these people would only be a matter of having enough, sufficiently representative training data. This turns out to be true only, as long as the speakers belong to the age range 15 to approximately 70. Outside this range the error rates increase dramatically, even with balanced amounts of training data. For males, the lower limit is very sharp and can be attributed to the change of pitch frequency during puberty. For females, the lower limit is gradual and caused by the slowly changing dimensions of the vocal tract length only. For both genders, the upper limit is very gradual and can possibly be attributed to changes in the glottis area and the internal control loops of the human articulatory system. The paper presents some supporting evidence for the above assertions and gives results for various attempts to improve the performance. Recognition of children and the elderly will require much more research if we are to fully understand the characteristics of these age group on current and future speech recognition systems.

...read moreread less

204 citations

Patent•

Bifurcated speaker specific and non-speaker specific speech recognition method and apparatus

[...]

Yasunaga Miyazawa¹, Mitsuhiro Inazumi¹, Hiroshi Hasegawa¹, Isao Edatsune¹, Osamu Urano¹ - Show less +1 more•Institutions (1)

Epson¹

20 Aug 1996-Journal of the Acoustical Society of America

TL;DR: In this article, a speaker specific and non-speaker specific method and apparatus is provided for enabling speech-based remote control and for recognizing the speech of an unspecified speaker at extremely high recognition rates regardless of the speaker's age, sex, or individual speech mannerisms.

...read moreread less

Abstract: Bifurcated speaker specific and non-speaker specific method and apparatus is provided for enabling speech-based remote control and for recognizing the speech of an unspecified speaker at extremely high recognition rates regardless of the speaker's age, sex, or individual speech mannerisms. A device main unit is provided with a speech recognition processor for recognizing speech and taking an appropriate action, and with a user terminal containing specific speaker capture and/or preprocessing capabilities. The user terminal exchanges data with the speech recognition processor using radio transmission. The user terminal may be provided with a conversion rule generator that compares the speech of a user with previously compiled standard speech feature data and, based on this comparison result, generates a conversion rule for converting the speaker's speech feature parameters to corresponding standard speaker's feature information. The speech recognition processor, in turn, may reference the conversion rule developed in the user terminal and perform speech recognition based on the input speech feature parameters that have been converted above.

...read moreread less

178 citations

Patent•

Method and apparatus for diphone aliasing

[...]

Caroline G. Henton¹•Institutions (1)

Apple Inc.¹

03 Jul 1996-Journal of the Acoustical Society of America

TL;DR: In this article, the authors proposed a formalized aliasing approach to improve the quality of electronic speech synthesis using pre-recorded segments of speech to fill in for other missing segments.

...read moreread less

Abstract: The present invention improves upon electronic speech synthesis using pre-recorded segments of speech to fill in for other missing segments of speech. The formalized aliasing approach of the present invention overcomes the ad hoc aliasing approach of the prior art which oftentimes generated less than satisfactory speech synthesis sound output. By formalizing the relationship between missing speech sound samples and available speech sound samples, the present invention provides a structured approach to aliasing which results in improved synthetic speech sound quality. Further, the formalized aliasing approach of the present invention can be used to lessen storage requirements for speech sound samples by only storing as many sound samples as memory capacity can support.

...read moreread less

177 citations

Proceedings Article•DOI•

Towards ASR on partially corrupted speech

[...]

H. Hennansky¹, S. Tibrewala¹, Misha Pavel•Institutions (1)

Oregon Health & Science University¹

03 Oct 1996

TL;DR: A new highly parallel approach to automatic recognition of speech, inspired by early Fetcher's research on articulation index, and based on independent probability estimates in several sub-bands of the available speech spectrum, is presented.

...read moreread less

Abstract: A new highly parallel approach to automatic recognition of speech, inspired by early Fetcher's research on articulation index, and based on independent probability estimates in several sub-bands of the available speech spectrum, is presented. The approach is especially suitable for situations when part of the spectrum of speech is computed. In such cases, it can yield an order-of-magnitude improvement in the error rate over a conventional full-band recognizer.

...read moreread less

Patent•DOI•

Voice-activated interactive speech recognition device and method

[...]

Yasunaga Miyazawa¹, Mitsuhiro Inazumi¹, Hiroshi Hasegawa¹, Isao Edatsune¹, Osamu Urano¹ - Show less +1 more•Institutions (1)

Epson¹

20 Aug 1996-Journal of the Acoustical Society of America

TL;DR: Techniques for implementing adaptable voice activation operations for interactive speech recognition devices and instruments include tailoring the volume level of the synthesized voice response according to the perceived volume level as detected by the input sound signal power detector.

...read moreread less

Abstract: Techniques for implementing adaptable voice activation operations for interactive speech recognition devices and instruments. Specifically, such speech recognition devices and instruments include an input sound signal power or volume detector in communication with a central CPU for bringing the CPU out of an initial sleep state upon detection of perceived voice exceeding a predetermined threshold volume level and is continuously perceived for at least a certain period of time. If both these conditions are satisfied, the CPU is transitioned into an active mode so that the perceived voice can be analyzed against a set of registered key words to determine if a "power on" command or similar instruction has been received. If so, the CPU maintains an active state in normal speech recognition processing ensues until a "power off" command is received. However, if the perceived and analyzed voice can not be recognized, it is deemed to be background noise and the minimum threshold is selectively updated to accommodate the volume level of the perceived but unrecognized voice. Other aspects include tailoring the volume level of the synthesized voice response according to the perceived volume level as detected by the input sound signal power detector, as well as modifying audible response volume in accordance with updated volume threshold levels.

...read moreread less

Patent•

Methods and apparatus for activating telephone services in response to speech

[...]

George J. Vysotsky, Ayman O. Asadi, David M. Lubensky, Vijay Raman, Jayant M. Naik - Show less +1 more

29 Feb 1996

TL;DR: In this paper, a call-placement system for telephone services in response to speech is described, which allows a customer to place a call by speaking a person's name which serves as a destination identifier without having to speak an additional command or steering word.

...read moreread less

Abstract: Methods and apparatus for activating telephone services in response to speech are described. A directory including names is maintained for each customer. A speaker dependent speech template and a telephone number for each name, is maintained as part of each customer's directory. Speaker independent speech templates are used for recognizing commands. The present invention has the advantage of permitting a customer to place a call by speaking a person's name which serves as a destination identifier without having to speak an additional command or steering word to place the call. This is achieved by treating the receipt of a spoken name in the absence of a command as an implicit command to place a call. Explicit speaker independent commands are used to invoke features or services other than call placement. Speaker independent and speaker dependent speech recognition are performed on a customer's speech in parallel. An arbiter is used to decide which function or service should be performed when an apparent conflict arises as a result of both the speaker dependent and speaker independent speech recognition step outputs. Stochastic grammars, word spotting and/or out-of-vocabulary rejection are used as part of the speech recognition process to provide a user friendly interface which permits the use of spontaneous speech. Voice verification is performed on a selective basis where security is of concern.

...read moreread less

Proceedings Article•DOI•

Effect of speech coders on speech recognition performance

[...]

B.T. Lilly¹, Kuldip K. Paliwal•Institutions (1)

Griffith University¹

01 Jan 1996

TL;DR: The results of a study to examine the effects speech coders have on speech recognition are presented and the effects onspeech recognition performance by tandeming each of the speechCoders are presented.

...read moreread less

Abstract: Speech coders with bitrates as low as 2.4 kbits/s are now being developed for speech transmission in the telecommunications industry. For speech coders to work at this reduced bitrate, some speech information has to be removed and it is only natural to expect that the performance of speech recognition systems will deteriorate when coded speech is applied as input to a recognition system. The results of a study to examine the effects speech coders have on speech recognition am presented. Six different speech coders ranging from 4.8 kbits/s to 40 kbits/s are used with two different speech recognition systems: (1) isolated word recognition and (2) phoneme recognition from continuous speech. The effects on speech recognition performance by tandeming each of the speech coders are also presented.

...read moreread less

Patent•DOI•

Method and device for voice activity detection and a communication device

[...]

Antti Vaehaetalo¹, Juha Häkkinen¹, Erkki Paajanen¹•Institutions (1)

Nokia¹

19 Nov 1996-Journal of the Acoustical Society of America

TL;DR: In this paper, the authors proposed a voice activity detection device in which an input speech signal (x(n)) is divided in subsignals representing specific frequency bands and noise (N(s)) is estimated in the subsignal.

...read moreread less

Abstract: The invention concerns a voice activity detection device in which an input speech signal (x(n)) is divided in subsignals (S(s)) representing specific frequency bands and noise (N(s)) is estimated in the subsignals. On basis of the estimated noise in the subsignals, subdecision signals (SNR(s)) are generated and a voice activity decision (Vind) for the input speech signal is formed on basis of the subdecision signals. Spectrum components of the input speech signal and a noise estimate are calculated and compared. More specifically a signal-to-noise ratio is calculated for each subsignal and each signal-to-noise ratio represents a subdecision signal (SNR(s)). From the signal-to-noise ratios a value proportional to their sum is calculated and compared with a threshold value and a voice activity decision signal (Vind) for the input speech signal is formed on basis of the comparison.

...read moreread less

Book Chapter•DOI•

Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications

[...]

Robert Kaucic¹, Barney Dalton¹, Andrew Blake¹•Institutions (1)

University of Oxford¹

15 Apr 1996

TL;DR: Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.

...read moreread less

Abstract: Developments in dynamic contour tracking permit sparse representation of the outlines of moving contours. Given the increasing computing power of general-purpose workstations it is now possible to track human faces and parts of faces in real-time without special hardware. This paper describes a real-time lip tracker that uses a Kalman filter based dynamic contour to track the outline of the lips. Two alternative lip trackers, one that tracks lips from a profile view and the other from a frontal view, were developed to extract visual speech recognition features from the lip contour. In both cases, visual features have been incorporated into an acoustic automatic speech recogniser. Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.

...read moreread less

Patent•DOI•

Method and apparatus for dynamic adaptation of a large vocabulary speech recognition system and for use of constraints from a database in a large vocabulary speech recognition system

[...]

John N. Nguyen, Michael S. Phillips

09 May 1996-Journal of the Acoustical Society of America

TL;DR: In this paper, a lexical network is structured to include Phonetic Constraint Nodes which organize the interword phonetic information in the network, and Word Class Nodes that organize the syntactic/semantic information in network.

...read moreread less

Abstract: The vocabulary of a large-vocabulary speech recognition system is structured to effectuate the rapid and efficient edition of words to a lexical network storing the vocabulary. The lexical network is structured to include Phonetic Constraint Nodes which organize the inter-word phonetic information in the network, and Word Class Nodes which organize the syntactic/semantic information in the network. Network fragments, corresponding to phoneme pronunciations and labeled to specify permitted interconnections, are precompiled to facilitate the rapid generation of pronunciation for new words and thereby enhance the rapid addition of words to the vocabulary even during speech recognition. Different language models and different vocabularies for different portions of a discourse are invoked dependent in part on the discourse history.

...read moreread less

Proceedings Article•DOI•

Confidence measures for the SWITCHBOARD database

[...]

Stephen Cox¹, Richard Rose²•Institutions (2)

Bell Labs¹, Carnegie Mellon University²

07 May 1996

TL;DR: A measure for evaluating the effectiveness of a post-classifier which estimates confidence-measures is defined, and the development of aPost- classifier for words decoded from the SWITCHBOARD database is described, which uses statistics derived from a Viterbi decoder.

...read moreread less

Abstract: There is increasing interest in systems which attempt to automate a task or a transaction using speech input and output. To function effectively with imperfect speech recognition, such systems require an estimate of which words in the output from the recogniser are likely to be correct and which can probably be disregarded as incorrect, i.e. a confidence-measure for each decoded word. We define a measure for evaluating the effectiveness of a post-classifier which estimates confidence-measures, and describe the development of a post-classifier for words decoded from the SWITCHBOARD database, which uses statistics derived from a Viterbi decoder. Without any grouping of the decoded word-classes, the post-classifier increased the probability of deciding whether a decoded word was correct or incorrect by 32%. When grouping was used, longer words showed an improvement of 65%.

...read moreread less

Patent•

Display indications of speech processing states in speech recognition system

[...]

Peter Rowland Eastwood¹, Alan J. Happ¹, Alice G. Klein¹, Daniel William Kruse¹, Maria Milenkovic¹ - Show less +1 more•Institutions (1)

IBM¹

30 May 1996

TL;DR: In this article, a visual feedback aid, for a computer system performing speech recognition functions, provides indications on a display monitor of the system representing the current state of operation of a system microphone, the current mode of operation in respect to speech, and a string of text representing the system's recognition (correct or incorrect) of commands instantly spoken into the microphone.

...read moreread less

Abstract: A visual feedback aid, for a computer system performing speech recognition functions, provides indications on a display monitor of the system representing the current state of operation of a system microphone, the current mode of operation of the system in respect to speech, and a string of text representing the system's recognition (correct or incorrect) of commands instantly spoken into the microphone. The indications preferably are located in a reserved area of a display window associated with a currently active application involving speech recognition. The reserved area preferably would be a prominent one, such as the application title bar.

...read moreread less

Patent•DOI•

Apparatus and method for reducing speech recognition vocabulary perplexity and dynamically selecting acoustic models

[...]

Chi Wong¹•Institutions (1)

Nortel¹

28 Mar 1996-Journal of the Acoustical Society of America

TL;DR: In this article, a method of reducing the perplexity of a speech recognition vocabulary and dynamically selecting speech recognition acoustic model sets used in a simulated telephone operator apparatus is presented, where the directory of users of the telephone network is subdivided into subsets wherein each subset contains the names of users within a certain location or exchange.

...read moreread less

Abstract: A method of reducing the perplexity of a speech recognition vocabulary and dynamically selecting speech recognition acoustic model sets used in a simulated telephone operator apparatus. The directory of users of the telephone network is subdivided into subsets wherein each subset contains the names of users within a certain location or exchange. A speech recognition vocabulary database is compiled for each subset and the appropriate database is loaded into the speech recognition apparatus in response to a requested call to the location covered by the subset. Furthermore, a site-specific acoustic model set is dynamically loaded according to the location of a calling party. An apparatus for carrying out the method is also discussed.

...read moreread less

Patent•DOI•

Method and apparatus for facilitating speech barge-in in connection with voice recognition systems

[...]

John N. Nguyen

21 May 1996-Journal of the Acoustical Society of America

TL;DR: In this article, a barge-in detector for use in connection with a speech recognition system forms a prompt replica for detecting the presence or absence of user input to the system, which is indicative of the prompt energy applied to an input of the system.

...read moreread less

Abstract: A barge-in detector for use in connection with a speech recognition system forms a prompt replica for use in detecting the presence or absence of user input to the system The replica is indicative of the prompt energy applied to an input of the system The detector detects the application of user input to the system, even if concurrent with a prompt, and enables the system to quickly respond to the user input

...read moreread less

Patent•DOI•

Interactive speech recognition with varying responses for time of day and environmental conditions

[...]

Edatsune Isao¹•Institutions (1)

Epson¹

29 Feb 1996-Journal of the Acoustical Society of America

TL;DR: The invention improves recognition rates by providing an interactive speech recognition device that performs recognition by taking situational andEnvironmental changes into consideration, thus enabling interactions that correspond to situational and environmental changes.

...read moreread less

Abstract: The invention improves recognition rates by providing an interactive speech recognition device that performs recognition by taking situational and environmental changes into consideration, thus enabling interactions that correspond to situational and environmental changes. The invention comprises a speech analysis unit that creates a speech data pattern corresponding to the input speech; a timing circuit for generating time data, for example, as variable data; a coefficient setting unit receiving the time data from the timing circuit and generating weighting coefficients that change over time, in correspondence to the content of each recognition target speech; a speech recognition unit that receives the speech data pattern of the input speech from the speech analysis unit, and that at the same time obtains a weighting coefficient in effect for a pre-registered recognition target speech at the time from the coefficient setting unit, that computes final recognition data by multiplying the recognition data corresponding to each recognition target speech by its corresponding weighting coefficient, and that recognizes the input speech based on the computed final recognition result; a speech synthesis unit for outputting speech synthesis data based on the recognition data that takes the weighting coefficient into consideration; and a drive control unit for transmitting the output from the speech synthesis unit to the outside.

...read moreread less

Proceedings Article•DOI•

Error correction via a post-processor for continuous speech recognition

[...]

Eric K. Ringger¹, James F. Allen²•Institutions (2)

University of Rochester¹, Carnegie Mellon University²

07 May 1996

TL;DR: This work provides evidence for the claim that a modern continuous speech recognizer can be used successfully in "black-box" fashion for robustly interpreting spontaneous utterances in a dialogue with a human.

...read moreread less

Abstract: This paper presents a new technique for overcoming several types of speech recognition errors by post-processing the output of a continuous speech recognizer. The post-processor output contains fewer errors, thereby making interpretation by higher-level modules, such as a parser, in a speech understanding system more reliable. The primary advantage to the post-processing approach over existing approaches for overcoming SR errors lies in its ability to introduce options that are not available in the SR module's output. This work provides evidence for the claim that a modern continuous speech recognizer can be used successfully in "black-box" fashion for robustly interpreting spontaneous utterances in a dialogue with a human.

...read moreread less

Proceedings Article•

Adding the affective dimension: a new look in speech analysis and synthesis.

[...]

Klaus R. Scherer¹•Institutions (1)

University of Geneva¹

01 Jan 1996

TL;DR: It is argued that major advances in speaker verification, speech recognition, and natural-sounding speech synthesis depend on increases in knowledge of the mechanisms underlying voice and speech production under emotional arousal or other attitudinal states, as well as on a more adequate understanding of listener decoding of affect from vocal quality.

...read moreread less

Abstract: This introduction to a special session on “Emotion in recognition and synthesis” highlights the need to understand the effects of affective speaker states on voice and speech on a psychophysiological level. It is argued that major advances in speaker verification, speech recognition, and natural-sounding speech synthesis depend on increases in our knowledge of the mechanisms underlying voice and speech production under emotional arousal or other attitudinal states, as well as on a more adequate understanding of listener decoding of affect from vocal quality. A brief review of the current state of the art is provided.

...read moreread less

Journal Article•DOI•

Time-frequency analysis and auditory modeling for automatic recognition of speech

[...]

J.W. Pitton¹, Kuansan Wang, Biing-Hwang Juang•Institutions (1)

Mathsoft¹

01 Sep 1996

TL;DR: Understanding of speech production and perception impacts methods of speech processing, and vice-versa, is enunciated, focusing on how modern time-frequency signal analysis methods could help expedite needed advances in these areas.

...read moreread less

Abstract: Modern speech processing research may be categorized into three broad areas: statistical, physiological, and perceptual. Statistical research investigates the nature of the variability of the speech waveform from a signal processing viewpoint. This approach relates to the processing of speech in order to obtain measurements of speech characteristics which demonstrate manageable variabilities across a wide range of the talker population, in the presence of noise or competing speakers as well as the interaction of speech with the channel through which it is transmitted, and under the inherent interaction of the information content of speech itself (i.e., the contextual factor). Physiological research aims at constructing accurate models of the articulatory and auditory process, helping to limit the signal space for speech processing. In the perceptual realm, work focuses on understanding the psychoacoustic and possibly the psycholinguistic aspects of the speech communication process that the human so conveniently conducts. By studying this working analysis/recognition system, insights may be garnered that will lead to improved methods of speech processing. Conversely by studying the limitations of this system, particularly how it reduces the information rate of the received signal through, for example, masking and adaptation improvements may be made in the efficiency of speech coding schemes without impacting the quality of the reconstructed speech. Thus comprehension of speech production and perception impacts methods of speech processing, and vice-versa. This paper enunciates such a position, focusing on how modern time-frequency signal analysis methods could help expedite needed advances in these areas.

...read moreread less

Patent•

Noncompressed voice and data communication over modem for a computer-based multifunction personal communications system

[...]

Sidhartha Maitra, Raghu Sharma

25 Jun 1996

TL;DR: In this paper, the authors proposed a voice over data function that directly encodes digitized voice samples onto the carrier using quadrature amplitude modulation to transmit multiple bits of the voice sample for every baud.

...read moreread less

Abstract: The voice over data component of a personal communications system enables the operator to simultaneously transmit voice and data communication to a remote site. This voice over data function directly encodes digitized voice samples onto the carrier using quadrature amplitude modulation to transmit multiple bits of the voice sample for every baud. The system also allocates selected bauds of the carrier to voice and to data so the voice over data may be transmitted using the same allocated bandwidth. The system may also dynamically reallocate the bandwidth over the telephone line depending on the demands of the voice grade digitized signal.

...read moreread less

Multi-Stream Speech Recognition

[...]

Hervé Bourlard, Stéphane Dupont, Christophe Ris

01 Jan 1996

TL;DR: This paper introduces the basic framework of a statistical structure that can accommodate multiple (asynchronous) observation streams (possibly exhibiting different frame rates) and will then be applied to the particular case of multi-band speech recognition and will be shown to yield significantly better noise robustness.

...read moreread less

Abstract: In this paper, we discuss a new automatic speech recognition (ASR) approach based on independent processing and recombination of several feature streams. In this framework, it is assumed that the speech signal is represented in terms of multiple input streams, each input stream representing a different characteristic of the signal. If the streams are entirely synchronous, they may be accommodated simply (as they usually are in state-of-the-art systems). However, as discussed in the paper, it may be required to permit some degree of asynchrony between streams. This paper introduces the basic framework of a statistical structure that can accommodate multiple (asynchronous) observation streams (possibly exhibiting different frame rates). This approach will then be applied to the particular case of multi-band speech recognition and will be shown to yield significantly better noise robustness.

...read moreread less

Patent•DOI•

Retaining prosody during speech analysis for later playback

[...]

Dale Boss¹, Sridhar Iyengar¹, T. Don Dennis¹•Institutions (1)

Intel¹

13 Dec 1996-Journal of the Acoustical Society of America

TL;DR: A speech system includes a speech encoding system and a speech decoding system that generates a data signal that includes the speech segment IDs and the values of the corresponding prosodic parameters.

...read moreread less

Abstract: A speech system includes a speech encoding system and a speech decoding system. The speech encoding system includes a speech analyzer for identifying each of the speech segments (i.e., phonemes) in the received digitized speech signal. A pitch detector, a duration detector, and an amplitude detector are each coupled to the memory and the analyzer and detect various prosodic parameters of each received speech segment. A speech encoder generates a data signal that includes the speech segment IDs and the values of the corresponding prosodic parameters. The speech decoding system includes a digital data decoder and a speech synthesizer for generating a speech signal based on the segment IDs and prosodic parameter values.

...read moreread less

Patent•DOI•

Transcoder with prevention of tandem coding of speech

[...]

Matti Lehtimäki

11 Apr 1996-Journal of the Acoustical Society of America

TL;DR: In this article, a transcoder (TRCU1, TRCU2) was proposed for preventing tandem coding of speech in a mobile to mobile (MS1, MS2) call within a mobile communication system which employs a speech coding method for reducing transmission rate on the radio path.

...read moreread less

Abstract: The invention relates to a transcoder (TRCU1, TRCU2) having means for preventing tandem coding of speech in a mobile to mobile (MS1, MS2) call within a mobile communication system which employs a speech coding method for reducing transmission rate on the radio path. The transcoder (TRCU1, TRCU2) comprises a speech coder (52, 73) which encodes the speech signal into speech parameters for transmission to a mobile station, and decodes the speech parameters received from the mobile station into a speech signal according to said speech coding method, as well as a PCM coder (54, 72) for transmitting an uplink speech signal to and for receiving a downlink speech signal from a PCM interface in the form of PCM speech samples. In addition to the normal operation, the transcoder transmits and receives speech parameters through a PCM interface in a subchannel formed by least significant bits of the PCM speech samples. Thus, it is possible to prevent tandem coding but at the same time maintain the standard PCM interface and the signallings and services associated thereto.

...read moreread less

Proceedings Article•DOI•

A fertility channel model for post-correction of continuous speech recognition

[...]

Eric K. Ringger¹, James F. Allen¹•Institutions (1)

University of Rochester¹

03 Oct 1996

TL;DR: It is demonstrated that some improvements in word accuracy result from augmenting the channel model with an account of word fertility in the channel, and that a modem continuous speech recognizer can be used in "black-box" fashion for robustly recognizing speech for which the recognizer was not originally trained.

...read moreread less

Abstract: The authors have implemented a post-processor called SPEECHPP to correct word-level errors committed by an arbitrary speech recognizer. Applying a noisy-channel model, SPEECHPP uses a Viterbi beam-search that employs language and channel models. Previous work demonstrated that a simple word-for-word channel model was sufficient to yield substantial increases in word accuracy. The paper demonstrates that some improvements in word accuracy result from augmenting the channel model with an account of word fertility in the channel. The work further demonstrates that a modem continuous speech recognizer can be used in "black-box" fashion for robustly recognizing speech for which the recognizer was not originally trained. The work also demonstrates that in the case where the recognizer can be tuned to the new task, environment, or speaker, the post-processor can also contribute to performance improvements.

...read moreread less

Collapse