scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1996"


Journal ArticleDOI
TL;DR: Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described, including the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.
Abstract: The future commercialization of speaker- and speech-recognition technology is impeded by the large degradation in system performance due to environmental differences between training and testing conditions. This is known as the "mismatched condition." Studies have shown [l] that most contemporary systems achieve good recognition performance if the conditions during training are similar to those during operation (matched conditions). Frequently, mismatched conditions axe present in which the performance is dramatically degraded as compared to the ideal matched conditions. A common example of this mismatch is when training is done on clean speech and testing is performed on noise- or channel-corrupted speech. Robust speech techniques [2] attempt to maintain the performance of a speech processing system under such diverse conditions of operation. This article presents an overview of current speaker-recognition systems and the problems encountered in operation, and it focuses on the front-end feature extraction process of robust speech techniques as a method of improvement. Linear predictive (LP) analysis, the first step of feature extraction, is discussed, and various robust cepstral features derived from LP coefficients are described. Also described is the afJine transform, which is a feature transformation approach that integrates mismatch to simultaneously combat both channel and noise distortion.

344 citations


Proceedings ArticleDOI
03 Oct 1996
TL;DR: The preliminary results presented in this paper show that such an approach, even using quite simple recombination strategies, can yield at least comparable performance on clean speech while providing better robustness in the case of noisy speech.
Abstract: In the framework of hidden Markov models (HMM) or hybrid HMM/artificial neural network (ANN) systems, we present a new approach towards automatic speech recognition (ASR). The general idea is to split the whole frequency band (represented in terms of critical bands) into a few sub-bands on which different recognizers are independently applied and then recombined at a certain speech unit level to yield global scores and a global recognition decision. The preliminary results presented in this paper show that such an approach, even using quite simple recombination strategies, can yield at least comparable performance on clean speech while providing better robustness in the case of noisy speech.

312 citations


Journal ArticleDOI
TL;DR: It is suggested that recent studies based on a Source Generator Framework can provide a viable foundation in which to establish robust speech recognition techniques, and three novel approaches for signal enhancement and stress equalization are considered to address the issue of recognition under noisy stressful conditions.

270 citations


PatentDOI
TL;DR: In this paper, a real-time voice dialog system is described, where a process for automatic control of devices by voice dialog is used applying methods of voice input, voice signal processing and voice recognition, syntactical-grammatical postediting as well as dialog, executive sequencing and interface control.
Abstract: The invention pertains to a voice dialog system wherein a process for automatic control of devices by voice dialog is used applying methods of voice input, voice signal processing and voice recognition, syntactical-grammatical postediting as well as dialog, executive sequencing and interface control, and which is characterized in that syntax and command structures are set during real-time dialog operation; preprocessing, recognition and dialog control are designed for operation in a noise-encumbered environment; no user training is required for recognition of general commands; training of individual users is necessary for recognition of special commands; the input of commands is done in linked form, the number of words used to form a command for voice input being variable; a real-time processing and execution of the voice dialog is established; the voice input and output is done in the hands-free mode.

263 citations



PatentDOI
TL;DR: A computer system for user speech actuation of access to stored information, the system including a central processing unit, a memory and a user input/output interface including a microphone for input of user speech utterances and audible sound signal processing circuitry, and a file system for accessing and storing information in the memory of the computer.
Abstract: A computer system for user speech actuation of access to stored information, the system including a central processing unit, a memory and a user input/output interface including a microphone for input of user speech utterances and audible sound signal processing circuitry, and a file system for accessing and storing information in the memory of the computer. A speech recognition processor operating on the computer system recognizes words based on the input speech utterances of the user in accordance with a set of language/acoustic model and speech recognition search parameters. Software running on the CPU scans a document accessed by a web browser to form a web triggered word set from a selected subset of information in the document. The language/acoustic model and speech recognition search parameters are modified dynamically using the web triggered word set, and used by the speech recognition processor for generating a word string for input to the browser to initiate a change in the information accessed.

208 citations


Journal ArticleDOI
TL;DR: An algorithm for automatically detecting landmarks associated with segments having abrupt acoustics, which provides hypotheses about the underlying broad phonetic class at each landmark as a consequence of landmark detection.
Abstract: This thesis is a component of a proposed knowledge-based speech recognition system which uses landmarks to guide the search for distinctive features. In an utterance, landmarks identify localized regions where the acoustic manifestations of the linguistically-motivated distinctive features are most salient. This thesis describes an algorithm for automatically detecting landmarks associated with segments having abrupt acoustics. As a consequence of landmark detection, the algorithm also provides hypotheses about the underlying broad phonetic class at each landmark. The algorithm is hierarchically-structured, and is rooted in linguistic and speech production theory. It uses several factors to detect landmarks: energy abruptness in five frequency bands and at two levels of temporal resolution, segmental duration, specific broad phonetic class constraints, and articulatory constraints. Landmark detection experiments were performed on clean speech (including TIMIT), speech in noise, and telephone speech. On clean speech, the landmark detector performed relatively well, with a detection rate of about 90% if correct landmark type was required, and 94% if correct landmark type was not required. The insertion rate was 6%-9%. An analysis of the temporal precision of the landmark detector showed that a large majority of the landmarks were detected within 20 ms of the landmark transcription, and almost all were within 30 ms. For either speech in noise or telephone speech, performance understandably degraded due to the reduced information content in the speech signal. For each set of experiments, the landmark detection algorithm was manually customized to the database using knowledge about speech and the operating environment. One consequence of this knowledge-driven approach is that there is no degradation in performance between what is typically called the "training" data set and the test data set. This approach also allows careful evaluation and further improvements to be made in a methodical manner. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

204 citations


Proceedings ArticleDOI
Jay G. Wilpon1, C.N. Jacobsen1
07 May 1996
TL;DR: A thorough investigation into the performance of the current automatic speech recognition technology with children and the elderly using a connected digit recognizer and a major telephone speech database finds that recognition of speech from these people is only a matter of having enough, sufficiently representative training data.
Abstract: Although children and the elderly have obvious needs for voice operated interfaces, hardly anything is known about the performance of the current automatic speech recognition technology with these people. In this paper we report the results of a thorough investigation into this field using a connected digit recognizer and a major telephone speech database. One would generally assume that the recognition of speech from these people would only be a matter of having enough, sufficiently representative training data. This turns out to be true only, as long as the speakers belong to the age range 15 to approximately 70. Outside this range the error rates increase dramatically, even with balanced amounts of training data. For males, the lower limit is very sharp and can be attributed to the change of pitch frequency during puberty. For females, the lower limit is gradual and caused by the slowly changing dimensions of the vocal tract length only. For both genders, the upper limit is very gradual and can possibly be attributed to changes in the glottis area and the internal control loops of the human articulatory system. The paper presents some supporting evidence for the above assertions and gives results for various attempts to improve the performance. Recognition of children and the elderly will require much more research if we are to fully understand the characteristics of these age group on current and future speech recognition systems.

204 citations


Patent
Yasunaga Miyazawa1, Mitsuhiro Inazumi1, Hiroshi Hasegawa1, Isao Edatsune1, Osamu Urano1 
TL;DR: In this article, a speaker specific and non-speaker specific method and apparatus is provided for enabling speech-based remote control and for recognizing the speech of an unspecified speaker at extremely high recognition rates regardless of the speaker's age, sex, or individual speech mannerisms.
Abstract: Bifurcated speaker specific and non-speaker specific method and apparatus is provided for enabling speech-based remote control and for recognizing the speech of an unspecified speaker at extremely high recognition rates regardless of the speaker's age, sex, or individual speech mannerisms. A device main unit is provided with a speech recognition processor for recognizing speech and taking an appropriate action, and with a user terminal containing specific speaker capture and/or preprocessing capabilities. The user terminal exchanges data with the speech recognition processor using radio transmission. The user terminal may be provided with a conversion rule generator that compares the speech of a user with previously compiled standard speech feature data and, based on this comparison result, generates a conversion rule for converting the speaker's speech feature parameters to corresponding standard speaker's feature information. The speech recognition processor, in turn, may reference the conversion rule developed in the user terminal and perform speech recognition based on the input speech feature parameters that have been converted above.

178 citations


Patent
Caroline G. Henton1
TL;DR: In this article, the authors proposed a formalized aliasing approach to improve the quality of electronic speech synthesis using pre-recorded segments of speech to fill in for other missing segments.
Abstract: The present invention improves upon electronic speech synthesis using pre-recorded segments of speech to fill in for other missing segments of speech. The formalized aliasing approach of the present invention overcomes the ad hoc aliasing approach of the prior art which oftentimes generated less than satisfactory speech synthesis sound output. By formalizing the relationship between missing speech sound samples and available speech sound samples, the present invention provides a structured approach to aliasing which results in improved synthetic speech sound quality. Further, the formalized aliasing approach of the present invention can be used to lessen storage requirements for speech sound samples by only storing as many sound samples as memory capacity can support.

177 citations


Proceedings ArticleDOI
03 Oct 1996
TL;DR: A new highly parallel approach to automatic recognition of speech, inspired by early Fetcher's research on articulation index, and based on independent probability estimates in several sub-bands of the available speech spectrum, is presented.
Abstract: A new highly parallel approach to automatic recognition of speech, inspired by early Fetcher's research on articulation index, and based on independent probability estimates in several sub-bands of the available speech spectrum, is presented. The approach is especially suitable for situations when part of the spectrum of speech is computed. In such cases, it can yield an order-of-magnitude improvement in the error rate over a conventional full-band recognizer.

PatentDOI
Yasunaga Miyazawa1, Mitsuhiro Inazumi1, Hiroshi Hasegawa1, Isao Edatsune1, Osamu Urano1 
TL;DR: Techniques for implementing adaptable voice activation operations for interactive speech recognition devices and instruments include tailoring the volume level of the synthesized voice response according to the perceived volume level as detected by the input sound signal power detector.
Abstract: Techniques for implementing adaptable voice activation operations for interactive speech recognition devices and instruments. Specifically, such speech recognition devices and instruments include an input sound signal power or volume detector in communication with a central CPU for bringing the CPU out of an initial sleep state upon detection of perceived voice exceeding a predetermined threshold volume level and is continuously perceived for at least a certain period of time. If both these conditions are satisfied, the CPU is transitioned into an active mode so that the perceived voice can be analyzed against a set of registered key words to determine if a "power on" command or similar instruction has been received. If so, the CPU maintains an active state in normal speech recognition processing ensues until a "power off" command is received. However, if the perceived and analyzed voice can not be recognized, it is deemed to be background noise and the minimum threshold is selectively updated to accommodate the volume level of the perceived but unrecognized voice. Other aspects include tailoring the volume level of the synthesized voice response according to the perceived volume level as detected by the input sound signal power detector, as well as modifying audible response volume in accordance with updated volume threshold levels.

Patent
29 Feb 1996
TL;DR: In this paper, a call-placement system for telephone services in response to speech is described, which allows a customer to place a call by speaking a person's name which serves as a destination identifier without having to speak an additional command or steering word.
Abstract: Methods and apparatus for activating telephone services in response to speech are described. A directory including names is maintained for each customer. A speaker dependent speech template and a telephone number for each name, is maintained as part of each customer's directory. Speaker independent speech templates are used for recognizing commands. The present invention has the advantage of permitting a customer to place a call by speaking a person's name which serves as a destination identifier without having to speak an additional command or steering word to place the call. This is achieved by treating the receipt of a spoken name in the absence of a command as an implicit command to place a call. Explicit speaker independent commands are used to invoke features or services other than call placement. Speaker independent and speaker dependent speech recognition are performed on a customer's speech in parallel. An arbiter is used to decide which function or service should be performed when an apparent conflict arises as a result of both the speaker dependent and speaker independent speech recognition step outputs. Stochastic grammars, word spotting and/or out-of-vocabulary rejection are used as part of the speech recognition process to provide a user friendly interface which permits the use of spontaneous speech. Voice verification is performed on a selective basis where security is of concern.

Proceedings ArticleDOI
01 Jan 1996
TL;DR: The results of a study to examine the effects speech coders have on speech recognition are presented and the effects onspeech recognition performance by tandeming each of the speechCoders are presented.
Abstract: Speech coders with bitrates as low as 2.4 kbits/s are now being developed for speech transmission in the telecommunications industry. For speech coders to work at this reduced bitrate, some speech information has to be removed and it is only natural to expect that the performance of speech recognition systems will deteriorate when coded speech is applied as input to a recognition system. The results of a study to examine the effects speech coders have on speech recognition am presented. Six different speech coders ranging from 4.8 kbits/s to 40 kbits/s are used with two different speech recognition systems: (1) isolated word recognition and (2) phoneme recognition from continuous speech. The effects on speech recognition performance by tandeming each of the speech coders are also presented.

PatentDOI
TL;DR: In this paper, the authors proposed a voice activity detection device in which an input speech signal (x(n)) is divided in subsignals representing specific frequency bands and noise (N(s)) is estimated in the subsignal.
Abstract: The invention concerns a voice activity detection device in which an input speech signal (x(n)) is divided in subsignals (S(s)) representing specific frequency bands and noise (N(s)) is estimated in the subsignals. On basis of the estimated noise in the subsignals, subdecision signals (SNR(s)) are generated and a voice activity decision (Vind) for the input speech signal is formed on basis of the subdecision signals. Spectrum components of the input speech signal and a noise estimate are calculated and compared. More specifically a signal-to-noise ratio is calculated for each subsignal and each signal-to-noise ratio represents a subdecision signal (SNR(s)). From the signal-to-noise ratios a value proportional to their sum is calculated and compared with a threshold value and a voice activity decision signal (Vind) for the input speech signal is formed on basis of the comparison.

Book ChapterDOI
15 Apr 1996
TL;DR: Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.
Abstract: Developments in dynamic contour tracking permit sparse representation of the outlines of moving contours. Given the increasing computing power of general-purpose workstations it is now possible to track human faces and parts of faces in real-time without special hardware. This paper describes a real-time lip tracker that uses a Kalman filter based dynamic contour to track the outline of the lips. Two alternative lip trackers, one that tracks lips from a profile view and the other from a frontal view, were developed to extract visual speech recognition features from the lip contour. In both cases, visual features have been incorporated into an acoustic automatic speech recogniser. Tests on small isolated-word vocabularies using a dynamic time warping based audio-visual recogniser demonstrate that real-time, contour-based lip tracking can be used to supplement acoustic-only speech recognisers enabling robust recognition of speech in the presence of acoustic noise.

PatentDOI
TL;DR: In this paper, a lexical network is structured to include Phonetic Constraint Nodes which organize the interword phonetic information in the network, and Word Class Nodes that organize the syntactic/semantic information in network.
Abstract: The vocabulary of a large-vocabulary speech recognition system is structured to effectuate the rapid and efficient edition of words to a lexical network storing the vocabulary. The lexical network is structured to include Phonetic Constraint Nodes which organize the inter-word phonetic information in the network, and Word Class Nodes which organize the syntactic/semantic information in the network. Network fragments, corresponding to phoneme pronunciations and labeled to specify permitted interconnections, are precompiled to facilitate the rapid generation of pronunciation for new words and thereby enhance the rapid addition of words to the vocabulary even during speech recognition. Different language models and different vocabularies for different portions of a discourse are invoked dependent in part on the discourse history.

Proceedings ArticleDOI
07 May 1996
TL;DR: A measure for evaluating the effectiveness of a post-classifier which estimates confidence-measures is defined, and the development of aPost- classifier for words decoded from the SWITCHBOARD database is described, which uses statistics derived from a Viterbi decoder.
Abstract: There is increasing interest in systems which attempt to automate a task or a transaction using speech input and output. To function effectively with imperfect speech recognition, such systems require an estimate of which words in the output from the recogniser are likely to be correct and which can probably be disregarded as incorrect, i.e. a confidence-measure for each decoded word. We define a measure for evaluating the effectiveness of a post-classifier which estimates confidence-measures, and describe the development of a post-classifier for words decoded from the SWITCHBOARD database, which uses statistics derived from a Viterbi decoder. Without any grouping of the decoded word-classes, the post-classifier increased the probability of deciding whether a decoded word was correct or incorrect by 32%. When grouping was used, longer words showed an improvement of 65%.

Patent
30 May 1996
TL;DR: In this article, a visual feedback aid, for a computer system performing speech recognition functions, provides indications on a display monitor of the system representing the current state of operation of a system microphone, the current mode of operation in respect to speech, and a string of text representing the system's recognition (correct or incorrect) of commands instantly spoken into the microphone.
Abstract: A visual feedback aid, for a computer system performing speech recognition functions, provides indications on a display monitor of the system representing the current state of operation of a system microphone, the current mode of operation of the system in respect to speech, and a string of text representing the system's recognition (correct or incorrect) of commands instantly spoken into the microphone. The indications preferably are located in a reserved area of a display window associated with a currently active application involving speech recognition. The reserved area preferably would be a prominent one, such as the application title bar.

PatentDOI
Chi Wong1
TL;DR: In this article, a method of reducing the perplexity of a speech recognition vocabulary and dynamically selecting speech recognition acoustic model sets used in a simulated telephone operator apparatus is presented, where the directory of users of the telephone network is subdivided into subsets wherein each subset contains the names of users within a certain location or exchange.
Abstract: A method of reducing the perplexity of a speech recognition vocabulary and dynamically selecting speech recognition acoustic model sets used in a simulated telephone operator apparatus. The directory of users of the telephone network is subdivided into subsets wherein each subset contains the names of users within a certain location or exchange. A speech recognition vocabulary database is compiled for each subset and the appropriate database is loaded into the speech recognition apparatus in response to a requested call to the location covered by the subset. Furthermore, a site-specific acoustic model set is dynamically loaded according to the location of a calling party. An apparatus for carrying out the method is also discussed.

PatentDOI
TL;DR: In this article, a barge-in detector for use in connection with a speech recognition system forms a prompt replica for detecting the presence or absence of user input to the system, which is indicative of the prompt energy applied to an input of the system.
Abstract: A barge-in detector for use in connection with a speech recognition system forms a prompt replica for use in detecting the presence or absence of user input to the system The replica is indicative of the prompt energy applied to an input of the system The detector detects the application of user input to the system, even if concurrent with a prompt, and enables the system to quickly respond to the user input

PatentDOI
Edatsune Isao1
TL;DR: The invention improves recognition rates by providing an interactive speech recognition device that performs recognition by taking situational andEnvironmental changes into consideration, thus enabling interactions that correspond to situational and environmental changes.
Abstract: The invention improves recognition rates by providing an interactive speech recognition device that performs recognition by taking situational and environmental changes into consideration, thus enabling interactions that correspond to situational and environmental changes. The invention comprises a speech analysis unit that creates a speech data pattern corresponding to the input speech; a timing circuit for generating time data, for example, as variable data; a coefficient setting unit receiving the time data from the timing circuit and generating weighting coefficients that change over time, in correspondence to the content of each recognition target speech; a speech recognition unit that receives the speech data pattern of the input speech from the speech analysis unit, and that at the same time obtains a weighting coefficient in effect for a pre-registered recognition target speech at the time from the coefficient setting unit, that computes final recognition data by multiplying the recognition data corresponding to each recognition target speech by its corresponding weighting coefficient, and that recognizes the input speech based on the computed final recognition result; a speech synthesis unit for outputting speech synthesis data based on the recognition data that takes the weighting coefficient into consideration; and a drive control unit for transmitting the output from the speech synthesis unit to the outside.

Proceedings ArticleDOI
07 May 1996
TL;DR: This work provides evidence for the claim that a modern continuous speech recognizer can be used successfully in "black-box" fashion for robustly interpreting spontaneous utterances in a dialogue with a human.
Abstract: This paper presents a new technique for overcoming several types of speech recognition errors by post-processing the output of a continuous speech recognizer. The post-processor output contains fewer errors, thereby making interpretation by higher-level modules, such as a parser, in a speech understanding system more reliable. The primary advantage to the post-processing approach over existing approaches for overcoming SR errors lies in its ability to introduce options that are not available in the SR module's output. This work provides evidence for the claim that a modern continuous speech recognizer can be used successfully in "black-box" fashion for robustly interpreting spontaneous utterances in a dialogue with a human.

Proceedings Article
01 Jan 1996
TL;DR: It is argued that major advances in speaker verification, speech recognition, and natural-sounding speech synthesis depend on increases in knowledge of the mechanisms underlying voice and speech production under emotional arousal or other attitudinal states, as well as on a more adequate understanding of listener decoding of affect from vocal quality.
Abstract: This introduction to a special session on “Emotion in recognition and synthesis” highlights the need to understand the effects of affective speaker states on voice and speech on a psychophysiological level. It is argued that major advances in speaker verification, speech recognition, and natural-sounding speech synthesis depend on increases in our knowledge of the mechanisms underlying voice and speech production under emotional arousal or other attitudinal states, as well as on a more adequate understanding of listener decoding of affect from vocal quality. A brief review of the current state of the art is provided.

Journal ArticleDOI
01 Sep 1996
TL;DR: Understanding of speech production and perception impacts methods of speech processing, and vice-versa, is enunciated, focusing on how modern time-frequency signal analysis methods could help expedite needed advances in these areas.
Abstract: Modern speech processing research may be categorized into three broad areas: statistical, physiological, and perceptual. Statistical research investigates the nature of the variability of the speech waveform from a signal processing viewpoint. This approach relates to the processing of speech in order to obtain measurements of speech characteristics which demonstrate manageable variabilities across a wide range of the talker population, in the presence of noise or competing speakers as well as the interaction of speech with the channel through which it is transmitted, and under the inherent interaction of the information content of speech itself (i.e., the contextual factor). Physiological research aims at constructing accurate models of the articulatory and auditory process, helping to limit the signal space for speech processing. In the perceptual realm, work focuses on understanding the psychoacoustic and possibly the psycholinguistic aspects of the speech communication process that the human so conveniently conducts. By studying this working analysis/recognition system, insights may be garnered that will lead to improved methods of speech processing. Conversely by studying the limitations of this system, particularly how it reduces the information rate of the received signal through, for example, masking and adaptation improvements may be made in the efficiency of speech coding schemes without impacting the quality of the reconstructed speech. Thus comprehension of speech production and perception impacts methods of speech processing, and vice-versa. This paper enunciates such a position, focusing on how modern time-frequency signal analysis methods could help expedite needed advances in these areas.

Patent
25 Jun 1996
TL;DR: In this paper, the authors proposed a voice over data function that directly encodes digitized voice samples onto the carrier using quadrature amplitude modulation to transmit multiple bits of the voice sample for every baud.
Abstract: The voice over data component of a personal communications system enables the operator to simultaneously transmit voice and data communication to a remote site. This voice over data function directly encodes digitized voice samples onto the carrier using quadrature amplitude modulation to transmit multiple bits of the voice sample for every baud. The system also allocates selected bauds of the carrier to voice and to data so the voice over data may be transmitted using the same allocated bandwidth. The system may also dynamically reallocate the bandwidth over the telephone line depending on the demands of the voice grade digitized signal.

01 Jan 1996
TL;DR: This paper introduces the basic framework of a statistical structure that can accommodate multiple (asynchronous) observation streams (possibly exhibiting different frame rates) and will then be applied to the particular case of multi-band speech recognition and will be shown to yield significantly better noise robustness.
Abstract: In this paper, we discuss a new automatic speech recognition (ASR) approach based on independent processing and recombination of several feature streams. In this framework, it is assumed that the speech signal is represented in terms of multiple input streams, each input stream representing a different characteristic of the signal. If the streams are entirely synchronous, they may be accommodated simply (as they usually are in state-of-the-art systems). However, as discussed in the paper, it may be required to permit some degree of asynchrony between streams. This paper introduces the basic framework of a statistical structure that can accommodate multiple (asynchronous) observation streams (possibly exhibiting different frame rates). This approach will then be applied to the particular case of multi-band speech recognition and will be shown to yield significantly better noise robustness.

PatentDOI
TL;DR: A speech system includes a speech encoding system and a speech decoding system that generates a data signal that includes the speech segment IDs and the values of the corresponding prosodic parameters.
Abstract: A speech system includes a speech encoding system and a speech decoding system. The speech encoding system includes a speech analyzer for identifying each of the speech segments (i.e., phonemes) in the received digitized speech signal. A pitch detector, a duration detector, and an amplitude detector are each coupled to the memory and the analyzer and detect various prosodic parameters of each received speech segment. A speech encoder generates a data signal that includes the speech segment IDs and the values of the corresponding prosodic parameters. The speech decoding system includes a digital data decoder and a speech synthesizer for generating a speech signal based on the segment IDs and prosodic parameter values.

PatentDOI
TL;DR: In this article, a transcoder (TRCU1, TRCU2) was proposed for preventing tandem coding of speech in a mobile to mobile (MS1, MS2) call within a mobile communication system which employs a speech coding method for reducing transmission rate on the radio path.
Abstract: The invention relates to a transcoder (TRCU1, TRCU2) having means for preventing tandem coding of speech in a mobile to mobile (MS1, MS2) call within a mobile communication system which employs a speech coding method for reducing transmission rate on the radio path. The transcoder (TRCU1, TRCU2) comprises a speech coder (52, 73) which encodes the speech signal into speech parameters for transmission to a mobile station, and decodes the speech parameters received from the mobile station into a speech signal according to said speech coding method, as well as a PCM coder (54, 72) for transmitting an uplink speech signal to and for receiving a downlink speech signal from a PCM interface in the form of PCM speech samples. In addition to the normal operation, the transcoder transmits and receives speech parameters through a PCM interface in a subchannel formed by least significant bits of the PCM speech samples. Thus, it is possible to prevent tandem coding but at the same time maintain the standard PCM interface and the signallings and services associated thereto.

Proceedings ArticleDOI
03 Oct 1996
TL;DR: It is demonstrated that some improvements in word accuracy result from augmenting the channel model with an account of word fertility in the channel, and that a modem continuous speech recognizer can be used in "black-box" fashion for robustly recognizing speech for which the recognizer was not originally trained.
Abstract: The authors have implemented a post-processor called SPEECHPP to correct word-level errors committed by an arbitrary speech recognizer. Applying a noisy-channel model, SPEECHPP uses a Viterbi beam-search that employs language and channel models. Previous work demonstrated that a simple word-for-word channel model was sufficient to yield substantial increases in word accuracy. The paper demonstrates that some improvements in word accuracy result from augmenting the channel model with an account of word fertility in the channel. The work further demonstrates that a modem continuous speech recognizer can be used in "black-box" fashion for robustly recognizing speech for which the recognizer was not originally trained. The work also demonstrates that in the case where the recognizer can be tuned to the new task, environment, or speaker, the post-processor can also contribute to performance improvements.