scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1994"


Journal ArticleDOI
01 Oct 1994
TL;DR: The objective of this paper is to provide a tutorial overview of speech coding methodologies with emphasis on those algorithms that are part of the recent low-rate standards for cellular communications.
Abstract: The past decade has witnessed substantial progress towards the application of low-rate speech coders to civilian and military communications as well as computer-related voice applications. Central to this progress has been the development of new speech coders capable of producing high-quality speech at low data rates. Most of these coders incorporate mechanisms to: represent the spectral properties of speech, provide for speech waveform matching, and "optimize" the coder's performance for the human ear. A number of these coders have already been adopted in national and international cellular telephony standards. The objective of this paper is to provide a tutorial overview of speech coding methodologies with emphasis on those algorithms that are part of the recent low-rate standards for cellular communications. Although the emphasis is on the new low-rate coders, we attempt to provide a comprehensive survey by covering some of the traditional methodologies as well. We feel that this approach will not only point out key references but will also provide valuable background to the beginner. The paper starts with a historical perspective and continues with a brief discussion on the speech properties and performance measures. We then proceed with descriptions of waveform coders, sinusoidal transform coders, linear predictive vocoders, and analysis-by-synthesis linear predictive coders. Finally, we present concluding remarks followed by a discussion of opportunities for future research. >

461 citations


PatentDOI
TL;DR: In this article, a distributed voice recognition system includes a digital signal processor (DSP), a nonvolatile storage medium (108), and a microprocessor (106), which is configured to extract parameters from digitized input speech samples and provide the extracted parameters to the microprocessor.
Abstract: A distributed voice recognition system includes a digital signal processor (DSP)(104), a nonvolatile storage medium (108), and a microprocessor (106). The DSP (104) is configured to extract parameters from digitized input speech samples and provide the extracted parameters to the microprocessor (106). The nonvolatile storage medium contains a database of speech templates. The microprocessor is configured to read the contents of the nonvolatile storage medium (108), compare the parameters with the contents, and select a speech template based upon the comparison. The nonvolatile storage medium may be a flash memory. The DSP (104) may be a vocoder. If the DSP (104) is a vocoder, the parameters may be diagnostic data generated by the vocoder. The distributed voice recognition system may reside on an application specific integrated circuit (ASIC).

361 citations


PatentDOI
TL;DR: In this article, a method and system for synthesizing speech utilizing a periodic waveform decomposition and relocation coding scheme was proposed, where signals of voiced sound interval among original speech are decomposed into wavelets, each of which corresponds to a speech waveform for one period made by each glottal pulse.
Abstract: The present invention relates to a method and system for synthesizing speech utilizing a periodic waveform decomposition and relocation coding scheme. According to the scheme, signals of voiced sound interval among original speech are decomposed into wavelets, each of which corresponds to a speech waveform for one period made by each glottal pulse. These wavelets are respectively coded and stored. The wavelets nearest to the positions where the wavelets are to be located are selected from stored wavelets and decoded. The decoded wavelets are superposed to each other such that original sound quality can be maintained and duration and pitch frequency of speech segment can be controlled arbitrarily.

224 citations


Book ChapterDOI
Oded Ghitza1
TL;DR: A state-of-the-art auditory model that simulates, in considerable detail, the outer parts of the auditory periphery up through the auditory nerve level is described and preliminary experimental results that confirm human usage of such integration are discussed, with different integration rules for different time-frequency regions depending on the phoneme-discrimination task.
Abstract: Auditory models that are capable of achieving human performance in tasks related to speech perception would provide a basis for realizing effective speech processing systems. Saving bits in speech coders, for example, relies on a perceptual tolerance to acoustic deviations from the original speech. Perceptual invariance to adverse signal conditions (noise, microphone and channel distortions, room reverberations) and to phonemic variability (due to nonuniqueness of articulatory gestures) may provide a basis for robust speech recognition. A state-of-the-art auditory model that simulates, in considerable detail, the outer parts of the auditory periphery up through the auditory nerve level is described. Speech information is extracted from the simulated auditory nerve firings, and used in place of the conventional input to several speech coding and recognition systems. The performance of these systems improves as a result of this replacement, but is still short of achieving human performance. The shortcomings occur, in particular, in tasks related to low bit-rate coding and to speech recognition. Since schemes for low bit-rate coding rely on signal manipulations that spread over durations of several tens of ms, and since schemes for speech recognition rely on phonemic/articulatory information that extend over similar time intervals, it is concluded that the shortcomings are due mainly to perceptually related rules over durations of 50-100 ms. These observations suggest a need for a study aimed at understanding how auditory nerve activity is integrated over time intervals of that duration. The author discusses preliminary experimental results that confirm human usage of such integration, with different integration rules for different time-frequency regions depending on the phoneme-discrimination task. >

192 citations


Patent
21 Jan 1994
Abstract: A signal processing arrangement uses a codebook of first vector quantized speech feature signals formed responsive to a large collection of speech feature signals. The codebook is altered by combining the first speech feature signals of the codebook with second speech feature signals generated responsive to later input speech patterns during normal speech processing. A speaker recognition template can be updated in this fashion to take account of change which may occur in the voice and speaking characteristics of a known speaker.

175 citations


Patent
16 Nov 1994
TL;DR: In this article, a shared time-division duplexing (STDD) scheme is proposed to allow both uplink and downlink voice traffic to share a common channel by dynamically allocating time slots in the common information channel.
Abstract: A low delay multiple access scheme called Shared Time-Division Duplexing (STDD), allows both uplink and downlink voice traffic to share a common channel. The scheme contains separate uplink and downlink control channels and a common voice information channel. The control channels comprise means for signalling voice requirements and acknowledgements of the time slot allocation. Using speech activity detection only, talk spurt speech packets are generated for transmission. STDD dynamically allocates time slots in the common information channel taking advantage of co-ordinated two-way conversations to achieve high statistical multiplexing gain and more efficient realization of the common information channel.

164 citations


PatentDOI
TL;DR: A telephony channel simulation process is disclosed for training a speech recognizer to respond to speech obtained from telephone systems.
Abstract: A telephony channel simulation process is disclosed for training a speech recognizer to respond to speech obtained from telephone systems. An input speech data set is provided to a speech recognition training processor, whose bandwidth is higher than a telephone bandwidth. The process performs a series of alterations to the input speech data set to obtain a modified speech data set. The modified speech data set enables the speech recognition processor to perform speech recognition on voice signals from a telephone system.

159 citations


Patent
29 Nov 1994
TL;DR: In this article, a negotiation handshake protocol is described which enables the two sites to negotiate the compression rate based on such factors, such as the speed or data bandwidth on the communications connection between two sites, the data demand between the sites and amount of silence detected in the speech signal.
Abstract: The present invention includes software and hardware components to enable digital data communication over standard telephone lines. The present invention converts analog voice signals to digital data, compresses that data and places the compressed speech data into packets for transfer over the telephone lines to a remote site. A voice control digital signal processor (DSP) operates to use one of a plurality of speech compression algorithms which produce a scaleable amount of compression. The rate of compression is inversely proportional to the quality of the speech the compression algorithm is able to reproduce. The higher the compression, the lower the reproduction quality. The selection of the rate of compression is dependant on such factors as the speed or data bandwidth on the communications connection between the two sites, the data demand between the sites and amount of silence detected in the speech signal. The voice compression rate is dynamically changed as the aforementioned factors change. A negotiation handshake protocol is described which enables the two sites to negotiate the compression rate based on such factors.

140 citations


Patent
21 Jun 1994
TL;DR: In this article, a digital signal processor for converting and compressing digital voice data into a voice data packet, a modem for maintaining communication over the PSTN and a micro-controller for managing the establishment and maintenance of concurrent voice and non-voice data communication.
Abstract: Full-duplex, concurrent voice and non-voice communication over the public switched telephone network (PSTN) is maintained by a communication interface apparatus. A voice only connection is established between two sites initially. Concurrent voice/non-voice connection then is established by pressing an engage button on the communication interface apparatus at each site. Voice communication is temporarily lost, while the connection changes from a phone-to-phone voice-only link to an interface-to-interface voice/non-voice data link. The communication interface apparatus includes a digital signal processor for converting and compressing digital voice data into a voice data packet, a modem for maintaining communication over the PSTN and a micro-controller for managing the establishment and maintenance of concurrent voice and non-voice data communication. The micro-controller monitors non-voice data availability to and from a local computer or gaming device, voice data availability to and from the digital signal processor and voice packet and non-voice data packet availability to and from the modem. In one embodiment, non-voice data is transmitted at a higher priority. Voice data is transmitted at a lower priority and buffered such that real-time performance is preserved. A compression algorithm and modem baud rate are used which enable voice data to fit within the available bandwidth left over from the non-voice data communication.

101 citations


Patent
14 Nov 1994
TL;DR: In this paper, a voice activated device using speaker independent speech recognition is capable of receiving from a remote location the phonetic spellings needed for speech recognition in the device, as well as additional application data.
Abstract: A voice activated device using speaker independent speech recognition is capable of receiving from a remote location the phonetic spellings needed for speech recognition in the device. The phonetic spellings, as well as additional application data, are communicated to the voice activated device from the remote location and stored in the device. A user can then speak voice commands which are intercepted by the device where local processing of the voice commands takes place. The device makes available to the user extensive information pertaining to multiple network services or applications. The information may be communicated to the user by voice or via other communication media.

101 citations


PatentDOI
TL;DR: A voice recognition system and method for training the same are provided wherein aFirst voice signal representing an instruction as well as a predetermined instruction signal corresponding to the first voice signal and identifying the instruction are input to the voice recognitionSystem.
Abstract: A voice recognition system and method for training the same are provided wherein a first voice signal representing an instruction as well as a predetermined instruction signal corresponding to the first voice signal and identifying the instruction are input to the voice recognition system. The system processes the first voice signal based on the predetermined instruction signal to produce voice recognition data for use by the system in identifying the instruction based on a second voice signal representing the same instruction. The processor stores the voice recognition data for subsequent use upon receipt of the second voice signal and carries out the instruction in response to the predetermined instruction signal corresponding to the first voice signal.

Proceedings ArticleDOI
31 Oct 1994
TL;DR: A continuous optical automatic speech recognizer that uses optical information from the oral-cavity shadow of a speaker that achieves a 25.3 percent recognition on sentences having a perplexity of 150 without using any syntactic, semantic, acoustic, or contextual guides is described.
Abstract: We describe a continuous optical automatic speech recognizer (OASR) that uses optical information from the oral-cavity shadow of a speaker. The system achieves a 25.3 percent recognition on sentences having a perplexity of 150 without using any syntactic, semantic, acoustic, or contextual guides. We introduce 13, mostly dynamic, oral-cavity features used for optical recognition, present phones that appear optically similar (visemes) for our speaker, and present the recognition results for our hidden Markov models (HMMs) using visemes, trisemes, and generalized trisemes. We conclude that future research is warranted for optical recognition, especially when combined with other input modalities. >

Journal ArticleDOI
TL;DR: Speech recognition and speech synthesis are technologies of particular interest for their support of direct communication between humans and computers through a communications mode humans commonly use among themselves and at which they are highly skilled.
Abstract: Speech recognition and speech synthesis are technologies of particular interest for their support of direct communication between humans and computers through a communications mode humans commonly use among themselves and at which they are highly skilled. Both manipulate speech in terms of its information content; recognition transforms human speech into text to be used literally (e.g., for dictation) or interpreted as commands to control applications, and synthesis allows the generation of spoken utterances from text

Proceedings ArticleDOI
19 Apr 1994
TL;DR: It is found that other factors beyond a mere decrease in bandwidth cause the observed degradation in recognition accuracy, and that the environmental compensation algorithms RASTA and CDCN fail to compensate completely for degradations introduced by the telephone network.
Abstract: We compare speech recognition accuracy for high-quality speech recorded under controlled conditions with speech as it appears over long-distance telephone lines. In addition to comparing recognition accuracy we use telephone-channel simulation to identify the sources of degradation of speech over telephone lines that have the greatest impact on speech recognition accuracy. We first compare the performance of the CMU SPHINX-I system on the TIMIT and NTIMIT databases. We found that other factors beyond a mere decrease in bandwidth cause the observed degradation in recognition accuracy, and that the environmental compensation algorithms RASTA and CDCN fail to compensate completely for degradations introduced by the telephone network. We identify the most problematic telephone-channel impairments using a commercial telephone channel simulator and the SPHINX-II system. Of the various effects considered, additive noise and linear filtering appear to have the greatest impact on recognition accuracy. Finally, we examined the performance of three cepstral compensation algorithms in the presence of the most damaging conditions. We found the compensation algorithms to be effective except for the worst 1% of the telephone channels. >

Dissertation
01 Jan 1994
TL;DR: SpeechSkimmer as mentioned in this paper uses simple speech processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail, and provides continuous real-time control of the speed and detail level of the audio presentation.
Abstract: Listening to a speech recording is much more difficult than visually scanning a document because of the transient and temporal nature of audio. Audio recordings capture the richness of speech, yet it is difficult to directly browse the stored information. This dissertation investigates techniques for structuring, filtering, and presenting recorded speech, allowing a user to navigate and interactively find information in the audio domain. This research makes it easier and more efficient to listen to recorded speech by using the SpeechSkimmer system. First, this dissertation describes Hyperspeech, a speech-only hypermedia system that explores issues of speech user interfaces, browsing, and the use of speech as data in an environment without a visual display. The system uses speech recognition input and synthetic speech feedback to aid in navigating through a database of digitally recorded speech. This system illustrates that managing and moving in time are crucial in speech interfaces. Hyperspeech uses manually segmented and structured speech recordings--a technique that is practical only in limited domains. Second, to overcome the limitations of Hyperspeech while retaining browsing capabilities, a variety of speech analysis and user interface techniques are explored. This research exploits properties of spontaneous speech to automatically select and present salient audio segments in a time-efficient manner. Two speech processing technologies, time compression and adaptive speech detection (to find hesitations and pauses), are reviewed in detail with a focus on techniques applicable to extracting and displaying speech information. Finally, this dissertation describes SpeechSkimmer, a user interface for interactively skimming speech recordings. SpeechSkimmer uses simple speech processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail. User interaction, through a manual input device, provides continuous real-time control of the speed and detail level of the audio presentation. SpeechSkimmer incorporates time-compressed speech, pause removal, automatic emphasis detection, and non-speech audio feedback to reduce the time needed to listen. This dissertation presents a multi-level structural approach to auditory skimming, and user interface techniques for interacting with recorded speech. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Proceedings ArticleDOI
Stephan Euler1, J. Zinke1
19 Apr 1994
TL;DR: The authors use a Gaussian classifier for estimation of the coding condition of a test utterance and the combination of this classifier and coder specific word models yields a high overall recognition performance.
Abstract: Examines the influence of different coders in the range from 64 kbit/sec to 4.8 kbit/sec on both a speaker independent isolated word recognizer and a speaker verification system. Applying systems trained with 64 kbit/sec to e.g. the 4.8 kbit/sec data increases the error rate of the word recognizer by a factor of three. For rates below 13 kbit/sec the speaker verification is more affected than the word recognition. The performance improves significantly if word models are provided for the individual coding conditions. Therefore, the authors use a Gaussian classifier for estimation of the coding condition of a test utterance. The combination of this classifier and coder specific word models yields a high overall recognition performance. >

Patent
Hideki Kojima1, Shinta Kimura1
27 Jun 1994
TL;DR: In this article, a delay device is provided, coupled to the recognition device, for delaying the recognition result or the response output from a recognition device by a predetermined time, and the delay device also supplies the delay recognition results or response to the display as an output of the speech recognition apparatus.
Abstract: A speech recognition apparatus generates an output which is displayed on a display. The speech recognition apparatus includes a recognition device for recognizing an input speech and for outputting a recognition result or a response corresponding thereto. A delay device is provided, coupled to the recognition device, for delaying the recognition result or the response output from the recognition device by a predetermined time. The delay device also supplies the delay recognition result or response to the display as an output of the speech recognition apparatus.

Patent
19 May 1994
TL;DR: In this article, the authors propose a speech detection apparatus consisting of a reference model maker for extracting a plurality of parameters for speech detection from training data, and a parameter extractor and a decision device for deciding whether or not the audio signal is speech, by comparing the parameters extracted from the input audio signal with the reference model.
Abstract: The speech detection apparatus comprises: a reference model maker for extracting a plurality of parameters for a speech detection from training data, and for making a reference model based on the parameters; a parameter extractor for extracting the plurality of parameters from each frame of an input audio signal; and a decision device for deciding whether or not the audio signal is speech, by comparing the parameters extracted from the input audio signal with the reference model. The reference model maker makes the reference model for each phoneme. The decision devices includes: a similarity computing unit for comparing the parameters extracted from each frame of the input audio signal with the reference model, and for computing a similarity of the frame with respect to the reference model; a phoneme decision unit for deciding a phoneme of each frame of the input audio signal based on the similarity computed for each phoneme; and a final decision unit for deciding whether or not a specific period of the input audio signal including a plurality of frames is speech, based on the result of the phoneme decision for the plurality of frames.

Journal ArticleDOI
TL;DR: The authors describe a novel approach to speech recognition by directly modeling the statistical characteristics of the speech waveforms, which allows them to remove the need for using speech preprocessors, which conventionally serve a role of converting speech waves into frame-based speech data subject to a subsequent modeling process.
Abstract: The authors describe a novel approach to speech recognition by directly modeling the statistical characteristics of the speech waveforms. This approach allows them to remove the need for using speech preprocessors, which conventionally serve a role of converting speech waveforms into frame-based speech data subject to a subsequent modeling process. Central to their method is the representation of the speech waveforms as the output of a time-varying filter excited by a Gaussian source time-varying in its power. In order to formulate a speech recognition algorithm based on this representation, the time variation in the characteristics of the filter and of the excitation source is described in a compact and parametric form of the Markov chain. They analyze in detail the comparative roles played by the filter modeling and by the source modeling in speech recognition performance. Based on the result of the analysis, they propose and evaluate a normalization procedure intended to remove the sensitivity of speech recognition accuracy to often uncontrollable speech power variations. The effectiveness of the proposed speech-waveform modeling approach is demonstrated in a speaker-dependent, discrete-utterance speech recognition task involving 18 highly confusable stop consonant-vowel syllables. The high accuracy obtained shows promising potentials of the proposed time-domain waveform modeling technique for speech recognition. >

PatentDOI
TL;DR: In this paper, the speed of an input speech is changed without any change of the pitch of the input speech, and the speed can be modulated continuously on the basis of the raw data of the speech.
Abstract: The speed of an input speech is changed without any change of the pitch of the input speech. Raw data of a speech are stored so that the speed of the speech can be modulated continuously on the basis of the raw data of the speech. In the speech speed conversion method, a speech speed conversion process for the input speech is carried out in a period designated when speech speed conversion is needed, which the speech speed conversion is not carried out in the other period. Further, in the speech speed conversion apparatus having a unit for inputting a speech, a speech speed conversion unit for changing the speed of the input speech, and a unit for supplying the output of the speech speed conversion unit as an output speech to listener's ears, the apparatus further includes a speech speed conversion switch, and a unit for outputting a speech while changing the speech speed of the input speech in a period in which the speech speed conversion switch is turned on, but for outputting a speech without any change of the input speech in the other period in which the speech speed conversion switch is turned off.

Proceedings ArticleDOI
08 Jun 1994
TL;DR: Experiments are described to develop a new technique that requires only the received speech, which uses perceptually-based speaker-independent speech parameters such as perceptual-linear prediction coefficients and the perceptually weighted Bark spectrum to estimate subjective quality.
Abstract: Objective speech quality measures automatically assess performance of communication systems without the need for human listeners. Typical objective quality methods are based on some distortion measure between the known input speech record and the received output signal. This paper describes experiments to develop a new technique that requires only the received speech. The algorithm uses perceptually-based speaker-independent speech parameters such as perceptual-linear prediction coefficients and the perceptually weighted Bark spectrum. Parameters derived from a variety of undegraded source speech material provides reference centroids corresponding to high speech quality. The average distance between output speech parameters and the nearest reference centroid provides an indication of speech degradation, which is used to estimate subjective quality. The paper presents algorithm results for speech processed through low bit-rate codecs and subjected to bit errors due to impaired channel conditions. Output-based quality measures would be a valuable tool for monitoring performance of speech communication systems such as digital mobile radio networks and mobile satellite systems. >

PatentDOI
TL;DR: A biofeedback system for speech disorders is provided which is adapted to detect disfluent speech and to provide auditory feedback enabling immediate fluent speech, and to control the auditory feedback in accordance with the disFluent speech, to enable immediate and carryover fluency.
Abstract: A biofeedback system for speech disorders is provided which is adapted to detect disfluent speech, and to provide auditory feedback enabling immediate fluent speech, and to control the auditory feedback in accordance with the disfluent speech, to enable immediate and carryover fluency. The disfluent speech detector is preferably an electromyograph (EMG). The auditory feedback is preferably frequency-altered auditory feedback (FAF). The controller shifts the pitch of the user's voice in accordance with the user's disfluent speech. The biofeedback system may also be provided with delayed auditory feedback (DAF) which enables user control of speaking rate, with masking auditory feedback (MAP) which improves user awareness of the physical sensations of speech, and with a voice-operated switch (VOX) to switch the device off when the user stops talking. The biofeedback system may also include a timer on the DAF circuit to automatically vary the user's speaking rate at regular time intervals. The biofeedback system may also be provided with a telephone interface to enable fluent speech while talking on telephones. The system may also provide biofeedback regarding the user's vocal pitch, enabling users to speak or sing at a higher or lower pitch.

PatentDOI
Yumi Takizawa1
TL;DR: In this article, the relationship between durations of each recognition unit is obtained by a duration training circuit and, at the time of recognizing speech, a beginning and an end of input speech is detected by a speech period sensing circuit, and then by using the relationship and the input speech period length, the durations in the input input speech are estimated.
Abstract: At the time of training reference speech, the relationship between durations of each recognition unit is obtained by a duration training circuit and, at the time of recognizing speech, a beginning and an end of input speech is detected by a speech period sensing circuit, and then by using the relationship and the input speech period length, the durations of the recognition units in the input speech are estimated. Next, the reference speech and the input speech are matched by the matching means by using the calculated estimation values in such a manner that the recognition units have a duration close to that of the estimated values.

Journal ArticleDOI
W.B. Kleijn1, J. Haagen1
TL;DR: The decomposition of the characteristic waveform is decomposed into a slowly evolving waveform and a rapidly evolving waveforms, representing the quasi-periodic and other components of speech, respectively, which allows efficient coding of voiced and unvoiced speech at bit rates between 2 and 8 kb/s.
Abstract: The speech signal is represented by an evolving characteristic waveform. The characteristic waveform is decomposed into a slowly evolving waveform and a rapidly evolving waveform, representing the quasi-periodic and other components of speech, respectively. These two evolving waveforms have fundamentally different quantization requirements. The decomposition allows efficient coding of voiced and unvoiced speech at bit rates between 2 and 8 kb/s. >

Patent
24 Feb 1994
TL;DR: In this paper, a voice/image simultaneous communication (VISC) system was proposed, which can transmit image data while permitting talking with higher efficiency by encoding voice and synthesizing voice coded data and image data than by multiplexing frequencies.
Abstract: A voice/image simultaneous communication apparatus which can transmit image data while permitting talking with higher efficiency by encoding voice and synthesizing voice coded data and image data than by multiplexing frequencies. The voice/image simultaneous communication apparatus is constructed of a voice coder for coding an analog voice signal, a voice decoder for decoding the coded voice into an analog voice signal, a data synthesizer for synthesizing image coded data and voice coded data, a data separator for separating synthesized coded data into image coded data and voice coded data, and a modem capable of performing full duplex communication. The data synthesizer can change the ratio of synthesis between voice coded data and image coded data in accordance with the data transmission speed.

Journal ArticleDOI
N. Amitay1, Sanjiv Nanda1
TL;DR: The excess capacity of resource auction multiple access (RAMA), originally proposed for fast handoffs and resource allocations in wireless personal communications systems (PCS), is evaluated for statistical multiplexing of speech and delays experienced by transmitted packets are more evenly distributed for the case of fast speech detection.
Abstract: The excess capacity of resource auction multiple access (RAMA), originally proposed for fast handoffs and resource allocations in wireless personal communications systems (PCS), is evaluated for statistical multiplexing of speech. Using selected GSM parameters in conjunction with M-ary FSK for signaling, it is shown that, in cells with propagation delays of up to 45 /spl mu/s, 216 assignments/s are feasible. The aim is to exploit this large assignment capacity to increase channel utilization. The authors show that, for packet dropping probabilities of 1%, RAMA can have a multiplexing gain as high as 2.63 with fast speech detection and 2.28 with slow speech detection. RAMA permits graceful degradation during peak traffic demand by operating at higher packet dropping probabilities. The authors also observe that, at low values of packet dropping probability, delays experienced by transmitted packets are more evenly distributed for the case of fast speech detection while the bulk of the packets experience less delay with slow speech detection. Speech clipping statistics associated with various values of packet dropping probabilities are also presented. >

PatentDOI
Bruno Lozach1
TL;DR: A system for predictive coding of a digital speech signal with embedded codes used in any transmission system or for storing speech signals and makes it possible to deliver indices representing the coded speech signal.
Abstract: A system for predictive coding of a digital speech signal with embedded codes used in any transmission system or for storing speech signals. The coded digital signal (Sn) is formed by a coded speech signal and, if appropriate, by auxiliary data. A perceptual weighting filter is formed by a filter for short-term prediction of the speech signal to be coded, in order to produce a frequency distribution of the quantization noise. A circuit makes it possible to perform the subtraction from the perceptual signal of the contribution of the past excitation signal P0 n to deliver an updated perceptual signal Pn. A long-term prediction circuit is formed, as a closed loop, from a dictionary updated by the modelled page excitation r1 n for the lowest throughput and makes it possible to deliver an optimal waveform and an associated estimated gain which make up the estimated perceptual signal P1 n. An orthonormal transform module includes an adaptive transform module and a module for progressive modelling by orthogonal vectors, thus making it possible to deliver indices representing the coded speech signal. A circuit makes it possible to insert auxiliary data by stealing bits from the coded speech signal. Decoding is performed through extraction of datasignal and transmission of indices representing coded speech signal which is modelled at the minimum throughput.

Journal ArticleDOI
TL;DR: The authors propose a choice of low delay, high quality speech coding and digital modulation systems based on adaptive DPCM, with QDPSK or pseudo-analog transmission (skewed DPSK), for use in conjunction with the STDD multiple access protocol.
Abstract: Various strategies to provide low-delay high-quality digital speech communications in a high-capacity wireless network are examined. Various multiple access schemes based on time-division and packet reservation are compared in terms of their statistical multiplexing capabilities, sensitivity to speech packet dropping, delay, robustness to lossy packet environments, and overhead efficiency. In particular, a low-delay multiple access scheme, called shared time-division duplexing (STDD) is proposed. This scheme allows both the uplink and downlink traffic to share a common channel, thereby achieving high statistical multiplexing gain even with a low population of simultaneous conversations. The authors also propose a choice of low delay, high quality speech coding and digital modulation systems based on adaptive DPCM, with QDPSK or pseudo-analog transmission (skewed DPSK), for use in conjunction with the STDD multiple access protocol. The choice of the alternative systems depends on required end-to-end delay, recovered speech quality and bandwidth efficiency. Typically, with a total capacity of 1 MBaud, 2 ms frame and 8 kBaud speech coding rate, low delay STDD is able to support 48 pairs of users compared to 38, 35, and 16 for TDMA with speech activity detection, basic TDMA and PRMA respectively. This corresponds to respective gains of 26%, 37% and 200%. >

Patent
08 Jun 1994
TL;DR: In this article, a distributed voice system and method provides voice scripting and voice messaging to a call processing system, where audio voice is recorded in frames and the frames encapsulated into data packets.
Abstract: A distributed voice system and method provides voice scripting and voice messaging to a call processing system. According to this system and method, audio voice is recorded in frames and the frames encapsulated into data packets. The data packets are stored in a database as voice scripts or voice messages. To play back a voice script or voice message, its packets are sequentially retrieved from the database. As soon as the first packet is retrieved, the data are extracted therefrom and playback can begin. As the voice is being played back to a user 106, subsequent packets are retrieved, the data extracted therefrom, and the data buffered for playback. In this manner, a voice script or a voice message can be played back without interruption.

Patent
Benjamin K. Reaves1
18 Jul 1994
TL;DR: In this article, the authors detect the beginning and ending portions of speech contained within an input signal based on the variance of smoothed frequency band limited energy and the history of the smoothed frequencies within the signal, which is relatively independent of an absolute signal-to-noise ratio with the signal.
Abstract: The device detects the beginning and ending portions of speech contained within an input signal based on the variance of smoothed frequency band limited energy and the history of the smoothed frequency band limited energy within the signal. The use of the variance allows detection which is relatively independent of an absolute signal-to-noise ratio with the signal, and allows accurate detection within a wide variety of backgrounds such as music, motor noise, and background noise, such as other voices. The device can be easily implemented using off-the-shelf hardware along with a high-speed special purpose digital signal processor integrated circuit.