scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 1994"


Journal ArticleDOI
01 Oct 1994
TL;DR: The objective of this paper is to provide a tutorial overview of speech coding methodologies with emphasis on those algorithms that are part of the recent low-rate standards for cellular communications.
Abstract: The past decade has witnessed substantial progress towards the application of low-rate speech coders to civilian and military communications as well as computer-related voice applications. Central to this progress has been the development of new speech coders capable of producing high-quality speech at low data rates. Most of these coders incorporate mechanisms to: represent the spectral properties of speech, provide for speech waveform matching, and "optimize" the coder's performance for the human ear. A number of these coders have already been adopted in national and international cellular telephony standards. The objective of this paper is to provide a tutorial overview of speech coding methodologies with emphasis on those algorithms that are part of the recent low-rate standards for cellular communications. Although the emphasis is on the new low-rate coders, we attempt to provide a comprehensive survey by covering some of the traditional methodologies as well. We feel that this approach will not only point out key references but will also provide valuable background to the beginner. The paper starts with a historical perspective and continues with a brief discussion on the speech properties and performance measures. We then proceed with descriptions of waveform coders, sinusoidal transform coders, linear predictive vocoders, and analysis-by-synthesis linear predictive coders. Finally, we present concluding remarks followed by a discussion of opportunities for future research. >

461 citations


Journal ArticleDOI
TL;DR: This correspondence presents an experimental evaluation of different features and channel compensation techniques for robust speaker identification, and it is shown that performance differences between the basic features is small, and the major gains are due to the channel Compensation techniques.
Abstract: This correspondence presents an experimental evaluation of different features and channel compensation techniques for robust speaker identification. The goal is to keep all processing and classification steps constant and to vary only the features and compensations used to allow a controlled comparison. A general, maximum-likelihood classifier based on Gaussian mixture densities is used as the classifier, and experiments are conducted on the King speech database, a conversational, telephone-speech database. The features examined are mel-frequency and linear-frequency filterbank cepstral coefficients, linear prediction cepstral coefficients, and perceptual linear prediction (PLP) cepstral coefficients. The channel compensation techniques examined are cepstral mean removal, RASTA processing, and a quadratic trend removal technique. It is shown for this database that performance differences between the basic features is small, and the major gains are due to the channel compensation techniques. The best "across-the-divide" recognition accuracy of 92% is obtained for both high-order LPC features and band-limited filterbank features. >

336 citations



Journal ArticleDOI
01 Jun 1994
TL;DR: Current activity in speech compression is dominated by research and development of a family of techniques commonly described as code-excited linear prediction (CELP) coding, which offer a quality versus bit rate tradeoff that significantly exceeds most prior compression techniques.
Abstract: Speech and audio compression has advanced rapidly in recent years spurred on by cost-effective digital technology and diverse commercial applications. Recent activity in speech compression is dominated by research and development of a family of techniques commonly described as code-excited linear prediction (CELP) coding. These algorithms exploit models of speech production and auditory perception and offer a quality versus bit rate tradeoff that significantly exceeds most prior compression techniques for rates in the range of 4 to 16 kb/s. Techniques have also been emerging in recent years that offer enhanced quality in the neighborhood of 2.4 kb/s over traditional vocoder methods. Wideband audio compression is generally aimed at a quality that is nearly indistinguishable from consumer compact-disc audio. Subband and transform coding methods combined with sophisticated perceptual coding techniques dominate in this arena with nearly transparent quality achieved at bit rates in the neighborhood of 128 kb/s per channel. >

234 citations


PatentDOI
TL;DR: In this article, a method and system for synthesizing speech utilizing a periodic waveform decomposition and relocation coding scheme was proposed, where signals of voiced sound interval among original speech are decomposed into wavelets, each of which corresponds to a speech waveform for one period made by each glottal pulse.
Abstract: The present invention relates to a method and system for synthesizing speech utilizing a periodic waveform decomposition and relocation coding scheme. According to the scheme, signals of voiced sound interval among original speech are decomposed into wavelets, each of which corresponds to a speech waveform for one period made by each glottal pulse. These wavelets are respectively coded and stored. The wavelets nearest to the positions where the wavelets are to be located are selected from stored wavelets and decoded. The decoded wavelets are superposed to each other such that original sound quality can be maintained and duration and pitch frequency of speech segment can be controlled arbitrarily.

224 citations


Journal ArticleDOI
TL;DR: This paper reviews methods for mapping from the acoustical properties of a speech signal to the geometry of the vocal tract that generated the signal and shows how this nonuniqueness can be alleviated by imposing continuity constraints.
Abstract: This paper reviews methods for mapping from the acoustical properties of a speech signal to the geometry of the vocal tract that generated the signal. Such mapping techniques are studied for their potential application in speech synthesis, coding, and recognition. Mathematically, the estimation of the vocal tract shape from its output speech is a so-called inverse problem, where the direct problem is the synthesis of speech from a given time-varying geometry of the vocal tract and glottis. Different mappings are discussed: mapping via articulatory codebooks, mapping by nonlinear regression, mapping by basis functions, and mapping by neural networks. Besides being nonlinear, the acoustic-to-geometry mapping is also nonunique, i.e., more than one tract geometry might produce the same speech spectrum. The authors show how this nonuniqueness can be alleviated by imposing continuity constraints. >

220 citations


Book ChapterDOI
Oded Ghitza1
TL;DR: A state-of-the-art auditory model that simulates, in considerable detail, the outer parts of the auditory periphery up through the auditory nerve level is described and preliminary experimental results that confirm human usage of such integration are discussed, with different integration rules for different time-frequency regions depending on the phoneme-discrimination task.
Abstract: Auditory models that are capable of achieving human performance in tasks related to speech perception would provide a basis for realizing effective speech processing systems. Saving bits in speech coders, for example, relies on a perceptual tolerance to acoustic deviations from the original speech. Perceptual invariance to adverse signal conditions (noise, microphone and channel distortions, room reverberations) and to phonemic variability (due to nonuniqueness of articulatory gestures) may provide a basis for robust speech recognition. A state-of-the-art auditory model that simulates, in considerable detail, the outer parts of the auditory periphery up through the auditory nerve level is described. Speech information is extracted from the simulated auditory nerve firings, and used in place of the conventional input to several speech coding and recognition systems. The performance of these systems improves as a result of this replacement, but is still short of achieving human performance. The shortcomings occur, in particular, in tasks related to low bit-rate coding and to speech recognition. Since schemes for low bit-rate coding rely on signal manipulations that spread over durations of several tens of ms, and since schemes for speech recognition rely on phonemic/articulatory information that extend over similar time intervals, it is concluded that the shortcomings are due mainly to perceptually related rules over durations of 50-100 ms. These observations suggest a need for a study aimed at understanding how auditory nerve activity is integrated over time intervals of that duration. The author discusses preliminary experimental results that confirm human usage of such integration, with different integration rules for different time-frequency regions depending on the phoneme-discrimination task. >

192 citations


Book
01 Jan 1994
TL;DR: This hands-on treatment of DSP will help communications engineers upgrade their skills in digital signal processing and make a smooth transition into the design of more advanced systems, and meets the needs of students who want to bolster their knowledge in communications.
Abstract: From the Publisher: A great deal of modern communications equipment is being converted from analog to digital technology. This timely book explains many of the important concepts related to digital signal processing in easy-to-understand discussions of communications techniques, data transmission, filters, and hardware. Readers are given practical information on how to apply theory and algorithms to the design of radio receivers and transmitters. Among the areas discussed are analog to digital conversion - with emphasis on noise and distortion performance; manipulation of complex signals - positive and negative frequencies, plus Hilbert transformers; digital filters - guidelines for performance in communications, plus decimation and interpolation; hardware - multiplier accumulators, fast Fourier transform processors, digital signal processors, data flow techniques in equipment, and hardware simulation and testing; and speech processing - linear predictive coding (LPC), code excited linear predictive coding (CELP), and how to digitize speech at low data rates. Development of algorithms for oscillators, detectors, modulators, automatic gain control circuits, and other devices is clearly explained. Specific algorithms are provided for AM modulation, frequency modulation, FM detection, threshold extension, audio compression, automatic gain control, and squelch circuitry. Explanations of basic concepts of digital signal processing and data transmission are accompanied by reviews of signal representations, sampling, convolution, and z-transforms. Extensive real-world examples contribute to expertise in many facets of incorporating digital technology into devices. This hands-on treatment of DSP will help communications engineers upgrade their skills in digital signal processing and make a smooth transition into the design of more advanced systems. It also meets the needs of students who want to bolster their knowledge in communications.

164 citations


PatentDOI
TL;DR: A telephony channel simulation process is disclosed for training a speech recognizer to respond to speech obtained from telephone systems.
Abstract: A telephony channel simulation process is disclosed for training a speech recognizer to respond to speech obtained from telephone systems. An input speech data set is provided to a speech recognition training processor, whose bandwidth is higher than a telephone bandwidth. The process performs a series of alterations to the input speech data set to obtain a modified speech data set. The modified speech data set enables the speech recognition processor to perform speech recognition on voice signals from a telephone system.

159 citations


Patent
29 Nov 1994
TL;DR: In this article, a negotiation handshake protocol is described which enables the two sites to negotiate the compression rate based on such factors, such as the speed or data bandwidth on the communications connection between two sites, the data demand between the sites and amount of silence detected in the speech signal.
Abstract: The present invention includes software and hardware components to enable digital data communication over standard telephone lines. The present invention converts analog voice signals to digital data, compresses that data and places the compressed speech data into packets for transfer over the telephone lines to a remote site. A voice control digital signal processor (DSP) operates to use one of a plurality of speech compression algorithms which produce a scaleable amount of compression. The rate of compression is inversely proportional to the quality of the speech the compression algorithm is able to reproduce. The higher the compression, the lower the reproduction quality. The selection of the rate of compression is dependant on such factors as the speed or data bandwidth on the communications connection between the two sites, the data demand between the sites and amount of silence detected in the speech signal. The voice compression rate is dynamically changed as the aforementioned factors change. A negotiation handshake protocol is described which enables the two sites to negotiate the compression rate based on such factors.

140 citations


Journal Article
TL;DR: Dolby AC-3 is a flexible audio data compression technology capable of encoding a range of audio channel formats into a low rate bit stream, based on a transform filter bank and psychoacoustics.
Abstract: Dolby AC-3 is a flexible audio data compression technology capable of encoding a range of audio channel formats into a low rate bit stream. Channel formats range from monophonic to 5.1 channels, and may include a number of associated audio services. Based on a transform filter bank and psychoacoustics, AC-3 includes the novel features of transmission of a variable frequency resolution spectral envelope and hybrid backward/forward adaptive bit allocation.

Journal ArticleDOI
TL;DR: A toll quality speech codec at 8 kb/s suitable for the future personal communications system and can support a frame erasure rate up to 3% with a degradation in its performance that is still worse than the ITU-T requirements.
Abstract: A toll quality speech codec at 8 kb/s suitable for the future personal communications system is presented. The codec is currently under standardization by the ITU-T (successor of CCITT) where the codec terms of reference were mainly determined considering PCS application. The encoding algorithm is based on algebraic code-excited linear prediction (ACELP) and has a speech frame of 10 ms. Efficient pitch and codebook search strategies, along with efficient quantization procedures, have been developed to achieve toll quality encoded speech with a complexity implementable on current fixed-point DSP chips. Formal subjective listening tests, performed by ITU-T SG 12, showed that the codec quality is equivalent to that of G.726 ADPCM at 32 kb/s in error-free conditions and it outperforms G.726 under error conditions. The codec performs adequately under tandeming conditions, and can support a frame erasure rate up to 3% with a degradation in its performance that is still worse than the ITU-T requirements, and this is one subject of study for the next phase. The algorithm has been implemented on a single fixed-point DSP for the ITU-T subjective rest, and required about 29 MIPS. An optimized version, however, requires 24 MIPS without any speech quality degradation. >

PatentDOI
TL;DR: In this article, a method and apparatus provide a video image of facial features synchronized with synthetic speech, where text input is transformed into a string of phonemes and timing data, which are transmitted to an image generation unit.
Abstract: A method and apparatus provide a video image of facial features synchronized with synthetic speech. Text input is transformed into a string of phonemes and timing data, which are transmitted to an image generation unit. At the same time, a string of synthetic speech samples is transmitted to an audio server. The audio server produces signals for an audio speaker, causing the audio signals to be continuously audibilized; additionally, the audio server initializes a timer. The image generation unit reads the timing data from the timer and, by consulting the phoneme and timing data, determines the position of the phoneme currently being audibilized. The image generation unit then calculates the facial configuration corresponding to the position in the string of phonemes, calculates the facial configuration, and causes the facial configuration to be displayed on a video device.

Patent
Robert E. Holm1
30 Sep 1994
TL;DR: In this paper, a transform based compression mechanism is proposed to establish a single compression technique which is applicable for all audio compression ranging from very low bit rate speech to CD/Audio quality music.
Abstract: A transform based compression mechanism to establish a single compression technique which is applicable for all audio compression ranging from very low bit rate speech to CD/Audio quality music. Additionally, a multiconferencing unit (multi-point bridge) is provided which takes advantage of the transform based compression algorithm by providing a simple, low cost way of combining multiple parties without the need for transcoding.

Proceedings ArticleDOI
19 Apr 1994
TL;DR: This paper describes the application of transform coded excitation (TCX) coding to encoding wideband speech and audio signals in the bit rate range of 16 k bits/s to 32 kbits/s and proposes novel quantization procedures including inter-frame prediction in the frequency domain.
Abstract: This paper describes the application of transform coded excitation (TCX) coding to encoding wideband speech and audio signals in the bit rate range of 16 kbits/s to 32 kbits/s. The approach uses a combination of time domain (linear prediction; pitch prediction) and frequency domain (transform coding; dynamic bit allocation) techniques, and utilizes a synthesis model similar to that of linear prediction coders such as CELP. However, at the encoder, the high complexity analysis-by-synthesis technique is bypassed by directly quantizing the so-called target signal in the frequency domain. The innovative excitation is derived at the decoder by inverse filtering the quantized target signal. The algorithm is intended for applications whereby a large number of bits is available for the innovative excitation. The TCX algorithm is utilized to encode wideband speech and audio signals with a 50-7000 Hz bandwidth. Novel quantization procedures including inter-frame prediction in the frequency domain are proposed to encode the target signal. The proposed algorithm achieves very high quality for speech at 16 kbits/s, and for music at 24 kbits/s. >

Journal ArticleDOI
TL;DR: The authors propose to use the singular value decomposition (SVD) approach to detect the instants of glottal closure from the speech signal using the Frobenius norms of signal matrices and therefore is computationally efficient.
Abstract: The detection of glottal closure instants has been a necessary step in several applications of speech processing, such as voice source analysis, speech prosody manipulation and speech synthesis. The paper presents a new algorithm for glottal closure detection that compares favorably with other methods available in terms of robustness and computational efficiency. The authors propose to use the singular value decomposition (SVD) approach to detect the instants of glottal closure from the speech signal. The proposed SVD method amounts to calculating the Frobenius norms of signal matrices and therefore is computationally efficient. Moreover, it produces well-defined and reliable peaks that indicate the instants of glottal closure. Finally, with the introduction of the total linear least squares technique, two other proposed methods are reinvestigated and unified into the SVD framework. >

Patent
22 Dec 1994
TL;DR: In this article, a very low bit rate video and audio coding system that performs allocation for distributing the bit rate based on the needs for the video and the audio signals is disclosed, where the audio signal information is stored in a FIFO buffer to await the arrival of the accompanying video signal.
Abstract: A very low bit rate video and audio coding system that performs allocation for distributing the bit rate based on the needs for the video and audio signals is disclosed. The audio time bands are set to frames as determined by the video signal. The audio is encoded first to determine the number of bits that will be saved based on the energy distribution of the audio signal. Audio signal information is stored in a FIFO buffer to await the arrival of the accompanying video signal. The video signal is then coded as an I picture, a P picture or an extra P picture based on the number of bits available in the buffer, the number of bits saved by the audio encoding, and the minimum number of bits required for video coding. After encoding, the video signal is sent to the FIFO buffer to be matched with the audio signal and outputted as one bit stream by a multiplexer.

Proceedings ArticleDOI
19 Apr 1994
TL;DR: The technique, called nonlinear predictive coding, is shown to be superior to the LPC technique and two different nonlinear predictors are presented, one based on a second-order Volterra filter, and the other on a time delay neural network, which is found to be the more suitable for speech coding applications.
Abstract: Addresses the question of how to extract the nonlinearities in speech with the prime purpose of facilitating coding of the residual signal in residual excited coders. The short-term prediction of speech in speech coders is extensively based on linear models, e.g. the linear predictive coding technique (LPC), which is one of the most basic elements in modern speech coders. This technique does not allow extraction of nonlinear dependencies. If nonlinearities are absent from speech the technique is sufficient, but if the speech contains nonlinearities the technique is inadequate. The authors give evidence for nonlinearities in speech and propose nonlinear short-term predictors that can substitute the LPC technique. The technique, called nonlinear predictive coding, is shown to be superior to the LPC technique. Two different nonlinear predictors are presented. The first is based on a second-order Volterra filter, and the second is based on a time delay neural network. The latter is shown to be the more suitable for speech coding applications. >

PatentDOI
Yong Zhou1
TL;DR: In this paper, the authors propose a low bit rate audio and video communication system which employs an integrated encoding system that dynamically allocates available bits among the audio and visual signals to be encoded.
Abstract: Disclosed is a low bit rate audio and video communication system which employs an integrated encoding system that dynamically allocates available bits among the audio and video signals to be encoded based on the content of the audio and video information and the manner in which the audio and video information will be perceived by a viewer. A dynamic bit allocation and encoding process will evaluate the current content of the audio and video information and allocate the available bits among the audio and video signals to be encoded. In addition, an appropriate audio encoding technique is dynamically selected based on the current content of the audio signal. A face location detection subroutine will detect and model the location of faces in each video frame, in order that the facial regions may be more accurately encoded than other portions of the video frame. A lip motion detection subroutine will detect the location and movement of the lips of a person present in a video scene, in order to determine when a person is speaking and to encode the lip regions more accurately. The audio and video signals generated by a second party to a communication are monitored to determine if the second party is paying attention to the audio and video information transmitted by the first party to the communication.

PatentDOI
TL;DR: The stereophonic embodiment eliminates redundancies in the sum and difference signals, so that the stereo coding uses significantly less than twice the bit rate of the comparable monaural signal.
Abstract: A technique for the masking of quantizing noise in the coding of audio signals is usable with the types of channel coding known as "noiseless" or Huffman coding and with variable radix packing. In a multichannel environment, noise masking thresholds may be determined by combining sets of power spectra for each of the channels. The stereophonic embodiment eliminates redundancies in the sum and difference signals, so that the stereo coding uses significantly less than twice the bit rate of the comparable monaural signal. The technique can be used both in transmission of signals and in recording for reproduction, particularly recording and reproduction of music. Compatibility with the ISDN transmission rates known as 1B, 2B and 3B rates has been achieved.

Journal ArticleDOI
TL;DR: The author proves that the matrixing operations in MPEG subband filtering can be efficiently computed using fast 32-point DCT or inverse DCT (IDCT) algorithms.
Abstract: Subband filtering is one of the most compute-intensive operations in the MPEG audio coding standard. The author proves that the matrixing operations in MPEG subband filtering can be efficiently computed using fast 32-point DCT or inverse DCT (IDCT) algorithms. >

Dissertation
01 Jan 1994
TL;DR: SpeechSkimmer as mentioned in this paper uses simple speech processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail, and provides continuous real-time control of the speed and detail level of the audio presentation.
Abstract: Listening to a speech recording is much more difficult than visually scanning a document because of the transient and temporal nature of audio. Audio recordings capture the richness of speech, yet it is difficult to directly browse the stored information. This dissertation investigates techniques for structuring, filtering, and presenting recorded speech, allowing a user to navigate and interactively find information in the audio domain. This research makes it easier and more efficient to listen to recorded speech by using the SpeechSkimmer system. First, this dissertation describes Hyperspeech, a speech-only hypermedia system that explores issues of speech user interfaces, browsing, and the use of speech as data in an environment without a visual display. The system uses speech recognition input and synthetic speech feedback to aid in navigating through a database of digitally recorded speech. This system illustrates that managing and moving in time are crucial in speech interfaces. Hyperspeech uses manually segmented and structured speech recordings--a technique that is practical only in limited domains. Second, to overcome the limitations of Hyperspeech while retaining browsing capabilities, a variety of speech analysis and user interface techniques are explored. This research exploits properties of spontaneous speech to automatically select and present salient audio segments in a time-efficient manner. Two speech processing technologies, time compression and adaptive speech detection (to find hesitations and pauses), are reviewed in detail with a focus on techniques applicable to extracting and displaying speech information. Finally, this dissertation describes SpeechSkimmer, a user interface for interactively skimming speech recordings. SpeechSkimmer uses simple speech processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail. User interaction, through a manual input device, provides continuous real-time control of the speed and detail level of the audio presentation. SpeechSkimmer incorporates time-compressed speech, pause removal, automatic emphasis detection, and non-speech audio feedback to reduce the time needed to listen. This dissertation presents a multi-level structural approach to auditory skimming, and user interface techniques for interacting with recorded speech. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Proceedings ArticleDOI
Stephan Euler1, J. Zinke1
19 Apr 1994
TL;DR: The authors use a Gaussian classifier for estimation of the coding condition of a test utterance and the combination of this classifier and coder specific word models yields a high overall recognition performance.
Abstract: Examines the influence of different coders in the range from 64 kbit/sec to 4.8 kbit/sec on both a speaker independent isolated word recognizer and a speaker verification system. Applying systems trained with 64 kbit/sec to e.g. the 4.8 kbit/sec data increases the error rate of the word recognizer by a factor of three. For rates below 13 kbit/sec the speaker verification is more affected than the word recognition. The performance improves significantly if word models are provided for the individual coding conditions. Therefore, the authors use a Gaussian classifier for estimation of the coding condition of a test utterance. The combination of this classifier and coder specific word models yields a high overall recognition performance. >

Proceedings ArticleDOI
19 Apr 1994
TL;DR: The general formulation of WLP is given and effective realizations with allpass filters are studied and the application of auditory WLP to speech coding and speech recognition has given good results.
Abstract: A linear prediction process is applied to frequency warped signals. The warping is realized by using orthonormal FAM (frequency modulated complex exponentials) functions. The general formulation of WLP is given and effective realizations with allpass filters are studied. The application of auditory WLP to speech coding and speech recognition has given good results. >

Journal ArticleDOI
Y. Mahieux1, J.P. Petit1
TL;DR: In this article, a transform coding algorithm devoted to high quality audio coding at a bit rate of 64 kbps per monophonic channel is presented. But, although a complete system including framing, synchronization and error correction has been developed, only the bit rate compression algorithm is described.
Abstract: This paper presents a transform coding algorithm devoted to high quality audio coding at a bit rate of 64 kbps per monophonic channel. It enables the transmission of a high quality stereo sound through the basic access (2B channels) of ISDN. Although a complete system including framing, synchronization and error correction has been developed, only the bit rate compression algorithm is described here. A detailed analysis of the signal processing techniques such as the time/frequency transformation, the pre-echo reduction by adaptive filtering, the fast algorithm computations, etc., is provided. The use of psychoacoustical properties is also precisely reported. Finally, some subjective evaluation results and one real time implementation of the coder using the ATT DSP32C digital signal processor are presented. >

Patent
19 May 1994
TL;DR: In this article, the authors propose a speech detection apparatus consisting of a reference model maker for extracting a plurality of parameters for speech detection from training data, and a parameter extractor and a decision device for deciding whether or not the audio signal is speech, by comparing the parameters extracted from the input audio signal with the reference model.
Abstract: The speech detection apparatus comprises: a reference model maker for extracting a plurality of parameters for a speech detection from training data, and for making a reference model based on the parameters; a parameter extractor for extracting the plurality of parameters from each frame of an input audio signal; and a decision device for deciding whether or not the audio signal is speech, by comparing the parameters extracted from the input audio signal with the reference model. The reference model maker makes the reference model for each phoneme. The decision devices includes: a similarity computing unit for comparing the parameters extracted from each frame of the input audio signal with the reference model, and for computing a similarity of the frame with respect to the reference model; a phoneme decision unit for deciding a phoneme of each frame of the input audio signal based on the similarity computed for each phoneme; and a final decision unit for deciding whether or not a specific period of the input audio signal including a plurality of frames is speech, based on the result of the phoneme decision for the plurality of frames.

Patent
Marvin L. Williams1
16 Aug 1994
TL;DR: In this paper, a template is used to analyze audio input events and a speech audio input event is recorded and then the recorded non-speech audio inputs are processed to create a second entry in the template.
Abstract: A method and apparatus for analyzing audio input events. A template is utilized to analyze audio input events. A speech audio input event is identified. The identified speech audio input event is recorded. The recorded speech audio input event is processed to create a first entry in a template. A selected non-speech audio input event which occurs in a selected environment is identified. The identified non-speech audio input event is recorded. Then the recorded non-speech audio input event is processed to create a second entry in the template. Thereafter, a speech audio input event and a non-speech audio input event is distinguished by comparing an audio input event to the template.

Journal ArticleDOI
TL;DR: It is found that 2-D matched-filter microphone arrays are capable of producing high speaker identification scores in a hostile acoustic environment, such as multipath distortion and competing noise sources.
Abstract: Hands-free operation of speech processing equipment is sometimes desired so that the user is unencumbered by hand-held or body-worn microphones. This paper explores the use of array microphones to capture speech under adverse acoustic conditions, and provide input to a system for automatic speaker identification. The system is evaluated using reverberated speech signals, generated by a computer model of room acoustics and transduced by different simulated microphone-arrays. For comparison, the system is also evaluated using close-talking microphone input. It is found that 2-D matched-filter microphone arrays are capable of producing high speaker identification scores in a hostile acoustic environment, such as multipath distortion and competing noise sources. The paper also explores the influence of vector quantization techniques, codebook size, and order of cepstrum coefficients on the performance of the speaker identification system. >

Journal ArticleDOI
TL;DR: The authors describe a novel approach to speech recognition by directly modeling the statistical characteristics of the speech waveforms, which allows them to remove the need for using speech preprocessors, which conventionally serve a role of converting speech waves into frame-based speech data subject to a subsequent modeling process.
Abstract: The authors describe a novel approach to speech recognition by directly modeling the statistical characteristics of the speech waveforms. This approach allows them to remove the need for using speech preprocessors, which conventionally serve a role of converting speech waveforms into frame-based speech data subject to a subsequent modeling process. Central to their method is the representation of the speech waveforms as the output of a time-varying filter excited by a Gaussian source time-varying in its power. In order to formulate a speech recognition algorithm based on this representation, the time variation in the characteristics of the filter and of the excitation source is described in a compact and parametric form of the Markov chain. They analyze in detail the comparative roles played by the filter modeling and by the source modeling in speech recognition performance. Based on the result of the analysis, they propose and evaluate a normalization procedure intended to remove the sensitivity of speech recognition accuracy to often uncontrollable speech power variations. The effectiveness of the proposed speech-waveform modeling approach is demonstrated in a speaker-dependent, discrete-utterance speech recognition task involving 18 highly confusable stop consonant-vowel syllables. The high accuracy obtained shows promising potentials of the proposed time-domain waveform modeling technique for speech recognition. >

PatentDOI
TL;DR: In this paper, the speed of an input speech is changed without any change of the pitch of the input speech, and the speed can be modulated continuously on the basis of the raw data of the speech.
Abstract: The speed of an input speech is changed without any change of the pitch of the input speech. Raw data of a speech are stored so that the speed of the speech can be modulated continuously on the basis of the raw data of the speech. In the speech speed conversion method, a speech speed conversion process for the input speech is carried out in a period designated when speech speed conversion is needed, which the speech speed conversion is not carried out in the other period. Further, in the speech speed conversion apparatus having a unit for inputting a speech, a speech speed conversion unit for changing the speed of the input speech, and a unit for supplying the output of the speech speed conversion unit as an output speech to listener's ears, the apparatus further includes a speech speed conversion switch, and a unit for outputting a speech while changing the speech speed of the input speech in a period in which the speech speed conversion switch is turned on, but for outputting a speech without any change of the input speech in the other period in which the speech speed conversion switch is turned off.