Showing papers on "Voice activity detection published in 1984"

PDF

Open Access

Automatic lipreading to enhance speech recognition (speech reading)

[...]

01 Jan 1984

TL;DR: An automatic lipreading system which has been developed and the combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone.

...read moreread less

Abstract: Automatic recognition of the acoustic speech signal alone is inaccurate and computationally expensive. Additional sources of speech information, such as lipreading (or speechreading), should enhance automatic speech recognition, just as lipreading is used by humans to enhance speech recognition when the acoustic signal is degraded. This paper describes an automatic lipreading system which has been developed. A commercial device performs the acoustic speech recognition independently of the lipreading system. The recognition domain is restricted to isolated utterances and speaker dependent recognition. The speaker faces a solid state camera which sends digitized video to a minicomputer system with custom video processing hardware. The video data is sampled during an utterance and then reduced to a template consisting of visual speech parameter time sequences. The distances between the incoming template and all of the trained templates for each utterance in the vocabulary are computed and a visual recognition candidate is obtained. The combination of the acoustic and visual recognition candidates is shown to yield a final recognition accuracy which greatly exceeds the acoustic recognition accuracy alone. Practical considerations and the possible enhancement of speaker independent and continuous speech recognition systems are also discussed.

...read moreread less

389 citations

Proceedings Article•DOI•

Improving performance of multi-pulse LPC coders at low bit rates

[...]

Sharad Singhal¹, B. S. Atal²•Institutions (2)

Bell Labs¹, AT&T²

19 Mar 1984

TL;DR: This paper focuses on problems encountered in attempting to maintain speech quality while synthesizing speech using multi-pulse excitation at lower bit rates.

...read moreread less

Abstract: The multi-pulse excitation model provides a method for producing natural-sounding speech at medium to low bit rates. Multi-pulse analysis obtains the all-pole filter excitation by minimizing a spectrally-weighted mean-squared error between the original and synthetic speech signals. Although the method provides high quality speech around 10 kbits/sec, speech quality suffers if the bit rate is lowered. In this paper, we focus on problems encountered in attempting to maintain speech quality while synthesizing speech using multi-pulse excitation at lower bit rates.

...read moreread less

163 citations

Proceedings Article•DOI•

Variable-frequency synthesis: An improved harmonic coding scheme

[...]

Luís B. Almeida¹, Fernando M. Silva•Institutions (1)

Instituto Superior Técnico¹

01 Mar 1984

TL;DR: Harmonic Coding is synthesized in the time domain, as a superimposition of "harmonics" whose instantaneous frequency varies continuously along an interpolation curve, within each frame, so that fast pitch variations can be tracked with no difficulty.

...read moreread less

Abstract: The Harmonic Coding concept has already shown its potential for efficiently coding speech. Previous implementations have usec a frame rate of one every 16 ms. This was mainly due to the fact that, with longer frames, even a nonstationary spectral model (of low order) cannot reproduce the zones of fast-varying pitch with the desirable quality. However, the high framing rate is a limitation, since it implies that fewer bits will be available for encoding each frame. A solution for this problem has been devised: the signal is synthesized in the time domain, as a superimposition of "harmonics" whose instantaneous frequency varies continuously along an interpolation curve, within each frame. In this way, fast pitch variations can be tracked with no difficulty. Experimental results are presented, confirming these facts. The integration of this synthesis scheme in a speech coder is discussed.

...read moreread less

98 citations

Patent•DOI•

Dual processor speech recognition system with dedicated data acquisition bus

[...]

Mary Anne Garrett¹, Gideon Shichman¹•Institutions (1)

IBM¹

30 Nov 1984-Journal of the Acoustical Society of America

TL;DR: Voice signals amplified in a programmable gain, programmable bandwidth subsystem are digitized and buffered and the digitized signals are transferred over a dedicated high speed bus to a speech processor, thus relieving the speech processor and its resident host of the overhead of high data rate transfers.

...read moreread less

Abstract: Voice signals amplified in a programmable gain, programmable bandwidth subsystem are digitized and buffered and the digitized signals are transferred over a dedicated high speed bus to a speech processor, thus relieving the speech processor and its resident host of the overhead of high data rate transfers and permitting a relatively low capacity computer to accomplish voice recognition on a real time basis.

...read moreread less

95 citations

Patent•

Voice activity detection process and means for implementing said process

[...]

André Desblache¹, Claude Galand¹, Robert Vermot-Gauchy¹•Institutions (1)

IBM¹

18 May 1984

TL;DR: In this paper, a VAD (Voice Activity Detector) is detected in two steps: (1) Signal energy above a threshold decides presence, below threshold decides ambiguity; (2) ambiguity is resolved by testing the rate of change of spectral parameters.

...read moreread less

Abstract: Speech signal presence is detected in a VAD (Voice Activity Detector) in two steps: (1) Signal energy above a threshold decides presence, below threshold decides ambiguity; (2) ambiguity is resolved by testing the rate of change of spectral parameters.

...read moreread less

79 citations

Proceedings Article•DOI•

A procedure for automatic alignment of phonetic transcriptions with continuous speech

[...]

Hong Leung¹, Victor W. Zue•Institutions (1)

Massachusetts Institute of Technology¹

01 Mar 1984

TL;DR: A system for automatic alignment of phonetic transcriptions with continuous speech has been developed and 93% of the segments are mapped into only one phoneme, and the offset between the boundary found by the automatic alignment system and a hand transcriber is less than 10 ms.

...read moreread less

Abstract: A system for automatic alignment of phonetic transcriptions with continuous speech has been developed The speech signal is first segmented into broad classes using a non-parametric Pattern classifier A knowledge-based dynamic programming algorithm then aligns the broad classes with the phonetic transcriptions These broad classes provide "islands of reliability" for more detailed segmentation and refinement of boundaries By doing alignment at the phonetic level, the system can often tolerate inter and intra-speaker variability The system was evaluated on sixty sentences spoken by three speakers, two male and one female 93% of the segments are mapped into only one phoneme, 70% of the time the offset between the boundary found by the automatic alignment system and a hand transcriber is less than 10 ms The performance can be improved by applying more heuristic rules

...read moreread less

64 citations

Journal Article•DOI•

An improved word-detection algorithm for telephone-quality speech incorporating both syntactic and semantic constraints

[...]

Jay G. Wilpon¹, Lawrence R. Rabiner¹, T. Martin¹•Institutions (1)

Bell Labs¹

01 Mar 1984-AT&T Bell Laboratories technical journal

TL;DR: This paper presents an improved word-detection algorithm, which can incorporate both vocabulary (syntactic) and task (semantic) information, leading to word- Detection accuracies close to 100 percent for isolated digit detection over a wide range of telephone transmission conditions.

...read moreread less

Abstract: Accurate location of the endpoints of spoken words and phrases is important for reliable and robust speech recognition. The endpoint detection problem is fairly straightforward for high-level speech signals in low-level stationary noise environments (e.g., signal-to-noise ratios greater than 30-dB rms). However, this problem becomes considerably more difficult when either the speech signals are too low in level (relative to the background noise), or when the background noise becomes highly nonstationary. Such conditions are often encountered in the switched telephone network when the limitation on using local dialed-up lines is removed. In such cases the background noise is often highly variable in both level and spectral content because of transmission line characteristics, transients and tones from the line and/or from signal generators, etc. Conventional speech endpoint detectors have been shown to perform very poorly (on the order of 50-percent word detection) under these conditions. In this paper we present an improved word-detection algorithm, which can incorporate both vocabulary (syntactic) and task (semantic) information, leading to word-detection accuracies close to 100 percent for isolated digit detection over a wide range of telephone transmission conditions.

...read moreread less

62 citations

Patent•

General technique to add multi-lingual speech to videotex systems, at a low data rate

[...]

Gerard Benbassat¹, Daniel Serain¹•Institutions (1)

Texas Instruments¹

25 May 1984

TL;DR: In this article, a video display of stored text is accompanied by associated speech from a speech synthesizer using coded sounds and intonation, and a central processor controls selection of text and speech.

...read moreread less

Abstract: Video display of stored text is accompanied by associated speech from a speech synthesizer using coded sounds and intonation. A central processor controls selection of text and speech. Speech is selectable in one of a plurality of prestored languages coded in frequency and duration data.

...read moreread less

52 citations

Journal Article•DOI•

Speech-quality assessment methods for speech-coding systems

[...]

Nobuhiko Kitawaki, Masaaki Honda, Kenzo Itoh

01 Oct 1984-IEEE Communications Magazine

50 citations

Proceedings Article•DOI•

Efficient computation and encoding of the multipulse excitation for LPC

[...]

M. Berouti¹, H. Garten, Peter Kabal, P. Mermelstein•Institutions (1)

bell northern research¹

01 Mar 1984

TL;DR: A computationally efficient formulation is derived for both covariance and correlation type analyses for multipulse coding of speech, ranging from a purely sequential one to one which reoptimizes pulse amplitudes at each step.

...read moreread less

Abstract: This paper discusses the analysis techniques used to derive the excitation waveform for multipulse coding of speech. A computationally efficient formulation is derived for both covariance and correlation type analyses. These methods differ in the way block edges are treated. Several methods for pulse amplitude and position determination are given, ranging from a purely sequential one to one which reoptimizes pulse amplitudes at each step. It is shown that the reoptimization scheme has a nested structure that allows a reduction in the computations. An efficient method for pulse position coding is given. This method can essentially achieve the entropy limit for randomly placed pulses. Experimental results are given for typical configurations including computational requirements and speech quality assessments.

...read moreread less

48 citations

Patent•DOI•

Speech recognition method including biased principal components

[...]

Peter F. Brown¹•Institutions (1)

ExxonMobil¹

27 Mar 1984-Journal of the Acoustical Society of America

TL;DR: In this article, a speech recognition method and apparatus employ a speech processing circuitry for repetitively deriving from a speech input, at a frame repetition rate, a plurality of acoustic parameters.

...read moreread less

Abstract: A speech recognition method and apparatus employ a speech processing circuitry for repetitively deriving from a speech input, at a frame repetition rate, a plurality of acoustic parameters. The acoustic parameters represent the speech input signal for a frame time. A plurality of template matching and cost processing circuitries are connected to a system bus, along with the speech processing circuitry, for determining, or identifying, the speech units in the input speech, by comparing the acoustic parameters with stored template patterns. The apparatus can be expanded by adding more template matching and cost processing circuitry to the bus thereby increasing the speech recognition capacity of the apparatus. The speech processing circuitry establishes overlapping time durations for generating the acoustic parameters and further employs a sinc-Kaiser smoothing function in combination with a folding technique for providing a discrete Fourier transform. The Fourier spectra are transformed using a biased principal component analysis which optimizes the across class variance. The template matching and cost processing circuitries provide distributed processing, on demand, of the acoustic parameters for generating through a dynamic programming technique the recognition decision.

...read moreread less

Patent•DOI•

Speech recognition method having noise immunity

[...]

John W. Klovstad¹, Chin-hui Lee¹, Kalyan Ganesan¹•Institutions (1)

ExxonMobil¹

27 May 1984-Journal of the Acoustical Society of America

TL;DR: In a speech recognition system, the beginning of speech versus non-speech (a cough or noise) is distinguished by reverting to a nonspeech decision process whenever the liklihood cost of template (vocabulary) patterns, including silence, is worse than a predetermined threshold, established by a Joker Word which represents a non-vocabulary word score and path in the grammar graph as discussed by the authors.

...read moreread less

Abstract: In a speech recognition system, the beginning of speech versus non-speech (a cough or noise) is distinguished by reverting to a non-speech decision process whenever the liklihood cost of template (vocabulary) patterns, including silence, is worse than a predetermined threshold, established by a Joker Word which represents a non-vocabulary word score and path in the grammar graph.

...read moreread less

Patent•DOI•

Speech recognition training method

[...]

James K. Baker¹, John W. Klovstad¹, Chin-hui Lee¹, Kalyan Ganesan¹•Institutions (1)

ExxonMobil¹

27 Mar 1984-Journal of the Acoustical Society of America

...read moreread less

Abstract: A speech recognition method and apparatus employ a speech processing circuitry for repetitively deriving from a speech input, at a frame repetition rate, a plurality of acoustic parameters. The acoustic parameters represent the speech input signal for a frame time. A plurality of template matching and cost processing circuitries are connected to a system bus, along with the speech processing circuitry, for determining, or identifying, the speech units in the input speech, by comparing the acoustic parameters with stored template patterns. The apparatus can be expanded by adding more template matching and cost processing circuitry to the bus thereby increasing the speech recognition capacity of the apparatus. Template pattern generation is advantageously aided by using a "joker" word to specify the time boundaries of utterances spoken in isolation, by finding the beginning and ending of an utterance surrounded by silence.

...read moreread less

Proceedings Article•DOI•

Knowledge based speech analysis and enhancement

[...]

C. Myers¹, Alan V. Oppenheim, Randall Davis, W. Dove•Institutions (1)

Massachusetts Institute of Technology¹

01 Mar 1984

TL;DR: A system for speech analysis and enhancement which combines signal processing and symbolic processing in a closely coupled manner and attempts to reconstruct the original speech waveform using symbolic processing to help model the signal and to guide reconstruction.

...read moreread less

Abstract: This paper describes a system for speech analysis and enhancement which combines signal processing and symbolic processing in a closely coupled manner. The system takes as input both a noisy speech signal and a symbolic description of the speech signal. The system attempts to reconstruct the original speech waveform using symbolic processing to help model the signal and to guide reconstruction. The system uses various signal processing algorithms for parameter estimation and reconstruction.

...read moreread less

Proceedings Article•DOI•

Use of dynamic programming for automatic synchronization of two similar speech signals

[...]

P. Bloom

01 Mar 1984

TL;DR: A further application for time-alignment algorithms is described, in which replacement dialogue for a film soundtrack may be automatically synchronized to reference dialogue recorded during filming, in a digital signal processing system that uses a DP algorithm.

...read moreread less

Abstract: A number of applications exist in basic speech research for Dynamic Programming (DP) algorithms that can produce accurate time registration data for aligning one speech signal with a similar speech signal. In this paper, a further application for time-alignment algorithms is described, in which replacement dialogue for a film soundtrack may be automatically synchronized to reference dialogue recorded during filming. This is being carried out in a digital signal processing system that uses a DP algorithm capable of aligning utterances of indeterminate length accurately and efficiently in real-time. The main features of this system and the DP algorithm will be described.

...read moreread less

Proceedings Article•DOI•

Automatic diphone bootstrapping for speaker-adaptive continuous speech recognition

[...]

A. Colla, D. Sciarra

01 Mar 1984

TL;DR: The paper describes an automatic method, called Automatic Diphone Bootstrapping (or A.D.R.B.B.), for template extraction for Speaker-Adaptive Continuous Speech Recognition using "diphones" as speech units, which operates without any manual intervention and performed very well for all the speakers on which it was tested.

...read moreread less

Abstract: The paper describes an automatic method, called Automatic Diphone Bootstrapping (or A.D.B.), for template extraction for Speaker-Adaptive Continuous Speech Recognition using "diphones" as speech units. Diphones have proved to be very suitable for C.S.R. as they meet the main requirements of phonetic units: invariance with the context and economy. Furthermore the performance of diphone-based speaker dependent C.S.R. systems is very high. For a long time manual extraction has been presented in the literature as the only completely reliable method for sub-word template creation for any speaker (see [1] as an example). Recently some automatic techniques for reference pattern extraction were developed [2,3], but they also require some manual corrections. The A.D.B. procedure operates without any manual intervention and performed very well for all the speakers on which it was tested. In a connected digit recognition task, a W.R.R. of 98.79% was achieved by using the speaker-adaptive templates created by the A.D.B. procedure.

...read moreread less

Patent•

Phrase-programmable digital speech system

[...]

William Joseph Raymond¹, Robert Lee Morgan, Ricky Lee Miller²•Institutions (2)

BorgWarner Inc.¹, University of Rochester²

01 Feb 1984

TL;DR: In this paper, a digital speech processor operates in parallel with a programmable digital computer to generate sequences of variable-length speech phrases and pauses at the request of the computer, using a separate command memory region, accessible to and loadable by the computer.

...read moreread less

Abstract: A digital speech processor operates in parallel with a programmable digital computer to generate sequences of variable-length speech phrases and pauses at the request of the computer. A speech memory within the speech processor contains digitally-encoded speech data segments of varying length. A separate command memory region, accessible to and loadable by the computer, can be loaded with a plurality of commands. When sequentially executed by the speech processor, these commands cause the processor to generate an arbitrary sequence of spoken phrases and pauses without intervention by the computer. Each two-byte command causes the speech processor to retrieve from the speech memory a particular speech data segment and convert it into speech, or to pause for a time interval specified by a number within the command.

...read moreread less

Proceedings Article•DOI•

Time alignment of natural speech to synthetic speech

[...]

M. Hunt¹•Institutions (1)

National Research Council¹

19 Mar 1984

TL;DR: An artificial speech recognition experiment is introduced as a convenient means of assessing alignment accuracy, and alignment accuracy is found to be improved considerably by applying certain speaker adaptation transformations to the synthetic speech.

...read moreread less

Abstract: A capacity to carry out reliable automatic time alignment of synthetic speech to naturally produced speech offers potential benfits in speech recognition and speaker recognition as well as in synthesis itself. Phrase alignment experiments are described that indicate that alignment to synthetic speech is more difficult than alignment of speech from two natural speakers. An artificial speech recognition experiment is introduced as a convenient means of assessing alignment accuracy. By this measure, alignment accuracy is found to be improved considerably by applying certain speaker adaptation transformations to the synthetic speech, by modifying the spectrum similarity metric, and by generating the synthetic spectra directly from the control parameters using simplified excitation spectra. The improvements seem to limit, however, at a level below that found between natural speakers. It is conjectured that further improvement requires modifications to the synthesis rules themselves.

...read moreread less

Proceedings Article•DOI•

Suprasegmentals in very large vocabulary isolated word recognition

[...]

Alex Waibel¹•Institutions (1)

Carnegie Mellon University¹

01 Mar 1984

TL;DR: It is shown that prosodic information, e.g., the rhythmic structure of an input word, its syllabic structure, voiced/unvoiced regions in the word and the temporal distribution of back/front vowels, nasals and liquids and glides, can be used effectively to select a substantially reduced subvocabulary of candidates, before any fine phonetic analysis is attempted to recognize the word.

...read moreread less

Abstract: Prosodic information is believed to be valuable informnation in human speech perception, but speech recognition systems to date have largely been based on segmental spectral analysis. In this paper I describe parts of a front end to a very-large-vocabulary isolated word recognition system using prosodic information. The present front end is template independent (speaker training for large vocabulary systems (> 20,000 words) is undesirable) and makes use of robust cues in the incoming speech to obtain a presorted vocabulary of candidates. It is shown that prosodic information, e.g., the rhythmic structure of an input word, its syllabic structure, voiced/unvoiced regions in the word and the temporal distribution of back/front vowels, nasals and liquids and glides, can be used effectively to select a substantially reduced subvocabulary of candidates, before any fine phonetic analysis is attempted to recognize the word.

...read moreread less

Proceedings Article•DOI•

Low bit rate speech coding by concatenation of sound units and prosody coding

[...]

G. Benbassat¹, X. Delon•Institutions (1)

Texas Instruments¹

01 Mar 1984

TL;DR: A concatenation of small sound segments is used to assemble a sentence along with duration and pitch contour extracted from a natural utterance of the same sentence to develop a speech coding system working at a very low bit rate.

...read moreread less

Abstract: The purpose of this work was to develop a speech coding system working at a very low bit rate (100 bits/s) and capable of reproducing natural sounding speech. The approach chosen was to use a concatenation of small sound segments to assemble a sentence along with duration and pitch contour extracted (by Dynamic Time Warping) from a natural utterance of the same sentence. The segmental aspect is coded at the phonological level, thus leading to a bit rate around 60 bits/s. The decoding of the phonemes is obtained by a set of rules operating on phonetic features. The prosodic aspect is coded using stored pitch contours and duration patterns, thus leading to a bit rate around 40 b/s.

...read moreread less

Journal Article•DOI•

On improvement of performance of isolated word recognition for degraded speech

[...]

B. Yegnanarayana¹, Sarat Chandran², Anant Agarwal•Institutions (2)

Indian Institute of Technology Madras¹, Yale University²

01 Oct 1984-Signal Processing

TL;DR: A recognition scheme which adapts itself to mild degradations in speech and improves the reliability of recognition significantly is proposed, and techniques which adaptively discriminate between noisy and noise-free parameters by using a selective weighting procedure in the final distance calculation are suggested.

...read moreread less

Journal Article•DOI•

Formant coding of speech using dynamic programming

[...]

B.C. Dupree

29 Mar 1984-Electronics Letters

TL;DR: An algorithm is proposed which will obtain, from an input speech signal, formant parameter data to control a parallel formant speech synthesiser by allowing some delay and employing variable-frame-rate techniques.

...read moreread less

Abstract: An algorithm is proposed which will obtain, from an input speech signal, formant parameter data to control a parallel formant speech synthesiser. By allowing some delay and employing variable-frame-rate techniques, the parameter data can be obtained at a low frame rate (typically 20 frames per second) suitable for transmission or storage.

...read moreread less

Proceedings Article•DOI•

A speaker recognizability test

[...]

P. Papamichalis¹, George R. Doddington•Institutions (1)

Texas Instruments¹

01 Mar 1984

TL;DR: A Speaker Recognizability Test (SRT) is presented, which tries to establish how well a given communications system preserves a speaker's identity.

...read moreread less

Abstract: Speech intelligibility and quality are the two most often tested features of speech coding systems. However, another feature of interest in store-and-forward applications is the preservation of a speaker's identity. Here, a Speaker Recognizability Test (SRT) is presented, which tries to establish how well a given communications system preserves a speaker's identity. Contrary to previous efforts, no attempt is made to identify the cues used by listeners for speaker recognition. Instead, listeners are asked directly to identify a speaker who says an utterance by comparing the uttered sentence with reference sentences, one from each speaker. Among the issues considered in the design of the test is the choice of speakers, the use of reference sentences from the same or different sessions of data collection, and the use of processed or unprocessed speech for reference.

...read moreread less

Journal Article•DOI•

Microprocessor-based speech processing system.

[...]

B. J. Guillemi¹, D. T. Nguyen¹•Institutions (1)

University of Auckland¹

01 Jun 1984-Journal of Speech Language and Hearing Research

TL;DR: A microprocessor-based speech acquisition and processing system which uses waveform analysis techniques to extract measurements from the acoustic signal and operates in "real time" and employs noninvasive data-capturing techniques.

...read moreread less

Abstract: Durational measurements of frication, aspiration, prevoicing, and voice onset are often difficult to perform from the spectrogram, and the resolution is limited to about 5 ms. In many instances, a ...

...read moreread less

Journal Article•DOI•

Analog Full-Duplex Speech Scrambling Systems

[...]

J. Delgado¹, José Tribolet•Institutions (1)

Instituto Superior Técnico¹

01 May 1984-IEEE Journal on Selected Areas in Communications

TL;DR: This paper deals with the problems associated with full-duplex scrambled speech communications over analog two-wire telephone networks by analyzing the situation in which the two users speak at the same time and each should hear the other.

...read moreread less

Abstract: This paper deals with the problems associated with full-duplex scrambled speech communications over analog two-wire telephone networks. The goal is not to describe particular scrambling systems or methods, but to analyze the situation in which the two users speak at the same time and each should hear the other. Current telephone scrambling devices preclude this feature. A general half-duplex scrambling model is described as a base to the discussion of pseudo- and true full-duplex communications. Finally, a novel true full-duplex scrambling architecture based on a paper by Cox and Tribolet [11] is presented.

...read moreread less

Proceedings Article•DOI•

Speaker adaptation in large-vocabulary voice recognition

[...]

Y. Kijima¹, Y. Nara, A. Kobayashi, S. Kimura•Institutions (1)

Fujitsu¹

01 Mar 1984

TL;DR: A speaker adaptation method that follows two steps -- selection of "persons" who have voices similar to the user's and generation of a speaker-adapted dictionary from their dictionaries is studied.

...read moreread less

Abstract: A speaker-trained voice recognition system with a large vocabulary has a serious weak point, that is, the user must register a large number of words prior to its use. To be freed from this problem, the authors have studied a speaker adaptation method. This method follows two steps -- 1) selection of "persons" who have voices similar to the user's and 2) generation of a speaker-adapted dictionary from their dictionaries. Results of simulation using 1000-word speech samples by 40 male speakers (20 for standard dictionaries and 20 for performance evaluation) are reported. The results indicated the advantage of this method. The speaker-trained dictionary gave 90.1% recognition accuracy, the speaker-independent dictionary gave 83.6%, and the speaker-adapted dictionary which required only 10% of the vocabulary for training gave 85.7%.

...read moreread less

Journal Article•DOI•

On Voice Characteristics Conversion

[...]

Kun S. Lin¹, Gene A. Frantz¹•Institutions (1)

Texas Instruments¹

01 Nov 1984-IEEE Transactions on Consumer Electronics

TL;DR: The voice characteristics conversion technique described herein is best suited for speech systems either a text-to-speech or analysis-synthesis (meaning record-playback) using LPC synthesizers.

...read moreread less

Abstract: Text-to-speech systems available today generate virtually unlimited speech from a prestored component sounds library, which is frequently created from a male voice. A major reason for that is due to the low pitch profile in the male voice. For example, the speech analysis software and commercial synthesizers seem to work more accurately and better with the male voice than the female voice. To overcome the analysis and synthesis problem on the female voice and to provide the choices of having more than one sex of voices output from a text-to-speech system, some means of voice characteristics conversion is needed. To do so, it is important to first understand what parameters in speech define the perception of different sex. An attempt is underway herein to first study these parameters and then to learn to adjust them so that the voice of one sex can be converted to another or from one voice character to another. The voice characteristics conversion technique described herein is best suited for speech systems either a text-to-speech or analysis-synthesis (meaning record-playback) using LPC synthesizers.

...read moreread less

Proceedings Article•DOI•

Improvement of the narrowband LPC synthesis

[...]

G.S. Kang¹, S. Everett•Institutions (1)

United States Naval Research Laboratory¹

01 Mar 1984

TL;DR: Modifications are presented which improve the quality of the synthesized speech without requiring the transmission of any additional data and show an increase of up to 5 points in overall speech quality with the implementation of each of these improvements.

...read moreread less

Abstract: The major weakness of the current narrowband LPC synthesizer lies in the use of a "canned" invariant excitation signal. The use of such an excitation signal is based on three primary assumptions, namely, (1) that the amplitude spectrum of the excitation signal is flat and time-invariant, (2) that the phase spectrum of the voiced excitation signal is a time-invariant function of frequency, and (3) that the probability density function (PDF) of the phase spectrum of the unvoiced excitation signal is also time-invariant. This paper critically examines these assumptions and presents modifications which improve the quality of the synthesized speech without requiring the transmission of any additional data. Diagnostic Acceptability Measure (DAM) tests show an increase of up to 5 points in overall speech quality with the implementation of each of these improvements.

...read moreread less

Proceedings Article•DOI•

Performance of isolated word recognition system for degraded speech

[...]

B. Yegnanarayana¹, S. Chandran•Institutions (1)

Indian Institute of Technology Madras¹

01 Mar 1984

TL;DR: This paper proposes a recognition scheme which adapts itself to mild degradations in speech, and suggests techniques which adaptively discriminate between noisy and noise-free parameters by using a selective weighting procedure in the final distance calculations.

...read moreread less

Abstract: The performance of an isolated word speech recognition (IWSR) system is known to drop rapidly with increase in the degradation of the input speech. In this paper we propose a recognition scheme which adapts itself to mild degradations in speech. The scheme does not need apriori information regarding the nature and extent of noise. We suggest techniques which adaptively discriminate between noisy and noise-free parameters by using a selective weighting procedure in the final distance calculations. A suitable index is used to study the performance of the recognition system for small data sets. Our scheme lends itself to greater flexibility in handling degradations in speech input than do the existing recognition schemes. We illustrate our scheme by simulating an adaptive differential pulse code modulated (ADPCM) speech, where the main distortion is contributed by the quatization noise.

...read moreread less

Proceedings Article•

High Quality Bandwidth Reduction of Speech Signals.

[...]

Emil Brazda

01 Jan 1984