scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 1983"


Proceedings ArticleDOI
Bishnu S. Atal1
14 Apr 1983
TL;DR: The aim is to determine the extent to which the bit rate of LPC parameters can be reduced without sacrificing speech quality.
Abstract: This paper describes a method for efficient coding of LPC log area parameters. It is now well recognized that sample-by-sample quantization of LPC parameters is not very efficient in minimizing the bit rate needed to code these parameters. Recent methods for reducing the bit rate have used vector and segment quantization methods. Much of the past work in this area has focussed on efficient coding of LPC parameters in the context of vocoders which put a ceiling on achievable speech quality. The results from these studies cannot be directly applied to synthesis of high quality speech. This paper describes a different approach to efficient coding of log area parameters. Our aim is to determine the extent to which the bit rate of LPC parameters can be reduced without sacrificing speech quality. Speech events occur generally at non-uniformly spaced time intervals. Moreover, some speech events are slow while others are fast. Uniform sampling of speech parameters is thus not efficient. We describe a non-uniform sampling and interpolation procedure for efficient coding of log area parameters. A temporal decomposition technique is used to represent the continuous variation of these parameters as a linearly-weighted sum of a number of discrete elementary components. The location and length of each component is automatically adapted to speech events. We find that each elementary component can be coded as a very low information rate signal.

377 citations


Journal ArticleDOI
TL;DR: Large-scale packet speech multiplexing experiments could not be carried out on ARPANET or SATNET where the network link capacities severely restrict the number of speech users that can be accommodated, but experiments are currently being carried out using a wide-band satellite-based packet system designed to accommodate a sufficient number of simultaneous users to support realistic experiments in efficient statisticalmultiplexing.
Abstract: The integration of digital voice with data in a common packet-switched network system offers a number of potential benefits, including reduced systems cost through sharing of switching and transmission resources, flexible internetworking among systems utilizing different transmission media, and enhanced services for users requiring access to both voice and data communications. Issues which it has been necessary to address in order to realize these benefits include reconstitution of speech from packets arriving at nonuniform intervals, maximization of packet speech multiplexing efficiency, and determination of the implementation requirements for terminals and switching in a large-scale packet voice/data system. A series of packet speech systems experiments to address these issues has been conducted under the sponsorship of the Defense Advanced Research Projects Agency (DARPA). In the initial experiments on the ARPANET, the basic feasibility of speech communication on a store-and-forward packet network was demonstrated. Techniques were developed for reconstitution of speech from packets, and protocols were developed for call setup and for speech transport. Later speech experiments utilizing the Atlantic packet satellite network (SATNET) led to the development of techniques for efficient voice conferencing in a broadcast environment, and for internetting speech between a store-and-forward net (ARPANEI) and a broadcast net (SATNET). Large-scale packet speech multiplexing experiments could not be carried out on ARPANET or SATNET where the network link capacities severely restrict the number of speech users that can be accommodated. However, experiments are currently being carried out using a wide-band satellite-based packet system designed to accommodate a sufficient number of simultaneous users to support realistic experiments in efficient statistical multiplexing. Key developments to date associated with the wide-band experiments have been 1) techniques for internetting via voice/data gateways from a variety of local access networks (packet cable, packet radio, and circuit-switched) to a long-haul broadcast satellite network and 2) compact implementations of packet voice terminals with full protocol and voice capabilities. Basic concepts and issues associated with packet speech systems are described. Requirements and techniques for speech processing, voice protocols, packetization and reconstitution, conferencing, and multiplexing are discussed in the context of a generic packet speech system configuration. Specific experimental configurations and key packet speech results on the ARPANET, SATNET, and wide-band system are reviewed.

155 citations


Journal ArticleDOI
TL;DR: Discrete-time analysis of two schemes for multiplexing voice and data is presented, in which speech activity detectors are not used, and a tradeoff exists between data message delay and speech interpolation advantage.
Abstract: Discrete-time analysis of two schemes for multiplexing voice and data is presented. In each scheme voice and data are multiplexed using the movable boundary frame allocation scheme. In the first scheme, speech activity detectors (SAD's) are not used, and hence, the variations in the voice traffic are only due to the on/off characteristics of voice. In the second scheme, SAD's are employed so that talker silences can he utilized for transmission of additional voice and/or data. In this scheme, the multiplexer performs digital speech interpolation as well as movable boundary frame allocation. The performance measures considered are probability of loss for voice calls, probability of speech clipping, speech packet rejection ratio, and the expected data message delay. In the case of the multiplexer with SAD, a tradeoff exists between data message delay and speech interpolation advantage. Some numerical examples are presented which illustrate the performance of the two multiplexers.

128 citations


Journal ArticleDOI
TL;DR: The results of a new method based on rate-distortion speech coding (speech coding by vector quantization), minimum cross-entropy pattern classification, and information-theoretic spectral distortion measures for discrete utterance speech recognition are presented.
Abstract: The results of a new method are presented for discrete utterance speech recognition. The method is based on rate-distortion speech coding (speech coding by vector quantization), minimum cross-entropy pattern classification, and information-theoretic spectral distortion measures. Separate vector quantization code books are designed from training sequences for each word in the recognition vocabulary. Inputs from outside the training sequence are classified by performing vector quantization and finding the code book that achieves the lowest average distortion per speech frame. The new method obviates time alignment. It achieves 99 percent accuracy for speaker-dependent recognition of a 20 -word vocabulary that includes the ten digits, with higher accuracy for recognition of the digit subset. For speaker-independent recognition, the method achieves 88 percent accuracy for the 20 -word vocabulary and 95 percent for the digit subset. Background of the method, detailed empirical results, and an analysis of computational requirements are presented.

92 citations


Journal ArticleDOI
TL;DR: In this paper, a speech-carrying protocol for packet-switching radio networks is described. But the authors focus on the features of the network that limit its ability to carry packetized speech and their effects on the network performance.
Abstract: A research effort to provide speech-carrying capabilities to a data-oriented packet-switching radio network is described. The features of the network that limit its ability to carry packetized speech are discussed, and their effects on the network performance are analyzed. A new protocol, called duct routing, that enhances the network capabilities in a mobile environment is presented. That protocol makes use of repeater redundancy to compensate for loss of communication connectivity due to node mobility. A series of experiments to evaluate the network performance in carrying speech traffic, both with data and voice protocols, is described, and the results are presented and discussed.

43 citations


Journal ArticleDOI
TL;DR: A new diversity technique is proposed to combat Rayleigh fading in digital mobile radio systems transmitting speech signals using μ-law PCM encoded speech signals, and a statistical error detection strategy is evoked to identify the erroneous samples.
Abstract: A new diversity technique is proposed to combat Rayleigh fading in digital mobile radio systems transmitting speech signals. The speech signals are μ-law PCM encoded ( \mu = 255 , 8 kHz sampling, 8 bits/code word, 64 kbit/s data rate), and alternate data words are used to form two streams called "odd" and "even." The even stream is delayed by τ seconds and the streams are interleaved prior to radio transmission using two-level PSK modulation. At the receiver the odd data stream is delayed by τ and interleaved with the even stream. Consequently, if an error burst occurs, the effect of the reshuffling of the data stream is, in general, to place words with bit errors in juxtaposition to those correctly received. After μ-law PCM decoding of the words, a statistical error detection strategy is evoked to identify the erroneous samples. These samples are replaced by adjacent sample interpolation to give the recovered speech sequence. No recourse to channel protection coding is made. In our experiments a Rayleigh fading envelope was generated from a hardware simulator and stored in a computer, along with four sentences of speech. The system was then simulated and the recovered speech perceived. The objective performance measures were segmental SNR for the audio signal, and BER. Different error detection strategies were examined and restrictions on τ investigated. For a mobile speed of 30 mph, SNR values of 32, 21, and 16 dB were obtained for BER values of 0.1, 1, and 2 percent, corresponding to SNR gains over an uncorrected system of 3, 9, and 11 dB, respectively.

31 citations


Proceedings ArticleDOI
01 Apr 1983
TL;DR: Results indicate that discrimination between similar sounding words can be greatly improved, and an alternative DTW approach which is able to focus its attention on those parts of a speech pattern which serve to distinguish it from similar patterns is presented.
Abstract: Whole-word pattern matching using dynamic time-warping (DTW) has achieved considerable success as an algorithm for automatic speech recognition. However, the performance of such an algorithm is ultimately limited by its inability to discriminate between similar sounding words. The problem arises because all differences between speech patterns are treated as being equally important, hence the algorithm is particularly susceptible to confusions caused by irrelevant differences. This paper presents an alternative DTW approach which is able to focus its attention on those parts of a speech pattern which serve to distinguish it from similar patterns. A network-type data structure is derived from reference speech patterns, and the separate paths through the network determine the regions where recognition takes place. Results indicate that discrimination between similar sounding words can be greatly improved.

23 citations


Journal ArticleDOI
TL;DR: This work concludes that relative to conventional radio telephoning in which two channels are dedicated to each transmitter/receiver pair, a bandwidth reduction of 30-35 percent can be achieved.
Abstract: This paper presents the basic architecture and performance of a mobile radio multiaccess voice/data system. Natural pauses in conversational speech allow bandwidth saving through interleaving of data packets and talkspurts from different voice sources. A speech detector designed specifically for the mobile environment is presented. Blocking and delay performance of the multiaccess uplink is analyzed for voice traffic, assuming no traffic effects from the low priority data packets. Performance results from simulation are then presented for two downlink strategies in a two-hop virtual circuit in which a base station acts as a relay. The results verify also that the uplink analysis is valid for low voice traffic. For the data traffic, simulation results are presented in terms of data packet transmission delay and probability of collision with talkspurts. The results indicate that data flow may be limited by the collision factor. This work concludes that relative to conventional radio telephoning in which two channels are dedicated to each transmitter/receiver pair, a bandwidth reduction of 30-35 percent can be achieved.

22 citations


Journal ArticleDOI
TL;DR: A 2:1 compression and expansion system that has been used as part of a 9.6 kbit/s speech coder is discussed and it is shown that for all the compression/expansion ratios of interest the buffer size needed is twice the maximum pitch period.
Abstract: Time domain harmonic scaling (TDHS) has been realized in real time on the Bell Laboratories digital signal processing (DSP) integrated circuit It is an algorithm that can expand or compress the bandwidth and sampling rate of speech by taking advantage of the pitch structure in the speech signal As such it is useful in a variety of speech applications including speech coding, speech enhancement, and rate modification A single DSP can perform compression and a second DSP can perform expansion Both operations require pitch information to be supplied with the input speech Included in the system is a real-time pitch/periodicity detector which has also been implemented on a single DSP Its design is based on a novel modification of the autocorrelation function type pitch detector This paper presents details of both the TDHS and pitch detector implementation and discusses their performances In particular in this paper we discuss a 2:1 compression and expansion system that has been used as part of a 96 kbit/s speech coder TDHS was previously thought to require a much larger buffer than the RAM memory available in the DSP We show that for all the compression/expansion ratios of interest the buffer size needed is twice the maximum pitch period

19 citations


Journal ArticleDOI
TL;DR: The letter considers an integrated voice-data system in which data are synchronously multiplexed in analogue speech and the behaviour of the data buffer is studied in a more realistic way than in previous analyses, which leads to considerably different results.
Abstract: The letter considers an integrated voice-data system in which data are synchronously multiplexed in analogue speech. The behaviour of the data buffer of this system is studied in a more realistic way than in previous analyses, which leads to considerably different results.

19 citations


Proceedings ArticleDOI
T. Iwata, H. Ishizuka, M. Watari, T. Hoshi, M. Mizuno 
01 Jan 1983
TL;DR: A single chip implementing a distance calculator, dynamic programming equation calculator and pipelined operations for use in speech recognition and up to 340 isolated words or 40 connected words can be recognized in realtime.
Abstract: This report will discuss a single chip implementing a distance calculator, dynamic programming equation calculator and pipelined operations for use in speech recognition. Up to 340 isolated words or 40 connected words can be recognized in realtime.

Journal ArticleDOI
TL;DR: In this article, the authors considered the possibility of introducing packetized voice traffic into a packet-switched network and proposed simplified protocols and priority rules for voice handling, which are compared by means of analytical tools and simulation experiments considering the presence of voice, interactive, and batch data packets.
Abstract: This paper considers the possibility of introducing packetized voice traffic into a packet-switched network. It is well known that the network must assure voice packets sufficient delay characteristics for conversational speech, i.e., low delay between speaker and listener and low delay jitter or variance. To reach these goals, simplified protocols and priority rules for voice handling are proposed and evaluated. A model of a packet switching node structure capable of handling both data and voice is derived for both analytical and simulation approaches. The use of low bit rate voice encoders is considered. The necessity of avoiding the transmission of silent intervals is discussed in relation to the behavior of packet voice receivers. Proposed strategies are compared by means of analytical tools and simulation experiments considering the presence of voice, interactive, and batch data packets.

Proceedings ArticleDOI
01 Apr 1983
TL;DR: It was found that the system consisting of the prefilter working in tandem with the word recognizer increased word recognition accuracy.
Abstract: A series of experiments were performed to determine (1) the effects of using an energy-based endpoint detector and a conventional isolated word recognition system when the input speech is noisy and (2) the effects of placing a noise suppression prefilter in tandem with the word recognizer in an attempt to remove the noise prior to recognition. It was found that the system consisting of the prefilter working in tandem with the word recognizer increased word recognition accuracy.

Dissertation
01 Jan 1983

PatentDOI
Goldstern Ernest1
TL;DR: In this paper, a speech recognizer converts words or word groups of the speech into digital data words, which are then transmitted to the receiver in a highly redundant manner, for example by a thousandfold repetition.
Abstract: @ The transmitter comprises a speech recognizer which converts words or word groups of the speech into digital data words. The data words are transmitted to the receiver in a highly redundant manner, for example by a thousandfold repetition. The receiver comprises means for recovering the non-redundant original data words and also comprises a speech generating arrangement which converts these data words into corresponding speech (words). Use: radio communication.


Journal ArticleDOI
TL;DR: Results measured over 16 ms, a phoneme, and word durations indicate that the adaptive frequency mapping algorithm significantly enhances the recovered speech compared to telephonic speech.
Abstract: Telephone channels restrict the bandwidth of speech signals to approximately 0.3-3.3 kHz, with the consequence that the intelligibility of unvoiced sounds may be significantly impaired. To prevent this band limitation of unvoiced sounds while still confining the speech to the telephonic bandwidth, we propose a scheme which, on recognizing the presence of unvoiced sounds extending to 7.6 kHz, frequency maps them into the band 0.3-3.3 kHz. Four mapping laws are considered and the unvoiced speech is compressed using each law. Frequency demapping is employed, and the law that has the best spectral match to the speech spectrum is selected. Voiced speech is band limited from 0.3 to 3.3 kHz. Results measured over 16 ms, a phoneme, and word durations indicate that the adaptive frequency mapping algorithm significantly enhances the recovered speech compared to telephonic speech. Informal listening experiences support these findings.

Patent
21 Sep 1983
TL;DR: In this article, a digital speech processor operates in parallel with a programmable digital computer to generate sequences of variable-length speech phases and pauses at the request of the computer, and the speech memory region within the speech processor contains digitally-encoded speech data segments of varying length.
Abstract: A digital speech processor operates in parallel with a programmable digital computer to generate sequences of variable-length speech phases and pauses at the request of the computer. A speech memory region within the speech processor contains digitally-encoded speech data segments of varying length. A separate command memory region, can be loaded with a plurality of commands. When sequentially executed by the speech processor, these commands cause the processor to generate an arbitrary sequence of spoken phases and pauses without intervention by the computer. When the programmable digital computer is not operating the speech processor to synthesize spoken words, the speech and command memory regions are used as auxiliary random access memory to increase the size of the memory space of the computer.

Proceedings ArticleDOI
01 Apr 1983
TL;DR: Results found are: (1) limited time sequence compression does not impose any negative effect on DP or its alternatives and (2) variable threshold scheme performs better than the fixed threshold scheme.
Abstract: This paper investigates the effect of LPC based time compression schemes on dynamic programming (DP) and its alternatives. Two compression schemes, one with fixed threshold and the other with variable threshold both incorporated with two control factors, the rate of frame overlap and the step of interframe interval, are investigated. The test speech is 40-word alpha-digit vocabulary pronounced by 10 males and 10 females. Results found are: (1) limited time sequence compression does not impose any negative effect on DP or its alternatives and (2) variable threshold scheme performs better than the fixed threshold scheme. More detailed discussion on the compression schemes and DP interaction are included.

Proceedings ArticleDOI
14 Apr 1983
TL;DR: This paper discusses the formulation of the problem, the techniques developed, and the results of a limited-scale intelligibility test, which indicate that no intelligibility improvement is obtained from the processing.
Abstract: Development and tests on an algorithm to enhance the intelligibility of speech degraded by an interfering talker is reported. This paper discusses the formulation of the problem, the techniques developed, and the results of a limited-scale intelligibility test. While the test results indicate that no intelligibility improvement is obtained from the processing, several promising new directions for this problem have been identified.

Journal ArticleDOI
TL;DR: A 2:1 compression and expansion system that has been used as part of a 9.6 kbit/s speech coder is discussed and it is shown that for all the compression/expansion ratios of interest the buffer size needed is twice the maximum pitch period.
Abstract: Time domain harmonic scaling (TDHS) has been realized in real time on the Bell Laboratories digital signal processing (DSP) integrated circuit. It is an algorithm that can expand or compress the bandwidth and sampling rate of speech by taking advantage of the pitch structure in the speech signal. As such it is useful in a variety of speech applications including speech coding, speech enhancement, and rate modification. A single DSP can perform compression and a second DSP can perform expansion. Both operations require pitch information to be supplied with the input speech. Included in the system is a real-time pitch/periodicity detector which has also been implemented on a single DSP. Its design is based on a novel modification of the autocorrelation function type pitch detector. This paper presents details of both the TDHS and pitch detector implementation and discusses their performances. In particular in this paper we discuss a 2:1 compression and expansion system that has been used as part of a 9.6 kbit/s speech coder. TDHS was previously thought to require a much larger buffer than the RAM memory available in the DSP. We show that for all the compression/expansion ratios of interest the buffer size needed is twice the maximum pitch period.

Journal ArticleDOI
TL;DR: A low-cost recognition system for isolated words and small vocabulary (typically 15 words) is described, with possibility of integration in a stand-alone small-size CMOS chip, very-low-power consumption, and automatic adaptation to the speaker without any tedious training mode.
Abstract: A low-cost recognition system for isolated words and small vocabulary (typically 15 words) is described. The main features of the system are: possibility of integration in a stand-alone small-size CMOS chip, very-low-power consumption (typically 200 µW at 3 V supply voltage), and automatic adaptation to the speaker without any tedious training mode.

Journal ArticleDOI
TL;DR: Results of the analysis show that even during a conversation many useful idle periods do occur, and these silences could indeed be exploited by making them available to additional users and hence improve both the efficiency of the channels and their congestion.
Abstract: Two methods of speech detection at the syllabic level of voice traffic over land mobile radio telephone channels are presented. Both methods are based on the periodic comparison of the audio signal level with a threshold and provide an ON-OFF pattern of active-idle periods on the channel at a rate equal to 100 to 200 samples per second. One method is entirely digital whereas the other method uses an analog detector followed by digital processing. Results of the analysis show that during a conversation more than 50 percent of the time the channel is idle, with an average duration of the silences larger than 300 ms. These results and those concerning other parameters of interest indicate that even during a conversation many useful idle periods do occur. These silences could indeed be exploited by making them available to additional users and hence improve both the efficiency of the channels and their congestion.

Journal ArticleDOI
TL;DR: The way ahead lies in exploiting the new technologies of automatic speech generation and recognition, while these are still in their infancy there is an opportunity to ensure that they are incorporated into networks in the most satisfactory way for both administrations and customers.
Abstract: Signalling between a telephone user and the automatic network by means of dialled digits and coded tones offers little more than basic service. A wider range of services is possible by adopting the more natural medium of speech. A historical perspective illustrates developments from the speaking clock to voice guidance and improved information services. The way ahead lies in exploiting the new technologies of automatic speech generation and recognition. While these are still in their infancy there is an opportunity to ensure that they are incorporated into networks in the most satisfactory way for both administrations and customers. This will entail research into speech algorithms, hardware technologies, systems and human factors, with increasing future emphasis upon the last two.

Proceedings ArticleDOI
01 Apr 1983
TL;DR: A methodology is described to obtain a set of segments and rules that represents adequately the speech performance of a given speaker and how such a segment data base can be used for speech coding at very low bit rate, synthesis from unrestricted text, and continuous speech recognition.
Abstract: A methodology is described to obtain a set of segments and rules that represents adequately the speech performance of a given speaker. This methodology proceeds from an initial set of diphones extracted from a neutral context and modify this set with larger and/or smaller segments depending on the match with natural utterances. Each segment is stored as a sequence of frames coded using LPC coefficients. An estimate of the likelihood of timescale distortion is associated with each frame. It represents knowledge on temporal variability that can be used by synthesis rules and/or pattern matching algorithms. It is then shown how such a segment data base can be used for 1) speech coding at very low bit rate ( ∼ 400 bit/sec), 2) synthesis from unrestricted text, 3) continuous speech recognition.

Journal ArticleDOI
B.S. Babu1
TL;DR: This paper describes a 2400 bit/s vocoder based on spectral envelope estimation, spectral coding to 48 bits, pitch extraction, and decreasing-chirp excitation for voiced synthesis that is robust in acoustic noise environments at a data rate of 2400 bits/s.
Abstract: This paper describes a 2400 bit/s vocoder based on spectral envelope estimation, spectral coding to 48 bits, pitch extraction, and decreasing-chirp excitation for voiced synthesis. Several spectral smoothing and coding schemes are described and intelligibility test results compared. This vocoder was implemented on the CSP-30 high speed digital processor at the RADC/EEV Speech Processing Research and Development Facility at Hanscom AFB, MA. This system yields high performance in a quiet environment and is robust in acoustic noise environments at a data rate of 2400 bits/s.


Proceedings ArticleDOI
12 Dec 1983
TL;DR: The discussion suggests that one to one replacement of keyboard with voice overlooks some possible advantages of voice, and suggests that it is also possible to find operators who work well with voice.
Abstract: The performance of two speech recognition systems installed at two field sites was analyzed. The speech systems were part of larger computer systems that were performing real functions in industrial environments. The two sites appeared to be polarized in terms of expected suitability for speech recognition. The variables looked at included task complexity, memory load, requirements for verification and error correction, vocabulary and syntax, microphone, operator experience and complexity of host computer software. Accuracy and throughput were measured for the speech recognition system at each site. The same measurements were made for keyboard entry. Operator differences account for most of the variance in results. Accuracy with voice input was higher than with keyboard for most operators. The most accurate operators with keyboard also tended to be the most accurate with voice. Throughput data appears more sensitive to individual differences in dealing with voice input, although the throughput data was clouded by slow host system response times overall. The discussion suggests that one to one replacement of keyboard with voice overlooks some possible advantages of voice. It is also possible to find operators who work well with voice. For those who do not work well with voice, the problems appear to be related to general work habits and attitude, rather than to specific difficulties with speech.


Journal ArticleDOI
TL;DR: The design objectives and test results of a new services integrated system using a packetized speech technique and an experimental system for confirming the choice of design parameters as well as measuring the speech and data communication quality are presented.
Abstract: The design objectives and test results of a new services integrated system using a packetized speech technique are presented. First, several important design parameters are investigated through theoretical analysis and computer simulation. Second, an experimental system for confirming the choice of design parameters as well as measuring the speech and data communication quality is described, and the test results of this system are discussed.