scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks

TL;DR: It is argued that VADs should prioritize accuracy over area and power, and it is introduced a VAD circuit that uses an NN to classify modulation frequency features with 22.3-mW power consumption.
Abstract: This paper describes digital circuit architectures for automatic speech recognition (ASR) and voice activity detection (VAD) with improved accuracy, programmability, and scalability. Our ASR architecture is designed to minimize off-chip memory bandwidth, which is the main driver of system power consumption. A SIMD processor with 32 parallel execution units efficiently evaluates feed-forward deep neural networks (NNs) for ASR, limiting memory usage with a sparse quantized weight matrix format. We argue that VADs should prioritize accuracy over area and power, and introduce a VAD circuit that uses an NN to classify modulation frequency features with 22.3- $\mu \text{W}$ power consumption. The 65-nm test chip is shown to perform a variety of ASR tasks in real time, with vocabularies ranging from 11 words to 145 000 words and full-chip power consumption ranging from 172 $\mu \text{W}$ to 7.78 mW.
Citations
More filters
Proceedings ArticleDOI
04 Apr 2019
TL;DR: This paper designs and implements SONIC, an intermittence-aware software system with specialized support for DNN inference, and introduces loop continuation, a new technique that dramatically reduces the cost of guaranteeing correct intermittent execution for loop-heavy code likeDNN inference.
Abstract: Energy-harvesting technology provides a promising platform for future IoT applications. However, since communication is very expensive in these devices, applications will require inference "beyond the edge" to avoid wasting precious energy on pointless communication. We show that application performance is highly sensitive to inference accuracy. Unfortunately, accurate inference requires large amounts of computation and memory, and energy-harvesting systems are severely resource-constrained. Moreover, energy-harvesting systems operate intermittently, suffering frequent power failures that corrupt results and impede forward progress. This paper overcomes these challenges to present the first full-scale demonstration of DNN inference on an energy-harvesting system. We design and implement SONIC, an intermittence-aware software system with specialized support for DNN inference. SONIC introduces loop continuation, a new technique that dramatically reduces the cost of guaranteeing correct intermittent execution for loop-heavy code like DNN inference. To build a complete system, we further present GENESIS, a tool that automatically compresses networks to optimally balance inference accuracy and energy, and TAILS, which exploits SIMD hardware available in some microcontrollers to improve energy efficiency. Both SONIC & TAILS guarantee correct intermittent execution without any hand-tuning or performance loss across different power systems. Across three neural networks on a commercially available microcontroller, SONIC & TAILS reduce inference energy by 6.9× and 12.2×, respectively, over the state-of-the-art.

121 citations

Journal ArticleDOI
TL;DR: A modified version of rVAD is presented where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation, which significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices.

90 citations


Cites background from "A Low-Power Speech Recognizer and V..."

  • ...Voice activity detection (VAD), also called speech activity detection (SAD), is widely used in real-world speech systems for improving robustness against additive noises or discarding the non-speech part of a signal to reduce the computational cost of downstream processing [1]....

    [...]

Proceedings ArticleDOI
23 Jul 2019
TL;DR: A supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM), which significantly outperforms the state-of-the-art methods and achieves a diarization error rate below average.
Abstract: More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algorithms like probabilistic linear discriminant analysis (PLDA) are widely used for scoring the similarity between two speech segments. In this paper, we propose a supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is applied on top of the similarity matrix to further improve the performance. Experimental results show that our system significantly outperforms the state-of-the-art methods and achieves a diarization error rate of 6.63\% on the NIST SRE 2000 CALLHOME database.

89 citations


Cites background from "A Low-Power Speech Recognizer and V..."

  • ...In this paper, an oracle VAD is employed to remove nonspeech regions in audios....

    [...]

  • ...Since an oracle VAD is employed in our implementation, we exclude FA and Miss from our evaluations....

    [...]

  • ...First, a voice activity detector (VAD) [3] removes nonspeech regions from the audio input....

    [...]

  • ...DER consists of three components: false alarm (FA), missed detection (Miss), and speaker confusion, among which FA and Miss are mostly caused by VAD errors....

    [...]

Posted Content
TL;DR: SONIC as mentioned in this paper is an intermittence-aware software system with specialized support for DNN inference, which introduces loop continuation, a new technique that dramatically reduces the cost of guaranteeing correct intermittent execution for loop-heavy code like DNN, and automatically compresses networks to optimally balance inference accuracy and energy.
Abstract: Energy-harvesting technology provides a promising platform for future IoT applications. However, since communication is very expensive in these devices, applications will require inference "beyond the edge" to avoid wasting precious energy on pointless communication. We show that application performance is highly sensitive to inference accuracy. Unfortunately, accurate inference requires large amounts of computation and memory, and energy-harvesting systems are severely resource-constrained. Moreover, energy-harvesting systems operate intermittently, suffering frequent power failures that corrupt results and impede forward progress. This paper overcomes these challenges to present the first full-scale demonstration of DNN inference on an energy-harvesting system. We design and implement SONIC, an intermittence-aware software system with specialized support for DNN inference. SONIC introduces loop continuation, a new technique that dramatically reduces the cost of guaranteeing correct intermittent execution for loop-heavy code like DNN inference. To build a complete system, we further present GENESIS, a tool that automatically compresses networks to optimally balance inference accuracy and energy, and TAILS, which exploits SIMD hardware available in some microcontrollers to improve energy efficiency. Both SONIC & TAILS guarantee correct intermittent execution without any hand-tuning or performance loss across different power systems. Across three neural networks on a commercially available microcontroller, SONIC & TAILS reduce inference energy by 6.9x and 12.2x, respectively, over the state-of-the-art.

69 citations

Journal ArticleDOI
05 Jul 2018
TL;DR: The design, fabrication, evaluation, and use of a self-powered microphone that is thin, flexible, and easily manufactured that takes advantage of the triboelectric nanogenerator to transform vibrations into an electric signal without applying an external power source is demonstrated.
Abstract: We demonstrate the design, fabrication, evaluation, and use of a self-powered microphone that is thin, flexible, and easily manufactured. Our technology is referred to as a Self-powered Audio Triboelectric Ultra-thin Rollable Nanogenerator (SATURN) microphone. This acoustic sensor takes advantage of the triboelectric nanogenerator (TENG) to transform vibrations into an electric signal without applying an external power source. The sound quality of the SATURN mic, in terms of acoustic sensitivity, frequency response, and directivity, is affected by a set of design parameters that we explore based on both theoretical simulation and empirical evaluation. The major advantage of this audio material sensor is that it can be manufactured simply and deployed easily to convert every-day objects and physical surfaces into microphones which can sense audio. We explore the space of potential applications for such a material as part of a self-sustainable interactive system.

48 citations


Cites background from "A Low-Power Speech Recognizer and V..."

  • ...To take advantage of SATURN microphone as a self-powered sensor with high acoustic sensitivity, we should either connect it to low power processor [13, 49, 55] which allows for both operation and recognition of sound in about a few tens of micro-watts as shown in Figure 29 or send the audio to remote base station for recognition using analog backscatter [61] which would only consumes a few micro-watts which can be harvested from the environment....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

9,091 citations


Additional excerpts

  • ...ASR due to their improved accuracy [14]....

    [...]

Journal ArticleDOI
TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.
Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

4,822 citations

Journal ArticleDOI
TL;DR: Eyeriss as mentioned in this paper is an accelerator for state-of-the-art deep convolutional neural networks (CNNs) that optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, by reconfiguring the architecture.
Abstract: Eyeriss is an accelerator for state-of-the-art deep convolutional neural networks (CNNs). It optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, for various CNN shapes by reconfiguring the architecture. CNNs are widely used in modern AI systems but also bring challenges on throughput and energy efficiency to the underlying hardware. This is because its computation requires a large amount of data, creating significant data movement from on-chip and off-chip that is more energy-consuming than computation. Minimizing data movement energy cost for any CNN shape, therefore, is the key to high throughput and energy efficiency. Eyeriss achieves these goals by using a proposed processing dataflow, called row stationary (RS), on a spatial architecture with 168 processing elements. RS dataflow reconfigures the computation mapping of a given shape, which optimizes energy efficiency by maximally reusing data locally to reduce expensive data movement, such as DRAM accesses. Compression and data gating are also applied to further improve energy efficiency. Eyeriss processes the convolutional layers at 35 frames/s and 0.0029 DRAM access/multiply and accumulation (MAC) for AlexNet at 278 mW (batch size $N = 4$ ), and 0.7 frames/s and 0.0035 DRAM access/MAC for VGG-16 at 236 mW ( $N = 3$ ).

2,165 citations

Book
01 Jan 1997
TL;DR: The speech recognition problem hidden Markov models the acoustic model basic language modelling the Viterbi search hypothesis search on a tree and the fast match elements of information theory.
Abstract: The speech recognition problem hidden Markov models the acoustic model basic language modelling the Viterbi search hypothesis search on a tree and the fast match elements of information theory the complexity of tasks - the quality of language models the expectation - maximization algorithm and its consequences decision trees and tree language models phonetics from orthography - spelling-to-base from mappings triphones and allophones maximum entropy probability estimation and language models three applications of maximum entropy estimation to language modelling estimation of probabilities from counts and the Back-Off method.

2,153 citations


"A Low-Power Speech Recognizer and V..." refers methods in this paper

  • ...We provide a brief overview of the hidden Markov model (HMM) framework for ASR [3], [4]....

    [...]

Proceedings ArticleDOI
23 Mar 1992
TL;DR: SWITCHBOARD as mentioned in this paper is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition.
Abstract: SWITCHBOARD is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition. About 2500 conversations by 500 speakers from around the US were collected automatically over T1 lines at Texas Instruments. Designed for training and testing of a variety of speech processing algorithms, especially in speaker verification, it has over an 1 h of speech from each of 50 speakers, and several minutes each from hundreds of others. A time-aligned word for word transcription accompanies each recording. >

2,102 citations