A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks

doi:10.1109/JSSC.2017.2752838

Home
/
Papers
/
A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks

Journal Article•DOI•

A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks

Michael Price¹, James Glass¹, Anantha P. Chandrakasan¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2018-IEEE Journal of Solid-state Circuits (IEEE)-Vol. 53, Iss: 1, pp 66-75

TL;DR: It is argued that VADs should prioritize accuracy over area and power, and it is introduced a VAD circuit that uses an NN to classify modulation frequency features with 22.3-mW power consumption.

read less

Abstract: This paper describes digital circuit architectures for automatic speech recognition (ASR) and voice activity detection (VAD) with improved accuracy, programmability, and scalability. Our ASR architecture is designed to minimize off-chip memory bandwidth, which is the main driver of system power consumption. A SIMD processor with 32 parallel execution units efficiently evaluates feed-forward deep neural networks (NNs) for ASR, limiting memory usage with a sparse quantized weight matrix format. We argue that VADs should prioritize accuracy over area and power, and introduce a VAD circuit that uses an NN to classify modulation frequency features with 22.3- $\mu \text{W}$ power consumption. The 65-nm test chip is shown to perform a variety of ASR tasks in real time, with vocabularies ranging from 11 words to 145 000 words and full-chip power consumption ranging from 172 $\mu \text{W}$ to 7.78 mW.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Intelligence Beyond the Edge: Inference on Intermittent Embedded Systems

[...]

Graham Gobieski¹, Brandon Lucia¹, Nathan Beckmann¹•Institutions (1)

Carnegie Mellon University¹

04 Apr 2019

TL;DR: This paper designs and implements SONIC, an intermittence-aware software system with specialized support for DNN inference, and introduces loop continuation, a new technique that dramatically reduces the cost of guaranteeing correct intermittent execution for loop-heavy code likeDNN inference.

...read moreread less

Abstract: Energy-harvesting technology provides a promising platform for future IoT applications. However, since communication is very expensive in these devices, applications will require inference "beyond the edge" to avoid wasting precious energy on pointless communication. We show that application performance is highly sensitive to inference accuracy. Unfortunately, accurate inference requires large amounts of computation and memory, and energy-harvesting systems are severely resource-constrained. Moreover, energy-harvesting systems operate intermittently, suffering frequent power failures that corrupt results and impede forward progress. This paper overcomes these challenges to present the first full-scale demonstration of DNN inference on an energy-harvesting system. We design and implement SONIC, an intermittence-aware software system with specialized support for DNN inference. SONIC introduces loop continuation, a new technique that dramatically reduces the cost of guaranteeing correct intermittent execution for loop-heavy code like DNN inference. To build a complete system, we further present GENESIS, a tool that automatically compresses networks to optimally balance inference accuracy and energy, and TAILS, which exploits SIMD hardware available in some microcontrollers to improve energy efficiency. Both SONIC & TAILS guarantee correct intermittent execution without any hand-tuning or performance loss across different power systems. Across three neural networks on a commercially available microcontroller, SONIC & TAILS reduce inference energy by 6.9× and 12.2×, respectively, over the state-of-the-art.

...read moreread less

121 citations

Journal Article•DOI•

rVAD: An unsupervised segment-based robust voice activity detection method

[...]

Zheng-Hua Tan¹, Achintya Kumar Sarkar¹, Najim Dehak²•Institutions (2)

Aalborg University¹, Johns Hopkins University²

01 Jan 2020-Computer Speech & Language

TL;DR: A modified version of rVAD is presented where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation, which significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices.

...read moreread less

90 citations

Cites background from "A Low-Power Speech Recognizer and V..."

...Voice activity detection (VAD), also called speech activity detection (SAD), is widely used in real-world speech systems for improving robustness against additive noises or discarding the non-speech part of a signal to reduce the computational cost of downstream processing [1]....
[...]

Proceedings Article•DOI•

[...]

Qingjian Lin, Ruiqing Yin, Ming Li¹, Hervé Bredin², Claude Barras - Show less +1 more•Institutions (2)

Texas A&M University¹, Université Paris-Saclay²

23 Jul 2019

TL;DR: A supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM), which significantly outperforms the state-of-the-art methods and achieves a diarization error rate below average.

...read moreread less

Abstract: More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algorithms like probabilistic linear discriminant analysis (PLDA) are widely used for scoring the similarity between two speech segments. In this paper, we propose a supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is applied on top of the similarity matrix to further improve the performance. Experimental results show that our system significantly outperforms the state-of-the-art methods and achieves a diarization error rate of 6.63\% on the NIST SRE 2000 CALLHOME database.

...read moreread less

89 citations

Cites background from "A Low-Power Speech Recognizer and V..."

...In this paper, an oracle VAD is employed to remove nonspeech regions in audios....
[...]
...Since an oracle VAD is employed in our implementation, we exclude FA and Miss from our evaluations....
[...]
...First, a voice activity detector (VAD) [3] removes nonspeech regions from the audio input....
[...]
...DER consists of three components: false alarm (FA), missed detection (Miss), and speaker confusion, among which FA and Miss are mostly caused by VAD errors....
[...]

Posted Content•

Intelligence Beyond the Edge: Inference on Intermittent Embedded Systems.

[...]

Graham Gobieski, Nathan Beckmann, Brandon Lucia

28 Sep 2018-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: SONIC as mentioned in this paper is an intermittence-aware software system with specialized support for DNN inference, which introduces loop continuation, a new technique that dramatically reduces the cost of guaranteeing correct intermittent execution for loop-heavy code like DNN, and automatically compresses networks to optimally balance inference accuracy and energy.

...read moreread less

Abstract: Energy-harvesting technology provides a promising platform for future IoT applications. However, since communication is very expensive in these devices, applications will require inference "beyond the edge" to avoid wasting precious energy on pointless communication. We show that application performance is highly sensitive to inference accuracy. Unfortunately, accurate inference requires large amounts of computation and memory, and energy-harvesting systems are severely resource-constrained. Moreover, energy-harvesting systems operate intermittently, suffering frequent power failures that corrupt results and impede forward progress. This paper overcomes these challenges to present the first full-scale demonstration of DNN inference on an energy-harvesting system. We design and implement SONIC, an intermittence-aware software system with specialized support for DNN inference. SONIC introduces loop continuation, a new technique that dramatically reduces the cost of guaranteeing correct intermittent execution for loop-heavy code like DNN inference. To build a complete system, we further present GENESIS, a tool that automatically compresses networks to optimally balance inference accuracy and energy, and TAILS, which exploits SIMD hardware available in some microcontrollers to improve energy efficiency. Both SONIC & TAILS guarantee correct intermittent execution without any hand-tuning or performance loss across different power systems. Across three neural networks on a commercially available microcontroller, SONIC & TAILS reduce inference energy by 6.9x and 12.2x, respectively, over the state-of-the-art.

...read moreread less

69 citations

Journal Article•DOI•

SATURN: A Thin and Flexible Self-powered Microphone Leveraging Triboelectric Nanogenerator

[...]

Nivedita Arora¹, Steven L. Zhang¹, Fereshteh Shahmiri¹, Diego Osorio¹, Yi-Cheng Wang¹, Mohit Gupta¹, Zhengjun Wang¹, Thad Starner¹, Zhong Lin Wang¹, Gregory D. Abowd¹ - Show less +6 more•Institutions (1)

Georgia Institute of Technology¹

05 Jul 2018

TL;DR: The design, fabrication, evaluation, and use of a self-powered microphone that is thin, flexible, and easily manufactured that takes advantage of the triboelectric nanogenerator to transform vibrations into an electric signal without applying an external power source is demonstrated.

...read moreread less

Abstract: We demonstrate the design, fabrication, evaluation, and use of a self-powered microphone that is thin, flexible, and easily manufactured. Our technology is referred to as a Self-powered Audio Triboelectric Ultra-thin Rollable Nanogenerator (SATURN) microphone. This acoustic sensor takes advantage of the triboelectric nanogenerator (TENG) to transform vibrations into an electric signal without applying an external power source. The sound quality of the SATURN mic, in terms of acoustic sensitivity, frequency response, and directivity, is affected by a set of design parameters that we explore based on both theoretical simulation and empirical evaluation. The major advantage of this audio material sensor is that it can be manufactured simply and deployed easily to convert every-day objects and physical surfaces into microphones which can sense audio. We explore the space of potential applications for such a material as part of a self-sustainable interactive system.

...read moreread less

48 citations

Cites background from "A Low-Power Speech Recognizer and V..."

...To take advantage of SATURN microphone as a self-powered sensor with high acoustic sensitivity, we should either connect it to low power processor [13, 49, 55] which allows for both operation and recognition of sound in about a few tens of micro-watts as shown in Figure 29 or send the audio to remote base station for recognition using analog backscatter [61] which would only consumes a few micro-watts which can be harvested from the environment....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

[...]

Geoffrey E. Hinton¹, Li Deng², Dong Yu², George E. Dahl¹, Abdelrahman Mohamed¹, Navdeep Jaitly¹, Andrew W. Senior³, Vincent Vanhoucke³, Patrick Nguyen³, Tara N. Sainath⁴, Brian Kingsbury⁴ - Show less +7 more•Institutions (4)

University of Toronto¹, Microsoft², Google³, IBM⁴

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

9,091 citations

Additional excerpts

...ASR due to their improved accuracy [14]....
[...]

Journal Article•DOI•

Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

[...]

S. Davis, Paul Mermelstein¹•Institutions (1)

bell northern research¹

01 Aug 1980-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: In this article, several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system, and the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations.

...read moreread less

Abstract: Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system. The vocabulary included many phonetically similar monosyllabic words, therefore the emphasis was on the ability to retain phonetically significant acoustic information in the face of syntactic and duration variations. For each parameter set (based on a mel-frequency cepstrum, a linear frequency cepstrum, a linear prediction cepstrum, a linear prediction spectrum, or a set of reflection coefficients), word templates were generated using an efficient dynamic warping method, and test data were time registered with the templates. A set of ten mel-frequency cepstrum coefficients computed every 6.4 ms resulted in the best performance, namely 96.5 percent and 95.0 percent recognition with each of two speakers. The superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum.

...read moreread less

4,822 citations

Journal Article•DOI•

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

[...]

Yu-Hsin Chen¹, Tushar Krishna¹, Joel Emer¹, Vivienne Sze¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2017-IEEE Journal of Solid-state Circuits

TL;DR: Eyeriss as mentioned in this paper is an accelerator for state-of-the-art deep convolutional neural networks (CNNs) that optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, by reconfiguring the architecture.

...read moreread less

Abstract: Eyeriss is an accelerator for state-of-the-art deep convolutional neural networks (CNNs). It optimizes for the energy efficiency of the entire system, including the accelerator chip and off-chip DRAM, for various CNN shapes by reconfiguring the architecture. CNNs are widely used in modern AI systems but also bring challenges on throughput and energy efficiency to the underlying hardware. This is because its computation requires a large amount of data, creating significant data movement from on-chip and off-chip that is more energy-consuming than computation. Minimizing data movement energy cost for any CNN shape, therefore, is the key to high throughput and energy efficiency. Eyeriss achieves these goals by using a proposed processing dataflow, called row stationary (RS), on a spatial architecture with 168 processing elements. RS dataflow reconfigures the computation mapping of a given shape, which optimizes energy efficiency by maximally reusing data locally to reduce expensive data movement, such as DRAM accesses. Compression and data gating are also applied to further improve energy efficiency. Eyeriss processes the convolutional layers at 35 frames/s and 0.0029 DRAM access/multiply and accumulation (MAC) for AlexNet at 278 mW (batch size $N = 4$ ), and 0.7 frames/s and 0.0035 DRAM access/MAC for VGG-16 at 236 mW ( $N = 3$ ).

...read moreread less

2,165 citations

Book•

Statistical methods for speech recognition

[...]

Frederick Jelinek¹•Institutions (1)

Johns Hopkins University¹

01 Jan 1997

TL;DR: The speech recognition problem hidden Markov models the acoustic model basic language modelling the Viterbi search hypothesis search on a tree and the fast match elements of information theory.

...read moreread less

Abstract: The speech recognition problem hidden Markov models the acoustic model basic language modelling the Viterbi search hypothesis search on a tree and the fast match elements of information theory the complexity of tasks - the quality of language models the expectation - maximization algorithm and its consequences decision trees and tree language models phonetics from orthography - spelling-to-base from mappings triphones and allophones maximum entropy probability estimation and language models three applications of maximum entropy estimation to language modelling estimation of probabilities from counts and the Back-Off method.

...read moreread less

2,153 citations

"A Low-Power Speech Recognizer and V..." refers methods in this paper

...We provide a brief overview of the hidden Markov model (HMM) framework for ASR [3], [4]....
[...]

Proceedings Article•DOI•

SWITCHBOARD: telephone speech corpus for research and development

[...]

J.J. Godfrey¹, E. Holliman¹, J. McDaniel¹•Institutions (1)

Texas Instruments¹

23 Mar 1992

TL;DR: SWITCHBOARD as mentioned in this paper is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition.

...read moreread less

Abstract: SWITCHBOARD is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition. About 2500 conversations by 500 speakers from around the US were collected automatically over T1 lines at Texas Instruments. Designed for training and testing of a variety of speech processing algorithms, especially in speaker verification, it has over an 1 h of speech from each of 50 speakers, and several minutes each from hundreds of others. A time-aligned word for word transcription accompanies each recording. >

...read moreread less

2,102 citations