scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A 90 nm CMOS, $6\ {\upmu {\text{W}}}$ Power-Proportional Acoustic Sensing Frontend for Voice Activity Detection

TL;DR: In this article, the authors presented a new power-proportional sensing paradigm and the use of machine-learning-assisted moderate-precision analog analytics for classification of speech and non-speech.
Abstract: This work presents a ${\text{sub}}{\text{-}}6\ \upmu {\text{W}} $ acoustic frontend for speech/non-speech classification in a voice activity detection (VAD) in 90 nm CMOS. Power consumption of the VAD system is minimized by architectural design around a new power-proportional sensing paradigm and the use of machine-learning-assisted moderate-precision analog analytics for classification. Power-proportional sensing allows for hierarchical and context-aware scaling of the frontend’s power consumption depending on the complexity of the ongoing information extraction, while the use of analog analytics brings increased power efficiency through switching on / off the computation of individual features depending on the features’ usefulness in a particular context. The proposed VAD system reduces the power consumption by $\text{{10}} \times $ as compared to state-of-the-art (SotA) systems and yet achieves an 89% average hit rate (HR) for a 12 dB signal-to-acoustic-noise ratio (SANR) in babble context, which is at par with software-based VAD systems.

Summary (3 min read)

Introduction

  • Power consumption of the VAD system is minimized by architectural design around a new Power-Proportional sensing paradigm and the use of machine-learning assisted moderate-precision analog analytics for classification.
  • Power-Proportional sensing allows for hierarchical and context-aware scaling of the frontend’s power consumption depending on the complexity of the ongoing information extraction, while the use of analog analytics brings increased power efficiency through switching on/off the computation of individual features depending on the features’ usefulness in a particular context.
  • Technological innovations are changing the way the authors interact with electronic devices.
  • Yet, the information content in raw signals and its application relevance dynamically varies depending on the operating context.
  • VAD systems distinguish speech from non-speech in different background noise contexts for varying signal to acoustic noise ratios (SANR).

A. Power-Proportional Sensing

  • The core premise for Power-Proportional sensing is that power consumption of the sensing system scales proportionally with the complexity of the sensing task.
  • First, the amount of information extracted from the incoming signal can scale in complexity.
  • In such an architecture each processing stage extracts more complex information than the previous stage while consuming more power.
  • Context-awareness enables Power-Proportional sensing to scale power as the background noise context scales the complexity of information extraction, as shown in bold in Fig.
  • SotA sensing systems do not exploit the power scaling opportunity offered by the above scenarios, and typically operate constantly in full processing mode.

B. Power Efficiency through Analog Analytics

  • The Power-Proportional sensing paradigm as highlighted in previous paragraph needs complexity and precision dependent power scalable hardware blocks.
  • Reduction in supply voltage due to technology scaling allows more power efficient digital circuits and questions the beneficial analog behavior in advanced technologies.
  • This is because with scaling, the cost of maintaining the same precision in analog increases as a larger bias > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5 current is needed to reduce the noise-floor compensating for reduction in signal swing.
  • Hence, absolute precision requirements for such systems are rather modest, and mismatches and offset impairments are automatically taken care of by the embedded trained classifier in the loop.
  • As demonstrated by this work, as well as some existing works, machine learning assisted [13, 14] and/or digital calibration [15] can improve SNR by 6 – 10 dB for comparable power which pushes the efficiency crossover point in the rightward direction as shown in Fig.

III. SYSTEM ARCHITECTURE AND SPECIFICATIONS

  • This section highlights the use of the aforementioned key principles in the developed VAD architecture [16] and derives the specifications for the analog/mixed-signal building blocks.
  • If the signal is speech, the classifier wakes up the microcontroller for more advanced processing.
  • This allows scaling the power with necessary information as outlined in Section II.
  • As further modelled in subsection B, considering that the analog feature-extraction blocks are in the loop during this training operation, all static analog impairments such as mismatch, gain errors, or offsets are absorbed in the trained feature thresholds and do not affect the classification > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7 accuracy.
  • B derives specifications for the targeted VAD system.

B. Specifications for VAD system

  • This section first derives the system level specifications and then the specifications for individual analog blocks.
  • Mathematically, each analog feature is defined as ∗ (1) where is the amplified acoustic signal, is the impulse response of band pass filter used to decompose the input signal into a smaller frequency band, ,∗ and represent the absolute value, convolution, and averaging respectively.
  • The MATLAB model varies the number of computed features in the above frequency range by scaling the Q factor of the band pass filters.
  • The results of the above simulation are shown in Fig. 4(a).
  • It can be seen that more features improve classification > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8 accuracy, yet accuracy gains diminish beyond 16 features allowing us to limit their design to a maximum of 16 (individually (dis)activated) features.

IV. SYSTEM IMPLEMENTATION

  • This Section details the implementation nuances of the individual system blocks discussed in the previous section: namely the wakeup detector, the analog feature-extractor and the embedded mixed-signal classifier.
  • A further subsection discusses system training for the complete VAD system before discussing one-time calibration and measurement results in Section V.

A. Wakeup detector

  • The always-awake threshold-based wakeup detector acts as the system’s watch-dog that wakes up the analog feature-extractor only when a signal of sufficient strength is detected.
  • The wakeup detector is a low power 3-phase comparator and its schematic is shown in Fig.
  • Each amplifier is a PMOS input source-coupled single-ended differential amplifier and can be turned on/off individually to save power depending on the microphone’s signal-level and is designed to provide a mid-band gain of 20 dB.
  • Measured power consumption of this block is 700 nW when all four amplifier stages are turned on, and excluding the external bias.
  • > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10.

B. Analog feature-extractor

  • On receiving the wakeup signal from the threshold based wakeup detector, the analog featureextractor decomposes the input signal into the set of 16 features.
  • This contributes to Power-Proportional information extraction, as it allows turning off amplifier stages of unused features along with all other circuitry involved in individual feature computation.
  • A.2. The sub-blocks of the analog feature-extractor are now explained in more detail.
  • To cover for this, the f-3dB of the amplifiers in each band also increases progressively from band 1 to band 16.
  • The architecture of the currentmode averaging is shown in Fig. 12.

C. Decision tree based classifier

  • The extracted feature subset, af5 - af12, is passed on to the on-chip classifier (Fig. 5) while the complete feature-set af1 - af16 can be passed on to an off-chip ADC for more complex information extraction, such as context-change detection and retraining the classifier as in [22].
  • In these cases, the Nyquist sampling rate for the features is only 16x2x16 = 512 Hz instead of 8 kHz for audio.
  • Each node of the decision tree can be configured to select one feature out of af5 - af12.
  • To this end, the on-chip decision tree classifier is trained with their modified C4.5 algorithm with 160 s of labeled data from the standardized NOIZEUS database [23].
  • The authors modification to C4.5 maximizes the information-gain/watt and therefore outputs a resourceefficient model that maximizes the information capture while minimizing the power [22].

B. System measurement results

  • The chip is integrated with the microcontroller using external level-shifters and DACs, to form the complete VAD.
  • Fig. 20 shows a one-time calibration to characterize for mismatch in the ADC and DAC paths.
  • This subsection also displays the classification accuracy results for the complete VAD system and illustrates the achieved Power-Proportionality.
  • Receiver operating characteristic (ROC) curves characterize the classifier systems and depict hit-rates (HR) for the variables under observation [24].
  • The power consumption for signal detection is measured to be below 1 µW, whereas power consumption for classification varies depending on complexity of the operating context and has an upper bound of 6 µW.

VI. CONCLUSIONS

  • This work demonstrates a power efficient acoustic sensing frontend for speech/non-speech classification in a voice activity detection system.
  • K. Badami, S. Lauwereins, W. Meert, and M. Verhelst, ‘Context-aware hierarchical informationsensing in a 6μW 90nm CMOS voice activity detector’, 2015 IEEE International Solid-State Circuits Conference - Digest of Technical Papers, 2015.

Did you find this useful? Give us your feedback

Figures (23)

Content maybe subject to copyright    Report

Citati
o
Archi
v
Publi
s
Journ
Auth
o
IR
o
n
v
ed version
s
hed version
al homepag
e
o
r contact
Ko
m
A
9
for
IE
E
Aut
pa
p
htt
p
e
htt
p
ko
htt
p
m
ail Badami,
S
9
0 nm CMO
S
Voice Activi
t
E
E Journal of
S
hor manuscri
p
er, but witho
u
p
://ieeexplore.
i
p
://sscs.ieee.o
m
ail.badami@
e
p
s://lirias.kule
u
S
teven Lauw
e
S
, 6 μW Pow
t
y Detection
S
olid State Ci
r
pt: the conte
n
u
t the final typ
e
i
eee.org/docu
rg/en/publica
t
e
sat.kuleuve
n
u
ven.be/handl
e
e
reins, Wann
e
er-Proportio
n
r
cuits, Vol. 51
,
n
t is identica
l
e
setting by th
e
ment/731502
5
t
ions/ieee-
j
ou
r
n
.be
e
/123456789
/
s Meert, Mari
a
n
al Acoustic
,
Issue 1
l
to the cont
e
e
publisher
5
/?arnumber
=
r
nal-of-solid-s
t
/
514022
a
n Verhelst, (
2
Sensing Fr
o
e
nt of the pu
b
=
7315025
t
ate-circuits-
j
s
2
014),
o
ntend
b
lished
sc

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO
EDIT) <
1
A 90 nm CMOS, 6 μW Power-Proportional Acoustic
Sensing Frontend for Voice Activity Detection
Abstract – This work presents a sub-6 µW acoustic front-end for speech/non-speech
classification in a voice activity detection (VAD) in 90 nm CMOS. Power consumption of the
VAD system is minimized by architectural design around a new Power-Proportional sensing
paradigm and the use of machine-learning assisted moderate-precision analog analytics for
classification. Power-Proportional sensing allows for hierarchical and context-aware scaling of
the frontend’s power consumption depending on the complexity of the ongoing information
extraction, while the use of analog analytics brings increased power efficiency through switching
on/off the computation of individual features depending on the features’ usefulness in a
particular context. The proposed VAD system reduces the power consumption by 10X as
compared to state-of-the-art systems and yet achieves an 89% average hit rate for a 12 dB signal
to acoustic noise ratio in babble context, which is at par with software based VAD systems.
I. INTRODUCTION
Technological innovations are changing the way we interact with electronic devices.
Interactions like voice control and gesture recognition are rapidly gaining popularity. Such
natural interactive systems do not only need many integrated sensors, but also always-awake,
reactive sensor frontends. These frontends generate large amounts of raw signals that state-of-the
art (SotA) frontends immediately digitize for processing on a DSP. This very robust approach is
Komail Badami, Steven Lauwereins, Wannes Meert, Marian Verhelst, KU Leuven, Belgium

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO
EDIT) <
2
not power efficient, as not all raw sensor signals are equally relevant. The net information
content of a sensed signal is quite often significantly smaller than the Nyquist rate [1-7]. Existing
works such as Information-Rate processing [1,2], Analog to Information conversion [3-5] and
Compressed Sensing [6,7] show power savings by extracting or compressing the information
from signals before digitizing the data. However, as these schemes operate in a static way, the
compression or extraction parameters are set beforehand. Yet, the information content in raw
signals and its application relevance dynamically varies depending on the operating context.
Operating such systems efficiently hence requires a dynamic system adaptation depending on the
context or signal information content. Existing systems do not perform such fine grain adaptive
behavior, which severely limits their power savings as shown by solid line in Fig. 1.
We propose a self-scalable, Power-Proportional sensing paradigm which gracefully scales the
system’s power consumption with the amount and complexity of extracted information, i.e. the
power consumption for such a system increases only as the task of information extraction gets
more complex. To this end, in this paper we propose key enablers for Power-Proportionality and
apply them to a proof of concept acoustic frontend for voice activity detection (VAD).
VAD systems distinguish speech from non-speech in different background noise contexts for
varying signal to acoustic noise ratios (SANR). SotA VAD systems [8-10] extract complex
features like Mel-Frequency Cepstral Coefficients, DCT etc. to differentiate speech from non-
speech. The high computational complexity of such features results in large power consumption,
typically about 50 - 100 µW [8-11] in addition to the power consumption of the required active
microphone. Such a continuous large power consumption is unacceptable for battery powered
always-on sensor frontends. This work exploits our new Power-Proportional sensing paradigm
along with moderate-precision, computationally-inexpensive, analog feature-extraction, coupled

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO
EDIT) <
3
with an embedded mixed-signal classifier to save more than 10X power consumption over SotA
without compromising on the classification accuracy.
The outline for this paper is as follows. Section II discusses insights into the design principles
for Power-Proportional sensing and explains the rationale behind the analog feature-extraction
instead of the commonly used digital scheme. Section III describes the architecture and
specification set for VAD while the detailed implementation is discussed in Section IV.
Measurement results for the chip and for the full VAD system are discussed in Section V.
II. KEY PRINCIPLES FOR POWER EFFICIENT SENSING
This section details the two key principles that allow our always-on sensing system to scale its
power consumption with the information extracted saving 10X power over SotA VAD systems.
A. Power-Proportional Sensing
The core premise for Power-Proportional sensing is that power consumption of the sensing
system scales proportionally with the complexity of the sensing task. The sensing process with
the target of information extraction can increase in complexity along two dimensions:
First, the amount of information extracted from the incoming signal can scale in complexity.
Consider for example, the task of speaker identification v/s speech detection. The former task
entails the later as a prerequisite first step, hence justifying the increase in power consumption.
Enabling hierarchical operation for tasks of increasing complexity allows scaling of power
consumption with complexity of information extraction. In such an architecture each processing
stage extracts more complex information than the previous stage while consuming more power.
This enables information extraction by necessity, as is shown on the horizontal-axis in Fig. 1.
Secondly, even if the amount of extracted information remains the same, distinguishing the
useful information from the background noise (the context) is subject to varying levels of

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO
EDIT) <
4
difficulty. For this case consider the complexity of speech detection in a quiet office, in contrast
to a noisy street environment. The amount of information needed is same in both cases, but in the
latter case as the background noise maps directly onto the information spectrum, it creates in-
band interference on the desired signal. As such, distinguishing speech from non-speech
becomes more complex, hence justifying the increase in power consumption. Context-awareness
enables Power-Proportional sensing to scale power as the background noise context scales the
complexity of information extraction, as shown in bold in Fig. 1. For the example above,
context-awareness allows to use a much smaller discriminating feature subset in a low noise
environment and a relatively larger subset for noisy background contexts, hence scaling power.
SotA sensing systems do not exploit the power scaling opportunity offered by the above
scenarios, and typically operate constantly in full processing mode. This plateaus the on-state
power consumption for SotA sensing systems independent of system utility as shown in Fig. 1.
B. Power Efficiency through Analog Analytics
The Power-Proportional sensing paradigm as highlighted in previous paragraph needs
complexity and precision dependent power scalable hardware blocks. Such power scaling with
precision is very different for analog and digital implementations. Analog power consumption
scales gradually for thermal noise limited system with low-to-medium precision, while digital
has a logarithmic power v/s precision profile. As it has been shown in [12] and in Fig. 2, for a
0.25 µm CMOS technology, analog computation is not only more power-efficient than digital for
low-to-medium resolution processing, but also exhibits better scalability.
Reduction in supply voltage due to technology scaling allows more power efficient digital
circuits and questions the beneficial analog behavior in advanced technologies. This is because
with scaling, the cost of maintaining the same precision in analog increases as a larger bias

Citations
More filters
Journal ArticleDOI
TL;DR: This work proposes an efficient hardware architecture to implement gradient boosted trees in applications under stringent power, area, and delay constraints, such as medical devices, and introduces the concepts of asynchronous tree operation and sequential feature extraction to achieve an unprecedented energy and area efficiency.
Abstract: Biomedical applications often require classifiers that are both accurate and cheap to implement. Today, deep neural networks achieve the state-of-the-art accuracy in most learning tasks that involve large data sets of unstructured data. However, the application of deep learning techniques may not be beneficial in problems with limited training sets and computational resources, or under domain-specific test time constraints. Among other algorithms, ensembles of decision trees, particularly the gradient boosted models have recently been very successful in machine learning competitions. Here, we propose an efficient hardware architecture to implement gradient boosted trees in applications under stringent power, area, and delay constraints, such as medical devices. Specifically, we introduce the concepts of asynchronous tree operation and sequential feature extraction to achieve an unprecedented energy and area efficiency. The proposed architecture is evaluated in automated seizure detection for epilepsy, using 3074 h of intracranial EEG data from 26 patients with 393 seizures. Average F1 scores of 99.23% and 87.86% are achieved for random and block-wise splitting of data into train/test sets, respectively, with an average detection latency of 1.1 s. The proposed classifier is fabricated in a 65-nm TSMC process, consuming 41.2 nJ/class in a total area of $540\times 1850\,\,\mathrm {\mu m}^{2}$ . This design improves the state-of-the-art by $27\times $ reduction in energy-area-latency product. Moreover, the proposed gradient-boosting architecture offers the flexibility to accommodate variable tree counts specific to each patient, to trade the predictive accuracy with energy. This patient-specific and energy-quality scalable classifier holds great promise for low-power sensor data classification in biomedical applications.

87 citations

Journal ArticleDOI
TL;DR: It is argued that VADs should prioritize accuracy over area and power, and it is introduced a VAD circuit that uses an NN to classify modulation frequency features with 22.3-mW power consumption.
Abstract: This paper describes digital circuit architectures for automatic speech recognition (ASR) and voice activity detection (VAD) with improved accuracy, programmability, and scalability. Our ASR architecture is designed to minimize off-chip memory bandwidth, which is the main driver of system power consumption. A SIMD processor with 32 parallel execution units efficiently evaluates feed-forward deep neural networks (NNs) for ASR, limiting memory usage with a sparse quantized weight matrix format. We argue that VADs should prioritize accuracy over area and power, and introduce a VAD circuit that uses an NN to classify modulation frequency features with 22.3- $\mu \text{W}$ power consumption. The 65-nm test chip is shown to perform a variety of ASR tasks in real time, with vocabularies ranging from 11 words to 145 000 words and full-chip power consumption ranging from 172 $\mu \text{W}$ to 7.78 mW.

76 citations

Journal ArticleDOI
TL;DR: This article presents a complete mixed-signal system-on-chip, capable of directly interfacing to an analog microphone and performing keyword spotting (KWS) and speaker verification (SV), without any need for further external accesses.
Abstract: The use of speech-triggered wake-up interfaces has grown significantly in the last few years for use in ubiquitous and mobile devices. Since these interfaces must always be active, power consumption is one of their primary design metrics. This article presents a complete mixed-signal system-on-chip, capable of directly interfacing to an analog microphone and performing keyword spotting (KWS) and speaker verification (SV), without any need for further external accesses. Through the use of: 1) an integrated single-chip digital-friendly design; b) hardware-aware algorithmic optimization; and c) memory- and power-optimized accelerators, ultra-low power is achieved while maintaining high accuracy for speech recognition tasks. The 65-nm implementation achieves 18.3- $\mu \text{W}$ worst case power consumption or 10.6- $\mu \text{W}$ power for typical real-time scenarios, $10\times $ below state of the art (SoA).

63 citations

Journal ArticleDOI
TL;DR: A 0.5V 55μW 64×2-channel binaural silicon cochlea aiming for ultra-low-power IoE applications like event-driven VAD, sound source localization, speaker identification and primitive speech recognition is presented.
Abstract: This paper presents a $64 \times 2$ channel stereo-audio sensing front end with parallel asynchronous event output inspired by the biological cochlea. Each binaural channel performs feature extraction by analog bandpass filtering, and the filtered signal is encoded into events via asynchronous delta modulation (ADM). The channel central frequencies $f_{0}$ are geometrically scaled across the human hearing range. Two design techniques are highlighted to achieve the high system power efficiency: source-follower-based bandpass filters (BPFs) and asynchronous delta modulation (ADM) with adaptive self-oscillating comparison. The chip was fabricated in 0.18 $\mu \text{m}$ 1P6M CMOS, and occupies an area of $10.5 \times 4.8$ mm2. The core cochlea system operating under a 0.5 V power supply consumes 55 $\mu \text{W}$ at an output rate of 100k event/s. The measured range of $f_{0}$ is from 8 Hz to 20 kHz, and the BPF quality factor ${Q}$ can be tuned from 1 to almost 40. The 1 $\sigma $ mismatch of $f_{0}$ and ${Q}$ between two ears is 3.3% and 15%, respectively, across all channels at ${Q}\approx $ 10. Reconstruction of speech input from the event output of the chip is performed to validate the information integrity in event-domain representation, and vowel discrimination is demonstrated as a simple application using histograms of the output events. This type of silicon cochlea front end targets integration with embedded event-driven processors for low-power smart audio sensing with classification capabilities, such as voice activity detection and speaker identification.

62 citations

Journal ArticleDOI
TL;DR: An algorithm-circuit cross optimization is introduced to realize a 12-nW stand-alone microsystem that integrates the analog frontend with the digital backend signal classifier and replaces a conventional high-power/area-consuming parallel feature extraction using the fast Fourier transform.
Abstract: This paper presents an ultra-low power acoustic sensing and object recognition microsystem for Internet of Things applications. The microsystem is targeted for unattended ground sensor nodes where long-term (decades) life time is desired without the need for battery replacement. The system incorporates an microelectromechanical systems microphone as a frontend sensor along with active circuitry to identify target objects. We introduce an algorithm-circuit cross optimization to realize a 12-nW stand-alone microsystem that integrates the analog frontend with the digital backend signal classifier. The frequency-domain analysis of target audio signals reveals that the system can operate with a relatively low bandwidth ( 3 dB) which significantly relaxes power constraints on both analog frontend and digital backend circuits. To further relax the current requirement of the preceding amplifier, we propose an 8-bit SAR-analog-to-digital converter that is designed to have a highly reduced sampling capacitance ( 95% reliability and consumes only 12 nW with continuous monitoring.

39 citations

References
More filters
Proceedings ArticleDOI
20 May 2012
TL;DR: An Analog-to-Information spectral decomposition scheme suitable for parallel low-power analog and mixed-signal VLSI implementation and a feasible solution space given an on-line self-calibrating system are presented.
Abstract: This paper presents the design of an Analog-to-Information spectral decomposition scheme suitable for parallel low-power analog and mixed-signal VLSI implementation. The novel scheme extracts sufficient information to achieve good back-end signal detection and classification performance while using less power than purely digital spectral techniques such as FFT. Simulations of a prototype system in a mixed-signal 130nm CMOS process show a feasible solution space given an on-line self-calibrating system.

4 citations