scispace - formally typeset
Open AccessJournal ArticleDOI

A 90 nm CMOS, $6\ {\upmu {\text{W}}}$ Power-Proportional Acoustic Sensing Frontend for Voice Activity Detection

TLDR
In this article, the authors presented a new power-proportional sensing paradigm and the use of machine-learning-assisted moderate-precision analog analytics for classification of speech and non-speech.
Abstract
This work presents a ${\text{sub}}{\text{-}}6\ \upmu {\text{W}} $ acoustic frontend for speech/non-speech classification in a voice activity detection (VAD) in 90 nm CMOS. Power consumption of the VAD system is minimized by architectural design around a new power-proportional sensing paradigm and the use of machine-learning-assisted moderate-precision analog analytics for classification. Power-proportional sensing allows for hierarchical and context-aware scaling of the frontend’s power consumption depending on the complexity of the ongoing information extraction, while the use of analog analytics brings increased power efficiency through switching on / off the computation of individual features depending on the features’ usefulness in a particular context. The proposed VAD system reduces the power consumption by $\text{{10}} \times $ as compared to state-of-the-art (SotA) systems and yet achieves an 89% average hit rate (HR) for a 12 dB signal-to-acoustic-noise ratio (SANR) in babble context, which is at par with software-based VAD systems.

read more

Content maybe subject to copyright    Report

Citati
o
Archi
v
Publi
s
Journ
Auth
o
IR
o
n
v
ed version
s
hed version
al homepag
e
o
r contact
Ko
m
A
9
for
IE
E
Aut
pa
p
htt
p
e
htt
p
ko
htt
p
m
ail Badami,
S
9
0 nm CMO
S
Voice Activi
t
E
E Journal of
S
hor manuscri
p
er, but witho
u
p
://ieeexplore.
i
p
://sscs.ieee.o
m
ail.badami@
e
p
s://lirias.kule
u
S
teven Lauw
e
S
, 6 μW Pow
t
y Detection
S
olid State Ci
r
pt: the conte
n
u
t the final typ
e
i
eee.org/docu
rg/en/publica
t
e
sat.kuleuve
n
u
ven.be/handl
e
e
reins, Wann
e
er-Proportio
n
r
cuits, Vol. 51
,
n
t is identica
l
e
setting by th
e
ment/731502
5
t
ions/ieee-
j
ou
r
n
.be
e
/123456789
/
s Meert, Mari
a
n
al Acoustic
,
Issue 1
l
to the cont
e
e
publisher
5
/?arnumber
=
r
nal-of-solid-s
t
/
514022
a
n Verhelst, (
2
Sensing Fr
o
e
nt of the pu
b
=
7315025
t
ate-circuits-
j
s
2
014),
o
ntend
b
lished
sc

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO
EDIT) <
1
A 90 nm CMOS, 6 μW Power-Proportional Acoustic
Sensing Frontend for Voice Activity Detection
Abstract – This work presents a sub-6 µW acoustic front-end for speech/non-speech
classification in a voice activity detection (VAD) in 90 nm CMOS. Power consumption of the
VAD system is minimized by architectural design around a new Power-Proportional sensing
paradigm and the use of machine-learning assisted moderate-precision analog analytics for
classification. Power-Proportional sensing allows for hierarchical and context-aware scaling of
the frontend’s power consumption depending on the complexity of the ongoing information
extraction, while the use of analog analytics brings increased power efficiency through switching
on/off the computation of individual features depending on the features’ usefulness in a
particular context. The proposed VAD system reduces the power consumption by 10X as
compared to state-of-the-art systems and yet achieves an 89% average hit rate for a 12 dB signal
to acoustic noise ratio in babble context, which is at par with software based VAD systems.
I. INTRODUCTION
Technological innovations are changing the way we interact with electronic devices.
Interactions like voice control and gesture recognition are rapidly gaining popularity. Such
natural interactive systems do not only need many integrated sensors, but also always-awake,
reactive sensor frontends. These frontends generate large amounts of raw signals that state-of-the
art (SotA) frontends immediately digitize for processing on a DSP. This very robust approach is
Komail Badami, Steven Lauwereins, Wannes Meert, Marian Verhelst, KU Leuven, Belgium

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO
EDIT) <
2
not power efficient, as not all raw sensor signals are equally relevant. The net information
content of a sensed signal is quite often significantly smaller than the Nyquist rate [1-7]. Existing
works such as Information-Rate processing [1,2], Analog to Information conversion [3-5] and
Compressed Sensing [6,7] show power savings by extracting or compressing the information
from signals before digitizing the data. However, as these schemes operate in a static way, the
compression or extraction parameters are set beforehand. Yet, the information content in raw
signals and its application relevance dynamically varies depending on the operating context.
Operating such systems efficiently hence requires a dynamic system adaptation depending on the
context or signal information content. Existing systems do not perform such fine grain adaptive
behavior, which severely limits their power savings as shown by solid line in Fig. 1.
We propose a self-scalable, Power-Proportional sensing paradigm which gracefully scales the
system’s power consumption with the amount and complexity of extracted information, i.e. the
power consumption for such a system increases only as the task of information extraction gets
more complex. To this end, in this paper we propose key enablers for Power-Proportionality and
apply them to a proof of concept acoustic frontend for voice activity detection (VAD).
VAD systems distinguish speech from non-speech in different background noise contexts for
varying signal to acoustic noise ratios (SANR). SotA VAD systems [8-10] extract complex
features like Mel-Frequency Cepstral Coefficients, DCT etc. to differentiate speech from non-
speech. The high computational complexity of such features results in large power consumption,
typically about 50 - 100 µW [8-11] in addition to the power consumption of the required active
microphone. Such a continuous large power consumption is unacceptable for battery powered
always-on sensor frontends. This work exploits our new Power-Proportional sensing paradigm
along with moderate-precision, computationally-inexpensive, analog feature-extraction, coupled

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO
EDIT) <
3
with an embedded mixed-signal classifier to save more than 10X power consumption over SotA
without compromising on the classification accuracy.
The outline for this paper is as follows. Section II discusses insights into the design principles
for Power-Proportional sensing and explains the rationale behind the analog feature-extraction
instead of the commonly used digital scheme. Section III describes the architecture and
specification set for VAD while the detailed implementation is discussed in Section IV.
Measurement results for the chip and for the full VAD system are discussed in Section V.
II. KEY PRINCIPLES FOR POWER EFFICIENT SENSING
This section details the two key principles that allow our always-on sensing system to scale its
power consumption with the information extracted saving 10X power over SotA VAD systems.
A. Power-Proportional Sensing
The core premise for Power-Proportional sensing is that power consumption of the sensing
system scales proportionally with the complexity of the sensing task. The sensing process with
the target of information extraction can increase in complexity along two dimensions:
First, the amount of information extracted from the incoming signal can scale in complexity.
Consider for example, the task of speaker identification v/s speech detection. The former task
entails the later as a prerequisite first step, hence justifying the increase in power consumption.
Enabling hierarchical operation for tasks of increasing complexity allows scaling of power
consumption with complexity of information extraction. In such an architecture each processing
stage extracts more complex information than the previous stage while consuming more power.
This enables information extraction by necessity, as is shown on the horizontal-axis in Fig. 1.
Secondly, even if the amount of extracted information remains the same, distinguishing the
useful information from the background noise (the context) is subject to varying levels of

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO
EDIT) <
4
difficulty. For this case consider the complexity of speech detection in a quiet office, in contrast
to a noisy street environment. The amount of information needed is same in both cases, but in the
latter case as the background noise maps directly onto the information spectrum, it creates in-
band interference on the desired signal. As such, distinguishing speech from non-speech
becomes more complex, hence justifying the increase in power consumption. Context-awareness
enables Power-Proportional sensing to scale power as the background noise context scales the
complexity of information extraction, as shown in bold in Fig. 1. For the example above,
context-awareness allows to use a much smaller discriminating feature subset in a low noise
environment and a relatively larger subset for noisy background contexts, hence scaling power.
SotA sensing systems do not exploit the power scaling opportunity offered by the above
scenarios, and typically operate constantly in full processing mode. This plateaus the on-state
power consumption for SotA sensing systems independent of system utility as shown in Fig. 1.
B. Power Efficiency through Analog Analytics
The Power-Proportional sensing paradigm as highlighted in previous paragraph needs
complexity and precision dependent power scalable hardware blocks. Such power scaling with
precision is very different for analog and digital implementations. Analog power consumption
scales gradually for thermal noise limited system with low-to-medium precision, while digital
has a logarithmic power v/s precision profile. As it has been shown in [12] and in Fig. 2, for a
0.25 µm CMOS technology, analog computation is not only more power-efficient than digital for
low-to-medium resolution processing, but also exhibits better scalability.
Reduction in supply voltage due to technology scaling allows more power efficient digital
circuits and questions the beneficial analog behavior in advanced technologies. This is because
with scaling, the cost of maintaining the same precision in analog increases as a larger bias

Citations
More filters
Journal ArticleDOI

Energy-Efficient Classification for Resource-Constrained Biomedical Applications

TL;DR: This work proposes an efficient hardware architecture to implement gradient boosted trees in applications under stringent power, area, and delay constraints, such as medical devices, and introduces the concepts of asynchronous tree operation and sequential feature extraction to achieve an unprecedented energy and area efficiency.
Journal ArticleDOI

A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks

TL;DR: It is argued that VADs should prioritize accuracy over area and power, and it is introduced a VAD circuit that uses an NN to classify modulation frequency features with 22.3-mW power consumption.
Journal ArticleDOI

Vocell: A 65-nm Speech-Triggered Wake-Up SoC for 10- $\mu$ W Keyword Spotting and Speaker Verification

TL;DR: This article presents a complete mixed-signal system-on-chip, capable of directly interfacing to an analog microphone and performing keyword spotting (KWS) and speaker verification (SV), without any need for further external accesses.
Journal ArticleDOI

A 0.5 V 55 $\mu \text{W}$ 64 $\times $ 2 Channel Binaural Silicon Cochlea for Event-Driven Stereo-Audio Sensing

TL;DR: A 0.5V 55μW 64×2-channel binaural silicon cochlea aiming for ultra-low-power IoE applications like event-driven VAD, sound source localization, speaker identification and primitive speech recognition is presented.
Journal ArticleDOI

Always-On 12-nW Acoustic Sensing and Object Recognition Microsystem for Unattended Ground Sensor Nodes

TL;DR: An algorithm-circuit cross optimization is introduced to realize a 12-nW stand-alone microsystem that integrates the analog frontend with the digital backend signal classifier and replaces a conventional high-power/area-consuming parallel feature extraction using the fast Fourier transform.
References
More filters
Journal ArticleDOI

Subjective comparison and evaluation of speech enhancement algorithms

TL;DR: A noisy speech corpus is developed suitable for evaluation of speech enhancement algorithms encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms.
Journal ArticleDOI

Analog versus digital: extrapolating from electronics to neurobiology

TL;DR: The results suggest that it is likely that the brain computes in a hybrid fashion and that an underappreciated and important reason for the efficiency of the human brain, which consumes only 12 W, is the hybrid and distributed nature of its architecture.
Journal ArticleDOI

Design and Analysis of a Hardware-Efficient Compressed Sensing Architecture for Data Compression in Wireless Sensors

TL;DR: The design and measurement of the proposed architecture is presented in the context of medical sensors, however the tools and insights are generally applicable to any sparse data acquisition.
Journal ArticleDOI

Development and analysis of an International Speech Test Signal (ISTS)

TL;DR: The primary intention is to include this test signal with a new measurement method for a new hearing aid standard (IEC 60118-15) that is based on natural recordings but is largely non-intelligible because of segmentation and remixing.
Journal ArticleDOI

On Vowel Duration in English

TL;DR: The authors reported the average durations of 12 vowels of American English measured in bisyllabic nonsense utterances and the vowels occurred in 14 symmetrical consonantal environments and the utterances were produced by three male talkers.
Related Papers (5)