A 90 nm CMOS, $6\ {\upmu {\text{W}}}$ Power-Proportional Acoustic Sensing Frontend for Voice Activity Detection

doi:10.1109/JSSC.2015.2487276

Citati

o

Archi

v

Publi

s

Journ

Auth

o

IR

o

n

v

ed version

s

hed version

al homepag

e

o

r contact

Ko

m

A

9

for

IE

E

Aut

pa

p

htt

p

e

htt

p

ko

m

htt

p

m

ail Badami,

S

9

0 nm CMO

S

Voice Activi

t

E

E Journal of

S

hor manuscri

p

er, but witho

u

p

://ieeexplore.

i

p

://sscs.ieee.o

m

ail.badami@

e

p

s://lirias.kule

u

S

teven Lauw

e

S

, 6 μW Pow

t

y Detection

S

olid State Ci

r

pt: the conte

n

u

t the final typ

e

i

eee.org/docu

rg/en/publica

t

e

sat.kuleuve

n

u

ven.be/handl

e

reins, Wann

e

er-Proportio

n

r

cuits, Vol. 51

,

n

t is identica

l

e

setting by th

e

ment/731502

5

t

ions/ieee-

j

ou

r

n

.be

e

/123456789

/

s Meert, Mari

a

n

al Acoustic

,

Issue 1

l

to the cont

e

publisher

5

/?arnumber

=

r

nal-of-solid-s

t

/

514022

a

n Verhelst, (

2

Sensing Fr

o

e

nt of the pu

b

=

7315025

t

ate-circuits-

j

s

2

014),

o

ntend

b

lished

sc

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO

EDIT) <

1

A 90 nm CMOS, 6 μW Power-Proportional Acoustic

Sensing Frontend for Voice Activity Detection

Abstract – This work presents a sub-6 µW acoustic front-end for speech/non-speech

classification in a voice activity detection (VAD) in 90 nm CMOS. Power consumption of the

VAD system is minimized by architectural design around a new Power-Proportional sensing

paradigm and the use of machine-learning assisted moderate-precision analog analytics for

classification. Power-Proportional sensing allows for hierarchical and context-aware scaling of

the frontend’s power consumption depending on the complexity of the ongoing information

extraction, while the use of analog analytics brings increased power efficiency through switching

on/off the computation of individual features depending on the features’ usefulness in a

particular context. The proposed VAD system reduces the power consumption by 10X as

compared to state-of-the-art systems and yet achieves an 89% average hit rate for a 12 dB signal

to acoustic noise ratio in babble context, which is at par with software based VAD systems.

I. INTRODUCTION

Technological innovations are changing the way we interact with electronic devices.

Interactions like voice control and gesture recognition are rapidly gaining popularity. Such

natural interactive systems do not only need many integrated sensors, but also always-awake,

reactive sensor frontends. These frontends generate large amounts of raw signals that state-of-the

art (SotA) frontends immediately digitize for processing on a DSP. This very robust approach is

Komail Badami, Steven Lauwereins, Wannes Meert, Marian Verhelst, KU Leuven, Belgium

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO

EDIT) <

2

not power efficient, as not all raw sensor signals are equally relevant. The net information

content of a sensed signal is quite often significantly smaller than the Nyquist rate [1-7]. Existing

works such as Information-Rate processing [1,2], Analog to Information conversion [3-5] and

Compressed Sensing [6,7] show power savings by extracting or compressing the information

from signals before digitizing the data. However, as these schemes operate in a static way, the

compression or extraction parameters are set beforehand. Yet, the information content in raw

signals and its application relevance dynamically varies depending on the operating context.

Operating such systems efficiently hence requires a dynamic system adaptation depending on the

context or signal information content. Existing systems do not perform such fine grain adaptive

behavior, which severely limits their power savings as shown by solid line in Fig. 1.

We propose a self-scalable, Power-Proportional sensing paradigm which gracefully scales the

system’s power consumption with the amount and complexity of extracted information, i.e. the

power consumption for such a system increases only as the task of information extraction gets

more complex. To this end, in this paper we propose key enablers for Power-Proportionality and

apply them to a proof of concept acoustic frontend for voice activity detection (VAD).

VAD systems distinguish speech from non-speech in different background noise contexts for

varying signal to acoustic noise ratios (SANR). SotA VAD systems [8-10] extract complex

features like Mel-Frequency Cepstral Coefficients, DCT etc. to differentiate speech from non-

speech. The high computational complexity of such features results in large power consumption,

typically about 50 - 100 µW [8-11] in addition to the power consumption of the required active

microphone. Such a continuous large power consumption is unacceptable for battery powered

always-on sensor frontends. This work exploits our new Power-Proportional sensing paradigm

along with moderate-precision, computationally-inexpensive, analog feature-extraction, coupled

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO

EDIT) <

3

with an embedded mixed-signal classifier to save more than 10X power consumption over SotA

without compromising on the classification accuracy.

The outline for this paper is as follows. Section II discusses insights into the design principles

for Power-Proportional sensing and explains the rationale behind the analog feature-extraction

instead of the commonly used digital scheme. Section III describes the architecture and

specification set for VAD while the detailed implementation is discussed in Section IV.

Measurement results for the chip and for the full VAD system are discussed in Section V.

II. KEY PRINCIPLES FOR POWER EFFICIENT SENSING

This section details the two key principles that allow our always-on sensing system to scale its

power consumption with the information extracted saving 10X power over SotA VAD systems.

A. Power-Proportional Sensing

The core premise for Power-Proportional sensing is that power consumption of the sensing

system scales proportionally with the complexity of the sensing task. The sensing process with

the target of information extraction can increase in complexity along two dimensions:

First, the amount of information extracted from the incoming signal can scale in complexity.

Consider for example, the task of speaker identification v/s speech detection. The former task

entails the later as a prerequisite first step, hence justifying the increase in power consumption.

Enabling hierarchical operation for tasks of increasing complexity allows scaling of power

consumption with complexity of information extraction. In such an architecture each processing

stage extracts more complex information than the previous stage while consuming more power.

This enables information extraction by necessity, as is shown on the horizontal-axis in Fig. 1.

Secondly, even if the amount of extracted information remains the same, distinguishing the

useful information from the background noise (the context) is subject to varying levels of

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO

EDIT) <

4

difficulty. For this case consider the complexity of speech detection in a quiet office, in contrast

to a noisy street environment. The amount of information needed is same in both cases, but in the

latter case as the background noise maps directly onto the information spectrum, it creates in-

band interference on the desired signal. As such, distinguishing speech from non-speech

becomes more complex, hence justifying the increase in power consumption. Context-awareness

enables Power-Proportional sensing to scale power as the background noise context scales the

complexity of information extraction, as shown in bold in Fig. 1. For the example above,

context-awareness allows to use a much smaller discriminating feature subset in a low noise

environment and a relatively larger subset for noisy background contexts, hence scaling power.

SotA sensing systems do not exploit the power scaling opportunity offered by the above

scenarios, and typically operate constantly in full processing mode. This plateaus the on-state

power consumption for SotA sensing systems independent of system utility as shown in Fig. 1.

B. Power Efficiency through Analog Analytics

The Power-Proportional sensing paradigm as highlighted in previous paragraph needs

complexity and precision dependent power scalable hardware blocks. Such power scaling with

precision is very different for analog and digital implementations. Analog power consumption

scales gradually for thermal noise limited system with low-to-medium precision, while digital

has a logarithmic power v/s precision profile. As it has been shown in [12] and in Fig. 2, for a

0.25 µm CMOS technology, analog computation is not only more power-efficient than digital for

low-to-medium resolution processing, but also exhibits better scalability.

Reduction in supply voltage due to technology scaling allows more power efficient digital

circuits and questions the beneficial analog behavior in advanced technologies. This is because

with scaling, the cost of maintaining the same precision in analog increases as a larger bias

A 90 nm CMOS, $6\ {\upmu {\text{W}}}$ Power-Proportional Acoustic Sensing Frontend for Voice Activity Detection

Figures

Citations

Energy-Efficient Classification for Resource-Constrained Biomedical Applications

A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks

Vocell: A 65-nm Speech-Triggered Wake-Up SoC for 10- $\mu$ W Keyword Spotting and Speaker Verification

A 0.5 V 55 $\mu \text{W}$ 64 $\times $ 2 Channel Binaural Silicon Cochlea for Event-Driven Stereo-Audio Sensing

Always-On 12-nW Acoustic Sensing and Object Recognition Microsystem for Unattended Ground Sensor Nodes

References

Subjective comparison and evaluation of speech enhancement algorithms

Analog versus digital: extrapolating from electronics to neurobiology

Design and Analysis of a Hardware-Efficient Compressed Sensing Architecture for Data Compression in Wireless Sensors

Development and analysis of an International Speech Test Signal (ISTS)

On Vowel Duration in English

Related Papers (5)

A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks

Analog versus digital: extrapolating from electronics to neurobiology

Where Analog Meets Digital: Analog?to?Information Conversion and Beyond

An investigation of deep neural networks for noise robust speech recognition

Event driven persistent sensing: Overcoming the energy and lifetime limitations in unattended wireless sensors