A 90 nm CMOS, $6\ {\upmu {\text{W}}}$ Power-Proportional Acoustic Sensing Frontend for Voice Activity Detection

doi:10.1109/JSSC.2015.2487276

Journal Article•DOI•

A 90 nm CMOS, $6\ {\upmu {\text{W}}}$ Power-Proportional Acoustic Sensing Frontend for Voice Activity Detection

Komail Badami¹, Steven Lauwereins¹, Wannes Meert², Marian Verhelst¹•Institutions (2)

Katholieke Universiteit Leuven¹, University of Copenhagen Faculty of Science²

01 Jan 2016-IEEE Journal of Solid-state Circuits (IEEE)-Vol. 51, Iss: 1, pp 291-302

TL;DR: In this article, the authors presented a new power-proportional sensing paradigm and the use of machine-learning-assisted moderate-precision analog analytics for classification of speech and non-speech.

read less

Abstract: This work presents a ${\text{sub}}{\text{-}}6\ \upmu {\text{W}} $ acoustic frontend for speech/non-speech classification in a voice activity detection (VAD) in 90 nm CMOS. Power consumption of the VAD system is minimized by architectural design around a new power-proportional sensing paradigm and the use of machine-learning-assisted moderate-precision analog analytics for classification. Power-proportional sensing allows for hierarchical and context-aware scaling of the frontend’s power consumption depending on the complexity of the ongoing information extraction, while the use of analog analytics brings increased power efficiency through switching on / off the computation of individual features depending on the features’ usefulness in a particular context. The proposed VAD system reduces the power consumption by $\text{{10}} \times $ as compared to state-of-the-art (SotA) systems and yet achieves an 89% average hit rate (HR) for a 12 dB signal-to-acoustic-noise ratio (SANR) in babble context, which is at par with software-based VAD systems.

...read moreread less

Summary (3 min read)

Jump to: [Introduction] – [A. Power-Proportional Sensing] – [B. Power Efficiency through Analog Analytics] – [III. SYSTEM ARCHITECTURE AND SPECIFICATIONS] – [B. Specifications for VAD system] – [IV. SYSTEM IMPLEMENTATION] – [A. Wakeup detector] – [B. Analog feature-extractor] – [C. Decision tree based classifier] – [B. System measurement results] and [VI. CONCLUSIONS]

Introduction

Power consumption of the VAD system is minimized by architectural design around a new Power-Proportional sensing paradigm and the use of machine-learning assisted moderate-precision analog analytics for classification.
Power-Proportional sensing allows for hierarchical and context-aware scaling of the frontend’s power consumption depending on the complexity of the ongoing information extraction, while the use of analog analytics brings increased power efficiency through switching on/off the computation of individual features depending on the features’ usefulness in a particular context.
Technological innovations are changing the way the authors interact with electronic devices.
Yet, the information content in raw signals and its application relevance dynamically varies depending on the operating context.
VAD systems distinguish speech from non-speech in different background noise contexts for varying signal to acoustic noise ratios (SANR).

A. Power-Proportional Sensing

The core premise for Power-Proportional sensing is that power consumption of the sensing system scales proportionally with the complexity of the sensing task.
First, the amount of information extracted from the incoming signal can scale in complexity.
In such an architecture each processing stage extracts more complex information than the previous stage while consuming more power.
Context-awareness enables Power-Proportional sensing to scale power as the background noise context scales the complexity of information extraction, as shown in bold in Fig.
SotA sensing systems do not exploit the power scaling opportunity offered by the above scenarios, and typically operate constantly in full processing mode.

B. Power Efficiency through Analog Analytics

The Power-Proportional sensing paradigm as highlighted in previous paragraph needs complexity and precision dependent power scalable hardware blocks.
Reduction in supply voltage due to technology scaling allows more power efficient digital circuits and questions the beneficial analog behavior in advanced technologies.
This is because with scaling, the cost of maintaining the same precision in analog increases as a larger bias > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5 current is needed to reduce the noise-floor compensating for reduction in signal swing.
Hence, absolute precision requirements for such systems are rather modest, and mismatches and offset impairments are automatically taken care of by the embedded trained classifier in the loop.
As demonstrated by this work, as well as some existing works, machine learning assisted [13, 14] and/or digital calibration [15] can improve SNR by 6 – 10 dB for comparable power which pushes the efficiency crossover point in the rightward direction as shown in Fig.

III. SYSTEM ARCHITECTURE AND SPECIFICATIONS

This section highlights the use of the aforementioned key principles in the developed VAD architecture [16] and derives the specifications for the analog/mixed-signal building blocks.
If the signal is speech, the classifier wakes up the microcontroller for more advanced processing.
This allows scaling the power with necessary information as outlined in Section II.
As further modelled in subsection B, considering that the analog feature-extraction blocks are in the loop during this training operation, all static analog impairments such as mismatch, gain errors, or offsets are absorbed in the trained feature thresholds and do not affect the classification > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7 accuracy.
B derives specifications for the targeted VAD system.

B. Specifications for VAD system

This section first derives the system level specifications and then the specifications for individual analog blocks.
Mathematically, each analog feature is defined as ∗ (1) where is the amplified acoustic signal, is the impulse response of band pass filter used to decompose the input signal into a smaller frequency band, ,∗ and represent the absolute value, convolution, and averaging respectively.
The MATLAB model varies the number of computed features in the above frequency range by scaling the Q factor of the band pass filters.
The results of the above simulation are shown in Fig. 4(a).
It can be seen that more features improve classification > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8 accuracy, yet accuracy gains diminish beyond 16 features allowing us to limit their design to a maximum of 16 (individually (dis)activated) features.

IV. SYSTEM IMPLEMENTATION

This Section details the implementation nuances of the individual system blocks discussed in the previous section: namely the wakeup detector, the analog feature-extractor and the embedded mixed-signal classifier.
A further subsection discusses system training for the complete VAD system before discussing one-time calibration and measurement results in Section V.

A. Wakeup detector

The always-awake threshold-based wakeup detector acts as the system’s watch-dog that wakes up the analog feature-extractor only when a signal of sufficient strength is detected.
The wakeup detector is a low power 3-phase comparator and its schematic is shown in Fig.
Each amplifier is a PMOS input source-coupled single-ended differential amplifier and can be turned on/off individually to save power depending on the microphone’s signal-level and is designed to provide a mid-band gain of 20 dB.
Measured power consumption of this block is 700 nW when all four amplifier stages are turned on, and excluding the external bias.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10.

B. Analog feature-extractor

On receiving the wakeup signal from the threshold based wakeup detector, the analog featureextractor decomposes the input signal into the set of 16 features.
This contributes to Power-Proportional information extraction, as it allows turning off amplifier stages of unused features along with all other circuitry involved in individual feature computation.
A.2. The sub-blocks of the analog feature-extractor are now explained in more detail.
To cover for this, the f-3dB of the amplifiers in each band also increases progressively from band 1 to band 16.
The architecture of the currentmode averaging is shown in Fig. 12.

C. Decision tree based classifier

The extracted feature subset, af5 - af12, is passed on to the on-chip classifier (Fig. 5) while the complete feature-set af1 - af16 can be passed on to an off-chip ADC for more complex information extraction, such as context-change detection and retraining the classifier as in [22].
In these cases, the Nyquist sampling rate for the features is only 16x2x16 = 512 Hz instead of 8 kHz for audio.
Each node of the decision tree can be configured to select one feature out of af5 - af12.
To this end, the on-chip decision tree classifier is trained with their modified C4.5 algorithm with 160 s of labeled data from the standardized NOIZEUS database [23].
The authors modification to C4.5 maximizes the information-gain/watt and therefore outputs a resourceefficient model that maximizes the information capture while minimizing the power [22].

B. System measurement results

The chip is integrated with the microcontroller using external level-shifters and DACs, to form the complete VAD.
Fig. 20 shows a one-time calibration to characterize for mismatch in the ADC and DAC paths.
This subsection also displays the classification accuracy results for the complete VAD system and illustrates the achieved Power-Proportionality.
Receiver operating characteristic (ROC) curves characterize the classifier systems and depict hit-rates (HR) for the variables under observation [24].
The power consumption for signal detection is measured to be below 1 µW, whereas power consumption for classification varies depending on complexity of the operating context and has an upper bound of 6 µW.

VI. CONCLUSIONS

This work demonstrates a power efficient acoustic sensing frontend for speech/non-speech classification in a voice activity detection system.
K. Badami, S. Lauwereins, W. Meert, and M. Verhelst, ‘Context-aware hierarchical informationsensing in a 6μW 90nm CMOS voice activity detector’, 2015 IEEE International Solid-State Circuits Conference - Digest of Technical Papers, 2015.

Did you find this useful? Give us your feedback

Figures (23)

Fig. 16 Measured input referred noise at the LNA output.

Fig. 15 Measurement setup (top) and chip micrograph (bottom) with important blocks highlighted

Fig. 19 Measured power consumption of LNA and of each band for gain setting of 01 and 11

Fig. 17 Measured small signal magnitude response for LNA (a), amplifier with LNA (b), BPF with amplifier (c) in 16th band

Fig. 18 Measured large signal frequency response of complete bands for bands 3, 5, 7 and 10

Fig. 5 Histogram depicting average usefulness of computed features in exhibition background noise context for SANR of 0 dB

Fig. 8 Amplifier schematic highlighting gate leakage through the input pair

Fig. 7 Schematic and design parameters of the analog feature extraction block

Fig. 9 Simulated frequency response for LNA and amplifiers in even bands showing increasing f-3dB

Fig. 11 Simulated frequency response for a constant Q = 1.3 BPF filters in even bands

Fig. 12 Rectifier and LPF based averaging circuit

Fig. 10 First order gm – C based band pass filter topology

Table 2 Measured power consumption variation with classification task complexity illustrating achieved PowerProportional operation

Table 3 Comparison with State of the art VAD and similar systems

Fig. 1 Power-Proportional sensing in contrast with State-of-the-art sensing systems

Fig. 2 Computation power scaling for analog (solid line) and digital (dashed line) implementations [12] and impact on efficiency cross over point due to voltage scaling and due to digital assistance by machine learning and / or calibration

Fig. 21 Comparison of classification accuracy to STOA software VADs

Fig. 22 Measured ROC curves depicting classification accuracy for multiple SANR in (a) Exhibition and (b) Car noise contexts

Fig. 20 Calibration scheme for ADC and DAC paths

Fig. 14 Architecture of (a) one node of DT classifier and (b) complete classifier

Fig. 13 Simulated response of the averaging circuit for a sinewave input of 20mVpp amplitude and 500 Hz frequency

Content maybe subject to copyright Report

Citati

Archi

Publi

Journ

Auth

ed version

hed version

al homepag

r contact

for

Aut

htt

ail Badami,

0 nm CMO

Voice Activi

E Journal of

hor manuscri

er, but witho

://ieeexplore.

://sscs.ieee.o

ail.badami@

s://lirias.kule

teven Lauw

, 6 μW Pow

y Detection

olid State Ci

pt: the conte

t the final typ

eee.org/docu

rg/en/publica

sat.kuleuve

ven.be/handl

reins, Wann

er-Proportio

cuits, Vol. 51

t is identica

setting by th

ment/731502

ions/ieee-

.be

/123456789

s Meert, Mari

al Acoustic

Issue 1

to the cont

publisher

/?arnumber

nal-of-solid-s

514022

n Verhelst, (

Sensing Fr

nt of the pu

7315025

ate-circuits-

014),

ntend

lished

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO

EDIT) <

A 90 nm CMOS, 6 μW Power-Proportional Acoustic

Sensing Frontend for Voice Activity Detection

Abstract – This work presents a sub-6 µW acoustic front-end for speech/non-speech

classification in a voice activity detection (VAD) in 90 nm CMOS. Power consumption of the

VAD system is minimized by architectural design around a new Power-Proportional sensing

paradigm and the use of machine-learning assisted moderate-precision analog analytics for

classification. Power-Proportional sensing allows for hierarchical and context-aware scaling of

the frontend’s power consumption depending on the complexity of the ongoing information

extraction, while the use of analog analytics brings increased power efficiency through switching

on/off the computation of individual features depending on the features’ usefulness in a

particular context. The proposed VAD system reduces the power consumption by 10X as

compared to state-of-the-art systems and yet achieves an 89% average hit rate for a 12 dB signal

to acoustic noise ratio in babble context, which is at par with software based VAD systems.

I. INTRODUCTION

Technological innovations are changing the way we interact with electronic devices.

Interactions like voice control and gesture recognition are rapidly gaining popularity. Such

natural interactive systems do not only need many integrated sensors, but also always-awake,

reactive sensor frontends. These frontends generate large amounts of raw signals that state-of-the

art (SotA) frontends immediately digitize for processing on a DSP. This very robust approach is

Komail Badami, Steven Lauwereins, Wannes Meert, Marian Verhelst, KU Leuven, Belgium

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO

EDIT) <

not power efficient, as not all raw sensor signals are equally relevant. The net information

content of a sensed signal is quite often significantly smaller than the Nyquist rate [1-7]. Existing

works such as Information-Rate processing [1,2], Analog to Information conversion [3-5] and

Compressed Sensing [6,7] show power savings by extracting or compressing the information

from signals before digitizing the data. However, as these schemes operate in a static way, the

compression or extraction parameters are set beforehand. Yet, the information content in raw

signals and its application relevance dynamically varies depending on the operating context.

Operating such systems efficiently hence requires a dynamic system adaptation depending on the

context or signal information content. Existing systems do not perform such fine grain adaptive

behavior, which severely limits their power savings as shown by solid line in Fig. 1.

We propose a self-scalable, Power-Proportional sensing paradigm which gracefully scales the

system’s power consumption with the amount and complexity of extracted information, i.e. the

power consumption for such a system increases only as the task of information extraction gets

more complex. To this end, in this paper we propose key enablers for Power-Proportionality and

apply them to a proof of concept acoustic frontend for voice activity detection (VAD).

VAD systems distinguish speech from non-speech in different background noise contexts for

varying signal to acoustic noise ratios (SANR). SotA VAD systems [8-10] extract complex

features like Mel-Frequency Cepstral Coefficients, DCT etc. to differentiate speech from non-

speech. The high computational complexity of such features results in large power consumption,

typically about 50 - 100 µW [8-11] in addition to the power consumption of the required active

microphone. Such a continuous large power consumption is unacceptable for battery powered

always-on sensor frontends. This work exploits our new Power-Proportional sensing paradigm

along with moderate-precision, computationally-inexpensive, analog feature-extraction, coupled

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO

EDIT) <

with an embedded mixed-signal classifier to save more than 10X power consumption over SotA

without compromising on the classification accuracy.

The outline for this paper is as follows. Section II discusses insights into the design principles

for Power-Proportional sensing and explains the rationale behind the analog feature-extraction

instead of the commonly used digital scheme. Section III describes the architecture and

specification set for VAD while the detailed implementation is discussed in Section IV.

Measurement results for the chip and for the full VAD system are discussed in Section V.

II. KEY PRINCIPLES FOR POWER EFFICIENT SENSING

This section details the two key principles that allow our always-on sensing system to scale its

power consumption with the information extracted saving 10X power over SotA VAD systems.

A. Power-Proportional Sensing

The core premise for Power-Proportional sensing is that power consumption of the sensing

system scales proportionally with the complexity of the sensing task. The sensing process with

the target of information extraction can increase in complexity along two dimensions:

First, the amount of information extracted from the incoming signal can scale in complexity.

Consider for example, the task of speaker identification v/s speech detection. The former task

entails the later as a prerequisite first step, hence justifying the increase in power consumption.

Enabling hierarchical operation for tasks of increasing complexity allows scaling of power

consumption with complexity of information extraction. In such an architecture each processing

stage extracts more complex information than the previous stage while consuming more power.

This enables information extraction by necessity, as is shown on the horizontal-axis in Fig. 1.

Secondly, even if the amount of extracted information remains the same, distinguishing the

useful information from the background noise (the context) is subject to varying levels of

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO

EDIT) <

difficulty. For this case consider the complexity of speech detection in a quiet office, in contrast

to a noisy street environment. The amount of information needed is same in both cases, but in the

latter case as the background noise maps directly onto the information spectrum, it creates in-

band interference on the desired signal. As such, distinguishing speech from non-speech

becomes more complex, hence justifying the increase in power consumption. Context-awareness

enables Power-Proportional sensing to scale power as the background noise context scales the

complexity of information extraction, as shown in bold in Fig. 1. For the example above,

context-awareness allows to use a much smaller discriminating feature subset in a low noise

environment and a relatively larger subset for noisy background contexts, hence scaling power.

SotA sensing systems do not exploit the power scaling opportunity offered by the above

scenarios, and typically operate constantly in full processing mode. This plateaus the on-state

power consumption for SotA sensing systems independent of system utility as shown in Fig. 1.

B. Power Efficiency through Analog Analytics

The Power-Proportional sensing paradigm as highlighted in previous paragraph needs

complexity and precision dependent power scalable hardware blocks. Such power scaling with

precision is very different for analog and digital implementations. Analog power consumption

scales gradually for thermal noise limited system with low-to-medium precision, while digital

has a logarithmic power v/s precision profile. As it has been shown in [12] and in Fig. 2, for a

0.25 µm CMOS technology, analog computation is not only more power-efficient than digital for

low-to-medium resolution processing, but also exhibits better scalability.

Reduction in supply voltage due to technology scaling allows more power efficient digital

circuits and questions the beneficial analog behavior in advanced technologies. This is because

with scaling, the cost of maintaining the same precision in analog increases as a larger bias

HTML Viewer

A 90 nm CMOS, $6\ {\upmu {\text{W}}}$ Power-Proportional Acoustic Sensing Frontend for Voice Activity Detection

Summary (3 min read)

Introduction

A. Power-Proportional Sensing

B. Power Efficiency through Analog Analytics

III. SYSTEM ARCHITECTURE AND SPECIFICATIONS

B. Specifications for VAD system

IV. SYSTEM IMPLEMENTATION

A. Wakeup detector

B. Analog feature-extractor

C. Decision tree based classifier

B. System measurement results

VI. CONCLUSIONS

Figures (23)

Citations

References

Related Papers (5)