scispace - formally typeset
Search or ask a question
Author

Yu Chen

Bio: Yu Chen is an academic researcher from University of Michigan. The author has contributed to research in topics: Computer science & Deep learning. The author has an hindex of 6, co-authored 8 publications receiving 127 citations.

Papers
More filters
Proceedings ArticleDOI
01 Feb 2019
TL;DR: No sub $-\mu \mathrm {W}$ VAD has been reported to date, preventing the use of VADs in unobtrusive mm-scale sensor nodes, and their simple decision tree or fixed neural network-based approach limited broader use for various acoustic event targets.
Abstract: Acoustic sensing is one of the most widely used sensing modalities to intelligently assess the environment. In particular, ultra-low power (ULP) always-on voice activity detection (VAD) is gaining attention as an enabling technology for IoT platforms. In many practical applications, acoustic events-of-interest occur infrequently. Therefore, the system power consumption is typically dominated by the always-on acoustic wakeup detector, while the remainder of the system is power-gated the vast majority of the time. A previous acoustic wakeup detector [1] consumed just 12nW but could not process voice signals (up to 4kHz bandwidth) or handle non-stationary events, which are essential qualities for a VAD. Prior VAD ICs [2], [3] demonstrated reliable performance but consumed significant power $(\gt 20 \mu \mathrm {W})$ and lacked an analog frontend (AFE), which further increases power. Recent analog-domain feature extraction-based VADs [4], [5] also reported $\mu \mathrm {W}-$ level power consumption, and their simple decision tree [4] or fixed neural network-based approach [5] limited broader use for various acoustic event targets. In summary, no sub $-\mu \mathrm {W}$ VAD has been reported to date, preventing the use of VADs in unobtrusive mm-scale sensor nodes.

44 citations

Proceedings ArticleDOI
01 Feb 2019
TL;DR: Visual SLAM requires massive computation in the CNN-based feature extraction and matching, as well as data-dependent dynamic memory access and control flow with high-precision operations, creating significant low-power design challenges.
Abstract: Simultaneous localization and mapping (SLAM) estimates an agent’s trajectory for all six degrees of freedom (6 DoF) and constructs a 3D map of an unknown surrounding. It is a fundamental kernel that enables head-mounted augmented/virtual reality devices and autonomous navigation of micro aerial vehicles. A noticeable recent trend in visual SLAM is to apply computation- and memory-intensive convolutional neural networks (CNNs) that outperform traditional hand-designed feature-based methods [1]. For each video frame, CNN-extracted features are matched with stored keypoints to estimate the agent’s 6-DoF pose by solving a perspective-n-points (PnP) non-linear optimization problem (Fig. 7.3.1, left). The agent’s long-term trajectory over multiple frames is refined by a bundle adjustment process (BA, Fig. 7.3.1 right), which involves a large-scale ($\sim$120 variables) non-linear optimization. Visual SLAM requires massive computation ($\gt250$ GOP/s) in the CNN-based feature extraction and matching, as well as data-dependent dynamic memory access and control flow with high-precision operations, creating significant low-power design challenges. Software implementations are impractical, resulting in 0.2s runtime with a $\sim$3 GHz CPU + GPU system with $\gt100$ MB memory footprint and $\gt100$ W power consumption. Prior ASICs have implemented either an incomplete SLAM system [2, 3] that lacks estimation of ego-motion or employed a simplified (non-CNN) feature extraction and tracking [2, 4, 5] that limits SLAM quality and range. A recent ASIC [5] augments visual SLAM with an off-chip high-precision inertial measurement unit (IMU), simplifying the computational complexity, but incurring additional power and cost overhead.

40 citations

Journal ArticleDOI
TL;DR: An algorithm-circuit cross optimization is introduced to realize a 12-nW stand-alone microsystem that integrates the analog frontend with the digital backend signal classifier and replaces a conventional high-power/area-consuming parallel feature extraction using the fast Fourier transform.
Abstract: This paper presents an ultra-low power acoustic sensing and object recognition microsystem for Internet of Things applications. The microsystem is targeted for unattended ground sensor nodes where long-term (decades) life time is desired without the need for battery replacement. The system incorporates an microelectromechanical systems microphone as a frontend sensor along with active circuitry to identify target objects. We introduce an algorithm-circuit cross optimization to realize a 12-nW stand-alone microsystem that integrates the analog frontend with the digital backend signal classifier. The frequency-domain analysis of target audio signals reveals that the system can operate with a relatively low bandwidth ( 3 dB) which significantly relaxes power constraints on both analog frontend and digital backend circuits. To further relax the current requirement of the preceding amplifier, we propose an 8-bit SAR-analog-to-digital converter that is designed to have a highly reduced sampling capacitance ( 95% reliability and consumes only 12 nW with continuous monitoring.

39 citations

Proceedings ArticleDOI
20 Jun 2021
TL;DR: In this paper, a generative adversarial network (GAN) is used to predict the latent vector representation of the future frame and a convolutional long short-term memory (ConvLSTM) network is employed to predict future frames.
Abstract: Learning-based video compression has achieved substantial progress during recent years. The most influential approaches adopt deep neural networks (DNNs) to remove spatial and temporal redundancies by finding the appropriate lower-dimensional representations of frames in the video. We propose a novel DNN based framework that predicts and compresses video sequences in the latent vector space. The proposed method first learns the efficient lower-dimensional latent space representation of each video frame and then performs inter-frame prediction in that latent domain. The proposed latent domain compression of individual frames is obtained by a deep autoencoder trained with a generative adversarial network (GAN). To exploit the temporal correlation within the video frame sequence, we employ a convolutional long short-term memory (ConvLSTM) network to predict the latent vector representation of the future frame. We demonstrate our method with two applications; video compression and abnormal event detection that share the identical latent frame prediction network. The proposed method exhibits superior or competitive performance compared to the state-of-the-art algorithms specifically designed for either video compression or anomaly detection.1

39 citations

Journal ArticleDOI
TL;DR: This article presents a voice and acoustic activity detector that uses a mixer-based architecture and ultra-low-power neural network (NN)-based classifier that features inaudible acoustic signature detection for intentional remote silent wakeup of the system while re-using a subset of the same system components.
Abstract: This article presents a voice and acoustic activity detector that uses a mixer-based architecture and ultra-low-power neural network (NN)-based classifier. By sequentially scanning 4 kHz of frequency bands and down-converting to below 500 Hz, feature extraction power consumption is reduced by 4 $\times $ . The NN processor employs computational sprinting, enabling 12 $\times $ power reduction. The system also features inaudible acoustic signature detection for intentional remote silent wakeup of the system while re-using a subset of the same system components. The measurement results achieve 91.5%/90% speech/non-speech hit rates at 10-dB SNR with babble noise and 142-nW power consumption. Acoustic signature detection consumes 66 nW, successfully detecting a signature 10 dB below the noise level.

35 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This tutorial summarizes the efforts to date, starting from its early adaptations, semantic-aware and task-oriented communications, covering the foundations, algorithms and potential implementations, and focuses on approaches that utilize information theory to provide the foundations.
Abstract: Communication systems to date primarily aim at reliably communicating bit sequences. Such an approach provides efficient engineering designs that are agnostic to the meanings of the messages or to the goal that the message exchange aims to achieve. Next generation systems, however, can be potentially enriched by folding message semantics and goals of communication into their design. Further, these systems can be made cognizant of the context in which communication exchange takes place, thereby providing avenues for novel design insights. This tutorial summarizes the efforts to date, starting from its early adaptations, semantic-aware and task-oriented communications, covering the foundations, algorithms and potential implementations. The focus is on approaches that utilize information theory to provide the foundations, as well as the significant role of learning in semantics and task-aware communications.

67 citations

Journal ArticleDOI
TL;DR: This paper presents an ultra-low-power voice activity detector (VAD) that uses analog signal processing for acoustic feature extraction (AFE) directly on the microphone output, approximate event-driven analog-to-digital conversion (ED-ADC), and digital deep neural network (DNN) for speech/non-speech classification.
Abstract: This paper presents an ultra-low-power voice activity detector (VAD). It uses analog signal processing for acoustic feature extraction (AFE) directly on the microphone output, approximate event-driven analog-to-digital conversion (ED-ADC), and digital deep neural network (DNN) for speech/non-speech classification. New circuits, including the low-noise amplifier, bandpass filter, and full-wave rectifier contribute to the more than 9 $\times $ normalized power/channel reduction in the feature extraction front-end compared to the best prior art. The digital DNN is a three-hidden-layer binarized multilayer perceptron (MLP) with a 2-neuron output layer and a 48-neuron input layer that receives parallel event streams from the ED-ADCs. To obtain the DNN weights via off-line training, a customized front-end model written in python is constructed to accelerate feature generation in software emulation, and the model parameters are extracted from Spectre simulations. The chip, fabricated in 0.18- $\mu \text{m}$ CMOS, has a core area of 1.66 $\times $ 1.52 mm2 and consumes 1 $\mu \text{W}$ . The classification measurements using the 1-hour 10-dB signal-to-noise ratio audio with restaurant background noise show a mean speech/non-speech hit rate of 84.4%/85.4% with a 1.88%/4.65% 1- $\sigma $ variation across ten dies that are all loaded with the same weights.

57 citations

Proceedings ArticleDOI
10 Jun 2018
TL;DR: The vision of BARNET (Backscattering Activity Recognition NEtwork of Tags), a network of passive RF tags that use RF backscatter for tag-to-tag communication, is presented and the BARNET tag architecture shows that an ASIC implementation can run on harvested RF power.
Abstract: We present the vision of BARNET (Backscattering Activity Recognition NEtwork of Tags), a network of passive RF tags that use RF backscatter for tag-to-tag communication. BARNET not only provides identification of tagged objects but also can serve as a 'device-free' activity recognition system. BARNET's key innovation is the concept of backscatter channel state information (BCSI) which can be measured via systematic multiphase probing of the backscatter tag-to-tag channel using innovative processing on the passive tags. So far such measurements were only possible using active radio receivers that consume much higher power. Changes in BCSI provide signatures for different activities in the environment that can be learned using suitable machine learning tools. We develop the BARNET tag architecture which shows that an ASIC implementation can run on harvested RF power. We develop a printed circuit board (PCB) prototype using discrete components to evaluate activity recognition performance. We show that the prototype can recognize human daily activities with an average error around 6%. Overall, BARNET uses passive tags to achieve the same level of performance as systems that use powered, active radios.

52 citations

Journal ArticleDOI
TL;DR: Vega as discussed by the authors is an IoT endnode system on chip (SoC) capable of scaling from a 1.7-μW fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW).
Abstract: The Internet-of-Things (IoT) requires endnodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT endnode system on chip (SoC) capable of scaling from a 1.7-μW fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW) peak performance on NSAAs, including mobile deep neural network (DNN) inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile magnetoresistive random access memory (MRAM). To meet the performance and flexibility requirements of NSAAs, the SoC features ten RISC-V cores: one core for SoC and IO management and a nine-core cluster supporting multi-precision single instruction multiple data (SIMD) integer and floating-point (FP) computation. Vega achieves the state-of-the-art (SoA)-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3 TOPS/W for 8-bit DNN inference with hardware acceleration). On FP computation, it achieves the SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine learning (ML) accelerators boost energy efficiency in cognitive sleep and active states.

46 citations

Proceedings ArticleDOI
01 Feb 2019
TL;DR: No sub $-\mu \mathrm {W}$ VAD has been reported to date, preventing the use of VADs in unobtrusive mm-scale sensor nodes, and their simple decision tree or fixed neural network-based approach limited broader use for various acoustic event targets.
Abstract: Acoustic sensing is one of the most widely used sensing modalities to intelligently assess the environment. In particular, ultra-low power (ULP) always-on voice activity detection (VAD) is gaining attention as an enabling technology for IoT platforms. In many practical applications, acoustic events-of-interest occur infrequently. Therefore, the system power consumption is typically dominated by the always-on acoustic wakeup detector, while the remainder of the system is power-gated the vast majority of the time. A previous acoustic wakeup detector [1] consumed just 12nW but could not process voice signals (up to 4kHz bandwidth) or handle non-stationary events, which are essential qualities for a VAD. Prior VAD ICs [2], [3] demonstrated reliable performance but consumed significant power $(\gt 20 \mu \mathrm {W})$ and lacked an analog frontend (AFE), which further increases power. Recent analog-domain feature extraction-based VADs [4], [5] also reported $\mu \mathrm {W}-$ level power consumption, and their simple decision tree [4] or fixed neural network-based approach [5] limited broader use for various acoustic event targets. In summary, no sub $-\mu \mathrm {W}$ VAD has been reported to date, preventing the use of VADs in unobtrusive mm-scale sensor nodes.

44 citations