Journal ArticleDOI
Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques
Reinhold Haeb-Umbach,Shinji Watanabe,Tomohiro Nakatani,Michiel Bacchiani,Bjorn Hoffmeister,Michael L. Seltzer,Heiga Zen,Mehrez Souden +7 more
Reads0
Chats0
TLDR
The purpose of this article is to describe, in a way that is amenable to the nonspecialist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assistants.Abstract:
Once a popular theme of futuristic science fiction or far-fetched technology forecasts, digital home assistants with a spoken language interface have become a ubiquitous commodity today. This success has been made possible by major advancements in signal processing and machine learning for so-called far-field speech recognition, where the commands are spoken at a distance from the sound-capturing device. The challenges encountered are quite unique and different from many other use cases of automatic speech recognition (ASR). The purpose of this article is to describe, in a way that is amenable to the nonspecialist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assistants. These technologies include multichannel acoustic echo cancellation (MAEC), microphone array processing and dereverberation techniques for signal enhancement, reliable wake-up word and end-of-interaction detection, and high-quality speech synthesis as well as sophisticated statistical models for speech and language, learned from large amounts of heterogeneous training data. In all of these fields, deep learning (DL) has played a critical role.read more
Citations
More filters
Posted Content
CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings
Shinji Watanabe,Michael I. Mandel,Jon Barker,Emmanuel Vincent,Ashish Arora,Xuankai Chang,Sanjeev Khudanpur,Vimal Manohar,Daniel Povey,Desh Raj,David Snyder,Aswin Shanmugam Subramanian,Jan Trmal,Bar Ben Yair,Christoph Boeddeker,Zhaoheng Ni,Yusuke Fujita,Shota Horiguchi,Naoyuki Kanda,Takuya Yoshioka,Neville Ryant +20 more
TL;DR: Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.
Journal ArticleDOI
Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR
TL;DR: A novel method of time-varying beamforming with estimated complex spectra for single- and multi-channel speech enhancement, where deep neural networks are used to predict the real and imaginary components of the direct-path signal from noisy and reverberant ones.
Journal ArticleDOI
A review of speaker diarization: Recent advances with deep learning
Tae Jin Park,Naoyuki Kanda,Dimitrios Dimitriadis,Kyu Jeong Han,Shinji Watanabe,Shrikanth S. Narayanan +5 more
TL;DR: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity as mentioned in this paper, or in short, identifying "who spoke when" in audio and video recordings.
Journal ArticleDOI
Noninvasive Neural Interfacing With Wearable Muscle Sensors: Combining Convolutive Blind Source Separation Methods and Deep Learning Techniques for Neural Decoding
Ales Holobar,Dario Farina +1 more
TL;DR: In this paper, the authors present a brief overview of neural interfaces and discuss the properties of multichannel sEMG in comparison to other CNS and PNS recording modalities, with a focus on recent breakthroughs in convolutive blind source separation (BSS) methods and deep learning techniques.
Posted Content
Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition.
TL;DR: In this paper, a source splitting mechanism was proposed to create source-specific intermediate representations inside the network to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Posted Content
WaveNet: A Generative Model for Raw Audio
Aaron van den Oord,Sander Dieleman,Heiga Zen,Karen Simonyan,Oriol Vinyals,Alex Graves,Nal Kalchbrenner,Andrew W. Senior,Koray Kavukcuoglu +8 more
TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.
Journal ArticleDOI
Image method for efficiently simulating small‐room acoustics
Jont B. Allen,David A. Berkley +1 more
TL;DR: The theoretical and practical use of image techniques for simulating the impulse response between two points in a small rectangular room, when convolved with any desired input signal, simulates room reverberation of the input signal.
WaveNet: A Generative Model for Raw Audio
Aaron van den Oord,Sander Dieleman,Heiga Zen,Karen Simonyan,Oriol Vinyals,Alex Graves,Nal Kalchbrenner,Andrew W. Senior,Koray Kavukcuoglu +8 more
TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.