Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques

doi:10.1109/MSP.2019.2918706

Journal ArticleDOI

Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques

Reinhold Haeb-Umbach, +7 more

- 30 Oct 2019 -

IEEE Signal Processing Magazine

- Vol. 36, Iss: 6, pp 111-124

Chats0

TLDR

The purpose of this article is to describe, in a way that is amenable to the nonspecialist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assistants.

Abstract:

Once a popular theme of futuristic science fiction or far-fetched technology forecasts, digital home assistants with a spoken language interface have become a ubiquitous commodity today. This success has been made possible by major advancements in signal processing and machine learning for so-called far-field speech recognition, where the commands are spoken at a distance from the sound-capturing device. The challenges encountered are quite unique and different from many other use cases of automatic speech recognition (ASR). The purpose of this article is to describe, in a way that is amenable to the nonspecialist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assistants. These technologies include multichannel acoustic echo cancellation (MAEC), microphone array processing and dereverberation techniques for signal enhancement, reliable wake-up word and end-of-interaction detection, and high-quality speech synthesis as well as sophisticated statistical models for speech and language, learned from large amounts of heterogeneous training data. In all of these fields, deep learning (DL) has played a critical role.

Citations

PDF

Open Access

More filters

Posted Content

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Shinji Watanabe, +20 more

- 20 Apr 2020 -

arXiv: Sound

TL;DR: Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.

...read moreread less

Journal ArticleDOI

Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

Zhong-Qiu Wang, +2 more

- 28 May 2020 -

IEEE Transactions on Audio, Speech, and ...

TL;DR: A novel method of time-varying beamforming with estimated complex spectra for single- and multi-channel speech enhancement, where deep neural networks are used to predict the real and imaginary components of the direct-path signal from noisy and reverberant ones.

...read moreread less

Journal ArticleDOI

A review of speaker diarization: Recent advances with deep learning

Tae Jin Park, +5 more

- 01 Mar 2022 -

Computer Speech & Language

TL;DR: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity as mentioned in this paper, or in short, identifying "who spoke when" in audio and video recordings.

...read moreread less

Journal ArticleDOI

Noninvasive Neural Interfacing With Wearable Muscle Sensors: Combining Convolutive Blind Source Separation Methods and Deep Learning Techniques for Neural Decoding

Ales Holobar, +1 more

- 29 Jun 2021 -

IEEE Signal Processing Magazine

TL;DR: In this paper, the authors present a brief overview of neural interfaces and discuss the properties of multichannel sEMG in comparison to other CNS and PNS recording modalities, with a focus on recent breakthroughs in convolutive blind source separation (BSS) methods and deep learning techniques.

...read moreread less

Posted Content

Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition.

Aswin Shanmugam Subramanian, +4 more

- 16 Feb 2021 -

arXiv: Audio and Speech Processing

TL;DR: In this paper, a source splitting mechanism was proposed to create source-specific intermediate representations inside the network to estimate the direction of arrival (DOA) of all the speakers simultaneously from the audio mixture.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

Posted Content

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, +8 more

- 12 Sep 2016 -

arXiv: Sound

TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.

...read moreread less

Journal ArticleDOI

Image method for efficiently simulating small‐room acoustics

Jont B. Allen, +1 more

- 01 Nov 1976 -

Journal of the Acoustical Society of Ame...

TL;DR: The theoretical and practical use of image techniques for simulating the impulse response between two points in a small rectangular room, when convolved with any desired input signal, simulates room reverberation of the input signal.

...read moreread less

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, +8 more

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

Collapse

Related Papers (5)

Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction

Tomohiro Nakatani, +4 more

- 01 Sep 2010 -

IEEE Transactions on Audio, Speech, and ...

Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques

Citations

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR

A review of speaker diarization: Recent advances with deep learning

Noninvasive Neural Interfacing With Wearable Muscle Sensors: Combining Convolutive Blind Source Separation Methods and Deep Learning Techniques for Neural Decoding

Deep Learning based Multi-Source Localization with Source Splitting and its Effectiveness in Multi-Talker Speech Recognition.

References

Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition

WaveNet: A Generative Model for Raw Audio

Image method for efficiently simulating small‐room acoustics

WaveNet: A Generative Model for Raw Audio

Related Papers (5)

Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction

Librispeech: An ASR corpus based on public domain audio books

The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines

The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices

Improved MVDR beamforming using single-channel mask prediction networks