scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition

TL;DR: This paper introduces a neural network architecture, which performs multichannel filtering in the first layer of the network, and shows that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target Speaker direction.
Abstract: Multichannel automatic speech recognition (ASR) systems commonly separate speech enhancement, including localization, beamforming, and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture, which performs multichannel filtering in the first layer of the network, and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally, we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model.
Citations
More filters
Journal ArticleDOI
TL;DR: A thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications is provided.
Abstract: Over the past decades, a tremendous amount of research has been done on the use of machine learning for speech processing applications, especially speech recognition. However, in the past few years, research has focused on utilizing deep learning for speech-related applications. This new area of machine learning has yielded far better results when compared to others in a variety of applications including speech, and thus became a very attractive area of research. This paper provides a thorough examination of the different studies that have been conducted since 2006, when deep learning first arose as a new area of machine learning, for speech applications. A thorough statistical analysis is provided in this review which was conducted by extracting specific information from 174 papers published between the years 2006 and 2018. The results provided in this paper shed light on the trends of research in this area as well as bring focus to new research topics.

701 citations

Journal ArticleDOI
TL;DR: A thorough investigation of deep learning in its applications and mechanisms is sought, as a categorical collection of state of the art in deep learning research, to provide a broad reference for those seeking a primer on deep learning and its various implementations, platforms, algorithms, and uses in a variety of smart-world systems.
Abstract: Deep learning has exploded in the public consciousness, primarily as predictive and analytical products suffuse our world, in the form of numerous human-centered smart-world systems, including targeted advertisements, natural language assistants and interpreters, and prototype self-driving vehicle systems. Yet to most, the underlying mechanisms that enable such human-centered smart products remain obscure. In contrast, researchers across disciplines have been incorporating deep learning into their research to solve problems that could not have been approached before. In this paper, we seek to provide a thorough investigation of deep learning in its applications and mechanisms. Specifically, as a categorical collection of state of the art in deep learning research, we hope to provide a broad reference for those seeking a primer on deep learning and its various implementations, platforms, algorithms, and uses in a variety of smart-world systems. Furthermore, we hope to outline recent key advancements in the technology, and provide insight into areas, in which deep learning can improve investigation, as well as highlight new areas of research that have yet to see the application of deep learning, but could nonetheless benefit immensely. We hope this survey provides a valuable reference for new deep learning practitioners, as well as those seeking to innovate in the application of deep learning.

411 citations


Cites methods from "Multichannel Signal Processing With..."

  • ...[112] applied a deep neural network framework to jointly perform multichannel enhancement and acoustic modeling for automatic speech recognition (ASR)....

    [...]

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The structure and application of an acoustic room simulator to generate large-scale simulated data for training deep neural networks for far-field speech recognition and performance is evaluated using a factored complex Fast Fourier Transform (CFFT) acoustic model introduced in earlier work.
Abstract: We describe the structure and application of an acoustic room simulator to generate large-scale simulated data for training deep neural networks for far-field speech recognition. The system simulates millions of different room dimensions, a wide distribution of reverberation time and signal-to-noise ratios, and a range of microphone and sound source locations. We start with a relatively clean training set as the source and artificially create simulated data by randomly sampling a noise configuration for every new training example. As a result, the acoustic model is trained using examples that are virtually never repeated. We evaluate performance of this approach based on room simulation using a factored complex Fast Fourier Transform (CFFT) acoustic model introduced in our earlier work, which uses CFFT layers and LSTM AMs for joint multichannel processing and acoustic modeling. Results show that the simulator-driven approach is quite effective in obtaining large improvements not only in simulated test conditions, but also in real / rerecorded conditions. This room simulation system has been employed in training acoustic models including the ones for the recently released Google Home.

263 citations


Cites methods from "Multichannel Signal Processing With..."

  • ...The model performs multichannel processing and acoustic modeling jointly, and has been shown to provide comparable or better performance to traditional enhancement techniques like beamforming [23]....

    [...]

Journal ArticleDOI
TL;DR: A review of recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems.
Abstract: Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition but still remains an important challenge. Data-driven supervised approaches, especially the ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks. In the meanwhile, we discuss the pros and cons of these approaches and provide their experimental results on benchmark databases. We expect that this overview can facilitate the development of the robustness of speech recognition systems in acoustic noisy environments.

218 citations


Cites background from "Multichannel Signal Processing With..."

  • ...The de facto standard metric to evaluate the performance of ASR systems is Word Error Rate (WER) or Word Accuracy Rate (WAR)....

    [...]

  • ...For example, in Weninger et al. (2015) the authors conducted speech recognition on the enhanced speech and found that SDR and WER improvements are significantly correlated with Spearman’s rho= 0.84 in single-channel case, and Spearman’s rho = 0.92 in two-channel case, evaluated on the CHiME-2 benchmark database....

    [...]

  • ...Rather than using neural networks to support traditional beamformers and post-filters for speech enhancement, joint front- and back-end multi-channel ASR systems have recently attracted considerable attention with a goal of decreasing the WER directly (Hoshen et al. 2015; Liu et al. 2014; Swietojanski et al. 2014)....

    [...]

  • ...Although no research has proved that a good value of these intermediate metrics for enhancement techniques necessarily leads to a better WER or WAR, experimental results have frequently shown a strong correlation between them....

    [...]

  • ...A follow-up work was reported in Sainath et al. (2017), where the authors employed two convolutional layers, instead of one layer, at the front-end....

    [...]

Proceedings ArticleDOI
01 Sep 2018
TL;DR: In this paper, a deep neural network was proposed to estimate the directions of arrival (DOA) of multiple sound sources in anechoic, matched and unmatched reverberant conditions.
Abstract: This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.

191 citations

References
More filters
Proceedings Article
31 Mar 2010
TL;DR: The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Abstract: Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. 1 Deep Neural Networks Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include Appearing in Proceedings of the 13 International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: WC Weston et al., 2008). Much attention has recently been devoted to them (see (Bengio, 2009) for a review), because of their theoretical appeal, inspiration from biology and human cognition, and because of empirical success in vision (Ranzato et al., 2007; Larochelle et al., 2007; Vincent et al., 2008) and natural language processing (NLP) (Collobert & Weston, 2008; Mnih & Hinton, 2009). Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one may need deep architectures. Most of the recent experimental results with deep architecture are obtained with models that can be turned into deep supervised neural networks, but with initialization or training schemes different from the classical feedforward neural networks (Rumelhart et al., 1986). Why are these new algorithms working so much better than the standard random initialization and gradient-based optimization of a supervised training criterion? Part of the answer may be found in recent analyses of the effect of unsupervised pretraining (Erhan et al., 2009), showing that it acts as a regularizer that initializes the parameters in a “better” basin of attraction of the optimization procedure, corresponding to an apparent local minimum associated with better generalization. But earlier work (Bengio et al., 2007) had shown that even a purely supervised but greedy layer-wise procedure would give better results. So here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old (but deep) multilayer neural networks. Our analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations. We also evaluate the effects on these of choices of activation function (with the idea that it might affect saturation) and initialization procedure (since unsupervised pretraining is a particular form of initialization and it has a drastic impact).

9,500 citations

Journal ArticleDOI
TL;DR: In this paper, a maximum likelihood estimator is developed for determining time delay between signals received at two spatially separated sensors in the presence of uncorrelated noise, where the role of the prefilters is to accentuate the signal passed to the correlator at frequencies for which the signal-to-noise (S/N) ratio is highest and suppress the noise power.
Abstract: A maximum likelihood (ML) estimator is developed for determining time delay between signals received at two spatially separated sensors in the presence of uncorrelated noise. This ML estimator can be realized as a pair of receiver prefilters followed by a cross correlator. The time argument at which the correlator achieves a maximum is the delay estimate. The ML estimator is compared with several other proposed processors of similar form. Under certain conditions the ML estimator is shown to be identical to one proposed by Hannan and Thomson [10] and MacDonald and Schultheiss [21]. Qualitatively, the role of the prefilters is to accentuate the signal passed to the correlator at frequencies for which the signal-to-noise (S/N) ratio is highest and, simultaneously, to suppress the noise power. The same type of prefiltering is provided by the generalized Eckart filter, which maximizes the S/N ratio of the correlator output. For low S/N ratio, the ML estimator is shown to be equivalent to Eckart prefiltering.

4,317 citations


"Multichannel Signal Processing With..." refers methods in this paper

  • ...The input to the FP module is a concatenation of frames of raw input samples xc(l)[t] from all the channels, and can also include features typically computed for localization such as cross correlation features [32], [31], [33]....

    [...]

Journal ArticleDOI
TL;DR: An overview of beamforming from a signal-processing perspective is provided, with an emphasis on recent research.
Abstract: An overview of beamforming from a signal-processing perspective is provided, with an emphasis on recent research. Data-independent, statistically optimum, adaptive, and partially adaptive beamforming are discussed. Basic notation, terminology, and concepts are included. Several beamformer implementations are briefly described. >

4,122 citations


"Multichannel Signal Processing With..." refers background in this paper

  • ...The beampatterns show the magnitude response in dB as a function of frequency and direction of arrival, i.e. each horizontal slice of the beampattern corresponds to the filter’s magnitude response for a signal coming from a particular direction, and each vertical slice corresponds to the filter’s…...

    [...]

Journal ArticleDOI
TL;DR: The theoretical and practical use of image techniques for simulating the impulse response between two points in a small rectangular room, when convolved with any desired input signal, simulates room reverberation of the input signal.
Abstract: Image methods are commonly used for the analysis of the acoustic properties of enclosures. In this paper we discuss the theoretical and practical use of image techniques for simulating, on a digital computer, the impulse response between two points in a small rectangular room. The resulting impulse response, when convolved with any desired input signal, such as speech, simulates room reverberation of the input signal. This technique is useful in signal processing or psychoacoustic studies. The entire process is carried out on a digital computer so that a wide range of room parameters can be studied with accurate control over the experimental conditions. A fortran implementation of this model has been included.

3,720 citations

Proceedings Article
03 Dec 2012
TL;DR: This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Abstract: Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.

3,475 citations