Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition
Citations
701 citations
411 citations
Cites methods from "Multichannel Signal Processing With..."
...[112] applied a deep neural network framework to jointly perform multichannel enhancement and acoustic modeling for automatic speech recognition (ASR)....
[...]
263 citations
Cites methods from "Multichannel Signal Processing With..."
...The model performs multichannel processing and acoustic modeling jointly, and has been shown to provide comparable or better performance to traditional enhancement techniques like beamforming [23]....
[...]
218 citations
Cites background from "Multichannel Signal Processing With..."
...The de facto standard metric to evaluate the performance of ASR systems is Word Error Rate (WER) or Word Accuracy Rate (WAR)....
[...]
...For example, in Weninger et al. (2015) the authors conducted speech recognition on the enhanced speech and found that SDR and WER improvements are significantly correlated with Spearman’s rho= 0.84 in single-channel case, and Spearman’s rho = 0.92 in two-channel case, evaluated on the CHiME-2 benchmark database....
[...]
...Rather than using neural networks to support traditional beamformers and post-filters for speech enhancement, joint front- and back-end multi-channel ASR systems have recently attracted considerable attention with a goal of decreasing the WER directly (Hoshen et al. 2015; Liu et al. 2014; Swietojanski et al. 2014)....
[...]
...Although no research has proved that a good value of these intermediate metrics for enhancement techniques necessarily leads to a better WER or WAR, experimental results have frequently shown a strong correlation between them....
[...]
...A follow-up work was reported in Sainath et al. (2017), where the authors employed two convolutional layers, instead of one layer, at the front-end....
[...]
191 citations
References
9,500 citations
4,317 citations
"Multichannel Signal Processing With..." refers methods in this paper
...The input to the FP module is a concatenation of frames of raw input samples xc(l)[t] from all the channels, and can also include features typically computed for localization such as cross correlation features [32], [31], [33]....
[...]
4,122 citations
"Multichannel Signal Processing With..." refers background in this paper
...The beampatterns show the magnitude response in dB as a function of frequency and direction of arrival, i.e. each horizontal slice of the beampattern corresponds to the filter’s magnitude response for a signal coming from a particular direction, and each vertical slice corresponds to the filter’s…...
[...]
3,720 citations
3,475 citations