Home
/
Authors
/
Lianwu Chen

Author

Lianwu Chen

Bio: Lianwu Chen is an academic researcher from Tencent. The author has contributed to research in topics: Speech enhancement & Acoustic model. The author has an hindex of 11, co-authored 43 publications receiving 377 citations.

Topics: Speech enhancement, Acoustic model, Word error rate, PESQ, Computer science ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information.

[...]

Rongzhi Gu¹, Lianwu Chen², Shi-Xiong Zhang³, Zheng Jimeng², Yong Xu⁴, Meng Yu², Dan Su², Yuexian Zou¹, Dong Yu² - Show less +5 more•Institutions (4)

Peking University¹, Tencent², Microsoft³, Nanjing Forestry University⁴

15 Sep 2019

TL;DR: This paper integrates an attention mechanism to dynamically tune the model’s attention to the reliable input features to alleviate spatial ambiguity problem when multiple speakers are closely located and significantly improves the performance of speech separation against the baseline single-channel and multi-channel speech separation methods.

...read moreread less

Abstract: The recent exploration of deep learning for supervised speech separation has significantly accelerated the progress on the multi-talker speech separation problem. The multi-channel approaches have attracted much research attention due to the benefit of spatial information. In this paper, integrated with the power spectra and inter-channel spatial features at the input level, we explore to leverage directional features, which imply the speaker source from the desired target direction, for target speaker separation. In addition, we incorporate an attention mechanism to dynamically tune the model’s attention to the reliable input features to alleviate spatial ambiguity problem when multiple speakers are closely located. We demonstrate, on the far-field WSJ0 2-mix dataset, that our proposed approach significantly improves the performance of speech separation against the baseline single-channel and multi-channel speech separation methods.

...read moreread less

79 citations

Journal Article•DOI•

Multi-Modal Multi-Channel Target Speech Separation

[...]

Rongzhi Gu¹, Shi-Xiong Zhang¹, Yong Xu¹, Lianwu Chen¹, Yuexian Zou², Dong Yu¹ - Show less +2 more•Institutions (2)

Tencent¹, Peking University²

16 Mar 2020-IEEE Journal of Selected Topics in Signal Processing

TL;DR: A general multi-modal framework for target speech separation is proposed by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements, and a factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi- modalities at embedding level.

...read moreread less

Abstract: Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.

...read moreread less

72 citations

Proceedings Article•DOI•

Time Domain Audio Visual Speech Separation

[...]

Jian Wu¹, Yong Xu¹, Shi-Xiong Zhang¹, Lianwu Chen¹, Meng Yu¹, Lei Xie², Dong Yu¹ - Show less +3 more•Institutions (2)

Tencent¹, Northwestern Polytechnical University²

01 Dec 2019

TL;DR: In this paper, a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures is proposed, which includes an audio encoder, a video encoder that extracts lip embedding from video streams, a multi-modal separation network and an audio decoder.

...read moreread less

Abstract: Audio-visual multi-modal modeling has been demonstrated to be effective in many speech related tasks, such as speech recognition and speech enhancement. This paper introduces a new time-domain audio-visual architecture for target speaker extraction from monaural mixtures. The architecture generalizes the previous TasNet (time-domain speech separation network) to enable multi-modal learning and at meanwhile it extends the classical audio-visual speech separation from frequency-domain to time-domain. The main components of proposed architecture include an audio encoder, a video encoder that extracts lip embedding from video streams, a multi-modal separation network and an audio decoder. Experiments on simulated mixtures based on recently released LRS2 dataset show that our method can bring 3dB+ and 4dB+ Si-SNR improvements on two- and three-speaker cases respectively, compared to audio-only TasNet and frequency-domain audio-visual networks.

...read moreread less

71 citations

Proceedings Article•DOI•

Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures

[...]

Jun Wang¹, Jie Chen¹, Dan Su¹, Lianwu Chen¹, Meng Yu¹, Yanmin Qian², Dong Yu¹ - Show less +3 more•Institutions (2)

Tencent¹, Shanghai Jiao Tong University²

24 Jul 2018

TL;DR: A novel "deep extractor network" which creates an extractor point for the target speaker in a canonical high dimensional embedding space, and pulls together the time-frequency bins corresponding to thetarget speaker.

...read moreread less

Abstract: Speaker-aware source separation methods are promising workarounds for major difficulties such as arbitrary source permutation and unknown number of sources. However, it remains challenging to achieve satisfying performance provided a very short available target speaker utterance (anchor). Here we present a novel "deep extractor network" which creates an extractor point for the target speaker in a canonical high dimensional embedding space, and pulls together the time-frequency bins corresponding to the target speaker. The proposed model is different from prior works in that the canonical embedding space encodes knowledges of both the anchor and the mixture during an end-to-end training phase: First, embeddings for the anchor and mixture speech are separately constructed in a primary embedding space, and then combined as an input to feed-forward layers to transform to a canonical embedding space which we discover more stable than the primary one. Experimental results show that given a very short utterance, the proposed model can efficiently recover high quality target speech from a mixture, which outperforms various baseline models, with 5.2% and 6.6% relative improvements in SDR and PESQ respectively compared with a baseline oracle deep attracor model. Meanwhile, we show it can be generalized well to more than one interfering speaker.

...read moreread less

63 citations

Posted Content•

End-to-End Multi-Channel Speech Separation.

[...]

Rongzhi Gu, Jian Wu, Shi-Xiong Zhang, Lianwu Chen, Yong Xu, Meng Yu, Dan Su, Yuexian Zou, Dong Yu - Show less +5 more

15 May 2019-arXiv: Sound

TL;DR: This paper proposes a new end-to-end model for multi-channel speech separation that reformulate the traditional short time Fourier transform and inter-channel phase difference as a function of time-domain convolution with a special kernel.

...read moreread less

Abstract: The end-to-end approach for single-channel speech separation has been studied recently and shown promising results. This paper extended the previous approach and proposed a new end-to-end model for multi-channel speech separation. The primary contributions of this work include 1) an integrated waveform-in waveform-out separation system in a single neural network architecture. 2) We reformulate the traditional short time Fourier transform (STFT) and inter-channel phase difference (IPD) as a function of time-domain convolution with a special kernel. 3) We further relaxed those fixed kernels to be learnable, so that the entire architecture becomes purely data-driven and can be trained from end-to-end. We demonstrate on the WSJ0 far-field speech separation task that, with the benefit of learnable spatial features, our proposed end-to-end multi-channel model significantly improved the performance of previous end-to-end single-channel method and traditional multi-channel methods.

...read moreread less

62 citations

1
2
3
4
…
5
6
7
8
9
10

Collapse

Cited by

PDF

Open Access

More filters

Posted Content•

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

[...]

Yanxin Hu¹, Yun Liu, Shubo Lv, Mengtao Xing¹, Shimin Zhang¹, Yihui Fu¹, Jian Wu², Bihong Zhang, Lei Xie¹ - Show less +5 more•Institutions (2)

Northwestern Polytechnical University¹, Microsoft²

01 Aug 2020-arXiv: Audio and Speech Processing

TL;DR: A new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex- valued operation.

...read moreread less

Abstract: Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

...read moreread less

237 citations

Posted Content•

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

[...]

Quan Wang, Hannah Muckenhirn, Kevin W. Wilson, Prashant Sridhar, Zelin Wu, John R. Hershey, Rif A. Saurous, Ron Weiss, Ye Jia, Ignacio Lopez Moreno - Show less +6 more

11 Oct 2018-arXiv: Audio and Speech Processing

TL;DR: In this paper, a speaker recognition network that produces speaker-discriminative embeddings and a spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask.

...read moreread less

Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

...read moreread less

197 citations

Journal Article•DOI•

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

[...]

Chao Zhang¹, Zichao Yang, Xiaodong He, Li Deng•Institutions (1)

University of Cambridge¹

15 Apr 2020-IEEE Journal of Selected Topics in Signal Processing

TL;DR: A technical review of available models and learning methods for multimodal intelligence, focusing on the combination of vision and natural language modalities, which has become an important topic in both the computer vision andnatural language processing research communities.

...read moreread less

Abstract: Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

...read moreread less

174 citations

Journal Article•DOI•

SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures

[...]

Katerina Zmolikova¹, Marc Delcroix², Keisuke Kinoshita², Tsubasa Ochiai², Tomohiro Nakatani², Lukas Burget¹, Jan Cernocky¹ - Show less +3 more•Institutions (2)

Brno University of Technology¹, NTT Communications Corp²

13 Jun 2019-IEEE Journal of Selected Topics in Signal Processing

TL;DR: This paper introduces SpeakerBeam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker and shows the benefit of including speaker information in the processing and the effectiveness of the proposed method.

...read moreread less

Abstract: The processing of speech corrupted by interfering overlapping speakers is one of the challenging problems with regards to today's automatic speech recognition systems. Recently, approaches based on deep learning have made great progress toward solving this problem. Most of these approaches tackle the problem as speech separation, i.e., they blindly recover all the speakers from the mixture. In some scenarios, such as smart personal devices, we may however be interested in recovering one target speaker from a mixture. In this paper, we introduce SpeakerBeam, a method for extracting a target speaker from the mixture based on an adaptation utterance spoken by the target speaker. Formulating the problem as speaker extraction avoids certain issues such as label permutation and the need to determine the number of speakers in the mixture. With SpeakerBeam, we jointly learn to extract a representation from the adaptation utterance characterizing the target speaker and to use this representation to extract the speaker. We explore several ways to do this, mostly inspired by speaker adaptation in acoustic models for automatic speech recognition. We evaluate the performance on the widely used WSJ0-2mix and WSJ0-3mix datasets, and these datasets modified with more noise or more realistic overlapping patterns. We further analyze the learned behavior by exploring the speaker representations and assessing the effect of the length of the adaptation data. The results show the benefit of including speaker information in the processing and the effectiveness of the proposed method.

...read moreread less

158 citations

Journal Article•DOI•

Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques

[...]

Reinhold Haeb-Umbach¹, Shinji Watanabe², Tomohiro Nakatani, Michiel Bacchiani³, Bjorn Hoffmeister⁴, Michael L. Seltzer⁵, Heiga Zen³, Mehrez Souden⁴ - Show less +4 more•Institutions (5)

University of Paderborn¹, Johns Hopkins University², Google³, Apple Inc.⁴, Facebook⁵

30 Oct 2019-IEEE Signal Processing Magazine

TL;DR: The purpose of this article is to describe, in a way that is amenable to the nonspecialist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assistants.

...read moreread less

Abstract: Once a popular theme of futuristic science fiction or far-fetched technology forecasts, digital home assistants with a spoken language interface have become a ubiquitous commodity today. This success has been made possible by major advancements in signal processing and machine learning for so-called far-field speech recognition, where the commands are spoken at a distance from the sound-capturing device. The challenges encountered are quite unique and different from many other use cases of automatic speech recognition (ASR). The purpose of this article is to describe, in a way that is amenable to the nonspecialist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assistants. These technologies include multichannel acoustic echo cancellation (MAEC), microphone array processing and dereverberation techniques for signal enhancement, reliable wake-up word and end-of-interaction detection, and high-quality speech synthesis as well as sophisticated statistical models for speech and language, learned from large amounts of heterogeneous training data. In all of these fields, deep learning (DL) has played a critical role.

...read moreread less

115 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101

Collapse