Home
/
Authors
/
Kun Han

Author

Kun Han

Other affiliations: Ohio State University, University of Science and Technology of China

Bio: Kun Han is an academic researcher from DiDi. The author has contributed to research in topics: Speech processing & Supervised learning. The author has an hindex of 13, co-authored 34 publications receiving 1471 citations. Previous affiliations of Kun Han include Ohio State University & University of Science and Technology of China.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Speech emotion recognition using deep neural network and extreme learning machine.

[...]

Kun Han¹, Dong Yu², Ivan Tashev²•Institutions (2)

Ohio State University¹, Microsoft²

14 Sep 2014

TL;DR: The experimental results demonstrate that the proposed approach effectively learns emotional information from low-level features and leads to 20% relative accuracy improvement compared to the state of the art approaches.

...read moreread less

Abstract: Speech emotion recognition is a challenging problem partly because it is unclear what features are effective for the task. In this paper we propose to utilize deep neural networks (DNNs) to extract high level features from raw data and show that they are effective for speech emotion recognition. We first produce an emotion state probability distribution for each speech segment using DNNs. We then construct utterance-level features from segment-level probability distributions. These utterancelevel features are then fed into an extreme learning machine (ELM), a special simple and efficient single-hidden-layer neural network, to identify utterance-level emotions. The experimental results demonstrate that the proposed approach effectively learns emotional information from low-level features and leads to 20% relative accuracy improvement compared to the stateof-the-art approaches.

...read moreread less

681 citations

Journal Article•DOI•

Learning spectral mapping for speech dereverberation and denoising

[...]

Kun Han¹, Yuxuan Wang¹, DeLiang Wang¹, William S. Woods, Ivo Merks, Tao Zhang - Show less +2 more•Institutions (1)

Ohio State University¹

01 Jun 2015-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Deep neural networks are trained to directly learn a spectral mapping from the magnitude spectrogram of corrupted speech to that of clean speech, which substantially attenuates the distortion caused by reverberation, as well as background noise, and is conceptually simple.

...read moreread less

Abstract: In real-world environments, human speech is usually distorted by both reverberation and background noise, which have negative effects on speech intelligibility and speech quality. They also cause performance degradation in many speech technology applications, such as automatic speech recognition. Therefore, the dereverberation and denoising problems must be dealt with in daily listening environments. In this paper, we propose to perform speech dereverberation using supervised learning, and the supervised approach is then extended to address both dereverberation and denoising. Deep neural networks are trained to directly learn a spectral mapping from the magnitude spectrogram of corrupted speech to that of clean speech. The proposed approach substantially attenuates the distortion caused by reverberation, as well as background noise, and is conceptually simple. Systematic experiments show that the proposed approach leads to significant improvements of predicted speech intelligibility and quality, as well as automatic speech recognition in reverberant noisy conditions. Comparisons show that our approach substantially outperforms related methods.

...read moreread less

229 citations

Journal Article•DOI•

Exploring Monaural Features for Classification-Based Speech Segregation

[...]

Yuxuan Wang¹, Kun Han¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Feb 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper expands T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cep stral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP), and proposes to use a group Lasso approach to select complementary features in a principled way.

...read moreread less

Abstract: Monaural speech segregation has been a very challenging problem for decades. By casting speech segregation as a binary classification problem, recent advances have been made in computational auditory scene analysis on segregation of both voiced and unvoiced speech. So far, pitch and amplitude modulation spectrogram have been used as two main kinds of time-frequency (T-F) unit level features in classification. In this paper, we expand T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cepstral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP). Comprehensive comparisons are performed in order to identify effective features for classification-based speech segregation. Our experiments in matched and unmatched test conditions show that these newly included features significantly improve speech segregation performance. Specifically, GFCC and RASTA-PLP are the best single features in matched-noise and unmatched-noise test conditions, respectively. We also find that pitch-based features are crucial for good generalization to unseen environments. To further explore complementarity in terms of discriminative power, we propose to use a group Lasso approach to select complementary features in a principled way. The final combined feature set yields promising results in both matched and unmatched test conditions.

...read moreread less

192 citations

Journal Article•DOI•

A classification based approach to speech segregation

[...]

Kun Han¹, DeLiang Wang•Institutions (1)

Ohio State University¹

08 Nov 2012-Journal of the Acoustical Society of America

TL;DR: This study proposes a classification approach to estimate the Ideal binary mask (IBM) and employs support vector machines to classify time-frequency units as either target- or interference-dominant.

...read moreread less

Abstract: A key problem in computational auditory scene analysis (CASA) is monaural speech segregation, which has proven to be very challenging. For monaural mixtures, one can only utilize the intrinsic properties of speech or interference to segregate target speech from background noise. Ideal binary mask (IBM) has been proposed as a main goal of sound segregation in CASA and has led to substantial improvements of human speech intelligibility in noise. This study proposes a classification approach to estimate the IBM and employs support vector machines to classify time-frequency units as either target- or interference-dominant. A re-thresholding method is incorporated to improve classification results and maximize hit minus false alarm rates. An auditory segmentation stage is utilized to further improve estimated masks. Systematic evaluations show that the proposed approach produces high quality estimated IBMs and outperforms a recent system in terms of classification accuracy.

...read moreread less

105 citations

Proceedings Article•DOI•

Using Context Information for Dialog Act Classification in DNN Framework

[...]

Yang Liu, Kun Han¹, Zhao Tan², Yun Lei•Institutions (2)

DiDi¹, Washington University in St. Louis²

01 Sep 2017

TL;DR: This paper proposes several ways of using context information for DA classification, all in the deep learning framework, and demonstrates that incorporating context information significantly improves DA classification and achieves new state-of-the-art performance.

...read moreread less

Abstract: Previous work on dialog act (DA) classification has investigated different methods, such as hidden Markov models, maximum entropy, conditional random fields, graphical models, and support vector machines A few recent studies explored using deep learning neural networks for DA classification, however, it is not clear yet what is the best method for using dialog context or DA sequential information, and how much gain it brings This paper proposes several ways of using context information for DA classification, all in the deep learning framework The baseline system classifies each utterance using the convolutional neural networks (CNN) Our proposed methods include using hierarchical models (recurrent neural networks (RNN) or CNN) for DA sequence tagging where the bottom layer takes the sentence CNN representation as input, concatenating predictions from the previous utterances with the CNN vector for classification, and performing sequence decoding based on the predictions from the sentence CNN model We conduct thorough experiments and comparisons on the Switchboard corpus, demonstrate that incorporating context information significantly improves DA classification, and show that we achieve new state-of-the-art performance for this task

...read moreread less

104 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep clustering: Discriminative embeddings for segmentation and separation

[...]

John R. Hershey¹, Zhuo Chen², Jonathan Le Roux¹, Shinji Watanabe¹•Institutions (2)

Mitsubishi Electric Research Laboratories¹, Columbia University²

20 Mar 2016

TL;DR: In this paper, a deep network is trained to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures.

...read moreread less

Abstract: We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, "class-based" methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step "decodes" the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.

...read moreread less

1,216 citations

Journal Article•DOI•

On training targets for supervised speech separation

[...]

Yuxuan Wang¹, Arun Narayanan¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Dec 2014-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets.

...read moreread less

Abstract: Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.

...read moreread less

1,046 citations

Journal Article•DOI•

Supervised Speech Separation Based on Deep Learning: An Overview

[...]

DeLiang Wang¹, Jitong Chen¹•Institutions (1)

Ohio State University¹

01 Oct 2018-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A comprehensive overview of deep learning-based supervised speech separation can be found in this paper, where three main components of supervised separation are discussed: learning machines, training targets, and acoustic features.

...read moreread less

Abstract: Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.

...read moreread less

1,009 citations

Journal Article•DOI•

A review of affective computing

[...]

Soujanya Poria¹, Erik Cambria², Rajiv Bajpai², Amir Hussain¹•Institutions (2)

University of Stirling¹, Nanyang Technological University²

01 Sep 2017-Information Fusion

TL;DR: This first of its kind, comprehensive literature review of the diverse field of affective computing focuses mainly on the use of audio, visual and text information for multimodal affect analysis, and outlines existing methods for fusing information from different modalities.

...read moreread less

969 citations

Journal Article•DOI•

A Survey on Deep Learning: Algorithms, Techniques, and Applications

[...]

Samira Pouyanfar¹, Saad Sadiq², Yilin Yan², Haiman Tian¹, Yudong Tao², Maria Presa Reyes¹, Mei-Ling Shyu², Shu-Ching Chen¹, S. Sitharama Iyengar¹ - Show less +5 more•Institutions (2)

Florida International University¹, University of Miami²

18 Sep 2018-ACM Computing Surveys

TL;DR: A comprehensive review of historical and recent state-of-the-art approaches in visual, audio, and text processing; social network analysis; and natural language processing is presented, followed by the in-depth analysis on pivoting and groundbreaking advances in deep learning applications.

...read moreread less

Abstract: The field of machine learning is witnessing its golden era as deep learning slowly becomes the leader in this domain. Deep learning uses multiple layers to represent the abstractions of data to build computational models. Some key enabler deep learning algorithms such as generative adversarial networks, convolutional neural networks, and model transfers have completely changed our perception of information processing. However, there exists an aperture of understanding behind this tremendously fast-paced domain, because it was never previously represented from a multiscope perspective. The lack of core understanding renders these powerful methods as black-box machines that inhibit development at a fundamental level. Moreover, deep learning has repeatedly been perceived as a silver bullet to all stumbling blocks in machine learning, which is far from the truth. This article presents a comprehensive review of historical and recent state-of-the-art approaches in visual, audio, and text processing; social network analysis; and natural language processing, followed by the in-depth analysis on pivoting and groundbreaking advances in deep learning applications. It was also undertaken to review the issues faced in deep learning such as unsupervised learning, black-box models, and online learning and to illustrate how these challenges can be transformed into prolific future research avenues.

...read moreread less

824 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse