Home
/
Authors
/
Ji Ming

Author

Ji Ming

Bio: Ji Ming is an academic researcher from Queen's University Belfast. The author has contributed to research in topics: Hidden Markov model & Speaker recognition. The author has an hindex of 18, co-authored 99 publications receiving 1452 citations.

Papers published on a yearly basis

2021
2019
2018
2017
2016
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1992
1991
1990

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Robust Speaker Recognition in Noisy Conditions

[...]

Ji Ming¹, Timothy J. Hazen², James Glass², Douglas A. Reynolds²•Institutions (2)

Queen's University Belfast¹, Massachusetts Institute of Technology²

01 Jul 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper describes a method that combines multicondition model training and missing-feature theory to model noise with unknown temporal-spectral characteristics, and is found to achieve lower error rates.

...read moreread less

Abstract: This paper investigates the problem of speaker identification and verification in noisy conditions, assuming that speech signals are corrupted by environmental noise, but knowledge about the noise characteristics is not available. This research is motivated in part by the potential application of speaker recognition technologies on handheld devices or the Internet. While the technologies promise an additional biometric layer of security to protect the user, the practical implementation of such systems faces many challenges. One of these is environmental noise. Due to the mobile nature of such systems, the noise sources can be highly time-varying and potentially unknown. This raises the requirement for noise robustness in the absence of information about the noise. This paper describes a method that combines multicondition model training and missing-feature theory to model noise with unknown temporal-spectral characteristics. Multicondition training is conducted using simulated noisy data with limited noise variation, providing a ldquocoarserdquo compensation for the noise, and missing-feature theory is applied to refine the compensation by ignoring noise variation outside the given training conditions, thereby reducing the training and testing mismatch. This paper is focused on several issues relating to the implementation of the new model for real-world applications. These include the generation of multicondition training data to model noisy speech, the combination of different training data to optimize the recognition performance, and the reduction of the model's complexity. The new algorithm was tested using two databases with simulated and realistic noisy speech data. The first database is a redevelopment of the TIMIT database by rerecording the data in the presence of various noise types, used to test the model for speaker identification with a focus on the varieties of noise. The second database is a handheld-device database collected in realistic noisy conditions, used to further validate the model for real-world speaker verification. The new model is compared to baseline systems and is found to achieve lower error rates.

...read moreread less

277 citations

Proceedings Article•DOI•

Extension of Zipf's law to words and phrases

[...]

Le Quan Ha¹, Elvira I. Sicilia-Garcia¹, Ji Ming¹, Francis Jack Smith¹•Institutions (1)

Queen's University Belfast¹

24 Aug 2002

TL;DR: The law for single words is shown to be valid only for high frequency words, but when single word and n-gram phrases are combined together in one list and put in order of frequency the combined list follows Zipf's law accurately for all words and phrases, down to the lowest frequencies in both languages.

...read moreread less

Abstract: Zipf's law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for n-gram word phrases as well as for single words. The law for single words is shown to be valid only for high frequency words. However, when single word and n-gram phrases are combined together in one list and put in order of frequency the combined list follows Zipf's law accurately for all words and phrases, down to the lowest frequencies in both languages. The Zipf curves for the two languages are then almost identical.

...read moreread less

124 citations

Journal Article•DOI•

Comparison of image transform-based features for visual speech recognition in clean and corrupted videos

[...]

Rowan Seymour¹, Darryl Stewart¹, Ji Ming¹•Institutions (1)

Queen's University Belfast¹

01 Apr 2008-Eurasip Journal on Image and Video Processing

TL;DR: A study into the performance of a variety of different image transform-based feature types for speaker-independent visual speech recognition of isolated digits includes the first reported use of features extracted using a discrete curvelet transform.

...read moreread less

Abstract: We present results of a study into the performance of a variety of different image transform-based feature types for speaker-independent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording.

...read moreread less

77 citations

Journal Article•DOI•

Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions

[...]

Darryl Stewart¹, Rowan Seymour², Adrian Pass, Ji Ming¹•Institutions (2)

Queen's University Belfast¹, University of Washington²

13 Jan 2014-IEEE Transactions on Systems, Man, and Cybernetics

TL;DR: The maximum weighted stream posterior model is presented as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption.

...read moreread less

Abstract: This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.

...read moreread less

70 citations

Proceedings Article•DOI•

A corpus-based approach to speech enhancement from nonstationary noise.

[...]

Ji Ming¹, Ramji Srinivasan¹, Danny Crookes¹•Institutions (1)

Queen's University Belfast¹

26 Sep 2010

TL;DR: In this paper, the authors proposed a method to estimate clean speech by recognizing long segments of the clean speech as whole units, which can reduce the requirement for prior information about the noise which can be difficult to estimate for fast-varying noise.

...read moreread less

Abstract: Temporal dynamics and speaker characteristics are two important features of speech that distinguish speech from noise In this paper, we propose a method to maximally extract these two features of speech for speech enhancement We demonstrate that this can reduce the requirement for prior information about the noise, which can be difficult to estimate for fast-varying noise Given noisy speech, the new approach estimates clean speech by recognizing long segments of the clean speech as whole units In the recognition, clean speech sentences, taken from a speech corpus, are used as examples Matching segments are identified between the noisy sentence and the corpus sentences The estimate is formed by using the longest matching segments found in the corpus sentences Longer speech segments as whole units contain more distinct dynamics and richer speaker characteristics, and can be identified more accurately from noise than shorter speech segments Therefore, estimation based on the longest recognized segments increases the noise immunity and hence the estimation accuracy The new approach consists of a statistical model to represent up to sentence-long temporal dynamics in the corpus speech, and an algorithm to identify the longest matching segments between the noisy sentence and the corpus sentences The algorithm is made more robust to noise uncertainty by introducing missing-feature based noise compensation into the corpus sentences Experiments have been conducted on the TIMIT database for speech enhancement from various types of nonstationary noise including song, music, and crosstalk speech The new approach has shown improved performance over conventional enhancement algorithms in both objective and subjective evaluations

...read moreread less

62 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

I and i

[...]

Kevin Barraclough

08 Dec 2001-BMJ

TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.

...read moreread less

Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

...read moreread less

33,785 citations

Journal Article•DOI•

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

[...]

Geoffrey E. Hinton¹, Li Deng², Dong Yu², George E. Dahl¹, Abdelrahman Mohamed¹, Navdeep Jaitly¹, Andrew W. Senior³, Vincent Vanhoucke³, Patrick Nguyen³, Tara N. Sainath⁴, Brian Kingsbury⁴ - Show less +7 more•Institutions (4)

University of Toronto¹, Microsoft², Google³, IBM⁴

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

9,091 citations

Journal Article•

Deep Neural Networks for Acoustic Modeling in Speech Recognition

[...]

Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew W. Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, Brian Kingsbury - Show less +7 more

01 Nov 2012-IEEE Signal Processing Magazine

TL;DR: This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.

...read moreread less

2,527 citations

Book•

Supervised Sequence Labelling with Recurrent Neural Networks

[...]

Alex Graves

09 Feb 2012

TL;DR: A new type of output layer that allows recurrent networks to be trained directly for sequence labelling tasks where the alignment between the inputs and the labels is unknown, and an extension of the long short-term memory network architecture to multidimensional data, such as images and video sequences.

...read moreread less

Abstract: Recurrent neural networks are powerful sequence learners. They are able to incorporate context information in a flexible way, and are robust to localised distortions of the input data. These properties make them well suited to sequence labelling, where input sequences are transcribed with streams of labels. The aim of this thesis is to advance the state-of-the-art in supervised sequence labelling with recurrent networks. Its two main contributions are (1) a new type of output layer that allows recurrent networks to be trained directly for sequence labelling tasks where the alignment between the inputs and the labels is unknown, and (2) an extension of the long short-term memory network architecture to multidimensional data, such as images and video sequences.

...read moreread less

2,101 citations

Journal Article•DOI•

Acoustic Modeling Using Deep Belief Networks

[...]

Abdelrahman Mohamed¹, George E. Dahl¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.

...read moreread less

Abstract: Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.

...read moreread less

1,767 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse