Home
/
Authors
/
Ikuyo Masuda-Katsuse

Author

Ikuyo Masuda-Katsuse

Bio: Ikuyo Masuda-Katsuse is an academic researcher from Kindai University. The author has contributed to research in topics: Pronunciation & Web application. The author has an hindex of 4, co-authored 10 publications receiving 1709 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

[...]

Hideki Kawahara, Ikuyo Masuda-Katsuse, Alain de Cheveigné

01 Apr 1999-Speech Communication

TL;DR: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters by using pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region.

...read moreread less

1,741 citations

Journal Article•DOI•

Dynamic sound stream formation based on continuity of spectral change

[...]

Ikuyo Masuda-Katsuse, Hideki Kawahara

01 Apr 1999-Speech Communication

TL;DR: A proposed computational model that dynamically tracks and predicts changes in spectral shapes was verified in both psychophysical experiments and engineering applications and applied to phonemic restoration and segregation of two simultaneous utterances, showing the model to be effective for such engineering applications.

...read moreread less

19 citations

Journal Article•DOI•

Contribution of pitch-accent information to Japanese spoken-word recognition

[...]

Ikuyo Masuda-Katsuse¹•Institutions (1)

Kindai University¹

01 Mar 2006-Acoustical Science and Technology

TL;DR: This article investigated the relation between word intelligibility in the presence of noise and the adequacy of accent type in those words and found that the spoken words with more adequate accent type were more intelligible.

...read moreread less

Abstract: This paper investigates the contribution of pitch-accent information to Japanese spoken-word recognition. Pitch accent of spoken words was manipulated by controlling F0. First, the present author investigated the relation between word intelligibility in the presence of noise and the adequacy of accent type in those words. In the intelligibility test, participants were presented with speech stimuli and a pink noise together, and were required to identify the word. In the rating test, the same participants were presented with the same speech stimuli and were required to rate the adequacy of the words’ accent types. Results indicated that the spoken words with more adequate accent type were more intelligible in the presence of noise. Next, the present author investigated the relation between reaction time in shadowing words and the adequacy of accent type in those words. In the shadowing task, the participants were required to shadow a word whose accent type was manipulated as soon as they identified it. The same participants participated in the rating test. The reaction time in the case of the words with an adequate accent was shorter than in the case of an inadequate one. These results support the hypothesis that pitch-accent information in Japanese spoken words might facilitate word recognition.

...read moreread less

4 citations

Proceedings Article•

A new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise.

[...]

Ikuyo Masuda-Katsuse

01 Jan 2001

TL;DR: A new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise by extending PreFEst is proposed, which does not need to know noise characteristics in advance and does not even estimate them in its process.

...read moreread less

Abstract: In this paper, we propose a new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise by extending PreFEst [3]. The method does not need to know noise characteristics in advance and does not even estimate them in its process. A small set of evaluations demonstrates the feasibility of the method by showing a good performance even with a signal-to-noise ratio of less than 10 dB.

...read moreread less

4 citations

Proceedings Article•

Pronunciation practice support system for children who have difficulty correctly pronouncing words.

[...]

Ikuyo Masuda-Katsuse¹•Institutions (1)

Kindai University¹

01 Jan 2014

TL;DR: A system with which children who have difficulty correctly pronouncing words to practice their pronunciation allows exercises to be individually tailored to each child’s pronunciation needs.

...read moreread less

Abstract: We developed a system with which children who have difficulty correctly pronouncing words to practice their pronunciation. It allows exercises to be individually tailored to each child’s pronunciation needs. Three speech evaluation methods were prepared for each type of presented words: automatic speech recognition, phonemic discrimination between the correct and the probable error pronunciation of a consonant period and articulation tests from speech-languagehearing therapists. For 3 or 4 months, we performed practical field tests with nine students in special support education classes in four elementary schools. In the tests, we realized medical-educational-engineering collaboration and the technical support of local-community volunteers.

...read moreread less

3 citations

Cited by

PDF

Open Access

More filters

Posted Content•

WaveNet: A Generative Model for Raw Audio

[...]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, Koray Kavukcuoglu - Show less +5 more

12 Sep 2016-arXiv: Sound

TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.

...read moreread less

Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

4,002 citations

Journal Article•DOI•

YIN, a fundamental frequency estimator for speech and music

[...]

Alain de Cheveigné¹, Hideki Kawahara•Institutions (1)

IRCAM¹

03 Apr 2002-Journal of the Acoustical Society of America

TL;DR: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds, based on the well-known autocorrelation method with a number of modifications that combine to prevent errors.

...read moreread less

Abstract: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds. It is based on the well-known autocorrelation method with a number of modifications that combine to prevent errors. The algorithm has several desirable features. Error rates are about three times lower than the best competing methods, as evaluated over a database of speech recorded together with a laryngograph signal. There is no upper limit on the frequency search range, so the algorithm is suited for high-pitched voices and music. The algorithm is relatively simple and may be implemented efficiently and with low latency, and it involves few parameters that must be tuned. It is based on a signal model (periodic signal) that may be extended in several ways to handle various forms of aperiodicity that occur in particular applications. Finally, interesting parallels may be drawn with models of auditory processing.

...read moreread less

1,975 citations

Journal Article•DOI•

Statistical Parametric Speech Synthesis

[...]

Alan W. Black¹, Heiga Zen², Keiichi Tokuda²•Institutions (2)

Carnegie Mellon University¹, Nagoya Institute of Technology²

15 Apr 2007

TL;DR: This paper gives a general overview of techniques in statistical parametric speech synthesis, and contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years.

...read moreread less

Abstract: This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future.

...read moreread less

1,270 citations

Journal Article•DOI•

WORLD: A vocoder-based high-quality speech synthesis system for real-time applications

[...]

Masanori Morise¹, Fumiya Yokomori¹, Kenji Ozawa¹•Institutions (1)

University of Yamanashi¹

01 Jul 2016-IEICE Transactions on Information and Systems

TL;DR: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech and showed that it was superior to the other systems in terms of both sound quality and processing speed.

...read moreread less

Abstract: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing. key words: speech analysis, speech synthesis, vocoder, sound quality, realtime processing

...read moreread less

1,025 citations

Journal Article•DOI•

Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

[...]

Tomoki Toda¹, Alan W. Black², Keiichi Tokuda³•Institutions (3)

Nara Institute of Science and Technology¹, Carnegie Mellon University², Nagoya Institute of Technology³

01 Nov 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this article, a Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers, and a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory is proposed.

...read moreread less

Abstract: In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

...read moreread less

914 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse