scispace - formally typeset
Search or ask a question
Author

Ikuyo Masuda-Katsuse

Bio: Ikuyo Masuda-Katsuse is an academic researcher from Kindai University. The author has contributed to research in topics: Pronunciation & Web application. The author has an hindex of 4, co-authored 10 publications receiving 1709 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters by using pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region.

1,741 citations

Journal ArticleDOI
TL;DR: A proposed computational model that dynamically tracks and predicts changes in spectral shapes was verified in both psychophysical experiments and engineering applications and applied to phonemic restoration and segregation of two simultaneous utterances, showing the model to be effective for such engineering applications.

19 citations

Journal ArticleDOI
TL;DR: This article investigated the relation between word intelligibility in the presence of noise and the adequacy of accent type in those words and found that the spoken words with more adequate accent type were more intelligible.
Abstract: This paper investigates the contribution of pitch-accent information to Japanese spoken-word recognition. Pitch accent of spoken words was manipulated by controlling F0. First, the present author investigated the relation between word intelligibility in the presence of noise and the adequacy of accent type in those words. In the intelligibility test, participants were presented with speech stimuli and a pink noise together, and were required to identify the word. In the rating test, the same participants were presented with the same speech stimuli and were required to rate the adequacy of the words’ accent types. Results indicated that the spoken words with more adequate accent type were more intelligible in the presence of noise. Next, the present author investigated the relation between reaction time in shadowing words and the adequacy of accent type in those words. In the shadowing task, the participants were required to shadow a word whose accent type was manipulated as soon as they identified it. The same participants participated in the rating test. The reaction time in the case of the words with an adequate accent was shorter than in the case of an inadequate one. These results support the hypothesis that pitch-accent information in Japanese spoken words might facilitate word recognition.

4 citations

Proceedings Article
01 Jan 2001
TL;DR: A new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise by extending PreFEst is proposed, which does not need to know noise characteristics in advance and does not even estimate them in its process.
Abstract: In this paper, we propose a new method for speech recognition in the presence of non-stationary, unpredictable and high-level noise by extending PreFEst [3]. The method does not need to know noise characteristics in advance and does not even estimate them in its process. A small set of evaluations demonstrates the feasibility of the method by showing a good performance even with a signal-to-noise ratio of less than 10 dB.

4 citations

Proceedings Article
01 Jan 2014
TL;DR: A system with which children who have difficulty correctly pronouncing words to practice their pronunciation allows exercises to be individually tailored to each child’s pronunciation needs.
Abstract: We developed a system with which children who have difficulty correctly pronouncing words to practice their pronunciation. It allows exercises to be individually tailored to each child’s pronunciation needs. Three speech evaluation methods were prepared for each type of presented words: automatic speech recognition, phonemic discrimination between the correct and the probable error pronunciation of a consonant period and articulation tests from speech-languagehearing therapists. For 3 or 4 months, we performed practical field tests with nine students in special support education classes in four elementary schools. In the tests, we realized medical-educational-engineering collaboration and the technical support of local-community volunteers.

3 citations


Cited by
More filters
Posted Content
TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.
Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

4,002 citations

Journal ArticleDOI
TL;DR: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds, based on the well-known autocorrelation method with a number of modifications that combine to prevent errors.
Abstract: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds. It is based on the well-known autocorrelation method with a number of modifications that combine to prevent errors. The algorithm has several desirable features. Error rates are about three times lower than the best competing methods, as evaluated over a database of speech recorded together with a laryngograph signal. There is no upper limit on the frequency search range, so the algorithm is suited for high-pitched voices and music. The algorithm is relatively simple and may be implemented efficiently and with low latency, and it involves few parameters that must be tuned. It is based on a signal model (periodic signal) that may be extended in several ways to handle various forms of aperiodicity that occur in particular applications. Finally, interesting parallels may be drawn with models of auditory processing.

1,975 citations

Journal ArticleDOI
15 Apr 2007
TL;DR: This paper gives a general overview of techniques in statistical parametric speech synthesis, and contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years.
Abstract: This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future.

1,270 citations

Journal ArticleDOI
TL;DR: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech and showed that it was superior to the other systems in terms of both sound quality and processing speed.
Abstract: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing. key words: speech analysis, speech synthesis, vocoder, sound quality, realtime processing

1,025 citations

Journal ArticleDOI
TL;DR: In this article, a Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers, and a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory is proposed.
Abstract: In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

914 citations