scispace - formally typeset
Search or ask a question
Author

Yoshinori Sagisaka

Bio: Yoshinori Sagisaka is an academic researcher from Nippon Telegraph and Telephone. The author has contributed to research in topics: Speech synthesis & Phrase. The author has an hindex of 13, co-authored 28 publications receiving 775 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A large-scale Japanese speech database has been described and has been used to develop algorithms in speech recognition and synthesis studies and to find acoustic, phonetic and linguistic evidence that will serve as basic data for speech technologies.

282 citations

Book
01 Jan 1992
TL;DR: In this article, a fuzzy logical model of speech perception is proposed for Japanese monosyllabic perception, which is a framework for research and theory, and the effect of FO lowering on vowel identification.
Abstract: Part 1 Speech perception: assimilation and contrast in vowel perception, Sumi Shigeno perception of vowel quality in a phonologically neutralized context, Robert Allen Fox modelling human vowel identification using aspects of formant trajectory and context, Caroline B. Huang psychoacoustic evidence for contextual effect models, Masato Akagi the fuzzy logical model of speech perception - a framework for research and theory, Dominic W. Massaro the effect of FO on vowel identification, Tatsuya Hirahara and Hiroaki Kato paying attention to differences among talkers, Howard C. Nusbaum and Todd M. Morin adaptability to differences between talkers in Japanese monosyllabic perception, Kazuhiko Kakehi talker normalization in speech perception, David B. Pisoni perception of American English /r/ and /l/ by native speakers of Japanese, Reiko A. Yamada and Yoh'ichi Tohkura some effects of training Japanese listeners to identify English /r/ and /l/, Scott E. Lively et al learning non-native phoneme contrasts - interactions among subject, stimulus and task variables, Winifred Strange speech processing and segmentation in Romance languages, Jacques Mehler and Anne Christophe speech prototypes - studies on the nature, function, ontogeny and phylogeny of the "centre" of speech categories, Patricia K. Kuhl learning to hear phonetic information, Howard C. Nusbaum and Lisa Lee processing constraints of the native phonological repertoire on the native language, Anne Cutler perceptual normalization of vocal tract size in young children and infants, Shigeru Kiritani et al two mechanisms of processing sound sequences, Morio Kohno. Part 2 Speech production and linguistic structure: what is the input to the speech production mechanism?, John J. Ohala modelling the process of fundamental frequency contour generation, Hiroya Fujisaki sensorimotor transformations and control strategies in speech, Kevin G. Munhall et al articulatory correlates of liguistically contrastive events - where are they?, Eric Vatikiotis-Bateson and Janet Fletcher intonational categories and the articulatory control of duration, Mary E. Beckman and Jan Edwards perceptual vs physical models of intonation, Rene Collier FO lowering - peripheral mechanisms and motor programming, Kiyoshi Honda the control of segmental duration in speech synthesis using statistical methods, Nobuyoshi Kaiki and Yoshinori Sagisaka segmental elasticity and timing in Japanese speech, Nick Campbell the production and perception of word boundaries, Anne Cutler syntactic influences on prosody, Jacques Terken and Rene Collier to what extent is speech production controlled by speech perception? some questions and some experimental evidence, Sieb G. Nooteboom and Wieke Eefting on the modelling of segmental duration control, Yoshinori Sagisaka evidence for speech rythms across languages, Mary E. Beckman.

95 citations

Journal ArticleDOI
TL;DR: A speech spectrum transformation method by interpolating multi-speakers' spectral patterns and multi-functional representation with Radial Basis Function networks to generate new spectrum patterns close to those of the target speaker.

63 citations

Journal ArticleDOI
TL;DR: This paper proposed a method for automatically generating a pronunciation dictionary based on a pronunciation neural network that can predict plausible pronunciations from the canonical pronunciation, which gives consistently higher recognition rates than a conventional dictionary.

57 citations

Journal ArticleDOI
TL;DR: The results showed that the compensatory change in duration of a vowel and its adjacent consonant is not perceptually so salient as expected for the simultaneous modifications in the two segments, and suggests the presence of a time perception range wider than a single segment.
Abstract: Perceptual sensitivity to temporal modification in two consecutive speech segments was measured in word contexts to explore the following two questions: (1) whether there is an interaction between multiple segmental durations, and (2) what aspect of the stimulus context determines the perceptually salient temporal markers? Experiment 1 obtained acceptability ratings for words with temporal modifications. The results showed that the compensatory change in duration of a vowel (V) and its adjacent consonant (C) is not perceptually so salient as expected for the simultaneous modifications in the two segments. This finding suggests the presence of a time perception range wider than a single segment (V or C). The results of experiment 1 also showed that rating scores for compensatory modification between V and C do not depend on the temporal order of modified pairs (VC or CV), but rather on the loudness difference between V and C; the acceptability decreased when the loudness difference between V and C became high. This suggests that perceptually salient markers locate around major jumps in loudness. The second finding, the dependence on the loudness jump, was replicated in experiment 2, which utilized a detection task for temporal modifications on nonspeech stimuli modeling the time-loudness features of the speech stimuli. Experiment 3 further investigated the influence of the temporal order of V and C by utilizing the detection task for the speech stimuli instead of the acceptability ratings.

36 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: An episodic model tested against speech production data from a word-shadowing task predicted the shadowing-response-time patterns, and it correctly predicted a tendency for shadowers to spontaneously imitate the acoustic patterns of words and nonwords.
Abstract: In this article the author proposes an episodic theory of spoken word representation, perception, and production. By most theories, idiosyncratic aspects of speech (voice details, ambient noise, etc.) are considered noise and are filtered in perception. However, episodic theories suggest that perceptual details are stored in memory and are integral to later perception. In this research the author tested an episodic model (MINERVA 2; D. L. Hintzman, 1986) against speech production data from a word-shadowing task. The model predicted the shadowing-response-time patterns, and it correctly predicted a tendency for shadowers to spontaneously imitate the acoustic patterns of words and nonwords. It also correctly predicted imitation strength as a function of "abstract" stimulus properties, such as word frequency. Taken together, the data and theory suggest that detailed episodes constitute the basic substrate of the mental lexicon. Early in the 20th century, Semon (1909/1923) described a memory theory that anticipated many aspects of contemporary theories (Schacter, Eich, & Tulving, 1978). In modem parlance, this was an episodic (or exemplar) theory, which assumes that every experience, such as perceiving a spoken word, leaves a unique memory trace. On presentation of a new word, all stored traces are activated, each according to its similarity to the stimulus. The most activated traces connect the new word to stored knowledge, the essence of recognition. The multiple-trace assumption allowed Semon's theory to explain the apparent permanence of specific memories; the challenge was also to create

1,399 citations

Journal ArticleDOI
15 Apr 2007
TL;DR: This paper gives a general overview of techniques in statistical parametric speech synthesis, and contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years.
Abstract: This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future.

1,270 citations

Journal ArticleDOI
TL;DR: The design of a new methodology for representing the relationship between two sets of spectral envelopes and the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods.
Abstract: Voice conversion, as considered in this paper, is defined as modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker). Our contribution includes the design of a new methodology for representing the relationship between two sets of spectral envelopes. The proposed method is based on the use of a Gaussian mixture model of the source speaker spectral envelopes. The conversion itself is represented by a continuous parametric function which takes into account the probabilistic classification provided by the mixture model. The parameters of the conversion function are estimated by least squares optimization on the training data. This conversion method is implemented in the context of the HNM (harmonic+noise model) system, which allows high-quality modifications of speech signals. Compared to earlier methods based on vector quantization, the proposed conversion scheme results in a much better match between the converted envelopes and the target envelopes. Evaluation by objective tests and formal listening tests shows that the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods.

1,109 citations

Journal ArticleDOI
TL;DR: In this article, a Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers, and a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory is proposed.
Abstract: In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

914 citations

Proceedings ArticleDOI
22 Sep 2008
TL;DR: The 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia as discussed by the authors, was held at the University of Queensland, Queensland, Australia.
Abstract: INTERSPEECH2008: 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia.

796 citations