Home
/
Authors
/
Yoshinori Sagisaka

Author

Yoshinori Sagisaka

Bio: Yoshinori Sagisaka is an academic researcher from Nippon Telegraph and Telephone. The author has contributed to research in topics: Speech synthesis & Phrase. The author has an hindex of 13, co-authored 28 publications receiving 775 citations.

Topics: Speech synthesis, Phrase, Pronunciation, Prosody, Language model ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

ATR Japanese speech database as a tool of speech recognition and synthesis

[...]

Akira Kurematsu, Kazuya Takeda, Yoshinori Sagisaka, Shigeru Katagiri, Hisao Kuwabara, Kiyohiro Shikano - Show less +2 more

01 Aug 1990-Speech Communication

TL;DR: A large-scale Japanese speech database has been described and has been used to develop algorithms in speech recognition and synthesis studies and to find acoustic, phonetic and linguistic evidence that will serve as basic data for speech technologies.

...read moreread less

282 citations

Book•

Speech, Perception, Production and Linguistic Structure

[...]

Henry Rogers, Yoh'ichi Tohkura, Erik Vatikiotis-Bateson, Yoshinori Sagisaka

01 Jan 1992

TL;DR: In this article, a fuzzy logical model of speech perception is proposed for Japanese monosyllabic perception, which is a framework for research and theory, and the effect of FO lowering on vowel identification.

...read moreread less

Abstract: Part 1 Speech perception: assimilation and contrast in vowel perception, Sumi Shigeno perception of vowel quality in a phonologically neutralized context, Robert Allen Fox modelling human vowel identification using aspects of formant trajectory and context, Caroline B. Huang psychoacoustic evidence for contextual effect models, Masato Akagi the fuzzy logical model of speech perception - a framework for research and theory, Dominic W. Massaro the effect of FO on vowel identification, Tatsuya Hirahara and Hiroaki Kato paying attention to differences among talkers, Howard C. Nusbaum and Todd M. Morin adaptability to differences between talkers in Japanese monosyllabic perception, Kazuhiko Kakehi talker normalization in speech perception, David B. Pisoni perception of American English /r/ and /l/ by native speakers of Japanese, Reiko A. Yamada and Yoh'ichi Tohkura some effects of training Japanese listeners to identify English /r/ and /l/, Scott E. Lively et al learning non-native phoneme contrasts - interactions among subject, stimulus and task variables, Winifred Strange speech processing and segmentation in Romance languages, Jacques Mehler and Anne Christophe speech prototypes - studies on the nature, function, ontogeny and phylogeny of the "centre" of speech categories, Patricia K. Kuhl learning to hear phonetic information, Howard C. Nusbaum and Lisa Lee processing constraints of the native phonological repertoire on the native language, Anne Cutler perceptual normalization of vocal tract size in young children and infants, Shigeru Kiritani et al two mechanisms of processing sound sequences, Morio Kohno. Part 2 Speech production and linguistic structure: what is the input to the speech production mechanism?, John J. Ohala modelling the process of fundamental frequency contour generation, Hiroya Fujisaki sensorimotor transformations and control strategies in speech, Kevin G. Munhall et al articulatory correlates of liguistically contrastive events - where are they?, Eric Vatikiotis-Bateson and Janet Fletcher intonational categories and the articulatory control of duration, Mary E. Beckman and Jan Edwards perceptual vs physical models of intonation, Rene Collier FO lowering - peripheral mechanisms and motor programming, Kiyoshi Honda the control of segmental duration in speech synthesis using statistical methods, Nobuyoshi Kaiki and Yoshinori Sagisaka segmental elasticity and timing in Japanese speech, Nick Campbell the production and perception of word boundaries, Anne Cutler syntactic influences on prosody, Jacques Terken and Rene Collier to what extent is speech production controlled by speech perception? some questions and some experimental evidence, Sieb G. Nooteboom and Wieke Eefting on the modelling of segmental duration control, Yoshinori Sagisaka evidence for speech rythms across languages, Mary E. Beckman.

...read moreread less

95 citations

Journal Article•DOI•

Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks

[...]

Naoto Iwahashi, Yoshinori Sagisaka

01 Feb 1995-Speech Communication

TL;DR: A speech spectrum transformation method by interpolating multi-speakers' spectral patterns and multi-functional representation with Radial Basis Function networks to generate new spectrum patterns close to those of the target speaker.

...read moreread less

63 citations

Journal Article•DOI•

Automatic generation of multiple pronunciations based on neural networks

[...]

Toshiaki Fukada, Takayoshi Yoshimura, Yoshinori Sagisaka

01 Feb 1999-Speech Communication

TL;DR: This paper proposed a method for automatically generating a pronunciation dictionary based on a pronunciation neural network that can predict plausible pronunciations from the canonical pronunciation, which gives consistently higher recognition rates than a conventional dictionary.

...read moreread less

57 citations

Journal Article•DOI•

Acceptability for temporal modification of consecutive segments in isolated words

[...]

Hiroaki Kato, Minoru Tsuzaki, Yoshinori Sagisaka

01 Apr 1997-Journal of the Acoustical Society of America

TL;DR: The results showed that the compensatory change in duration of a vowel and its adjacent consonant is not perceptually so salient as expected for the simultaneous modifications in the two segments, and suggests the presence of a time perception range wider than a single segment.

...read moreread less

Abstract: Perceptual sensitivity to temporal modification in two consecutive speech segments was measured in word contexts to explore the following two questions: (1) whether there is an interaction between multiple segmental durations, and (2) what aspect of the stimulus context determines the perceptually salient temporal markers? Experiment 1 obtained acceptability ratings for words with temporal modifications. The results showed that the compensatory change in duration of a vowel (V) and its adjacent consonant (C) is not perceptually so salient as expected for the simultaneous modifications in the two segments. This finding suggests the presence of a time perception range wider than a single segment (V or C). The results of experiment 1 also showed that rating scores for compensatory modification between V and C do not depend on the temporal order of modified pairs (VC or CV), but rather on the loudness difference between V and C; the acceptability decreased when the loudness difference between V and C became high. This suggests that perceptually salient markers locate around major jumps in loudness. The second finding, the dependence on the loudness jump, was replicated in experiment 2, which utilized a detection task for temporal modifications on nonspeech stimuli modeling the time-loudness features of the speech stimuli. Experiment 3 further investigated the influence of the temporal order of V and C by utilizing the detection task for the speech stimuli instead of the acceptability ratings.

...read moreread less

36 citations

1
2
3
4
…
5
6

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Echoes of echoes? An episodic theory of lexical access.

[...]

Stephen D. Goldinger¹•Institutions (1)

Arizona State University¹

01 Apr 1998-Psychological Review

TL;DR: An episodic model tested against speech production data from a word-shadowing task predicted the shadowing-response-time patterns, and it correctly predicted a tendency for shadowers to spontaneously imitate the acoustic patterns of words and nonwords.

...read moreread less

Abstract: In this article the author proposes an episodic theory of spoken word representation, perception, and production. By most theories, idiosyncratic aspects of speech (voice details, ambient noise, etc.) are considered noise and are filtered in perception. However, episodic theories suggest that perceptual details are stored in memory and are integral to later perception. In this research the author tested an episodic model (MINERVA 2; D. L. Hintzman, 1986) against speech production data from a word-shadowing task. The model predicted the shadowing-response-time patterns, and it correctly predicted a tendency for shadowers to spontaneously imitate the acoustic patterns of words and nonwords. It also correctly predicted imitation strength as a function of "abstract" stimulus properties, such as word frequency. Taken together, the data and theory suggest that detailed episodes constitute the basic substrate of the mental lexicon. Early in the 20th century, Semon (1909/1923) described a memory theory that anticipated many aspects of contemporary theories (Schacter, Eich, & Tulving, 1978). In modem parlance, this was an episodic (or exemplar) theory, which assumes that every experience, such as perceiving a spoken word, leaves a unique memory trace. On presentation of a new word, all stored traces are activated, each according to its similarity to the stimulus. The most activated traces connect the new word to stored knowledge, the essence of recognition. The multiple-trace assumption allowed Semon's theory to explain the apparent permanence of specific memories; the challenge was also to create

...read moreread less

1,399 citations

Journal Article•DOI•

Statistical Parametric Speech Synthesis

[...]

Alan W. Black¹, Heiga Zen², Keiichi Tokuda²•Institutions (2)

Carnegie Mellon University¹, Nagoya Institute of Technology²

15 Apr 2007

TL;DR: This paper gives a general overview of techniques in statistical parametric speech synthesis, and contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years.

...read moreread less

Abstract: This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future.

...read moreread less

1,270 citations

Journal Article•DOI•

Continuous probabilistic transform for voice conversion

[...]

Yannis Stylianou¹, Olivier Cappé², Eric Moulines²•Institutions (2)

Bell Labs¹, École Normale Supérieure²

01 Mar 1998-IEEE Transactions on Speech and Audio Processing

TL;DR: The design of a new methodology for representing the relationship between two sets of spectral envelopes and the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods.

...read moreread less

Abstract: Voice conversion, as considered in this paper, is defined as modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker). Our contribution includes the design of a new methodology for representing the relationship between two sets of spectral envelopes. The proposed method is based on the use of a Gaussian mixture model of the source speaker spectral envelopes. The conversion itself is represented by a continuous parametric function which takes into account the probabilistic classification provided by the mixture model. The parameters of the conversion function are estimated by least squares optimization on the training data. This conversion method is implemented in the context of the HNM (harmonic+noise model) system, which allows high-quality modifications of speech signals. Compared to earlier methods based on vector quantization, the proposed conversion scheme results in a much better match between the converted envelopes and the target envelopes. Evaluation by objective tests and formal listening tests shows that the proposed transform greatly improves the quality and naturalness of the converted speech signals compared with previous proposed conversion methods.

...read moreread less

1,109 citations

Journal Article•DOI•

Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

[...]

Tomoki Toda¹, Alan W. Black², Keiichi Tokuda³•Institutions (3)

Nara Institute of Science and Technology¹, Carnegie Mellon University², Nagoya Institute of Technology³

01 Nov 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this article, a Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers, and a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory is proposed.

...read moreread less

Abstract: In this paper, we describe a novel spectral conversion method for voice conversion (VC). A Gaussian mixture model (GMM) of the joint probability density of source and target features is employed for performing spectral conversion between speakers. The conventional method converts spectral parameters frame by frame based on the minimum mean square error. Although it is reasonably effective, the deterioration of speech quality is caused by some problems: 1) appropriate spectral movements are not always caused by the frame-based conversion process, and 2) the converted spectra are excessively smoothed by statistical modeling. In order to address those problems, we propose a conversion method based on the maximum-likelihood estimation of a spectral parameter trajectory. Not only static but also dynamic feature statistics are used for realizing the appropriate converted spectrum sequence. Moreover, the oversmoothing effect is alleviated by considering a global variance feature of the converted spectra. Experimental results indicate that the performance of VC can be dramatically improved by the proposed method in view of both speech quality and conversion accuracy for speaker individuality.

...read moreread less

914 citations

Proceedings Article•DOI•

Low-delay voice conversion based on maximum likelihood estimation of spectral parameter trajectory.

[...]

Takashi Muramatsu¹, Yamato Ohtani¹, Tomoki Toda¹, Hiroshi Saruwatari¹, Kiyohiro Shikano¹ - Show less +1 more•Institutions (1)

Nara Institute of Science and Technology¹

22 Sep 2008

TL;DR: The 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia as discussed by the authors, was held at the University of Queensland, Queensland, Australia.

...read moreread less

Abstract: INTERSPEECH2008: 9th Annual Conference of the International Speech Communication Association, September 22-26, 2008, Brisbane, Australia.

...read moreread less

796 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146

Collapse