Home
/
Authors
/
Takao Kobayashi

Author

Takao Kobayashi

Other affiliations: Ericsson Radio Systems, IBM, Nagoya Institute of Technology ...read more

Bio: Takao Kobayashi is an academic researcher from Tokyo Institute of Technology. The author has contributed to research in topics: Speech synthesis & Hidden Markov model. The author has an hindex of 41, co-authored 235 publications receiving 8359 citations. Previous affiliations of Takao Kobayashi include Ericsson Radio Systems & IBM.

Papers published on a yearly basis

2022
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1986
1984
1974

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Speech parameter generation algorithms for HMM-based speech synthesis

[...]

Keiichi Tokuda¹, Takayoshi Yoshimura¹, Takashi Masuko², Takao Kobayashi², Tadashi Kitamura¹ - Show less +1 more•Institutions (2)

Nagoya Institute of Technology¹, Tokyo Institute of Technology²

05 Jun 2000

TL;DR: A speech parameter generation algorithm for HMM-based speech synthesis, in which the speech parameter sequence is generated from HMMs whose observation vector consists of a spectral parameter vector and its dynamic feature vectors, is derived.

...read moreread less

Abstract: This paper derives a speech parameter generation algorithm for HMM-based speech synthesis, in which the speech parameter sequence is generated from HMMs whose observation vector consists of a spectral parameter vector and its dynamic feature vectors. In the algorithm, we assume that the state sequence (state and mixture sequence for the multi-mixture case) or a part of the state sequence is unobservable (i.e., hidden or latent). As a result, the algorithm iterates the forward-backward algorithm and the parameter generation algorithm for the case where the state sequence is given. Experimental results show that by using the algorithm, we can reproduce clear formant structure from multi-mixture HMMs as compared with that produced from single-mixture HMMs.

...read moreread less

1,071 citations

Proceedings Article•

Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis

[...]

Takayoshi Yoshimura, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Tadashi Kitamura - Show less +1 more

01 Jan 1999

TL;DR: An HMM-based speech synthesis system in which spectrum, pitch and state duration are modeled simultaneously in a unified framework of HMM is described.

...read moreread less

Abstract: In this paper, we describe an HMM-based speech synthesis system in which spectrum, pitch and state duration are modeled simultaneously in a unified framework of HMM. In the system, pitch and state duration are modeled by multi-space probability distribution HMMs and multi-dimensional Gaussian distributions, respectively. The distributions for spectral parameter, pitch parameter and the state duration are clustered independently by using a decision-tree based context clustering technique. Synthetic speech is generated by using an speech parameter generation algorithm from HMM and a mel-cepstrum based vocoding technique. Through informal listening tests, we have confirmed that the proposed system successfully synthesizes natural-sounding speech which resembles the speaker in the training database.

...read moreread less

759 citations

Proceedings Article•DOI•

An adaptive algorithm for mel-cepstral analysis of speech

[...]

T. Fukada¹, Keiichi Tokuda², Takao Kobayashi², Satoshi Imai²•Institutions (2)

Canon Inc.¹, Tokyo Institute of Technology²

23 Mar 1992

TL;DR: The authors apply the criterion used in the unbiased estimation of log spectrum to the spectral model represented by the mel-cepstral coefficients to solve the nonlinear minimization problem involved in the method and derive an adaptive algorithm whose convergence is guaranteed.

...read moreread less

Abstract: The authors describe a mel-cepstral analysis method and its adaptive algorithm. In the proposed method, the authors apply the criterion used in the unbiased estimation of log spectrum to the spectral model represented by the mel-cepstral coefficients. To solve the nonlinear minimization problem involved in the method, they give an iterative algorithm whose convergence is guaranteed. Furthermore, they derive an adaptive algorithm for the mel-cepstral analysis by introducing an instantaneous estimate for gradient of the criterion. The adaptive mel-cepstral analysis system is implemented with an IIR adaptive filter which has an exponential transfer function, and whose stability is guaranteed. The authors also present examples of speech analysis and results of an isolated word recognition experiment. >

...read moreread less

374 citations

Journal Article•DOI•

Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm

[...]

Junichi Yamagishi¹, Takao Kobayashi², Y. Nakano², K. Ogata², J. Isogai² - Show less +1 more•Institutions (2)

University of Edinburgh¹, Tokyo Institute of Technology²

01 Jan 2009-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A new adaptation algorithm is proposed called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms.

...read moreread less

Abstract: In this paper, we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms. Here, we investigate six major aspects of the speaker adaptation: initial models; the amount of the training data for the initial models; the transform functions, estimation criteria, and sensitivity of several linear regression adaptation algorithms; and combination algorithms. Analyzing the effect of the initial model, we compare speaker-dependent models, gender-independent models, and the simultaneous use of the gender-dependent models to single use of the gender-dependent models. Analyzing the effect of the transform functions, we compare the transform function for only mean vectors with that for mean vectors and covariance matrices. Analyzing the effect of the estimation criteria, we compare the ML criterion with a robust estimation criterion called structural MAP. We evaluate the sensitivity of several thresholds for the piecewise linear regression algorithms and take up methods combining MAP adaptation with the linear regression algorithms. We incorporate these adaptation algorithms into our speech synthesis system and present several subjective and objective evaluation results showing the utility and effectiveness of these algorithms in speaker adaptation for HMM-based speech synthesis.

...read moreread less

373 citations

Proceedings Article•DOI•

Speech parameter generation from HMM using dynamic features

[...]

Keiichi Tokuda¹, Takao Kobayashi¹, Satoshi Imai¹•Institutions (1)

Tokyo Institute of Technology¹

09 May 1995

TL;DR: It is shown that the parameter generation from HMMs using the dynamic features results in searching for the optimum state sequence and solving a set of linear equations for each possible state sequence.

...read moreread less

Abstract: This paper proposes an algorithm for speech parameter generation from HMMs which include the dynamic features. The performance of speech recognition based on HMMs has been improved by introducing the dynamic features of speech. Thus we surmise that, if there is a method for speech parameter generation from HMMs which include the dynamic features, it will be useful for speech synthesis by rule. It is shown that the parameter generation from HMMs using the dynamic features results in searching for the optimum state sequence and solving a set of linear equations for each possible state sequence. We derive a fast algorithm for the solution by the analogy of the RLS algorithm for adaptive filtering. We also show the effect of incorporating the dynamic features by an example of speech parameter generation.

...read moreread less

299 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

YIN, a fundamental frequency estimator for speech and music

[...]

Alain de Cheveigné¹, Hideki Kawahara•Institutions (1)

IRCAM¹

03 Apr 2002-Journal of the Acoustical Society of America

TL;DR: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds, based on the well-known autocorrelation method with a number of modifications that combine to prevent errors.

...read moreread less

Abstract: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds. It is based on the well-known autocorrelation method with a number of modifications that combine to prevent errors. The algorithm has several desirable features. Error rates are about three times lower than the best competing methods, as evaluated over a database of speech recorded together with a laryngograph signal. There is no upper limit on the frequency search range, so the algorithm is suited for high-pitched voices and music. The algorithm is relatively simple and may be implemented efficiently and with low latency, and it involves few parameters that must be tuned. It is based on a signal model (periodic signal) that may be extended in several ways to handle various forms of aperiodicity that occur in particular applications. Finally, interesting parallels may be drawn with models of auditory processing.

...read moreread less

1,975 citations

Journal Article•DOI•

Multimodal Machine Learning: A Survey and Taxonomy

[...]

Tadas Baltrusaitis¹, Chaitanya Ahuja², Louis-Philippe Morency²•Institutions (2)

Microsoft¹, Carnegie Mellon University²

01 Feb 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.

...read moreread less

Abstract: Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together Multimodal machine learning aims to build models that can process and relate information from multiple modalities It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research

...read moreread less

1,945 citations

Proceedings Article•DOI•

librosa: Audio and Music Signal Analysis in Python

[...]

Brian McFee¹, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, Oriol Nieto - Show less +3 more•Institutions (1)

New York University¹

01 Jan 2015

TL;DR: A brief overview of the librosa library's functionality is provided, along with explanations of the design goals, software development practices, and notational conventions.

...read moreread less

Abstract: This document describes version 0.4.0 of librosa: a Python pack- age for audio and music signal processing. At a high level, librosa provides implementations of a variety of common functions used throughout the field of music information retrieval. In this document, a brief overview of the library's functionality is provided, along with explanations of the design goals, software development practices, and notational conventions.

...read moreread less

1,793 citations

Journal Article•DOI•

Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds

[...]

Hideki Kawahara, Ikuyo Masuda-Katsuse, Alain de Cheveigné

01 Apr 1999-Speech Communication

TL;DR: A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters by using pitch-adaptive spectral analysis combined with a surface reconstruction method in the time–frequency region.

...read moreread less

1,741 citations

Journal Article•DOI•

Statistical Parametric Speech Synthesis

[...]

Alan W. Black¹, Heiga Zen², Keiichi Tokuda²•Institutions (2)

Carnegie Mellon University¹, Nagoya Institute of Technology²

15 Apr 2007

TL;DR: This paper gives a general overview of techniques in statistical parametric speech synthesis, and contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years.

...read moreread less

Abstract: This paper gives a general overview of techniques in statistical parametric speech synthesis. One of the instances of these techniques, called HMM-based generation synthesis (or simply HMM-based synthesis), has recently been shown to be very effective in generating acceptable speech synthesis. This paper also contrasts these techniques with the more conventional unit selection technology that has dominated speech synthesis over the last ten years. Advantages and disadvantages of statistical parametric synthesis are highlighted as well as identifying where we expect the key developments to appear in the immediate future.

...read moreread less

1,270 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse