Home
/
Authors
/
Shinnosuke Takamichi

Author

Shinnosuke Takamichi

Other affiliations: Nara Institute of Science and Technology, Nippon Telegraph and Telephone

Bio: Shinnosuke Takamichi is an academic researcher from University of Tokyo. The author has contributed to research in topics: Speech synthesis & Computer science. The author has an hindex of 18, co-authored 102 publications receiving 1044 citations. Previous affiliations of Shinnosuke Takamichi include Nara Institute of Science and Technology & Nippon Telegraph and Telephone.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

[...]

Yuki Saito¹, Shinnosuke Takamichi¹, Hiroshi Saruwatari¹•Institutions (1)

University of Tokyo¹

01 Jan 2018-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The proposed method can generate more natural spectral parameters and $F_0$ than conventional minimum generation error training algorithm regardless of its hyperparameter settings, and it is found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.

...read moreread less

Abstract: A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural networks techniques can be applied to artificially synthesize speech waveform, the synthetic speech quality is low compared with that of natural speech. One of the issues causing the quality degradation is an oversmoothing effect often observed in the generated speech parameters. A GAN introduced in this paper consists of two neural networks: a discriminator to distinguish natural and generated samples, and a generator to deceive the discriminator. In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator. Since the objective of the GANs is to minimize the divergence (i.e., distribution difference) between the natural and generated speech parameters, the proposed method effectively alleviates the oversmoothing effect on the generated speech parameters. We evaluated the effectiveness for text-to-speech and voice conversion, and found that the proposed method can generate more natural spectral parameters and $F_0$ than conventional minimum generation error training algorithm regardless of its hyperparameter settings. Furthermore, we investigated the effect of the divergence of various GANs, and found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.

...read moreread less

178 citations

Proceedings Article•DOI•

Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors

[...]

Yuki Saito¹, Yusuke Ijima², Kyosuke Nishida², Shinnosuke Takamichi²•Institutions (2)

University of Tokyo¹, Nippon Telegraph and Telephone²

15 Apr 2018

TL;DR: Experimental results demonstrate that PPGs successfully improve both naturalness and speaker similarity of the converted speech, and both speaker codes and d-vectors can be adopted to the VAE-based many-to-many non-parallel VC.

...read moreread less

Abstract: This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained using non-parallel speech corpora with given speaker representations, phonetic contents of the converted speech tend to vanish because of an over-regularization issue often observed in latent variables of the VAEs. To overcome the issue, this paper proposes a VAE-based non-parallel VC conditioned by not only the speaker representations but also phonetic contents of speech represented as phonetic posteriorgrams (PPGs). Since the phonetic contents are given during the training, we can expect that the VC models effectively learn speaker-independent latent features of speech. Focusing on the point, this paper also extends the conventional VAE-based non-parallel VC to many-to-many VC that can convert arbitrary speakers' characteristics into another arbitrary speakers' ones. We investigate two methods to estimate speaker representations for speakers not included in speech corpora used for training VC models: 1) adapting conventional speaker codes, and 2) using d-vectors for the speaker representations. Experimental results demonstrate that 1) PPGs successfully improve both naturalness and speaker similarity of the converted speech, and 2) both speaker codes and d-vectors can be adopted to the VAE-based many-to-many non-parallel VC.

...read moreread less

114 citations

Posted Content•

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

[...]

Ryosuke Sonobe, Shinnosuke Takamichi, Hiroshi Saruwatari

28 Oct 2017-arXiv: Computation and Language

TL;DR: A novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis and consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters.

...read moreread less

Abstract: Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis. The corpus consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters. In this paper, we describe how we designed and analyzed the corpus. The corpus is freely available online.

...read moreread less

86 citations

Proceedings Article•DOI•

A postfilter to modify the modulation spectrum in HMM-based speech synthesis

[...]

Shinnosuke Takamichi¹, Tomoki Toda¹, Graham Neubig¹, Sakriani Sakti¹, Satoshi Nakamura¹ - Show less +1 more•Institutions (1)

Nara Institute of Science and Technology¹

04 May 2014

TL;DR: The Modulation Spectrum (MS) of speech parameter trajectory is introduced as a new feature to effectively capture the over-smoothing effect, and a postfilter is proposed based on the MS.

...read moreread less

Abstract: In this paper, we propose a postfilter to compensate modulation spectrum in HMM-based speech synthesis. In order to alleviate over-smoothing effects which is a main cause of quality degradation in HMM-based speech synthesis, it is necessary to consider features that can capture over-smoothing. Global Variance (GV) is one well-known example of such a feature, and the effectiveness of parameter generation algorithm considering GV have been confirmed. However, the quality gap between natural speech and synthetic speech is still large. In this paper, we introduce the Modulation Spectrum (MS) of speech parameter trajectory as a new feature to effectively capture the over-smoothing effect, and we propose a postfilter based on the MS. The MS is represented as a power spectrum of the parameter trajectory. The generated speech parameter sequence is filtered to ensure that its MS has a pattern similar to natural speech. Experimental results show quality improvements when the proposed methods are applied to spectral and F 0 components, compared with conventional methods considering GV.

...read moreread less

72 citations

Journal Article•DOI•

Postfilters to modify the modulation spectrum for statistical parametric speech synthesis

[...]

Shinnosuke Takamichi¹, Tomoki Toda², Alan W. Black³, Graham Neubig¹, Sakriani Sakti¹, Satoshi Nakamura¹ - Show less +2 more•Institutions (3)

Nara Institute of Science and Technology¹, Nagoya University², Carnegie Mellon University³

01 Apr 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper proposes postfilters to modify the MS utterance by utterance or segment by segment to make the MS of synthetic speech close to that of natural speech, applicable to various synthesizers based on statistical parametric speech synthesis.

...read moreread less

Abstract: This paper presents novel approaches based on modulation spectrum (MS) for high-quality statistical parametric speech synthesis, including text-to-speech (TTS) and voice conversion (VC). Although statistical parametric speech synthesis offers various advantages over concatenative speech synthesis, the synthetic speech quality is still not as good as that of con-catenative speech synthesis or the quality of natural speech. One of the biggest issues causing the quality degradation is the over-smoothing effect often observed in the generated speech parameter trajectories. Global variance (GV) is known as a feature well correlated with the over-smoothing effect, and the effectiveness of keeping the GV of the generated speech parameter trajectories similar to those of natural speech has been confirmed. However, the quality gap between natural speech and synthetic speech is still large. In this paper, we propose using the MS of the generated speech parameter trajectories as a new feature to effectively quantify the over-smoothing effect. Moreover, we propose post-filters to modify the MS utterance by utterance or segment by segment to make the MS of synthetic speech close to that of natural speech. The proposed postfilters are applicable to various synthesizers based on statistical parametric speech synthesis. We first perform an evaluation of the proposed method in the framework of hidden Markov model (HMM)-based TTS, examining its properties from different perspectives. Furthermore, effectiveness of the proposed postfilters are also evaluated in Gaussian mixture model (GMM)-based VC and classification and regression trees (CART)-based TTS (a.k.a., CLUSTERGEN). The experimental results demonstrate that 1) the proposed utterance-level postfilter achieves quality comparable to the conventional generation algorithm considering the GV, and yields significant improvements by applying to the GV-based generation algorithm in HMM-based TTS, 2) the proposed segment-level postfilter capable of achieving low-delay synthesis also yields significant improvements in synthetic speech quality, and 3) the proposed postfilters are also effective in not only HMM-based TTS but also GMM-based VC and CLUSTERGEN.

...read moreread less

67 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

다중혈관 관상동맥 환자에서 y-문합을 이용하여 양쪽 내흉동맥만을 사용한 우회술의 조기 성적

[...]

성기익, 이영탁, 박계현, 전태국, 박표원, 한일용, 장윤희 - Show less +3 more

01 Mar 2003-The Korean Journal of Thoracic and Cardiovascular Surgery

28,685 citations

Posted Content•

WaveNet: A Generative Model for Raw Audio

[...]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, Koray Kavukcuoglu - Show less +5 more

12 Sep 2016-arXiv: Sound

TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.

...read moreread less

Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

4,002 citations

Journal Article•

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

[...]

杉山拓海

12 Sep 2017-Computers & Graphics

3,940 citations

WaveNet: A Generative Model for Raw Audio

[...]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, Koray Kavukcuoglu - Show less +5 more

12 Sep 2016

TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

3,248 citations

Journal Article•DOI•

Multirate digital signal processing

[...]

S. Biyiksiz¹•Institutions (1)

Raytheon¹

01 Mar 1985

TL;DR: This book by Elliott and Rao is a valuable contribution to the general areas of signal processing and communications and can be used for a graduate level course in perhaps two ways.

...read moreread less

Abstract: There has been a great deal of material in the area of discrete-time transforms that has been published in recent years. This book does an excellent job of presenting important aspects of such material in a clear manner. The book has 11 chapters and a very useful appendix. Seven of these chapters are essentially devoted to the Fourier series/transform, discrete Fourier transform, fast Fourier transform (FFT), and applications of the FFT in the area of spectral estimation. Chapters 8 through 10 deal with many other discrete-time transforms and algorithms to compute them. Of these transforms, the KarhunenLoeve, the discrete cosine, and the Walsh-Hadamard transform are perhaps the most well-known. A lucid discussion of number theoretic transforms i5 presented in Chapter 11. This reviewer feels that the authors have done a fine job of compiling the pertinent material and presenting it in a concise and clear manner. There are a number of problems at the end of each chapter, an appreciable number of which are challenging. The authors have included a comprehensive set of references at the end of the book. In brief, this book is a valuable contribution to the general areas of signal processing and communications. It can be used for a graduate level course in perhaps two ways. One would be to cover the first seven chapters in great detail. The other would be to cover the whole book by focussing on different topics in a selective manner. This book by Elliott and Rao is extremely useful to researchers/engineers who are working in the areas of signal processing and communications. It i s also an excellent reference book, and hence a valuable addition to one’s library

...read moreread less

843 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse