scispace - formally typeset
Search or ask a question
Author

Shinnosuke Takamichi

Bio: Shinnosuke Takamichi is an academic researcher from University of Tokyo. The author has contributed to research in topics: Speech synthesis & Computer science. The author has an hindex of 18, co-authored 102 publications receiving 1044 citations. Previous affiliations of Shinnosuke Takamichi include Nara Institute of Science and Technology & Nippon Telegraph and Telephone.

Papers published on a yearly basis

Papers
More filters
Journal ArticleDOI
TL;DR: The proposed method can generate more natural spectral parameters and $F_0$ than conventional minimum generation error training algorithm regardless of its hyperparameter settings, and it is found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.
Abstract: A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural networks techniques can be applied to artificially synthesize speech waveform, the synthetic speech quality is low compared with that of natural speech. One of the issues causing the quality degradation is an oversmoothing effect often observed in the generated speech parameters. A GAN introduced in this paper consists of two neural networks: a discriminator to distinguish natural and generated samples, and a generator to deceive the discriminator. In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator. Since the objective of the GANs is to minimize the divergence (i.e., distribution difference) between the natural and generated speech parameters, the proposed method effectively alleviates the oversmoothing effect on the generated speech parameters. We evaluated the effectiveness for text-to-speech and voice conversion, and found that the proposed method can generate more natural spectral parameters and $F_0$ than conventional minimum generation error training algorithm regardless of its hyperparameter settings. Furthermore, we investigated the effect of the divergence of various GANs, and found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving the synthetic speech quality.

178 citations

Proceedings ArticleDOI
15 Apr 2018
TL;DR: Experimental results demonstrate that PPGs successfully improve both naturalness and speaker similarity of the converted speech, and both speaker codes and d-vectors can be adopted to the VAE-based many-to-many non-parallel VC.
Abstract: This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained using non-parallel speech corpora with given speaker representations, phonetic contents of the converted speech tend to vanish because of an over-regularization issue often observed in latent variables of the VAEs. To overcome the issue, this paper proposes a VAE-based non-parallel VC conditioned by not only the speaker representations but also phonetic contents of speech represented as phonetic posteriorgrams (PPGs). Since the phonetic contents are given during the training, we can expect that the VC models effectively learn speaker-independent latent features of speech. Focusing on the point, this paper also extends the conventional VAE-based non-parallel VC to many-to-many VC that can convert arbitrary speakers' characteristics into another arbitrary speakers' ones. We investigate two methods to estimate speaker representations for speakers not included in speech corpora used for training VC models: 1) adapting conventional speaker codes, and 2) using d-vectors for the speaker representations. Experimental results demonstrate that 1) PPGs successfully improve both naturalness and speaker similarity of the converted speech, and 2) both speaker codes and d-vectors can be adopted to the VAE-based many-to-many non-parallel VC.

114 citations

Posted Content
TL;DR: A novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis and consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters.
Abstract: Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis. The corpus consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters. In this paper, we describe how we designed and analyzed the corpus. The corpus is freely available online.

86 citations

Proceedings ArticleDOI
04 May 2014
TL;DR: The Modulation Spectrum (MS) of speech parameter trajectory is introduced as a new feature to effectively capture the over-smoothing effect, and a postfilter is proposed based on the MS.
Abstract: In this paper, we propose a postfilter to compensate modulation spectrum in HMM-based speech synthesis. In order to alleviate over-smoothing effects which is a main cause of quality degradation in HMM-based speech synthesis, it is necessary to consider features that can capture over-smoothing. Global Variance (GV) is one well-known example of such a feature, and the effectiveness of parameter generation algorithm considering GV have been confirmed. However, the quality gap between natural speech and synthetic speech is still large. In this paper, we introduce the Modulation Spectrum (MS) of speech parameter trajectory as a new feature to effectively capture the over-smoothing effect, and we propose a postfilter based on the MS. The MS is represented as a power spectrum of the parameter trajectory. The generated speech parameter sequence is filtered to ensure that its MS has a pattern similar to natural speech. Experimental results show quality improvements when the proposed methods are applied to spectral and F 0 components, compared with conventional methods considering GV.

72 citations

Journal ArticleDOI
TL;DR: This paper proposes postfilters to modify the MS utterance by utterance or segment by segment to make the MS of synthetic speech close to that of natural speech, applicable to various synthesizers based on statistical parametric speech synthesis.
Abstract: This paper presents novel approaches based on modulation spectrum (MS) for high-quality statistical parametric speech synthesis, including text-to-speech (TTS) and voice conversion (VC). Although statistical parametric speech synthesis offers various advantages over concatenative speech synthesis, the synthetic speech quality is still not as good as that of con-catenative speech synthesis or the quality of natural speech. One of the biggest issues causing the quality degradation is the over-smoothing effect often observed in the generated speech parameter trajectories. Global variance (GV) is known as a feature well correlated with the over-smoothing effect, and the effectiveness of keeping the GV of the generated speech parameter trajectories similar to those of natural speech has been confirmed. However, the quality gap between natural speech and synthetic speech is still large. In this paper, we propose using the MS of the generated speech parameter trajectories as a new feature to effectively quantify the over-smoothing effect. Moreover, we propose post-filters to modify the MS utterance by utterance or segment by segment to make the MS of synthetic speech close to that of natural speech. The proposed postfilters are applicable to various synthesizers based on statistical parametric speech synthesis. We first perform an evaluation of the proposed method in the framework of hidden Markov model (HMM)-based TTS, examining its properties from different perspectives. Furthermore, effectiveness of the proposed postfilters are also evaluated in Gaussian mixture model (GMM)-based VC and classification and regression trees (CART)-based TTS (a.k.a., CLUSTERGEN). The experimental results demonstrate that 1) the proposed utterance-level postfilter achieves quality comparable to the conventional generation algorithm considering the GV, and yields significant improvements by applying to the GV-based generation algorithm in HMM-based TTS, 2) the proposed segment-level postfilter capable of achieving low-delay synthesis also yields significant improvements in synthetic speech quality, and 3) the proposed postfilters are also effective in not only HMM-based TTS but also GMM-based VC and CLUSTERGEN.

67 citations


Cited by
More filters
Posted Content
TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.
Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

4,002 citations

12 Sep 2016
TL;DR: WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

3,248 citations

Journal ArticleDOI
S. Biyiksiz1
01 Mar 1985
TL;DR: This book by Elliott and Rao is a valuable contribution to the general areas of signal processing and communications and can be used for a graduate level course in perhaps two ways.
Abstract: There has been a great deal of material in the area of discrete-time transforms that has been published in recent years. This book does an excellent job of presenting important aspects of such material in a clear manner. The book has 11 chapters and a very useful appendix. Seven of these chapters are essentially devoted to the Fourier series/transform, discrete Fourier transform, fast Fourier transform (FFT), and applications of the FFT in the area of spectral estimation. Chapters 8 through 10 deal with many other discrete-time transforms and algorithms to compute them. Of these transforms, the KarhunenLoeve, the discrete cosine, and the Walsh-Hadamard transform are perhaps the most well-known. A lucid discussion of number theoretic transforms i5 presented in Chapter 11. This reviewer feels that the authors have done a fine job of compiling the pertinent material and presenting it in a concise and clear manner. There are a number of problems at the end of each chapter, an appreciable number of which are challenging. The authors have included a comprehensive set of references at the end of the book. In brief, this book is a valuable contribution to the general areas of signal processing and communications. It can be used for a graduate level course in perhaps two ways. One would be to cover the first seven chapters in great detail. The other would be to cover the whole book by focussing on different topics in a selective manner. This book by Elliott and Rao is extremely useful to researchers/engineers who are working in the areas of signal processing and communications. It i s also an excellent reference book, and hence a valuable addition to one’s library

843 citations