scispace - formally typeset
Journal ArticleDOI

INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora

Reads0
Chats0
TLDR
This paper proposes a new iterative alignment method that allows pairing phonetically equivalent acoustic vectors from nonparallel utterances from different speakers, even under cross-lingual conditions, and it does not require any phonetic or linguistic information.
Abstract
Most existing voice conversion systems, particularly those based on Gaussian mixture models, require a set of paired acoustic vectors from the source and target speakers to learn their corresponding transformation function. The alignment of phonetically equivalent source and target vectors is not problematic when the training corpus is parallel, which means that both speakers utter the same training sentences. However, in some practical situations, such as cross-lingual voice conversion, it is not possible to obtain such parallel utterances. With an aim towards increasing the versatility of current voice conversion systems, this paper proposes a new iterative alignment method that allows pairing phonetically equivalent acoustic vectors from nonparallel utterances from different speakers, even under cross-lingual conditions. This method is based on existing voice conversion techniques, and it does not require any phonetic or linguistic information. Subjective evaluation experiments show that the performance of the resulting voice conversion system is very similar to that of an equivalent system trained on a parallel corpus.

read more

Citations
More filters
Proceedings ArticleDOI

Phonetic posteriorgrams for many-to-one voice conversion without parallel data training

TL;DR: This paper proposes a novel approach to voice conversion with non-parallel training data to bridge between speakers by means of Phonetic PosteriorGrams obtained from a speaker-independent automatic speech recognition system.
Proceedings ArticleDOI

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

TL;DR: In this article, a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) is proposed for voice conversion from unaligned speech corpora.
Proceedings ArticleDOI

Voice conversion from non-parallel corpora using variational auto-encoder

TL;DR: In this article, a variational auto-encoder-decoder framework for spectral conversion with unaligned corpora is proposed. But it does not use parallel corpora or phonetic alignments to train a spectral conversion system.
Journal ArticleDOI

An overview of voice conversion systems

TL;DR: An overview of real-world applications of VC systems, extensively study existing systems proposed in the literature, and discuss remaining challenges are provided.
Journal ArticleDOI

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

TL;DR: This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.
References
More filters
Book

Fundamentals of speech recognition

TL;DR: This book presents a meta-modelling framework for speech recognition that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of manually modeling speech.
Proceedings ArticleDOI

Voice conversion through vector quantization

TL;DR: The authors propose a new voice conversion technique through vector quantization and spectrum mapping which makes it possible to precisely control voice individuality.
Journal ArticleDOI

Voice transformation using PSOLA technique

TL;DR: A new system for voice conversion is described that combines a PSOLA (Pitch Synchronous Overlap and Add)-derived synthesizer and a module for spectral transformation, which produces a satisfyingly natural “transformed” voice.

High-resolution voice transformation

TL;DR: A new type of speech corpus that is especially suited to VT research and development by consisting of naturally time-aligned sentences is proposed, which results in high-quality speech samples that only differ in their segmental properties, the focus of transformation.
Related Papers (5)