scispace - formally typeset
Journal ArticleDOI

VoCo: text-based insertion and replacement in audio narration

Reads0
Chats0
TLDR
This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration, using a text to speech synthesizer to say the word in a generic voice, and then using voice conversion to convert it into a voice that matches the narration.
Abstract
Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

read more

Citations
More filters
Journal ArticleDOI

Text-based editing of talking-head video

TL;DR: This work proposes a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts).
Posted Content

Text-based Editing of Talking-head Video

TL;DR: This paper proposed a method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts).
Proceedings ArticleDOI

Fftnet: A Real-Time Speaker-Dependent Neural Vocoder

TL;DR: FFTNet offers two improvements over WaveNet, substantially faster, allowing for real-time synthesis of audio waveforms, and when used as a vocoder, the resulting speech sounds more natural, as measured via a “mean opinion score” test.
Proceedings Article

Fitting New Speakers Based on a Short Untranscribed Sample

TL;DR: In this article, an additional network that given an audio sample, places the speaker in the embedding space is trained as part of the speech synthesis system using various consistency losses, and the results demonstrate a greatly improved performance on both the dataset speakers, and more importantly when fitting new voices, even from very short samples.
Journal ArticleDOI

Anticipating and addressing the ethical implications of deepfakes in the context of elections

TL;DR: The ethical issues raised by deepfakes are examined and four potential forms of intervention are discussed with respect to multi-stakeholder responsibility for addressing harms, including education and media literacy, subject defense, verification, and publicity moderation.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

Sequence to Sequence Learning with Neural Networks

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
Posted Content

Sequence to Sequence Learning with Neural Networks

TL;DR: This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Journal ArticleDOI

An algorithm for the machine calculation of complex Fourier series

TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.
Journal ArticleDOI

Amazon's Mechanical Turk A New Source of Inexpensive, Yet High-Quality, Data?

TL;DR: Findings indicate that MTurk can be used to obtain high-quality data inexpensively and rapidly and the data obtained are at least as reliable as those obtained via traditional methods.
Related Papers (5)