Journal•

arXiv: Audio and Speech Processing

About: arXiv: Audio and Speech Processing is an academic journal. The journal publishes majorly in the area(s): Speech enhancement & Word error rate. Over the lifetime, 2942 publications have been published receiving 13076 citations.

...read moreread less

Topics: Speech enhancement, Word error rate, Artificial neural network, Speech synthesis, Computer science ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Posted Content•

Conformer: Convolution-augmented Transformer for Speech Recognition

[...]

Anmol Gulati¹, James Qin¹, Chung-Cheng Chiu¹, Niki Parmar¹, Yu Zhang¹, Jiahui Yu², Wei Han¹, Shibo Wang, Zhengdong Zhang¹, Yonghui Wu¹, Ruoming Pang¹ - Show less +7 more•Institutions (2)

Google¹, Adobe Systems²

16 May 2020-arXiv: Audio and Speech Processing

TL;DR: This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

...read moreread less

Abstract: Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. We also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.

...read moreread less

1,270 citations

Posted Content•

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

[...]

Yi Ren¹, Chenxu Hu¹, Xu Tan¹, Tao Qin², Sheng Zhao², Zhou Zhao², Tie-Yan Liu² - Show less +3 more•Institutions (2)

Zhejiang University¹, Microsoft²

08 Jun 2020-arXiv: Audio and Speech Processing

TL;DR: FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.

...read moreread less

Abstract: Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of fully end-to-end inference. Experimental results show that 1) FastSpeech 2 achieves a 3x training speed-up over FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech 2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even surpass autoregressive models. Audio samples are available at this https URL.

...read moreread less

529 citations

Posted Content•

DiffWave: A Versatile Diffusion Model for Audio Synthesis

[...]

Zhifeng Kong¹, Wei Ping², Jiaji Huang³, Kexin Zhao³, Bryan Catanzaro² - Show less +1 more•Institutions (3)

University of California, San Diego¹, Nvidia², Baidu³

21 Sep 2020-arXiv: Audio and Speech Processing

TL;DR: DiffWave significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

...read moreread less

Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

...read moreread less

459 citations

Posted Content•

MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment

[...]

Hao-Wen Dong¹, Wen-Yi Hsiao¹, Li-Chia Yang¹, Yi-Hsuan Yang¹•Institutions (1)

Academia Sinica¹

19 Sep 2017-arXiv: Audio and Speech Processing

TL;DR: In this article, three models for symbolic multi-track music generation under the framework of generative adversarial networks (GANs) are proposed, referred to as the jamming model, the composer model and the hybrid model.

...read moreread less

Abstract: Generating music has a few notable differences from generating images and videos. First, music is an art of time, necessitating a temporal model. Second, music is usually composed of multiple instruments/tracks with their own temporal dynamics, but collectively they unfold over time interdependently. Lastly, musical notes are often grouped into chords, arpeggios or melodies in polyphonic music, and thereby introducing a chronological ordering of notes is not naturally suitable. In this paper, we propose three models for symbolic multi-track music generation under the framework of generative adversarial networks (GANs). The three models, which differ in the underlying assumptions and accordingly the network architectures, are referred to as the jamming model, the composer model and the hybrid model. We trained the proposed models on a dataset of over one hundred thousand bars of rock music and applied them to generate piano-rolls of five tracks: bass, drums, guitar, piano and strings. A few intra-track and inter-track objective metrics are also proposed to evaluate the generative results, in addition to a subjective user study. We show that our models can generate coherent music of four bars right from scratch (i.e. without human inputs). We also extend our models to human-AI cooperative music generation: given a specific track composed by human, we can generate four additional tracks to accompany it. All code, the dataset and the rendered audio samples are available at this https URL .

...read moreread less

319 citations

Posted Content•

Jukebox: A Generative Model for Music

[...]

Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever¹ - Show less +2 more•Institutions (1)

OpenAI¹

30 Apr 2020-arXiv: Audio and Speech Processing

TL;DR: It is shown that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes, and can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable.

...read moreread less

Abstract: We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multi-scale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples at this https URL, along with model weights and code at this https URL

...read moreread less

312 citations

Collapse

Network Information

Related Journals (5)

arXiv: Computation and Language

24.8K papers, 481.5K citations

86% related

arXiv: Learning

45K papers, 837.1K citations

82% related

IEEE Signal Processing Magazine

2.1K papers, 288.9K citations

50K papers, 1.1M citations

80% related

arXiv: Machine Learning

12.4K papers, 260.6K citations

78% related

Performance

Metrics

2,942

Papers

29,657

Citations

No. of papers from the Journal in previous years
Year	Papers
2021	889
2020	1,244
2019	508
2018	259
2017	42