scispace - formally typeset
Search or ask a question
Topic

Spectrogram

About: Spectrogram is a research topic. Over the lifetime, 5813 publications have been published within this topic receiving 81547 citations.


Papers
More filters
Posted Content
Isaac Elias1, Heiga Zen1, Jonathan Shen1, Yu Zhang1, Ye Jia1, Ron Weiss1, Yonghui Wu1 
TL;DR: A non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder, called Parallel Tacotron, which is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware.
Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.

54 citations

Journal ArticleDOI
TL;DR: In this article, the authors introduce the Wigner distribution function (WDF) as a self-windowed complex spectrogram and suggest some methods for the optical generation of the WDF of two-dimensional signals.
Abstract: We introduce the Wigner distribution function (WDF) as a self-windowed complex spectrogram and suggest some methods for the optical generation of the WDF of two-dimensional signals. The resulting WDFs, since they are four-dimensional functions, are represented as sectional images displayed either in parallel or as temporal sequences. We give some experimental results for real-valued input signals obtained from different coherent-optical WDF processors.

54 citations

Posted Content
TL;DR: A deep neural network, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.
Abstract: In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram which is then used as target to our main lip reading network comprising of CNN, LSTM and fully connected layers. Our experiments show that the autoencoder is able to reconstruct the original auditory spectrogram with a 98% correlation and also improves the quality of reconstructed speech from the main lip reading network. Our model, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.

54 citations

Journal ArticleDOI
TL;DR: It is demonstrated that two previously proposed methods for combining the information content from multiple spectrograms into a single, positive time-frequency function are optimal in a cross-entropy sense.
Abstract: We demonstrate that two previously proposed methods for combining the information content from multiple spectrograms into a single, positive time-frequency function are optimal in a cross-entropy sense. The goal in combining the spectrograms is to obtain an improved approximation of the joint time-frequency signal density by overcoming limitations of any single spectrogram. An example of each method is provided, and results are compared with spectrograms and a Cohen-Posch (1985) time-frequency density (TFD) of a nonstationary pulsed tone signal. The proposed combinations are effective and can be efficiently computed. >

54 citations

Proceedings ArticleDOI
01 Apr 1981
TL;DR: A fast nonlinear time alignment method is presented, which is based on a preprocessing of the normalized speech spectrogram by means of a segmentation of the trace in the spectral feature space, which offers savings in computing time by a factor of 10 or more as compared to conventional dynamic programming.
Abstract: A fast nonlinear time alignment method is presented, which is based on a preprocessing of the normalized speech spectrogram by means of a segmentation of the trace in the spectral feature space. After such trace segmentation the patterns have a fixed format and allow for a subsequent classification with a distance measure which is obtained from conventional dynamic programming with extreme constraints. Since, due to the trace segmentation preprocessing, these extreme constraints can be applied without performance degradation, the described method offers savings in computing time by a factor of 10 or more as compared to conventional dynamic programming. As a side benefit, reference pattern memory savings by a factor of 3 or more are obtained.

54 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
79% related
Convolutional neural network
74.7K papers, 2M citations
78% related
Feature extraction
111.8K papers, 2.1M citations
77% related
Wavelet
78K papers, 1.3M citations
76% related
Support vector machine
73.6K papers, 1.7M citations
75% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20241
2023627
20221,396
2021488
2020595
2019593