Topic
Spectrogram
About: Spectrogram is a(n) research topic. Over the lifetime, 5813 publication(s) have been published within this topic receiving 81547 citation(s).
Papers published on a yearly basis
Papers
More filters
[...]
TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.
Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and $F_{0}$ features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.
1,094 citations
[...]
TL;DR: The reassignment method, first applied by Kodera, Gendrin, and de Villedary (1976) to the spectrogram, is generalized to any bilinear time-frequency or time-scale distribution.
Abstract: In this paper, the use of the reassignment method, first applied by Kodera, Gendrin, and de Villedary (1976) to the spectrogram, is generalized to any bilinear time-frequency or time-scale distribution. This method creates a modified version of a representation by moving its values away from where they are computed, so as to produce a better localization of the signal components. We first propose a new formulation of this method, followed by a thorough theoretical study of its characteristics. Its practical use for a large variety of known time-frequency and time-scale distributions is then addressed. Finally, some experimental results are reported to demonstrate the performance of this method. >
1,082 citations
[...]
TL;DR: An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented and enables a better separation quality than the previous algorithms.
Abstract: An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented. The algorithm is based on factorizing the magnitude spectrogram of an input signal into a sum of components, each of which has a fixed magnitude spectrum and a time-varying gain. Each sound source, in turn, is modeled as a sum of one or more components. The parameters of the components are estimated by minimizing the reconstruction error between the input spectrogram and the model, while restricting the component spectrograms to be nonnegative and favoring components whose gains are slowly varying and sparse. Temporal continuity is favored by using a cost term which is the sum of squared differences between the gains in adjacent frames, and sparseness is favored by penalizing nonzero gains. The proposed iterative estimation algorithm is initialized with random values, and the gains and the spectra are then alternatively updated using multiplicative update rules until the values converge. Simulation experiments were carried out using generated mixtures of pitched musical instrument samples and drum sounds. The performance of the proposed method was compared with independent subspace analysis and basic nonnegative matrix factorization, which are based on the same linear model. According to these simulations, the proposed method enables a better separation quality than the previous algorithms. Especially, the temporal continuity criterion improved the detection of pitched musical sounds. The sparseness criterion did not produce significant improvements
1,038 citations
[...]
TL;DR: The authors introduce the co-occurrence smoothing algorithm, which enables accurate recognition even with very limited training data, and can be used as benchmarks to evaluate future systems.
Abstract: Hidden Markov modeling is extended to speaker-independent phone recognition. Using multiple codebooks of various linear-predictive-coding (LPC) parameters and discrete hidden Markov models (HMMs) the authors obtain a speaker-independent phone recognition accuracy of 58.8-73.8% on the TIMIT database, depending on the type of acoustic and language models used. In comparison, the performance of expert spectrogram readers is only 69% without use of higher level knowledge. The authors introduce the co-occurrence smoothing algorithm, which enables accurate recognition even with very limited training data. Since the results were evaluated on a standard database, they can be used as benchmarks to evaluate future systems. >
867 citations
[...]
TL;DR: In this paper, a deep network is trained to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures.
Abstract: We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, "class-based" methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step "decodes" the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.
854 citations