scispace - formally typeset
Search or ask a question
Topic

Spectrogram

About: Spectrogram is a research topic. Over the lifetime, 5813 publications have been published within this topic receiving 81547 citations.


Papers
More filters
Posted Content
Ron Weiss1, RJ Skerry-Ryan1, Eric Battenberg1, Soroosh Mariooryad1, Diederik P. Kingma1 
TL;DR: A sequence-to-sequence neural network which directly generates speech waveforms from text inputs, extending the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop, enabling parallel training and synthesis.
Abstract: We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within each block are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding blocks.This model can be optimized directly with maximum likelihood, with-out using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.

40 citations

Journal ArticleDOI
TL;DR: A novel method for diagnosing bearing faults and their degradation level under variable shaft speed and can achieve very high accuracy and robustness for bearing fault diagnosis even under noisy environments is proposed.
Abstract: Predicting bearing faults is an essential task in machine health monitoring because bearings are vital components of rotary machines, especially heavy motor machines. Moreover, indicating the degradation level of bearings will help factories plan maintenance schedules. With advancements in the extraction of useful information from vibration signals, diagnosis of motor failures by maintenance engineers can be gradually replaced by an automatic detection process. Especially, state-of-the-art methods using deep learning have contributed significantly to automatic fault diagnosis. This paper proposes a novel method for diagnosing bearing faults and their degradation level under variable shaft speed. In the proposed method, vibration signals are represented by spectrograms to apply deep learning methods through preprocessing using Short-Time Fourier Transform (STFT). Then, feature extraction and health status classification are performed by a convolutional neural network (CNN), VGG16. According to our various experiments, our proposed method can achieve very high accuracy and robustness for bearing fault diagnosis even under noisy environments.

40 citations

Proceedings ArticleDOI
14 Aug 2009
TL;DR: The presented algorithm was tested using actual stressful speech utterances from SUSAS (Speech Under Simulated and Actual Stress) database on the vowel-based level and indicated that the proposed method can be applied to voiced speech in speech independent conditions.
Abstract: This paper presents a new system for automatic stress detection in speech. In the process of feature extraction speech spectrograms were used as the primary features. The sigma-pi neuron cells were then employed to derive the secondary features. The analysis was performed at three alternative sets of analytical frequency bands: critical bands, Bark scale bands and equivalent rectangular bandwidth (ERB) scale bands. The presented algorithm was tested using actual stressful speech utterances from SUSAS (Speech Under Simulated and Actual Stress) database on the vowel-based level. The automatic stress-level classification was implemented using Gaussian mixture model (GMM) and k-nearest neighbor (KNN) classifiers. The strongest effect on the classification results was observed when selecting the type of frequency bands. The ERB scale provided the highest classification results ranging from 67.84% to 73.76%. The classification results did not differ between data sets containing specific types of vowels and data sets containing mixtures of vowels. This indicates that the proposed method can be applied to voiced speech in speech independent conditions.

39 citations

Posted Content
TL;DR: In this paper, a cyclic-consistent generative adversarial network (CycleGAN) is proposed for unsupervised speech domain adaptation, which employs multiple independent discriminators on the power spectrogram, each in charge of different frequency bands.
Abstract: Domain adaptation plays an important role for speech recognition models, in particular, for domains that have low resources. We propose a novel generative model based on cyclic-consistent generative adversarial network (CycleGAN) for unsupervised non-parallel speech domain adaptation. The proposed model employs multiple independent discriminators on the power spectrogram, each in charge of different frequency bands. As a result we have 1) better discriminators that focus on fine-grained details of the frequency features, and 2) a generator that is capable of generating more realistic domain-adapted spectrogram. We demonstrate the effectiveness of our method on speech recognition with gender adaptation, where the model only has access to supervised data from one gender during training, but is evaluated on the other at test time. Our model is able to achieve an average of $7.41\%$ on phoneme error rate, and $11.10\%$ word error rate relative performance improvement as compared to the baseline, on TIMIT and WSJ dataset, respectively. Qualitatively, our model also generates more natural sounding speech, when conditioned on data from the other domain.

39 citations

Proceedings ArticleDOI
25 Mar 2012
TL;DR: New variants of the non-negative matrix factorization concept that incorporate music-specific constraints are introduced that incorporateMusic spectrograms' structural regularities.
Abstract: Music spectrograms typically have many structural regularities that can be exploited to help solve the problem of decomposing a given spectrogram into distinct musically meaningful components. In this paper, we introduce new variants of the non-negative matrix factorization concept that incorporate music-specific constraints.

39 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
79% related
Convolutional neural network
74.7K papers, 2M citations
78% related
Feature extraction
111.8K papers, 2.1M citations
77% related
Wavelet
78K papers, 1.3M citations
76% related
Support vector machine
73.6K papers, 1.7M citations
75% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20241
2023627
20221,396
2021488
2020595
2019593