Topic

Spectrogram

About: Spectrogram is a research topic. Over the lifetime, 5813 publications have been published within this topic receiving 81547 citations.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Speaker-independent phone recognition using hidden Markov models

[...]

Kai-Fu Lee¹, H.-W. Hon¹•Institutions (1)

Carnegie Mellon University¹

01 Nov 1989-IEEE Transactions on Acoustics, Speech, and Signal Processing

TL;DR: The authors introduce the co-occurrence smoothing algorithm, which enables accurate recognition even with very limited training data, and can be used as benchmarks to evaluate future systems.

...read moreread less

Abstract: Hidden Markov modeling is extended to speaker-independent phone recognition. Using multiple codebooks of various linear-predictive-coding (LPC) parameters and discrete hidden Markov models (HMMs) the authors obtain a speaker-independent phone recognition accuracy of 58.8-73.8% on the TIMIT database, depending on the type of acoustic and language models used. In comparison, the performance of expert spectrogram readers is only 69% without use of higher level knowledge. The authors introduce the co-occurrence smoothing algorithm, which enables accurate recognition even with very limited training data. Since the results were evaluated on a standard database, they can be used as benchmarks to evaluate future systems. >

...read moreread less

895 citations

Posted Content•

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

[...]

Jonathan Shen, Ruoming Pang, Ron Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu - Show less +9 more

16 Dec 2017-arXiv: Computation and Language

TL;DR: Tacotron 2 as mentioned in this paper uses a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms.

...read moreread less

Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

...read moreread less

733 citations

Journal Article•DOI•

The local mean decomposition and its application to EEG perception data.

[...]

Jonathan S Smith

22 Dec 2005-Journal of the Royal Society Interface

TL;DR: The paper presents the results of applying LMD to a set of scalp electroencephalogram (EEG) visual perception data, and suggests that there is a statistically significant difference between the theta phase concentrations of the perception and no perception EEG data.

...read moreread less

Abstract: This paper describes the local mean decomposition (LMD), a new iterative approach to demodulating amplitude and frequency modulated signals. The new method decomposes such signals into a set of functions, each of which is the product of an envelope signal and a frequency modulated signal from which a time-varying instantaneous frequency can be derived. The LMD method can be used to analyse a wide variety of natural signals such as electrocardiograms, functional magnetic resonance imaging data, and earthquake data. The paper presents the results of applying LMD to a set of scalp electroencephalogram (EEG) visual perception data. The LMD instantaneous frequency and energy structure of the EEG is examined, and compared with results obtained using the spectrogram. The nature of visual perception is investigated by measuring the degree of EEG instantaneous phase concentration that occurs following stimulus onset over multiple trials. The analysis suggests that there is a statistically significant difference between the theta phase concentrations of the perception and no perception EEG data.

...read moreread less

705 citations

Journal Article•DOI•

Complex ratio masking for monaural speech separation

[...]

Donald S. Williamson¹, Yuxuan Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Mar 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

...read moreread less

Abstract: Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

...read moreread less

699 citations

Journal Article•DOI•

Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation

[...]

Alexey Ozerov¹, Cédric Févotte¹•Institutions (1)

Télécom ParisTech¹

01 Mar 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this article, a general data-driven object-based model of multichannel audio data, assumed generated as a possibly underdetermined convolutive mixture of source signals, is considered.

...read moreread less

Abstract: We consider inference in a general data-driven object-based model of multichannel audio data, assumed generated as a possibly underdetermined convolutive mixture of source signals. We work in the short-time Fourier transform (STFT) domain, where convolution is routinely approximated as linear instantaneous mixing in each frequency band. Each source STFT is given a model inspired from nonnegative matrix factorization (NMF) with the Itakura-Saito divergence, which underlies a statistical model of superimposed Gaussian components. We address estimation of the mixing and source parameters using two methods. The first one consists of maximizing the exact joint likelihood of the multichannel data using an expectation-maximization (EM) algorithm. The second method consists of maximizing the sum of individual likelihoods of all channels using a multiplicative update algorithm inspired from NMF methodology. Our decomposition algorithms are applied to stereo audio source separation in various settings, covering blind and supervised separation, music and speech sources, synthetic instantaneous and convolutive mixtures, as well as professionally produced music recordings. Our EM method produces competitive results with respect to state-of-the-art as illustrated on two tasks from the international Signal Separation Evaluation Campaign (SiSEC 2008).

...read moreread less

636 citations

Collapse

Network Information

Performance

Metrics

7,848

Papers

107,060

Citations

No. of papers in the topic in previous years
Year	Papers
2024	1
2023	627
2022	1,396
2021	488
2020	595
2019	593

Spectrogram

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics