scispace - formally typeset
Search or ask a question
Topic

Spectrogram

About: Spectrogram is a research topic. Over the lifetime, 5813 publications have been published within this topic receiving 81547 citations.


Papers
More filters
Proceedings Article
03 May 2021
TL;DR: DiffWave as mentioned in this paper is a diffusion probabilistic model for conditional and unconditional waveform generation, which is non-autoregressive and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis.
Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

27 citations

Posted Content
TL;DR: MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice, is proposed and applied to perform music style transfer.
Abstract: Traditional voice conversion methods rely on parallel recordings of multiple speakers pronouncing the same sentences. For real-world applications however, parallel data is rarely available. We propose MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice. We firstly compute spectrograms from waveform data and then perform a domain translation using a Generative Adversarial Network (GAN) architecture. An additional siamese network helps preserving speech information in the translation process, without sacrificing the ability to flexibly model the style of the target speaker. We test our framework with a dataset of clean speech recordings, as well as with a collection of noisy real-world speech examples. Finally, we apply the same method to perform music style transfer, translating arbitrarily long music samples from one genre to another, and showing that our framework is flexible and can be used for audio manipulation applications different from voice conversion.

27 citations

Journal ArticleDOI
TL;DR: Comparisons showed that the proposed speech enhancement approaches outperformed the related speech enhancement methods and led to large improvements of the perceived speech quality and intelligibility, as well as automatic speech recognition.
Abstract: Human speech in real-world environments is typically degraded by the background noise. They have a negative impact on perceptual speech quality and intelligibility which causes performance degradation in various speech-related technological applications, such as hearing aids and automatic speech recognition systems. It also degrades the original phase of the clean speech and introduces perceptual disturbance which leads to the negative impacts on the quality of speech. Therefore, speech enhancement must vigilantly be dealt with in everyday listening environments. In this article, speech enhancement is performed using supervised learning of spectral masking. Deep neural networks (DNN) and recurrent neural networks (RNN) are trained to learn the spectral masking from the magnitude spectrograms of the degraded speech. An iterative procedure is adopted as a post-processing step to deal with the noisy phase. Additionally, an intelligibility improvement filter is also used to incorporate the critical band importance function weights where higher weights contribute more towards intelligibility. Systematic experiments demonstrated that the proposed approaches greatly attenuated the background noise. Also, they led to large improvements of the perceived speech quality and intelligibility, as well as automatic speech recognition. In experiments, TIMIT database is used. The STOI is improved by 17.6% over the noisy speech. Also, SDR and PESQ are improved by 5.22dB and 19% over the noisy speech utterances. These comparisons showed that the proposed speech enhancement approaches outperformed the related speech enhancement methods.

27 citations

Journal ArticleDOI
Peng Lei1, Jiawei Liang1, Zhenyu Guan1, Jun Wang1, Tong Zheng1 
TL;DR: An acceleration method of the convolutional neural network (CNN) on the field-programmable gate array (FPGA) for the embedded application of the millimeter-wave (mmW) radar-based human activity classification shows that it maintains the high classification accuracy but also improves its execution speed, memory requirement, and power consumption.
Abstract: Deep learning techniques have attracted much attention in the radar automatic target recognition. In this paper, we investigate an acceleration method of the convolutional neural network (CNN) on the field-programmable gate array (FPGA) for the embedded application of the millimeter-wave (mmW) radar-based human activity classification. Considering the micro-Doppler effect caused by a person's body movements, the spectrogram of mmW radar echoes is adopted as the CNN input. After that, according to the CNN architecture and the properties of the FPGA implementation, several parallel processing strategies are designed as well as data quantization and optimization of classification decision to accelerate the CNN execution. Finally, comparative experiments and discussions are carried out based on a measured dataset of nine individuals with four different actions by using a 77-GHz mmW radar. The results show that the proposed method not only maintains the high classification accuracy but also improves its execution speed, memory requirement, and power consumption. Specifically, compared with the implementation of the same network model on a graphics processing unit, it could achieve the speedup of about 30.42% at the cost of the classification accuracy loss of only 0.27%.

27 citations

Proceedings ArticleDOI
17 Sep 2003
TL;DR: An enhanced method for the detection of wheezes, based on the spectrogram of the breath sound recordings is proposed, which could be used for long-term wheezing screening in sleep-laboratories, resulting in significant data-volume reduction.
Abstract: An enhanced method for the detection of wheezes, based on the spectrogram of the breath sound recordings is proposed. The identification of wheezes in the total breath cycle would contribute to the diagnosis of pathologies related to patients with obstructive airway diseases. Fast and quite simple techniques are applied to automatically locate and identify wheezing-episodes. Amplitude criteria are applied to the peaks of the spectrogram in order to discriminate the wheezing from the breath sound, whereas frequency and time continuity criteria are used to improve the results. The proposed detector could be used for long-term wheezing screening in sleep-laboratories, resulting in significant data-volume reduction.

27 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
79% related
Convolutional neural network
74.7K papers, 2M citations
78% related
Feature extraction
111.8K papers, 2.1M citations
77% related
Wavelet
78K papers, 1.3M citations
76% related
Support vector machine
73.6K papers, 1.7M citations
75% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20241
2023627
20221,396
2021488
2020595
2019593