scispace - formally typeset
Search or ask a question
Topic

Spectrogram

About: Spectrogram is a research topic. Over the lifetime, 5813 publications have been published within this topic receiving 81547 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: Experimental results show that the proposed neural network named sequence-to-sequence ConvErsion NeTwork (SCENT) obtained better objective and subjective performance than the baseline methods using Gaussian mixture models and deep neural networks as acoustic models.
Abstract: In this paper, a neural network named Sequence-to-sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion. At training stage, a SCENT model is estimated by aligning the feature sequences of source and target speakers implicitly using attention mechanism. At conversion stage, acoustic features and durations of source utterances are converted simultaneously using the unified acoustic model. Mel-scale spectrograms are adopted as acoustic features which contain both excitation and vocal tract descriptions of speech signals. The bottleneck features extracted from source speech using an automatic speech recognition (ASR) model are appended as auxiliary input. A WaveNet vocoder conditioned on Mel-spectrograms is built to reconstruct waveforms from the outputs of the SCENT model. It is worth noting that our proposed method can achieve appropriate duration conversion which is difficult in conventional methods. Experimental results show that our proposed method obtained better objective and subjective performance than the baseline methods using Gaussian mixture models (GMM) and deep neural networks (DNN) as acoustic models. This proposed method also outperformed our previous work which achieved the top rank in Voice Conversion Challenge 2018. Ablation tests further confirmed the effectiveness of several components in our proposed method.

32 citations

Proceedings ArticleDOI
01 Sep 2018
TL;DR: This paper explicitly integrate phase reconstruction into the authors' separation algorithm using a loss function defined on time-domain signals, and allows the network to learn a modified version of these representations from data, instead of using fixed STFT/iSTFT time-frequency representations.
Abstract: Progress in solving the cocktail party problem, i.e., separating the speech from multiple overlapping speakers, has recently accelerated with the invention of techniques such as deep clustering and permutation free mask inference. These approaches typically focus on estimating target STFT magnitudes and ignore problems of phase inconsistency. In this paper, we explicitly integrate phase reconstruction into our separation algorithm using a loss function defined on time-domain signals. A deep neural network structure is defined by unfolding a phase reconstruction algorithm and treating each iteration as a layer in our network. Furthermore, instead of using fixed STFT/iSTFT time-frequency representations, we allow our network to learn a modified version of these representations from data. We compare several variants of these unfolded phase reconstruction networks achieving state of the art results on the publicly available wsj0-2mix dataset, and show improved performance when the STFT/iSTFT-like representations are allowed to adapt.

32 citations

Journal Article
TL;DR: This paper proposes methods that apply 3-D microphone arrays, directional analysis of measured room responses, and visualization of data, yielding useful information about the time-frequency-direction properties of the responses.
Abstract: Room impulse responses are inherently multidimensional, including components in three coordinate directions, each one further being described as a time-frequency representation. Suc h 5-dimensional data is di cult to visualize and interpret. We propose methods that apply 3-D microphone arrays, directional analysis of measured room responses, and visualization of data, yielding useful information about the time-frequency-direction properties of the responses. The applicability of the methods is demonstrated with three di erent cases of real measurements. INTRODUCTION A room impulse response, measured from a source to a receiver position, is inherently multidimensional. Traditionally, the evolution of an omnidirectional sound pressure response in a single point has been studied as a function of time and frequency. However, dividing the response further into directional components can reveal much more information about the actual propagation of sound in the room, as well as about its perceptual aspects. In this paper we propose methods that are based on 3-D microphone arrays, directional analysis of the measured responses, and visualization of such data in a way that yields maximal information about the time-frequency-direction properties of the response. MERIMAA ET AL. Measurement, Analysis, and Visualization of Directional Room Responses The measurement of directional room responses is made with a special 3-D microphone probe which basically consists of two intensity probes in each x-, y-, and z-coordinate directions and is constructed of small electret capsules. The responses are analyzed either with a uniform or an auditorily motivated time-frequency resolution. The analysis results in a significant amount of 5-dimensional data that is hard to visualize and interpret. Based on measured x/y/z-intensity components, intensity vectors (magnitude and direction) can be plotted in a spectrogram-like map, one vector for each time-frequency bin, illustrating the directional evolution of the field in time and frequency. Additionally, a pressure-related time-frequency spectrogram can be overlaid with the vectors, in gray levels or colors, illustrating for example a perceptually motivated spectrogram with no directional information. One such map can be used to illustrate the horizontal information and another one can be added for the elevation information. This technique is a part of a Matlab visualization toolbox for directional room responses developed by the authors, and it includes several other possibilities to analyze and represent room acoustical data. Traditional parameters and presentations are also available, some of them in 3-D versions, such as energy-time plots in desired directions. The paper starts with a discussion on measurements of directional room responses and sound intensity. This is followed by descriptions of the visualization method and the auditorily motivated time-frequency analysis. Finally, the applicability of the methods is demonstrated with three different cases of real measurements. DIRECTIONAL SOUND PRESSURE COMPONENTS Existing literature on room acoustics discusses mainly omnidirectional measurements with the exception of some special directional parameters. Directional room responses can be measured with either directional microphones or arrays of microphones. However, an array of omnidirectional microphones has some distinct advantages compared to directional microphones. Omnidirectional capsules can be made smaller and they usually behave more like ideal transducers. Further, if the omnidirectional signals are stored at the measurement time, it is possible to afterwards create varying directivity patterns based on a single measurement. Typical directivity patterns can be formed with an array of two or more closely spaced omnidirectional microphones and some equalization to compensate for the resulting non-flat magnitude response. For example the difference of two microphone signals gives a dipole pattern and adding an appropriate delay to one of the signals changes the pattern to a cardioid. Okubo et al. [1] have also proposed a method that uses a product of cardioid and dipole signals to achieve a directivity pattern more suitable for some directional room acoustics measurements. Various directional sound pressure responses can be used to plot traditional impulse responses, energy-time-curves or spectrograms that give information about the directional properties of the room responses. With larger microphone arrays it is also possible to form directivity patterns with very narrow beams and thus good spatial resolution. However, groups of similar plots for several different directions are not very visual or easy to interpret. Sound intensity as a vector quantity can solve some of the visualization problems in the method we are proposing in this paper. SOUND INTENSITY Sound intensity [2] describes the propagation of energy in a sound field. Instantaneous intensity vector is defined as the product of instantaneous sound pressure p(t) and particle velocity u(t) I(t) = p(t)u(t) (1) Based on the linearized fluid momentum equation, particle velocity in the direction n can be written in the form

32 citations

Journal ArticleDOI
TL;DR: A chi-squared description of theSpectrogram distribution appears accurate when the analysis window used to construct the spectrogram decreases to zero at its boundaries, regardless of the level of correlation contained in the signal.
Abstract: Given a correlated Gaussian signal, may a chi-squared law of probability always be used to describe a spectrogram coefficient distribution? If not, would a "chi-squared description" lead to an acceptable amount of error when detection problems are to be faced in the time-frequency domain? These two questions prompted the study reported in this paper. After deriving the probability distribution of spectrogram coefficients in the context of a non centered Gaussian correlated signal, the Kullback-Leibler divergence is first used to evaluate to what extent the nonwhiteness of the signal and the Fourier analysis window impact the probability distribution of the spectrogram. To complete the analysis, a detection task formulated as a binary hypothesis test is considered. We evaluate the error committed on the probability of false alarm when the likelihood ratio test is expressed with chi-squared laws. From these results, a chi-squared description of the spectrogram distribution appears accurate when the analysis window used to construct the spectrogram decreases to zero at its boundaries, regardless of the level of correlation contained in the signal. When other analysis windows are used, the length of the window and the correlation contained in the analyzed signal impact the validity of the chi-squared description.

32 citations

Proceedings ArticleDOI
20 Mar 2016
TL;DR: A novel source separation method aiming to overcome the difficulty of modelling non-stationary signals, based on a signal representation that divides the complex spectrogram into a grid of patches of arbitrary size, which reveals spectral and temporal modulation textures.
Abstract: In this paper we present a novel source separation method aiming to overcome the difficulty of modelling non-stationary signals. The method can be applied to mixtures of musical instruments with frequency and/or amplitude modulation, e.g. typically caused by vibrato. It is based on a signal representation that divides the complex spectrogram into a grid of patches of arbitrary size. These complex patches are then processed by a two-dimensional discrete Fourier transform, forming a tensor representation which reveals spectral and temporal modulation textures. Our representation can be seen as an alternative to modulation transforms computed on magnitude spectrograms. An adapted factorization model allows to decompose different time-varying harmonic sources based on their particular common modulation profile: hence the name Common Fate Model. The method is evaluated on musical instrument mixtures playing the same fundamental frequency (unison), showing improvement over other state-of-the-art methods.

31 citations


Network Information
Related Topics (5)
Deep learning
79.8K papers, 2.1M citations
79% related
Convolutional neural network
74.7K papers, 2M citations
78% related
Feature extraction
111.8K papers, 2.1M citations
77% related
Wavelet
78K papers, 1.3M citations
76% related
Support vector machine
73.6K papers, 1.7M citations
75% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20241
2023627
20221,396
2021488
2020595
2019593