Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks. However, little is known about its effectiveness in other challenging situations such as music source separation. Contrary to conventional networks that directly estimate the source signals, deep clustering generates an embedding for each time-frequency bin, and separates sources by clustering the bins in the embedding space. We show that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation, presumably because its more flexible objective engenders better regularization. Since the strengths of deep clustering and conventional network architectures appear complementary, we explore combining them in a single hybrid network trained via an approach akin to multi-task learning. Remarkably, the combination significantly outperforms either of its components.

Deep clustering and conventional networks for music separation: Stronger together

The effect of reduced spectral contrast on the speech‐reception threshold (SRT) for sentences in noise, and on phoneme identification, was investigated with 16 normal‐hearing subjects. The SRT increases—to about the same extent for a male as for a female voice—as spectral energy is smeared over bandwidths exceeding the ear's critical bandwidth. Phoneme identification shows that vowels are more susceptible to this type of processing than consonants. Vowels are primarily confused with the back vowels /ɔ,u/, and consonants are confused where place of articulation is concerned. In competing speech normal‐hearing subjects show a 6–8 dB lower SRT for sentences than in steady‐state noise, while sensorineurally hearing‐impaired subjects do not [J. M. Festen and R. Plomp, J. Acoust. Soc. Am. 88, 1725–1736 (1990)]. As frequency resolution may contribute to this effect, and as fluctuating interferences of speech are very common in daily situations, the extent of the threshold difference between noise and speech mask...

Effect of spectral envelope smearing on speech reception in noise and competing speech

Popular music is often composed of an accompaniment and a lead component, the latter typically consisting of vocals. Filtering such mixtures to extract one or both components has many applications, such as automatic karaoke and remixing. This particular case of source separation yields very specific challenges and opportunities, including the particular complexity of musical structures, but also relevant prior knowledge coming from acoustics, musicology or sound engineering. Due to both its importance in applications and its challenging difficulty, lead and accompaniment separation has been a popular topic in signal processing for decades. In this article, we provide a comprehensive review of this research topic, organizing the different approaches according to whether they are model-based or data-centered. For model-based methods, we organize them according to whether they concentrate on the lead signal, the accompaniment, or both. For data-centered approaches, we discuss the particular difficulty of obtaining data for learning lead separation systems, and then review recent approaches, notably those based on deep learning. Finally, we discuss the delicate problem of evaluating the quality of music separation through adequate metrics and present the results of the largest evaluation, to-date, of lead and accompaniment separation systems. In conjunction with the above, a comprehensive list of references is provided, along with relevant pointers to available implementations and repositories.

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01766781/document

An Overview of Lead and Accompaniment Separation in Music

The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem of identifying a speaker from its voice regardless of the content. In this study, the authors designed and implemented a novel text-independent multimodal speaker identification system based on wavelet analysis and neural networks. Wavelet analysis comprises discrete wavelet transform, wavelet packet transform, wavelet sub-band coding and Mel-frequency cepstral coefficients (MFCCs). The learning module comprises general regressive, probabilistic and radial basis function neural networks, forming decisions through a majority voting scheme. The system was found to be competitive and it improved the identification rate by 15% as compared with the classical MFCC. In addition, it reduced the identification time by 40% as compared with the back-propagation neural network, Gaussian mixture model and principal component analysis. Performance tests conducted using the GRID database corpora have shown that this approach has faster identification time and greater accuracy compared with traditional approaches, and it is applicable to real-time, text-independent speaker identification systems.

Speaker identification using multimodal neural networks and wavelet analysis

With the development of Earth observation programs, more and more multi-temporal synthetic aperture radar (SAR) data are available from remote sensing platforms. Therefore, it is demanding to develop unsupervised methods for SAR image change detection. Recently, deep learning-based methods have displayed promising performance for remote sensing image analysis. However, these methods can only provide excellent performance when the number of training samples is sufficiently large. In this paper, a novel simple method for SAR image change detection is proposed. The proposed method uses two singular value decomposition (SVD) analyses to learn the non-linear relations between multi-temporal images. By this means, the proposed method can generate more representative feature expressions with fewer samples. Therefore, it provides a simple yet effective way to be designed and trained easily. Firstly, deep semi-nonnegative matrix factorization (Deep Semi-NMF) is utilized to select pixels that have a high probability of being changed or unchanged as samples. Next, image patches centered at these sample pixels are generated from the input multi-temporal SAR images. Then, we build SVD networks, which are comprised of two SVD convolutional layers and one histogram feature generation layer. Finally, pixels in both multi-temporal SAR images are classified by the SVD networks, and then the final change map can be obtained. The experimental results of three SAR datasets have demonstrated the effectiveness and robustness of the proposed method.

Change Detection in SAR Images Based on Deep Semi-NMF and SVD Networks

This paper proposes a voice activity detection (VAD) algorithm based on an energy related feature of the frequency modulation of harmonics. A multi-resolution spectro-temporal analysis framework, which was developed to extract texture features of the audio signal from its Fourier spectrogram, is used to extract frequency modulation features of the speech signal. The proposed algorithm labels the voice active segments of the speech signal by comparing the energy related feature of the frequency modulation of harmonics with a threshold. Then, the proposed VAD is implemented on one of Texas Instruments (TI) digital signal processor (DSP) platforms for real-time operation. Simulations conducted on the DSP platform demonstrate the proposed VAD performs significantly better than three standard VADs, ITU-T G.729B, ETSI AMR1 and AMR2, in non-stationary noise in terms of the receiver operating characteristic (ROC) curves and the recognition rates from a practical distributed speech recognition (DSR) system. key words: digital signal processor, frequency modulation, spectrotemporal analysis, voice activity detection

https://ir.nctu.edu.tw:443/bitstream/11536/129426/1/a68cbaa0802dfebfa4e94a77e3339be5.pdf

Robust Voice Activity Detection Algorithm Based on Feature of Frequency Modulation of Harmonics and Its DSP Implementation

This paper presents a Bayesian nonnegative matrix factorization (NMF) approach to extract singing voice from background music accompaniment. Using this approach, the likelihood function based on NMF is represented by a Poisson distribution and the NMF parameters, consisting of basis and weight matrices, are characterized by the exponential priors. A variational Bayesian expectationmaximization algorithm is developed to learn variational parameters and model parameters for monaural source separation. A clustering algorithm is performed to establish two groups of bases: one is for singing voice and the other is for background music. Model complexity is controlled by adaptively selecting the number of bases for different mixed signals according to the variational lower bound. Model regularization is tackled through the uncertainty modeling via variational inference based on marginal likelihood. The experimental results on MIR-1K database show that the proposed method performs better than various unsupervised separation algorithms in terms of the global normalized source to distortion ratio.

Bayesian singing-voice separation

The two-dimensional spectro-temporal modulation filtering concept of the auditory model [T. Chi, P. Ru, and S. A. Shamma, J. Acoust. Soc. Am. 118(2), 887–906 (2005)] is implemented on the Fourier spectrogram. The Fourier magnitude spectrogram is analyzed in terms of its joint spectro-temporal modulations, which embed the temporal dynamics and spectral structures. Instead of iterative projection methods, the overlap-and-add method is adopted to invert modified Fourier spectrograms back to sounds. The proposed framework not only provides a similar spectro-temporal analytical process for sounds as the auditory model but also produces synthesized sounds with better quality in a timely manner, which makes proposed framework feasible to human speech recognition (HSR) applications as well.

https://ir.nctu.edu.tw:443/bitstream/11536/8926/1/000290450400005.pdf

Multiband analysis and synthesis of spectro-temporal modulations of Fourier spectrogram.

Spectro-temporal modulations of speech encode speech structures and speaker characteristics. An algorithm which distinguishes speech from non-speech based on spectro-temporal modulation energies is proposed and evaluated in robust text-independent closed-set speaker identification simulations using the TIMIT and GRID corpora. Simulation results show the proposed method produces much higher speaker identification rates in all signal-to-noise ratio (SNR) conditions than the baseline system using mel-frequency cepstral coefficients. In addition, the proposed method also outperforms the system, which uses auditory-based nonnegative tensor cepstral coefficients [Q. Wu and L. Zhang, “Auditory sparse representation for robust speaker recognition based on tensor structure,” EURASIP J. Audio, Speech, Music Process. 2008, 578612 (2008)], in low SNR (≤ 10 dB) conditions.

https://ir.nctu.edu.tw:443/bitstream/11536/16329/1/000303601600003.pdf

Spectro-temporal modulation energy based mask for robust speaker identification

This paper proposes a layered nonnegative matrix factorization (L-NMF) algorithm for speech separation. The standard NMF method extracts parts-based bases out of nonnegative training data and is often used to separate mixed spectrograms. The proposed L-NMF algorithm comprises of several layers of standard NMF blocks. During training, each layer of the L-NMF is initialized separately and then fine-tuned by minimizing the propagated reconstruction error. More complicated bases of the training data are emerged in deeper layers of the L-NMF by progressively combining parts-based bases extracted in the first layer. In other words, these complicated bases contain collective information of the parts-based bases. The bases deciphered by all layers are then used to separate spectrograms in the conventional NMF way. Simulation results show the proposed LNMF outperforms the standard NMF in terms of the source-todistortion ratio (SDR).

Chung Chien Hsu

Papers

Robust Voice Activity Detection Algorithm Based on Feature of Frequency Modulation of Harmonics and Its DSP Implementation

Bayesian singing-voice separation

Multiband analysis and synthesis of spectro-temporal modulations of Fourier spectrogram.

Spectro-temporal modulation energy based mask for robust speaker identification

Layered Nonnegative Matrix Factorization for Speech Separation