There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality.

Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

I and i

Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.

Supervised Speech Separation Based on Deep Learning: An Overview

In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independent multitalker speech separation. Specifically, uPIT extends the recently proposed permutation invariant training (PIT) technique with an utterance-level cost function, hence eliminating the need for solving an additional permutation problem during inference, which is otherwise required by frame-level PIT. We achieve this using recurrent neural networks (RNNs) that, during training, minimize the utterance-level separation error, hence forcing separated frames belonging to the same speaker to be aligned to the same output stream. In practice, this allows RNNs, trained with uPIT, to separate multitalker mixed speech without any prior knowledge of signal duration, number of speakers, speaker identity, or gender. We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on nonnegative matrix factorization and computational auditory scene analysis, and compares favorably with deep clustering, and the deep attractor network. Furthermore, we found that models trained with uPIT generalize well to unseen speakers and languages. Finally, we found that a single model, trained with uPIT, can handle both two-speaker, and three-speaker speech mixtures.

Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks

Monaural source separation is important for many real world applications. It is challenging because, with only a single channel of information available, without any constraints, an infinite number of solutions are possible. In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including monaural speech separation, monaural singing voice separation, and speech denoising. The joint optimization of the deep recurrent neural networks with an extra masking layer enforces a reconstruction constraint. Moreover, we explore a discriminative criterion for training neural networks to further enhance the separation performance. We evaluate the proposed system on the TSP, MIR-1K, and TIMIT datasets for speech separation, singing voice separation, and speech denoising tasks, respectively. Our approaches achieve 2.30--4.98 dB SDR gain compared to NMF models in the speech separation task, 2.30--2.48 dB GNSDR gain and 4.32--5.42 dB GSIR gain compared to existing models in the singing voice separation task, and outperform NMF and DNN baselines in the speech denoising task.

Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation

The title of this technical report says almost everything: this is indeed \"a short bibliography on AI and the arts\". It is presented in four sections: General Arguments, Proposals, and Approaches (31 references); Artificial Intelligence in Music (124 references); Artificial Intelligence in Literature and the Performing Arts (13 references), and Artificial Intelligence and Visual Art (57 references). About a quarter of these have short abstracts. Creating a bibliography can be a monumental task, and this bibliography should be viewed as a good and useful start, though it is by no means complete. For comparison, consider the 4,585-entry bibliography Computer Applications in Music by Deta Davis (A-REditions). No direct comparison is intended (or possible), but my point is that many more papers are likely to exist. As a rough check, I looked for several pre-1990 AI and Music articles and books (including my own, of course) in the bibliography. Out of five papers from well-known sources, only one was listed. On the other hand, I discovered a number of papers in this report that were unknown to me, so I am grateful to have a new source of references. In their introduction, the authors acknowledge the need for more references and even offer.a cup of coffee in reward for each new one. I will be sending a number of contributions, so the next time anyone is in Vienna, the coffee is on me. I hope the authors will continue to collect abstracts and publish an updated report in the future.

Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review)

Domain adaptation has shown promising advances for alleviating domain shift problem. However, recent visual domain adaptation works usually focus on non-sequential object recognition with a global coarse alignment, which is inadequate to transfer effective knowledge for sequence-like text images with variable-length fine-grained character information. In this paper, we develop a Sequence-to-Sequence Domain Adaptation Network (SSDAN) for robust text image recognition, which could exploit unsupervised sequence data by an attention-based sequence encoder-decoder network. In the SSDAN, a gated attention similarity (GAS) unit is introduced to adaptively focus on aligning the distribution of the source and target sequence data in an attended character-level feature space rather than a global coarse alignment. Extensive text recognition experiments show the SSDAN could efficiently transfer sequence knowledge and validate the promising power of the proposed model towards real world applications in various recognition scenarios, including the natural scene text, handwritten text and even mathematical expression recognition.

/pdf/sequence-to-sequence-domain-adaptation-network-for-robust-1dw8rr7yz5.pdf

Sequence-To-Sequence Domain Adaptation Network for Robust Text Image Recognition

Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks.

/pdf/add-2022-the-first-audio-deep-synthesis-detection-challenge-28yqka99.pdf

ADD 2022: the first Audio Deep Synthesis Detection Challenge

Deep learning based speech separation usually uses a supervised algorithm to learn a mapping function from noisy features to separation targets. These separation targets, either ideal masks or magnitude spectrograms, have prominent spectro-temporal structures. Nonnegative matrix factorization (NMF) is a well-known representation learning technique that is capable of capturing the basic spectral structures. Therefore, the combination of deep learning and NMF as an organic whole is a smart strategy. However, previous methods typically use deep neural networks (DNN) and NMF for speech separation in a separate manner. In this paper, we propose a jointly combinatorial scheme to concentrate the strengths of both DNN and NMF for speech separation. NMF is used to learn the basis spectra that then are integrated into a DNN to directly reconstruct the magnitude spectrograms of speech and noise. Instead of predicting activation coefficients inferred by NMF, which is used as an intermediate target by the previous methods, DNN directly optimizes an actual separation objective in our system, so that the accumulated errors could be alleviated. Moreover, we explore a discriminative training objective with sparsity constraints to suppress noise and preserve more speech components further. Systematic experiments show that the proposed models are competitive with the previous methods.

Deep Learning Based Speech Separation via NMF-Style Reconstructions

Speech separation and pitch estimation in noisy conditions are considered to be a "chicken-and-egg" problem. On one hand, pitch information is an important cue for speech separation. On the other hand, speech separation makes pitch estimation easier when background noise is removed. In this paper, we propose a supervised learning architecture to solve these two problems iteratively. The proposed algorithm is based on the deep stacking network (DSN), which provides a method for stacking simple processing modules to build deep architectures. Each module is a classifier whose target is the ideal binary mask (IBM), and the input vector includes spectral features, pitch-based features and the output from the previous module. During the testing stage, we estimate the pitch using the separation results and update the pitch-based features to the next module. When embedded into the DSN, pitch estimation and speech separation each run several times. We obtain the final results from the last module. Systematic evaluations show that the proposed system results in both a high quality estimated binary mask and accurate pitch estimation and outperforms recent systems in its generalization ability.

A pairwise algorithm using the deep stacking network for speech separation and pitch estimation

Recently, the end-to-end system has made significant breakthroughs in the field of speech recognition. However, this single end-to-end architecture is not especially robust to the input variations interfered of noises and reverberations, resulting in performance degradation dramatically in reality. To alleviate this issue, the mainstream approach is to use a well-designed speech enhancement module as the front-end of ASR. However, enhancement modules would result in speech distortions and mismatches to training, which sometimes degrades the ASR performance. In this paper, we propose a jointly adversarial enhancement training to boost robustness of end-to-end systems. Specifically, we use a jointly compositional scheme of maskbased enhancement network, attention-based encoder-decoder network and discriminant network during training. The discriminator is used to distinguish between the enhanced features from enhancement network and clean features, which could guide enhancement network to output towards the realistic distribution. With the joint optimization of the recognition, enhancement and adversarial loss, the compositional scheme is expected to learn more robust representations for the recognition task automatically. Systematic experiments on AISHELL-1 show that the proposed method improves the noise robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.

Shuai Nie

Papers

Sequence-To-Sequence Domain Adaptation Network for Robust Text Image Recognition

ADD 2022: the first Audio Deep Synthesis Detection Challenge

Deep Learning Based Speech Separation via NMF-Style Reconstructions

A pairwise algorithm using the deep stacking network for speech separation and pitch estimation

Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition.