scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Audio and Speech Processing in 2017"


Posted Content
TL;DR: In this article, three models for symbolic multi-track music generation under the framework of generative adversarial networks (GANs) are proposed, referred to as the jamming model, the composer model and the hybrid model.
Abstract: Generating music has a few notable differences from generating images and videos. First, music is an art of time, necessitating a temporal model. Second, music is usually composed of multiple instruments/tracks with their own temporal dynamics, but collectively they unfold over time interdependently. Lastly, musical notes are often grouped into chords, arpeggios or melodies in polyphonic music, and thereby introducing a chronological ordering of notes is not naturally suitable. In this paper, we propose three models for symbolic multi-track music generation under the framework of generative adversarial networks (GANs). The three models, which differ in the underlying assumptions and accordingly the network architectures, are referred to as the jamming model, the composer model and the hybrid model. We trained the proposed models on a dataset of over one hundred thousand bars of rock music and applied them to generate piano-rolls of five tracks: bass, drums, guitar, piano and strings. A few intra-track and inter-track objective metrics are also proposed to evaluate the generative results, in addition to a subjective user study. We show that our models can generate coherent music of four bars right from scratch (i.e. without human inputs). We also extend our models to human-AI cooperative music generation: given a specific track composed by human, we can generate four additional tracks to accompany it. All code, the dataset and the rendered audio samples are available at this https URL .

319 citations


Posted Content
TL;DR: This paper proposed a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than their previous tuple-based end to end (TE2E), which does not require an initial stage of example selection.
Abstract: In this paper, we propose a new loss function called generalized end-to-end (GE2E) loss, which makes the training of speaker verification models more efficient than our previous tuple-based end-to-end (TE2E) loss function. Unlike TE2E, the GE2E loss function updates the network in a way that emphasizes examples that are difficult to verify at each step of the training process. Additionally, the GE2E loss does not require an initial stage of example selection. With these properties, our model with the new loss function decreases speaker verification EER by more than 10%, while reducing the training time by 60% at the same time. We also introduce the MultiReader technique, which allows us to do domain adaptation - training a more accurate model that supports multiple keywords (i.e. "OK Google" and "Hey Google") as well as multiple dialects.

239 citations


Posted Content
TL;DR: This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.
Abstract: For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.

170 citations


Posted Content
TL;DR: This paper presented a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts, and found that this model, which was not explicitly given any information about language identity, improved recognition performance by 21% relative compared to analogous sequence to sequence models trained on each language individually.
Abstract: Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages.

95 citations


Posted Content
TL;DR: This article used shallow fusion with an external language model at inference time to improve the performance of a competitive attention-based sequence-to-sequence model, obviating the need for second-pass rescoring.
Abstract: Attention-based sequence-to-sequence models for automatic speech recognition jointly train an acoustic model, language model, and alignment mechanism. Thus, the language model component is only trained on transcribed audio-text pairs. This leads to the use of shallow fusion with an external language model at inference time. Shallow fusion refers to log-linear interpolation with a separately trained language model at each step of the beam search. In this work, we investigate the behavior of shallow fusion across a range of conditions: different types of language models, different decoding units, and different tasks. On Google Voice Search, we demonstrate that the use of shallow fusion with a neural LM with wordpieces yields a 9.1% relative word error rate reduction (WERR) over our competitive attention-based sequence-to-sequence model, obviating the need for second-pass rescoring.

63 citations


Posted Content
TL;DR: In this article, the authors explore the possibility of training a single model to serve different English dialects, which simplifies the process of training multi-dialect systems without the need for separate acoustic, pronunciation and language models for each dialect.
Abstract: Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network. In this work, we look at one such sequence-to-sequence model, namely listen, attend and spell (LAS), and explore the possibility of training a single model to serve different English dialects, which simplifies the process of training multi-dialect systems without the need for separate AM, PM and LMs for each dialect. We show that simply pooling the data from all dialects into one LAS model falls behind the performance of a model fine-tuned on each dialect. We then look at incorporating dialect-specific information into the model, both by modifying the training targets by inserting the dialect symbol at the end of the original grapheme sequence and also feeding a 1-hot representation of the dialect information into all layers of the model. Experimental results on seven English dialects show that our proposed system is effective in modeling dialect variations within a single LAS model, outperforming a LAS model trained individually on each of the seven dialects by 3.1 ~ 16.5% relative.

57 citations


Posted Content
TL;DR: This work develops an end-to-end speaker verification system that is initialized to mimic an i-vector + PLDA baseline that outperforms the i- vector +PLDA baseline on both long and short duration utterances.
Abstract: Recently several end-to-end speaker verification systems based on deep neural networks (DNNs) have been proposed. These systems have been proven to be competitive for text-dependent tasks as well as for text-independent tasks with short utterances. However, for text-independent tasks with longer utterances, end-to-end systems are still outperformed by standard i-vector + PLDA systems. In this work, we develop an end-to-end speaker verification system that is initialized to mimic an i-vector + PLDA baseline. The system is then further trained in an end-to-end manner but regularized so that it does not deviate too far from the initial system. In this way we mitigate overfitting which normally limits the performance of end-to-end systems. The proposed system outperforms the i-vector + PLDA baseline on both long and short duration utterances.

45 citations


Posted Content
TL;DR: This work describes how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s and shows that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener.
Abstract: Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model.

41 citations


Posted Content
TL;DR: Experiments show that the performance of the universal phoneme-based CTC system can be improved by applying LHUC and it is extensible to new phonemes during cross-lingual adaptation and applying dropout during adaptation can further improve the system and achieve competitive performance with Deep Neural Network / Hidden Markov Model (DNN/HMM) systems on limited data.
Abstract: Multilingual models for Automatic Speech Recognition (ASR) are attractive as they have been shown to benefit from more training data, and better lend themselves to adaptation to under-resourced languages. However, initialisation from monolingual context-dependent models leads to an explosion of context-dependent states. Connectionist Temporal Classification (CTC) is a potential solution to this as it performs well with monophone labels. We investigate multilingual CTC in the context of adaptation and regularisation techniques that have been shown to be beneficial in more conventional contexts. The multilingual model is trained to model a universal International Phonetic Alphabet (IPA)-based phone set using the CTC loss function. Learning Hidden Unit Contribution (LHUC) is investigated to perform language adaptive training. In addition, dropout during cross-lingual adaptation is also studied and tested in order to mitigate the overfitting problem. Experiments show that the performance of the universal phoneme-based CTC system can be improved by applying LHUC and it is extensible to new phonemes during cross-lingual adaptation. Updating all the parameters shows consistent improvement on limited data. Applying dropout during adaptation can further improve the system and achieve competitive performance with Deep Neural Network / Hidden Markov Model (DNN/HMM) systems on limited data.

37 citations


Proceedings ArticleDOI
TL;DR: In this paper, an LSTM network is used to predict skeleton points from audio recordings of a violin or a piano player playing, and the predicted points are applied onto a rigged avatar to create the animation.
Abstract: We present a method that gets as input an audio of violin or piano playing, and outputs a video of skeleton predictions which are further used to animate an avatar. The key idea is to create an animation of an avatar that moves their hands similarly to how a pianist or violinist would do, just from audio. Aiming for a fully detailed correct arms and fingers motion is a goal, however, it's not clear if body movement can be predicted from music at all. In this paper, we present the first result that shows that natural body dynamics can be predicted at all. We built an LSTM network that is trained on violin and piano recital videos uploaded to the Internet. The predicted points are applied onto a rigged avatar to create the animation.

35 citations


Posted Content
TL;DR: In this paper, the authors explored the potential of deep learning in classifying audio concepts on user-Generated Content videos, using two cascaded neural networks in a hierarchical configuration to analyze the short and long-term context information.
Abstract: Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter, or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a reduced set of audio concepts. However, multi-class classification can benefit from context window analysis and the discriminating power of deeper architectures. Although deep learning has shown promise in various applications such as speech and object recognition, it has not yet met the expectations for other fields such as audio concept classification. This paper explores, for the first time, the potential of deep learning in classifying audio concepts on User-Generated Content videos. The proposed system is comprised of two cascaded neural networks in a hierarchical configuration to analyze the short- and long-term context information. Our system outperforms a GMM approach by a relative 54%, a Neural Network by 33%, and a Deep Neural Network by 12% on the TRECVID-MED database

Posted Content
TL;DR: It is demonstrated that deep neural networks that use lower bit precision significantly reduce the processing time (up to 30x), however, their performance impact is low only in the case of classification tasks such as those present in voice activity detection.
Abstract: While deep neural networks have shown powerful performance in many audio applications, their large computation and memory demand has been a challenge for real-time processing. In this paper, we study the impact of scaling the precision of neural networks on the performance of two common audio processing tasks, namely, voice-activity detection and single-channel speech enhancement. We determine the optimal pair of weight/neuron bit precision by exploring its impact on both the performance and processing time. Through experiments conducted with real user data, we demonstrate that deep neural networks that use lower bit precision significantly reduce the processing time (up to 30x). However, their performance impact is low (< 3.14%) only in the case of classification tasks such as those present in voice activity detection.

Posted Content
TL;DR: This work focuses on reliable detection and segmentation of bird vocalizations as recorded in the open field, and suggests two approaches: first, DenseNets are applied to weekly labeled data to infer the attention map of the dataset, and second, a deep autoencoder is used, namely the U-net, that encircles the spectral blobs of vocalizations while suppressing other audio sources.
Abstract: This work focuses on reliable detection and segmentation of bird vocalizations as recorded in the open field. Acoustic detection of avian sounds can be used for the automatized monitoring of multiple bird taxa and querying in long-term recordings for species of interest. These tasks are tackled in this work, by suggesting two approaches: A) First, DenseNets are applied to weekly labeled data to infer the attention map of the dataset (i.e. Salience and CAM). We push further this idea by directing attention maps to the YOLO v2 Deepnet-based, detection framework to localize bird vocalizations. B) A deep autoencoder, namely the U-net, maps the audio spectrogram of bird vocalizations to its corresponding binary mask that encircles the spectral blobs of vocalizations while suppressing other audio sources. We focus solely on procedures requiring minimum human attendance, suitable to scan massive volumes of data, in order to analyze them, evaluate insights and hypotheses and identify patterns of bird activity. Hopefully, this approach will be valuable to researchers, conservation practitioners, and decision makers that need to design policies on biodiversity issues.

Posted Content
TL;DR: The second task focused on evaluating sound event detection systems using synthetic mixtures of office sounds as mentioned in this paper, with the added benefit of a very accurate ground truth, and the performance of the tested algorithms when facing controlled levels of audio complexity.
Abstract: As part of the 2016 public evaluation challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2016), the second task focused on evaluating sound event detection systems using synthetic mixtures of office sounds. This task, which follows the `Event Detection - Office Synthetic' task of DCASE 2013, studies the behaviour of tested algorithms when facing controlled levels of audio complexity with respect to background noise and polyphony/density, with the added benefit of a very accurate ground truth. This paper presents the task formulation, evaluation metrics, submitted systems, and provides a statistical analysis of the results achieved, with respect to various aspects of the evaluation dataset.

Posted Content
TL;DR: This work extended the previous approach towards training CTC-based systems multilingually, and built systems based on graphemes or phonemes, which could reduce the gap between these mono- and multilingual setups.
Abstract: Training automatic speech recognition (ASR) systems requires large amounts of data in the target language in order to achieve good performance. Whereas large training corpora are readily available for languages like English, there exists a long tail of languages which do suffer from a lack of resources. One method to handle data sparsity is to use data from additional source languages and build a multilingual system. Recently, ASR systems based on recurrent neural networks (RNNs) trained with connectionist temporal classification (CTC) have gained substantial research interest. In this work, we extended our previous approach towards training CTC-based systems multilingually. Our systems feature a global phone set, based on the joint phone sets of each source language. We evaluated the use of different language combinations as well as the addition of Language Feature Vectors (LFVs). As contrastive experiment, we built systems based on graphemes as well. Systems having a multilingual phone set are known to suffer in performance compared to their monolingual counterparts. With our proposed approach, we could reduce the gap between these mono- and multilingual setups, using either graphemes or phonemes.

Posted Content
TL;DR: The authors proposed two supervised learning settings that utilize spontaneity to improve speech emotion recognition: a hierarchical model that performs spontaneous detection before performing emotion recognition, and a multitask learning model that jointly learns to recognize both spontaneous and emotion.
Abstract: We investigate the effect and usefulness of spontaneity (i.e. whether a given speech is spontaneous or not) in speech in the context of emotion recognition. We hypothesize that emotional content in speech is interrelated with its spontaneity, and use spontaneity classification as an auxiliary task to the problem of emotion recognition. We propose two supervised learning settings that utilize spontaneity to improve speech emotion recognition: a hierarchical model that performs spontaneity detection before performing emotion recognition, and a multitask learning model that jointly learns to recognize both spontaneity and emotion. Through various experiments on the well known IEMOCAP database, we show that by using spontaneity detection as an additional task, significant improvement can be achieved over emotion recognition systems that are unaware of spontaneity. We achieve state-of-the-art emotion recognition accuracy (4-class, 69.1%) on the IEMOCAP database outperforming several relevant and competitive baselines.

Journal ArticleDOI
TL;DR: In this article, a joint pressure and velocity matching (JPVM) approach is proposed to control the sound field inside the local listening zones by evoking the sound pressure and particle velocity on surrounding contours.
Abstract: In this paper, a recently proposed approach to multizone sound field synthesis, referred to as Joint Pressure and Velocity Matching (JPVM), is investigated analytically using a spherical harmonics representation of the sound field The approach is motivated by the Kirchhoff-Helmholtz integral equation and aims at controlling the sound field inside the local listening zones by evoking the sound pressure and particle velocity on surrounding contours Based on the findings of the modal analysis, an improved version of JPVM is proposed which provides both better performance and lower complexity In particular, it is shown analytically that the optimization of the tangential component of the particle velocity vector, as is done in the original JPVM approach, is very susceptible to errors and thus not pursued anymore The analysis furthermore provides fundamental insights as to how the spherical harmonics used to describe the 3D variant sound field translate into 2D basis functions as observed on the contours surrounding the zones By means of simulations, it is verified that discarding the tangential component of the particle velocity vector ultimately leads to an improved performance Finally, the impact of sensor noise on the reproduction performance is assessed

Posted Content
TL;DR: This paper analyzes the usage of attention mechanisms to the problem of sequence summarization in the authors' end-to-end text-dependent speaker recognition system and shows that attention-based models can improves the Equal Error Rate (EER) of the speaker verification system by relatively 14% compared to their non-attention LSTM baseline model.
Abstract: Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.

Posted Content
TL;DR: The DIRHA-ENGLISH multi-microphone corpus as mentioned in this paper is composed of both real and simulated material, and it includes 12 US and 12 UK English native speakers, each speaker uttered different sets of phonetically rich sentences, newspaper articles, conversational speech, keywords, and commands.
Abstract: This paper introduces the contents and the possible usage of the DIRHA-ENGLISH multi-microphone corpus, recently realized under the EC DIRHA project. The reference scenario is a domestic environment equipped with a large number of microphones and microphone arrays distributed in space. The corpus is composed of both real and simulated material, and it includes 12 US and 12 UK English native speakers. Each speaker uttered different sets of phonetically-rich sentences, newspaper articles, conversational speech, keywords, and commands. From this material, a large set of 1-minute sequences was generated, which also includes typical domestic background noise as well as inter/intra-room reverberation effects. Dev and test sets were derived, which represent a very precious material for different studies on multi-microphone speech processing and distant-speech recognition. Various tasks and corresponding Kaldi recipes have already been developed. The paper reports a first set of baseline results obtained using different techniques, including Deep Neural Networks (DNN), aligned with the state-of-the-art at international level.

Posted Content
TL;DR: This paper revise this classical approach in the context of modern DNN-HMM systems, and proposes the adoption of three methods, namely, asymmetric context windowing, close- talk based supervision, and close-talk based pre-training, which show a significant advantage in using these three methods.
Abstract: Despite the significant progress made in the last years, state-of-the-art speech recognition technologies provide a satisfactory performance only in the close-talking condition. Robustness of distant speech recognition in adverse acoustic conditions, on the other hand, remains a crucial open issue for future applications of human-machine interaction. To this end, several advances in speech enhancement, acoustic scene analysis as well as acoustic modeling, have recently contributed to improve the state-of-the-art in the field. One of the most effective approaches to derive a robust acoustic modeling is based on using contaminated speech, which proved helpful in reducing the acoustic mismatch between training and testing conditions. In this paper, we revise this classical approach in the context of modern DNN-HMM systems, and propose the adoption of three methods, namely, asymmetric context windowing, close-talk based supervision, and close-talk based pre-training. The experimental results, obtained using both real and simulated data, show a significant advantage in using these three methods, overall providing a 15% error rate reduction compared to the baseline systems. The same trend in performance is confirmed either using a high-quality training set of small size, and a large one.

Journal ArticleDOI
TL;DR: In this paper, a new approach for the analysis of nonstationary signals is proposed, with a focus on audio applications, via stationarity-breaking operators acting on Gaussian stationary random signals.
Abstract: A new approach for the analysis of nonstationary signals is proposed, with a focus on audio applications. Following earlier contributions, nonstationarity is modeled via stationarity-breaking operators acting on Gaussian stationary random signals. The focus is on time warping and amplitude modulation, and an approximate maximum-likelihood approach based on suitable approximations in the wavelet transform domain is developed. This paper provides theoretical analysis of the approximations, and introduces JEFAS, a corresponding estimation algorithm. The latter is tested and validated on synthetic as well as real audio signal.

Posted Content
TL;DR: In this article, a robust time-warping algorithm which synchronizes two singing recordings can provide a promising solution to the singing voice correction problem, which aligns amateur singing recordings to professional ones, and a new pitch contour is generated given the alignment information.
Abstract: Expressive singing voice correction is an appealing but challenging problem. A robust time-warping algorithm which synchronizes two singing recordings can provide a promising solution. We thereby propose to address the problem by canonical time warping (CTW) which aligns amateur singing recordings to professional ones. A new pitch contour is generated given the alignment information, and a pitch-corrected singing is synthesized back through the vocoder. The objective evaluation shows that CTW is robust against pitch-shifting and time-stretching effects, and the subjective test demonstrates that CTW prevails the other methods including DTW and the commercial auto-tuning software. Finally, we demonstrate the applicability of the proposed method in a practical, real-world scenario.

Posted Content
TL;DR: It is shown that for Carnatic music, the note transitions and movements have a greater role in defining the raga structure than the exact note positions, and the proposed stochastic models of repetitive note patterns obtained from raga notations of known compositions, outperforms the state of the art melody based raga identification technique on an equivalent melodic data corresponding to the same compositions.
Abstract: Carnatic music, a form of Indian Art Music, has relied on an oral tradition for transferring knowledge across several generations. Over the last two hundred years, the use of prescriptive notations has been adopted for learning, sight-playing and sight-singing. Prescriptive notations offer generic guidelines for a raga rendition and do not include information about the ornamentations or the gamakas, which are considered to be critical for characterizing a raga. In this paper, we show that prescriptive notations contain raga attributes and can reliably identify a raga of Carnatic music from its octave-folded prescriptive notations. We restrict the notations to 7 notes and suppress the finer note position information. A dictionary based approach captures the statistics of repetitive note patterns within a raga notation. The proposed stochastic models of repetitive note patterns (or SMRNP in short) obtained from raga notations of known compositions, outperforms the state of the art melody based raga identification technique on an equivalent melodic data corresponding to the same compositions. This in turn shows that for Carnatic music, the note transitions and movements have a greater role in defining the raga structure than the exact note positions.

Posted Content
TL;DR: In this paper, the authors describe an approach to the generation of realistic corpora in a domestic context and demonstrate that a comparable performance trend can be observed with both real and simulated data across different recognition frameworks, acoustic models, as well as multi-microphone processing techniques.
Abstract: The availability of realistic simulated corpora is of key importance for the future progress of distant speech recognition technology. The reliability, flexibility and low computational cost of a data simulation process may ultimately allow researchers to train, tune and test different techniques in a variety of acoustic scenarios, avoiding the laborious effort of directly recording real data from the targeted environment. In the last decade, several simulated corpora have been released to the research community, including the data-sets distributed in the context of projects and international challenges, such as CHiME and REVERB. These efforts were extremely useful to derive baselines and common evaluation frameworks for comparison purposes. At the same time, in many cases they highlighted the need of a better coherence between real and simulated conditions. In this paper, we examine this issue and we describe our approach to the generation of realistic corpora in a domestic context. Experimental validation, conducted in a multi-microphone scenario, shows that a comparable performance trend can be observed with both real and simulated data across different recognition frameworks, acoustic models, as well as multi-microphone processing techniques.

Posted Content
TL;DR: In this paper, the authors found that altering the level of emphasis on landmarks through accordingly re-weighting acoustic likelihood in frames, tends to reduce the phone error rate (PER) and leveraging the landmark as a heuristic, one of their hybrid DNN frame dropping strategies maintained a PER within 0.44% of optimal when scoring less than half (41.2%) of the frames.
Abstract: Most mainstream Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea, that some frames are more important than others. Acoustic landmark theory exploits the quantal nonlinear articulatory-acoustic relationships from human speech perception experiments, and provides theoretical support for extracting acoustic features in the vicinity of landmark regions where an abrupt change occurs in the spectrum of speech signals. In this work, we conduct experiments on the TIMIT corpus, with both GMM and DNN based ASR systems and found that frames containing landmarks are more informative than others. We found that altering the level of emphasis on landmarks through accordingly re-weighting acoustic likelihood in frames, tends to reduce the phone error rate (PER). Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies maintained a PER within 0.44% of optimal when scoring less than half (41.2% to be precise) of the frames. This hybrid strategy out-performs other non-heuristicbased methods and demonstrates the potential of landmarks for reducing computation.

Posted Content
TL;DR: This work derives a generic overcomplete frame thresholding scheme based on risk minimization and validate the method on a large scale bird activity detection task via the scattering network architecture performed by means of continuous wavelets, known for being an adequate dictionary in audio environments.
Abstract: In this work, we derive a generic overcomplete frame thresholding scheme based on risk minimization. Overcomplete frames being favored for analysis tasks such as classification, regression or anomaly detection, we provide a way to leverage those optimal representations in real-world applications through the use of thresholding. We validate the method on a large scale bird activity detection task via the scattering network architecture performed by means of continuous wavelets, known for being an adequate dictionary in audio environments.

Posted Content
TL;DR: Mel-Frequency Cepstral Coefficients is used to extract the feature of a voice in judging whether a speaker is included in a multi-speaker environment and distinguish who the speaker should be.
Abstract: This paper proposes an original statistical decision theory to accomplish a multi-speaker recognition task in cocktail party problem. This theory relies on an assumption that the varied frequencies of speakers obey Gaussian distribution and the relationship of their voiceprints can be represented by Euclidean distance vectors. This paper uses Mel-Frequency Cepstral Coefficients to extract the feature of a voice in judging whether a speaker is included in a multi-speaker environment and distinguish who the speaker should be. Finally, a thirteen-dimension constellation drawing is established by mapping from Manhattan distances of speakers in order to take a thorough consideration about gross influential factors.

Posted Content
TL;DR: A new method to enhance head shadow in low frequencies is presented, resulting in interaural level differences that can be used to unambiguously localize sounds and is promising for bilateral cochlear implant or hearing aid users and for improved speech perception in multi-talker environments.
Abstract: A new method to enhance head shadow in low frequencies is presented, resulting in interaural level differences that can be used to unambiguously localize sounds. Enhancement is achieved with a fixed beamformer with ipsilateral directionality in each ear. The microphone array consists of one microphone per device. The method naturally handles multiple sources without sound location estimations. In a localization experiment with simulated bimodal listeners, performance improved from 51° to 28° root-mean-square error compared with standard omni-directional microphones. The method is also promising for bilateral cochlear implant or hearing aid users and for improved speech perception in multi-talker environments.

Posted Content
TL;DR: In this article, the authors proposed Language Feature Vectors (LFVs) to train language adaptive multilingual systems using recurrent neural networks (RNNs) and applied them to the hidden layers of RNNs.
Abstract: In this work, we focus on multilingual systems based on recurrent neural networks (RNNs), trained using the Connectionist Temporal Classification (CTC) loss function. Using a multilingual set of acoustic units poses difficulties. To address this issue, we proposed Language Feature Vectors (LFVs) to train language adaptive multilingual systems. Language adaptation, in contrast to speaker adaptation, needs to be applied not only on the feature level, but also to deeper layers of the network. In this work, we therefore extended our previous approach by introducing a novel technique which we call "modulation". Based on this method, we modulated the hidden layers of RNNs using LFVs. We evaluated this approach in both full and low resource conditions, as well as for grapheme and phone based systems. Lower error rates throughout the different conditions could be achieved by the use of the modulation.

Posted Content
TL;DR: The proposed solution is able to find the sampling points that results in a better conditioned SHM and also maintains all the application specific requirements.
Abstract: In this paper, we attempt to study the conditioning of the Spherical Harmonic Matrix (SHM), which is widely used in the discrete, limited order orthogonal representation of sound fields. SHM's has been widely used in the audio applications like spatial sound reproduction using loudspeakers, orthogonal representation of Head Related Transfer Functions (HRTFs) etc. The conditioning behaviour of the SHM depends on the sampling positions chosen in the 3D space. Identification of the optimal sampling points in the continuous 3D space that results in a well-conditioned SHM for any number of sampling points is a highly challenging task. In this work, an attempt has been made to solve a discrete version of the above problem using optimization based techniques. The discrete problem is, to identify the optimal sampling points from a discrete set of densely sampled positions of the 3D space, that minimizes the condition number of SHM. This method has been subsequently utilized for identifying the geometry of loudspeakers in the spatial sound reproduction, and in the selection of spatial sampling configurations for HRTF measurement. The application specific requirements have been formulated as additional constraints of the optimization problem. Recently developed mixed-integer optimization solvers have been used in solving the formulated problem. The performance of the obtained sampling position in each application is compared with the existing configurations. Objective measures like condition number, D-measure, and spectral distortion are used to study the performance of the sampling configurations resulting from the proposed and the existing methods. It is observed that the proposed solution is able to find the sampling points that results in a better conditioned SHM and also maintains all the application specific requirements.