scispace - formally typeset
Search or ask a question

Showing papers in "Eurasip Journal on Audio, Speech, and Music Processing in 2019"


Journal ArticleDOI
TL;DR: The proposed speech emotion recognition method based on the decision tree support vector machine (SVM) model with Fisher feature selection with Fisher criterion can effectively reduce the emotional confusion and improve the emotion recognition rate.
Abstract: The overall recognition rate will reduce due to the increase of emotional confusion in multiple speech emotion recognition. To solve the problem, we propose a speech emotion recognition method based on the decision tree support vector machine (SVM) model with Fisher feature selection. At the stage of feature selection, Fisher criterion is used to filter out the feature parameters of higher distinguish ability. At the emotion classification stage, an algorithm is proposed to determine the structure of decision tree. The decision tree SVM can realize the two-step classification of the first rough classification and the fine classification. Thus the redundant parameters are eliminated and the performance of emotion recognition is improved. In this method, the decision tree SVM framework is firstly established by calculating the confusion degree of emotion, and then the features with higher distinguish ability are selected for each SVM of the decision tree according to Fisher criterion. Finally, speech emotion recognition is realized based on this model. The decision tree SVM with Fisher feature selection on CASIA Chinese emotion speech corpus and Berlin speech corpus are constructed to validate the effectiveness of our framework. The experimental results show that the average emotion recognition rate based on the proposed method is 9% higher than traditional SVM classification method on CASIA, and 8.26% higher on Berlin speech corpus. It is verified that the proposed method can effectively reduce the emotional confusion and improve the emotion recognition rate.

81 citations


Journal ArticleDOI
TL;DR: This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments, selected from the Google AudioSet dataset.
Abstract: Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speech.

32 citations


Journal ArticleDOI
TL;DR: The effect of sparse measurement grids on the reproduced binaural signal is studied by analyzing both aliasing and truncation errors and indicates a substantial effect of truncation error on the loudness stability, while the added aliasing seems to significantly reduce this effect.
Abstract: In response to renewed interest in virtual and augmented reality, the need for high-quality spatial audio systems has emerged. The reproduction of immersive and realistic virtual sound requires high resolution individualized head-related transfer function (HRTF) sets. In order to acquire an individualized HRTF, a large number of spatial measurements are needed. However, such a measurement process requires expensive and specialized equipment, which motivates the use of sparsely measured HRTFs. Previous studies have demonstrated that spherical harmonics (SH) can be used to reconstruct the HRTFs from a relatively small number of spatial samples, but reducing the number of samples may produce spatial aliasing error. Furthermore, by measuring the HRTF on a sparse grid the SH representation will be order-limited, leading to constrained spatial resolution. In this paper, the effect of sparse measurement grids on the reproduced binaural signal is studied by analyzing both aliasing and truncation errors. The expected effect of these errors on the perceived loudness stability of the virtual sound source is studied theoretically, as well as perceptually by an experimental investigation. Results indicate a substantial effect of truncation error on the loudness stability, while the added aliasing seems to significantly reduce this effect.

24 citations


Journal ArticleDOI
TL;DR: The proposed method consistently showed better performance in all the three languages than the baseline system, and the F-score ranged from 86.5% for British data to 95.9% for Korean drama data.
Abstract: We propose a new method for music detection from broadcasting contents using the convolutional neural networks with a Mel-scale kernel. In this detection task, music segments should be annotated from the broadcast data, where music, speech, and noise are mixed. The convolutional neural network is composed of a convolutional layer with kernel that is trained to extract robust features. The Mel-scale changes the kernel size, and the backpropagation algorithm trains the kernel shape. We used 52 h of mixed broadcast data (25 h of music) to train the convolutional network and 24 h of collected broadcast data (ratio of music of 50–76%) for testing. The test data consisted of various genres (drama, documentary, news, kids, reality, and so on) that are broadcast in British English, Spanish, and Korean languages. The proposed method consistently showed better performance in all the three languages than the baseline system, and the F-score ranged from 86.5% for British data to 95.9% for Korean drama data. Our music detection system takes about 28 s to process a 1-min signal using only one CPU with 4 cores.

23 citations


Journal ArticleDOI
TL;DR: Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline, and the c-vector system performs the best.
Abstract: Phonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.

20 citations


Journal ArticleDOI
TL;DR: This paper improves the discriminating ability of the RP feature by proposing two new auditory filter-based RP features for replay attack detection and a combination of the scores of these proposed RP features and a standard magnitude-based feature, that is, the constant Q transform cepstral coefficient (CQCC), is also applied to further improve the reliable detection decision.
Abstract: There are many studies on detecting human speech from artificially generated speech and automatic speaker verification (ASV) that aim to detect and identify whether the given speech belongs to a given speaker. Recent studies demonstrate the success of the relative phase (RP) feature in speaker recognition/verification and the detection of synthesized speech and converted speech. However, there are few studies that focus on the RP feature for replay attack detection. In this paper, we improve the discriminating ability of the RP feature by proposing two new auditory filter-based RP features for replay attack detection. The key idea is to integrate the advantage of RP-based features in signal representation with the advantage of two auditory filter-based RP features. For the first proposed feature, we apply a Mel-filter bank to convert the signal representation of conventional RP information from a linear scale to a Mel scale, where the modified representation is called the Mel-scale RP feature. For the other proposed feature, a gammatone filter bank is applied to scale the RP information, where the scaled RP feature is called the gammatone-scale RP feature. These two proposed phase-based features are implemented to achieve better performance than a conventional RP feature because of the scale resolution and. In addition to the use of individual Mel/gammatone-scale RP features, a combination of the scores of these proposed RP features and a standard magnitude-based feature, that is, the constant Q transform cepstral coefficient (CQCC), is also applied to further improve the reliable detection decision. The effectiveness of the proposed Mel-scale RP feature, gammatone-scale RP feature, and their combination are evaluated using the ASVspoof 2017 dataset. On the evaluation dataset, our proposed methods demonstrate significant improvement over the existing feature and baseline CQCC feature. The combination of the CQCC and gammatone-scale RP provides the best performance compared with an individual baseline feature and other combination methods.

17 citations


Journal ArticleDOI
TL;DR: The results of the experiments show that creating ASR systems with DSL can achieve an accuracy comparable to traditional methods, while simultaneously making use of unlabeled data, which obviously is much cheaper to obtain and comes in larger sizes.
Abstract: Current automatic speech recognition (ASR) systems achieve over 90–95% accuracy, depending on the methodology applied and datasets used. However, the level of accuracy decreases significantly when the same ASR system is used by a non-native speaker of the language to be recognized. At the same time, the volume of labeled datasets of non-native speech samples is extremely limited both in size and in the number of existing languages. This problem makes it difficult to train or build sufficiently accurate ASR systems targeted at non-native speakers, which, consequently, calls for a different approach that would make use of vast amounts of large unlabeled datasets. In this paper, we address this issue by employing dual supervised learning (DSL) and reinforcement learning with policy gradient methodology. We tested DSL in a warm-start approach, with two models trained beforehand, and in a semi warm-start approach with only one of the two models pre-trained. The experiments were conducted on English language pronounced by Japanese and Polish speakers. The results of our experiments show that creating ASR systems with DSL can achieve an accuracy comparable to traditional methods, while simultaneously making use of unlabeled data, which obviously is much cheaper to obtain and comes in larger sizes.

13 citations


Journal ArticleDOI
TL;DR: In this paper, a modification to DTW that performs individual and independent pairwise alignment of feature trajectories is proposed, termed feature trajectory dynamic time warping (FTDTW), which is applied as a similarity measure in the agglomerative hierarchical clustering of speech segments.
Abstract: Dynamic time warping (DTW) can be used to compute the similarity between two sequences of generally differing length. We propose a modification to DTW that performs individual and independent pairwise alignment of feature trajectories. The modified technique, termed feature trajectory dynamic time warping (FTDTW), is applied as a similarity measure in the agglomerative hierarchical clustering of speech segments. Experiments using MFCC and PLP parametrisations extracted from TIMIT and from the Spoken Arabic Digit Dataset (SADD) show consistent and statistically significant improvements in the quality of the resulting clusters in terms of F-measure and normalised mutual information (NMI).

12 citations


Journal ArticleDOI
TL;DR: This work introduces a joint model trained with nonnegative matrix factorization (NMF)-based high-level features and puts forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs.
Abstract: A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.

11 citations


Journal ArticleDOI
Teng Zhang1, Ji Wu1
TL;DR: Experiments on audio source separation and audio scene classification tasks show performance improvements of the proposed filter banks when compared with traditional fixed-parameter triangular or gaussian filters on Mel scale.
Abstract: Filter banks on spectrums play an important role in many audio applications. Traditionally, the filters are linearly distributed on perceptual frequency scale such as Mel scale. To make the output smoother, these filters are often placed so that they overlap with each other. However, fixed-parameter filters are usually in the context of psychoacoustic experiments and selected experimentally. To make filter banks discriminative, the authors use a neural network structure to learn the frequency center, bandwidth, gain, and shape of the filters adaptively when filter banks are used as a feature extractor. This paper investigates several different constraints on discriminative frequency filter banks and the dual spectrum reconstruction problem. Experiments on audio source separation and audio scene classification tasks show performance improvements of the proposed filter banks when compared with traditional fixed-parameter triangular or gaussian filters on Mel scale. The classification errors on LITIS ROUEN dataset and DCASE2016 dataset are reduced by 13.9% and 4.6% relatively.

10 citations


Journal ArticleDOI
TL;DR: The proposed adaptive averaging a priori SNR estimation employing critical band processing and a modified objective measurement known as modified hamming distance improves the speech quality under different noise conditions and maintains the advantage of the DD approach in eliminating the musical noise under different SNR conditions.
Abstract: In this paper, an adaptive averaging a priori SNR estimation employing critical band processing is proposed. The proposed method modifies the current decision-directed a priori SNR estimation to achieve faster tracking when SNR changes. The decision-directed estimator (DD) employs a fixed weighting with the value close to one, which makes it slow in following the onsets of speech utterances. The proposed SNR estimator provides a means to solve this issue by employing an adaptive weighting factor. This allows an improved tracking of onset changes in the speech signal. As a consequence, it results in better preservation of speech components. This adaptive technique ensures that the weighting between the modified decision-directed a priori estimate and the maximum likelihood a priori estimate is a function of the speech absence probability. The estimate of the speech absence probability is modeled by a sigmoid function. Furthermore, a critical band mapping for the short-time Fourier transform analysis-synthesis system is utilized in the speech enhancement to achieve less musical noise. In addition, to evaluate the ability of the a priori SNR estimation method in preserving speech components, we proposed a modified objective measurement known as modified hamming distance. Evaluations are performed by utilizing both objective and subjective measurements. The experimental results show that the proposed method improves the speech quality under different noise conditions. Moreover, it maintains the advantage of the DD approach in eliminating the musical noise under different SNR conditions. The objective results are supported by subjective listening tests using 10 subjects (5 males and 5 females).

Journal ArticleDOI
TL;DR: The proposed discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed, and it is shown that the proposed method is more effective at distinguishing happiness from other emotions.
Abstract: Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional l1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

Journal ArticleDOI
TL;DR: The results suggest that the QbE STD task is still in progress, and the performance of these systems is highly sensitive to changes in the data domain, Nevertheless, QbBE STD strategies are able to outperform text-based STD in unseen data domains.
Abstract: The huge amount of information stored in audio and video repositories makes search on speech (SoS) a priority area nowadays. Within SoS, Query-by-Example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given a spoken query. Research on this area is continuously fostered with the organization of QbE STD evaluations. This paper presents a multi-domain internationally open evaluation for QbE STD in Spanish. The evaluation aims at retrieving the speech files that contain the queries, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: MAVIR database, which comprises a set of talks from workshops; RTVE database, which includes broadcast television (TV) shows; and COREMAH database, which contains 2-people spontaneous speech conversations about different topics. The evaluation has been designed carefully so that several analyses of the main results can be carried out. We present the evaluation itself, the three databases, the evaluation metrics, the systems submitted to the evaluation, the results, and the detailed post-evaluation analyses based on some query properties (within-vocabulary/out-of-vocabulary queries, single-word/multi-word queries, and native/foreign queries). Fusion results of the primary systems submitted to the evaluation are also presented. Three different teams took part in the evaluation, and ten different systems were submitted. The results suggest that the QbE STD task is still in progress, and the performance of these systems is highly sensitive to changes in the data domain. Nevertheless, QbE STD strategies are able to outperform text-based STD in unseen data domains.

Journal ArticleDOI
TL;DR: This paper proposes a score-informed source separation framework based on non-negative matrix factorization (NMF) and dynamic time warping (DTW) that suits for both offline and online systems and has been evaluated and compared with other state-of-the-art methods for single channel source separation of small ensembles and large orchestra ensembleles.
Abstract: In this paper, we propose a score-informed source separation framework based on non-negative matrix factorization (NMF) and dynamic time warping (DTW) that suits for both offline and online systems. The proposed framework is composed of three stages: training, alignment, and separation. In the training stage, the score is encoded as a sequence of individual occurrences and unique combinations of notes denoted as score units. Then, we proposed a NMF-based signal model where the basis functions for each score unit are represented as a weighted combination of spectral patterns for each note and instrument in the score obtained from a trained a priori over-completed dictionary. In the alignment stage, the time-varying gains are estimated at frame level by computing the projection of each score unit basis function over the captured audio signal. Then, under the assumption that only a score unit is active at a time, we propose an online DTW scheme to synchronize the score information with the performance. Finally, in the separation stage, the obtained gains are refined using local low-rank NMF and the separated sources are obtained using a soft-filter strategy. The framework has been evaluated and compared with other state-of-the-art methods for single channel source separation of small ensembles and large orchestra ensembles obtaining reliable results in terms of SDR and SIR. Finally, our method has been evaluated in the specific task of acoustic minus one, and some demos are presented.

Journal ArticleDOI
TL;DR: This work proposes an unsupervised adaptation method which does not need for in-domain labeled data but only the recording that the authors are diarizing, and is totally compatible with a supervised one.
Abstract: We present a novel model adaptation approach to deal with data variability for speaker diarization in a broadcast environment. Expensive human annotated data can be used to mitigate the domain mismatch by means of supervised model adaptation approaches. By contrast, we propose an unsupervised adaptation method which does not need for in-domain labeled data but only the recording that we are diarizing. We rely on an inner adaptation block which combines Agglomerative Hierarchical Clustering (AHC) and Mean-Shift (MS) clustering techniques with a Fully Bayesian Probabilistic Linear Discriminant Analysis (PLDA) to produce pseudo-speaker labels suitable for model adaptation. We propose multiple adaptation approaches based on this basic block, including unsupervised and semi-supervised. Our proposed solutions, analyzed with the Multi-Genre Broadcast 2015 (MGB) dataset, reported significant improvements (16% relative improvement) with respect to the baseline, also outperforming a supervised adaptation proposal with low resources (9% relative improvement). Furthermore, our proposed unsupervised adaptation is totally compatible with a supervised one. The joint use of both adaptation techniques (supervised and unsupervised) shows a 13% relative improvement with respect to only considering the supervised adaptation.

Journal ArticleDOI
TL;DR: An overview of communication enhancement techniques for masks based on digital signal processing is given, which shows that the communication can be improved significantly, as the results of measurements of real-time mask systems show.
Abstract: So-called full-face masks are essential for fire fighters to ensure respiratory protection in smoke diving incidents. While such masks are absolutely necessary for protection purposes on one hand, they impair the voice communication of fire fighters drastically on the other hand. For this reason communication systems should be used to amplify the speech and, therefore, to improve the communication quality. This paper gives an overview of communication enhancement techniques for masks based on digital signal processing. The presented communication system picks up the speech signal by a microphone in the mask, enhance it, and play back the amplified signal by loudspeakers located on the outside of such masks. Since breathing noise is also picked up by the microphone, it’s advantageous to recognize and suppress it – especially since breathing noise is very loud (usually much louder than the recorded voice). A voice activity detection distinguishes between side talkers, pause, breathing out, breathing in, and speech. It ensures that only speech components are played back. Due to the fact that the microphone is located close to the loudspeakers, the output signals are coupling back into the microphone and feedback may occur even at moderate gains. This can be reduced by feedback reduction (consisting of cancellation and suppression approaches). To enhance the functionality of the canceler a decorrelation stage can be applied to the enhanced signal before loudspeaker playback. As a consequence of all processing stages, the communication can be improved significantly, as the results of measurements of real-time mask systems show.

Journal ArticleDOI
TL;DR: Systematic evaluation shows the SID is robust against the variations in singer’s singing style and structure of songs and is effective in identifying the cover songs and singers.
Abstract: Singing voice analysis has been a topic of research to assist several applications in the domain of music information retrieval system. One such major area is singer identification (SID). There has been enormous increase in production of movies and songs in Bollywood industry over the last 50 decades. Surveying this extensive dataset of singers, the paper presents singer identification system for Indian playback singers. Four acoustic features namely—formants, harmonic spectral envelope, vibrato, and timbre—that uniquely describe the singer are extracted from the singing voice segments. Using the combination of these multiple acoustic features, we address the major challenges in SID like the variations in singer’s voice, testing of multilingual songs, and the album effect. Systematic evaluation shows the SID is robust against the variations in singer’s singing style and structure of songs and is effective in identifying the cover songs and singers. The results are investigated on in-house cappella database consisting of 26 singers and 550 songs. By performing dimension reduction of the feature vector and using Support Vector Machine classifier, we achieved an accuracy of 86% using fourfold cross validation process. In addition, performance comparison of the proposed work with other existing approaches reveals the superiority in terms of volume of dataset and song duration.

Journal ArticleDOI
TL;DR: This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech- to- singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus.
Abstract: Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, time-scale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.

Journal ArticleDOI
TL;DR: The deep neural network (DNN)-based method is proposed to learn the mapping relationship between input features of noisy speech and the T-F masks, and Experimental results show that the codebook-driven method can achieve better performance than conventional methods, and the DNN-based method performs better than the codebooks.
Abstract: According to the encoding and decoding mechanism of binaural cue coding (BCC), in this paper, the speech and noise are considered as left channel signal and right channel signal of the BCC framework, respectively. Subsequently, the speech signal is estimated from noisy speech when the inter-channel level difference (ICLD) and inter-channel correlation (ICC) between speech and noise are given. In this paper, exact inter-channel cues and the pre-enhanced inter-channel cues are used for speech restoration. The exact inter-channel cues are extracted from clean speech and noise, and the pre-enhanced inter-channel cues are extracted from the pre-enhanced speech and estimated noise. After that, they are combined one by one to form a codebook. Once the pre-enhanced cues are extracted from noisy speech, the exact cues are estimated by a mapping between the pre-enhanced cues and a prior codebook. Next, the estimated exact cues are used to obtain a time-frequency (T-F) mask for enhancing noisy speech based on the decoding of BCC. In addition, in order to further improve accuracy of the T-F mask based on the inter-channel cues, the deep neural network (DNN)-based method is proposed to learn the mapping relationship between input features of noisy speech and the T-F masks. Experimental results show that the codebook-driven method can achieve better performance than conventional methods, and the DNN-based method performs better than the codebook-driven method.

Journal ArticleDOI
TL;DR: In this paper, a latent class model (LCM) is applied to the task of speaker diarization, which is similar to Kenny's variational Bayes (VB) method in that it uses soft information and avoids premature hard decisions in its iterations.
Abstract: In this paper, we apply a latent class model (LCM) to the task of speaker diarization. LCM is similar to Patrick Kenny’s variational Bayes (VB) method in that it uses soft information and avoids premature hard decisions in its iterations. In contrast to the VB method, which is based on a generative model, LCM provides a framework allowing both generative and discriminative models. The discriminative property is realized through the use of i-vector (Ivec), probabilistic linear discriminative analysis (PLDA), and a support vector machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and LCM-Ivec-Hybrid are introduced. In addition, three further improvements are applied to enhance its performance. (1) Adding neighbor windows to extract more speaker information for each short segment. (2) Using a hidden Markov model to avoid frequent speaker change points. (3) Using an agglomerative hierarchical cluster to do initialization and present hard and soft priors, in order to overcome the problem of initial sensitivity. Experiments on the National Institute of Standards and Technology Rich Transcription 2009 speaker diarization database, under the condition of a single distant microphone, show that the diarization error rate (DER) of the proposed methods has substantial relative improvements compared with mainstream systems. Compared to the VB method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments on our collected database, CALLHOME97, CALLHOME00, and SRE08 short2-summed trial conditions also show that the proposed LCM-Ivec-Hybrid system has the best overall performance.

Journal ArticleDOI
TL;DR: The obtained results suggest that the STD task is still in progress and performance is highly sensitive to changes in the data domain.
Abstract: Search on speech (SoS) is a challenging area due to the huge amount of information stored in audio and video repositories. Spoken term detection (STD) is an SoS-related task aiming to retrieve data from a speech repository given a textual representation of a search term (which can include one or more words). This paper presents a multi-domain internationally open evaluation for STD in Spanish. The evaluation has been designed carefully so that several analyses of the main results can be carried out. The evaluation task aims at retrieving the speech files that contain the terms, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: the MAVIR database, which comprises a set of talks from workshops; the RTVE database, which includes broadcast news programs; and the COREMAH database, which contains 2-people spontaneous speech conversations about different topics. We present the evaluation itself, the three databases, the evaluation metric, the systems submitted to the evaluation, the results, and detailed post-evaluation analyses based on some term properties (within-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and native/foreign terms). Fusion results of the primary systems submitted to the evaluation are also presented. Three different research groups took part in the evaluation, and 11 different systems were submitted. The obtained results suggest that the STD task is still in progress and performance is highly sensitive to changes in the data domain.

Journal ArticleDOI
TL;DR: Two novel linguistic features extracted from text input for prosody generation in a Mandarin text-to-speech system, which measures the likelihood that a major punctuation mark can be inserted at a word boundary and the quotation confidence, are identified as promising features for Mandarin prosodygeneration.
Abstract: This paper proposes two novel linguistic features extracted from text input for prosody generation in a Mandarin text-to-speech system. The first feature is the punctuation confidence (PC), which measures the likelihood that a major punctuation mark (MPM) can be inserted at a word boundary. The second feature is the quotation confidence (QC), which measures the likelihood that a word string is quoted as a meaningful or emphasized unit. The proposed PC and QC features are influenced by the properties of automatic Chinese punctuation generation and linguistic characteristic of the Chinese punctuation system. Because MPMs are highly correlated with prosodic–acoustic features and quoted word strings serve crucial roles in human language understanding, the two features could potentially provide useful information for prosody generation. This idea was realized by employing conditional random-field-based models for predicting MPMs, quoted word string locations, and their associated confidences—that is, PC and QC—for each word boundary. The predicted punctuations and their confidences were then combined with traditional linguistic features to predict prosodic–acoustic features for performing speech synthesis using multilayer perceptrons. Both objective and subjective tests demonstrated that the prosody generated with the proposed linguistic features was superior to that generated without the proposed features. Therefore, the proposed PC and QC are identified as promising features for Mandarin prosody generation.

Journal ArticleDOI
TL;DR: This paper investigates a room-localized SAD system for smart homes equipped with multiple microphones distributed in multiple rooms, significantly extending earlier work and significantly outperforming alternative baselines.
Abstract: Voice-enabled interaction systems in domestic environments have attracted significant interest recently, being the focus of smart home research projects and commercial voice assistant home devices. Within the multi-module pipelines of such systems, speech activity detection (SAD) constitutes a crucial component, providing input to their activation and speech recognition subsystems. In typical multi-room domestic environments, SAD may also convey spatial intelligence to the interaction, in addition to its traditional temporal segmentation output, by assigning speech activity at the room level. Such room-localized SAD can, for example, disambiguate user command referents, allow localized system feedback, and enable parallel voice interaction sessions by multiple subjects in different rooms. In this paper, we investigate a room-localized SAD system for smart homes equipped with multiple microphones distributed in multiple rooms, significantly extending our earlier work. The system employs a two-stage algorithm, incorporating a set of hand-crafted features specially designed to discriminate room-inside vs. room-outside speech at its second stage, refining SAD hypotheses obtained at its first stage by traditional statistical modeling and acoustic front-end processing. Both algorithmic stages exploit multi-microphone information, combining it at the signal, feature, or decision level. The proposed approach is extensively evaluated on both simulated and real data recorded in a multi-room, multi-microphone smart home, significantly outperforming alternative baselines. Further, it remains robust to reduced microphone setups, while also comparing favorably to deep learning-based alternatives.

Journal ArticleDOI
TL;DR: An innovative parallel dictionary-learning method using non-negative Tucker decomposition (NTD) is proposed, which estimates the dictionary matrix for NMF-VC without using parallel data.
Abstract: Voice conversion (VC) is a technique of exclusively converting speaker-specific information in the source speech while preserving the associated phonemic information. Non-negative matrix factorization (NMF)-based VC has been widely researched because of the natural-sounding voice it achieves when compared with conventional Gaussian mixture model-based VC. In conventional NMF-VC, models are trained using parallel data which results in the speech data requiring elaborate pre-processing to generate parallel data. NMF-VC also tends to be an extensive model as this method has several parallel exemplars for the dictionary matrix, leading to a high computational cost. In this study, an innovative parallel dictionary-learning method using non-negative Tucker decomposition (NTD) is proposed. The proposed method uses tensor decomposition and decomposes an input observation into a set of mode matrices and one core tensor. The proposed NTD-based dictionary-learning method estimates the dictionary matrix for NMF-VC without using parallel data. The experimental results show that the proposed method outperforms other methods in both parallel and non-parallel settings.