Showing papers in "Speech Communication in 2015"
••
TL;DR: How common paralinguistic speech characteristics are affected by depression and suicidality and the application of this information in classification and prediction systems is reviewed.
607 citations
••
TL;DR: A survey of past work and priority research directions for the future is provided, showing that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.
433 citations
••
TL;DR: Experimental results on an isolated English word corpus recorded by non-native (L2) English learners show that the proposed GOP measure can improve the performance of GOP based mispronunciation detection approach, i.e., 7.4 % of the precision and recall rate are improved, compared with the conventional GOP estimated from GMM-HMM.
184 citations
••
TL;DR: Experiments showed that deep feature based methods can obtain significant performance improvements compared to the traditional baselines, no matter if they are directly applied in the GMM-UBM system or utilized as identity vectors.
176 citations
••
TL;DR: The performance of the human listening panel shows that imitated speech increases the difficulty of the speaker verification task, and a statistically significant association is found between listener accuracy and self reported factors only when familiar voices were present in the test.
84 citations
••
TL;DR: The hypothesis that the effects of depression in speech manifest as a reduction in the spread of phonetic events in acoustic space as modelled by Gaussian Mixture Models in combination with Mel Frequency Cepstral Coefficient (MFCC) is investigated.
81 citations
••
TL;DR: The findings demonstrate that such spoofing-oriented playback attacks can be effectively detected and should not be considered a significant argument against applications of text-dependent speaker verification.
75 citations
••
TL;DR: Object and subjective evaluations indicated that the proposed spectral envelope estimation algorithm can obtain a temporally stable spectral envelope and synthesize speech with higher sound quality than speech synthesized with other algorithms.
73 citations
••
TL;DR: Experimental results indicate that: (i) the MHEC feature is highly effective and performs favorably compared to other conventional and state-of-the-art front-ends, and (ii) the power-law non-linearity consistently yields the best performance across different conditions for both SID and LID tasks.
61 citations
••
TL;DR: The paper discusses the evaluation of audiovisual speech synthesizers, it elaborates on the hardware requirements for performing visual speech synthesis and it describes some important future directions that should stimulate the use of audiolabeled speech synthesis technology in real-life applications.
60 citations
••
[...]
TL;DR: This study presents a novel expert-based approach to assess the quality of ongoing Spoken Dialog System (SDS) interactions and concludes that this paradigm could render SDSs more user friendly and improve user acceptance.
••
TL;DR: The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).
••
TL;DR: All features, both spectral and prosodic, are necessary for achievement of optimal performance on the data from L1-English speakers; MFCC log-posterior probability features are the single best set of features, followed by duration, energy, pitch and finally, spectral tilt features.
••
TL;DR: Analysis on age factor suggests that mother tongue detection in older speaker groups is easier than in younger speaker groups, and mother tongue traits might be more preserved in older speakers when speaking the second language in comparison to younger speakers.
••
[...]
TL;DR: The speech processing pipeline to automatically detect common errors associated with CAS is described, which contains modules for voice activity detection, pronunciation verification, and lexical stress verification.
••
TL;DR: Results show that the proposed QMF approach successfully improves the system performance in terms of both discrimination and calibration, compared with conventional linear calibration.
••
TL;DR: It is noticed that the more challenging open-ended tasks benefit significantly more than constrained item types by the use of DNN-HMMs, which indicates the potential to build reliable spoken assessment applications based on constrained tasks, when few domain specific training data are available.
••
TL;DR: New measures of syntactic complexity for use in the framework of automatic scoring systems for second language spontaneous speech, are studied and suggest that they show a reasonable association with human-rated proficiency scores compared to conventional measures of Syntactic complexity.
••
TL;DR: This work introduces a set of databases that contains both speech and electroglottograph data that provide arguably the first direct insight into how cognitive load affects the voice source, and shows that glottal- based features carry complementary information with respect to formant-based features.
••
TL;DR: It is argued that all of these findings are expected within an exemplar approach assuming storage of tonal information with lexical items, and discussed the implications of this for the production and mental representation of intonation.
••
TL;DR: The distributions of LRs were found to be relatively stable across systems, although LRs for individual comparisons may be substantially affected, as expected, the Mismatched systems produced the worst validity, while the Matched systemsproduced the best validity.
••
TL;DR: This study's results show that responses to the VF task contain a large number of extraneous utterances and noise that lead to relatively poor baseline ASR performance, but it is found that speaker adaptation combined with confidence scoring significantly improves all three metrics and can enable use of ASR for reliable estimates of the traditional manual VF scores.
••
TL;DR: It is found that the simple distribution-based detection method is capable of detecting clipped speech with a higher accuracy, and the DNN-based reconstruction can achieve promising performance gains for speaker recognition on clipped speech.
••
TL;DR: This paper shows that the relationship between lexical units and acoustic features can be factored into two parts through a latent variable, namely, an acoustic model and a lexical model and proposes an approach that addresses both acoustic and phonetic lexical resource constraints in ASR system development.
••
TL;DR: This study investigated the relationship between clearly produced and plain citation form speech styles and motion of visible articulators, and found significant effects of speech style as well as speaker gender and saliency of visual speech cues.
••
TL;DR: The results suggest that talkers actively monitor their environment and are able to adopt appropriate speech production strategies for efficient and effective communication in adverse conditions.
••
TL;DR: The results of this study will be useful in a proposed application of speech ABR to objective hearing aid fitting, if the separation of the brain's responses to different vowels is found to be correlated with perceptual discrimination.
••
TL;DR: This paper focuses on employing adaptive scales for computation of perceptually scaled continuous wavelet transform coefficients (CWT) and adaptive thresholding of these coefficients for speech enhancement and finds that for the white Gaussian noise case, SNR and SSNR of the proposed method were better than all the methods under comparison.
••
TL;DR: A source separation algorithm based on the von Mises mixture model and the complex Gaussian mixture model is developed, where the model parameters are estimated via an expectation–maximization (EM) algorithm and a T–F mask is derived from themodel parameters for recovering the sources.
••
TL;DR: The results highlight the feasibility of instrumental quality prediction for TTS signals provided that broad training material is employed and high prediction accuracy, however, requires nonlinear model structures.