Showing papers in "Speech Communication in 2007"
••
TL;DR: A noisy speech corpus is developed suitable for evaluation of speech enhancement algorithms encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms.
634 citations
••
TL;DR: Current advances related to automatic speech recognition (ASR) and spoken language systems and deficiencies in dealing with variation naturally present in speech are outlined.
507 citations
••
TL;DR: An image-based, text-free evaluation system is presented that provides intuitive assessment of emotion primitives, and yields high inter-evaluator agreement, and speaker-dependent modeling of emotion expression is proposed since the emotionPrimitives are particularly suited for capturing dynamics and intrinsic variations in emotion expression.
309 citations
••
TL;DR: This research aims to improve the automatic perception of vocal emotion in two ways: compare two emotional speech data sources: natural, spontaneous emotional speech and acted or portrayed emotional speech, and look at two classification methods which have not been applied: stacked generalisation and unweighted vote.
305 citations
••
TL;DR: It was found that, compared with British adult-directed speech, vowels were equivalently hyperarticulated in infant- and foreigner-directedspeech, and that linguistic modifications are independent of vocal pitch and affective valence.
215 citations
••
TL;DR: The results show that the proposed approach always outperforms the use of transformations in the feature space and yields even better results when combined with linear input transformations.
173 citations
••
TL;DR: The development of a gender-independent laugh detector is described with the aim to enable automatic emotion recognition and acoustic measurements showed differences between laughter and speech in mean pitch and in the ratio of the durations of unvoiced to voiced portions, which indicate that these prosodic features are indeed useful for discrimination between laughed and speech.
169 citations
••
TL;DR: It is shown that the SP-SDW-MWF is more robust against signal model errors than the GSC, and that the block-structured step size matrix gives rise to a faster convergence and a better tracking performance than the diagonal step size Matrix, only at a slightly higher computational cost.
167 citations
••
TL;DR: The implementation and evaluation of an open-domain unit selection speech synthesis engine designed to be flexible enough to encourage further unit selection research and allow rapid voice development by users with minimal speech synthesis knowledge and experience are presented.
161 citations
••
TL;DR: This study significantly improved the intelligibility of dysarthric vowels of one speaker from 48% to 54%, as evaluated by a vowel identification task using 64 CVC stimuli judged by 24 listeners.
161 citations
••
TL;DR: The robustness of approaches to the automatic classification of emotions in speech is addressed and it is suggested that existing approaches are efficient enough to handle larger amounts of training data without any reduction in classification accuracy.
••
TL;DR: The use of several methods for speaker adaptive acoustic modeling to cope with inter-speaker spectral variability and to improve recognition performance for children proved to be effective in recognition of read speech with a vocabulary of about 11k words.
••
TL;DR: An overview of past and present efforts to link human and automatic speech recognition research is provided and an overview of the literature describing the performance difference between machines and human listeners is presented.
••
TL;DR: The research findings indicated that Persian apologies are as formulaic in pragmatic structures as in English apologies and the values assigned to the two context-external variables were found to have significant effect on the frequency of the intensifiers in different situations.
••
TL;DR: Using a single set of speaker-independent, noise-level-independent parameters, the model was able to predict not only the intelligibility of individual speakers to a remarkable degree, but could also account for most of the token-wise intelligibilities of the letter keywords.
••
TL;DR: Overall results indicate that SNR and SSNR improvements for the proposed approach are comparable to those of the Ephraim Malah filter, with BWT enhancement giving the best results of all methods for the noisiest (-10db and -5db input SNR) conditions.
••
TL;DR: It is shown that chirp group delay representations are potentially useful for improving ASR performance and presented one application in feature extraction for automatic speech recognition (ASR), which can be guaranteed to be spike-free.
••
TL;DR: Results confirm that lexical masking occurs only when some words in the babble are detectable, and suggest that different levels of linguistic information can be extracted from background babble and cause different types of linguistic competition for target-word identification.
••
TL;DR: The present study demonstrates that the assumed distribution of the DFT coefficients can have a significant effect on the quality of the enhanced speech and derive the MMSE estimator under speech presence uncertainty and a Laplacian statistical model.
••
TL;DR: It is demonstrated that speech recognition error rates for interactive read aloud can be reduced by more than 50% through a combination of advances in both statistical language and acoustic modeling.
••
TL;DR: It is argued that progress is hampered by the fragmentation of the field across many different disciplines, coupled with a failure to create an integrated view of the fundamental mechanisms that underpin one organism's ability to communicate with another.
••
TL;DR: It is shown that the ''decision-directed'' approach for speech spectral variance estimation can have an important bias at low SNRs, which generally leads to too much speech suppression.
••
TL;DR: The results provide support for an autosegmental-metrical account of the intonational phonology of French in which the early rise is a bitonal (LH) phrase accent that serves as a cue to content word beginnings.
••
TL;DR: The results suggest that in addition to content cues, voice cues can be used by Chinese listeners to release speech from masking by other talkers.
••
TL;DR: This study proposes a new feature vector that will allow better classification of emotional/stressed states and achieves good discrimination between neutral, angry, loud and Lombard states for the simulated domain of the Speech Under Simulated and Actual Stress (SUSAS) database.
••
TL;DR: This paper reviews the progress of Thai speech technology in five areas of research: fundamental analyses and tools, text-to-speech synthesis (TTS), automatic speech recognition (ASR), speech applications, and language resources.
••
TL;DR: Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy, which is compared to a conventional fragment generation approach.
••
TL;DR: A probabilistic algorithm for phrase stress assignment accounts for both prominence and constituency prosodic relations by considering the coupling between a dependency-grammar system of markers and constituent-size constraints, which copes with intra- and inter-speaker prosodic variability.
••
TL;DR: It is shown from the continuous Korean-English speech recognition experiments that the proposed method can achieve the average word error rate reduction by 12.75% when compared with the speech recognition system with the baseline acoustic models trained by native speech.
••
TL;DR: Results from the analysis of Japanese vowel data suggested that contraction and relaxation of the three subdivisions of the genioglossus play a dominant role in forming tongue shapes for vowels.