Showing papers in &quot;Speech Communication in 2011&quot;

Automatic speech emotion recognition using modulation spectral features

TL;DR: The basic phenomenon reflecting the last fifteen years is addressed, commenting on databases, modelling and annotation, the unit of analysis and prototypicality and automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration.

...read moreread less

671 citations

Journal Article•DOI•

[...]

Siqing Wu¹, Tiago H. Falk², Wai-Yip Chan¹•Institutions (2)

Queen's University¹, University of Toronto²

The importance of phase in speech enhancement

TL;DR: Modulation spectral features are proposed for the automatic recognition of human affective information from speech and render a substantial improvement in recognition performance when used to augment prosodic features, which have been extensively used for emotion recognition.

...read moreread less

359 citations

Journal Article•DOI•

[...]

Kuldip K. Paliwal¹, Kamil Wojcicki¹, Benjamin Shannon¹•Institutions (1)

Griffith University¹

01 Apr 2011-Speech Communication

TL;DR: The results of the oracle experiments show that accurate phase spectrum estimates can considerably contribute towards speech quality, as well as that the use of mismatched analysis windows in the computation of the magnitude and phase spectra provides significant improvements in both objective and subjective speech quality.

...read moreread less

357 citations

Journal Article•DOI•

Emotion recognition using a hierarchical binary decision tree approach

[...]

Chi-Chun Lee¹, Emily Mower¹, Carlos Busso², Sungbok Lee¹, Shrikanth S. Narayanan¹ - Show less +1 more•Institutions (2)

University of Southern California¹, University of Texas at Dallas²

Auditory model based direction estimation of concurrent speakers from binaural signals

TL;DR: In this article, a hierarchical computational structure is proposed to recognize emotions, which maps an input speech utterance into one of the multiple emotion classes through subsequent layers of binary classifications.

...read moreread less

291 citations

Journal Article•DOI•

[...]

Mathias Dietz¹, Stephan D. Ewert¹, Volker Hohmann¹•Institutions (1)

University of Oldenburg¹

Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor

TL;DR: This study investigates auditory model based DOA estimation emphasizing known features and limitations of the auditory binaural processing such as high temporal resolution, restricted frequency range to exploit temporal fine-structure, and a limited range to compensate for interaural time delay.

...read moreread less

127 citations

Journal Article•DOI•

[...]

Kate Forbes-Riley¹, Diane J. Litman¹•Institutions (1)

University of Pittsburgh¹

Anger recognition in speech using acoustic and linguistic cues

TL;DR: The performance of a spoken dialogue system that provides substantive dynamic responses to automatically detected user affective states is evaluated and a detailed system error analysis is presented that reveals challenges for real-time affect detection and adaptation.

...read moreread less

110 citations

Journal Article•DOI•

[...]

Tim Polzehl¹, Alexander Schmitt², Florian Metze³, Michael Wagner⁴•Institutions (4)

Technical University of Berlin¹, Information Technology University², Carnegie Mellon University³, University of Canberra⁴

SNR loss: A new objective measure for predicting the intelligibility of noise-suppressed speech

TL;DR: The present study elaborates on the exploitation of both linguistic and acoustic feature modeling for anger classification by evaluating classification success using the f1 measurement in addition to overall accuracy figures.

...read moreread less

100 citations

Journal Article•DOI•

[...]

Jianfen Ma¹, Philipos C. Loizou¹•Institutions (1)

University of Texas at Dallas¹

Achieving rapport with turn-by-turn, user-responsive emotional coloring

TL;DR: Three new objective measures that can be used for prediction of intelligibility of processed speech in noisy conditions using a critical-band spectral representation of the clean and noise-suppressed signals and are based on the measurement of the SNR loss incurred after the corrupted signal goes through a speech enhancement algorithm.

...read moreread less

88 citations

Journal Article•DOI•

[...]

Jaime C. Acosta¹, Nigel Ward¹•Institutions (1)

University of Texas at El Paso¹

A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: Multivariate kernel density (MVKD) versus Gaussian mixture model-universal background model (GMM-UBM)

TL;DR: Gracie is the first spoken dialog system that recognizes a user's emotional state from his or her speech and gives a response with appropriate emotional coloring, and shows that dialog systems can tap into this important level of interpersonal interaction using today's technology.

...read moreread less

84 citations

Journal Article•DOI•

[...]

Geoffrey Stewart Morrison¹•Institutions (1)

Australian National University¹

01 Feb 2011-Speech Communication

TL;DR: Two procedures for the calculation of forensic likelihood ratios were tested on the same set of acoustic-phonetic data, and the performance of the fused GMM-UBM system was much better than that of the fusion MVKD system.

...read moreread less

82 citations

Journal Article•DOI•

Investigation of spectral centroid features for cognitive load classification

[...]

Phu Ngoc Le¹, Eliathamby Ambikairajah¹, Julien Epps¹, Vidhyasaharan Sethu¹, Eric H. C. Choi² - Show less +1 more•Institutions (2)

University of New South Wales¹, NICTA²

01 Apr 2011-Speech Communication

TL;DR: Results of classification experiments show that the spectral centroid features consistently and significantly outperform a baseline system employing MFCC, pitch, and intensity features, and the fusion of an SCF based system with an SCA based system results in a relative reduction in error rate.

...read moreread less

Journal Article•DOI•

The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate

[...]

Adriana Stan¹, Junichi Yamagishi², Simon King², Matthew P. Aylett•Institutions (2)

Technical University of Cluj-Napoca¹, University of Edinburgh²

Constructing a spoken dialogue corpus for studying paralinguistic information in expressive conversation and analyzing its statistical/acoustic characteristics

TL;DR: Using the database, some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale are revisited, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers.

...read moreread less

Journal Article•DOI•

[...]

Hiroki Mori¹, Tomoyuki Satake¹, Makoto Nakamura¹, Hideki Kasuya¹•Institutions (1)

Utsunomiya University¹

01 Jan 2011-Speech Communication

TL;DR: To stimulate expressively-rich and vivid conversation, the ''4-frame cartoon sorting task'' was devised and the perceived emotional states of speakers can be accurately estimated from the speech parameters in most cases.

...read moreread less

Journal Article•DOI•

Acoustic characteristics of public speaking: Anxiety and practice effects

[...]

Alexander M. Goberman¹, Stephanie Hughes², Todd Haydock•Institutions (2)

Bowling Green State University¹, Governors State University²

Modulation-domain Kalman filtering for single-channel speech enhancement

TL;DR: Data indicate that practice patterns have a significant effect on the fluency characteristics of public speaking performance, as speakers who started practicing earlier were less disfluent than those who started later.

...read moreread less

Journal Article•DOI•

[...]

Stephen So¹, Kuldip K. Paliwal¹•Institutions (1)

Griffith University¹

Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication

TL;DR: The results from objective experiments and blind subjective listening tests using the NOIZEUS corpus show that the MDKF (with clean speech parameters) outperforms all the acoustic and time- domain enhancement methods that were evaluated, including the time-domain Kalman filter withclean speech parameters.

...read moreread less

Journal Article•DOI•

[...]

Junfeng Li¹, Shuichi Sakamoto², Satoshi Hongo, Masato Akagi¹, Yôiti Suzuki² - Show less +1 more•Institutions (2)

Japan Advanced Institute of Science and Technology¹, Tohoku University²

Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency

TL;DR: The main advantages of the proposed TS-BASE/WF are effectiveness in dealing with non-stationary multiple-source interference signals, and success in preserving binaural cues after processing, confirmed according to the comprehensive objective and subjective evaluations in different acoustical spatial configurations.

...read moreread less

Journal Article•DOI•

[...]

Hui Yin¹, Volker Hohmann¹, Climent Nadeu¹•Institutions (1)

Polytechnic University of Catalonia¹

Formant position based weighted spectral features for emotion recognition

TL;DR: The experimental results show that the proposed envelope and phase based features can improve recognition rates in clean and noisy conditions compared to the reference MFCC-based recognizer.

...read moreread less

Journal Article•DOI•

[...]

Elif Bozkurt¹, Engin Erzin¹, Cigdem Eroglu Erdem², A. Tanju Erdem³•Institutions (3)

Koç University¹, Bahçeşehir University², Özyeğin University³

Application of speaker- and language identification state-of-the-art techniques for emotion recognition

TL;DR: This paper proposes novel spectrally weighted mel-frequency cepstral coefficient (WMFCC) features for emotion recognition from speech, and evaluates the proposed WMFCC features together with the standard spectral and prosody features using HMM based classifiers on the spontaneous FAU Aibo emotional speech corpus.

...read moreread less

Journal Article•DOI•

[...]

Marcel Kockmann¹, Lukas Burget¹, Jan "Honza" ernocký¹•Institutions (1)

Brno University of Technology¹

Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation

TL;DR: This paper describes the efforts of transferring feature extraction and statistical modeling techniques from the fields of speaker and language identification to the related field of emotion recognition and shows how to apply Gaussian Mixture Modeling techniques on top of it.

...read moreread less

Journal Article•DOI•

[...]

Thomas Drugman¹, Baris Bozkurt², Thierry Dutoit¹•Institutions (2)

University of Mons¹, İzmir Institute of Technology²

Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition

TL;DR: It is shown that the complex cepstrum causal-anticausal decomposition can be effectively used for glottal flow estimation when specific windowing criteria are met and has the potential to be used for voice quality analysis.

...read moreread less

Journal Article•DOI•

[...]

Bernd Meyer¹, Birger Kollmeier¹•Institutions (1)

University of Oldenburg¹

Characterisation and identification of non-native French accents

TL;DR: An analysis based on phoneme confusions for both feature types suggests that spectro-temporal and purely spectral features carry complementary information.

...read moreread less

Journal Article•DOI•

[...]

Bianca Vieru¹, Philippe Boula de Mareüil¹, Martine Adda-Decker¹•Institutions (1)

Centre national de la recherche scientifique¹

Enhancement of noisy speech by temporal and spectral processing

TL;DR: Major identified accent-specific cues include the devoicing of voiced stop consonants, the ''rolled r'' and schwa fronting or raising, which can contribute to improve pronunciation modeling in automatic speech recognition of accented speech.

...read moreread less

Journal Article•DOI•

[...]

P. Krishnamoorthy¹, S. R. M. Prasanna²•Institutions (2)

Samsung India Software Center¹, Indian Institute of Technology Guwahati²

01 Feb 2011-Speech Communication

TL;DR: A noisy speech enhancement method by combining linear prediction (LP) residual weighting in the time domain and spectral processing in the frequency domain to provide better noise suppression as well as better enhancement in the speech regions is presented.

...read moreread less

Journal Article•DOI•

The additive effect of turn-taking cues in human and synthetic voice

[...]

Anna Hjalmarsson

01 Jan 2011-Speech Communication

TL;DR: The results show that the turn-taking cues realized with a synthetic voice affect the judgements similar to the corresponding human version and there is no difference in reaction times between these two conditions.

...read moreread less

Journal Article•DOI•

Recognizing affect from speech prosody using hierarchical graphical models

[...]

Raul Fernandez¹, Rosalind W. Picard²•Institutions (2)

IBM¹, Massachusetts Institute of Technology²

Motion strategies for binaural localisation of speech sources in azimuth and distance by artificial listeners

TL;DR: A class of hierarchical directed graphical models is developed and applied on the task of recognizing affective categories from prosody in both acted and natural speech, achieving rates within nearly 10% of human recognition accuracy despite only focusing on prosody.

...read moreread less

Journal Article•DOI•

[...]

Yan-Chen Lu¹, Martin Cooke²•Institutions (2)

University of Sheffield¹, University of the Basque Country²

Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis

TL;DR: Evaluations of listener motion strategies demonstrated that two strategies were particularly effective for localisation, simply to move towards the most likely source location, which is beneficial in increasing signal-to-noise ratio, particularly in reverberant conditions.

...read moreread less

Journal Article•DOI•

[...]

Kai Yu¹, Heiga Zen², François Mairesse¹, Steve Young¹•Institutions (2)

University of Cambridge¹, Toshiba²

Utterance partitioning with acoustic vector resampling for GMM-SVM speaker verification

TL;DR: Experiments on a word-level emphasis synthesis task show that all context adaptive training approaches can outperform the standard full-context-dependent HMM approach, however, the MLLR based system achieved the best performance.

...read moreread less

Journal Article•DOI•

[...]

Man-Wai Mak¹, Wei Rao¹•Institutions (1)

Hong Kong Polytechnic University¹

01 Jan 2011-Speech Communication

TL;DR: This paper proposes a resampling technique - namely utterance partitioning with acoustic vector resamplings (UP-AVR) - to mitigate the data imbalance problem in GMM-SVM systems.

...read moreread less

Journal Article•DOI•

Categorical perception of voicing, colors and facial expressions: A developmental study

[...]

Ingrid Hoonhorst¹, Victoria Medina², Cécile Colin¹, Emily Markessis¹, Monique Radeau¹, Paul Deltenre¹, Willy Serniclaes² - Show less +3 more•Institutions (2)

Université libre de Bruxelles¹, Paris Descartes University²

Listeners' weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis

TL;DR: Comparisons in French-speaking children and adults indicate that whereas general cognitive maturation has some influence on the development of perceptual categorization, this is not without domain-specific effects, the structural complexity of the categories being one of them.

...read moreread less

Journal Article•DOI•

[...]

Catherine Mayo¹, Robert A. J. Clark¹, Simon King¹•Institutions (1)

University of Edinburgh¹