scispace - formally typeset
Search or ask a question

Showing papers in "Eurasip Journal on Audio, Speech, and Music Processing in 2012"


Journal ArticleDOI
TL;DR: Current digital audio steganographic techniques are reviewed and evaluated based on robustness, security and hiding capacity indicators and a robustness-based classification of steganography models depending on their occurrence in the embedding process is provided.
Abstract: The rapid spread in digital data usage in many real life applications have urged new and effective ways to ensure their security. Efficient secrecy can be achieved, at least in part, by implementing steganograhy techniques. Novel and versatile audio steganographic methods have been proposed. The goal of steganographic systems is to obtain secure and robust way to conceal high rate of secret data. We focus in this paper on digital audio steganography, which has emerged as a prominent source of data hiding across novel telecommunication technologies such as covered voice-over-IP, audio conferencing, etc. The multitude of steganographic criteria has led to a great diversity in these system design techniques. In this paper, we review current digital audio steganographic techniques and we evaluate their performance based on robustness, security and hiding capacity indicators. Another contribution of this paper is the provision of a robustness-based classification of steganographic models depending on their occurrence in the embedding process. A survey of major trends of audio steganography applications is also discussed in this paper.

175 citations


Journal ArticleDOI
TL;DR: Experimental results prove the efficiency of the proposed hiding technique since the stego signals are perceptually indistinguishable from the equivalent cover signal, while being able to recover the secret speech message with slight degradation in the quality.
Abstract: A new method to secure speech communication using the discrete wavelet transforms (DWT) and the fast Fourier transform is presented in this article. In the first phase of the hiding technique, we separate the speech high-frequency components from the low-frequency components using the DWT. In a second phase, we exploit the low-pass spectral proprieties of the speech spectrum to hide another secret speech signal in the low-amplitude high-frequency regions of the cover speech signal. The proposed method allows hiding a large amount of secret information while rendering the steganalysis more complex. Experimental results prove the efficiency of the proposed hiding technique since the stego signals are perceptually indistinguishable from the equivalent cover signal, while being able to recover the secret speech message with slight degradation in the quality.

69 citations


Journal ArticleDOI
TL;DR: A new set of acoustic features for automatic emotion recognition from audio based on the perceptual quality metrics that are given in perceptual evaluation of audio quality known as ITU BS.1387 recommendation is proposed.
Abstract: In this article, we propose a new set of acoustic features for automatic emotion recognition from audio. The features are based on the perceptual quality metrics that are given in perceptual evaluation of audio quality known as ITU BS.1387 recommendation. Starting from the outer and middle ear models of the auditory system, we base our features on the masked perceptual loudness which defines relatively objective criteria for emotion detection. The features computed in critical bands based on the reference concept include the partial loudness of the emotional difference, emotional difference-to-perceptual mask ratio, measures of alterations of temporal envelopes, measures of harmonics of the emotional difference, the occurrence probability of emotional blocks, and perceptual bandwidth. A soft-majority voting decision rule that strengthens the conventional majority voting is proposed to assess the classifier outputs. Compared to the state-of-the-art systems including Munich Open-Source Emotion and Affect Recognition Toolkit, Hidden Markov Toolkit, and Generalized Discriminant Analysis, it is shown that the emotion recognition rates are improved between 7-16% for EMO-DB and 7-11% in VAM for "all" and "valence" tasks.

63 citations


Journal ArticleDOI
TL;DR: New feature extraction methods, which utilize wavelet decomposition and reduced order linear predictive coding (LPC) coefficients, have been proposed for speech recognition and the experimental results show the superiority of the proposed techniques over the conventional methods like linear predictive cepstral coefficients, Mel-frequency cep stral coefficient, spectral subtraction, and cepStral mean normalization in presence of additive white Gaussian noise.
Abstract: In this article, new feature extraction methods, which utilize wavelet decomposition and reduced order linear predictive coding (LPC) coefficients, have been proposed for speech recognition. The coefficients have been derived from the speech frames decomposed using discrete wavelet transform. LPC coefficients derived from subband decomposition (abbreviated as WLPC) of speech frame provide better representation than modeling the frame directly. The WLPC coefficients have been further normalized in cepstrum domain to get new set of features denoted as wavelet subband cepstral mean normalized features. The proposed approaches provide effective (better recognition rate), efficient (reduced feature vector dimension), and noise robust features. The performance of these techniques have been evaluated on the TI-46 isolated word database and own created Marathi digits database in a white noise environment using the continuous density hidden Markov model. The experimental results also show the superiority of the proposed techniques over the conventional methods like linear predictive cepstral coefficients, Mel-frequency cepstral coefficients, spectral subtraction, and cepstral mean normalization in presence of additive white Gaussian noise.

57 citations


Journal ArticleDOI
TL;DR: This work proposes a unifying framework to generate emotions across voice, gesture, and music, by representing emotional states as a 4-parameter tuple of speed, intensity, regularity, and extent (SIRE).
Abstract: It has been long speculated that expression of emotions from different modalities have the same underlying 'code', whether it be a dance step, musical phrase, or tone of voice. This is the first attempt to implement this theory across three modalities, inspired by the polyvalence and repeatability of robotics. We propose a unifying framework to generate emotions across voice, gesture, and music, by representing emotional states as a 4-parameter tuple of speed, intensity, regularity, and extent (SIRE). Our results show that a simple 4-tuple can capture four emotions recognizable at greater than chance across gesture and voice, and at least two emotions across all three modalities. An application for multi-modal, expressive music robots is discussed.

36 citations


Journal ArticleDOI
TL;DR: The evaluation results for the task of speaker diarization of broadcast news, which was part of the Albayzin 2010 evaluation campaign of language and speech technologies, consists of a subset of the Catalan broadcast news database recorded from the 3/24 TV channel.
Abstract: In this article, we present the evaluation results for the task of speaker diarization of broadcast news, which was part of the Albayzin 2010 evaluation campaign of language and speech technologies. The evaluation data consists of a subset of the Catalan broadcast news database recorded from the 3/24 TV channel. The description of five submitted systems from five different research labs is given, marking the common as well as the distinctive system features. The diarization performance is analyzed in the context of the diarization error rate, the number of detected speakers and also the acoustic background conditions. An effort is also made to put the achieved results in relation to the particular system design features.

33 citations


Journal ArticleDOI
TL;DR: A novel voice activity detection (VAD) approach based on phoneme recognition using Gaussian Mixture Model based Hidden Markov Model (HMM/GMM) is proposed, which shows a higher speech/non-speech detection accuracy over a wide range of SNR regimes compared with some existing VAD methods.
Abstract: In this article, a novel voice activity detection (VAD) approach based on phoneme recognition using Gaussian Mixture Model based Hidden Markov Model (HMM/GMM) is proposed. Some sophisticated speech features such as high order statistics (HOS), harmonic structure information and Mel-frequency cepstral coefficients (MFCCs) are employed to represent each speech/non-speech segment. The main idea of this new method is regarding the non-speech as a new phoneme corresponding to the conventional phonemes in mandarin, and all of them are then trained under maximum likelihood principle with Baum-Welch algorithm using GMM/HMM model. The Viterbi decoding algorithm is finally used for searching the maximum likelihood of the observed signals. The proposed method shows a higher speech/non-speech detection accuracy over a wide range of SNR regimes compared with some existing VAD methods. We also propose a different method to demonstrate that the conventional speech enhancement method only with accurate VAD is not effective enough for automatic speech recognition (ASR) at low SNR regimes.

23 citations


Journal ArticleDOI
TL;DR: Analysis of data covers the issues of the number of basic dimensions in music mood, their relation to valence and arousal, the distribution of moods in the valence–arousal plane, distinctiveness of the labels, and appropriate (number of) labels for full coverage of the plane.
Abstract: Mood is an important aspect of music and knowledge of mood can be used as a basic feature in music recommender and retrieval systems. A listening experiment was carried out establishing ratings for various moods and a number of attributes, e.g., valence and arousal. The analysis of these data covers the issues of the number of basic dimensions in music mood, their relation to valence and arousal, the distribution of moods in the valence–arousal plane, distinctiveness of the labels, and appropriate (number of) labels for full coverage of the plane. It is also shown that subject-averaged valence and arousal ratings can be predicted from music features by a linear model.

17 citations


Journal ArticleDOI
TL;DR: The proposed synthesis technique surpasses the synthesis performance of evolutionary artificial neural networks, exhibiting a considerable capability to accurately distinguish among different audio classes.
Abstract: A vast amount of audio features have been proposed in the literature to characterize the content of audio signals. In order to overcome specific problems related to the existing features (such as lack of discriminative power), as well as to reduce the need for manual feature selection, in this article, we propose an evolutionary feature synthesis technique with a built-in feature selection scheme. The proposed synthesis process searches for optimal linear/nonlinear operators and feature weights from a pre-defined multi-dimensional search space to generate a highly discriminative set of new (artificial) features. The evolutionary search process is based on a stochastic optimization approach in which a multi-dimensional particle swarm optimization algorithm, along with fractional global best formation and heterogeneous particle behavior techniques, is applied. Unlike many existing feature generation approaches, the dimensionality of the synthesized feature vector is also searched and optimized within a set range in order to better meet the varying requirements set by many practical applications and classifiers. The new features generated by the proposed synthesis approach are compared with typical low-level audio features in several classification and retrieval tasks. The results demonstrate a clear improvement of up to 15–20% in average retrieval performance. Moreover, the proposed synthesis technique surpasses the synthesis performance of evolutionary artificial neural networks, exhibiting a considerable capability to accurately distinguish among different audio classes.

15 citations


Journal ArticleDOI
TL;DR: A class of methods for solving the permutation problem based on information theoretic distance measures is presented, which have been tested on different real-room speech mixtures with different reverberation times in conjunction with different ICA algorithms.
Abstract: The problem of blind source separation (BSS) of convolved acoustic signals is of great interest for many classes of applications. Due to the convolutive mixing process, the source separation is performed in the frequency domain, using independent component analysis (ICA). However, frequency domain BSS involves several major problems that must be solved. One of these is the permutation problem. The permutation ambiguity of ICA needs to be resolved so that each separated signal contains the frequency components of only one source signal. This article presents a class of methods for solving the permutation problem based on information theoretic distance measures. The proposed algorithms have been tested on different real-room speech mixtures with different reverberation times in conjunction with different ICA algorithms.

15 citations


Journal ArticleDOI
TL;DR: A novel interleaver scheme based on chaotic Baker map is presented for protection against error bursts and reduction the packet loss of the audio signals over wireless networks and it improves the quality of the received audio signal.
Abstract: This article studies a vital issue in wireless communications, which is the transmission of audio signals over wireless networks. It presents a novel interleaver scheme for protection against error bursts and reduction the packet loss of the audio signals. The proposed technique in the article is the chaotic interleaver; it is based on chaotic Baker map. It is used as a randomizing data tool to improve the quality of the audio over the mobile communications channels. A comparison study between the proposed chaotic interleaving scheme and the traditional block and convolutional interleaving schemes for audio transmission over uncorrelated and correlated fading channels is presented. The simulation results show the superiority of the proposed chaotic interleaving scheme over the traditional schemes. The simulation results also reveal that the proposed chaotic interleaver improves the quality of the received audio signal. It improves the amount of the throughput over the wireless link through the packet loss reduction.

Journal ArticleDOI
Masami Akamine1, Jitendra Ajmera2
TL;DR: This article proposes a new acoustic model using decision trees (DTs) as replacements for Gaussian mixture models (GMM) to compute the observation likelihoods for a given hidden Markov model state in a speech recognition system.
Abstract: This article proposes a new acoustic model using decision trees (DTs) as replacements for Gaussian mixture models (GMM) to compute the observation likelihoods for a given hidden Markov model state in a speech recognition system. DTs have a number of advantageous properties, such as that they do not impose restrictions on the number or types of features, and that they automatically perform feature selection. This article explores and exploits DTs for the purpose of large vocabulary speech recognition. Equal and decoding questions have newly been introduced into DTs to directly model gender- and context-dependent acoustic space. Experimental results for the 5k ARPA wall-street-journal task show that context information significantly improves the performance of DT-based acoustic models as expected. Context-dependent DT-based models are highly compact compared to conventional GMM-based acoustic models. This means that the proposed models have effective data-sharing across various context classes.

Journal ArticleDOI
TL;DR: The proposed approach improves the widely used spectral subtraction which inherently suffers from the associated musical noise effects through a psychoacoustic masking and critical band variance normalization technique to improve automatic speech recognition (ASR) performance.
Abstract: This article describes a modified technique for enhancing noisy speech to improve automatic speech recognition (ASR) performance. The proposed approach improves the widely used spectral subtraction which inherently suffers from the associated musical noise effects. Through a psychoacoustic masking and critical band variance normalization technique, the artifacts produced by spectral subtraction are minimized for improving the ASR accuracy. The popular advanced ETSI-2 front end is tested for comparison purposes. The performed speech recognition evaluations on the noisy standard AURORA-2 tasks show enhanced performance for all noise conditions.

Journal ArticleDOI
TL;DR: This study proposes a music-aided framework for affective interaction of service robots with humans and proposes a novel approach to identify human emotions in the perception system, which exhibited superior performance over the conventional approach.
Abstract: This study proposes a music-aided framework for affective interaction of service robots with humans. The framework consists of three systems, respectively, for perception, memory, and expression on the basis of the human brain mechanism. We propose a novel approach to identify human emotions in the perception system. The conventional approaches use speech and facial expressions as representative bimodal indicators for emotion recognition. But, our approach uses the mood of music as a supplementary indicator to more correctly determine emotions along with speech and facial expressions. For multimodal emotion recognition, we propose an effective decision criterion using records of bimodal recognition results relevant to the musical mood. The memory and expression systems also utilize musical data to provide natural and affective reactions to human emotions. For evaluation of our approach, we simulated the proposed human-robot interaction with a service robot, iRobiQ. Our perception system exhibited superior performance over the conventional approach, and most human participants noted favorable reactions toward the music-aided affective interaction.

Journal ArticleDOI
TL;DR: A novel approach for robust dialogue act detection in a spoken dialogue system is proposed, which achieves a detection accuracy of 85.1%, which is significantly better than the baseline performance of 62.3% using a naïve Bayes classifier.
Abstract: A novel approach for robust dialogue act detection in a spoken dialogue system is proposed. Shallow representation named partial sentence trees are employed to represent automatic speech recognition outputs. Parsing results of partial sentences can be decomposed into derivation rules, which turn out to be salient features for dialogue act detection. Data-driven dialogue acts are learned via an unsupervised learning algorithm called spectral clustering, in a vector space whose axes correspond to derivation rules. The proposed method is evaluated in a Mandarin spoken dialogue system for tourist-information services. Combined with information obtained from the automatic speech recognition module and from a Markov model on dialogue act sequence, the proposed method achieves a detection accuracy of 85.1%, which is significantly better than the baseline performance of 62.3% using a naive Bayes classifier. Furthermore, the average number of turns per dialogue session also decreases significantly with the improved detection accuracy.

Journal ArticleDOI
TL;DR: This article investigates the use of a robotic arm as a bidirectional tangible interface for musical expression, actively modifying the compliant control strategy to create a bind between gestural input and music output.
Abstract: The availability of haptic interfaces in music content processing offers interesting possibilities of performer-instrument interaction for musical expression. These new musical instruments can precisely modulate the haptic feedback, and map it to a sonic output, thus offering new artistic content creation possibilities. With this article, we investigate the use of a robotic arm as a bidirectional tangible interface for musical expression, actively modifying the compliant control strategy to create a bind between gestural input and music output. The user can define recursive modulations of music parameters by grasping and gradually refining periodic movements on a gravity-compensated robot manipulator. The robot learns on-line the new desired trajectory, increasing its stiffness as the modulation refinement proceeds. This article reports early results of an artistic performance that has been carried out with the collaboration of a musician, who played with the robot as part of his live stage setup.

Journal ArticleDOI
TL;DR: A speaker-dependent model interpolation method for statistical emotional speech synthesis that achieves sound performance on the emotional expressiveness, the naturalness, and the target speaker similarity without the need to collect the emotional speech of thetarget speaker, saving the cost of data collection and labeling.
Abstract: In this article, we propose a speaker-dependent model interpolation method for statistical emotional speech synthesis. The basic idea is to combine the neutral model set of the target speaker and an emotional model set selected from a pool of speakers. For model selection and interpolation weight determination, we propose to use a novel monophone-based Mahalanobis distance, which is a proper distance measure between two Hidden Markov Model sets. We design Latin-square evaluation to reduce the systematic bias in the subjective listening tests. The proposed interpolation method achieves sound performance on the emotional expressiveness, the naturalness, and the target speaker similarity. Moreover, such performance is achieved without the need to collect the emotional speech of the target speaker, saving the cost of data collection and labeling.

Journal ArticleDOI
TL;DR: An audiovisual integration method for beat-tracking for live guitar performances that is capable of real-time processing with a suppressed number of particles while preserving the estimation accuracy, and demonstrates an ensemble with the humanoid HRP-2 that plays the theremin with a human guitarist.
Abstract: The aim of this paper is to improve beat-tracking for live guitar performances. Beat-tracking is a function to estimate musical measurements, for example musical tempo and phase. This method is critical to achieve a synchronized ensemble performance such as musical robot accompaniment. Beat-tracking of a live guitar performance has to deal with three challenges: tempo fluctuation, beat pattern complexity and environmental noise. To cope with these problems, we devise an audiovisual integration method for beat-tracking. The auditory beat features are estimated in terms of tactus (phase) and tempo (period) by Spectro-Temporal Pattern Matching (STPM), robust against stationary noise. The visual beat features are estimated by tracking the position of the hand relative to the guitar using optical flow, mean shift and the Hough transform. Both estimated features are integrated using a particle filter to aggregate the multimodal information based on a beat location model and a hand's trajectory model. Experimental results confirm that our beat-tracking improves the F-measure by 8.9 points on average over the Murata beat-tracking method, which uses STPM and rule-based beat detection. The results also show that the system is capable of real-time processing with a suppressed number of particles while preserving the estimation accuracy. We demonstrate an ensemble with the humanoid HRP-2 that plays the theremin with a human guitarist.

Journal ArticleDOI
TL;DR: A biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal is explored.
Abstract: Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equipped with elaborate machinery for speech analysis and feature extraction, which hold great lessons for improving the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing.

Journal ArticleDOI
TL;DR: Evidence is found that supports the use of the proposed dance representation for flexibly modeling and synthesizing dance sequences from different popular dance styles, with potential developments for the generation of expressive and natural movement profiles onto humanoid dancing characters.
Abstract: Dance movements are a complex class of human behavior which convey forms of non-verbal and subjective communication that are performed as cultural vocabularies in all human cultures. The singularity of dance forms imposes fascinating challenges to computer animation and robotics, which in turn presents outstanding opportunities to deepen our understanding about the phenomenon of dance by means of developing models, analyses and syntheses of motion patterns. In this article, we formalize a model for the analysis and representation of popular dance styles of repetitive gestures by specifying the parameters and validation procedures necessary to describe the spatiotemporal elements of the dance movement in relation to its music temporal structure (musical meter). Our representation model is able to precisely describe the structure of dance gestures according to the structure of musical meter, at different temporal resolutions, and is flexible enough to convey the variability of the spatiotemporal relation between music structure and movement in space. It results in a compact and discrete mid-level representation of the dance that can be further applied to algorithms for the generation of movements in different humanoid dancing characters. The validation of our representation model relies upon two hypotheses: (i) the impact of metric resolution and (ii) the impact of variability towards fully and naturally representing a particular dance style of repetitive gestures. We numerically and subjectively assess these hypotheses by analyzing solo dance sequences of Afro-Brazilian samba and American Charleston, captured with a MoCap (Motion Capture) system. From these analyses, we build a set of dance representations modeled with different parameters, and re-synthesize motion sequence variations of the represented dance styles. For specifically assessing the metric hypothesis, we compare the captured dance sequences with repetitive sequences of a fixed dance motion pattern, synthesized at different metric resolutions for both dance styles. In order to evaluate the hypothesis of variability, we compare the same repetitive sequences with others synthesized with variability, by generating and concatenating stochastic variations of the represented dance pattern. The observed results validate the proposition that different dance styles of repetitive gestures might require a minimum and sufficient metric resolution to be fully represented by the proposed representation model. Yet, these also suggest that additional information may be required to synthesize variability in the dance sequences while assuring the naturalness of the performance. Nevertheless, we found evidence that supports the use of the proposed dance representation for flexibly modeling and synthesizing dance sequences from different popular dance styles, with potential developments for the generation of expressive and natural movement profiles onto humanoid dancing characters.

Journal ArticleDOI
TL;DR: The first experiment showed that visual rhythm has little influence on rope-turning cooperation between humans, and the second experiment provided firmer evidence for the same hypothesis because humans neglected their visual rhythms.
Abstract: As fundamental research for human-robot interaction, this paper addresses the rhythmic reference of a human while turning a rope with another human. We hypothyzed that when interpreting rhythm cues to make a rhythm reference, humans will use auditory and force rhythms more than visual ones. We examined 21-23 years old test subjects. We masked perception of each test subject using 3 kinds of masks, an eye-mask, headphones, and a force mask. The force mask is composed of a robot arm and a remote controller. These instruments allow a test subject to turn a rope without feeling force from the rope. In the first experiment, each test subject interacted with an operator that turned a rope with a constant rhythm. 8 experiments were conducted for each test subject that wore combinations of masks. We measured the angular velocity of force between a test subject/the operator and a rope. We calculated error between the angular velocities of the force directions, and validated the error. In the second experiment, two test subjects interacted with each other. 1.6 - 2.4 Hz auditory rhythm was presented from headphones so as to inform target turning frequency. Addition to the auditory rhythm, the test subjects wore eye-masks. The first experiment showed that visual rhythm has little influence on rope-turning cooperation between humans. The second experiment provided firmer evidence for the same hypothesis because humans neglected their visual rhythms.

Journal ArticleDOI
TL;DR: The uses of the heuristics that exploit the harmonic information of each pitch to tackle limitations of NMF in polyphonic music transcription could significantly improve the accuracy of the transcription output as compared to the standard NMF approach.
Abstract: This article discusses our research on polyphonic music transcription using non-negative matrix factorisation (NMF). The application of NMF in polyphonic transcription offers an alternative approach in which observed frequency spectra from polyphonic audio could be seen as an aggregation of spectra from monophonic components. However, it is not easy to find accurate aggregations using a standard NMF procedure since there are many ways to satisfy the factoring of V ≈ WH. Three limitations associated with the application of standard NMF to factor frequency spectra are (i) the permutation of transcription output; (ii) the unknown factoring r; and (iii) the factoring W and H that have a tendency to be trapped in a sub-optimal solution. This work explores the uses of the heuristics that exploit the harmonic information of each pitch to tackle these limitations. In our implementation, this harmonic information is learned from the training data consisting of the pitches from a desired instrument, while the unknown effective r is approximated from the correlation between the input signal and the training data. This approach offers an effective exploitation of the domain knowledge. The empirical results show that the proposed approach could significantly improve the accuracy of the transcription output as compared to the standard NMF approach.

Journal ArticleDOI
TL;DR: A fast Multi-Candidate (MC) approach that solves the per-Gaussian CLSQ problems approximately by selecting the best from a small set of candidate solutions, which are generated as the MDT solutions on a reduced set of cluster Gaussians is proposed.
Abstract: The application of Missing Data Techniques (MDT) to increase the noise robustness of HMM/GMM-based large vocabulary speech recognizers is hampered by a large computational burden. The likelihood evaluations imply solving many constrained least squares (CLSQ) optimization problems. As an alternative, researchers have proposed frontend MDT or have made oversimplifying independence assumptions for the backend acoustic model. In this article, we propose a fast Multi-Candidate (MC) approach that solves the per-Gaussian CLSQ problems approximately by selecting the best from a small set of candidate solutions, which are generated as the MDT solutions on a reduced set of cluster Gaussians. Experiments show that the MC MDT runs equally fast as the uncompensated recognizer while achieving the accuracy of the full backend optimization approach. The experiments also show that exploiting the more accurate acoustic model of the backend does pay off in terms of accuracy when compared to frontend MDT.

Journal ArticleDOI
TL;DR: A consistent and theoretically sound framework for combining perception and control for accurate musical timing is presented and a hierarchical hidden Markov model that combines event detection and tempo tracking is developed.
Abstract: Interaction with human musicians is a challenging task for robots as it involves online perception and precise synchronization. In this paper, we present a consistent and theoretically sound framework for combining perception and control for accurate musical timing. For the perception, we develop a hierarchical hidden Markov model that combines event detection and tempo tracking. The robot performance is formulated as a linear quadratic control problem that is able to generate a surprisingly complex timing behavior in adapting the tempo. We provide results with both simulated and real data. In our experiments, a simple Lego robot percussionist accompanied the music by detecting the tempo and position of clave patterns in the polyphonic music. The robot successfully synchronized itself with the music by quickly adapting to the changes in the tempo.

Journal ArticleDOI
TL;DR: A study on force-feedback interaction with a model of a neural oscillator provides insight into enhanced human-robot interactions for controlling musical sound and suggests an extension of dynamic pattern theory to force- feedback tasks.
Abstract: A study on force-feedback interaction with a model of a neural oscillator provides insight into enhanced human-robot interactions for controlling musical sound. We provide differential equations and discrete-time computable equations for the core oscillator model developed by Edward Large for simulating rhythm perception. Using a mechanical analog parameterization, we derive a force-feedback model structure that enables a human to share control of a virtual percussion instrument with a "robotic" neural oscillator. A formal human subject test indicated that strong coupling (STRNG) between the force-feedback device and the neural oscillator provided subjects with the best control. Overall, the human subjects predominantly found the interaction to be "enjoyable" and "fun" or "entertaining." However, there were indications that some subjects preferred a medium-strength coupling (MED), presumably because they were unaccustomed to such strong force-feedback interaction with an external agent. With related models, test subjects performed better when they could synchronize their input in phase with a dominant sensory feedback modality. In contrast, subjects tended to perform worse when an optimal strategy was to move the force-feedback device with a 90° phase lag. Our results suggest an extension of dynamic pattern theory to force-feedback tasks. In closing, we provide an overview of how a similar force-feedback scenario could be used in a more complex musical robotics setting.

Journal ArticleDOI
TL;DR: A novel architecture for embedding phone based language recognition into a large vocabulary continuous speech recognition decoder by sharing the same decoding process but generating separate lattices to compensate for the prior bias introduced by the pronunciation dictionary and the language model of the LVCSR decoder is demonstrated.
Abstract: An increasing number of multilingual applications require language recognition (LRE) as a frontend, but desire low additional computational cost. This article demonstrates a novel architecture for embedding phone based language recognition into a large vocabulary continuous speech recognition (LVCSR) decoder by sharing the same decoding process but generating separate lattices. To compensate for the prior bias introduced by the pronunciation dictionary and the language model of the LVCSR decoder, three different phone lattice reconstruction algorithms are proposed. The underlying goals of these algorithms are to override pronunciation and grammar restrictions to provide richer phonetic information. All of the new algorithms incorporate a vector space modeling backend for improved LRE accuracy. Evaluated on a Mandarin/English detection task, the proposed integrated LVCSR-LRE system using frame-expanded N-best phone lattice achieves comparable performance to a state-of-the-art phone recognition-vector space modeling (PRVSM) system, but with an added computational cost three times lower than that of a separate PRVSM system.

Journal ArticleDOI
TL;DR: New phase parameters, channel phase differences (CPDs), defined as the phase differences between the mono downmix and the stereo channels, are introduced and can noticeably improve sound quality for stereo inputs with low ICCs.
Abstract: Conventional parametric stereo (PS) audio coding employs inter-channel phase difference and overall phase difference as phase parameters. In this article, it is shown that those parameters cannot correctly represent the phase relationship between the stereo channels when inter-channel correlation (ICC) is less than one, which is common in practical situations. To solve this problem, we introduce new phase parameters, channel phase differences (CPDs), defined as the phase differences between the mono downmix and the stereo channels. Since CPDs have a descriptive relationship with ICC as well as inter-channel intensity difference, they are more relevant to represent the phase difference between the channels in practical situations. We also propose methods of synthesizing CPDs at the decoder. Through computer simulations and subjective listening tests, it is confirmed that the proposed methods produce significantly lower phase errors than conventional PS, and it can noticeably improve sound quality for stereo inputs with low ICCs.

Journal ArticleDOI
TL;DR: It is demonstrated experimentally that a consistency is definitely acquired in case the syllable is located exactly in the same word, and that the energy warping process intra a syllable must be considered in a text-to-speech system to improve the synthesized speech quality.
Abstract: In this study, a consistency analysis of energy parameter for Mandarin speech is presented. Identified as a result of inspection of the human pronunciation process, the consistency can be interpreted as a high correlation of a warping curve between the spectrum and the prosody intra a syllable. Through three steps in the procedure of the consistency analysis, the hidden Markov model (HMM) algorithm is used first to decode HMM-state sequences within a syllable at the same time as to divide them into three segments. Second, based on a designated syllable, the vector quantization (VQ) with the Linde–Buzo–Gray algorithm is used to train the VQ codebooks of each segment. Third, the energy vector of each segment is encoded as an index by VQ codebooks, and then the probability of each possible path is evaluated as a prerequisite to analyze the consistency. It is demonstrated experimentally that a consistency is definitely acquired in case the syllable is located exactly in the same word. These results offer a research direction that the energy warping process intra a syllable must be considered in a text-to-speech system to improve the synthesized speech quality.

Journal ArticleDOI
TL;DR: The authors implement a real hands-free communication system in which the usage of the proposed beamforming technique has proven its superiority with respect to the usual two-microphone one in terms of echo reduction, and guaranteeing a comparable spatial image.
Abstract: In this article, the authors propose an optimally designed fixed beamformer (BF) for stereophonic acoustic echo cancelation (SAEC) in real hands-free communication applications. Several contributions related to the combination of beamforming and echo cancelation have appeared in the literature so far, but, up to the authors’ knowledge, the idea of using optimal fixed BFs in a real-time SAEC system both for echo reduction and stereophonic audio rendering is first addressed in this contribution. The employment of such designed BFs allows positively addressing both issues, as the several simulated and real tests seem to confirm. In particular, the stereo-recording quality attainable through the proposed approach has been preliminarily evaluated by means of subjective listening tests. Moreover, the overall system robustness against microphone array imperfections and noise presence has been experimentally evaluated. This allowed the authors to implement a real hands-free communication system in which the usage of the proposed beamforming technique has proven its superiority with respect to the usual two-microphone one in terms of echo reduction, and guaranteeing a comparable spatial image. Moreover, the proposed framework requires a low computational cost increment with regard to the baseline approach, since only few extra filtering operations with short filters need to be executed. Nevertheless, according to the performed simulations, the BF-based SAEC configuration seems to not require the signal decorrelation module, resulting in an overall computational saving.