scispace - formally typeset
Search or ask a question

Showing papers in "Eurasip Journal on Audio, Speech, and Music Processing in 2006"


Journal ArticleDOI
TL;DR: Results are provided that shed light on the characteristics and attributes of the DAPGF and OZGF responses that provide a robust foundation for modeling cochlea transfer functions and can also serve as "design curves" for fitting these responses to frequency-domain physiological data.
Abstract: This paper deals with continuous-time filter transfer functions that resemble tuning curves at particular set of places on the basilar membrane of the biological cochlea and that are suitable for practical VLSI implementations. The resulting filters can be used in a filterbank architecture to realize cochlea implants or auditory processors of increased biorealism. To put the reader into context, the paper starts with a short review on the gammatone filter and then exposes two of its variants, namely, the differentiated all-pole gammatone filter (DAPGF) and one-zero gammatone filter (OZGF), filter responses that provide a robust foundation for modeling cochlea transfer functions. The DAPGF and OZGF responses are attractive because they exhibit certain characteristics suitable for modeling a variety of auditory data: level-dependent gain, linear tail for frequencies well below the center frequency, asymmetry, and so forth. In addition, their form suggests their implementation by means of cascades of N identical two-pole systems which render them as excellent candidates for efficient analog or digital VLSI realizations. We provide results that shed light on their characteristics and attributes and which can also serve as "design curves" for fitting these responses to frequency-domain physiological data. The DAPGF and OZGF responses are essentially a "missing link" between physiological, electrical, and mechanical models for auditory filtering.

78 citations


Journal ArticleDOI
TL;DR: A new technique for separating two speech signals from a single recording is presented and effectively adds vocal-tract-related filter characteristics as a new cue to CASA models using a new grouping technique based on an underdetermined blind source separation.
Abstract: We present a new technique for separating two speech signals from a single recording. The proposed method bridges the gap between underdetermined blind source separation techniques and those techniques that model the human auditory system, that is, computational auditory scene analysis (CASA). For this purpose, we decompose the speech signal into the excitation signal and the vocal-tract-related filter and then estimate the components from the mixed speech using a hybrid model. We first express the probability density function (PDF) of the mixed speech's log spectral vectors in terms of the PDFs of the underlying speech signal's vocal-tract-related filters. Then, the mean vectors of PDFs of the vocal-tract-related filters are obtained using a maximum likelihood estimator given the mixed signal. Finally, the estimated vocal-tract-related filters along with the extracted fundamental frequencies are used to reconstruct estimates of the individual speech signals. The proposed technique effectively adds vocal-tract-related filter characteristics as a new cue to CASA models using a new grouping technique based on an underdetermined blind source separation. We compare our model with both an underdetermined blind source separation and a CASA method. The experimental results show that our model outperforms both techniques in terms of SNR improvement and the percentage of crosstalk suppression.

53 citations


Journal ArticleDOI
TL;DR: The results from perceptual experiments show that the listeners' accent background impacts their ability to categorize accents, and the comprehensibility of the speech contributes to accent perception.
Abstract: Variability of speaker accent is a challenge for effective human communication as well as speech technology including automatic speech recognition and accent identification. The motivation of this study is to contribute to a deeper understanding of accent variation across speakers from a cognitive perspective. The goal is to provide perceptual assessment of accent variation in native and English. The main focus is to investigate how listener's accent background affects accent perception and comprehensibility. The results from perceptual experiments show that the listeners' accent background impacts their ability to categorize accents. Speaker accent type affects perceptual accent classification. The interaction between listener accent background and speaker accent type is significant for both accent perception and speech comprehension. In addition, the results indicate that the comprehensibility of the speech contributes to accent perception. The outcomes point to the complex nature of accent perception, and provide a foundation for further investigation on the involvement of cognitive processing for accent perception. These findings contribute to a richer understanding of the cognitive aspects of accent variation, and its application for speech technology.

35 citations


Journal ArticleDOI
TL;DR: It is found that some of the classical solutions obtain a moderate signal enhancement, while more advanced subspace-based dereverberation techniques fail to enhance the signals despite their high-computational load.
Abstract: Dereverberation is required in various speech processing applications such as handsfree telephony and voice-controlled systems, especially when signals are applied that are recorded in a moderately or highly reverberant environment. In this paper, we compare a number of classical and more recently developed multimicrophone dereverberation algorithms, and validate the different algorithmic settings by means of two performance indices and a speech recognition system. It is found that some of the classical solutions obtain a moderate signal enhancement. More advanced subspace-based dereverberation techniques, on the other hand, fail to enhance the signals despite their high-computational load.

18 citations


Journal ArticleDOI
TL;DR: This work is the result of an interdisciplinary collaboration between scientists from the fields of audio signal processing, phonetics and cognitive neuroscience aiming at studying the perception of modifications in meter, rhythm, semantics and harmony in language and music.
Abstract: This work is the result of an interdisciplinary collaboration between scientists from the fields of audio signal processing, phonetics and cognitive neuroscience aiming at studying the perception of modifications in meter, rhythm, semantics and harmony in language and music. A special time-stretching algorithm was developed to work with natural speech. In the language part, French sentences ending with tri-syllabic congruous or incongruous words, metrically modified or not, were made. In the music part, short melodies made of triplets, rhythmically and/or harmonically modified, were built. These stimuli were presented to a group of listeners that were asked to focus their attention either on meter/rhythm or semantics/harmony and to judge whether or not the sentences/melodies were acceptable. Language ERP analyses indicate that semantically incongruous words are processed independently of the subject's attention thus arguing for automatic semantic processing. In addition, metric incongruities seem to influence semantic processing. Music ERP analyses show that rhythmic incongruities are processed independently of attention, revealing automatic processing of rhythm in music.

7 citations


Journal ArticleDOI
TL;DR: This paper proposes one such transform-based compression technique, where the joint time-frequency properties of the nonstationary nature of the audio signals were exploited in creating a compact energy representation of the signal in fewer coefficients.
Abstract: Wide band digital audio signals have a very high data-rate associated with them due to their complex nature and demand for high-quality reproduction. Although recent technological advancements have significantly reduced the cost of bandwidth and miniaturized storage facilities, the rapid increase in the volume of digital audio content constantly compels the need for better compression algorithms. Over the years various perceptually lossless compression techniques have been introduced, and transform-based compression techniques have made a significant impact in recent years. In this paper, we propose one such transform-based compression technique, where the joint time-frequency (TF) properties of the nonstationary nature of the audio signals were exploited in creating a compact energy representation of the signal in fewer coefficients. The decomposition coefficients were processed and perceptually filtered to retain only the relevant coefficients. Perceptual filtering (psychoacoustics) was applied in a novel way by analyzing and performing TF specific psychoacoustics experiments. An added advantage of the proposed technique is that, due to its signal adaptive nature, it does not need predetermined segmentation of audio signals for processing. Eight stereo audio signal samples of different varieties were used in the study. Subjective (mean opinion score--MOS) listening tests were performed and the subjective difference grades (SDG) were used to compare the performance of the proposed coder with MP3, AAC, and HE-AAC encoders. Compression ratios in the range of 8 to 40 were achieved by the proposed technique with subjective difference grades (SDG) ranging from ---0.53 to ---2.27.

5 citations


Journal ArticleDOI
TL;DR: This paper experimentally shows the importance of perceptual continuity of the expressive strength in vocal timbre for natural change in vocal expression and concludes that applying continuity was highly effective for achieving perceptual naturalness.
Abstract: This paper experimentally shows the importance of perceptual continuity of the expressive strength in vocal timbre for natural change in vocal expression. In order to synthesize various and continuous expressive strengths with vocal timbre, we investigated gradually changing expressions by applying the STRAIGHT speech morphing algorithm to singing voices. Here, a singing voice without expression is used as the base of morphing, and singing voices with three different expressions are used as the target. Through statistical analyses of perceptual evaluations, we confirmed that the proposed morphing algorithm provides perceptual continuity of vocal timbre. Our results showed the following: (i) gradual strengths in absolute evaluations, and (ii) a perceptually linear strength provided by the calculation of corrected intervals of the morph ratio by the inverse (reciprocal) function of an equation that approximates the perceptual strength. Finally, we concluded that applying continuity was highly effective for achieving perceptual naturalness, judging from the results showing that (iii) our gradual transformation method can perform well for perceived naturalness.

5 citations


Journal ArticleDOI
TL;DR: An exclusive maximum selective-tap time-domain convolutive BSS algorithm (XM BSS) is proposed that reduces the interchannel coherence of the tap-input vectors and improves the conditioning of the autocorrelation matrix resulting in improved convergence rate and reduced misalignment.
Abstract: We investigate novel algorithms to improve the convergence and reduce the complexity of time-domain convolutive blind source separation (BSS) algorithms. First, we propose MMax partial update time-domain convolutive BSS (MMax BSS) algorithm. We demonstrate that the partial update scheme applied in the MMax LMS algorithm for single channel can be extended to multichannel time-domain convolutive BSS with little deterioration in performance and possible computational complexity saving. Next, we propose an exclusive maximum selective-tap time-domain convolutive BSS algorithm (XM BSS) that reduces the interchannel coherence of the tap-input vectors and improves the conditioning of the autocorrelation matrix resulting in improved convergence rate and reduced misalignment. Moreover, the computational complexity is reduced since only half of the tap inputs are selected for updating. Simulation results have shown a significant improvement in convergence rate compared to existing techniques.

2 citations


Journal ArticleDOI
TL;DR: This issue contains seven papers that exemplify the breadth and depth of current work in perceptual modeling and its applications, including efficient FFT-based processing that mimics two-tone suppression, which is a key attribute of simultaneous masking.
Abstract: This is a special issue published in volume 2007 of \" EURASIP Journal on Audio, Speech, and Music Processing. \" All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. New understandings of human auditory perception have recently contributed to advances in numerous areas related to audio, speech, and music processing. These include coding , speech and speaker recognition, synthesis, signal separation , signal enhancement, automatic content identification and retrieval, and quality estimation. Researchers continue to seek more detailed, accurate, and robust characterizations of human auditory perception, from the periphery to the auditory cortex, and in some cases whole brain inventories. This special issue on Perceptual Models for Speech, Audio , and Music Processing contains seven papers that exemplify the breadth and depth of current work in perceptual modeling and its applications. The issue opens with \" Practical gammatone-like filters for auditory processing \" by A. G. Katsiamis et al.which contains a nice review on how to make cochlear-like filters using classical signal processing methods. As described in the paper , the human cochlea is nonlinear. The nonlinearity in the cochlea is believed to control for dynamic range issues, perhaps due to the small dynamic range of neurons. Having a time domain version of the cochlea with a built in nonlinear-ity is an important tool in many signal processing applications. This paper shows one way this might be accomplished using a cascade of second-order sections. While we do not know how the human cochlea accomplishes this task of non-linear filtering, the technique described here is one reasonable method for solving this very difficult problem. B. Raj et al.apply perceptual modeling to the automatic speech recognition problem in \" An FFT-based companding front end for noise-robust automatic speech recognition. \" These authors describe efficient FFT-based processing that mimics two-tone suppression, which is a key attribute of simultaneous masking. This processing involves a bank of relatively wide filters, followed by a compressive nonlinearity, then relatively narrow filters, and finally an expansion stage. The net result is that strong spectral components tend to reduce the level of weaker neighboring spectral components, and this is a form of spectral peak enhancement. The authors apply this work as a preprocessor for a mel-cepstrum HMM-based automatic speech recognition algorithm and they demonstrate improved performance for a …

1 citations


Journal ArticleDOI
TL;DR: A multiple-description MSVQ targeted for communication over packet-loss channels is proposed and investigated, and a practical example involving quantization of speech line spectral frequency (LSF) vectors is presented to demonstrate the potential advantage of MD-MSVQ over interleaving-based MSV Q as well as traditional MSVZ based on error concealment at the receiver.
Abstract: Multistage vector quantization (MSVQ) is a technique for low complexity implementation of high-dimensional quantizers, which has found applications within speech, audio, and image coding. In this paper, a multiple-description MSVQ (MD-MSVQ) targeted for communication over packet-loss channels is proposed and investigated. An MD-MSVQ can be viewed as a generalization of a previously reported interleaving-based transmission scheme for multistage quantizers. An algorithm for optimizing the codebooks of an MD-MSVQ for a given packet-loss probability is suggested, and a practical example involving quantization of speech line spectral frequency (LSF) vectors is presented to demonstrate the potential advantage of MD-MSVQ over interleaving-based MSVQ as well as traditional MSVQ based on error concealment at the receiver.