scispace - formally typeset
Search or ask a question

Showing papers by "Paavo Alku published in 2010"


Journal ArticleDOI
TL;DR: The data suggest that detrimental effects of prematurity on language skills are based on the low degree of specialization to native language early in development, and delayed or atypical perceptual narrowing was associated with slower language acquisition.
Abstract: Early auditory experiences are a prerequisite for speech and language acquisition. In healthy children, phoneme discrimination abilities improve for native and degrade for unfamiliar, socially irrelevant phoneme contrasts between 6 and 12 months of age as the brain tunes itself to, and specializes in the native spoken language. This process is known as perceptual narrowing, and has been found to predict normal native language acquisition. Prematurely born infants are known to be at an elevated risk for later language problems, but it remains unclear whether these problems relate to early perceptual narrowing. To address this question, we investigated early neurophysiological phoneme discrimination abilities and later language skills in prematurely born infants and in healthy, full-term infants. Our follow-up study shows for the first time that perceptual narrowing for non-native phoneme contrasts found in the healthy controls at 12 months was not observed in very prematurely born infants. An electric mismatch response of the brain indicated that whereas full-term infants gradually lost their ability to discriminate non-native phonemes from 6 to 12 months of age, prematurely born infants kept on this ability. Language performance tested at the age of 2 years showed a significant delay in the prematurely born group. Moreover, those infants who did not become specialized in native phonemes at the age of one year, performed worse in the communicative language test (MacArthur Communicative Development Inventories) at the age of two years. Thus, decline in sensitivity to non-native phonemes served as a predictor for further language development. Our data suggest that detrimental effects of prematurity on language skills are based on the low degree of specialization to native language early in development. Moreover, delayed or atypical perceptual narrowing was associated with slower language acquisition. The results hence suggest that language problems related to prematurity may partially originate already from this early tuning stage of language acquisition.

93 citations


Journal ArticleDOI
TL;DR: Two temporally weighted variants of linear predictive modeling are introduced to speaker verification and they are compared to FFT, which is normally used in computing MFCCs, and to conventional linear prediction and the effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations is investigated.
Abstract: Text-independent speaker verification under additive noise corruption is considered. In the popular mel-frequency cepstral coefficient (MFCC) front-end, the conventional Fourier-based spectrum estimation is substituted with weighted linear predictive methods, which have earlier shown success in noise-robust speech recognition. Two temporally weighted variants of linear predictive modeling are introduced to speaker verification and they are compared to FFT, which is normally used in computing MFCCs, and to conventional linear prediction. The effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations is also investigated. Experiments by the authors on the NIST 2002 SRE corpus indicate that the accuracy of the conventional and proposed features are close to each other on clean data. For factory noise at 0 dB SNR level, baseline FFT and the better of the proposed features give EERs of 17.4% and 15.6%, respectively. These accuracies improve to 11.6% and 11.2%, respectively, when spectral subtraction is included as a preprocessing method. The new features hold a promise for noise-robust speaker verification.

69 citations


Journal ArticleDOI
TL;DR: It appeared to be possible to identify valences from vowel samples of short duration ( approximately 150 milliseconds), and NAQ tended to differentiate between the valences and activity levels perceived in both genders.

47 citations


Journal ArticleDOI
TL;DR: The current MMN results imply enhanced processing of linguistically relevant information at the pre-attentive stage and in this way support the domain-specific model of speech perception.

46 citations


01 Jan 2010
TL;DR: The GlottHMM as mentioned in this paper is a hidden Markov model (HMM) based speech synthesis system that utilizes glottal inverse filtering for separating the vocal tract from the source.
Abstract: This paper describes the GlottHMM speech synthesis entry for Blizzard Challenge 2010. GlottHMM is a hidden Markov model (HMM) based speech synthesis system that utilizes glottal inverse filtering for separating the vocal tract from the glottal source. The source and the filter characteristics are modeled separately in the framework of HMM. In the synthesis stage, natural glottal flow pulses are used to generate the excitation signal, and the excitation signal is further modified according to the desired voice source characteristics generated by the HMM. In order to prevent the over-smoothing of the vocal tract filter parameters, a new formant enhancement method is used to make the vocal tract resonances sharper. Finally, speech is synthesized by filtering the glottal excitation by the vocal tract filter. Index Terms: speech synthesis, hidden Markov model, glottal inverse filtering

36 citations


Journal ArticleDOI
TL;DR: The results of an MEG study utilizing realistic spatial sound stimuli presented in a stimulus-specific adaptation paradigm support a population rate code model where neurons in the right hemisphere are more often tuned to the left than to the right of the perceiver while in the left hemisphere these two neuronal populations are of equal size.

34 citations


Journal ArticleDOI
TL;DR: It is proposed that the increased activity of AEFs reflects cortical processing of acoustic properties common to both speech and non-speech stimuli, and is most likely caused by spectral changes brought about by the decrease of amplitude resolution.
Abstract: Recent studies have shown that the human right-hemispheric auditory cortex is particularly sensitive to reduction in sound quality, with an increase in distortion resulting in an amplification of the auditory N1m response measured in the magnetoencephalography (MEG). Here, we examined whether this sensitivity is specific to the processing of acoustic properties of speech or whether it can be observed also in the processing of sounds with a simple spectral structure. We degraded speech stimuli (vowel /a/), complex non-speech stimuli (a composite of five sinusoidals), and sinusoidal tones by decreasing the amplitude resolution of the signal waveform. The amplitude resolution was impoverished by reducing the number of bits to represent the signal samples. Auditory evoked magnetic fields (AEFs) were measured in the left and right hemisphere of sixteen healthy subjects. We found that the AEF amplitudes increased significantly with stimulus distortion for all stimulus types, which indicates that the right-hemispheric N1m sensitivity is not related exclusively to degradation of acoustic properties of speech. In addition, the P1m and P2m responses were amplified with increasing distortion similarly in both hemispheres. The AEF latencies were not systematically affected by the distortion. We propose that the increased activity of AEFs reflects cortical processing of acoustic properties common to both speech and non-speech stimuli. More specifically, the enhancement is most likely caused by spectral changes brought about by the decrease of amplitude resolution, in particular the introduction of periodic, signal-dependent distortion to the original sound. Converging evidence suggests that the observed AEF amplification could reflect cortical sensitivity to periodic sounds.

30 citations


Tuomo Raitio1, Antti Suni, Hannu Pulakka, Martti Vainio, Paavo Alku 
01 Jan 2010
TL;DR: Experiments indicate that the formants enhancement prior to HMM training improves the quality of synthetic speech by providing sharper formants, and the performance of the new formant enhancement method is similar to the existing method.
Abstract: Hidden Markov model (HMM) based speech synthesis has a tendency to over-smooth the spectral envelope of speech, which makes the speech sound muffled. One means to compensate for the over-smoothing is to enhance the formants of the spectral model. This paper compares the performance of different formant enhancement methods, and studies the enhancement of the formants prior to HMM training in order to preemptively compensate for the over-smoothing. A new method for enhancing the formants of an all-pole model is also introduced. Experiments indicate that the formant enhancement prior to HMM training improves the quality of synthetic speech by providing sharper formants, and the performance of the new formant enhancement method is similar to the existing method.

27 citations


Proceedings Article
01 Jan 2010
TL;DR: A generalized formulation of linear prediction (LP), including both conventional and temporally weighted LP analysis methods as special cases, is introduced, shown to lead to performance improvement in several cases involving channel distortion and additive noise mismatch between the training and recognition conditions.
Abstract: This paper introduces a generalized formulation of linear prediction (LP), including both conventional and temporally weighted LP analysis methods as special cases. The temporally weighted methods have recently been successfully applied to noise robust spectrum analysis in speech and speaker recognition applications. In comparison to those earlier methods, the new generalized approach allows more versatility in weighting different parts of the data in the LP analysis. Two such weighted methods are evaluated and compared to the conventional spectrum modeling methods FFT and LP, as well as the temporally weighted methods WLP and SWLP, by substituting each of them in turn as the spectrum estimation method of the MFCC feature extraction stage of a GMM-UBM based speaker verification system. The new methods are shown to lead to performance improvement in several cases involving channel distortion and additive noise mismatch between the training and recognition conditions.

26 citations


Journal ArticleDOI
TL;DR: The results suggest that the Turkish-German children have not yet fully acquired the German phonetic inventory despite living in Germany since birth and being immersed in a German-speaking environment.

26 citations


Journal ArticleDOI
TL;DR: The latency of the transient brain response was prolonged in the aged compared to the young and the accuracy of behavioral responses to sinusoids was diminished among the aged.

Proceedings ArticleDOI
26 Sep 2010
TL;DR: The results supported the hypothesis – formed by an earlier study of voice quality changes in running speech – that more prominent syllables are produced with a less tense voice quality and less prominent ones with a more tense phonation.
Abstract: Prominence relations in speech are signaled by various ways including such phonetic means as voice fundamental frequency, intensity, and duration. A less studied acoustic feature affecting prominence is the so called voice quality which is determined by changes in the airflow caused by different laryngeal settings. We investigated the changes in voice quality with respect to linguistic prosodic signaling of focus in simple three word utterances. We used inverse filtering based methods for calculating and parametrizing the glottal flow in several different vowels and focus conditions. The results supported our hypothesis – formed by an earlier study of voice quality changes in running speech – that more prominent syllables are produced with a less tense voice quality and less prominent ones with a more tense phonation. We provide both physiological and linguistic explanations for the phenomena.

Proceedings ArticleDOI
01 Aug 2010
TL;DR: A new method for the bandwidth extension of telephone speech using only the information in the narrowband speech to improve speech quality compared with a previously published bandwidth extension method.
Abstract: The limited audio bandwidth used in telephone systems degrades both the quality and the intelligibility of speech. This paper presents a new method for the bandwidth extension of telephone speech. Frequency components are added to the frequency band 4–8 kHz using only the information in the narrowband speech. First, a wideband excitation is generated by spectral folding from the narrowband linear prediction residual. The highband of this signal is divided into four subbands with a filter bank, and a neural network is used to weight the subbands based on features calculated from the narrowband speech. Bandwidth-extended speech is obtained by summing the weighted subbands and the original narrowband signal. Listening tests show that this new method improves speech quality compared with a previously published bandwidth extension method.

Journal ArticleDOI
TL;DR: A temporal window of integration for the periodicity of speech sounds in the F0 range of typical male speech is defined, which is 3-5 cycles, or 30-50 ms, and this window is shorter for the periodic than for the aperiodic stimuli.
Abstract: Cortical sensitivity to the periodicity of speech sounds has been evidenced by larger, more anterior responses to periodic than to aperiodic vowels in several non-invasive studies of the human brain. The current study investigated the temporal integration underlying the cortical sensitivity to speech periodicity by studying the increase in periodicity-specific cortical activation with growing stimulus duration. Periodicity-specific activation was estimated from magnetoencephalography as the differences between the N1m responses elicited by periodic and aperiodic vowel stimuli. The duration of the vowel stimuli with a fundamental frequency (F0=106 Hz) representative of typical male speech was varied in units corresponding to the vowel fundamental period (9.4 ms) and ranged from one to ten units. Cortical sensitivity to speech periodicity, as reflected by larger and more anterior responses to periodic than to aperiodic stimuli, was observed when stimulus duration was 3 cycles or more. Further, for stimulus durations of 5 cycles and above, response latency was shorter for the periodic than for the aperiodic stimuli. Together the current results define a temporal window of integration for the periodicity of speech sounds in the F0 range of typical male speech. The length of this window is 3-5 cycles, or 30-50 ms.

Journal ArticleDOI
TL;DR: Directly observable, non-invasive brain measures can be used in assessing the effects of stroke which are related to the behavioral symptoms patients manifest, and left-hemispheric ischemic stroke impairs the processing of sinusoidal and speech sounds.

01 Jan 2010
TL;DR: Two temporally weighted variants of linear predictive (LP) modeling are introduced to speaker verification and compared to FFT, which is normally used in computing MFCCs, and to conventional LP and the effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations.
Abstract: We consider text-independent speaker verification under ad ditive noise corruption. In the popular mel-frequency cepstral coefficient (MFCC) front-end, we substitute the convention al Fourier-based spectrum estimation with weighted linear predictive methods, which have earlier shown success in noise-robust speech recognition. We introduce two temporally weighted variants of linear predictive (LP) modeling to speaker verification and compare them to FFT, which is normally used in computing MFCCs, and to conventional LP. We also investigate the effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations. Our experiments on the NIST 2002 SRE corpus indicate that the accuracy of the conventional and proposed features are close to each other on clean data. On 0 dB SNR level, baseline FFT and the better of the proposed features give EERs of 17.4 % and 15.6 %, respectively. These accuracies improve to 11.6 % and 11.2 %, respectively, when spectral subtraction is included as a pre-processing method. The new features hold a promise for noise-robust speaker verification.