scispace - formally typeset
Search or ask a question

Showing papers by "Unto K. Laine published in 2005"


Patent
30 Nov 2005
TL;DR: In this paper, the authors used the vector-autoregressive (VAR) method in the segmentation of speech using a vector time series depicting speech, with the aid of data both preceding and following the prediction point in time.
Abstract: The invention relates to a method for the segmentation of speech using an automatic method. The invention is characterized by the use of the vector-autoregressive (VAR) method in the segmentation. In it, the changes taking place in a vector time series depicting speech are predicted on the basis of data both preceding the prediction point in time and following the prediction point in time, with the aid of a vector-autoregressive model.

2 citations


Proceedings ArticleDOI
04 Sep 2005
TL;DR: A preliminary test of a novel method for automatic speech segmentation, which is based on detecting unpredictedchanges in auditory time-frequency picture of speech at phoneboundaries and which kind of phone boundaries allows the mostreliable and robust detection by the method.
Abstract: A vector autoregressive (VAR) model is used in the auditorytime-frequency domain to predict spectral changes. Forwardand backward prediction errors increases at the phone bound-aries. These error signals are then used to study and detect theboundaries of the largest changes allowing the most reliableautomatic segmentation. Using a fully unsupervised methodyields segments consisting of a variable number of phones. Thequality of performance of this method was tested with a set of150 Finnish sentences pronounced by one female and two malespeakers. The performance for English was tested using theTIMIT core test set. The boundaries between stops and vowels,in particular, are detected with high probability and precision. 1. Introduction Many subfields of speech technology need robust methodsfor automatic phonetic speech segmentation. Preferably thesemethods would be fully speaker and language independent.They should perform segmentation without any prior infor-mation about the speaker or the utterance in question. Thesemethods should not apply any type of prior learning, andthey should be able to process unknown utterances in a fullyunsupervised manner. This paper describes a preliminary testof a novel method for automatic speech segmentation, whichfulfills the hard demands mentioned to a certain degree.Segmentation methods described in the literature can beclassified into explicit and implicit methods. They also varyin terms of segmentation units (e.g. phonemes, syllables,words). In explicit methods, the underlying phoneme sequenceis known prior to the segmentation. These methods are usedin speech synthesis for example. Implicit methods split theutterance into smaller units without using any informationabout the underlying phoneme sequence. These methods arebased on analyzing the acoustic properties of the signal anddetecting either spectrally stable parts or rapid variations ofsignal. An example of a method based on locating spectrallystable parts is in [1] where the correlation between parameterscomputed from nearby frames has been used as a measure ofstability. In [2], segment boundaries are implicitly detectedcomparing the means of frames around potential boundariesusing “jump-function.” In [3], the variations of short-termenergy function is used as a measure to produce syllable-likeunits using minimum phase group delay functions.In the case of continuous speech, the signal cannot bestrictly divided into stable and varying parts which wouldcorrespond one-to-one with phones and segment boundaries.No phone in continuous speech produces steady spectra,but instead within a phone there are always slow spectralmovements which are, to some degree, possible to predict.The method proposed in this paper does not detect these slowspectral variations, but rather is based on detecting unpredictedchanges in auditory time-frequency picture of speech at phoneboundaries. These unpredicted changes happen most oftenwhen moving from one phoneme class to another. Change inthe speech production mechanism changes the acoustic signalin an unpredictable manner. Knowing that not all transitionsproduce a large or rapid spectral change, a question of thisstudy is which kind of phone boundaries allows the mostreliable and robust detection by the method.When facing speaker-independent unlimited vocabulary(e.g. inflectional languages) continuous speech recognition, thewords have to be split into smaller units such as morphemes;hence, not every phone boundary needs to be detected. Seg-ments similar to syllables or morphemes consisting of one tomany phones do apply as well as long as the total number ofdifferent segments is not too high for modeling purposes.The novel method presented in this paper produces seg-ments consisting of phone clusters of different lengths. Thecore idea is to model the spectral variation by using VectorAutoregressive model (VAR). The model performs forward andbackward predictions in the auditory time-frequency domainwith associated prediction errors. The segment boundarycandidates are found based on these error signals.

1 citations