scispace - formally typeset
Search or ask a question

Showing papers by "Unto K. Laine published in 2011"


Book ChapterDOI
13 Jun 2011
TL;DR: Several researchers have concurrently worked on blind speech segmentation methods that do not require any external or prior knowledge regarding the speech to be segmented, since they do not need to be trained extensively on carefully prepared speech material.
Abstract: Automated segmentation of speech into phone-sized units has been a subject of study for over 30 years, as it plays a central role in many speech processing and ASR applications. While segmentation by hand is relatively precise, it is also extremely laborious and tedious. This is one reason why automated methods are widely utilized. For example, phonetic analysis of speech (Mermelstein, 1975), audio content classification (Zhang & Kuo, 1999), and word recognition (Antal, 2004) utilize segmentation for dividing continuous audio signals into discrete, non-overlapping units in order to provide structural descriptions for the different parts of a processed signal. In the field of automatic segmentation of speech, the best results have so far been achieved with semi-automatic HMMs that require prior training (see, e.g., Makhoul & Schwartz, 1994). Algorithms using additional linguistic information like phonetic annotation during the segmentation process are often also effective (e.g., Hemert, 1991). The use of these types of algorithms is well justified for several different purposes, but extensive training may not always be possible, nor may adequately rich descriptions of speech material be available, for instance, in real-time applications. Training of the algorithms also imposes limitations to the material that can be segmented effectively, with the results being highly dependent on, e.g., the language and vocabulary of the training and target material. Therefore, several researchers have concurrently worked on blind speech segmentation methods that do not require any external or prior knowledge regarding the speech to be segmented (Almpanidis & Kotropoulos, 2008; Aversano et al., 2001; Cherniz et al., 2007; Esposito & Aversano, 2005; Estevan et al., 2007; Sharma & Mammone, 1996). These so called blind segmentation algorithms have many potential applications in the field of speech processing that are complementary to supervised segmentation, since they do not need to be trained extensively on carefully prepared speech material. As an important property, blind algorithms do not necessarily make assumptions about underlying signal conditions whereas in trained algorithms possible mismatches between training data and processed input cause problems and errors in segmentation, e.g., due to changes in background noise conditions or microphone properties. Blind methods also provide a valuable tool for investigating speech from a basic level such as phonetic research, they are language independent, and they can be used as a processing step in self-learning agents attempting to make sense of sensory input where externally supplied linguistic knowledge cannot be used (e.g., Rasanen & Driesen, 2009; Rasanen et al., 2008).

29 citations


Proceedings Article
01 Aug 2011
TL;DR: Results from the experiments show that combination of audio and acceleration data enhances classification accuracy of physical activities with all classifiers, whereas environment classification does not benefit notably from acceleration features.
Abstract: This work studies combination of audio and acceleration sensory streams for automatic classification of user context. Instead of performing sensory fusion at a feature level, we study the combination of classifier output distributions using a number of different classifiers. Performance of the algorithms is evaluated using a data set collected with casually worn mobile phones from a variety of real world environments and user activities. Results from the experiments show that combination of audio and acceleration data enhances classification accuracy of physical activities with all classifiers, whereas environment classification does not benefit notably from acceleration features.

8 citations


Proceedings Article
01 Jan 2011
TL;DR: A computationally effective method for speech inversion in proposed, using a two-pole predictor structure in order to maintain better articulatory dynamics when compared to conventional dynamic programming methods.
Abstract: An articulatory model of speech production is created for the purpose of studying the links between speech production and perception A computationally effective method for speech inversion in proposed, using a two-pole predictor structure in order to maintain better articulatory dynamics when compared to conventional dynamic programming methods Preliminary tests for the effect of inversion are performed for 2500 Finnish syllables extracted from continuous speech, consisting of 125 different syllable classes A cluster selectivity test shows that the syllables are more reliably clustered using the automatically obtained parametric representation of articulatory gestures rather than the original formant representation that is used as a starting point for the inversion Index Terms: Articulatory model, speech inversion, motor theory, vocal tract 1 Introduction to articulatory modeling Speech events are more conveniently described in articulatory than acoustic sense Individual articulators move rather slowly and smoothly when compared to spectral characteristics of speech signals Since the relative trajectories of different articulators remain rather similar in the production of speech sounds regardless of the speaker, the modeling of speech perception with articulatory modeling may help to overcome many of the problems that arise from the ambiguity in the purely acoustic domain In the 19th century research on the area of articulatory modeling boomed when the first electrical and then digital models for speech production could be implemented Researchers have often referred to articulatory models developed by Coker [1], Mermelstein [2] or Maeda [3], for example when studying the speech inverse problem Maeda’s model’s seven articulatory parameters were estimated from x-ray tracings using so-called arbitrary factor analysis in order to have the parameters maximally uncorrelated to each other Mermelstein’s geometrical articulatory model depicts the positions of articulators in the midsagittal plane Lips, jaw, tongue, velum and hyoid are considered as movable structures In 1990’s and 2000’s more complex vocal tract and tongue models were developed Eg Dang and Honda have created a 3D articulatory model which used physiological constraints typical to human articulation in inverting vowel-to-vowel sequences [4]

2 citations


Proceedings ArticleDOI
27 Aug 2011
TL;DR: A novel method for stop consonant recognition is presented, based on statistical properties of short temporal fine structure of burst part, which is evaluated with simple frequency domain method.
Abstract: The automatic classification of the unvoiced stop consonants is widely considered as a difficult task for traditional frequency domain and even time-frequency methods. Main reason for this is their short duration and diverse temporal structure. In this paper we present a novel method for stop consonant recognition. The method is based on statistical properties of short temporal fine structure of burst part. Classification is also evaluated with simple frequency domain method.

2 citations


Proceedings ArticleDOI
Unto K. Laine1
27 Aug 2011
TL;DR: It is argued that syntactic methods may provide universal tools to model and describe structures from the very elementary level of signals up to the highest one, that of language.
Abstract: A new method for inferring specific stochastic grammars is presented. The process called Hybrid Model Learner (HML) applies entropy rate to guide the agglomeration process of type ab->c. Each rule derived from the input sequence is associated with a certain entropy-rate difference. A grammar automatically inferred from an example sequence can be used to detect and recognize similar structures in unknown sequences. Two important schools of thought, that of structuralism and the other of ‘stochasticism’ are discussed, including how these two have met and are influencing current statistical learning methods. It is argued that syntactic methods may provide universal tools to model and describe structures from the very elementary level of signals up to the highest one, that of language.

1 citations