Bio: Trym Holter is an academic researcher from SINTEF. The author has contributed to research in topics: Closed captioning & User requirements document. The author has an hindex of 3, co-authored 8 publications receiving 95 citations.
TL;DR: A maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word to create a consistent framework for optimisation of automatic speech recognition systems.
Abstract: This paper addresses the problem of generating lexical word representations that properly represent natural pronunciation variations for the purpose of improved speech recognition accuracy. In order to create a consistent framework for optimisation of automatic speech recognition systems, we present a maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word. We also propose an extension of this formulation in order to achieve optimal modelling of pronunciation variations. Since different words will not in general exhibit the same amount of pronunciation variation, the procedure allows words to be represented by a different number of baseforms. The methods improve the subword description of the vocabulary words and have been shown to improve recognition performance on the DARPA Resource Management task. Dieser Beitrag behandelt das Problem der Erzeugung lexikalischer Wortdarstellungen, die die naturliche Variation der Aussprache von Wortern geeignet reprasentieren, um auf diese Weise die Genauigkeit eines Spracherkennungssystemes zu erhohen. Um einen einheitlichen Ansatz fur die Optimierung von Spracherkennungssystemen zu entwickeln, wird ein maximum-likelihood-basiertes Verfahren zur vollautomatischen datengesteuerten Modellierung der Aussprache von Wortern vorgestellt. Dieses Verfahren basiert auf einen Satz von Teilwortern mit verborgenen Markov-Modellen und akustische Proben eines Wortes. Au?serdem wird eine Erweiterung dieses Verfahrens vorgeschlagen, um eine optimale Modellierung der Aussprachevariation zu erzielen. Da unterschiedliche Worter im allgemeinen nicht den gleichen Grad der Aussprachevariation aufweisen, erlaubt das vorgestellte Verfahren, Worter durch eine unterschiedliche Anzahl von Basisformen darzustellen. Diese Verfahren verbessern die Teilwort-Darstellung der Worter im Wortschatz, und es konnte gezeigt werden, da?s damit die Erkennungsleistung fur den DARPA Resource Management Task verbessert wird. Cette communication aborde le probleme de la generation des representations des mots lexicaux representant des variations naturelles de la prononciation. Le but est d'ameliorer la precision en ce qui concerne la reconnaissance de la parole. Afin de creer un cadre consistant pour l'optimisation des systemes automatiques pour la reconnaissance de la parole, on presente ici un algorithme base sur la classification au maximum de vraisemblance pour la modelisation automatique de la prononciation. Cette modelisation utilise une rame d'unites de parole aux modeles Markov dissimules et des echantillons acoustiques d'un mot. On propose aussi une extension de cette formulation afin d'obtenir une modelisation optimale des variations de la prononciation. Puisque de differents mots n'exposent pas, en general, le meme degre de variation de la prononciation, cette methode permet une representation des mots par un nombre varie d'entrees lexicales. La methode ameliore la description d'unites de parole des mots du vocabulaire, chose qui a demontre une amelioration de la performance de la reconnaissance en ce qui concerne la tâche de la DARPA Resource Management.
01 Jan 2000
TL;DR: The experiments showed that the turn error rate was more than twice as large for the real dialogues as for theWoZ calls, i.e., 13.3% versus 5.7%.
Abstract: This paper describes the development and testing of a pilot spoken dialogue system for bus travel information in the city of Trondheim, Norway. The system driven dialogue was designed on the basis of analyzed recordings from both human-human operator dialogues, Wizard-of-Oz (WoZ) dialogues, and a text-based inquiry system for the web. The dialogue system employs a flexible speech recognizer and an utterance concatenation procedure for speech output. Even though the system is intended for research only, it has been accessible through a public phone number since October 1999. During this period all dialogues have been recorded. From these, approximately 350 dialogues were selected for annotation and comparison to 120 dialogues from the WoZ recordings. The experiments showed that the turn error rate was more than twice as large for the real dialogues as for the WoZ calls, i.e., 13.3% versus 5.7%. Thus, the WoZ results did not give a reliable estimate for the true performance. Our experiments indicate that the current flexible speech recognizer should be further optimized.
01 Jan 2000
TL;DR: This application will provide the hearing impaired with an option to read captions for live broadcast programs, i.e., when off-line captioning is not feasible.
Abstract: A system for on-line generation of closed captions (subtitles) for broadcast of live TV-programs is described. During broadcast, a commentator formulates a possibly condensed, but semantically correct version of the original speech. These compressed phrases are recognized by a continuous speech recognizer, and the resulting captions are fed into the teletext system. This application will provide the hearing impaired with an option to read captions for live broadcast programs, i.e., when off-line captioning is not feasible. The main advantage in using a speech recognizer rather than a stenography-based system (e.g., Velotype) is the relaxed requirements for commentator training. Also, the amount of text generated by a system based on stenography tends to be large, thus making it harder to read.
01 Jan 2000
TL;DR: A statistical framework for modeling (and decoding) semantic concepts based on discrete hidden Markov models (DHMMs) is described, where each semantic concept class is modeled as a multi-state DHMM, where the observations are the recognized words.
Abstract: A key issue in a spoken dialogue system is the successful semantic interpretation of the output from the speech recognizer. Extracting the semantic concepts, i.e. the meaningful phrases, of an utterance is traditionally performed using rule based methods. In this paper we describe a statistical framework for modeling (and decoding) semantic concepts based on discrete hidden Markov models (DHMMs). Each semantic concept class is modeled as a multi-state DHMM, where the observations are the recognized words. The proposed decoding procedure is capable of parsing an utterance into a sequence of phrases, each belonging to a different concept class. The phrase sequence will correspond to a concept segmentation and class identification, whilst the semantic entities constituting each phrase contain the semantic value. The algorithm has been tested on a dialogue system for bus route information in Norwegian. The results confirm the applicability of the procedure. Semantically relevant concepts in input inquiries could be identified with 6.9% error rate on the sentence level. The corresponding segmentation error rate was 8.6% when concept segmentation information was available during training. Without this information, i.e. if the training was performed in an embedded mode, the segmentation error rate increased to 23.5%.
••15 Sep 2004
TL;DR: A mobile context aware system for maintenance work based on electronically tagged equipment and handheld wireless terminals with a multimodal user interface with particular attention to voice interaction in noisy industrial scenarios is proposed.
Abstract: Maintenance workers in the oil and process industry have typically had minimal IT support, relying on paper-based solutions both for the information they need to bring into the field and for data capture. This paper proposes a mobile context aware system for maintenance work based on electronically tagged equipment and handheld wireless terminals with a multimodal user interface. Particular attention has been given to voice interaction in noisy industrial scenarios, utilising the PARAT earplug. A proof-of-concept demonstrator of the system has been developed. The paper presents the demonstrator architecture and experiences gained through this work.
TL;DR: This contribution provides an overview of the publications on pronunciation variation modeling in automatic speech recognition, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'.
Abstract: The focus in automatic speech recognition (ASR) research has gradually shifted from isolated words to conversational speech. Consequently, the amount of pronunciation variation present in the speech under study has gradually increased. Pronunciation variation will deteriorate the performance of an ASR system if it is not well accounted for. This is probably the main reason why research on modeling pronunciation variation for ASR has increased lately. In this contribution, we provide an overview of the publications on this topic, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'.11Whenever we mention 'the Rolduc workshop' in the text we refer to the ESCA Tutorial and Research Workshop "Modeling pronunciation variation for ASR" that was held in Rolduc from 4 to 6 May 1998. This special issue of Speech Communication contains a selection of papers presented at that workshop. First, the most important characteristics that distinguish the various studies on pronunciation variation modeling are discussed. Subsequently, the issues of evaluation and comparison are addressed. Particular attention is paid to some of the most important factors that make it difficult to compare the different methods in an objective way. Finally, some conclusions are drawn as to the importance of objective evaluation and the way in which it could be carried out. Die Forschungsrichtung der automatischen Spracherkennung (ASR) hat sich nach und nach vom Erkennen isolierter Worter in Richtung Erkennung frei gesprochener Sprache entwickelt. Das hat zur Folge, da?s die Aussprachevariation, so wie sie in der freien Rede zutage tritt, bei der Spracherkennung ein intervenierender Faktor geworden ist. Die Leistung eines ASR-Systems wird namlich erheblich beeintrachtigt, wenn man diesen Faktor nicht berucksichtigt. Dies ist vermutlich der Hauptgrund dafur, warum die systematische Berucksichtigung der Aussprachevariation bei der ASR in letzter Zeit stark zugenommen hat. Dieser Artikel stellt einen Uberblick der Literatur zu diesem Thema dar, wobei den Beitragen in diesem 'special issue' sowie denen des 'Rolduc workshop' besondere Aufmerksamkeit geschenkt wird. Zunachst werden die wichtigsten Unterschiede der zahlreichen Arbeiten zur Modellbildung der Aussprachevariation diskutiert. Dann folgt eine Besprechung der Beurteilung und des Vergleichs verschiedener Methoden, die der Modellbildung zugrunde liegen. Dabei wird den wichtigsten Faktoren, die einen objektiven Vergleich der Methoden erschweren, besondere Aufmerksamkeit geschenkt. Letztendlich schlie?sen sich einige Schlu?sfolgerungen im Hinblick auf die Relevanz objektiver Beurteilung und deren mogliche Realisierung an. Le centre d'interet dans la recherche de la reconnaissance automatique de la parole (ASR), parti des mots isoles, s'est engage vers le discours conversationnel. Par consequence, la quantite de variation de prononciation presente dans le discours dont nous rapportons les resultats a graduellement augmente. La variation de prononciation deteriorera la performance d'un systeme ASR si l'on n'en rend pas compte. C'est probablement la raison principale pourquoi la recherche dans le domaine de la modelisation de la variation de prononciation pour ASR a augmente recemment. Dans cette contribution on fournit une vue d'ensemble des publications sur ce sujet, et en particulier on refere aux articles de cette edition speciale et aux contributions presentees dans les sessions qui ont eu lieu a 'Rolduc'. D'abord, les caracteristiques les plus importantes qui distinguent les diverses etudes sur modelisation de variation de prononciation sont discutees. Puis les questions d'evaluation et de comparaison sont adressees. Une attention particuliere est pretee a certains des facteurs les plus importants qui rendent difficile de comparer les differentes methodes d'une maniere objective. Enfin quelques conclusions sont tirees quant a l'importance de l'evaluation objective et de la facon dans laquelle elle pourrait etre effectuee.
TL;DR: The structure of the patterns collection is presented - the patterns are suggested solutions to problems that are grouped into a set of problem areas that are further grouped into three main problem areas - a structure which is valuable both as an index to identifying patterns to use, and it gives a fairly comprehensive overview of issues when designing user interfaces for mobile applications.
Abstract: The topic of this paper is a collection of user interface (UI) design patterns for mobile applications. In the paper we present the structure of the patterns collection - the patterns are suggested solutions to problems that are grouped into a set of problem areas that are further grouped into three main problem areas - a structure which is valuable both as an index to identifying patterns to use, and it gives a fairly comprehensive overview of issues when designing user interfaces for mobile applications. To show the breadth of the patterns collection we present six individual problems with connected design patterns in some detail - each coming from different problem areas. They represent important and relevant problems, and are on different levels of abstraction, thus showing how patterns may be used to present problems and solutions on different levels of detail. To show the relevance and usefulness of the patterns collection for usability professionals with a mixed background, we present some relevant findings from a validation of the patterns collection. In addition to verifying the relevance and usefulness of the patterns collection, it also shows both expected and surprising correlations between background and perceived relevance and usefulness. One important finding from the validation is an indication that the patterns collection is best suited for experienced UI developers wanting to start developing mobile UIs. Using a patterns collection for documenting design knowledge and experience has been a mixed experience, so we discuss pros and cons of this. Finally, we present related work and future research.
01 Jan 1999
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.
01 Jan 1999
TL;DR: This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected, and suggests that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciation used in the speech recognizer based on the extended context.
Abstract: As of this writing, the automatic recognition of spontaneous speech by computer is fraught with errors; many systems transcribe one out of every three to five words incorrectly, whereas humans can transcribe spontaneous speech with one error in twenty words or better. This high error rate is due in part to the poor modeling of pronunciations within spontaneous speech. This dissertation examines how pronunciations vary in this speaking style, and how speaking rate and word predictability can be used to predict when greater pronunciation variation can be expected. It includes an investigation of the relationship between speaking rate, word predictability, pronunciations, and errors made by speech recognition systems. The results of these studies suggest that for spontaneous speech, it may be appropriate to build models for syllables and words that can dynamically change the pronunciations used in the speech recognizer based on the extended context (including surrounding words, phones, speaking rate, etc.). Implementation of new pronunciation models automatically derived from data within the ICSI speech recognition system has shown a 4–5% relative improvement on the Broadcast News recognition task. Roughly two thirds of these gains can be attributed to static baseform improvements; adding the ability to dynamically adjust pronunciations within the recognizer provides the other third of the improvement. The Broadcast News task also allows for comparison of performance on different styles of speech: the new pronunciation models do not help for pre-planned speech, but they provide a significant gain for spontaneous speech. Not only do the automatically learned pronunciation models capture some of the linguistic variation due to the speaking style, but they also represent variation in the acoustic model due to channel effects. The largest improvement was seen in the telephone speech condition, in which 12% of the errors produced by the baseline system were corrected.
TL;DR: This work constructs a Markov chain Monte Carlo (MCMC) sampling scheme, where sampling from all the posterior probability distributions is very easy and has been tested in extensive computer simulations on finite discrete-valued observed data.
Abstract: Hidden Markov models (HMMs) represent a very important tool for analysis of signals and systems. In the past two decades, HMMs have attracted the attention of various research communities, including the ones in statistics, engineering, and mathematics. Their extensive use in signal processing and, in particular, speech processing is well documented. A major weakness of conventional HMMs is their inflexibility in modeling state durations. This weakness can be avoided by adopting a more complicated class of HMMs known as nonstationary HMMs. We analyze nonstationary HMMs whose state transition probabilities are functions of time that indirectly model state durations by a given probability mass function and whose observation spaces are discrete. The objective of our work is to estimate all the unknowns of a nonstationary HMM, which include its parameters and the state sequence. To that end, we construct a Markov chain Monte Carlo (MCMC) sampling scheme, where sampling from all the posterior probability distributions is very easy. The proposed MCMC sampling scheme has been tested in extensive computer simulations on finite discrete-valued observed data, and some of the simulation results are presented.