Author
P. O. Husoy
Bio: P. O. Husoy is an academic researcher from Norwegian Institute of Technology. The author has contributed to research in topics: Speaker recognition & Speech coding. The author has an hindex of 2, co-authored 2 publications receiving 43 citations.
Papers
More filters
23 May 1989
TL;DR: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units that showed results comparable to those of whole-word-based systems.
Abstract: The authors describe a system for speaker-dependent speech recognition based on acoustic subword units. Several strategies for automatic generation of an acoustic lexicon are outlined. Preliminary tests have been performed on a small vocabulary. In these tests, the proposed system showed results comparable to those of whole-word-based systems. >
42 citations
Proceedings Article•
01 Jan 1991
2 citations
Cited by
More filters
TL;DR: Current advances related to automatic speech recognition (ASR) and spoken language systems and deficiencies in dealing with variation naturally present in speech are outlined.
Abstract: Major progress is being recorded regularly on both the technology and exploitation of automatic speech recognition (ASR) and spoken language systems. However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as the sensitivity to the environment (background noise), or the weak representation of grammatical and semantic knowledge.
Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes the use by specific populations. Also, some applications, like directory assistance, particularly stress the core recognition technology due to the very high active vocabulary (application perplexity). There are actually many factors affecting the speech realization: regional, sociolinguistic, or related to the environment or the speaker herself. These create a wide range of variations that may not be modeled correctly (speaker, gender, speaking rate, vocal effort, regional accent, speaking style, non-stationarity, etc.), especially when resources for system training are scarce. This paper outlines current advances related to these topics.
507 citations
01 Jan 1999
TL;DR: Problems with the phoneme as the basic subword unit in speech recognition are raised, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech.
Abstract: The notion that a word is composed of a sequence of phone segments, sometimes referred to as ‘beads on a string’, has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives – automatically derived subword units and linguistically motivated distinctive feature systems – and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies.
151 citations
TL;DR: This paper presents a complete probabilistic formulation for the automatic design of subword units and dictionary, given only the acoustic data and their transcriptions, and permits easy incorporation of external sources of information, such as the spellings of words in terms of a nonideographic script.
Abstract: Large vocabulary continuous speech recognition (LVCSR) systems traditionally represent words in terms of smaller subword units. Both during training and during recognition, they require a mapping table, called the dictionary, which maps words into sequences of these subword units. The performance of the LVCSR system depends critically on the definition of the subword units and the accuracy of the dictionary. In current LVCSR systems, both these components are manually designed. While manually designed subword units generalize well, they may not be the optimal units of classification for the specific task or environment for which an LVCSR system is trained. Moreover, when human expertise is not available, it may not be possible to design good subword units manually. There is clearly a need for data-driven design of these LVCSR components. In this paper, we present a complete probabilistic formulation for the automatic design of subword units and dictionary, given only the acoustic data and their transcriptions. The proposed framework permits easy incorporation of external sources of information, such as the spellings of words in terms of a nonideographic script.
82 citations
TL;DR: A joint solution to the related problems of learning a unit inventory and corresponding lexicon from data on a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities.
Abstract: Although most parameters in a speech recognition system are estimated from data by the use of an objective function, the unit inventory and lexicon are generally hand crafted and therefore unlikely to be optimal. This paper proposes a joint solution to the related problems of learning a unit inventory and corresponding lexicon from data. On a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities. Obwohl die meisten Parameter eines Spracherkennungssystems aus Daten geschatzt werden, ist die Wahl der akustischen Grundeinheiten und des Lexikons normalerweise nicht automatisch und deshalb wahrscheinlich nicht optimal. Dieser Artikel stellt einen kombinierten Ansatz fur die Losung dieser verwandten Probleme dar - das Lernen von akustischen Grundeinheiten und des zugehorigen Lexikons aus Daten. Experimente mit sprecher-unabhangigen gelesenen Sprachdaten mit einem Vokabular von 1000 Wortern zeigen, da?s der vorgestellte Ansatz besser ist als ein System niedriger oder hoherer Komplexitat, das auf Phonemen basiert ist. Bien que la plupart des parametres dans un systeme de reconnaissance de la parole soient estimes a partie des donnees en utilisant une fonction objective, l'inventaire des unites acoustiques et le lexique sont generalement crees a la main, et donc susceptibles de ne pas etre optimeux. Cette etude propose une solution conjointe aux problemes interdependants que sont l'apprentissage a partir des donnees d'un inventaire des unites acoustiques et du lexique correspondant. Nous avons teste l'algorithme propose sur des echantillons lus, en reconnaissance independantes du locuteur avec un vocabulaire de 1k: il surpasse les systemes phonetiques en faible ou forte complexite.
66 citations
TL;DR: A maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word to create a consistent framework for optimisation of automatic speech recognition systems.
Abstract: This paper addresses the problem of generating lexical word representations that properly represent natural pronunciation variations for the purpose of improved speech recognition accuracy. In order to create a consistent framework for optimisation of automatic speech recognition systems, we present a maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word. We also propose an extension of this formulation in order to achieve optimal modelling of pronunciation variations. Since different words will not in general exhibit the same amount of pronunciation variation, the procedure allows words to be represented by a different number of baseforms. The methods improve the subword description of the vocabulary words and have been shown to improve recognition performance on the DARPA Resource Management task. Dieser Beitrag behandelt das Problem der Erzeugung lexikalischer Wortdarstellungen, die die naturliche Variation der Aussprache von Wortern geeignet reprasentieren, um auf diese Weise die Genauigkeit eines Spracherkennungssystemes zu erhohen. Um einen einheitlichen Ansatz fur die Optimierung von Spracherkennungssystemen zu entwickeln, wird ein maximum-likelihood-basiertes Verfahren zur vollautomatischen datengesteuerten Modellierung der Aussprache von Wortern vorgestellt. Dieses Verfahren basiert auf einen Satz von Teilwortern mit verborgenen Markov-Modellen und akustische Proben eines Wortes. Au?serdem wird eine Erweiterung dieses Verfahrens vorgeschlagen, um eine optimale Modellierung der Aussprachevariation zu erzielen. Da unterschiedliche Worter im allgemeinen nicht den gleichen Grad der Aussprachevariation aufweisen, erlaubt das vorgestellte Verfahren, Worter durch eine unterschiedliche Anzahl von Basisformen darzustellen. Diese Verfahren verbessern die Teilwort-Darstellung der Worter im Wortschatz, und es konnte gezeigt werden, da?s damit die Erkennungsleistung fur den DARPA Resource Management Task verbessert wird. Cette communication aborde le probleme de la generation des representations des mots lexicaux representant des variations naturelles de la prononciation. Le but est d'ameliorer la precision en ce qui concerne la reconnaissance de la parole. Afin de creer un cadre consistant pour l'optimisation des systemes automatiques pour la reconnaissance de la parole, on presente ici un algorithme base sur la classification au maximum de vraisemblance pour la modelisation automatique de la prononciation. Cette modelisation utilise une rame d'unites de parole aux modeles Markov dissimules et des echantillons acoustiques d'un mot. On propose aussi une extension de cette formulation afin d'obtenir une modelisation optimale des variations de la prononciation. Puisque de differents mots n'exposent pas, en general, le meme degre de variation de la prononciation, cette methode permet une representation des mots par un nombre varie d'entrees lexicales. La methode ameliore la description d'unites de parole des mots du vocabulaire, chose qui a demontre une amelioration de la performance de la reconnaissance en ce qui concerne la tâche de la DARPA Resource Management.
63 citations