scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Speech and Language Processing in 2011"


Journal ArticleDOI
TL;DR: This work experiments with both a step-wise approach, where spatial prepositions are found and the related trajectors, and landmarks are then extracted, and a joint learning approach,Where a spatial relation and its composing indicator, trajector, and landmark are classified collectively.
Abstract: This article reports on the novel task of spatial role labeling in natural language text. It proposes machine learning methods to extract spatial roles and their relations. This work experiments with both a step-wise approach, where spatial prepositions are found and the related trajectors, and landmarks are then extracted, and a joint learning approach, where a spatial relation and its composing indicator, trajector, and landmark are classified collectively. Context-dependent learning techniques, such as a skip-chain conditional random field, yield good results on the GUM-evaluation (Maptask) data and CLEF-IAPR TC-12 Image Benchmark. An extensive error analysis, including feature assessment, and a cross-domain evaluation pinpoint the main bottlenecks and avenues for future research.

129 citations


Journal ArticleDOI
TL;DR: Experimental results show that a set of approximate dynamic programming algorithms combined to a method for learning a sparse representation of the value function can learn good dialogue policies directly from data, avoiding user modeling errors.
Abstract: Spoken Dialogue Systems (SDS) are systems which have the ability to interact with human beings using natural language as the medium of interaction. A dialogue policy plays a crucial role in determining the functioning of the dialogue management module. Handcrafting the dialogue policy is not always an option, considering the complexity of the dialogue task and the stochastic behavior of users. In recent years approaches based on Reinforcement Learning (RL) for policy optimization in dialogue management have been proved to be an efficient approach for dialogue policy optimization. Yet most of the conventional RL algorithms are data intensive and demand techniques such as user simulation. Doing so, additional modeling errors are likely to occur. This paper explores the possibility of using a set of approximate dynamic programming algorithms for policy optimization in SDS. Moreover, these algorithms are combined to a method for learning a sparse representation of the value function. Experimental results show that these algorithms when applied to dialogue management optimization are particularly sample efficient, since they learn from few hundreds of dialogue examples. These algorithms learn in an off-policy manner, meaning that they can learn optimal policies with dialogue examples generated with a quite simple strategy. Thus they can learn good dialogue policies directly from data, avoiding user modeling errors.

112 citations


Journal ArticleDOI
TL;DR: How spoken dialogs using Automatic Speech Recognition (ASR) and natural language processing were developed were developed to stimulate students' thinking, reasoning and self explanations are described.
Abstract: This article describes My Science Tutor (MyST), an intelligent tutoring system designed to improve science learning by students in 3rd, 4th, and 5th grades (7 to 11 years old) through conversational dialogs with a virtual science tutor. In our study, individual students engage in spoken dialogs with the virtual tutor Marni during 15 to 20 minute sessions following classroom science investigations to discuss and extend concepts embedded in the investigations. The spoken dialogs in MyST are designed to scaffold learning by presenting open-ended questions accompanied by illustrations or animations related to the classroom investigations and the science concepts being learned. The focus of the interactions is to elicit self-expression from students. To this end, Marni applies some of the principles of Questioning the Author, a proven approach to classroom conversations, to challenge students to think about and integrate new concepts with prior knowledge to construct enriched mental models that can be used to explain and predict scientific phenomena. In this article, we describe how spoken dialogs using Automatic Speech Recognition (ASR) and natural language processing were developed to stimulate students' thinking, reasoning and self explanations. We describe the MyST system architecture and Wizard of Oz procedure that was used to collect data from tutorial sessions with elementary school students. Using data collected with the procedure, we present evaluations of the ASR and semantic parsing components. A formal evaluation of learning gains resulting from system use is currently being conducted. This paper presents survey results of teachers' and children's impressions of MyST.

69 citations


Journal ArticleDOI
TL;DR: A novel algorithm for learning parameters in statistical dialogue systems which are modeled as Partially Observable Markov Decision Processes (POMDPs), which shows that model parameters estimated to maximize the expected cumulative reward result in significantly improved performance compared to the baseline hand-crafted model parameters.
Abstract: This article presents a novel algorithm for learning parameters in statistical dialogue systems which are modeled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy that selects the system's responses based on the inferred state; and a reward function that specifies the desired behavior of the system. Ideally both the model parameters and the policy would be designed to maximize the cumulative reward. However, while there are many techniques available for learning the optimal policy, no good ways of learning the optimal model parameters that scale to real-world dialogue systems have been found yet.The presented algorithm, called the Natural Actor and Belief Critic (NABC), is a policy gradient method that offers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected cumulative reward. The resulting gradient is then used to adapt both the prior distribution of the dialogue model parameters and the policy parameters. In addition, the article presents a variant of the NABC algorithm, called the Natural Belief Critic (NBC), which assumes that the policy is fixed and only the model parameters need to be estimated. The algorithms are evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximize the expected cumulative reward result in significantly improved performance compared to the baseline hand-crafted model parameters. The algorithms are also compared to optimization techniques using plain gradients and state-of-the-art random search algorithms. In all cases, the algorithms based on the natural gradient work significantly better.

53 citations


Journal ArticleDOI
TL;DR: The Hidden Information State model provides a principled way of ensuring tractability in a POMDP-based dialogue model and allows for a more complex user goal to be represented, and it enables an effective pruning technique to be implemented that preserves the overall system performance within a limited computational resource more effectively than existing approaches.
Abstract: Effective dialogue management is critically dependent on the information that is encoded in the dialogue state. In order to deploy reinforcement learning for policy optimization, dialogue must be modeled as a Markov Decision Process. This requires that the dialogue state must encode all relevent information obtained during the dialogue prior to that state. This can be achieved by combining the user goal, the dialogue history, and the last user action to form the dialogue state. In addition, to gain robustness to input errors, dialogue must be modeled as a Partially Observable Markov Decision Process (POMDP) and hence, a distribution over all possible states must be maintained at every dialogue turn. This poses a potential computational limitation since there can be a very large number of dialogue states. The Hidden Information State model provides a principled way of ensuring tractability in a POMDP-based dialogue model. The key feature of this model is the grouping of user goals into partitions that are dynamically built during the dialogue. In this article, we extend this model further to incorporate the notion of complements. This allows for a more complex user goal to be represented, and it enables an effective pruning technique to be implemented that preserves the overall system performance within a limited computational resource more effectively than existing approaches.

51 citations


Journal ArticleDOI
TL;DR: Initial results of FLORA, an accessible computer program that uses speech recognition to provide an accurate measure of children's oral reading ability, are presented and compared to human scoring on 783 recordings of grade level text passages read aloud by first through fourth grade students in classroom settings.
Abstract: We present initial results of FLORA, an accessible computer program that uses speech recognition to provide an accurate measure of children's oral reading ability. FLORA presents grade-level text passages to children, who read the passages out loud, and computes the number of words correct per minute (WCPM), a standard measure of oral reading fluency. We describe the main components of the FLORA program, including the system architecture and the speech recognition subsystems. We compare results of FLORA to human scoring on 783 recordings of grade level text passages read aloud by first through fourth grade students in classroom settings. On average, FLORA WCPM scores were within 3 to 4 words of human scorers across students in different grade levels and schools.

40 citations


Journal ArticleDOI
TL;DR: This article proposes a hierarchical reinforcement learning approach to optimize subbehaviors rather than full behaviors for spatially-aware dialogue systems and extends an existing algorithm in the literature of reinforcement learning in order to support reusable policies and therefore to perform fast learning.
Abstract: This article addresses the problem of scalable optimization for spatially-aware dialogue systems. These kinds of systems must perceive, reason, and act about the spatial environment where they are embedded. We formulate the problem in terms of Semi-Markov Decision Processes and propose a hierarchical reinforcement learning approach to optimize subbehaviors rather than full behaviors. Because of the vast number of policies that are required to control the interaction in a dynamic environment (e.g., a dialogue system assisting a user to navigate in a building from one location to another), our learning approach is based on two stages: (a) the first stage learns low-level behavior, in advance; and (b) the second stage learns high-level behavior, in real time. For such a purpose we extend an existing algorithm in the literature of reinforcement learning in order to support reusable policies and therefore to perform fast learning. We argue that our learning approach makes the problem feasible, and we report on a novel reinforcement learning dialogue system that performs a joint optimization between dialogue and spatial behaviors. Our experiments, using simulated and real environments, are based on a text-based dialogue system for indoor navigation. Experimental results in a realistic environment reported an overall user satisfaction result of 89p, which suggests that our proposed approach is attractive for its application in real interactions as it combines fast learning with adaptive and reasonable behavior.

26 citations


Journal ArticleDOI
TL;DR: The final age/gender detection system evaluated using a six-hour child abuse (CA) test set achieved promising results given the extremely difficult conditions of this type of video material.
Abstract: This article presents a description of the INESC-ID Age and Gender classification systems which were developed for aiding the detection of child abuse material within the scope of the European project I-DASH. The Age and Gender classification systems are composed respectively by the fusion of four and six individual subsystems trained with short- and long-term acoustic and prosodic features, different classification strategies, Gaussian Mixture Models-Universal Background Model (GMM-UBM), Multi-Layer Perceptrons (MLP) and Support Vector Machines (SVM), trained over five different speech corpus. The best results obtained by the calibration and linear logistic regression fusion back-end show an absolute improvement of 2p on the unweighted accuracy value for the Age and 1p for the Gender when compared to the best individual frontend systems in the development set. The final age/gender detection system evaluated using a six-hour child abuse (CA) test set achieved promising results given the extremely difficult conditions of this type of video material. In order to further improve the performance in the CA domain, the classification modules were adapted using unsupervised selection of training data. An automatic data selection algorithm using frame-level posterior probabilities was developed. Performance improvement after adapting the classification modules was around 10p relative when compared with the baseline classifiers.

23 citations


Journal ArticleDOI
TL;DR: The generalized models perform better at predicting fluency and comprehension posttest scores of 55 children ages 7--10, with adjusted R2 of 0.6, and could help teachers identify which students are making adequate progress.
Abstract: We compare two types of models to assess the prosody of children's oral reading Template models measure how well the child's prosodic contour in reading a given sentence correlates in pitch, intensity, pauses, or word reading times with an adult narration of the same sentence We evaluate template models directly against a common rubric used to assess fluency by hand, and indirectly by their ability to predict fluency and comprehension test scores and gains of 10 children who used Project LISTEN's Reading Tutor; the template models outpredict the human assessmentWe also use the same set of adult narrations to train generalized models for mapping text to prosody, and use them to evaluate children's prosody Using only durational features for both types of models, the generalized models perform better at predicting fluency and comprehension posttest scores of 55 children ages 7--10, with adjusted R2 of 06 Such models could help teachers identify which students are making adequate progress The generalized models have the additional advantage of not requiring an adult narration of every sentence

17 citations


Journal ArticleDOI
TL;DR: The FAU Aibo Emotion Corpus is used which contains emotionally colored spontaneous children's speech recorded in a child-robot interaction scenario and the Tandem model prevails over a triphone-based Hidden Markov Model approach.
Abstract: In this article, we focus on keyword detection in children's speech as it is needed in voice command systems. We use the FAU Aibo Emotion Corpus which contains emotionally colored spontaneous children's speech recorded in a child-robot interaction scenario and investigate various recent keyword spotting techniques. As the principle of bidirectional Long Short-Term Memory (BLSTM) is known to be well-suited for context-sensitive phoneme prediction, we incorporate a BLSTM network into a Tandem model for flexible coarticulation modeling in children's speech. Our experiments reveal that the Tandem model prevails over a triphone-based Hidden Markov Model approach.

15 citations


Journal ArticleDOI
TL;DR: A user model for user simulation and a system state representation in spoken decision support dialogue systems that accommodate patterns by considering the user's knowledge and preferences are presented.
Abstract: This article presents a user model for user simulation and a system state representation in spoken decision support dialogue systems. When selecting from a group of alternatives, users apply different decision-making criteria with different priorities. At the beginning of the dialogue, however, users often do not have a definite goal or criteria in which they place value, thus they can learn about new features while interacting with the system and accordingly create new criteria. In this article, we present a user model and dialogue state representation that accommodate these patterns by considering the user's knowledge and preferences. To estimate the parameters used in the user model, we implemented a trial sightseeing guidance system, collected dialogue data, and trained a user simulator. Since the user parameters are not observable from the system, the dialogue is modeled as a partially observable Markov decision process (POMDP), and a dialogue state representation was introduced based on the model. We then optimized its dialogue strategy so that users can make better choices. The dialogue strategy is evaluated using a user simulator trained from a large number of dialogues collected using a trial dialogue system.

Journal ArticleDOI
TL;DR: A novel system to automatically diagnose reading disorders is presented, based on a speech recognition engine with a module for prosodic analysis that identifies 98.3% of the 120 children correctly.
Abstract: We present a novel system to automatically diagnose reading disorders The system is based on a speech recognition engine with a module for prosodic analysis The reading disorder test is based on eight different subtests In each of the subtests, the system achieves a recognition accuracy of at least 95p As in the perceptual version of the test, the results of the subtests are then joined into a final test result to determine whether the child has a reading disorder In the final classification stage, the system identifies 983p of the 120 children correctly In the future, our system will facilitate the clinical evaluation of reading disorders

Journal ArticleDOI
TL;DR: The finding suggests that the authors do not always need to construct a realistic user simulation, and can save engineering cost by wisely choosing simulation models that are appropriate for their task.
Abstract: Recent studies show that user simulations can be used to generate training corpora for learning dialogue strategies automatically. However, it is unclear what type of simulation is most suitable in a particular task setting. We observe that a simulation which generates random behaviors in a restricted way outperforms simulations that mimic human user behaviors in a statistical way. Our finding suggests that we do not always need to construct a realistic user simulation. Since constructing realistic user simulations is not a trivial task, we can save engineering cost by wisely choosing simulation models that are appropriate for our task.

Journal ArticleDOI
TL;DR: A recent technique, ℓ1-regularized logistic regression, is applied to learn dialogue classifiers using a rich feature set and fewer data points than features to characterize differences in the behavior of children when they choose the story they read.
Abstract: The richness of multimodal dialogue makes the space of possible features required to describe it very large relative to the amount of training data. However, conventional classifier learners require large amounts of data to avoid overfitting, or do not generalize well to unseen examples. To learn dialogue classifiers using a rich feature set and fewer data points than features, we apply a recent technique, e1-regularized logistic regression. We demonstrate this approach empirically on real data from Project LISTEN's Reading Tutor, which displays a story on a computer screen and listens to a child read aloud. We train a classifier to predict task completion (i.e., whether the student will finish reading the story) with 71p accuracy on a balanced, unseen test set. To characterize differences in the behavior of children when they choose the story they read, we likewise train and test a classifier that with 73.6p accuracy infers who chose the story based on the ensuing dialogue. Both classifiers significantly outperform baselines and reveal relevant features of the dialogue.

Journal ArticleDOI
TL;DR: The articles collected together in this special issue represent important themes in these research areas and illustrate current research directions in the field, and it is hoped that they will form a valuable resource for the international community of dialogue system researchers, both in industry and academia.
Abstract: In the 1960s, AI researchers were predicting that machines capable of spoken dialogue behavior, somewhat like the preceding example, would be possible within a few decades. Fifty years later, we are still confronted with very difficult problems in AI, but we have powerful new tools with which to address the spoken dialogue problem. The research landscape in spoken dialogue systems has undergone significant changes over the past decade. This transformation has been the result of new momentum and fresh insights coming from the investigation of data-driven, statistical machine learning methods in three core areas of dialogue system research: spoken language understanding, dialogue management, and natural language generation. These methods hold the promise of mathematically precise approaches to system design, optimization, and evaluation, based on data collected from real user interactions with dialogue systems. The articles collected together in this special issue represent important themes in these research areas and illustrate current research directions in the field. As such, we hope that they will form a valuable resource for the international community of dialogue system researchers, both in industry and academia. Speech and language processing techniques have now achieved such a level of maturity that voice-enabled user interfaces are widely deployed and have created a billion dollar industry. However, the design and development of these interfaces is far from a simple and standardized process. Indeed, it is not enough to simply plug together speech recognition and synthesis systems: recognized speech must be understood in the context of the application, the overall interaction must be appropriately managed, and spoken

Journal ArticleDOI
TL;DR: This research automatically verifying preliterate children's pronunciations of English letter-names and the sounds each letter represents and discusses the various confounding factors for this assessment task that impact automatic verification performance.
Abstract: Automatic literacy assessment is an area of research that has shown significant progress in recent years Technology can be used to automatically administer reading tasks and analyze and interpret children's reading skills It has the potential to transform the classroom dynamic by providing useful information to teachers in a repeatable, consistent, and affordable way While most previous research has focused on automatically assessing children reading words and sentences, assessments of children's earlier foundational skills is needed We address this problem in this research by automatically verifying preliterate children's pronunciations of English letter-names and the sounds each letter represents (“letter-sounds”) The children analyzed in this study were from a diverse bilingual background and were recorded in actual kindergarten to second grade classrooms We first manually verified (accept/reject) the letter-name and letter-sound utterances, which serve as the ground-truth in this study Next, we investigated four automatic verification methods that were based on automatic speech recognition techniques We attained percent agreement with human evaluations of 90p and 85p for the letter-name and letter-sound tasks, respectively Humans agree between themselves an average of 95p of the time for both tasks We discuss the various confounding factors for this assessment task, such as background noise and the presence of disfluencies, that impact automatic verification performance

Journal ArticleDOI
TL;DR: Seven articles included in this special issue of ACM Transactions on Speech and Language Processing cover diverse research areas related to children’s speech, including improving the state-of-the-art in core technology: word-spotting in a child-robot interaction scenario and age/gender classification.
Abstract: The rapid advancement of speech recognition and spoken dialogue technologies has enabled the use of voice in numerous interactive applications today. Although children represent an important user segment for speech processing technologies, the majority of the research effort so far has focused on adult users. This is due to a variety of reasons including lack of appropriate speech data from children across the developmental trajectory, challenges associated with conducting experiments with children, as well as fundamental misconceptions about children-computer interaction. From the very early experiments at AT&T Bell Labs, it was made clear that children’s speech posed a challenge to speech recognizers designed for adult voices. The acoustic characteristics of children’s speech vary widely with age, resulting in spectral and temporal patterns that differ from those of adults. In addition, variability in pronunciation is greater in children’s speech, further hampering modeling speech and spoken language patterns. In the past two decades, the performance gap between automatic speech recognition of children and adult speech has been narrowed through the use of a variety of feature extraction, acoustic modeling, and model adaptation techniques. These advances have made it possible to design and build complex spoken dialogue systems and language training software for children. However, a number of research challenges remain. These challenges are especially relevant for processing spontaneous speech and for handling speech of younger children, especially of the little researched preschool population. The behavior of children interacting with computers is different from the behavior of adults. For children, playing and learning are intertwined activities (especially at a young age). Children easily assume roles (play-acting) and often adopt a more exploratory behavior when faced with a task. As a result, child-computer interaction patterns are often different from those of adults; interaction patterns and skills also vary significantly with age. For example, when using a conversational interface, children have a different language strategy for initiating and guiding conversational exchanges and may adopt different linguistic registers than adults. In order to develop reliable voice-interactive systems, further studies are needed to better understand the characteristics of children’s speech and the different aspects of speech-based interaction including the role of speech in multimodal interfaces. The development of prototype systems for a broad range of applications is also important to provide experimental evidence of the degree of progress in speech technologies and to help focus research on application-specific problems that emerge when these systems are used in realistic operating environments. This effort is necessarily an interdisciplinary one where researchers from diverse fields such as acoustics, natural language, artificial intelligence, robotics, human-computer interaction, multimedia systems, psychology, and education come together to produce high-quality research and technological innovation. The seven articles included in this special issue of ACM Transactions on Speech and Language Processing cover diverse research areas related to children’s speech. The first two articles focus on improving the state-of-the-art in core technology: word-spotting in a child-robot interaction scenario and age/gender classification. The rest of the articles