scispace - formally typeset
Search or ask a question

Showing papers by "Kazuya Takeda published in 2010"


Proceedings Article
01 May 2010
TL;DR: The results show that the proposed method achieved good classification performance; the classification accuracy was 94.7% in the experiment on a classification into dialogs with task completion and those without task completion.
Abstract: In this paper, we propose an estimation method of user satisfaction for a spoken dialog system using an N-gram-based dialog history model. We have collected a large amount of spoken dialog data accompanied by usability evaluation scores by users in real environments. The database is made by a field-test in which naive users used a client-server music retrieval system with a spoken dialog interface on their own PCs. An N-gram model is trained from the sequences that consist of users' dialog acts and/or the system's dialog acts for each one of six user satisfaction levels: from 1 to 5 and φ (task not completed). Then, the satisfaction level is estimated based on the N-gram likelihood. Experiments were conducted on the large real data and the results show that our proposed method achieved good classification performance; the classification accuracy was 94.7% in the experiment on a classification into dialogs with task completion and those without task completion. Even if the classifier detected all of the task incomplete dialog correctly, our proposed method achieved the false detection rate of only 6%.

41 citations


01 Jan 2010
TL;DR: An audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced and roughly a 35% relative error reduction was achieved in low SNR conditions compared with an audio-only ASR method.
Abstract: In this paper, an audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced. CENSREC-1-AV consists of an audio-visual database and a baseline system of bimodal speech recognition which uses audio and visual information. In the database, there are 3,234 and 1,963 utterances made by 42 and 51 speakers as a training and a test sets respectively. Each utterance consists of a speech signal as well as color and infrared pictures around a speaker’s mouth. A baseline system is built so that a user can evaluate a proposed bimodal speech recognizer. In the baseline system, multi-stream HMMs are obtained using training data. A preliminary experiment was conducted to evaluate the baseline using acoustically noisy testing data. The results show that roughly a 35% relative error reduction was achieved in low SNR conditions compared with an audio-only ASR method.

31 citations


Proceedings ArticleDOI
21 Jun 2010
TL;DR: A browsing and retrieval system for driving data that provides a multi-modal data browser, query- and similarity-based retrieval functions, and a fast browsing function that skips redundant scenes that could be skipped by an image-based method is developed.
Abstract: With the increased presence and recent advances of drive recorders, rich driving data that include video, vehicle acceleration signals, driver speech, GPS data, and several sensor signals can be continuously recorded and stored. These advances enable researchers to study driving behavior more extensively for traffic safety. However, increasing the variety and the amount of driving data complicates the simultaneous browsing of various data and finding desired data from large databases. In this study, we develop a browsing and retrieval system for driving data that provides a multi-modal data browser, query- and similarity-based retrieval functions, and a fast browsing function that skips redundant scenes. For sharing data with several users, this system can be used via networks from PCs or smartphones, This system uses a time-series active search, which has been successfully used for fast search of audio and video data, as its retrieval function algorithm. In a few seconds, this system can retrieve driving scenes that are similar to an input scene from 80,000 scenes. Retrieval performance was compared in various retrieval conditions by changing the codebook size of the vector quantization for the histogram features and a combination of driving signals. Experimental results showed that more than 97% retrieval performance was achieved for driving behaviors of left/right turns and curves using a combination of such complementary information as steering angles and lateral acceleration. We also compared the proposed method to a conventional image-based retrieval method using subjective similarity scores of driving scenes. Our proposed system retrieved similar scenes with about a 75% retrieval performance that was five points higher than a conventional image-based retrieval method. It is because image-based method is sensitive to changes of image in the area except in the region of interest for driving data retrieval. The fast browsing function also skipped scenes that could not be skipped by an image-based method.

10 citations


Proceedings Article
26 Sep 2010
TL;DR: A method of detecting taskincompleted users for a spoken dialog system using an N-grambased dialog history model and results show that the proposed method achieved good classification performance.
Abstract: In this paper, we propose a method of detecting taskincompleted users for a spoken dialog system using an N-grambased dialog history model. We collected a large amount of spoken dialog data accompanied by usability evaluation scores by users in real environments. The database was made by a field test in which naive users used a client-server music retrieval system with a spoken dialog interface on their own PCs. An N-gram model was trained from sequences that consist of user dialog acts and/or system dialog acts for two dialog classes, that is, the dialog completed the music retrieval task or the dialog incompleted the task. Then the system detects unknown dialogs that is not completed the task based on the N-gram likelihood. Experiments were conducted on large real data, and the results show that our proposed method achieved good classification performance. When the classifier correctly detected all of the task-incompleted dialogs, our proposed method achieved a false detection rate of 6%.

4 citations



Proceedings ArticleDOI
14 Mar 2010
TL;DR: In the proposed method, in order to solve the permutation problem in FD-ICA through clustering acoustic transfer functions, amplitude and phase differences are optimally combined as a function of frequency.
Abstract: A sound source separation method based on frequency-domain independent component analysis (FD-ICA) is proposed. This method fully utilizes the dodecahedral microphone array (DHMA), which has several merits: 1) the size of the array is very small and thus easy to handle; 2) the amplitude difference among microphones on the different surfaces is large; and 3) it is less affected by spatial aliasing in the higher frequency region. In the proposed method, in order to solve the permutation problem in FD-ICA through clustering acoustic transfer functions, amplitude and phase differences are optimally combined as a function of frequency. A DHMA of 8 cm in diameter with 60 microphones is used for the experiment, where up to twelve sound sources (speech/musical instruments) are separated using the proposed algorithm. The separation performance of the proposed method attains 24 dB in the signal-to-interference ratio (SIR) improvement score for the case of twelve sources. Since the performance is better by up to 10 dB in comparison to the conventional method, our results confirm the effectiveness of the proposed method.

3 citations


Journal ArticleDOI
TL;DR: This letter proposes a dimensionality reduction method that minimizes the maximum classification error and proposes two interpolated methods that can describe the average and maximum classification errors.
Abstract: Acoustic feature transformation is widely used to reduce dimensionality and improve speech recognition performance. In this letter we focus on dimensionality reduction methods that minimize the average classification error. Unfortunately, minimization of the average classification error may cause considerable overlaps between distributions of some classes. To mitigate risks of considerable overlaps, we propose a dimensionality reduction method that minimizes the maximum classification error. We also propose two interpolated methods that can describe the average and maximum classification errors. Experimental results show that these proposed methods improve speech recognition performance.

2 citations



Journal ArticleDOI
TL;DR: This letter investigates the effectiveness of discriminative training of HMMs and their combination, and investigates the robustness of matched and mismatched noise conditions between training and evaluation environments.
Abstract: To improve speech recognition performance, acoustic feature transformation based on discriminant analysis has been widely used. For the same purpose, discriminative training of HMMs has also been used. In this letter we investigate the effectiveness of these two techniques and their combination. We also investigate the robustness of matched and mismatched noise conditions between training and evaluation environments.

1 citations


01 Dec 2010
TL;DR: Experimental results proved that the proposed method needed a significantly smaller amount of the target speaker’s utterances than conventional M LLR, MAP and SAT.
Abstract: We propose a technique for generating a large amount of target speaker-like speech features by converting a large amount of prepared speech features of many speakers into features similar to those of the target speaker using a transformation matrix. To generate a large amount of target speaker-like features, the system only needs a very small amount of the target speaker’s utterances. This technique enables the system to adapt the acoustic model efficiently from a small amount of the target speaker’s utterances. To evaluate the proposed method, we prepared 100 reference speakers and 12 target (test) speakers. We conducted the experiments in an isolated word recognition task using a speech database collected by real PC-based distributed environments and compared our proposed method with MLLR, MAP and the method theoretically equivalent to the SAT. Experimental results proved that the proposed method needed a significantly smaller amount of the target speaker’s utterances than conventional M LLR, MAP and SAT.

01 Jan 2010
TL;DR: In this paper, a new method of visualizing characteristics of head-related transfer function (HRTF) is proposed, which can illustrate the HRTF and other factors such as the reverberations separately.
Abstract: A new method of visualizing characteristics of head-related transfer function (HRTF) is proposed. The proposed visualization method can illustrate the HRTFsand other factors such as the reverberations separately. The HRTF is an acoustic transfer function between a sound source and the ear canal entrance, and it is defined as a function on time and direction of the sound source. Since the HRTF depends on sound source direction and subject, HRTFs are usually measured with a dummy head or a human. Measured HRTFs are generally visualized by a figure whose axes correspond to the angle of sound source and the temporal frequency. The conventional figure can illustrate the difference in HRTFs among the directions of sound source, and most previous works employed frequency analysis in the time domain while emphasizing the time variation. In this paper, the measured HRTFs are analyzed with spatio-temporal frequency analysis and used to examine the efficiency of the proposed visualization method. The spatio-temporal frequency analysis can visualize and analyze the characteristics of HRTFs using the spectrum calculated by two-dimensional Fourier transform in time and space. In our experiments, the theoretical property of the spatio-temporal frequency characteristic is investigated and the influence of reverberation in the measurement environment is also examined. Moreover, a dereverberation method is proposed. From the results, the characteristics of HRTFs are mostly concentrated in a specific frequency band, and the proposed visualization method is efficient for illustrating the HRTF and other factors such as reverberation and reflection waves. The dereverberation method decreases the average reverberation time from 382.4 to 316.0 ms in a reverberant condition.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: The results show that the user's cognition for tasks reflects the grasp forms and the possible size of object movement.
Abstract: We study the effect of cognitive states, feelings about tasks, on grasping behavior to estimate user's feelings from their motion. Since people solve the inverse kinematics problem of grasping based on their cognition for the task, when they grasp an object, the way to grasp the object reflects their cognitive states. We are analyzing the way of grasping a cup depending on whether a user is stressed. The physical properties of grasping, volume and entropy of Grasp Jacobian ellipsoids are analyzed. The volume of Grasp Jacobian ellipsoids, which indicates the possible size of object movement, was shrunk after learning the grasp motion. Also the volumes between the relaxed and the stressed cognitive conditions were significantly different. These results show that the user's cognition for tasks reflects the grasp forms and the possible size of object movement.