scispace - formally typeset
Search or ask a question

Showing papers by "Kazuya Takeda published in 2004"


Proceedings ArticleDOI
27 Jun 2004
TL;DR: Experimental results show that the dynamic features of Gaussian mixture models significantly improve the performance of driver identification.
Abstract: We investigate the uniqueness of driver behavior in vehicles and the possibility of using it for personal identification with the objectives of achieving safer driving, of assisting the driver in case of emergencies, and of being a part of a multi-mode biometric signature for driver identification. We use Gaussian mixture models (GMM) for modeling the individualities of the accelerator and brake pedal pressures, and focus on not only the static features, but also the dynamics of the pedal pressures. Experimental results show that the dynamic features significantly improve the performance of driver identification.

66 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: ICSLP2004: the 8th International Conference on Spoken Language Processing, October 4-8, 2004, Jeju Island, Korea.
Abstract: ICSLP2004: the 8th International Conference on Spoken Language Processing, October 4-8, 2004, Jeju Island, Korea

55 citations


Journal ArticleDOI
01 Feb 2004
TL;DR: Preliminary results on intermedia signal conversion are described as an example of the corpus-based in-car speech signal processing research.
Abstract: An ongoing project for constructing a multimedia corpus of dialogues under the driving condition is reported. More than 500 subjects have been enrolled in this corpus development and more than 2 gigabytes of signals have been collected during approximately 60 minutes of driving per subject. Twelve microphones and three video cameras are installed in a car to obtain audio and video data. In addition, five signals regarding car control and the location of the car provided by the Global Positioning System (GPS) are recorded. All signals are simultaneously recorded directly onto the hard disk of the PCs onboard the specially designed data collection vehicle (DCV). The in-car dialogues are initiated by a human operator, an automatic speech recognition (ASR) system and a wizard of OZ (WOZ) system so as to collect as many speech disfluencies as possible. In addition to the details of data collection, in this paper, preliminary results on intermedia signal conversion are described as an example of the corpus-based in-car speech signal processing research.

17 citations


Journal ArticleDOI
TL;DR: A one-pass decoding algorithm is modified to decode the input speech of infinite length so that, with appropriate non-speech models for silence and ambient noises, continuous speech recognition can be executed without the explicit end-point detection.
Abstract: A new continuous speech recognition method that does not need the explicit speech end-point detection is proposed. A one-pass decoding algorithm is modified to decode the input speech of infinite length so that, with appropriate non-speech models for silence and ambient noises, continuous speech recognition can be executed without the explicit end-point detection. The basic algorithm 1) decodes a processing block of the predetermined length, 2) tracebacks and finds the boundaries of the processing blocks where the word history in the preceding processing block is merged into one, and 3) restarts decoding from the boundary frame with the merged word history. The effectiveness of the method is verified by the spoken dialogue transcription experiments. With a 5-minute dialogue in a moving car, the proposed method gives better results in word accuracy than the results using the explicit end-point detection method and the conventional one-pass decoder.

7 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: A system called the Data Collection Vehicle (DCV), which supports synchronous recording of multi- channel audio data from 16 microphones that can be placed in flexible positions, multi-channel video data from 3 cameras, and vehicle-related data.
Abstract: CIAIR, Nagoya University, has been compiling an in-car speech database since 1999. This paper reports on various characteristics of the database. We have developed a system called the Data Collection Vehicle (DCV), which supports synchronous recording of multi-channel audio data from 16 microphones that can be placed in flexible positions, multi-channel video data from 3 cameras, and vehicle-related data. In the compilation process, each subject had conversations with three types of dialogue system: a human, the ì Wizard of OZî system, and a conversational system. In this paper, we present the specifications and the characteristics of the database.

6 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: The proposed approach outperforms linear regression methods and also outperforms adaptive beamformer by 8% and 3% respectively in terms of averaged recognition accuracies.
Abstract: In this paper, we address issues in improving handsfree speech recognition performance in different car environments using multiple spatially distributed microphones. In previous work, we proposed multiple regression of the log-spectra (MRLS) for estimating the logspectra of speech at a close-talking microphone. In this paper, the idea is extended to nonlinear regressions. Isolated word recognition experiments under real car environments show that, compared to the nearest distant microphone, recognition accuracies could be improved by about 40% for very noisy driving conditions by using the optimizing regression method, The proposed approach outperforms linear regression methods and also outperforms adaptive beamformer by 8% and 3% respectively in terms of averaged recognition accuracies.

4 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: Leveraging the synchrony between speech and finger tapping provides a 46 % relative improvement and a 1 % absolute improvement in connected digit recognition experiments and LVCSR experiments, respectively.
Abstract: Behavioral synchronization between speech and finger tapping provides a novel approach to the improvement of speech recognition accuracy. We combine a sequence of finger tapping timings recorded alongside an utterance using two distinct methods: in the first method, HMM state transition probabilities at the word boundaries are controlled by the timing of the finger tapping; in the second, the probability (relative frequency) of the finger tapping is used as a ’feature’ and combined with MFCC in a HMM recognition system. We evaluate these methods through connected digit recognition under different noise conditions (AURORA-2J) and LVCSR tasks. Leveraging the synchrony between speech and finger tapping provides a 46 % relative improvement and a 1 % absolute improvement in connected digit recognition experiments and LVCSR experiments, respectively.

4 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: A robust audio-visual integration system for source tracking and speech enhancement for an in-vehicle speech dialog system that can improve speech accuracy by up to 40.75% compared with audio data alone.
Abstract: Human-computer interaction for in-vehicle information and navigation systems is a challenging problem because of the diverse and changing acoustic environments. It is proposed that the integration of video and audio information can significantly improve dialog system performance, since the visual modality is not impacted by acoustic noise. In this paper, we propose a robust audio-visual integration system for source tracking and speech enhancement for an in-vehicle speech dialog system. The proposed system integrates both audio and visual information to locate the desired speaker source. Using real data collected in car environments, the proposed system can improve speech accuracy by up to 40.75% compared with audio data alone.

4 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: The proposed speech enhancement method modify the noise estimation from the minimum statistics method and combine with a maximum a posterior (MAP) decomposition, using the Rice-conditional probability and a non-Gaussian statistic model of the speech.
Abstract: In this paper, we propose a speech enhancement method based on spectral magnitude estimation. We modify the noise estimation from the minimum statistics method and combine with a maximum a posterior (MAP) decomposition, using the Rice-conditional probability and a non-Gaussian statistic model of the speech. We derive two versions of magnitude decomposition and magnitude-phase decomposition and compare to spectral subtraction and other MAP methods based on the Gaussian statistic (MMSE, LSA). The experiments show the advantage of the proposed method in the improvement of both SNR (up to 12 dB) and recognition accuracy rate (up to 21 % to base line).

4 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: A new method to expand an examplebased spoken dialogue system to handle context dependent utterances, named “GROW architecture” that consists of the dialogue system and a Wizard-of-OZ (WOZ) system.
Abstract: In this paper, we propose a new method to expand an examplebased spoken dialogue system to handle context dependent utterances. The dialogue system refers to the dialogue examples to find an example that is suitable to promote dialogue. Here, the dialogue contexts are expressed in the form of dialogue slots. By constructing dialogue examples with the text of utterances and the dialogue slots, the system handle context dependent dialogue. And we also propose a new framework of spoken dialogue, named “GROW architecture” that consists of the dialogue system and a Wizard-of-OZ (WOZ) system. By using the WOZ system to add dialogue examples via network, it becomes efficient to augment dialogue examples.

4 citations



Proceedings ArticleDOI
04 Oct 2004
TL;DR: The dependency of conversational utterances on themode of dialogue is analyzed and some characteristics such as sentence complexity loudness of the voice and speaking-rate are found to be significantly different among the dialogue modes.
Abstract: The dependency of conversational utterances on themode of dialogue is analyzed. A speech corpus of 800 speakers collected under three different modes, i.e., talking to a human operator, an WOZ system and an ASR system, is used for analysis. Some characteristics such as sentence complexity loudness of the voice and speaking-rate are found to be significantly different among the dialogue modes. Linear regression analysis results also clarify the relative importance of those characteristics on speech recognition accuracy.


Book ChapterDOI
30 Nov 2004
TL;DR: A new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a close-talking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by the distributed microphones, is described.
Abstract: This paper describes a new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a close-talking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by the distributed microphones. The advantages of the proposed method are as follows: The method does not make any assumptions about the positions of the speaker and noise sources with respect to the microphones. Therefore, the system can be trained for various sitting positions of drivers. The regression weights can be statistically optimized over a certain length of speech segments (e.g., sentences of speech) under particular road conditions. The performance of the proposed method is illustrated by speech recognition of real in-car dialogue data. In comparison to the nearest distant microphone and multi-microphone adaptive beamformer, the proposed approach obtains relative word error rate (WER) reductions of 9.8% and 3.6% respectively.