scispace - formally typeset
Search or ask a question

Showing papers by "Kazuya Takeda published in 2008"


BookDOI
04 Dec 2008
TL;DR: In-Vehicle Corpus and Signal Processing for Driver Behavior (inCARS 2007) as discussed by the authors is a collection of expanded papers from the third biennial DSPinCARs held in Istanbul in June 2007.
Abstract: In-Vehicle Corpus and Signal Processing for Driver Behavior is comprised of expanded papers from the third biennial DSPinCARS held in Istanbul in June 2007. The goal is to bring together scholars working on the latest techniques, standards, and emerging deployment on this central field of living at the age of wireless communications, smart vehicles, and human-machine-assisted safer and comfortable driving. Topics covered in this book include: improved vehicle safety; safe driver assistance systems; smart vehicles; wireless LAN-based vehicular location information processing; EEG emotion recognition systems; and new methods for predicting driving actions using driving signals. In-Vehicle Corpus and Signal Processing for Driver Behavior is appropriate for researchers, engineers, and professionals working in signal processing technologies, next generation vehicle design, and networks for mobile platforms.

57 citations


Proceedings ArticleDOI
22 Sep 2008
TL;DR: The results of evaluation experiments proved that CENSREC-4 is an effective database for evaluating the new dereverberation method because the traditional dereVerberation process had difficulty sufficiently improving the recognition performance.
Abstract: In this paper, we newly introduce a collection of databases and evaluation tools called CENSREC-4, which is an evaluation framework for distant-talking speech under hands-free conditions. Distant-talking speech recognition is crucial for a handsfree speech interface. Therefore, we measured room impulse responses to investigate reverberant speech recognition in various environments. The data contained in CENSREC-4 are connected digit utterances, as in CENSREC-1. Two subsets are included in the data: basic data sets and extra data sets. The basic data sets are used for the evaluation environment for the room impulse response-convolved speech data. The extra data sets consist of simulated and recorded data. An evaluation framework is only provided for the basic data sets as evaluation tools. The results of evaluation experiments proved that CENSREC-4 is an effective database for evaluating the new dereverberation method because the traditional dereverberation process had difficulty sufficiently improving the recognition performance. Index Terms: Various environments, Impulse response, Convolution, Real recorded data, Evaluation framework

28 citations


Proceedings ArticleDOI
12 May 2008
TL;DR: Subjective evaluation shows that there is no significant difference between natural and reconstructed sound when more than 6 virtual sources are used, and the effectiveness of the encoding algorithm as well as the virtual source representation is confirmed.
Abstract: A sound field reproduction method which uses blind source separation and head-related transfer function is proposed. In the proposed system, multichannel acoustic signals captured at the distant microphones are encoded to a set of location/signal pairs of virtual sound sources based on frequency-domain ICA. After estimating the locations and the signals of the virtual sources, by convolving the controlled acoustic transfer functions with each signal, the spatial sound at the selected point is constructed. In the evaluation, the sound field made by 6 sound sources is captured using 48 distant microphones and is encoded into set of virtual sound sources. Subjective evaluation shows that there is no significant difference between natural and reconstructed sound when more than 6 virtual sources are used. Therefore the effectiveness of the encoding algorithm as well as the virtual source representation is confirmed.

26 citations


Proceedings ArticleDOI
10 Oct 2008
TL;DR: A statistical driver model is proposed that assumes that a driver plans various vehicle trajectories depending on the surrounding vehicles and then selects a safe and comfortable trajectory, which is then generated from the HMM.
Abstract: This paper describes a method to generate vehicle trajectories of lane change paths for individual drivers. Although each driver has a consistent preferance in the lane change behavior, lane-changing time and vehicle trajectory are uncertain due to the presence of surrounding vehicles. To model this uncertainty, we propose a statistical driver model. We assume that a driver plans various vehicle trajectories depending on the surrounding vehicles and then selects a safe and comfortable trajectory. Lane change patterns of each driver are modeled with a hidden Markov model (HMM), which is trained using longitudinal vehicle velocity, lateral vehicle position, and their dynamic features. Vehicle trajectories are generated from the HMM in a maximum likelihood criterion at random lane-changing time and state duration. Experimental results show that vehicle trajectories generated from the HMM included a similar trajectory to that of a target driver.

21 citations


Proceedings ArticleDOI
04 Jun 2008
TL;DR: This paper proposes a transcription protocol based on six major groups: driver mental state, driver actions, driverpsilas secondary task, driving environment, vehicle status, and speech/background noise, and integrates transcriptions, driving behavior, and physiological signals using a Bayesian network.
Abstract: In this paper we present our on-going data collection of multi-modal real-world driving. Video, speech, driving behavior, and physiological signals from 150 drivers have already been collected. To provide a more meaningful description of the collected data, we propose a transcription protocol based on six major groups: driver mental state, driver actions, driverpsilas secondary task, driving environment, vehicle status, and speech/background noise. Data from 30 drivers are transcribed. We then show how transcription reliability can be improved by properly training annotators. Finally, we integrate transcriptions, driving behavior, and physiological signals using a Bayesian network for estimating a driverpsilas level of irritation. Estimations are compared to actual values, assessed by the drivers themselves. Preliminary results are very encouraging.

18 citations


Proceedings Article
01 May 2008
TL;DR: The results of evaluation experiments proved that CENSREC-4 is an effective database suitable for evaluating the new dereverberation method because the traditional dereVerberation process had difficulty sufficiently improving the recognition performance.
Abstract: Recently, speech recognition performance has been drastically improved by statistical methods and huge speech databases. Now performance improvement under such realistic environments as noisy conditions is being focused on. Since October 2001, we from the working group of the Information Processing Society in Japan have been working on evaluation methodologies and frameworks for Japanese noisy speech recognition. We have released frameworks including databases and evaluation tools called CENSREC-1 (Corpus and Environment for Noisy Speech RECognition 1; formerly AURORA-2J), CENSREC-2 (in-car connected digits recognition), CENSREC-3 (in-car isolated word recognition), and CENSREC-1-C (voice activity detection under noisy conditions). In this paper, we newly introduce a collection of databases and evaluation tools named CENSREC-4, which is an evaluation framework for distant-talking speech under hands-free conditions. Distant-talking speech recognition is crucial for a hands-free speech interface. Therefore, we measured room impulse responses to investigate reverberant speech recognition. The results of evaluation experiments proved that CENSREC-4 is an effective database suitable for evaluating the new dereverberation method because the traditional dereverberation process had difficulty sufficiently improving the recognition performance. The framework was released in March 2008, and many studies are being conducted with it in Japan.

17 citations


Proceedings ArticleDOI
05 Nov 2008
TL;DR: Two novel methods for arbitrary listening-point generation for 3D audio-video (3DAV) integration in a large-scale multipoint cameras and microphones system with abilities to process, and display information of any recorded 3D scene in realtime are proposed.
Abstract: In this paper, we propose two novel methods for arbitrary listening-point generation for 3D audio-video (3DAV) integration in a large-scale multipoint cameras and microphones system with abilities to process, and display information of any recorded 3D scene in realtime. With this system, users are able to control their own viewpoint/listening-point position, freely. Arbitrary listening-point can be generated by either (i) ray-space representation of sound wave field (i.e. source sound independent) for multi frequency layers, or (ii) acoustic transfer function estimation (i.e. source sound dependent) and blind separation of sources of sounds. Arbitrary viewpoint generation is based on ray-space method, which is enhanced by using multipass dynamic programming for geometry compensation. Integration is done by either (i) ray-space representation of sound wave and image together, or (ii) integrating each camera video signal and acoustic transfer function of the same location as integrated 3DAV data. The prototype system of integrated audio-visual viewer achieves both good image and sound qualities with 15 frames/second.

13 citations


Proceedings ArticleDOI
20 Oct 2008
TL;DR: The probability distribution of the time gap between the starting times of an utterance and gestures is proposed and an integrative recognition method of speech accompanied with gestures such as pointing is proposed.
Abstract: We propose an integrative recognition method of speech accompanied with gestures such as pointing. Simultaneously generated speech and pointing complementarily help the recognition of both, and thus the integration of these multiple modalities may improve recognition performance. As an example of such multimodal speech, we selected the explanation of a geometry problem. While the problem was being solved, speech and fingertip movements were recorded with a close-talking microphone and a 3D position sensor. To find the correspondence between utterance and gestures, we propose probability distribution of the time gap between the starting times of an utterance and gestures. We also propose an integrative recognition method using this distribution. We obtained approximately 3-point improvement for both speech and fingertip movement recognition performance with this method.

8 citations


Proceedings Article
01 May 2008
TL;DR: It is found that drivers tended to use longer and faster utterances with more fillers to talk with humans than machines when comparing utterance length, speaking rate, and the filler rate of driver utterances in human-human and human-machine dialogs.
Abstract: In this paper, a large-scale real-world speech database is introduced along with other multimedia driving data. We designed a data collection vehicle equipped with various sensors to synchronously record twelve-channel speech, three-channel video, driving behavior including gas and brake pedal pressures, steering angles, and vehicle velocities, physiological signals including driver heart rate, skin conductance, and emotion-based sweating on the palms and soles, etc. These multimodal data are collected while driving on city streets and expressways under four different driving task conditions including two kinds of monologues, human-human dialog, and human-machine dialog. We investigated the response timing of drivers against navigator utterances and found that most overlapped with the preceding utterance due to the task characteristics and the features of Japanese. When comparing utterance length, speaking rate, and the filler rate of driver utterances in human-human and human-machine dialogs, we found that drivers tended to use longer and faster utterances with more fillers to talk with humans than machines.

7 citations


Proceedings ArticleDOI
22 Sep 2008
TL;DR: The key idea of the proposed system is to train a linear transformation between document and music spaces so that query documents can be mapped onto a music space in which similarities based on acoustic characteristics is represented.
Abstract: Building and combining document and music spaces of songs are discussed for a new music recommendation application, which uses commonly read texts such as Web log as query input. The most important application of this flexible recommendation system is its music query-by-Webpage, from which a song that appropriately matches Webpage is automatically played. The key idea of the proposed system is to train a linear transformation between document and music spaces so that query documents can be mapped onto a music space in which similarities based on acoustic characteristics is represented. The basic system has been trained using 2,650 pairs of song and review texts. Through experimental evaluations, we show the effectiveness of the system, which is three times better than the previous system. Web text as a training corpus and a bigram representation for the document vector are also investigated for the purpose of improving the system, and their effectiveness is also confirmed.

6 citations



Proceedings ArticleDOI
18 Jun 2008
TL;DR: Experimental results show that the percentages of risky steering operations estimated for individual drivers correlate with driver risk evaluation scores given by a risk consulting expert.
Abstract: Risky steering operations are detected based on the relationship between the radius of road curvature and road design speed defined in the road construction ordinance. Vehicle motion while steering is approximated as a circular motion, and the vehicle trajectory radius is estimated from lateral acceleration and vehicle velocity captured with a drive recorder based on a circular motion equation. Steering operation behaviors are evaluated for 203 drivers. Experimental results show that the percentages of risky steering operations estimated for individual drivers correlate with driver risk evaluation scores given by a risk consulting expert. We also observed situations of risky steering by recording video along with driving data using a data collection vehicle.

Proceedings Article
01 Aug 2008
TL;DR: The proposed estimation methods for all sound source directions on the horizontal plane based on a Gaussian mixture model (GMM) using binaural signals indicate that the proposed method can estimate all sound sources directions with a small amount of known information.
Abstract: We propose and evaluate estimation methods for all sound source directions on the horizontal plane based on a Gaussian mixture model (GMM) using binaural signals. An estimation method based on GMM can estimate a sound source direction on which GMM has already been trained; however, it cannot estimate without a model that corresponds to a sound source direction. Three methods with interpolation techniques are investigated. Two generate GMMs for all directions by interpolating an acoustic transfer function or statistical values of GMM, and the other calculates the posterior probability for all directions with a limited number of GMMs. In our experiments, we investigated six interval conditions. From the results, the interpolation methods of an acoustic transfer function and the statistical values of GMM achieve better performance. Although there were 12 trained GMMs for the 30° intervals, the interpolation method of the statistical values of GMMestimated 62.5%accuracy (45/72) with 2.8° estimation error. These results indicate that the proposed method can estimate all sound source directions with a small amount of known information.

Proceedings ArticleDOI
22 Sep 2008
TL;DR: A novel representation of F0 contours is proposed that provides a computationally efficient algorithm for automatically estimating the parameters of a F0 control model for singing voices and can identify both the target musical note sequence and the dynamics of singing behaviors included in the F1 contours.
Abstract: In this paper, we propose a novel representation of F0 contours that provides a computationally efficient algorithm for automatically estimating the parameters of a F0 control model for singing voices. Although the best known F0 control model, based on a second-order system with a piece-wise constant function as its input, can generate F0 contours of natural singing voices, this model has no means of learning the model parameters from observed F0 contours automatically. Therefore, by modeling the piece-wise constant function by Hidden Markov Models (HMM) and approximating the second order differential equation by the difference equation, we estimate model parameters optimally based on iteration of Viterbi training and an LPC-like solver. Our representation is a generative model and can identify both the target musical note sequence and the dynamics of singing behaviors included in the F0 contours. Our experimental results show that the proposed method can separate the dynamics from the target musical note sequence and generate the F0 contours using estimated model parameters.

25 Nov 2008
TL;DR: A Bayesian network based stochastic model was built that predicted the subjective score of system usability from personal profiles and several objective metrics and showed that each user’s satisfaction index could be predicted for 35.2% of the subjects using the trained Bayesiannetwork.
Abstract: As the spread of voice communication tools by Internet continues to spread, people have more chances to use microphones on their private PCs in various acoustic environments. In the case of using PC-based speech input application, such a variety of environments will cause speech recognition performance degradation. To improve speech recognition accuracy, it is crucial to collect speech data in the environment in which the system is used[1]. We collected speech interactions with PC-based applications in a wide range of user environments through a field test, and have obtained 488 hours of recorded data including 29 hours of speech segments, corresponding to about sixty thousand utterances. In addition to collecting data, we assessed system usability by a questionnaire that asked about system usability and the subjective impression of the speech recognition performance. Using the system data log and the questionnaire results, we analyzed the relationship among subjective performance and objective metrics. Through analysis, a Bayesian network based stochastic model was built that predicted the subjective score of system usability from personal profiles and several objective metrics. Results of experiments showed that each user’s satisfaction index could be predicted for 35.2% of the subjects using the trained Bayesian network.

Journal ArticleDOI
TL;DR: In this paper, the authors measured HRTFs for about 2,300 directions in sagittal and frontal coordinates and constructed a database of head-related transfer functions (HRTFs).
Abstract: 3D sounds can be generated by using a head‐related transfer function (HRTF), which is defined as the acoustic transfer function between a sound source and the entrance to the ear canal. Since HRTF depends on a subject and the sound source direction, many HRTF measurements were conducted. In most case, HRTFs were measured in horizontal coordinates. However, HRTF measurements in other coordinates are also useful. In previous researches, HRTFs measured in sagittal coordinates were used to investigate the relation between spectral cues and vertical angle perception. Although HRTF measurement in frontal coordinates is rarely conducted, there is an advantage to measure HRTFs densely in the front and rear where sound localizations are very sensitive. Therefore, we measured HRTFs for about 2,300 directions in sagittal and frontal coordinates and constructed a database. The measurements were conducted in a soundproof chamber with two head‐and‐torso simulators (B&K 4128 and KEMAR). The HRTF database can be downloaded at http://www.sp.m.is.nagoya‐u.ac.jp/HRTF/ .


Journal ArticleDOI
TL;DR: A multichannel speech enhancement method based on MAP speech spectral magnitude estimation using a generalized gamma model of speech prior distribution, where the model parameters are adapted from actual noisy speech in a frame-by-frame manner, resulting in better performance of speech enhancement algorithm.
Abstract: We present a multichannel speech enhancement method based on MAP speech spectral magnitude estimation using a generalized gamma model of speech prior distribution, where the model parameters are adapted from actual noisy speech in a frame-by-frame manner. The utilization of a more general prior distribution with its online adaptive estimation is shown to be effective for speech spectral estimation in noisy environments. Furthermore, the multi-channel information in terms of cross-channel statistics are shown to be useful to better adapt the prior distribution parameters to the actual observation, resulting in better performance of speech enhancement algorithm. We tested the proposed algorithm in an in-car speech database and obtained significant improvements of the speech recognition performance, particularly under non-stationary noise conditions such as music, air-conditioner and open window.

01 Jan 2008
TL;DR: New evaluation frameworks for bimodal speech recognition in noisy conditions and real environments are introduced and a baseline method and its recognition results will be also provided with these corpora.
Abstract: This paper introduces incoming evaluation frameworks for bimodal speech recognition in noisy conditions and real environments. In order to develop a robust speech recognition in noisy environments, bimodal speech recognition which uses acoustic and visual information has been paid attention to particularly for this decade. As a lot of methods and techniques for bimodal speech recognition have been proposed, a common evaluation framework, including audio-visual speech data and baseline system, is needed to estimate and compare these techniques and bimodal speech recognition schemes. Audio-visual evaluation frameworks, CENSREC-1-AVand CENSREC-2-AV,have been being built by the CENSREC project in Japan; CENSREC1-AV includes artificially noise-added waveforms and image sequences, whereas CENSREC-2-AV consists of audio-visual data recorded in in-car environments. A baseline method and its recognition results will be also provided with these corpora. Index Terms: evaluation framework, audio-visual speech corpus, bimodal speech recognition, noisy environments.

Journal ArticleDOI
TL;DR: An information retrieval system for telephone dialogue in a load dispatch center gives a solution for the task and realizes an information retrieval function with any keywords and is verified by telephone dialogue transcription and information retrieval experiments.
Abstract: We have developed an information retrieval system for telephone dialogue in a load dispatch center. In load dispatching operations, the needs for recording and information retrieval of a telephone dialogue are high. The proposed system gives a solution for the task and realizes an information retrieval function with any keywords. The effectiveness of the system is verified by telephone dialogue transcription and information retrieval experiments. With 30 telephone dialogues in a load dispatch center, we obtain 59.5% in average word correct and 44.4% in average word accuracy. In the information retrieval experiment, with 20 keywords, we obtain 87.3% in average precision and 67.2% in average recall. © 2007 Wiley Periodicals, Inc. Electr Eng Jpn, 162(3): 44– 50, 2008; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/eej.20402