scispace - formally typeset
Search or ask a question
Author

Ryuichi Nisimura

Bio: Ryuichi Nisimura is an academic researcher from Wakayama University. The author has contributed to research in topics: Speech processing & Spectral envelope. The author has an hindex of 10, co-authored 49 publications receiving 762 citations. Previous affiliations of Ryuichi Nisimura include Nara Institute of Science and Technology.

Papers
More filters
Proceedings ArticleDOI
12 May 2008
TL;DR: A simple new method for estimating temporally stable power spectra is introduced to provide a unified basis for computing an interference-free spectrum, the fundamental frequency (F0), as well as aperiodicity estimation.
Abstract: A simple new method for estimating temporally stable power spectra is introduced to provide a unified basis for computing an interference-free spectrum, the fundamental frequency (F0), as well as aperiodicity estimation. F0 adaptive spectral smoothing and cepstral liftering based on consistent sampling theory are employed for interference-free spectral estimation. A perturbation spectrum, calculated from temporally stable power and interference-free spectra, provides the basis for both F0 and aperiodicity estimation. The proposed approach eliminates ad-hoc parameter tuning and the heavy demand on computational power, from which STRAIGHT has suffered in the past.

339 citations

Proceedings ArticleDOI
17 May 2004
TL;DR: An automatic approach discriminating speakers between adult and child users, which is based on statistical learning is proposed, which realizes a flexible spoken dialogue to both adult andChild users.
Abstract: The Takemaru-kun system is a real world speech-oriented guidance system located at the Ikoma-City North Community Center. The system has been operated daily from November, 2002, to provide visitors a speech interface for information retrieval. This system also aims at the field test of a speech interface and collecting actual utterance data. By analyzing and evaluating the collected utterances, the flexible processing requirements are discovered according to the user's age group. It becomes impossible to disregard the increase of child users when the system is installed in a public place. The paper proposes an automatic approach discriminating speakers between adult and child users, which is based on statistical learning. This proposal realizes a flexible spoken dialogue to both adult and child users. As for parameter vectors in machine learning, acoustic and linguistic properties extracted from speech recognition logarithm likelihood scores are adopted to discriminate a user's age group. Although GMM-based recognition uses only acoustic properties, this method can also consider linguistic properties. In experiments with SVM-based screening, we obtained a 92.4% discrimination rate to the actual users' utterances. The advantage of using linguistic properties is also shown. The paper also describes an overview of the Takemaru-kun system and the data collection status from the field test. Child speech recognition performance is evaluated using the collected utterances.

73 citations

Proceedings ArticleDOI
04 Oct 2004
TL;DR: ICSLP2004: the 8th International Conference on Spoken Language Processing, October 4-8, 2004, Jeju Island, Korea.
Abstract: ICSLP2004: the 8th International Conference on Spoken Language Processing, October 4-8, 2004, Jeju Island, Korea.

61 citations

Proceedings ArticleDOI
19 Apr 2009
TL;DR: A generalized framework of auditory morphing based on the speech analysis, modification and resynthesis system STRAIGHT is proposed that enables each morphing rate of representational aspects to be a function of time, including the temporal axis itself.
Abstract: A generalized framework of auditory morphing based on the speech analysis, modification and resynthesis system STRAIGHT is proposed that enables each morphing rate of representational aspects to be a function of time, including the temporal axis itself. Two types of algorithms were derived: an incremental algorithm for real-time manipulation of morphing rates and a batch processing algorithm for off-line post-production applications. By defining morphing in terms of the derivative of mapping functions in the logarithmic domain, breakdown of morphing resynthesis found in the previous formulation in the case of extrapolations was eliminated. A method to alleviate perceptual defects in extrapolation is also introduced.

54 citations

Proceedings ArticleDOI
10 Dec 2002
TL;DR: The speech related parts of ASKA, a humanoid robot, implemented in the university reception desk for the computerized university guidance, can deal with a wide task domain of 20k large vocabulary using a word trigram model and an elaborated speaker-independent acoustic model.
Abstract: We implemented a humanoid robot, ASKA, in our university reception desk for the computerized university guidance. ASKA can recognize a user's question utterance, and answer the user's question by its text-to-speech voice, hand gesture and head movement. This paper describes the speech related parts of ASKA. ASKA can deal with a wide task domain of 20k large vocabulary using a word trigram model and an elaborated speaker-independent acoustic model. ASKA can also make a response with keyword and key-phrase detection in the N-best speech recognition results. The word recognition rate for the reception task is 90.9%, and the rate for the out-of-domain task is 78.9%. The correct response rate for the reception task is 61.7%. Users can enjoy their question-answering with ASKA.

43 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech and showed that it was superior to the other systems in terms of both sound quality and processing speed.
Abstract: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing. key words: speech analysis, speech synthesis, vocoder, sound quality, realtime processing

1,025 citations

01 Jan 2014

872 citations

Journal ArticleDOI
02 May 2018-Neuron
TL;DR: A core goal of auditory neuroscience is to build quantitative models that predict cortical responses to natural sounds, and hierarchical neural networks for speech and music recognition were optimized to solve ecologically relevant tasks.

403 citations

Proceedings ArticleDOI
03 May 2010
TL;DR: Experiments show that the interest points in conjunction with a boosted patch classifier are significantly better in detecting body parts in depth images than state-of-the-art sliding-window based detectors.
Abstract: We deal with the problem of detecting and identifying body parts in depth images at video frame rates. Our solution involves a novel interest point detector for mesh and range data that is particularly well suited for analyzing human shape. The interest points, which are based on identifying geodesic extrema on the surface mesh, coincide with salient points of the body, which can be classified as, e.g., hand, foot or head using local shape descriptors. Our approach also provides a natural way of estimating a 3D orientation vector for a given interest point. This can be used to normalize the local shape descriptors to simplify the classification problem as well as to directly estimate the orientation of body parts in space. Experiments involving ground truth labels acquired via an active motion capture system show that our interest points in conjunction with a boosted patch classifier are significantly better in detecting body parts in depth images than state-of-the-art sliding-window based detectors.

335 citations