scispace - formally typeset
Search or ask a question

Showing papers by "Kazuya Takeda published in 1996"


Proceedings ArticleDOI
03 Oct 1996
TL;DR: The variability of Lombardspeech under different noise conditions and an adaptation method for the different Lombard speech are discussed and it is shown that five words are enough for the learning interpolated transformation matrix for unknown noise conditions.
Abstract: The variability of Lombard speech under different noise conditions and an adaptation method for the different Lombard speech are discussed. For this purpose, various kinds of Lombard speech are recorded under different conditions of noise injected into a earphone with controlled feedback of voice. First, DTW word recognition experiments using clean speech as a reference are performed to show that the higher the noise level becomes the more seriously the utterance is affected. The second linear transformation of the cepstral feature vector is tested to show that when given enough (more than 100 words) training data, the transformation matrix can be correctly learned for each of the noise conditions. Interpolation of the transfer matrix is then proposed in order to reduce the adaptation parameter and number of training samples. The authors show, finally, that five words are enough for the learning interpolated transformation matrix for unknown noise conditions.

29 citations


Proceedings ArticleDOI
03 Oct 1996
TL;DR: It is shown that the amplitude distribution of the difference signal of HSLN approaches the Gaussian distribution from the Gamma distribution as the number of superpositions increase, which clarifies that the temporal change of spectral envelope plays an important roll in discriminating speech from noise.
Abstract: Human speech-like noise (HSLN) is a kind of bubble noise generated by superimposing independent speech signals typically more than one thousand times Since the basic feature of HSLN varies from that of overlapped speech to stationary noise, keeping long time spectra in the same shape, we investigate perceptual discrimination of speech from stationary noise and its acoustic correlates using HSLN of various numbers of superposition First we confirm the perceptual score, ie how much the HSLN sounds like stationary noise, and that the number of superpositions of HSLN is proportional by subjective tests Then, we show that the amplitude distribution of the difference signal of HSLN approaches the Gaussian distribution from the Gamma distribution as the number of superpositions increase The other subjective test to perceive three HSLN of different dynamic characteristics clarifies that the temporal change of spectral envelope plays an important roll in discriminating speech from noise

19 citations


Journal ArticleDOI
TL;DR: In this paper, a three-layer neural network is trained for the nonlinear method, whereas basic linear interpolation is used for the linear method, and the signal-to-deviation ratios (SDR) of the measured and interpolated HRTFs are calculated for objective evaluation of the methods.
Abstract: Two (linear and nonlinear) interpolation methods of the head‐related transfer function (HRTF) are exploited in order to realize virtual auditory localization. In both methods, HRTFs of the left and right ears are represented by a delay time and a common impulse response, where delay time is determined so that the cross correlation of two HRTFs takes the maximum value. A three‐layer neural network is trained for the nonlinear method, whereas basic linear interpolation is used for the linear method. Evaluation tests are performed by using HRTF prototypes, Web‐published by the MIT Media Lab. The signal‐to‐deviation ratios (SDR) of the measured and interpolated HRTFs are calculated for objective evaluation of the methods. The SDR of the nonlinear method is much better, i.e., 50 to 70 dB, than that of the linear method, i.e., 5 to 30 dB. On the other hand, there is no significant difference in the subjective evaluation of localizing the earphone‐presented sounds generated by the two interpolated HRTFs. Furthermore, the results of the above subjective tests are nearly identical to that of the measured HRTF.

16 citations


Journal Article
TL;DR: A parametric representation for the vocaltract log-area function that is directly and simply related to basic acoustic characteristics of the human vocal-tract is proposed.
Abstract: The objective of this paper is to nd a parametric representation for the vocaltract log-area function that is directly and simply related to basic acoustic characteristics of the human vocal-tract. The importance of this representation is associated with the solution of the articulatory-to-acoustic inverse problem, where a simple mapping from the articulatory space onto the acoustic space can be very useful. The method is as follows: Firstly, given a corpus of log-area functions, a parametric model is derived following a factor analysis technique. After that, the articulatory space, de ned by the parametric model, is lled with approximately uniformly distributed points, and the corresponding rst three formant frequencies are calculated. These formants de ne an acoustic space onto which the articulatory space maps. In the next step, an independent component analysis technique is used to determine acoustic and articulatory coordinate systems whose components are as independent as possible. Finally, using singular value decomposition, acoustic and articulatory coordinate systems are rotated so that each of the rst three components of the articulatory space has major in uence on one, and only one, component of the acoustic space. An example showing how the proposed model can be applied to the solution of the articulatory-to-acoustic inverse problem is given at the end of the paper. 2

12 citations


Patent
24 Sep 1996
TL;DR: In this paper, an opposite filter has been used to obtain a spectrum intensity pattern opposite to a transmission characteristic of an articulatory organ, which is then used for speaker recognition without being affected by a difference in a phoneme.
Abstract: PROBLEM TO BE SOLVED: To precisely perform speaker recognition without being affected by a difference in a phoneme. SOLUTION: An opposite filter 12 having a characteristic opposite to a transmission characteristic of an articulatory organ is set from a result of a linear predictive analysis, and an input voice waveform is filtered by this opposite filter 12, and a voice source waveform is generated, and from the obtained voice source waveform a spectrum intensity pattern is obtained through a frequency analysis part 14. The spectrum intensity pattern of this voice source waveform is divided to an pieces of frequency hands F1 -Fn by a filter bank part 15, and evaluation values V1 -Vn showing ruggedness of spectrum intensity are calculated at every frequency band by a harmonic structure analysis part 16. The evaluation value Vi in respective frequency band Fi is used for a characteristic pattern in the speaker recognition.

6 citations


Proceedings ArticleDOI
03 Oct 1996
TL;DR: In a real computer room, it is shown that SBXCOR is more robust than the two-channel-summed SBCOR, and the resultant performance was much better than that of the smoothed group delay spectrum and mel-frequency cepstral coefficient.
Abstract: This paper describes subband-crosscorrelation (SBXCOR) analysis using two channel signals. The SBXCOR analysis is an extended signal processing technique of subband-autocorrelation (SBCOR) analysis that extracts periodicities present in speech signals. In this paper, the performance of SBXCOR is investigated using a DTW word recognizer, under simulated acoustic conditions on computer and a real environmental condition. Under the simulated condition, it is assumed that speech signals in each channel are perfectly synchronized while noises are not correlated. Consequently, the effective signal-to-noise ratio of the signal generated by simply summing the two signals is raised about 3dB. In such a case, it is shown that SBXCOR is less robust than SBCOR extracted from the two-channel-summing signal, but more robust than the conventional one-channel SBCOR. The resultant performance was much better than that of the smoothed group delay spectrum and mel-frequency cepstral coefficient. In a real computer room, it is shown that SBXCOR is more robust than the two-channel-summed SBCOR.

5 citations


Patent
07 May 1996
TL;DR: In this article, the authors proposed a method to detect an on-hook and improve the performance of continuous voice recognition without erroneously judging the fact that a user stops talking even though there exists a long silence interval in a sentence.
Abstract: PURPOSE: To correctly detect an on-hook and to improve the performance of continuous voice recognition without erroneously judging the fact that a user stops talking even though there exists a long silence interval in a sentence. CONSTITUTION: A partial sentence generating section 9 obtains each partial sentence and its collating score as a partial sentence collating result 9A from a word collating result 7A of inputted voices and word standard patterns. An on-hook detecting section 13 checks whether a first condition is satisfied or not. This condition is defined as the collating score of the partial sentences received by a grammer rule 5A is a maximum among the collating scores of all partial sentences. Moreover, a check is made to see whether a second condition is met or not. This condition is defined as that the continuous time of the inputted voices, which is judged to be coincided with the standard pattern of silence, is more than a beforehand set prescribed time. The time, at which the first and the second conditions are simultaneously satisfied, is judged as the on-hook point of conversation and an on-hook detecting signal 13A is transmitted to a word prediction section 6 and a recognition result output section 10.

4 citations


Patent
12 Jan 1996
TL;DR: In this article, the causal relationship between system announce and user's conversation starting time is exploited to perform high precision voice recognition without detecting high precision conversation starting times to reach high precision.
Abstract: PURPOSE:To perform high precision voice recognition without detecting high precision conversation starting time to attain high precision voice recognition nor making conversation starting time to one point by utilizing the causal relationship between system announce and user's conversation starting time. CONSTITUTION:A silence interval is provided in a system announce and user's conversation starting time is controlled in an announce uttering device 2. In conjunction with this, a computing section 12 computes a first conversation starting point likelihood a(t) from the beforehand prepared conversation starting time prediction distribution. On the other hand, a computing section 14 computes a second conversation starting point likelihood b(t) from feature parameters 13 for conversation detection and a computing section 15 computes a third conversation starting point likelihood alpha(t) from the above. A conversation starting time is determined by the comparison between alpha(t) and a reference value Ref, a switch 17 is turned on and voice recognition is initiated.

3 citations


Journal ArticleDOI
TL;DR: In this paper, a blind separation method for source separation based on the assumption that the source signals are statistically independent of each other was evaluated under real conditions and the results showed that the SDR is proportional to the signal-to-deviation ratio between the source and the separated signals.
Abstract: A blind separation method proposed by Bell et al [Proc ICASSP 95 (1995)] is a novel approach for source separation based only upon the assumption that the source signals are statistically independent of each other In this paper, only simple computer‐simulation results were described The conditions of the experiments were extended and the method was evaluated under real conditions In the first experiment, the SNRs of source signals are controlled by adding white Gaussian noise in order to clarify to what extent the source signals can be regarded as independent The results clarify that SDR (ie, signal‐to‐deviation ratio between the source and the separated signals) is proportional to the SNR of source signals when the SNR is below 20 dB, and the SDR saturates at 40 dB when the SNR is better than 20 dB In the second experiment, the lengths of the signals are controlled to find the sufficient source signal length of processing From the results it is found that the source signal length must be more than 2000 points to achieve an SDR of 30 dB

2 citations


Journal ArticleDOI
TL;DR: In this paper, a zero-crossing analysis is proposed as a fine and robust time-domain preprocessing for the direction-of-arrivals (DOA) finding.
Abstract: Since time precision in finding delays between channels is one of the most important issues in estimating direction of arrival (DOA) from multichannel signals, a zero‐crossing analysis is proposed as a fine and robust time‐domain preprocessing for the DOA finding. As for the preliminary evaluation of noise robustness using zero‐crossing information, how the zero‐crossing points are affected is investigated, i.e., how far the zero‐crossing points move from the original points, under the presence of noise. For test speech, a 48‐kHz sampling of a male voice (upsampled from an 8‐kHz signal) is used after an adding machine generated white noise so that the SNR of the signal becomes 0 to 20 dB. The robustness of the zero‐crossing analysis can be concluded from the summarized results: (1) the mean value of the distance is less than 100 μs throughout the SNR; and (2) the standard deviation of the distribution becomes large as the SNR decreases; the standard deviation is less than 0.125 even under the condition of 0 dB SNR.

1 citations


Patent
20 Aug 1996
TL;DR: In this paper, the authors proposed an environment adaptive method for improving the recognition rate in a speech recognition using statistical models, where the non-speech model 13B and the voice model 6A were used in the speech recognition.
Abstract: PURPOSE: To improve an environment adaptive method for improving the recognition rate in a speech recognition using statistical models. CONSTITUTION: As to a voice model 6A and a non-speech model 13B among statistical models, the voice model 6A is made previously from speech data for learning 4 in the same manner as a coneventional manner and the non-speech model 13B is made from data 12 being later a noise superposition by superposing environmental noise data for learning 10 to speed data for learning 4 before the starting of a speech recognition. Then, the non-speech model 13B and the voice model 6A made from the speech data for learning 4 are used in the speech recognition.