scispace - formally typeset
Search or ask a question

Showing papers in "Acoustical Science and Technology in 2006"


Journal ArticleDOI
TL;DR: This review outlines historical backgrounds, architecture, underlying principles, and representative applications of STRAIGHT.
Abstract: STRAIGHT, a speech analysis, modification synthesis system, is an extension of the classical channel VOCODER that exploits the advantages of progress in information processing technologies and a new conceptualization of the role of repetitive structures in speech sounds. This review outlines historical backgrounds, architecture, underlying principles, and representative applications of STRAIGHT.

269 citations


Journal ArticleDOI
TL;DR: The phonemic restoration effect as discussed by the authors reveals the sophisticated capability of the brain underlying robust speech perception in noisy situations often encountered in daily life, and basic aspects of the phonemic recovery effect are described with audio demonstrations.
Abstract: Under certain conditions, sounds actually missing from a speech signal can be synthesized by the brain and clearly heard. This illusory phenomenon, known as the phonemic restoration effect, reveals the sophisticated capability of the brain underlying robust speech perception in noisy situations often encountered in daily life. In this article, basic aspects of the phonemic restoration effect are described with audio demonstrations.

68 citations


Journal ArticleDOI
Yukio Iwaya1
TL;DR: In this article, the authors proposed an individualization method of HRTFs called the determination method of OptimuM Impulse-response by Sound Orientation (DOMISO), which can be applied to a virtual auditory display (VAD).
Abstract: A virtual auditory display (VAD) is a system for generating spatialized sound to a listener. Commonly, VAD techniques are based on convolving head-related transfer functions (HRTFs) to a sound source. When HRTFs in a VAD are not fitted to a specific listener, the accuracy of localization is often low and produces large localization errors, typically appearing as frequent front-back confusion. However, the measurement of HRTFs for each listener for all sound-source directions requires a special measuring apparatus and a long measurement time with a listener's physical load. The author has therefore proposed an individualization method of HRTFs called the Determination method of OptimuM Impulse-response by Sound Orientation (DOMISO). In this paper, DOMISO and its effects are introduced.

59 citations


Journal ArticleDOI
TL;DR: In this study, the differences in the effectiveness of using various Japanese sounds in identifying the speakers were investigated and the stimuli used in the experiment was analysed in order to explain these differences in terms of acoustical distances.
Abstract: 1. Introduction Speech sounds convey not only linguistic or phonological information, but also nonlinguistic information, including the speakers' individualities [1]. It is known that the availability of the speech contents used for speaker identification differs depending on the types of sounds they contain, and it is reported that voiced sonorants, such as vowels and nasals, are most effective for speaker identification by both humans [2–4] and machines [5]. The speaker's individuality contained in speech sounds should have some acoustic correlations and their properties can be measured as acoustic parameters [6]. In this study, we conducted a human speaker identification test, and investigated the differences in the effectiveness of using various Japanese sounds in identifying the speakers. We also analysed the stimuli used in the experiment in order to explain these differences in terms of acoustical distances.

39 citations


Journal ArticleDOI
TL;DR: In this paper, sounds of the closing of the doors of various passenger cars were recorded and presented via headphones in a sound-proof room, and German and Japanese groups of participants formed mental images of the cars involved.
Abstract: It is reported that the sound of a car door closing is one of the main factors to determine the overall impression of the car. Much effort has been made to improve the quality of this sound. In this study, sounds of the closing of the doors of various passenger cars were recorded and presented via headphones in a sound-proof room. Based on these sounds, German and Japanese groups of participants formed mental images of the cars involved. They were also asked to evaluate the quality of the sounds. Generally speaking, though there were some differences, similar results were obtained with both groups of participants. It was found that the impressions of the sound quality varied considerably and that there was a correlation between the impression of the sound and the mental image of the car. It was suggested that the image of a car is related to the sound of the car.

36 citations


Journal ArticleDOI
TL;DR: In this paper, a 2AFC procedure combined with a 3-down 1-up transformed up-down method was employed to obtain threshold values that were less affected by listener's criterion of judgment.
Abstract: Hearing thresholds for pure tones from 2 kHz to 28 kHz were measured. A 2AFC procedure combined with a 3-down 1-up transformed up-down method was employed to obtain threshold values that were less affected by listener’s criterion of judgment. From some listeners, threshold values of 88 dB SPL or higher were obtained for a tone at 24 kHz, whereas thresholds could not be obtained from all participants at 26 kHz and above. Furthermore, thresholds were also measured under masking by a noise low-pass filtered at 20 kHz. At frequencies above 20 kHz, the difference of threshold values between with and without the masking noise was a few decibels, indicating that the tone detection was not affected by subharmonic components that might have appeared in the lower frequency regions. The results of measurement also showed that the threshold increased rather gradually for tones from 20 to 24 kHz whereas it increased sharply from 14 to 20 kHz.

33 citations


Journal ArticleDOI
TL;DR: The relationship between the viscosity boundary layer and the resonance frequency of the generated sound in a loop-tube-type thermoacoustic cooling system is investigated in this article.
Abstract: The relationship between the viscosity boundary layer and the resonance frequency of the generated sound in a loop-tube-type thermoacoustic cooling system is investigated. The frequency of the sound has been observed for various loop-tube lengths, inner pressures and working fluids, and the influence of the viscosity boundary layer upon the resonance frequency is discussed. It was generally considered that the sound generated in the loop-tube was usually resonated with the tube length by 1 wavelength. Under certain conditions, however, the resonant wavelength is 2. This results from the influence of the viscosity boundary layer. It is found that the loop-tube determines the resonance frequency so that the thickness of the viscosity boundary layer is smaller than the stack channel radius. As a result, the resonant wavelength is 2 under certain conditions. The frequency is an important parameter for the thermoacoustic cooling system. From obtained results, one of the factors for selecting the frequency is found.

30 citations



Journal ArticleDOI
Takayuki Arai1
TL;DR: Arai and Teranishi as discussed by the authors designed a three-tube model with a simple mechanism to produce several different vowels using a set of 10mm (or 15mm) thick plastic strips, closely inserted from one side.
Abstract: 1. Introduction A series of physical models of the human vocal tract for education in acoustics and speech science has been proposed from our group [1–8], and we successfully showed the effectiveness of hands-on activities for an intuitive understanding of the mechanisms and phenomena. Arai [1] replicated Chiba and Kajiyama's physical models of the human vocal tract on the basis of their measurement [9]. This Arai's model [1] consists of the cylinder-type and plate-type vocal-tract models; they are simple but offer a powerful demonstration of vowel production with a sound source such as an electrolarynx or a whistle-type artificial larynx. A driver unit of a horn speaker can also be used as a transducer to produce an arbitrary sound source. One can feed signals to the driver unit not only from an oscillator, but also from a computer using a digital/analog converter and an amplifier, so that any arbitrary signal can be a source signal. We have recently showed additional physical models of the human vocal tract that are useful for education. We have shown Umeda and Teranishi's model [10] with several sound sources fed through a driver unit in pedagogical situations [11]. In this model, one can change the cross-sectional areas of their model by moving 10-mm (or 15-mm) thick plastic strips, closely inserted from one side. In Arai [8], we further extended our previously proposed physical models of the human vocal tract to the lung models and head-shaped models. The head-shaped models can produce vowel sounds and provide a visual demonstration of how the vocal tract is positioned in the head. The lung models with the whistle-type artificial larynx give a visual demonstration of the human respiratory system and phonation. In one extended version of the head-shaped model [8], the tongue could be manipulated by hand. Therefore, many different vowels can be produced with the model by changing the position of the tongue. In Umeda and Teranishi's model [10], the shape of the vocal tract can also be changed by moving a set of the thick plastic strips. None of the models, however, had a simple way to change the vocal tract shape in order to produce different vowel sounds. In other words, we needed high degrees of freedom to control the vocal tract shape in the previous models. In this study, we, design a three-tube model with a simple mechanism to produce several different vowels.

20 citations



Journal ArticleDOI
TL;DR: In this article, the authors defined the instantaneous frequency as the one obtained by converting a real-time signal to a complex analytic signal and by differentiating the time-dependent phase with respect to time.
Abstract: In this paper, the instantaneous frequency is defined as the one obtained by converting a real time signal to a complex analytic signal and by differentiating the time-dependent phase with respect to time. Theoretical expressions of instantaneous frequencies for signals given as combinations of sinusoidal components are presented. Those are compared to the results obtained by numerical methods using the discrete Fourier transform in order to confirm the validity of the expressions and to check accuracies of the numerical methods. A reason for the existence of negative instantaneous frequencies is given by a vector representation of signal components. The instantaneous frequencies of frequency- and amplitude-modulated signals are also discussed. For a periodic sinusoidal burst signal, it was found that the instantaneous frequency stays zero during the period when the signal is zero and takes a value equal to twice that of the sinusoid at the onset of the signal.

Journal ArticleDOI
TL;DR: In this paper, psychoacoustical experiments using onomatopoeic representations were conducted to clarify the features of environmental sounds, and three factors obtained were the emotion factor, the clearness factor, and the powerfulness factor.
Abstract: In order to clarify the features of environmental sounds, psychoacoustical experiments using onomatopoeic representations were conducted. The onomatopoeic representations obtained for each sound and participant were expressed using phonetic parameters, such as place, manner of articulation, and vowels. Onomatopoeic representations were classified based on similarities of phonetic parameters using a hierarchical cluster analysis. As a result, similar acoustic properties were identified in the stimuli expressed by onomatopoeic representations classified into the same clusters. Furthermore, the auditory impressions associated with stimuli were measured by a semantic differential method using 13 adjective pairs. Factor analysis was applied to the average ratings for each sound for each scale. The three factors obtained were the emotion factor, the clearness factor, and the powerfulness factor. From these results, the relationships among the acoustic properties of sound stimuli, the impressions associated with them, and their onomatopoeic features were discussed. For example, onomatopoeic representations including voiced consonants were frequently used for sounds displaying components in the frequency region below about 1 kHz in their spectrum, inducing “powerfulness” and “darkness, dullness, and muddiness” impressions. Furthermore, similar relationships were found in supplementary experiments using various band noises.

Journal ArticleDOI
TL;DR: In this paper, a multi-carrier modulation system was proposed to reduce the effect of the reverberant signals and to improve the transmission data rate of a gas pipe communication system.
Abstract: In recent years, several studies on acoustic communication systems utilizing gas pipe lines have been carried out. However, in conventional acoustic communication technology, the transmission rate cannot be improved because of the reverberant signals arising from reflections at the bends and branches in pipes. To reduce the effect of the reverberant signals and to improve the transmission data rate, we studied a multi-carrier modulation system. In this study, to avoid intersymbol interference, we employed special modulation symbols composed of multicarrier frequencies which change cyclically. To evaluate the proposed system, we constructed an experimental setup simulating a pipe system equivalent to that of a six-story building as a communication path. We achieved a transmission rate of 3,840 bps using the proposed method.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed the simultaneous equations method, which is based on a priciple different from the filtered-x algorithm requiring a filter modeled on a secondary path from a loudspeaker to an error microphone.
Abstract: In this study, we verify the performance of the simultaneous equations method using an experimental active noise control system. The simultaneous equations method is based on a priciple different from the filtered-x algorithm requiring a filter modeled on a secondary path from a loudspeaker to an error microphone. Instead of the filter, called the secondary path filter, this method uses an auxiliary filter identifying the overall path consisting of a primary path, a noise control filter and the secondary path. As inferred from the configuration of the overall path, the auxiliary filter can provide two independent equations when two different coefficient vectors are given to the noise control filter. The method thereby estimates the coefficient vector of the noise control filter minimizing the output of the error microphone. In this paper, we propose the application of a frequency domain adaptive algorithm to the identification of the overall path. An improvement in the noise reduction speed is thereby expected. In this paper, we also present computer simulation results demonstrating that the simultaneous equations method can automatically recover the noise reduction effect degraded by path changes, and finally, using an experimental system, we indicate that the method successfully works in practical systems.


Journal ArticleDOI
TL;DR: In this article, a scale-model experiment and three-dimensional wave-based numerical analysis was conducted to investigate the effect of unevenly distributed sound absorbers on the reverberation time in a room with an absorptive floor and/or ceiling.
Abstract: The reverberation time in a room with unevenly distributed sound absorbers, such as a room having an absorptive floor and/or ceiling, is often observed to be longer in the middle- and high-frequency ranges than the values obtained using the Sabine/Eyring formula. In the present study, this phenomenon was investigated through a scale-model experiment and three-dimensional wave-based numerical analysis. The reverberation time in a room having an absorptive floor and/or ceiling was verified to be longer in the middle- and high-frequency ranges, and the arrangement of absorbers was found to affect the frequency characteristic of the reverberation time. The increase in the reverberation time is caused by the slow decay of the axial and tangential modes in the horizontal direction. The reverberation time is longer in the high-frequency range (in which the wavelength is sufficiently shorter compared with the height of the ceiling) than in the low-frequency range, even when the frequency characteristics of the absorption coefficients of the absorbers are flat. As a means of improving such an uneven reverberation time in a room, both the placement of diffusers in the vertical direction and the use of inwardly inclined walls (in rooms with highly absorptive floors) have been found to be effective.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a temperature dependence evaluation method for acoustic characteristics of the electret condenser microphone and reported the temperature dependence of the dominant parameter that affects the sensitivity of the microphone and described a method to design a microphone with stable temperature characteristics.
Abstract: Electret condenser microphones for mobile system terminals should be more robust than those for ordinary consumer equipment. Generally, the diaphragm electret can achieve flat temperature characteristics with respect to sensitivity because the temperature dependence of the diaphragm stiffness and gain characteristics of the FET offset each other. However, the fixed electrode electret using a PET diaphragm is not sufficiently robust with respect to temperature because the PET membrane has the same temperature coefficient as FETs, so there is no offsetting and it is difficult to compensate the temperature characteristics. In order to improve robustness, this paper proposes a temperature dependence evaluation method for acoustic characteristics of the electret condenser microphone. It then reports the temperature dependence of the dominant parameter that affects the sensitivity of the microphone and describes a method to design a microphone with stable temperature characteristics. In a new trial, silicon, an inorganic material, is applied to the diaphragm. It is subsequently demonstrated that an impedance converter composed of a source-follower circuit is effective for a high temperature environment of up to 80°C. Finally, the possibility of an electret condenser microphone with flat temperature characteristics is proposed.



Journal ArticleDOI
TL;DR: A method for predicting the absorption characteristics of a fibrous material, that is glass wool covered with a perforated facing and an impermeable film, typically used for noise barriers, based on Ingard and Bolt’s model is proposed.
Abstract: We propose a method for predicting the absorption characteristics of a fibrous material, that is glass wool, covered with a perforated facing and an impermeable film, typically used for noise barriers. The method is based on Ingard and Bolt’s model. It accounts for interactions among perforated facing, film and fibrous materials. The interaction occurs in areas where they are close to each other. That area was determined empirically as the coverage. The coverage is approximately 10 mm for a perforated facing with a 0.2 open area ratio. In the coverage, the perforated facing increases the acoustic impedance of film and fibrous material according to distance. The fibrous material causes acoustic resistance to the film when the film contacts the fibrous material. The formulae for their acoustic impedances were derived from many results of acoustic impedance measured using an impedance tube. The end correction of holes of the perforated facing was modified using the relationship between the measured values of resonance frequency for Helmholtz resonators with the perforated plate and their open area ratio. Results predicted by this method agree well with measured results obtained in most instances.


Journal ArticleDOI
TL;DR: Findings from fMRI measurement indicate that various sites in the brain, which are not ordinarily used for speech recognition, participate in making NVSS intelligible.
Abstract: Recent works on perception of noise-vocoded speech sound (NVSS) have revealed that amplitude envelope information is very important for speech perception when spectral information is not sufficiently available. Basically, the fundamental frequency information is not available and formant peaks cannot not be identified in NVSS. However, we can even recognize accent and distinguish male voice from female voice in NVSS. More, melody can be created from lyrics once lyrics are intelligible. In the present study, findings from fMRI measurement are introduced to show neural activities in the central nervous system during listening to NVSS. The present data indicate that various sites in the brain, which are not ordinarily used for speech recognition, participate in making NVSS intelligible. Applications of the present work include an innovative speech processor and a training system for hearing impaired people.


Journal ArticleDOI
TL;DR: In this article, the glottal flow was modeled as a simple symmetric jet and the flow became an unsteady complicated flow with vortices distributed in two-dimensional space.
Abstract: The present study is intended as an investigation of speech dynamics, particularly the unsteady motion of glottal flow in the larynx. In order to focus on only fluid motion, the vocal cords are assumed to be non-vibrating rigid bodies, although the glottal sound source is described as the interaction between the flow and the vibrating vocal cords. The glottal flow based on the two-dimensional rigid body model is simulated by solving basic equations in a compressible viscous fluid that is subject to appropriate initial and boundary conditions. The obtained results demonstrate that the initial glottal flow was a simple symmetric jet and that the flow became an unsteady complicated flow with vortices distributed in two-dimensional space. Furthermore, the structure of the complicated flow changed with time. These results indicate that simple assumptions, such as linearization of the fluid equations or one-dimensional models, are inappropriate for analysis of the speech production process.

Journal ArticleDOI
TL;DR: In this paper, a modified version of a filtered-reference LMS algorithm is introduced to reduce the difference between theoretical formulation and practical control process for various kinds of noise signals, and the results of the numerical simulations indicated that the practical implementation of the algorithm requires severer conditions for the convergence coefficient than those in the theoretical prediction.
Abstract: A filtered-reference LMS algorithm is often used in practical active noise control systems. This algorithm is derived under the assumption of a stationary noise signal, and the order of the signal convolution is switched in the derivation process. However, the order of convolution cannot be changed in a real physical process. We examine the differences between these situations, i.e., theoretical formulation compared to practical control process, for various kinds of noise signals. Amplitude-modulated and low-pass noise signals are used as examples of disturbance signals. Results of the numerical simulations indicated that the practical implementation of the algorithm requires severer conditions for the convergence coefficient than those in the theoretical prediction. Analytical examination with the transformed coefficient update procedure also reveals the difference between them. To reduce this difference and achieve robust attenuation under practical conditions, a modified version of a filtered-reference LMS algorithm is introduced. Advantages of this modified version are verified through a series of simulations.

Journal ArticleDOI
TL;DR: This paper proposes an equalization algorithm that is less sensitive to the order misadjustment of the transfer functions, and shows that the proposed method works well even when the order is highly overestimated.
Abstract: This paper addresses the blind dereverberation problem of a single-input multiple-output acoustic system. Many conventional approaches require a precise order of the transfer functions. In this paper, we propose an equalization algorithm that is less sensitive to the order misadjustment of the transfer functions. First, the transfer functions are estimated using an overestimated order, and the inverse filter set for these estimated transfer functions is calculated. Since the estimated transfer functions contain a common polynomial, the signal processed by the inverse filter set suffers from the effect of this common polynomial. Then, we extract this polynomial to compensate for the distortion. The proposed algorithm recovers input signal as long as the channel is overestimated. Simulation results show that the proposed method works well even when the order is highly overestimated.




Journal ArticleDOI
TL;DR: This study implements a digital pattern playback and explores its usefulness for education by proposing two simple but versatile algorithms based on the concept of amplitude modulation (AM) and the Fast Fourier transform (FFT).
Abstract: Department of Electrical and Electronics Engineering, Sophia University,7–1 Kioi-cho, Chiyoda-ku, Tokyo, 102–8554 Japan(Received 27 June 2006, Accepted for publication 28 July 2006)Keywords: Education in acoustics, Speech science, Acoustic phoneticsPACS number: 43.70.Bk, 43.10.Sv [doi:10.1250/ast.27.393]1. IntroductionPattern playback, a device that converts a spectrographicrepresentation back to a speech signal, was developed byCooper and his colleagues from Haskins Laboratories in thelate 1940s [1] and has contributed tremendously to the rapiddevelopment of research in speech science [2–4]. Byconverting a spectrogram into a sound, we can test whichacoustic cue projected on the spectrogram is important forspeech perception. Furthermore, we can simplify the acousticcue and/or systematically change an aspect of the acousticcue, redraw a spectrographic representation, and synthesizestimulus sounds. By doing this, many studies have beenconducted, such as the study of the locus theory, whichaccounts for the importance of the second formant trajectoryof a following vowel for the perception of a preceding stopconsonant [5].Today, we can easily implement a modern pattern play-back with digital technology, and this is valuable for pedagog-ical applications. Thus, in this study, we implement a digitalpattern playback and explore its usefulness for education [6].2. PrincipleIn the original ‘‘pattern playback’’ [1], the light source andtone wheel generate an optical set of harmonics at 120Hz, andthe amplitudes of the harmonics are modulated by a givenspectrogram. The spectrogram is placed on the top of a beltmoving at a constant speed, and an amplitude-modulatedsignal is output from the loudspeaker.This analog version of pattern playback can easily beimplemented with modern digital technology. In fact, Nye etal. reported a digital version of the pattern playback fromHaskins Laboratories using a PDP-11 computer system [7]. Inthis study, we propose two simple but versatile algorithms fordigital pattern playback.The first algorithm, or the AM method, is based on theconcept of amplitude modulation (AM). In this algorithm,the amplitudes of harmonics are modulated by the darknesspattern of a spectrogram as shown in Fig. 1. This is somewhatsimilar to the original pattern playback based on the sourcefilter theory of speech production. Changing the fundamentalfrequency of the harmonics yields a variation in pitch, and iteventually allows us to put intonation onto the output sounds.As an alternative option, we can also use a noise source,instead of the harmonic source, to produce unvoiced sounds.Many studies discuss how to reconstruct the originalphase components from a spectrographic representation (e.g.,[8]). However, the original pattern playback, even without thereconstruction of phase components, is still extremely power-ful for educational purposes because it shows the importanceof formant transitions, et cetera. Furthermore, we want toimplement a simple, digital system that everybody can use.For this reason, our system does not reconstruct the phasecomponents or change the fundamental frequency duringplaying back.The second algorithm, or the FFT method, is based on thefast Fourier transform (FFT). In this algorithm, a time slice ofa given spectrogram is treated as a logarithmic spectrum ofthat time frame, and the spectrum is converted back into thetime domain by the inverse FFT as shown in Fig. 2. Becausewe are not reconstructing the original phase, we simply set thephase components to zero.Because our aim is a simple algorithm with no pitchchange during playback, we have carefully chosen a frameshift dependent on the fundamental period. In other words, weused the frame shift that exactly matches the desired fun-damental period. To do this, we first reduce the frequencyresolution of the spectrum to obtain only the spectral envelope(especially for a spectrogram obtained by a narrow-bandanalysis), which reflects the vocal-tract filter. Then, by takingthe inverse FFT, we get an impulse response of the filter forthat time frame. Finally, we place the impulse response alongthe time axis frame-by-frame with the time interval of theframeshift,whichisalsoequivalenttothefundamentalperiod.We are technically able to change the time intervals to placethe impulse responses depending on the instantaneous pitchcontour, although we maintain a constant fundamental period.In theory, we can use a variety of sets of values for eachparameter. In practice, we use the following values. For thesampling frequency, 8 to 16kHz is preferable. For the framelength, 256 or 512 points is optimal. We can use a frame shiftof 3–13ms. This range is suitable for producing a speechsound uttered by an adult male or female, because thefundamental period is set to the frame shift. We often use theframe shift of 10ms, as when the fundamental frequency is100Hz. We can reconstruct an intelligible speech sound aslong as the spectrum within a frame is represented at about 40points or more up to 8kHz. A non-linear transformation of