scispace - formally typeset
Search or ask a question

Showing papers in "Journal of the Acoustical Society of America in 2006"


Journal ArticleDOI
TL;DR: An audio-visual corpus that consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers to support the use of common material in speech perception and automatic speech recognition studies.
Abstract: An audio-visual corpus has been collected to support the use of common material in speech perception and automatic speech recognition studies. The corpus consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers. Sentences are simple, syntactically identical phrases such as "place green at B 4 now". Intelligibility tests using the audio signals suggest that the material is easily identifiable in quiet and low levels of stationary noise. The annotated corpus is available on the web for research use.

1,088 citations


Journal ArticleDOI
TL;DR: This paper focuses on the development of model-Based Speech Segregation in CASA systems, which was first introduced in 2000 and has since been upgraded to a full-blown model-based system.
Abstract: Foreword. Preface. Contributors. Acronyms. 1. Fundamentals of Computational Auditory Scene Analysis (DeLiang Wang and Guy J. Brown). 1.1 Human Auditory Scene Analysis. 1.1.1 Structure and Function of the Auditory System. 1.1.2 Perceptual Organization of Simple Stimuli. 1.1.3 Perceptual Segregation of Speech from Other Sounds. 1.1.4 Perceptual Mechanisms. 1.2 Computational Auditory Scene Analysis (CASA). 1.2.1 What Is CASA? 1.2.2 What Is the Goal of CASA? 1.2.3 Why CASA? 1.3 Basics of CASA Systems. 1.3.1 System Architecture. 1.3.2 Cochleagram. 1.3.3 Correlogram. 1.3.4 Cross-Correlogram. 1.3.5 Time-Frequency Masks. 1.3.6 Resynthesis. 1.4 CASA Evaluation. 1.4.1 Evaluation Criteria. 1.4.2 Corpora. 1.5 Other Sound Separation Approaches. 1.6 A Brief History of CASA (Prior to 2000). 1.6.1 Monaural CASA Systems. 1.6.2 Binaural CASA Systems. 1.6.3 Neural CASA Models. 1.7 Conclusions 36 Acknowledgments. References. 2. Multiple F0 Estimation (Alain de Cheveigne). 2.1 Introduction. 2.2 Signal Models. 2.3 Single-Voice F0 Estimation. 2.3.1 Spectral Approach. 2.3.2 Temporal Approach. 2.3.3 Spectrotemporal Approach. 2.4 Multiple-Voice F0 Estimation. 2.4.1 Spectral Approach. 2.4.2 Temporal Approach. 2.4.3 Spectrotemporal Approach. 2.5 Issues. 2.5.1 Spectral Resolution. 2.5.2 Temporal Resolution. 2.5.3 Spectrotemporal Resolution. 2.6 Other Sources of Information. 2.6.1 Temporal and Spectral Continuity. 2.6.2 Instrument Models. 2.6.3 Learning-Based Techniques. 2.7 Estimating the Number of Sources. 2.8 Evaluation. 2.9 Application Scenarios. 2.10 Conclusion. Acknowledgments. References. 3. Feature-Based Speech Segregation (DeLiang Wang). 3.1 Introduction. 3.2 Feature Extraction. 3.2.1 Pitch Detection. 3.2.2 Onset and Offset Detection. 3.2.3 Amplitude Modulation Extraction. 3.2.4 Frequency Modulation Detection. 3.3 Auditory Segmentation. 3.3.1 What Is the Goal of Auditory Segmentation? 3.3.2 Segmentation Based on Cross-Channel Correlation and Temporal Continuity. 3.3.3 Segmentation Based on Onset and Offset Analysis. 3.4 Simultaneous Grouping. 3.4.1 Voiced Speech Segregation. 3.4.2 Unvoiced Speech Segregation. 3.5 Sequential Grouping. 3.5.1 Spectrum-Based Sequential Grouping. 3.5.2 Pitch-Based Sequential Grouping. 3.5.3 Model-Based Sequential Grouping. 3.6 Discussion. Acknowledgments. References. 4. Model-Based Scene Analysis (Daniel P. W. Ellis). 4.1 Introduction. 4.2 Source Separation as Inference. 4.3 Hidden Markov Models. 4.4 Aspects of Model-Based Systems. 4.4.1 Constraints: Types and Representations. 4.4.2 Fitting Models. 4.4.3 Generating Output. 4.5 Discussion. 4.5.1 Unknown Interference. 4.5.2 Ambiguity and Adaptation. 4.5.3 Relations to Other Separation Approaches. 4.6 Conclusions. References. 5. Binaural Sound Localization (Richard M. Stern, Guy J. Brown, and DeLiang Wang). 5.1 Introduction. 5.2 Physical and Physiological Mechanisms Underlying Auditory Localization. 5.2.1 Physical Cues. 5.2.2 Physiological Estimation of ITD and IID. 5.3 Spatial Perception of Single Sources. 5.3.1 Sensitivity to Differences in Interaural Time and Intensity. 5.3.2 Lateralization of Single Sources. 5.3.3 Localization of Single Sources. 5.3.4 The Precedence Effect. 5.4 Spatial Perception of Multiple Sources. 5.4.1 Localization of Multiple Sources. 5.4.2 Binaural Signal Detection. 5.5 Models of Binaural Perception. 5.5.1 Classical Models of Binaural Hearing. 5.5.2 Cross-Correlation-Based Models of Binaural Interaction. 5.5.3 Some Extensions to Cross-Correlation-Based Binaural Models. 5.6 Multisource Sound Localization. 5.6.1 Estimating Source Azimuth from Interaural Cross-Correlation. 5.6.2 Methods for Resolving Azimuth Ambiguity. 5.6.3 Localization of Moving Sources. 5.7 General Discussion. Acknowledgments. References. 6. Localization-Based Grouping (Albert S. Feng and Douglas L. Jones). 6.1 Introduction. 6.2 Classical Beamforming Techniques. 6.2.1 Fixed Beamforming Techniques. 6.2.2 Adaptive Beamforming Techniques. 6.2.3 Independent Component Analysis Techniques. 6.2.4 Other Localization-Based Techniques. 6.3 Location-Based Grouping Using Interaural Time Difference Cue. 6.4 Location-Based Grouping Using Interaural Intensity Difference Cue. 6.5 Location-Based Grouping Using Multiple Binaural Cues. 6.6 Discussion and Conclusions. Acknowledgments. References. 7. Reverberation (Guy J. Brown and Kalle J. Palomaki). 7.1 Introduction. 7.2 Effects of Reverberation on Listeners. 7.2.1 Speech Perception. 7.2.2 Sound Localization. 7.2.3 Source Separation and Signal Detection. 7.2.4 Distance Perception. 7.2.5 Auditory Spatial Impression. 7.3 Effects of Reverberation on Machines. 7.4 Mechanisms Underlying Robustness to Reverberation in Human Listeners. 7.4.1 The Role of Slow Temporal Modulations in Speech Perception. 7.4.2 The Binaural Advantage. 7.4.3 The Precedence Effect. 7.4.4 Perceptual Compensation for Spectral Envelope Distortion. 7.5 Reverberation-Robust Acoustic Processing. 7.5.1 Dereverberation. 7.5.2 Reverberation-Robust Acoustic Features. 7.5.3 Reverberation Masking. 7.6 CASA and Reverberation. 7.6.1 Systems Based on Directional Filtering. 7.6.2 CASA for Robust ASR in Reverberant Conditions. 7.6.3 Systems that Use Multiple Cues. 7.7 Discussion and Conclusions. Acknowledgments. References. 8. Analysis of Musical Audio Signals (Masataka Goto). 8.1 Introduction. 8.2 Music Scene Description. 8.2.1 Music Scene Descriptions. 8.2.2 Difficulties Associated with Musical Audio Signals. 8.3 Estimating Melody and Bass Lines. 8.3.1 PreFEst-front-end: Forming the Observed Probability Density Functions. 8.3.2 PreFEst-core: Estimating the F0's Probability Density Function. 8.3.3 PreFEst-back-end: Sequential F0 Tracking by Multiple-Agent Architecture. 8.3.4 Other Methods. 8.4 Estimating Beat Structure. 8.4.1 Estimating Period and Phase. 8.4.2 Dealing with Ambiguity. 8.4.3 Using Musical Knowledge. 8.5 Estimating Chorus Sections and Repeated Sections. 8.5.1 Extracting Acoustic Features and Calculating Their Similarity. 8.5.2 Finding Repeated Sections. 8.5.3 Grouping Repeated Sections. 8.5.4 Detecting Modulated Repetition. 8.5.5 Selecting Chorus Sections. 8.5.6 Other Methods. 8.6 Discussion and Conclusions. 8.6.1 Importance. 8.6.2 Evaluation Issues. 8.6.3 Future Directions. References. 9. Robust Automatic Speech Recognition (Jon Barker). 9.1 Introduction. 9.2 ASA and Speech Perception in Humans. 9.2.1 Speech Perception and Simultaneous Grouping. 9.2.2 Speech Perception and Sequential Grouping. 9.2.3 Speech Schemes. 9.2.4 Challenges to the ASA Account of Speech Perception. 9.2.5 Interim Summary. 9.3 Speech Recognition by Machine. 9.3.1 The Statistical Basis of ASR. 9.3.2 Traditional Approaches to Robust ASR. 9.3.3 CASA-Driven Approaches to ASR. 9.4 Primitive CASA and ASR. 9.4.1 Speech and Time-Frequency Masking. 9.4.2 The Missing-Data Approach to ASR. 9.4.3 Marginalization-Based Missing-Data ASR Systems. 9.4.4 Imputation-Based Missing-Data Solutions. 9.4.5 Estimating the Missing-Data Mask. 9.4.6 Difficulties with the Missing-Data Approach. 9.5 Model-Based CASA and ASR. 9.5.1 The Speech Fragment Decoding Framework. 9.5.2 Coupling Source Segregation and Recognition. 9.6 Discussion and Conclusions. 9.7 Concluding Remarks. References. 10. Neural and Perceptual Modeling (Guy J. Brown and DeLiang Wang). 10.1 Introduction. 10.2 The Neural Basis of Auditory Grouping. 10.2.1 Theoretical Solutions to the Binding Problem. 10.2.2 Empirical Results on Binding and ASA. 10.3 Models of Individual Neurons. 10.3.1 Relaxation Oscillators. 10.3.2 Spike Oscillators. 10.3.3 A Model of a Specific Auditory Neuron. 10.4 Models of Specific Perceptual Phenomena. 10.4.1 Perceptual Streaming of Tone Sequences. 10.4.2 Perceptual Segregation of Concurrent Vowels with Different F0s. 10.5 The Oscillatory Correlation Framework for CASA. 10.5.1 Speech Segregation Based on Oscillatory Correlation. 10.6 Schema-Driven Grouping. 10.7 Discussion. 10.7.1 Temporal or Spatial Coding of Auditory Grouping. 10.7.2 Physiological Support for Neural Time Delays. 10.7.3 Convergence of Psychological, Physiological, and Computational Approaches. 10.7.4 Neural Models as a Framework for CASA. 10.7.5 The Role of Attention. 10.7.6 Schema-Based Organization. Acknowledgments. References. Index.

940 citations


Journal ArticleDOI
TL;DR: An automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise revealed that cues to voicing are degraded more in the model than in human auditory processing.
Abstract: Do listeners process noisy speech by taking advantage of "glimpses"-spectrotemporal regions in which the target signal is least affected by the background? This study used an automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise. Twelve masking conditions were chosen to create a range of glimpse sizes. Several different glimpsing models were employed, differing in the local signal-to-noise ratio (SNR) used for detection, the minimum glimpse size, and the use of information in the masked regions. Recognition results were compared with behavioral data. A quantitative analysis demonstrated that the proportion of the time-frequency plane glimpsed is a good predictor of intelligibility. Recognition scores in each noise condition confirmed that sufficient information exists in glimpses to support consonant identification. Close fits to listeners' performance were obtained at two local SNR thresholds: one at around 8 dB and another in the range -5 to -2 dB. A transmitted information analysis revealed that cues to voicing are degraded more in the model than in human auditory processing.

693 citations


Journal ArticleDOI
TL;DR: The results suggest that talkers in conversational settings are susceptible to phonetic convergence, which can mark nonlinguistic functions in social discourse and can form the basis for phenomena such as accent change and dialect formation.
Abstract: Following research that found imitation in single-word shadowing, this study examines the degree to which interacting talkers increase similarity in phonetic repertoire during conversational interaction. Between-talker repetitions of the same lexical items produced in a conversational task were examined for phonetic convergence by asking a separate set of listeners to detect similarity in pronunciation across items in a perceptual task. In general, a listener judged a repeated item spoken by one talker in the task to be more similar to a sample production spoken by the talker’s partner than corresponding pre- and postinteraction utterances. Both the role of a participant in the task and the sex of the pair of talkers affected the degree of convergence. These results suggest that talkers in conversational settings are susceptible to phonetic convergence, which can mark nonlinguistic functions in social discourse and can form the basis for phenomena such as accent change and dialect formation.

662 citations


Journal ArticleDOI
TL;DR: The techniques developed in this work can be used to design lattices with a desired band structure and the observed spatial filtering effects due to anisotropy at high frequencies (short wavelengths) of wave propagation are consistent with the lattice symmetries.
Abstract: Plane wave propagation in infinite two-dimensional periodic lattices is investigated using Floquet-Bloch principles. Frequency bandgaps and spatial filtering phenomena are examined in four representative planar lattice topologies: hexagonal honeycomb, Kagome lattice, triangular honeycomb, and the square honeycomb. These topologies exhibit dramatic differences in their long-wavelength deformation properties. Long-wavelength asymptotes to the dispersion curves based on homogenization theory are in good agreement with the numerical results for each of the four lattices. The slenderness ratio of the constituent beams of the lattice (or relative density) has a significant influence on the band structure. The techniques developed in this work can be used to design lattices with a desired band structure. The observed spatial filtering effects due to anisotropy at high frequencies (short wavelengths) of wave propagation are consistent with the lattice symmetries.

593 citations


PatentDOI
TL;DR: In this article, an emotion recognition system for assessing human emotional behavior from communication by a speaker includes a processing system configured to receive signals representative of the verbal and/or non-verbal communication.
Abstract: An emotion recognition system for assessing human emotional behavior from communication by a speaker includes a processing system configured to receive signals representative of the verbal and/or non-verbal communication. The processing system derives signal features from the received signals. The processing system is further configured to implement at least one intermediate mapping between the signal features and one or more elements of an emotional ontology in order to perform an emotion recognition decision. The emotional ontology provides a gradient representation of the human emotional behavior.

495 citations


Journal ArticleDOI
TL;DR: In this article, the authors developed a robust experimental procedure to track the evolution of fatigue damage in a nickel-base superalloy with the acoustic nonlinearity parameter, β, and demonstrates its effectiveness by making repeatable measurements of β in multiple specimens, subjected to both high and low-cycle fatigue.
Abstract: This research develops a robust experimental procedure to track the evolution of fatigue damage in a nickel-base superalloy with the acoustic nonlinearity parameter, β, and demonstrates its effectiveness by making repeatable measurements of β in multiple specimens, subjected to both high- and low-cycle fatigue. The measurement procedure developed in this research is robust in that it is based on conventional piezoelectric contact transducers, which are readily available off the shelf, and it offers the potential for field applications. In addition, the measurement procedure enables the user to isolate sample nonlinearity from measurement system nonlinearity. The experimental results show that there is a significant increase in β linked to the high plasticity of low-cycle fatigue, and illustrate how these nonlinear ultrasonic measurements quantitatively characterize the damage state of a specimen in the early stages of fatigue. The high-cycle fatigue results are less definitive (the increase in β is not as...

428 citations


Journal ArticleDOI
TL;DR: Recent measurement at a previously studied location illustrates the magnitude of increases in ocean ambient noise in the Northeast Pacific over the past four decades.
Abstract: Recent measurement at a previously studied location illustrates the magnitude of increases in ocean ambient noise in the Northeast Pacific over the past four decades. Continuous measurements west of San Nicolas Island, California, over 138 days, spanning 2003-2004 are compared to measurements made during the 1960s at the same site. Ambient noise levels at 30-50 Hz were 10-12 dB higher (95% CI = 2.6 dB) in 2003-2004 than in 1964-1966, suggesting an average noise increase rate of 2.5-3 dB per decade. Above 50 Hz the noise level differences between recording periods gradually diminished to only 1-3 dB at 100-300 Hz. Above 300 Hz the 1964-1966 ambient noise levels were higher than in 2003-2004, owing to a diel component which was absent in the more recent data. Low frequency (10-50 Hz) ocean ambient noise levels are closely related to shipping vessel traffic. The number of commercial vessels plying the world's oceans approximately doubled between 1965 and 2003 and the gross tonnage quadrupled, with a corresponding increase in horsepower. Increases in commercial shipping are believed to account for the observed low-frequency ambient noise increase.

397 citations


Journal ArticleDOI
TL;DR: This study attempted to isolate the effects that energetic masking, defined as the loss of detectable target information due to the spectral overlap of the target and masking signals, has on multitalker speech perception through the use of ideal time-frequency binary masks.
Abstract: When a target speech signal is obscured by an interfering speech wave form, comprehension of the target message depends both on the successful detection of the energy from the target speech wave form and on the successful extraction and recognition of the spectro-temporal energy pattern of the target out of a background of acoustically similar masker sounds. This study attempted to isolate the effects that energetic masking, defined as the loss of detectable target information due to the spectral overlap of the target and masking signals, has on multitalker speech perception. This was achieved through the use of ideal time-frequency binary masks that retained those spectro-temporal regions of the acoustic mixture that were dominated by the target speech but eliminated those regions that were dominated by the interfering speech. The results suggest that energetic masking plays a relatively small role in the overall masking that occurs when speech is masked by interfering speech but a much more significant role when speech is masked by interfering noise.

388 citations


Journal ArticleDOI
TL;DR: State-of-the-art finite-element methods for time-harmonic acoustics governed by the Helmholtz equation are reviewed and Mesh resolution to control phase error and bound dispersion or pollution errors measured in global norms for large wave numbers in finite- element methods are described.
Abstract: State-of-the-art finite-element methods for time-harmonic acoustics governed by the Helmholtz equation are reviewed. Four major current challenges in the field are specifically addressed: the effective treatment of acoustic scattering in unbounded domains, including local and nonlocal absorbing boundary conditions, infinite elements, and absorbing layers; numerical dispersion errors that arise in the approximation of short unresolved waves, polluting resolved scales, and requiring a large computational effort; efficient algebraic equation solving methods for the resulting complex-symmetric (non-Hermitian) matrix systems including sparse iterative and domain decomposition methods; and a posteriori error estimates for the Helmholtz operator required for adaptive methods. Mesh resolution to control phase error and bound dispersion or pollution errors measured in global norms for large wave numbers in finite-element methods are described. Stabilized, multiscale, and other wave-based discretization methods developed to reduce this error are reviewed. A review of finite-element methods for acoustic inverse problems and shape optimization is also given.

368 citations


Journal ArticleDOI
TL;DR: The results confirm the significant influence of the shell on the bubble dynamics: shell elasticity increases the resonance frequency by about 50%, and shell viscosity is responsible for about 70% of the total damping.
Abstract: A new optical characterization of the behavior of single ultrasound contrast bubbles is presented. The method consists of insonifying individual bubbles several times successively sweeping the applied frequency, and to record movies of the bubble response up to 25 million frames/s with an ultrahigh speed camera operated in a segmented mode. The method, termed microbubble spectroscopy, enables to reconstruct a resonance curve in a single run. The data is analyzed through a linearized model for coated bubbles. The results confirm the significant influence of the shell on the bubble dynamics: shell elasticity increases the resonance frequency by about 50%, and shell viscosity is responsible for about 70% of the total damping. The obtained value for shell elasticity is in quantative agreement with previously reported values. The shell viscosity increases significantly with the radius, revealing a new nonlinear behavior of the phospholipid coating.

PatentDOI
TL;DR: The ultrasonic surgical tool has an elongate waveguide (1) operatively connected or connectable at a proximal end to a source of ultrasonic vibrations as discussed by the authors, and the operative element is curved in a plane transverse to that of the ridge.
Abstract: The ultrasonic surgical tool has an elongate waveguide (1) operatively connected or connectable at a proximal end to a source of ultrasonic vibrations. At a distal end, an operative element comprises a radially-extending ridge (2) defined between a substantially parallel pair of grooves (4) extending longitudinally of the waveguide (1). The operative element is curved in a plane transverse to that of the ridge (2). This arrangement is ergonomically superior and allows a surgeon to work for longer and with improved control. It also allows a clear visualisation of the operative elements of the tool and the target tissue.

PatentDOI
TL;DR: In this paper, a real-time speech recognition system includes distributed processing across a client and server for recognizing a spoken query by a user, where the partitioning of responsibility for speech recognition operations can be done on a client by client or connection by connection basis.
Abstract: A real-time speech recognition system includes distributed processing across a client and server for recognizing a spoken query by a user. Both the client and server can dedicate a variable number of processing resources for performing speech recognition functions. The partitioning of responsibility for speech recognition operations can be done on a client by client or connection by connection basis.

Journal ArticleDOI
TL;DR: Through overpressure experiments it was shown that both nonlinear propagation and cavitation mechanisms participate in accelerating lesion inception and growth, but no lesion displacement or distortion was observed in the absence of boiling.
Abstract: The importance of nonlinear acoustic wave propagation and ultrasound-induced cavitation in the acceleration of thermal lesion production by high intensity focused ultrasound was investigated experimentally and theoretically in a transparent protein-containing gel. A numerical model that accounted for nonlinear acoustic propagation was used to simulate experimental conditions. Various exposure regimes with equal total ultrasound energy but variable peak acoustic pressure were studied for single lesions and lesion stripes obtained by moving the transducer. Static overpressure was applied to suppress cavitation. Strong enhancement of lesion production was observed for high amplitude waves and was supported by modeling. Through overpressure experiments it was shown that both nonlinear propagation and cavitation mechanisms participate in accelerating lesion inception and growth. Using B-mode ultrasound, cavitation was observed at normal ambient pressure as weakly enhanced echogenicity in the focal region, but was not detected with overpressure. Formation of tadpole-shaped lesions, shifted toward the transducer, was always observed to be due to boiling. Boiling bubbles were visible in the gel and were evident as strongly echogenic regions in B-mode images. These experiments indicate that nonlinear propagation and cavitation accelerate heating, but no lesion displacement or distortion was observed in the absence of boiling.

Journal ArticleDOI
TL;DR: The results demonstrate that even when equally informative and discriminable, acoustic cues are not necessarily equally weighted in categorization; listeners exhibit biases when integrating multiple acoustic dimensions.
Abstract: The ability to integrate and weight information across dimensions is central to perception and is particularly important for speech categorization. The present experiments investigate cue weighting by training participants to categorize sounds drawn from a two-dimensional acoustic space defined by the center frequency (CF) and modulation frequency (MF) of frequency-modulated sine waves. These dimensions were psychophysically matched to be equally discriminable and, in the first experiment, were equally informative for accurate categorization. Nevertheless, listeners' category responses reflected a bias for use of CF. This bias remained even when the informativeness of CF was decreased by shifting distributions to create more overlap in CF. A reversal of weighting (MF over CF) was obtained when distribution variance was increased for CF. These results demonstrate that even when equally informative and discriminable, acoustic cues are not necessarily equally weighted in categorization; listeners exhibit biases when integrating multiple acoustic dimensions. Moreover, changes in weighting strategies can be affected by changes in input distribution parameters. This methodology provides potential insights into acquisition of speech sound categories, particularly second language categories. One implication is that ineffective cue weighting strategies for phonetic categories may be alleviated by manipulating variance of uninformative dimensions in training stimuli.

Journal ArticleDOI
TL;DR: Results lead this cross-language study of the categorical nature of tone perception to adopt a memory-based, multistore model of perception in which categorization is domain-general but influenced by long-term categorical representations.
Abstract: Whether or not categorical perception results from the operation of a special, language-specific, speech mode remains controversial. In this cross-language (Mandarin Chinese, English) study of the categorical nature of tone perception, we compared native Mandarin and English speakers’ perception of a physical continuum of fundamental frequency contours ranging from a level to rising tone in both Mandarin speech and a homologous (nonspeech) harmonic tone. This design permits us to evaluate the effect of language experience by comparing Chinese and English groups; to determine whether categorical perception is speech-specific or domain-general by comparing speech to nonspeech stimuli for both groups; and to examine whether categorical perception involves a separate categorical process, distinct from regions of sensory discontinuity, by comparing speech to nonspeech stimuli for English listeners. Results show evidence of strong categorical perception of speech stimuli for Chinese but not English listeners. Categorical perception of nonspeech stimuli was comparable to that for speech stimuli for Chinese but weaker for English listeners, and perception of nonspeech stimuli was more categorical for English listeners than was perception of speech stimuli. These findings lead us to adopt a memory-based, multistore model of perception in which categorization is domain-general but influenced by long-term categorical representations.

Journal ArticleDOI
TL;DR: In two studies the first formant of monosyllabic consonant-vowel-consonant words was shifted electronically and fed back to the participant very quickly so that participants perceived the modified speech as their own productions and appeared to more actively stabilize their productions from trial-to-trial.
Abstract: Auditory feedback during speech production is known to play a role in speech sound acquisition and is also important for the maintenance of accurate articulation. In two studies the first formant (F1) of monosyllabic consonant-vowel-consonant words (CVCs) was shifted electronically and fed back to the participant very quickly so that participants perceived the modified speech as their own productions. When feedback was shifted up (experiment 1 and 2) or down (experiment 1) participants compensated by producing F1 in the opposite frequency direction from baseline. The threshold size of manipulation that initiated a compensation in F1 was usually greater than 60Hz. When normal feedback was returned, F1 did not return immediately to baseline but showed an exponential deadaptation pattern. Experiment 1 showed that this effect was not influenced by the direction of the F1 shift, with both raising and lowering of F1 exhibiting the same effects. Experiment 2 showed that manipulating the number of trials that F1 ...

Journal ArticleDOI
TL;DR: The findings suggest that the potential for acquiring absolute pitch may be universal, and may be realized by enabling infants to associate pitches with verbal labels during the critical period for acquisition of features of their native language.
Abstract: Absolute pitch is extremely rare in the U.S. and Europe; this rarity has so far been unexplained. This paper reports a substantial difference in the prevalence of absolute pitch in two normal populations, in a large-scale study employing an on-site test, without self-selection from within the target populations. Music conservatory students in the U.S. and China were tested. The Chinese subjects spoke the tone language Mandarin, in which pitch is involved in conveying the meaning of words. The American subjects were nontone language speakers. The earlier the age of onset of musical training, the greater the prevalence of absolute pitch; however, its prevalence was far greater among the Chinese than the U.S. students for each level of age of onset of musical training. The findings suggest that the potential for acquiring absolute pitch may be universal, and may be realized by enabling infants to associate pitches with verbal labels during the critical period for acquisition of features of their native language.

Journal ArticleDOI
TL;DR: The rapid formant compensations found here suggest that auditory feedback control is similar for both F0 and formants.
Abstract: Auditory feedback influences human speech production, as demonstrated by studies using rapid pitch and loudness changes. Feedback has also been investigated using the gradual manipulation of formants in adaptation studies with whispered speech. In the work reported here, the first formant of steady-state isolated vowels was unexpectedly altered within trials for voiced speech. This was achieved using a real-time formant tracking and filtering system developed for this purpose. The first formant of vowel /epsilon/ was manipulated 100% toward either /ae/ or /I/, and participants responded by altering their production with average Fl compensation as large as 16.3% and 10.6% of the applied formant shift, respectively. Compensation was estimated to begin <460 ms after stimulus onset. The rapid formant compensations found here suggest that auditory feedback control is similar for both F0 and formants.

Journal ArticleDOI
TL;DR: An expression is derived for the radiation force on a sphere placed on the axis of an ideal acoustic Bessel beam propagating in an inviscid fluid using the partial-wave coefficients found in the analysis of the scattering when the sphere is placed in a plane wave traveling in the same external fluid.
Abstract: An expression is derived for the radiation force on a sphere placed on the axis of an ideal acoustic Bessel beam propagating in an inviscid fluid The expression uses the partial-wave coefficients found in the analysis of the scattering when the sphere is placed in a plane wave traveling in the same external fluid The Bessel beam is characterized by the cone angle β of its plane wave components where β=0 gives the limiting case of an ordinary plane wave Examples are found for fluid spheres where the radiation force reverses in direction so the force is opposite the direction of the beam propagation Negative axial forces are found to be correlated with conditions giving reduced backscattering by the beam This condition may also be helpful in the design of acoustic tweezers for biophysical applications Other potential applications include the manipulation of objects in microgravity Islands in the (ka,β) parameter plane having a negative radiation force are calculated for the case of a hexane drop in w

Journal ArticleDOI
TL;DR: Design considerations, assembly details, and operating procedures of one version of a cost-effective basic fiber-optic probe hydrophone (FOPH) are described to convey practical information to groups interested in constructing a similar device.
Abstract: Design considerations, assembly details, and operating procedures of one version of a cost-effective basic fiber-optic probe hydrophone (FOPH) are described in order to convey practical information to groups interested in constructing a similar device. The use of fiber optic hydrophones can overcome some of the limitations associated with traditional polyvinylidene difluoride (PVDF) hydrophones for calibration of acoustic fields. Compared to standard PVDF hydrophones, FOPH systems generally have larger bandwidths, enhanced spatial resolution, reduced directionality, and greater immunity to electromagnetic interference, though they can be limited by significantly lower sensitivities. The FOPH system presently described employs a 100-microm multimode optical fiber as the sensing element and incorporates a 1-W laser diode module, 2 x 2 optical coupler, and general-purpose 50-MHz silicon p-i-n photodetector. Wave forms generated using the FOPH system and a reference PVDF hydrophone are compared, and intrinsic and substitution methods for calibrating the FOPH system are discussed. The voltage-to-pressure transfer factor is approximately 0.8 mV/MPa (-302 dB re 1 V/microPa), though straightforward modifications to the optical components in the FOPH system are discussed that can significantly increase this value. Recommendations are presented to guide the choice of optical components and to provide practical insight into the routine usage of the FOPH device.

Journal ArticleDOI
TL;DR: This paper demonstrates how the smoothing spline ANOVA (SS ANOVA) can be applied to the comparison of tongue curves and shows some data comparing obstruents produced in word-final and word-medial coda position.
Abstract: Ultrasound imaging of the tongue is increasingly common in speech production research. However, there has been little standardization regarding the quantification and statistical analysis of ultrasound data. In linguistic studies, researchers may want to determine whether the tongue shape for an articulation under two different conditions (e.g., consonants in word-final versus word-medial position) is the same or different. This paper demonstrates how the smoothing spline ANOVA (SS ANOVA) can be applied to the comparison of tongue curves [Gu, Smoothing Spline ANOVA Models (Springer, New York, 2002)]. The SS ANOVA is a technique for determining whether or not there are significant differences between the smoothing splines that are the best fits for two data sets being compared. If the interaction term of the SS ANOVA model is statistically significant, then the groups have different shapes. Since the interaction may be significant even if only a small section of the curves are different (i.e., the tongue root is the same, but the tip of one group is raised), Bayesian confidence intervals are used to determine which sections of the curves are statistically different. SS ANOVAs are illustrated with some data comparing obstruents produced in word-final and word-medial coda position.

Journal ArticleDOI
TL;DR: Speech intelligibility measurements were carried out with 8 normal-hearing and 15 hearing-impaired listeners, collecting speech reception threshold (SRT) data for three different room acoustic conditions and eight directions of a single noise source.
Abstract: Binaural speech intelligibility of individual listeners under realistic conditions was predicted using a model consisting of a gammatone filter bank, an independent equalization-cancellation (EC) process in each frequency band, a gammatone resynthesis, and the speech intelligibility index (SII). Hearing loss was simulated by adding uncorrelated masking noises (according to the pure-tone audiogram) to the ear channels. Speech intelligibility measurements were carried out with 8 normal-hearing and 15 hearing-impaired listeners, collecting speech reception threshold (SRT) data for three different room acoustic conditions (anechoic, office room, cafeteria hall) and eight directions of a single noise source (speech in front). Artificial EC processing errors derived from binaural masking level difference data using pure tones were incorporated into the model. Except for an adjustment of the SII-to-intelligibility mapping function, no model parameter was fitted to the SRT data of this study. The overall correlation coefficient between predicted and observed SRTs was 0.95. The dependence of the SRT of an individual listener on the noise direction and on room acoustics was predicted with a median correlation coefficient of 0.91. The effect of individual hearing impairment was predicted with a median correlation coefficient of 0.95. However, for mild hearing losses the release from masking was overestimated.

Journal ArticleDOI
TL;DR: This study compared English and Spanish listeners' perceptions of English intervocalic consonants as a function of masker type to suggest that non-native listeners are more adversely affected by both energetic and informational masking.
Abstract: Spoken communication in a non-native language is especially difficult in the presence of noise. This study compared English and Spanish listeners' perceptions of English intervocalic consonants as a function of masker type. Three maskers (stationary noise, multitalker babble, and competing speech) provided varying amounts of energetic and informational masking. Competing English and Spanish speech maskers were used to examine the effect of masker language. Non-native performance fell short of that of native listeners in quiet, but a larger performance differential was found for all masking conditions. Both groups performed better in competing speech than in stationary noise, and both suffered most in babble. Since babble is a less effective energetic masker than stationary noise, these results suggest that non-native listeners are more adversely affected by both energetic and informational masking. A strong correlation was found between non-native performance in quiet and degree of deterioration in noise, suggesting that non-native phonetic category learning can be fragile. A small effect of language background was evident: English listeners performed better when the competing speech was Spanish.

PatentDOI
TL;DR: In this article, a wide variety of actuator types may be employed to provide synchronized vibration, including linear actuators, rotary actuators and rotating eccentric mass actuators (REMA).
Abstract: The present invention relates to synchronized vibration devices (620) that can provide haptic feedback to a user. A wide variety of actuator types may be employed to provide synchronized vibration, including linear actuators (100), rotary actuators (300), rotating eccentric mass actuators (304), and rocking or pivoting mass actuators (400, 490). A controller (502) may send signals to one or more driver circuits (504) for directing operation of the actuators. The controller may provide direction and amplitude control (508), vibration control (512), and frequency control (510) to direct the haptic experience. Parameters such as frequency, phase, amplitude, duration, and direction can be programmed or input as different patterns suitable for use in gaming, virtual reality and real-world situations.

Journal ArticleDOI
TL;DR: Tests of the smooth signal redundancy hypothesis with a very high-quality corpus collected for speech synthesis confirm the duration/language redundancy results achieved in previous work, and show a significant relationship between language redundancy factors and the first two formants, although these results vary considerably by vowel.
Abstract: The language redundancy of a syllable, measured by its predictability given its context and inherent frequency, has been shown to have a strong inverse relationship with syllabic duration. This relationship is predicted by the smooth signal redundancy hypothesis, which proposes that robust communication in a noisy environment can be achieved with an inverse relationship between language redundancy and the predictability given acoustic observations (acoustic redundancy). A general version of the hypothesis predicts similar relationships between the spectral characteristics of speech and language redundancy. However, investigating this claim is hampered by difficulties in measuring the spectral characteristics of speech within large conversational corpora, and difficulties in forming models of acoustic redundancy based on these spectral characteristics. This paper addresses these difficulties by testing the smooth signal redundancy hypothesis with a very high-quality corpus collected for speech synthesis, and presents both durational and spectral data from vowel nuclei on a vowel-by-vowel basis. Results confirm the duration/ language redundancy results achieved in previous work, and show a significant relationship between language redundancy factors and the first two formants, although these results vary considerably by vowel. In general, however, vowels show increased centralization with increased language redundancy.

Journal ArticleDOI
TL;DR: A computational model to simulate normal and impaired auditory-nerve (AN) fiber responses in cats is presented, achieved by providing two modes of basilar membrane excitation to the inner hair cell (IHC) rather than one.
Abstract: This paper presents a computational model to simulate normal and impaired auditory-nerve (AN) fiber responses in cats. The model responses match physiological data over a wider dynamic range than previous auditory models. This is achieved by providing two modes of basilar membrane excitation to the inner hair cell (IHC) rather than one. The two modes are generated by two parallel filters, component 1 (C1) and component 2 (C2), and the outputs are subsequently transduced by two separate functions. The responses are then added and passed through the IHC low-pass filter followed by the IHC-AN synapse model and discharge generator. The C1 filter is a narrow-band, chirp filter with the gain and bandwidth controlled by a nonlinear feed-forward control path. This filter is responsible for low and moderate level responses. A linear, static, and broadly tuned C2 filter followed by a nonlinear, inverted and nonrectifying C2 transduction function is critical for producing transition region and high-level effects. Consistent with Kiang's two-factor cancellation hypothesis, the interaction between the two paths produces effects such as the C1/C2 transition and peak splitting in the period histogram. The model responses are consistent with a wide range of physiological data from both normal and impaired ears for stimuli presented at levels spanning the dynamic range of hearing.

PatentDOI
TL;DR: In this article, both speech and alternate modality inputs are used in inputting information spoken into a mobile device to perform sequential commitment of words in a speech recognition result, which can be used to perform Sequential Commitment of Words in a Speech Recognition result.
Abstract: Both speech and alternate modality inputs are used in inputting information spoken into a mobile device. The alternate modality inputs can be used to perform sequential commitment of words in a speech recognition result.

Journal ArticleDOI
TL;DR: In this article, an enhancement of the synchronized switch damping technique on voltage source (SSDV) is presented, which allows fitting the mechanical braking force resulting from the SSDV process to the vibration level.
Abstract: Synchronized switch damping (SSD) principle and derived techniques have been developed to address the problem of structural damping. Compared with standard passive piezoelectric damping, these new semi-passive techniques offer the advantage of self-adaptation with environmental variations. Unlike active damping systems, their implementation does not require any sophisticated signal processing nor any bulky power amplifier. This paper presents an enhancement of the SSD technique on voltage source (SSDV) which is the most effective of the SSD techniques. The former SSDV technique uses a constant continuous voltage sources whereas the proposed enhancement uses an adaptive continuous voltage source which permits fitting the mechanical braking force resulting from the SSDV process to the vibration level. A theoretical analysis of the SSDV techniques is proposed. Experimental results for structural damping under single frequency and for vibration control of a smart board under white noise excitation are presented and confirm the interest of the enhanced SSDV compared to other SSD techniques. Depending on the excitation type, a 4- to 10-dB damping gain can be achieved.

Journal ArticleDOI
TL;DR: The analytical solutions can be used to assess the validity of the descriptive models for a given material to alleviate constraints on the Johnson et al. model.
Abstract: In this paper, the question of the acoustical determination of macroscopic thermal parameters used to describe heat exchanges in rigid open-cell porous media subjected to acoustical excitations is addressed. The proposed method is based on the measurement of the dynamic bulk modulus of the material, and analytical inverse solutions derived from different semiphenomenological models governing the thermal dissipation of acoustic waves in the material. Three models are considered: (1) Champoux–Allard model [J. Appl. Phys. 20, 1975–1979 (1991)] requiring knowledge of the porosity and thermal characteristic length, (2) Lafarge et al. model [J. Acoust. Soc. Am. 102, 1995–2006 (1997)] using the same parameters and the thermal permeability, and (3) Wilson model [J. Acoust. Soc. Am. 94, 1136–1145 (1993)] that requires two adjusted parameters. Except for the porosity that is obtained from direct measurement, all the other thermal parameters are derived from the analytical inversion of the models. The method is appl...