scispace - formally typeset
Search or ask a question

Showing papers in "Journal of the Acoustical Society of America in 1994"


Journal ArticleDOI
TL;DR: The mean-squared level of each digitally recorded sentence was adjusted to equate intelligibility when presented in spectrally matched noise to normal-hearing listeners, and statistical reliability and efficiency suit it to practical applications in which measures of speech intelligibility are required.
Abstract: A large set of sentence materials, chosen for their uniformity in length and representation of natural speech, has been developed for the measurement of sentence speech reception thresholds (sSRTs). The mean‐squared level of each digitally recorded sentence was adjusted to equate intelligibility when presented in spectrally matched noise to normal‐hearing listeners. These materials were cast into 25 phonemically balanced lists of ten sentences for adaptive measurement of sentence sSRTs. The 95% confidence interval for these measurements is ±2.98 dB for sSRTs in quiet and ±2.41 dB for sSRTs in noise, as defined by the variability of repeated measures with different lists. Average sSRTs in quiet were 23.91 dB(A). Average sSRTs in 72 dB(A) noise were 69.08 dB(A), or −2.92 dB signal/noise ratio. Low‐pass filtering increased sSRTs slightly in quiet and noise as the 4‐ and 8‐kHz octave bands were eliminated. Much larger increases in SRT occurred when the 2‐kHz octave band was eliminated, and bandwidth dropped below 2.5 kHz. Reliability was not degraded substantially until bandwidth dropped below 2.5 kHz. The statistical reliability and efficiency of the test suit it to practical applications in which measures of speech intelligibility are required.

1,909 citations


Journal ArticleDOI
TL;DR: Analysis of the formant data shows numerous differences between the present data and those of PB, both in terms of average frequencies of F1 and F2, and the degree of overlap among adjacent vowels.
Abstract: This study was designed as a replication and extension of the classic study of vowel acoustics by Peterson and Barney (PB) [J. Acoust. Soc. Am. 24, 175–184 (1952)]. Recordings were made of 50 men, 50 women, and 50 children producing the vowels /i, i, eh, ae, hooked backward eh, inverted vee), a, open oh, u, u/ in h–V–d syllables. Formant contours for F1–F4 were measured from LPC spectra using a custom interactive editing tool. For comparison with the PB data, formant patterns were sampled at a time that was judged by visual inspection to be maximally steady. Preliminary analysis shows numerous differences between the present data and those of PB, both in terms of average formant frequencies for vowels, and the degree of overlap among adjacent vowels. As with the original study, listening tests showed that the signals were nearly always identified as the vowel intended by the talker.

1,891 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a method for estimating the effective density and the bulk modulus of open cell foams and fibrous materials with cylindrical porous layers. But the authors do not consider the effect of noise on the propagation of sound.
Abstract: Preface to the second edition. 1 Plane waves in isotropic fluids and solids. 1.1 Introduction. 1.2 Notation - vector operators. 1.3 Strain in a deformable medium. 1.4 Stress in a deformable medium. 1.5 Stress-strain relations for an isotropic elastic medium. 1.6 Equations of motion. 1.7 Wave equation in a fluid. 1.8 Wave equations in an elastic solid. References. 2 Acoustic impedance at normal incidence of fluids. Substitution of a fluid layer for a porous layer. 2.1 Introduction. 2.2 Plane waves in unbounded fluids. 2.3 Main properties of impedance at normal incidence. 2.4 Reflection coefficient and absorption coefficient at normal incidence. 2.5 Fluids equivalent to porous materials: the laws of Delany and Bazley. 2.6 Examples. 2.7 The complex exponential representation. References. 3 Acoustic impedance at oblique incidence in fluids. Substitution of a fluid layer for a porous layer. 3.1 Introduction. 3.2 Inhomogeneous plane waves in isotropic fluids. 3.3 Reflection and refraction at oblique incidence. 3.4 Impedance at oblique incidence in isotropic fluids. 3.5 Reflection coefficient and absorption coefficient at oblique incidence. 3.6 Examples. 3.7 Plane waves in fluids equivalent to transversely isotropic porous media. 3.8 Impedance at oblique incidence at the surface of a fluid equivalent to an anisotropic porous material. 3.9 Example. References. 4 Sound propagation in cylindrical tubes and porous materials having cylindrical pores. 4.1 Introduction. 4.2 Viscosity effects. 4.3 Thermal effects. 4.4 Effective density and bulk modulus for cylindrical tubes having triangular, rectangular and hexagonal cross-sections. 4.5 High- and low-frequency approximation. 4.6 Evaluation of the effective density and the bulk modulus of the air in layers of porous materials with identical pores perpendicular to the surface. 4.7 The biot model for rigid framed materials. 4.8 Impedance of a layer with identical pores perpendicular to the surface. 4.9 Tortuosity and flow resistivity in a simple anisotropic material. 4.10 Impedance at normal incidence and sound propagation in oblique pores. Appendix 4.A Important expressions. Description on the microscopic scale. Effective density and bulk modulus. References. 5 Sound propagation in porous materials having a rigid frame. 5.1 Introduction. 5.2 Viscous and thermal dynamic and static permeability. 5.3 Classical tortuosity, characteristic dimensions, quasi-static tortuosity. 5.4 Models for the effective density and the bulk modulus of the saturating fluid. 5.5 Simpler models. 5.6 Prediction of the effective density and the bulk modulus of open cell foams and fibrous materials with the different models. 5.7 Fluid layer equivalent to a porous layer. 5.8 Summary of the semi-phenomenological models. 5.9 Homogenization. 5.10 Double porosity media. Appendix 5.A: Simplified calculation of the tortuosity for a porous material having pores made up of an alternating sequence of cylinders. Appendix 5.B: Calculation of the characteristic length LAMBDA'. Appendix 5.C: Calculation of the characteristic length LAMBDA for a cylinder perpendicular to the direction of propagation. References. 6 Biot theory of sound propagation in porous materials having an elastic frame. 6.1 Introduction. 6.2 Stress and strain in porous materials. 6.3 Inertial forces in the biot theory. 6.4 Wave equations. 6.5 The two compressional waves and the shear wave. 6.6 Prediction of surface impedance at normal incidence for a layer of porous material backed by an impervious rigid wall. Appendix 6.A: Other representations of the Biot theory. References. 7 Point source above rigid framed porous layers. 7.1 Introduction. 7.2 Sommerfeld representation of the monopole field over a plane reflecting surface. 7.3 The complex sin theta plane. 7.4 The method of steepest descent (passage path method). 7.5 Poles of the reflection coefficient. 7.6 The pole subtraction method. 7.7 Pole localization. 7.8 The modified version of the Chien and Soroka model. Appendix 7.A Evaluation of N. Appendix 7.B Evaluation of p r by the pole subtraction method. Appendix 7.C From the pole subtraction to the passage path: Locally reacting surface. References. 8 Porous frame excitation by point sources in air and by stress circular and line sources - modes of air saturated porous frames. 8.1 Introduction. 8.2 Prediction of the frame displacement. 8.3 Semi-infinite layer - Rayleigh wave. 8.4 Layer of finite thickness - modified Rayleigh wave. 8.5 Layer of finite thickness - modes and resonances. Appendix 8.A Coefficients r ij and M i,j. Appendix 8.B Double Fourier transform and Hankel transform. Appendix 8.B Appendix .C Rayleigh pole contribution. References. 9 Porous materials with perforated facings. 9.1 Introduction. 9.2 Inertial effect and flow resistance. 9.3 Impedance at normal incidence of a layered porous material covered by a perforated facing - Helmoltz resonator. 9.4 Impedance at oblique incidence of a layered porous material covered by a facing having cirular perforations. References. 10 Transversally isotropic poroelastic media. 10.1 Introduction. 10.2 Frame in vacuum. 10.3 Transversally isotropic poroelastic layer. 10.4 Waves with a given slowness component in the symmetry plane. 10.5 Sound source in air above a layer of finite thickness. 10.6 Mechanical excitation at the surface of the porous layer. 10.7 Symmetry axis different from the normal to the surface. 10.8 Rayleigh poles and Rayleigh waves. 10.9 Transfer matrix representation of transversally isotropic poroelastic media. Appendix 10.A: Coefficients T i in Equation (10.46). Appendix 10.B: Coefficients A i in Equation (10.97). References. 11 Modelling multilayered systems with porous materials using the transfer matrix method. 11.1 Introduction. 11.2 Transfer matrix method. 11.3 Matrix representation of classical media. 11.4 Coupling transfer matrices. 11.5 Assembling the global transfer matrix. 11.6 Calculation of the acoustic indicators. 11.7 Applications. Appendix 11.A The elements T ij of the Transfer Matrix T ]. References. 12 Extensions to the transfer matrix method. 12.1 Introduction. 12.2 Finite size correction for the transmission problem. 12.3 Finite size correction for the absorption problem. 12.4 Point load excitation. 12.5 Point source excitation. 12.6 Other applications. Appendix 12.A: An algorithm to evaluate the geometrical radiation impedance. References. 13 Finite element modelling of poroelastic materials. 13.1 Introduction. 13.2 Displacement based formulations. 13.3 The mixed displacement-pressure formulation. 13.4 Coupling conditions. 13.5 Other formulations in terms of mixed variables. 13.6 Numerical implementation. 13.7 Dissipated power within a porous medium. 13.8 Radiation conditions. 13.9 Examples. References. Index.

1,375 citations


Journal ArticleDOI
TL;DR: The effect of smearing the temporal envelope on the speech-reception threshold (SRT) for sentences in noise and on phoneme identification was investigated for normal-hearing listeners, showing a severe reduction in sentence intelligibility for narrow processing bands at low cutoff frequencies.
Abstract: The effect of smearing the temporal envelope on the speech-reception threshold (SRT) for sentences in noise and on phoneme identification was investigated for normal-hearing listeners. For this purpose, the speech signal was split up into a series of frequency bands (width of 1/4, 1/2, or 1 oct) and the amplitude envelope for each band was low-pass filtered at cutoff frequencies of 0, 1/2, 1, 2, 4, 8, 16, 32, or 64 Hz. Results for 36 subjects show (1) a severe reduction in sentence intelligibility for narrow processing bands at low cutoff frequencies (0-2 Hz); and (2) a marginal contribution of modulation frequencies above 16 Hz to the intelligibility of sentences (provided that lower modulation frequencies are completely present). For cutoff frequencies above 4 Hz, the SRT appears to be independent of the frequency bandwidth upon which envelope filtering takes place. Vowel and consonant identification with nonsense syllables were studied for cutoff frequencies of 0, 2, 4, 8, or 16 Hz in 1/4-oct bands. Results for 24 subjects indicate that consonants are more affected than vowels. Errors in vowel identification mainly consist of reduced recognition of diphthongs and of confusions between long and short vowels. In case of consonant recognition, stops appear to suffer most, with confusion patterns depending on the position in the syllable (initial, medial, or final).

856 citations


Journal ArticleDOI
TL;DR: Based on the modulation transfer function of some conditions, it is concluded that fast multichannel dynamic compression leads to an insignificant change in masked SRT.
Abstract: The effect of reducing low‐frequency modulations in the temporal envelope on the speech‐reception threshold (SRT) for sentences in noise and on phoneme identification was investigated. For this purpose, speech was split up into a series of frequency bands (1/4, 1/2, or 1 oct wide) and the amplitude envelope for each band was high‐pass filtered at cutoff frequencies of 1, 2, 4, 8, 16, 32, 64, or 128 Hz, or ∞ (completely flattened). Results for 42 normal‐hearing listeners show: (1) A clear reduction in sentence intelligibility with narrow‐band processing for cutoff frequencies above 64 Hz; and (2) no reduction of sentence intelligibility when only amplitude variations below 4 Hz are reduced. Based on the modulation transfer function of some conditions, it is concluded that fast multichannel dynamic compression leads to an insignificant change in masked SRT. Combining these results with previous data on low‐pass envelope filtering (temporal smearing) [Drullman et al., J. Acoust. Soc. Am. 95, 1053–1064 (1994)] shows that at 8–10 Hz the temporal modulation spectrum is divided into two equally important parts. Vowel and consonant identification with nonsense syllables were studied for cutoff frequencies of 2, 8, 32, 128 Hz, and ∞, processed in 1/4‐oct bands. Results for 12 subjects indicate that, just as for low‐pass envelope filtering, consonants are more affected than vowels. Errors in vowel identification mainly consist of reduced recognition of diphthongs and of durational confusions. For the consonants there are no clear confusion patterns, but stops appear to suffer least. In most cases, the responses tend to fall into the correct category (stop, fricative, or vowel‐like).

634 citations


Journal ArticleDOI
TL;DR: The long-term average speech spectrum (LTASS) and some dynamic characteristics of speech were determined for 12 languages: English (several dialects), Swedish, Danish, German, French (Canadian), Japanese, Cantonese, Mandarin, Russian, Welsh, Singhalese, and Vietnamese.
Abstract: The long‐term average speech spectrum (LTASS) and some dynamic characteristics of speech were determined for 12 languages: English (several dialects), Swedish, Danish, German, French (Canadian), Japanese, Cantonese, Mandarin, Russian, Welsh, Singhalese, and Vietnamese. The LTASS only was also measured for Arabic. Speech samples (18) were recorded, using standardized equipment and procedures, in 15 localities for (usually) ten male and ten female talkers. All analyses were conducted at the National Acoustic Laboratories, Sydney. The LTASS was similar for all languages although there were many statistically significant differences. Such differences were small and not always consistent for male and female samples of the same language. For one‐third octave bands of speech, the maximum short‐term rms level was 10 dB above the maximum long‐term rms level, consistent across languages and frequency. A ‘‘universal’’ LTASS is suggested as being applicable, across languages, for many purposes including use in hearing aid prescription procedures and in the Articulation Index.

507 citations


Journal ArticleDOI
TL;DR: In this paper, a biomechanically motivated version of the vowel undershoot model was used for English front vowels embedded in a /w-l/ frame and carrying constant main stress.
Abstract: Acoustic observations are reported for English front vowels embedded in a /w—l/ frame and carrying constant main stress. The vowels were produced by five speakers in clear and citation‐form styles at varying durations but at a constant speaking rate. The acoustic analyses revealed (i) that formant patterns were systematically displaced in the direction of the frequencies of the consonants of the adjacent pseudosymmetrical context; (ii) that those displacements depended in a lawful manner on vowel duration; (iii) that this context and duration dependence was more limited for clear than for citation‐form speech, and that the smaller formant shifts of clear speech tended to be achieved by increases in the rate of formant frequency change. The findings are compatible with a revised, and biomechanically motivated, version of the vowel undershoot model [Lindblom, J. Acoust. Soc. Am. 35, 1773–1781 (1963)] that derives formant patterns from numerical information on three variables: The ‘‘locus‐target’’ distance, vowel duration, and rate of formant frequency change. The results further indicate that the ‘‘clear’’ samples were not merely louder, but involved a systematic, undershoot‐compensating reorganization of the acoustic patterns.

446 citations


Journal ArticleDOI
TL;DR: In this paper, a time domain expression of causality analogous in function to the Kramers-Kronig relations in the frequency domain is used to derive the causal wave equations.
Abstract: For attenuation described by a slowly varying power law function of frequency, α=α0‖ω‖y, classical lossy time domain wave equations exist only for the restricted cases where y=0 or y=2. For the frequently occurring practical situation in which attenuation is much smaller than the wave number, a lossy dispersion characteristic is derived that has the desired attenuation general power law dependence. In order to obtain the corresponding time domain lossy wave equation, time domain loss operators similar in function to existing derivative operators are developed through the use of generalized functions. Three forms of lossy wave equations are found, depending on whether y is an even or odd integer or a noninteger. A time domain expression of causality analogous in function to the Kramers–Kronig relations in the frequency domain is used to derive the causal wave equations. Final causal versions of the time domain wave equations are obtained even for the cases where y≥1, which, according to the Paley–Wiener th...

413 citations


PatentDOI
TL;DR: In this article, a distributed voice recognition system includes a digital signal processor (DSP), a nonvolatile storage medium (108), and a microprocessor (106), which is configured to extract parameters from digitized input speech samples and provide the extracted parameters to the microprocessor.
Abstract: A distributed voice recognition system includes a digital signal processor (DSP)(104), a nonvolatile storage medium (108), and a microprocessor (106). The DSP (104) is configured to extract parameters from digitized input speech samples and provide the extracted parameters to the microprocessor (106). The nonvolatile storage medium contains a database of speech templates. The microprocessor is configured to read the contents of the nonvolatile storage medium (108), compare the parameters with the contents, and select a speech template based upon the comparison. The nonvolatile storage medium may be a flash memory. The DSP (104) may be a vocoder. If the DSP (104) is a vocoder, the parameters may be diagnostic data generated by the vocoder. The distributed voice recognition system may reside on an application specific integrated circuit (ASIC).

361 citations


PatentDOI
TL;DR: In this paper, a teleconferencing system with a video camera for generating a video signal representative of a video image of a first station B and a microphone array (150, 160) for receiving a sound from one or more fixed non-overlapping volume zones (151-159) into which the first station is divided.
Abstract: A teleconferencing system (100) is disclosed having a video camera for generating a video signal representative of a video image of a first station B. A microphone array (150, 160) is also provided in the first station for receiving a sound from one or more fixed non-overlapping volume zones (151-159) into which the first station is divided. The microphone array is also provided for generating a monochannel audio signal (170) representative of the received sound and a direction signal indicating, based on the sound received from each zone, from which of the volume zones the sound originated. The teleconferencing system also includes a display device (120A) at a second station A for displaying a video image of the first station. A loudspeaker control device (140) is also provided at the second station for selecting a virtual location (121) on the displayed video image depending on the direction signal, and for generating stereo sound from the monochannel audio signal which stereo sound emanates from the virtual location on the displayed video image.

351 citations


PatentDOI
TL;DR: A speech dialogue system capable of realizing natural and smooth dialogue between the system and a human user, and easy maneuverability of the system.
Abstract: A speech dialogue system capable of realizing natural and smooth dialogue between the system and a human user, and easy maneuverability of the system. In the system, a semantic content of input speech from a user is understood and a semantic content determination of a response output is made according to the understood semantic content of the input speech. Then, a speech response and a visual response according to the determined response output are generated and outputted to the user. The dialogue between the system and the user is managed by controlling transitions between user states during which the input speech is to be entered and system states during which the system response is to be outputted. The understanding of a semantic content of input speech from a user is made by detecting keywords in the input speech, with the keywords to be detected in the input speech limited in advance, according to a state of a dialogue between the user and the system.

Journal ArticleDOI
TL;DR: In this paper, a genetic algorithm is presented for underwater acoustic modeling problems, which is solved by a directed Monte Carlo search using genetic algorithms and is formulated by steadystate reproduction without duplicates.
Abstract: The goal of many underwater acoustic modeling problems is to find the physical parameters of the environment. With the increase in computer power and the development of advanced numerical models it is now feasible to carry out multiparameter inversion. The inversion is posed as an optimization problem, which is solved by a directed Monte Carlo search using genetic algorithms. The genetic algorithm presented in this paper is formulated by steady‐state reproduction without duplicates. For the selection of ‘‘parents’’ the object function is scaled according to a Boltzmann distribution with a ‘‘temperature’’ equal to the fitness of one of the members in the population. The inversion would be incomplete if not followed by an analysis of the uncertainties of the result. When using genetic algorithms the response from many environmental parameter sets has to be computed in order to estimate the solution. The many samples of the models are used to estimate the a posteriori probabilities of the model parameters. T...

Journal ArticleDOI
TL;DR: In this article, a revised fluid mechanical description of the air flow through the glottis is proposed, in which the separation point is allowed to move. But this assumption appears quite unrealistic, and considering that the position of the separation points is an important parameter in phonation models, the authors in this paper do not consider this assumption.
Abstract: Most flow models used in numerical simulation of voiced sound production rely, for the sake of simplicity, upon a certain number of assumptions. While most of these assumptions constitute reasonable first approximations, others appear more doubtful. In particular, it is implicitly assumed that the air flow through the glottal channel separates from the walls at a fixed point. Since this assumption appears quite unrealistic, and considering that the position of the separation point is an important parameter in phonation models, in this paper a revised fluid mechanical description of the air flow through the glottis is proposed, in which the separation point is allowed to move. This theoretical model, as well as the assumptions made, are validated using steady- and unsteady-flow measurements combined with flow visualizations. In order to evaluate the effective impact of the revised theory, we then present an application to a simple mechanical model of the vocal cords derived from the classical two-mass model. As expected, implementation of a moving separation point appears to be of great importance for the modeling of glottal signals. It is further shown that the numerical model coupled with a more realistic description of the vocal cord collision can lead to signals surprisingly close to those observed in real speech by inverse filtering.

Journal ArticleDOI
TL;DR: The authors trained monolingual speakers of Japanese to identify English /r/ and /l/ using a high-variability training procedure, which varied as a function of the talker and phonetic environment.
Abstract: Monolingual speakers of Japanese were trained to identify English /r/ and /l/ using Logan et al.’s [J. Acoust. Soc. Am. 89, 874–886 (1991)] high‐variability training procedure. Subjects’ performance improved from the pretest to the post‐test and during the 3 weeks of training. Performance during training varied as a function of talker and phonetic environment. Generalization accuracy to new words depended on the voice of the talker producing the /r/–/l/ contrast: Subjects were significantly more accurate when new words were produced by a familiar talker than when new words were produced by an unfamiliar talker. This difference could not be attributed to differences in intelligibility of the stimuli. Three and six months after the conclusion of training, subjects returned to the laboratory and were given the post‐test and tests of generalization again. Performance was surprisingly good on each test after 3 months without any further training: Accuracy decreased only 2% from the post‐test given at the end o...

Journal ArticleDOI
TL;DR: Performance was not markedly affected by the phase relationship between the components of a complex, except for stimuli with intermediate F0's in the MID spectral region, where FDLs and FMDDTs were much higher for ALT-phase stimuli than for SINE- phase stimuli, consistent with their unclear pitch, and when FMTs were measured.
Abstract: A series of experiments investigated the influence of harmonic resolvability on the pitch of, and the discriminability of differences in fundamental frequency (F0) between, frequency‐modulated (FM) harmonic complexes. Both F0 (62.5 to 250 Hz) and spectral region (LOW: 125–625 Hz, MID: 1375–1875 Hz, and HIGH: 3900–5400 Hz) were varied orthogonally. The harmonics that comprised each complex could be summed in either sine (0°) phase (SINE) or alternating sine‐cosine (0°–90°) phase (ALT). Stimuli were presented in a continuous pink‐noise background. Pitch‐matching experiments revealed that the pitch of ALT‐phase stimuli, relative to SINE‐phase stimuli, was increased by an octave in the HIGH region, for all F0’s, but was the same as that of SINE‐phase stimuli when presented in the LOW region. In the MID region, the pitch of ALT‐phase relative to SINE‐phase stimuli depended on F0, being an octave higher at low F0’s, equal at high F0’s, and unclear at intermediate F0’s. The same stimuli were then used in three measures of discriminability: FM detection thresholds (FMTs), frequency difference limens (FDLs), and FM direction discrimination thresholds (FMDDTs, defined as the minimum FM depth necessary for listeners to discriminate between two complexes modulated 180° out of phase with each other). For all three measures, at all F0’s, thresholds were low (<4% for FMTs, <5% for FMDDTs, and <1.5% for FDLs) when stimuli were presented in the LOW region, and high (≳10% for FMTs, ≳7% for FMDDTs, and ≳2.5% for FDLs) when presented in the HIGH region. When stimuli were presented in the MID region, thresholds were low for low F0’s, and high for high F0’s. Performance was not markedly affected by the phase relationship between the components of a complex, except for stimuli with intermediate F0’s in the MID spectral region, where FDLs and FMDDTs were much higher for ALT‐phase stimuli than for SINE‐phase stimuli, consistent with their unclear pitch. This difference was much smaller when FMTs were measured. The interaction between F0 and spectral region for both sets of experiments can be accounted for by a single definition of resolvability.

PatentDOI
TL;DR: The invention provides a method of large vocabulary speech recognition that employs a single tree-structured phonetic hidden Markov model (HMM) at each frame of a time-synchronous process, and phonetic context information is exploited, even before the complete context of a phoneme is known.
Abstract: The invention provides a method of large vocabulary speech recognition that employs a single tree-structured phonetic hidden Markov model (HMM) at each frame of a time-synchronous process. A grammar probability is utilized upon recognition of each phoneme of a word, before recognition of the entire word is complete. Thus, grammar probabilities are exploited as early as possible during recognition of a word. At each frame of the recognition process, a grammar probability is determined for the transition from the most likely preceding grammar state to a set of words that share at least one common phoneme. The grammar probability is combined with accumulating phonetic evidence to provide a measure of the likelihood that a state in the HMM will lead to the word most likely to have been spoken. In a preferred embodiment, phonetic context information is exploited, even before the complete context of a phoneme is known. Instead of an exact triphone model, wherein the phonemes previous and subsequent to a phoneme are considered, a composite triphone model is used that exploits partial phonetic context information to provide a phonetic model that is more accurate than aphonetic model that ignores context. In another preferred embodiment, the single phonetic tree method is used as the forward pass of a forward/backward recognition process, wherein the backward pass employs a recognition process other than the single phonetic tree method.

PatentDOI
Julian M. Kupiec1
TL;DR: In this paper, a system and method for automatically transcribing an input question from a form convenient for user input into a form suitable for use by a computer is presented, where the question is transduced into a signal that is converted into a sequence of symbols.
Abstract: A system and method for automatically transcribing an input question from a form convenient for user input into a form suitable for use by a computer. The question is a sequence of words represented in a form convenient for the user, such as a spoken utterance or a handwritten phrase. The question is transduced into a signal that is converted into a sequence of symbols. A set of hypotheses is generated from the sequence of symbols. The hypotheses are sequences of words represented in a form suitable for use by the computer, such as text. One or more information retrieval queries are constructed and executed to retrieve documents from a corpus (database). Retrieved documents are analyzed to produce an evaluation of the hypotheses of the set and to select one or more preferred hypotheses from the set. The preferred hypotheses are output to a display, speech synthesizer, or applications program. Additionally, retrieved documents relevant to the preferred hypotheses can be selected and output.

Journal ArticleDOI
TL;DR: Two predictors of intelligibility are used to quantify the environmental degradations: the articulation index (AI) and the speech transmission index (STI).
Abstract: The effect of articulating clearly on speech intelligibility is analyzed for ten normal‐hearing and two hearing‐impaired listeners in noisy, reverberant, and combined environments. Clear speech is more intelligible than conversational speech for each listener in every environment. The difference in intelligibility due to speaking style increases as noise and/or reverberation increase. The average difference in intelligibility is 20 percentage points for the normal‐hearing listeners and 26 percentage points for the hearing‐impaired listeners. Two predictors of intelligibility are used to quantify the environmental degradations: The articulation index (AI) and the speech transmission index (STI). Both are shown to predict, reliably, performance levels within a speaking style for normal‐hearing listeners. The AI is unable to represent the reduction in intelligibility scores due to reverberation for the hearing‐impaired listeners. Neither predictor can account for the difference in intelligibility due to spea...

PatentDOI
TL;DR: In this paper, a speech recognizer is employed to recognize a lexeme corresponding to that uttered by a caller in a directory assistance call, on the basis of the acoustics of the caller's utterance and the probability index.
Abstract: In a telecommunications system, automatic directory assistance uses a voice processing unit comprising a lexicon of lexemes and data representing a predetermined relationship between each lexeme and calling numbers in a locality served by the automated directory assistance apparatus. The voice processing unit issues messages to a caller making a directory assistance call to prompt the caller to utter a required one of said lexemes. The unit detects the calling number originating a directory assistance call and, responsive to the calling number and the relationship data computes a probability index representing the likelihood of a lexeme being the subject of the directory assistance call. The unit employs a speech recognizer to recognize, on the basis of the acoustics of the caller's utterance and the probability index, a lexeme corresponding to that uttered by the caller.

Journal ArticleDOI
TL;DR: Temporal analysis of multisyllabic constructs reveals several syntactical rules for syllable transitions in Mustached bats, which exhibit a large intrinsic variation in their physical structure compared to the stereotypic echolocation pulses.
Abstract: Mustached bats, Pteronotus parnellii parnellii spend most of their lives in the dark and use their auditory system for acoustic communication as well as echolocation. The sound spectrograms of their communication sounds or ‘‘calls’’ revealed that this species produces a rich variety of calls. These calls consist of one or more of the 33 different types of discrete sounds or ‘‘syllables’’ that are emitted singly and/or in combination. These syllables can be further classified as 19 simple syllables, 14 composites, and three subsyllables. Simple syllables consist of characteristic geometric patterns of CF (constant frequency), FM (frequency modulation), and NB (noise burst) sounds that are defined quantitatively using statistical criteria. Composites consist of simple syllables or subsyllables conjoined without any silent interval. Most syllable types exhibit a large intrinsic variation in their physical structure compared to the stereotypic echolocation pulses. Syllable domains are defined on the basis of ...

PatentDOI
TL;DR: An active noise cancellation system as discussed by the authors includes a series of features for more effective cancellation, greater reliability, and improved stability, such as locating a residual microphone radially offset from the center of a sound generator to detect a signal more similar to that incident upon the eardrum of the user.
Abstract: An active noise cancellation system includes a series of features for more effective cancellation, greater reliability, and improved stability. A particular feature adapted for headset systems includes locating a residual microphone radially offset from the center of a sound generator to detect a signal more similar to that incident upon the eardrum of the user. In addition, an open back headset design includes perforations on the side of the headset instead of the back, so that the perforations are less susceptible to inadvertent blockage. The system also includes a mecchanism for detecting changes in the acoustic characteristics of the environment that may be caused, for example, by pressure exerted upon the earpieces, and that may destabilize the cancellation system. The system automatically responds to such changes, for example, by reducing the gain or the frequency response of the system to preserve stability. The system further includes other methods for detecting imminent instability and compensating, such as detecting the onset of signals within enhancement frequencies characteristic of the onset of instability, and adjusting the gain or frequency response of the system or suppressing the enhanced signals. The system further includes a mecchanism for conserving batteryl ife by turning the system off when sound levels are low, or adjusting the power supply to the system to correspond to the current power requirements of the system.

Journal ArticleDOI
TL;DR: In this paper, the amplitude envelopes of 24 1/4-oct bands (covering 100 −6400 Hz) were processed in several ways (e.g., fast compression) in order to assess the importance of the modulation peaks and troughs.
Abstract: This paper describes a number of listening experiments to investigate the relative contribution of temporal envelope modulations and fine structure to speech intelligibility. The amplitude envelopes of 24 1/4‐oct bands (covering 100–6400 Hz) were processed in several ways (e.g., fast compression) in order to assess the importance of the modulation peaks and troughs. Results for 60 normal‐hearing subjects show that reduction of modulations by the addition of noise is more detrimental to sentence intelligibility than the same degree of reduction achieved by direct manipulation of the envelope; in some cases the benefit in speech‐reception threshold (SRT) is almost 7 dB. Two crossover levels can be defined in dividing the temporal envelope into two equally important parts. The first crossover level divides the envelope into two perceptually equal parts: Removing modulations either x dB below or above that level yields the same intelligibility score. The second crossover level divides the envelope into two acoustically equal peak and trough parts. The perceptual level is 9–12 dB higher than the acoustic level, indicating that envelope peaks are perceptually more important than troughs. Further results showed that 24 intact temporal speech envelopes with noise fine structure retain perfect intelligibility. In general, for the present type of signal manipulations, no one‐to‐one relation between the modulation‐transfer function and the intelligibility scores could be established.

Journal ArticleDOI
TL;DR: Results indicate that monaural temporal acuity and binaural echo suppression may be based on different processes and that old subjects may have larger temporal windows than young subjects.
Abstract: Thresholds for detecting a gap between two Gaussian-enveloped (standard deviation = 0.5 ms), 2-kHz tones were determined in young and old listeners. The gap-detection thresholds of old adults were more variable and about twice as large as those obtained from young adults. Moreover, gap-detection thresholds were not correlated with audiometric thresholds in either group. Estimates of the width of the temporal window of young subjects, based on the detection of a gap between two tone pips, were smaller than those typically obtained when a relatively long duration pure tone is interrupted [Moore et al., J. Acoust. Soc. Am. 85, 1266-1275 (1989)]. Because the amount of time it takes to recover from an adapting stimulus is likely to affect gap detection thresholds [Glasberg et al., J. Acoust. Soc. Am. 81, 1546-1556 (1987)], smaller estimates of temporal window size would be expected in this paradigm if the amount of adaptation produced by the first tone pip was negligible. The larger gap-detection thresholds of old subjects indicate that they may have larger temporal windows than young subjects. The lack of correlation between audiometric and gap-detection thresholds indicates that this loss of temporal acuity is not related to the degree of sensorineural hearing loss. In a second experiment on the precedence effect using the same subjects, a Gaussian-enveloped tone was presented over earphones to the left ear followed by the same tone pip presented to the right ear. To more realistically approximate a sound field situation, the tone pip presented to each ear was followed 0.6 ms later by an attenuated version presented to the contralateral ear. The delay between the left- and right-ear tone-pips was varied and the transition point between hearing a single tone on the left, and hearing two such sounds in close succession (one coming from the left and the other from the right) was determined. The transition point in this experiment did not differ between young and old subjects nor were these transition points correlated with gap-detection thresholds. These results indicate that monaural temporal acuity and binaural echo suppression may be based on different processes.

Journal ArticleDOI
TL;DR: It is concluded that there is significant subject variability in the magnitude of the reflectance for the ten ear canals, believed to be due to cochlear and middle ear impedance differences.
Abstract: The pressure reflectance R (omega) is the transfer function which may be defined for a linear one-port network by the ratio of the reflected complex pressure divided by the incident complex pressure. The reflectance is a function that is closely related to the impedance of the 1-port. The energy reflectance R (omega) is defined as magnitude of [R]2. It represents the ratio of reflected to incident energy. In the human ear canal the energy reflectance is important because it is a measure of the inefficiency of the middle ear and cochlea, and because of the insight provided by its simple frequency domain interpretation. One may characterize the ear canal impedance by use of the pressure reflectance and its magnitude, sidestepping the difficult problems of (a) the unknown canal length from the measurement point to the eardrum, (b) the complicated geometry of the drum, and (c) the cross-sectional area changes in the canal as a function of distance. Reported here are acoustic impedance measurements, looking into the ear canal, measured on ten young adults with normal hearing (ages 18-24). The measurement point in the canal was approximately 0.85 cm from the entrance of the canal. From these measurements, the pressure reflectance in the canal is computed and impedance and reflectance measurements from 0.1 to 15.0 kHz are compared among ears. The average reflectance and the standard deviation of the reflectance for the ten subjects have been determined. The impedance and reflectance of two common ear simulators, the Bruel & Kjaer 4157 and the Industrial Research Products DB-100 (Zwislocki) coupler are also measured and compared to the average human measurements. All measurements are made using controls that assure a uniform accuracy in the acoustic calibration across subjects. This is done by the use of two standard acoustic resistors whose impedances are known. From the experimental results, it is concluded that there is significant subject variability in the magnitude of the reflectance for the ten ear canals. This variability is believed to be due to cochlear and middle ear impedance differences. An attempt was made at modeling the reflectance but, as discussed in the paper, several problems presently stand in the way of these models. Such models would be useful for acoustic virtual-reality systems and for active noise control earphones.

PatentDOI
Peter William Farrett1
TL;DR: A set of intonation intervals for a chosen dialect are applied to the intonational contour of a phomene string derived from a single set of stored linguistic units, e.g., phonemes.
Abstract: A set of intonation intervals for a chosen dialect are applied to the intonational contour of a phomene string derived from a single set of stored linguistic units, eg, phonemes Sets of intonational intervals are stored to simulate or recognize different dialects or languages from a single set of stored phonemes The interval rules preferably use a prosodic analysis of the phoneme string or other cues to apply a given interval to the phoneme string A second set of interval data is provided for semantic information The speech system is based on the observation that each dialect and language possess its own set of musical relationships or intonation intervals These musical relationships are used by a human listener to identify the particular dialect or language The speech system may be either a speech synthesis or speech analysis tool or may be a combined speech synthesis/analysis system

Journal ArticleDOI
TL;DR: In this paper, a 3D time-harmonic acoustic infinite element for modeling acoustic fields in exterior domains, typically surrounding a structure, is described, which is based on a new multipole expansion that is the exact solution for arbitrary scattered and/or radiated fields exterior to a prolate spheroid.
Abstract: This paper describes a new three‐dimensional (3‐D) time‐harmonic acoustic infinite element for modeling acoustic fields in exterior domains, typically surrounding a structure. This ‘‘prolate spheroidal infinite element’’ is based on a new multipole expansion that is the exact solution for arbitrary scattered and/or radiated fields exterior to a prolate spheroid of any eccentricity. A combination of both prolate and oblate spheroidal infinite elements (the latter to be published separately) provides a capability for very efficiently modeling acoustic fields surrounding structures of virtually any practical shape. This new prolate element has symmetric matrices that are as cheap to generate as for 2‐D elements because only 2‐D integrals need to be numerically evaluated. The prolate element (along with a symmetric‐matrix fluid–structure coupling element, also to be published separately) fits naturally into purely structural finite element codes, thereby providing a structural acoustics capability. For the cl...

PatentDOI
TL;DR: In this article, a method and system for synthesizing speech utilizing a periodic waveform decomposition and relocation coding scheme was proposed, where signals of voiced sound interval among original speech are decomposed into wavelets, each of which corresponds to a speech waveform for one period made by each glottal pulse.
Abstract: The present invention relates to a method and system for synthesizing speech utilizing a periodic waveform decomposition and relocation coding scheme. According to the scheme, signals of voiced sound interval among original speech are decomposed into wavelets, each of which corresponds to a speech waveform for one period made by each glottal pulse. These wavelets are respectively coded and stored. The wavelets nearest to the positions where the wavelets are to be located are selected from stored wavelets and decoded. The decoded wavelets are superposed to each other such that original sound quality can be maintained and duration and pitch frequency of speech segment can be controlled arbitrarily.

Journal ArticleDOI
TL;DR: Three-day-old infants were tested with a non-nutritive sucking paradigm, and the results of two experiments suggest that infants can discriminate between items that contain a word boundary and items that do not, lending plausibility to the hypothesis that infants might use word boundary cues during lexical acquisition.
Abstract: Babies, like adults, hear mostly continuous speech. Unlike adults, however, they are not acquainted with the words that constitute the utterances; yet in order to construct representations for words, they have to retrieve them from the speech wave. Given the apparent lack of obvious cues to word boundaries (such as pauses between words), this is not a trivial problem. Among the several mechanisms that could be explored to solve this bootstrapping problem for lexical acquisition, a tentative but reasonable one posits the existence of some cues (other than silence) that signal word boundaries. In order to test this hypothesis, infants were used as informants in our experiments. It was hypothesized that if word boundary cues exist, and if infants are to use them in the course of language acquisition, then they should at least perceive these cues. As a consequence, infants should be able to discriminate sequences that contain a word boundary from those that do not. A number of bisyllabic stimuli were extracte...

Journal ArticleDOI
TL;DR: In this article, an original formalism that allows a thorough understanding of cross-correlation-based phase aberration correction techniques in a scattering medium is developed, based on the analysis of the second-order statistics of the pressure field scattered by a random distribution of scatterers.
Abstract: An original formalism that allows a thorough understanding of cross‐correlation‐based phase aberration correction techniques in a scattering medium is developed. This formalism is based on the analysis of the second‐order statistics of the pressure field scattered by a random distribution of scatterers. One of the major interests of this analysis is the ability it provides to evaluate and monitor the convergence of phase aberration correction techniques. This is achieved by the computation of a single parameter C that serves as a focusing criterion. The theoretical developments are validated experimentally.