scispace - formally typeset
Search or ask a question
Author

F. Soong

Bio: F. Soong is an academic researcher. The author has contributed to research in topics: Speaker recognition & Speech coding. The author has an hindex of 1, co-authored 1 publications receiving 156 citations.

Papers
More filters
Proceedings ArticleDOI
06 Apr 1987
TL;DR: Three different approaches for automatically segmenting speech into phonetic units are described, onebased on template matching, one based on detecting the spectral changes that occur at the boundaries between phoneticunits and one based upon a constrained-clustering vector quantization approach.
Abstract: For large vocabulary and continuous speech recognition, the sub-word-unit-based approach is a viable alternative to the whole-word-unit-based approach. For preparing a large inventory of subword units, an automatic segmentation is preferrable to manual segmentation as it substantially reduces the work associated with the generation of templates and gives more consistent results. In this paper we discuss some methods for automatically segmenting speech into phonetic units. Three different approaches are described, one based on template matching, one based on detecting the spectral changes that occur at the boundaries between phonetic units and one based on a constrained-clustering vector quantization approach. An evaluation of the performance of the automatic segmentation methods is given.

156 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Current advances related to automatic speech recognition (ASR) and spoken language systems and deficiencies in dealing with variation naturally present in speech are outlined.
Abstract: Major progress is being recorded regularly on both the technology and exploitation of automatic speech recognition (ASR) and spoken language systems. However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as the sensitivity to the environment (background noise), or the weak representation of grammatical and semantic knowledge. Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes the use by specific populations. Also, some applications, like directory assistance, particularly stress the core recognition technology due to the very high active vocabulary (application perplexity). There are actually many factors affecting the speech realization: regional, sociolinguistic, or related to the environment or the speaker herself. These create a wide range of variations that may not be modeled correctly (speaker, gender, speaking rate, vocal effort, regional accent, speaking style, non-stationarity, etc.), especially when resources for system training are scarce. This paper outlines current advances related to these topics.

507 citations

Proceedings ArticleDOI
07 Sep 2014
TL;DR: WiHear is presented, which enables Wi-Fi signals to “hear” people talks within the radio range without deploying any devices and can simultaneously “ hear’ multiple people's talks leveraging MIMO technology.
Abstract: Recent literature advances Wi-Fi signals to "see" people's motions and locations. This paper asks the following question: Can Wi-Fi "hear" our talks? We present WiHear, which enables Wi-Fi signals to "hear" our talks without deploying any devices. To achieve this, WiHear needs to detect and analyze fine-grained radio reflections from mouth movements. WiHear solves this micro-movement detection problem by introducing Mouth Motion Profile that leverages partial multipath effects and wavelet packet transformation. Since Wi-Fi signals do not require line-of-sight, WiHear can "hear" people talks within the radio range. Further, WiHear can simultaneously "hear" multiple people's talks leveraging MIMO technology. We implement WiHear on both USRP N210 platform and commercial Wi-Fi infrastructure. Results show that within our pre-defined vocabulary, WiHear can achieve detection accuracy of 91% on average for single individual speaking no more than 6 words and up to 74% for no more than 3 people talking simultaneously. Moreover, the detection accuracy can be further improved by deploying multiple receivers from different angles.

250 citations

Journal ArticleDOI
TL;DR: WiHear is presented, which enables Wi-Fi signals to “hear” people talks within the radio range without deploying any devices and can simultaneously “ hear’ multiple people's talks leveraging MIMO technology.
Abstract: Recent literature advances Wi-Fi signals to “see” people's motions and locations. This paper asks the following question: Can Wi-Fi “hear” our talks? We present WiHear, which enables Wi-Fi signals to “hear” our talks without deploying any devices. To achieve this, WiHear needs to detect and analyze fine-grained radio reflections from mouth movements. WiHear solves this micro-movement detection problem by introducing Mouth Motion Profile that leverages partial multipath effects and wavelet packet transformation. Since Wi-Fi signals do not require line-of-sight, WiHear can “hear” people talks within the radio range. Further, WiHear can simultaneously “hear” multiple people's talks leveraging MIMO technology. We implement WiHear on both USRP N210 platform and commercial Wi-Fi infrastructure. Results show that within our pre-defined vocabulary, WiHear can achieve detection accuracy of 91 percent on average for single individual speaking no more than six words and up to 74 percent for no more than three people talking simultaneously. Moreover, the detection accuracy can be further improved by deploying multiple receivers from different angles.

249 citations

Journal ArticleDOI
TL;DR: It appeared that spectral vowel reduction could be better interpreted as the result of an increased contextual assimilation than as the tendency to centralize.
Abstract: The effect of sentence accent, word stress, and word class (function words versus content words) on the acoustic properties of 9 Dutch vowels in fluent speech was investigated. A list of sentences was read aloud by 15 male speakers. Each sentence contained one syllable of interest. This could be a monosyllabic function word, an unstressed syllable of a content word, or a stressed syllable of a content word. The same syllable occurred in all three conditions. Sentence accent was manipulated with questions that preceded the sentences. A total number of 3465 vowels were segmented from the syllables and analysed. It was found that all three factors mentioned above had a significant effect both on the steady-state formant frequencies ( F 1 and F 2 ) and on the duration of the vowels. Word stress and word class had a stronger effect on the vowels than sentence accent. A listening experiment showed the perceptual significance of the acoustic measurements. It appeared that spectral vowel reduction could be better interpreted as the result of an increased contextual assimilation than as the tendency to centralize. We also studied changes in the dynamics of the formant tracks due to the experimental conditions. It was found that formant tracks of reduced vowels became flatter which supports the view of an increased contextual assimilation. Three simple models of vowel reduction are discussed.

189 citations

Journal ArticleDOI
TL;DR: This article describes an automatic procedure for the segmentation of speech: given either the linguistic or the phonetic content of a speech utterance, the system provides phone boundaries.
Abstract: An accurate database documentation at phonetic level is very important for speech research: however, manual segmentation and labeling is a time consuming and error prone task. This article describes an automatic procedure for the segmentation of speech: given either the linguistic or the phonetic content of a speech utterance, the system provides phone boundaries. The technique is based on the use of an acoustic-phonetic unit Hidden Markov Model (HMM) recognizer: both the recognizer and the segmentation system have been designed exploiting the DARPA-TIMIT acoustic-phonetic continuous speech database of American English. Segmentation and labeling experiments have been conducted in different conditions to check the reliability of the resulting system. Satisfactory results have been obtained, especially when the system is trained with some manually presegmented material. The size of this material is a crucial factor; system performance has been evaluated with respect to this parameter. It turns out that the system provides 88.3% correct boundary location, given a tolerance of 20 ms, when only 256 phonetically balanced sentences are used for its training.

186 citations