scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 2007"


Book
07 Jun 2007
TL;DR: Clear and concise, this book explores how human listeners compensate for acoustic noise in noisy environments and suggests steps that can be taken to realize the full potential of these algorithms under realistic conditions.
Abstract: With the proliferation of mobile devices and hearing devices, including hearing aids and cochlear implants, there is a growing and pressing need to design algorithms that can improve speech intelligibility without sacrificing quality. Responding to this need, Speech Enhancement: Theory and Practice, Second Edition introduces readers to the basic problems of speech enhancement and the various algorithms proposed to solve these problems. Updated and expanded, this second edition of the bestselling textbook broadens its scope to include evaluation measures and enhancement algorithms aimed at improving speech intelligibility. Fundamentals, Algorithms, Evaluation, and Future Steps Organized into four parts, the book begins with a review of the fundamentals needed to understand and design better speech enhancement algorithms. The second part describes all the major enhancement algorithms and, because these require an estimate of the noise spectrum, also covers noise estimation algorithms. The third part of the book looks at the measures used to assess the performance, in terms of speech quality and intelligibility, of speech enhancement methods. It also evaluates and compares several of the algorithms. The fourth part presents binary mask algorithms for improving speech intelligibility under ideal conditions. In addition, it suggests steps that can be taken to realize the full potential of these algorithms under realistic conditions. Whats New in This Edition Updates in every chapter A new chapter on objective speech intelligibility measures A new chapter on algorithms for improving speech intelligibility Real-world noise recordings (on accompanying CD) MATLAB code for the implementation of intelligibility measures (on accompanying CD) MATLAB and C/C++ code for the implementation of algorithms to improve speech intelligibility (on accompanying CD) Valuable Insights from a Pioneer in Speech Enhancement Clear and concise, this book explores how human listeners compensate for acoustic noise in noisy environments. Written by a pioneer in speech enhancement and noise reduction in cochlear implants, it is an essential resource for anyone who wants to implement or incorporate the latest speech enhancement algorithms to improve the quality and intelligibility of speech degraded by noise. Includes a CD with Code and Recordings The accompanying CD provides MATLAB implementations of representative speech enhancement algorithms as well as speech and noise databases for the evaluation of enhancement algorithms.

2,269 citations


Book
30 Nov 2007
TL;DR: A comprehensive overview of digital speech processing that ranges from the basic nature of the speech signal, through a variety of methods of representing speech in digital form, to applications in voice communication and automatic synthesis and recognition of speech.
Abstract: Since even before the time of Alexander Graham Bell's revolutionary invention, engineers and scientists have studied the phenomenon of speech communication with an eye on creating more efficient and effective systems of human-to-human and human-to-machine communication. Starting in the 1960s, digital signal processing (DSP), assumed a central role in speech studies, and today DSP is the key to realizing the fruits of the knowledge that has been gained through decades of research. Concomitant advances in integrated circuit technology and computer architecture have aligned to create a technological environment with virtually limitless opportunities for innovation in speech communication applications. In this text, we highlight the central role of DSP techniques in modern speech communication research and applications. We present a comprehensive overview of digital speech processing that ranges from the basic nature of the speech signal, through a variety of methods of representing speech in digital form, to applications in voice communication and automatic synthesis and recognition of speech. The breadth of this subject does not allow us to discuss any aspect of speech processing to great depth; hence our goal is to provide a useful introduction to the wide range of important concepts that comprise the field of digital speech processing. A more comprehensive treatment will appear in the forthcoming book, Theory and Application of Digital Speech Processing [101].

369 citations


Book ChapterDOI
01 Jun 2007
TL;DR: This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used.
Abstract: An important drawback affecting most of the speech processing systems is the environmental noise and its harmful effect on the system performance. Examples of such systems are the new wireless communications voice services or digital hearing aid devices. In speech recognition, there are still technical barriers inhibiting such systems from meeting the demands of modern applications. Numerous noise reduction techniques have been developed to palliate the effect of the noise on the system performance and often require an estimate of the noise statistics obtained by means of a precise voice activity detector (VAD). Speech/non-speech detection is an unsolved problem in speech processing and affects numerous applications including robust speech recognition (Karray and Marting, 2003; Ramirez et al. 2003), discontinuous transmission (ITU, 1996; ETSI, 1999), real-time speech transmission on the Internet (Sangwan et al., 2002) or combined noise reduction and echo cancellation schemes in the context of telephony (Basbug et al., 2004; Gustafsson et al., 2002). The speech/non-speech classification task is not as trivial as it appears, and most of the VAD algorithms fail when the level of background noise increases. During the last decade, numerous researchers have developed different strategies for detecting speech on a noisy signal (Sohn et al., 1999; Cho and Kondoz, 2001; Gazor and Zhang, 2003, Armani et al., 2003) and have evaluated the influence of the VAD effectiveness on the performance of speech processing systems (Bouquin-Jeannes and Faucon, 1995). Most of the approaches have focussed on the development of robust algorithms with special attention being paid to the derivation and study of noise robust features and decision rules (Woo et al., 2000; Li et al., 2002; Marzinzik and Kollmeier, 2002). The different VAD methods include those based on energy thresholds (Woo et al., 2000), pitch detection (Chengalvarayan, 1999), spectrum analysis (Marzinzik and Kollmeier, 2002), zero-crossing rate (ITU, 1996), periodicity measure (Tucker, 1992), higher order statistics in the LPC residual domain (Nemer et al., 2001) or combinations of different features (ITU, 1993; ETSI, 1999; Tanyer and Ozer, 2000). This chapter shows a comprehensive approximation to the main challenges in voice activity detection, the different solutions that have been reported in a complete review of the state of the art and the evaluation frameworks that are normally used. The application of VADs for speech coding, speech enhancement and robust speech recognition systems is shown and discussed. Three different VAD methods are described and compared to standardized and

256 citations


Journal ArticleDOI
TL;DR: Overall, the analysis of consonant confusion matrices suggests that in order for noise reduction algorithms to improve speech intelligibility, they need to improve the place and manner feature scores.
Abstract: The evaluation of intelligibility of noise reduction algorithms is reported. IEEE sentences and consonants were corrupted by four types of noise including babble, car, street and train at two signal-to-noise ratio levels (0 and 5 dB), and then processed by eight speech enhancement methods encompassing four classes of algorithms: spectral subtractive, sub-space, statistical model based and Wiener-type algorithms. The enhanced speech was presented to normal-hearing listeners for identification. With the exception of a single noise condition, no algorithm produced significant improvements in speech intelligibility. Information transmission analysis of the consonant confusion matrices indicated that no algorithm improved significantly the place feature score, significantly, which is critically important for speech recognition. The algorithms which were found in previous studies to perform the best in terms of overall quality, were not the same algorithms that performed the best in terms of speech intelligibility. The subspace algorithm, for instance, was previously found to perform the worst in terms of overall quality, but performed well in the present study in terms of preserving speech intelligibility. Overall, the analysis of consonant confusion matrices suggests that in order for noise reduction algorithms to improve speech intelligibility, they need to improve the place and manner feature scores.

251 citations


Patent
29 Jun 2007
TL;DR: In this paper, an audible indication of a user's position within a given speech grammar framework is provided for a speech-enabled software application, and recognition of speech grammars are limited to use only when a software application that has requested a given set of speech grammarmars is in focus by a user of an associated mobile computing device.
Abstract: An audible indication of a user's position within a given speech grammar framework is provided for a speech-enabled software application, and recognition of speech grammars are limited to use only when a software application that has requested a given set of speech grammars is in focus by a user of an associated mobile computing device.

244 citations


Patent
30 Nov 2007
TL;DR: In this paper, an overall system/method for text-input using a multimodal interface with speech recognition is described, where pluralities of modes interact with the main speech mode to provide the speech-recognition system with partial knowledge of the text corresponding to the spoken utterance forming the input to the speech recognition system.
Abstract: The disclosure describes an overall system/method for text-input using a multimodal interface with speech recognition. Specifically, pluralities of modes interact with the main speech mode to provide the speech-recognition system with partial knowledge of the text corresponding to the spoken utterance forming the input to the speech recognition system. The knowledge from other modes is used to dynamically change the ASR system's active vocabulary thereby significantly increasing recognition accuracy and significantly reducing processing requirements. Additionally, the speech recognition system is configured using three different system configurations (always listening, partially listening, and push-to-speak) and for each one of those three different user-interfaces are proposed (speak-and-type, type-and-speak, and speak-while-typing). Finally, the overall user-interface of the proposed system is designed such that it enhances existing standard text-input methods; thereby minimizing the behavior change for mobile users.

243 citations


Patent
29 Oct 2007
TL;DR: In this paper, a method of speech recognition is described for use with mobile devices, where a portion of an initial speech recognition result is presented on the mobile device including a set of general alternate recognition hypotheses associated with the portion of the speech recognition results.
Abstract: A method of speech recognition is described for use with mobile devices. A portion of an initial speech recognition result is presented on the mobile device including a set of general alternate recognition hypotheses associated with the portion of the speech recognition result. A key input representative of one or more associated letters is received from the user. The user is provided with a set of restricted alternate recognition hypotheses starting with the one or more letters associated with the key input. Then a user selection is accepted of one of the restricted alternate recognition hypotheses to represent a corrected speech recognition result.

216 citations


Patent
19 Nov 2007
TL;DR: In this article, a speech recognition method was proposed to improve methods of performing speech recognition with barge-in by starting a synthesis of recorded speech, receiving a user speech input signal providing information regarding a user choice, detecting an initial portion of the user input signal, selectively altering the synthesized speech, and recognizing the user choice.
Abstract: Embodiments of the present invention improve methods of performing speech recognition with barge-in. In one embodiment, the present invention includes a speech recognition method comprising starting a synthesis of recorded speech, receiving a user speech input signal providing information regarding a user choice, detecting an initial portion of the user speech input signal, selectively altering the synthesis of recorded speech, and recognizing the user choice.

213 citations


Journal ArticleDOI
TL;DR: A survey of a growing body of work in which representations of speech production are used to improve automatic speech recognition is provided.
Abstract: Although much is known about how speech is produced, and research into speech production has resulted in measured articulatory data, feature systems of different kinds, and numerous models, speech production knowledge is almost totally ignored in current mainstream approaches to automatic speech recognition. Representations of speech production allow simple explanations for many phenomena observed in speech which cannot be easily analyzed from either acoustic signal or phonetic transcription alone. In this article, a survey of a growing body of work in which such representations are used to improve automatic speech recognition is provided.

207 citations


Patent
Hisayuki Nagashima1
11 Sep 2007
TL;DR: In this article, the authors present a voice recognition system that includes a first voice recognition processing unit for executing processing of assigning weights of a first ratio to a sound score and a language score calculated for the input voice and recognizing the voice using the obtained scores to determine a type of a domain representing the control object based on a result of the processing.
Abstract: A voice recognition device, a voice recognition method and a voice recognition program capable of appropriately restricting recognition objects based on voice input from a user to recognize the input voice with accuracy are provided. The voice recognition device includes a first voice recognition processing unit for executing processing of assigning weights of a first ratio to a sound score and a language score calculated for the input voice and recognizing the voice using the obtained scores to determine a type of a domain representing the control object based on a result of the processing, and a second voice recognition processing unit, using the domain of the determined type as a recognition object, for executing processing of assigning weights of a second ratio to the sound score and the language score calculated for the input voice, the weight on the sound score being greater in the second ratio than in the first ratio, and recognizing the voice using the obtained scores to determine the control content of the control object based on a result of the processing.

192 citations


Journal ArticleDOI
TL;DR: The group delay function is modified to overcome the short-time spectral structure of speech owing to zeros that are close to the unit circle in the z-plane and also due to pitch periodicity effects and is called the modified group delay feature (MODGDF).
Abstract: Spectral representation of speech is complete when both the Fourier transform magnitude and phase spectra are specified. In conventional speech recognition systems, features are generally derived from the short-time magnitude spectrum. Although the importance of Fourier transform phase in speech perception has been realized, few attempts have been made to extract features from it. This is primarily because the resonances of the speech signal which manifest as transitions in the phase spectrum are completely masked by the wrapping of the phase spectrum. Hence, an alternative to processing the Fourier transform phase, for extracting speech features, is to process the group delay function which can be directly computed from the speech signal. The group delay function has been used in earlier efforts, to extract pitch and formant information from the speech signal. In all these efforts, no attempt was made to extract features from the speech signal and use them for speech recognition applications. This is primarily because the group delay function fails to capture the short-time spectral structure of speech owing to zeros that are close to the unit circle in the z-plane and also due to pitch periodicity effects. In this paper, the group delay function is modified to overcome these effects. Cepstral features are extracted from the modified group delay function and are called the modified group delay feature (MODGDF). The MODGDF is used for three speech recognition tasks namely, speaker, language, and continuous-speech recognition. Based on the results of feature and performance evaluation, the significance of the MODGDF as a new feature for speech recognition is discussed

Patent
01 Oct 2007
TL;DR: In this paper, a post-recognition processor coupled with an interface is used to compare recognized speech data generated by the speech recognition engine to contextual information retained in a memory, and transmits the modified recognition data to a parsing component.
Abstract: A system improves speech recognition includes an interface linked to a speech recognition engine. A post-recognition processor coupled to the interface compares recognized speech data generated by the speech recognition engine to contextual information retained in a memory, generates a modified recognized speech data, and transmits the modified recognized speech data to a parsing component.

Journal ArticleDOI
TL;DR: It is shown that the SP-SDW-MWF is more robust against signal model errors than the GSC, and that the block-structured step size matrix gives rise to a faster convergence and a better tracking performance than the diagonal step size Matrix, only at a slightly higher computational cost.

Patent
15 Nov 2007
TL;DR: A speech processing system includes a multiplexer that receives speech data input as part of a conversation turn in a conversation session between two or more users where one user is a speaker and each of the other users is a listener in each conversation turn as mentioned in this paper.
Abstract: A speech processing system includes a multiplexer that receives speech data input as part of a conversation turn in a conversation session between two or more users where one user is a speaker and each of the other users is a listener in each conversation turn A speech recognizing engine converts the speech data to an input string of acoustic data while a speech modifier forms an output string based on the input string by changing an item of acoustic data according to a rule The system also includes a phoneme speech engine for converting the first output string of acoustic data including modified and unmodified data to speech data for output via the multiplexer to listeners during the conversation turn

Patent
08 Aug 2007
TL;DR: In this paper, a solution for customizing synthetic voice characteristics in a user specific fashion is presented, where a data store can be searched for a speech profile associated with the user, and a set of speech output characteristics established for the user from the profile can be determined.
Abstract: The present invention discloses a solution for customizing synthetic voice characteristics in a user specific fashion. The solution can establish a communication between a user and a voice response system. A data store can be searched for a speech profile associated with the user. When a speech profile is found, a set of speech output characteristics established for the user from the profile can be determined. Parameters and settings of a text-to-speech engine can be adjusted in accordance with the determined set of speech output characteristics. During the established communication, synthetic speech can be generated using the adjusted text-to-speech engine. Thus, each detected user can hear a synthetic speech generated by a different voice specifically selected for that user. When no user profile is detected, a default voice or a voice based upon a user's speech or communication details can be used.

Patent
20 Mar 2007
TL;DR: In this paper, a multimodal digital audio editor is coupled with an ASR engine for indexing digitized speech with words represented in the digitised speech, with the ASR system recognizing user speech including a recognized word and inserting the recognized word into a speech recognition grammar.
Abstract: Indexing digitized speech with words represented in the digitized speech, with a multimodal digital audio editor operating on a multimodal device supporting modes of user interaction, the modes of user interaction including a voice mode and one or more non-voice modes, the multimodal digital audio editor operatively coupled to an ASR engine, including providing by the multimodal digital audio editor to the ASR engine digitized speech for recognition; receiving in the multimodal digital audio editor from the ASR engine recognized user speech including a recognized word, also including information indicating where, in the digitized speech, representation of the recognized word begins; and inserting by the multimodal digital audio editor the recognized word, in association with the information indicating where, in the digitized speech, representation of the recognized word begins, into a speech recognition grammar, the speech recognition grammar voice enabling user interface commands of the multimodal digital audio editor.

Patent
18 May 2007
TL;DR: In this article, a speech synthesis system and method including an application consisting of two networked parts, a client and a server, which uses the capabilities of the server to speech enable a client that does not have speech capabilities.
Abstract: A speech synthesis system and method including an application consisting of two networked parts, a client and a server, which uses the capabilities of the server to speech enable a client that does not have speech capabilities. The system has been designed to enable a client computer with audio capabilities to connect and request text to speech operations via a network or internet connection.

Patent
01 Aug 2007
TL;DR: In this paper, an energy change of each frame making up signals including speech and non-speech signals is detected and a speech segment corresponding to frames that include only speech signals from among the frames based on the detected energy change.
Abstract: A speech recognition method, medium, and system. The method includes detecting an energy change of each frame making up signals including speech and non-speech signals, and identifying a speech segment corresponding to frames that include only speech signals from among the frames based inclusive of the detected energy change.

Journal ArticleDOI
TL;DR: It is demonstrated that speech recognition error rates for interactive read aloud can be reduced by more than 50% through a combination of advances in both statistical language and acoustic modeling.

Patent
22 Mar 2007
TL;DR: In this article, a novel adaptive beamforming technique with enhanced noise suppression capability is proposed, where the sound source presence probability is estimated based on the instantaneous direction of arrival of the input signals and voice activity detection.
Abstract: A novel adaptive beamforming technique with enhanced noise suppression capability. The technique incorporates the sound-source presence probability into an adaptive blocking matrix. In one embodiment the sound-source presence probability is estimated based on the instantaneous direction of arrival of the input signals and voice activity detection. The technique guarantees robustness to steering vector errors without imposing ad hoc constraints on the adaptive filter coefficients. It can provide good suppression performance for both directional interference signals as well as isotropic ambient noise.

Patent
20 Mar 2007
TL;DR: In this paper, the system includes a voice recognition unit and a speech processing server that work together to enable users to interact with the system using voice commands guided by navigation context sensitive voice prompts, and provide user-requested data in a verbalized format back to the users.
Abstract: A system for providing access to data via a voice interface. In one embodiment, the system includes a voice recognition unit and a speech processing server that work together to enable users to interact with the system using voice commands guided by navigation context sensitive voice prompts, and provide user-requested data in a verbalized format back to the users. Digitized voice waveform data are processed to determine the voice commands of the user. The system also uses a “grammar” that enables users to retrieve data using intuitive natural language speech queries. In response to such a query, a corresponding data query is generated by the system to retrieve one or more data sets corresponding to the query. The user is then enabled to browse the data that are returned through voice command navigation, wherein the system “reads” the data back to the user using text-to-speech (TTS) conversion and system prompts.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: This work proposes a speech separation method that employs a maximum signal-to-noise (MaxSNR) beamformer combined with a voice activity detector and online clustering, and discusses the scaling ambiguity problem as regards the MaxSNR beamformer, and provides their solutions.
Abstract: We propose a speech separation method for a meeting situation, where each speaker sometimes speaks and the number of speakers changes every moment. Many source separation methods have already been proposed, however, they consider a case where all the speakers keep speaking: this is not always true in a real meeting. In such cases, in addition to separation, speech detection and the classification of the detected speech according to speaker become important issues. For that purpose, we propose a method that employs a maximum signal-to-noise (MaxSNR) beamformer combined with a voice activity detector and online clustering. We also discuss the scaling ambiguity problem as regards the MaxSNR beamformer, and provide their solutions. We report some encouraging results for a real meeting in a room with a reverberation time of about 350 ms.

Book
24 Sep 2007
TL;DR: This book discusses Speech Signals and Wavelets and Pitch Detection, Predictive Coding, and the Quadratic Spline Wavelets, and concludes with a comparison of Speech Transceivers and their applications.
Abstract: About the Authors. Other Wiley and IEEE Press Books on Related Topics. Preface and Motivation. Acknowledgements. I Speech Signals andWaveform Coding. 2 Predictive Coding. 3 Analysis-by-synthesis Principles. 4 Speech Spectral Quantization. 5 RPE Coding. 6 Forward-Adaptive CELP Coding. 7 Standard CELP Codecs. 8 Backward-Adaptive CELP Coding. 9 Wideband Speech Coding. 10 MPEG-4 Audio Compression and Transmission. 11 Overview of Low-rate Speech Coding. 12 Linear Predictive Vocoder. 13 Wavelets and Pitch Detection. 14 Zinc Function Excitation. 15 Mixed-Multiband Excitation. 16 Sinusoidal Transform Coding Below 4kbps. 17 Conclusions on Low Rate Coding. 18 Comparison of Speech Transceivers. 19 Voice Over the Internet Protocol. A Constructing the Quadratic Spline Wavelets. B Zinc Function Excitation. C Probability Density Function for Amplitudes. Bibliography. Index. Author Index.

Journal ArticleDOI
TL;DR: The accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system, both in terms of enhancement and recognition.
Abstract: This paper addresses the problem of distant speech acquisition in multiparty meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering. Beamforming techniques, however, rely on knowledge of the speaker location. In this paper, we present an integrated approach, in which an audio-visual multiperson tracker is used to track active speakers with high accuracy. Speech enhancement is then achieved using microphone array beamforming followed by a novel postfiltering stage. Finally, speech recognition is performed to evaluate the quality of the enhanced speech signal. The approach is evaluated on data recorded in a real meeting room for stationary speaker, moving speaker, and overlapping speech scenarios. The results show that the speech enhancement and recognition performance achieved using our approach are significantly better than a single table-top microphone and are comparable to a lapel microphone for some of the scenarios. The results also indicate that the audio-visual-based system performs significantly better than audio-only system, both in terms of enhancement and recognition. This reveals that the accurate speaker tracking provided by the audio-visual sensor array proved beneficial to improve the recognition performance in a microphone array-based speech recognition system.

Proceedings ArticleDOI
01 Dec 2007
TL;DR: In contrast to the common belief that "there is no data like more data", it is found possible to select a highly informative subset of data that produces recognition performance comparable to a system that makes use of a much larger amount of data.
Abstract: This paper presents a strategy for efficiently selecting informative data from large corpora of transcribed speech. We propose to choose data uniformly according to the distribution of some target speech unit (phoneme, word, character, etc). In our experiment, in contrast to the common belief that "there is no data like more data", we found it possible to select a highly informative subset of data that produces recognition performance comparable to a system that makes use of a much larger amount of data. At the same time, our selection process is efficient and fast.

Journal ArticleDOI
TL;DR: The experimental results demonstrate the improved robustness of the method described in this work when tracking sources emitting real-world speech signals, which typically involve significant silence gaps between utterances.
Abstract: In noisy and reverberant environments, the problem of acoustic source localisation and tracking (ASLT) using an array of microphones presents a number of challenging difficulties. One of the main issues when considering real-world situations involving human speakers is the temporally discontinuous nature of speech signals: the presence of silence gaps in the speech can easily misguide the tracking algorithm, even in practical environments with low to moderate noise and reverberation levels. A natural extension of currently available sound source tracking algorithms is the integration of a voice activity detection (VAD) scheme. We describe a new ASLT algorithm based on a particle filtering (PF) approach, where VAD measurements are fused within the statistical framework of the PF implementation. Tracking accuracy results for the proposed method is presented on the basis of synthetic audio samples generated with the image method, whereas performance results obtained with a real-time implementation of the algorithm, and using real audio data recorded in a reverberant room, are published elsewhere. Compared to a previously proposed PF algorithm, the experimental results demonstrate the improved robustness of the method described in this work when tracking sources emitting real-world speech signals, which typically involve significant silence gaps between utterances.

Journal ArticleDOI
TL;DR: Neuroevolution Artificial Bandwidth Expansion (NEABE) is proposed, a new method that uses spectral folding to create the initial spectral components above the telephone band and was found to be preferred over narrowband speech in about 80% of the test cases.
Abstract: The limited bandwidth of 0.3-3.4 kHz in current telephone systems reduces both the quality and the intelligibility of speech. Artificial bandwidth expansion is a method that expands the bandwidth of the narrowband speech signal in the receiving end of the transmission link by adding new frequency components to the higher frequencies, i.e., up to 8 kHz. In this paper, a new method for artificial bandwidth expansion, termed Neuroevolution Artificial Bandwidth Expansion (NEABE) is proposed. The method uses spectral folding to create the initial spectral components above the telephone band. The spectral envelope is then shaped in the frequency domain, based on a set of parameters given by a neural network. Subjective listening tests were used to evaluate the performance of the proposed algorithm, and the results showed that NEABE speech was preferred over narrowband speech in about 80% of the test cases

Proceedings ArticleDOI
10 Dec 2007
TL;DR: This paper proposes a novel design of real-time speech hiding for G.711 codec, which is widely supported by almost every VoIP device and shows that the processing time for the proposed algorithm takes only 0.257 ms, suitable for real- time VoIP applications.
Abstract: The real-time speech hiding is to hide the secret speech into a cover speech in real-time communication systems. By hiding one secret speech into the cover speech, we can get a stego speech, which sounds meaningful and indistinguishable from the original cover speech. Therefore, even if the attackers catch the audio packets on Internet, they would not notice that there is another speech hidden inside it. In this paper, we propose a scheme for speech hiding in a real-time communication system such as voice over Internet Protocol (VoIP). We propose a novel design of real-time speech hiding for G.711 codec, which is widely supported by almost every VoIP device. Experimental results show that the processing time for the proposed algorithm takes only 0.257 ms, which is suitable for real-time VoIP applications.

Patent
04 Jan 2007
TL;DR: In this article, the authors present a system for intelligent control of microphones in speech processing applications, which allows the capturing, recording and preprocessing of speech data in the captured audio in a way that optimizes speech decoding accuracy.
Abstract: Systems and methods for intelligent control of microphones in speech processing applications, which allows the capturing, recording and preprocessing of speech data in the captured audio in a way that optimizes speech decoding accuracy.

Proceedings ArticleDOI
27 Aug 2007
TL;DR: In this article, a statistical model approach is proposed to estimate statistical models sequentially without a priori knowledge of noise, and the proposed method constructs a clean speech / silence state transition model beforehand, and sequentially adapts the model to the noisy environment by using a switching Kalman filter.
Abstract: This paper addresses the problem of voice activity detection (VAD) in noisy environments. The VAD method proposed in this paper is based on a statistical model approach, and estimates statistical models sequentially without a priori knowledge of noise. Namely, the proposed method constructs a clean speech / silence state transition model beforehand, and sequentially adapts the model to the noisy environment by using a switching Kalman filter when a signal is observed. In this paper, we carried out two evaluations. In the first, we observed that the proposed method significantly outperforms conventional methods as regards voice activity detection accuracy in simulated noise environments. Second, we evaluated the proposed method on a VAD evaluation framework, CENSREC-1-C. The evaluation results revealed that the proposed method significantly outperforms the baseline results of CENSREC-1-C as regards VAD accuracy in real environments. In addition, we confirmed that the proposed method helps to improve the accuracy of concatenated speech recognition in real environments.