Showing papers on "Speech coding published in 2016"

PDF

Open Access

Journal Article•DOI•

WORLD: A vocoder-based high-quality speech synthesis system for real-time applications

[...]

Masanori Morise¹, Fumiya Yokomori¹, Kenji Ozawa¹•Institutions (1)

01 Jul 2016-IEICE Transactions on Information and Systems

TL;DR: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech and showed that it was superior to the other systems in terms of both sound quality and processing speed.

...read moreread less

Abstract: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing. key words: speech analysis, speech synthesis, vocoder, sound quality, realtime processing

...read moreread less

1,025 citations

Journal Article•DOI•

Complex ratio masking for monaural speech separation

[...]

Donald S. Williamson¹, Yuxuan Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Mar 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

...read moreread less

Abstract: Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

...read moreread less

699 citations

Journal Article•DOI•

Advances in phase-aware signal processing in speech communication

[...]

Pejman Mowlaee¹, Rahim Saeidi², Yannis Stylianou³•Institutions (3)

Graz University of Technology¹, Aalto University², University of Crete³

01 Jul 2016-Speech Communication

TL;DR: It is shown that phase-aware signal processing is an important emerging field with high potential in the current speech communication applications and can complement the possible solutions that magnitude-only methods suggest.

...read moreread less

126 citations

Posted Content•

Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks

[...]

Anurag Kumar¹, Dinei Florencio²•Institutions (2)

Carnegie Mellon University¹, Microsoft²

09 May 2016-arXiv: Sound

TL;DR: This paper deals with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech and proposes several strategies based on Deep Neural Networks for speech enhancement in these scenarios.

...read moreread less

Abstract: In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech. Most of the current literature on speech enhancement focus primarily on presence of single noise in corrupted speech which is far from real-world environments. Specifically, we deal with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech. We propose several strategies based on Deep Neural Networks (DNN) for speech enhancement in these scenarios. We also investigate a DNN training strategy based on psychoacoustic models from speech coding for enhancement of noisy speech

...read moreread less

91 citations

Patent•

Recognizing speech in the presence of additional audio

[...]

Diego Melendo Casado¹, Ignacio Lopez Moreno¹, Javier Gonzalez-Dominguez¹•Institutions (1)

Google¹

07 Apr 2016

81 citations

Proceedings Article•DOI•

DNN-based enhancement of noisy and reverberant speech

[...]

Yan Zhao¹, DeLiang Wang¹, Ivo Merks, Tao Zhang•Institutions (1)

Ohio State University¹

20 Mar 2016

TL;DR: This paper proposes to enhance the noisy and reverberant speech by learning a mapping to reverberant target speech rather than anechoic target speech, and develops a masking-based method for denoising and compares it with the spectral mapping method.

...read moreread less

Abstract: In the real world, speech is usually distorted by both reverberation and background noise. In such conditions, speech intelligibility is degraded substantially, especially for hearing-impaired (HI) listeners. As a consequence, it is essential to enhance speech in the noisy and reverberant environment. Recently, deep neural networks have been introduced to learn a spectral mapping to enhance corrupted speech, and shown significant improvements in objective metrics and automatic speech recognition score. However, listening tests have not yet shown any speech intelligibility benefit. In this paper, we propose to enhance the noisy and reverberant speech by learning a mapping to reverberant target speech rather than anechoic target speech. A preliminary listening test was conducted, and the results show that the proposed algorithm is able to improve speech intelligibility of HI listeners in some conditions. Moreover, we develop a masking-based method for denoising and compare it with the spectral mapping method. Evaluation results show that the masking-based method outperforms the mapping-based method.

...read moreread less

72 citations

Journal Article•DOI•

Unseen noise estimation using separable deep auto encoder for speech enhancement

[...]

Meng Sun¹, Xiongwei Zhang¹, Hugo Van hamme², Thomas Fang Zheng³•Institutions (3)

Nanjing University of Science and Technology¹, Katholieke Universiteit Leuven², Tsinghua University³

01 Jan 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The above proposed technique is called separable deep auto encoder (SDAE), and given the under-determined nature of the above optimization problem, the clean speech reconstruction is confined in the convex hull spanned by a pre-trained speech dictionary.

...read moreread less

Abstract: Unseen noise estimation is a key yet challenging step to make a speech enhancement algorithm work in adverse environments. At worst, the only prior knowledge we know about the encountered noise is that it is different from the involved speech. Therefore, by subtracting the components which cannot be adequately represented by a well defined speech model, the noises can be estimated and removed. Given the good performance of deep learning in signal representation, a deep auto encoder (DAE) is employed in this work for accurately modeling the clean speech spectrum. In the subsequent stage of speech enhancement, an extra DAE is introduced to represent the residual part obtained by subtracting the estimated clean speech spectrum (by using the pre-trained DAE) from the noisy speech spectrum. By adjusting the estimated clean speech spectrum and the unknown parameters of the noise DAE, one can reach a stationary point to minimize the total reconstruction error of the noisy speech spectrum. The enhanced speech signal is thus obtained by transforming the estimated clean speech spectrum back into time domain. The above proposed technique is called separable deep auto encoder (SDAE). Given the under-determined nature of the above optimization problem, the clean speech reconstruction is confined in the convex hull spanned by a pre-trained speech dictionary. New learning algorithms are investigated to respect the non-negativity of the parameters in the SDAE. Experimental results on TIMIT with 20 noise types at various noise levels demonstrate the superiority of the proposed method over the conventional baselines.

...read moreread less

69 citations

Proceedings Article•DOI•

Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks

[...]

Anurag Kumar¹, Dinei Florencio²•Institutions (2)

Carnegie Mellon University¹, Microsoft²

08 Sep 2016

TL;DR: In this article, the authors considered the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech and proposed several strategies based on Deep Neural Networks (DNN) for speech enhancement.

...read moreread less

67 citations

Journal Article•DOI•

Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces.

[...]

Florent Bocquelet¹, Thomas Hueber², Laurent Girin³, Christophe Savariaux³, Christophe Savariaux², Blaise Yvert¹, Blaise Yvert³ - Show less +3 more•Institutions (3)

French Institute of Health and Medical Research¹, Centre national de la recherche scientifique², University of Grenoble³

23 Nov 2016-PLOS Computational Biology

TL;DR: It is found that real-time synthesis of vowels and consonants was possible with good intelligibility and open to future speech BCI applications using such articulatory-based speech synthesizer.

...read moreread less

Abstract: Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer.

...read moreread less

66 citations

Posted Content•

Speex: A Free Codec For Free Speech.

[...]

Jean-Marc Valin¹•Institutions (1)

Commonwealth Scientific and Industrial Research Organisation¹

28 Feb 2016-arXiv: Sound

TL;DR: An overview of Speex, the technology involved in it and how it can be used in applications is presented.

...read moreread less

Abstract: The Speex project has been started in 2002 to address the need for a free, open-source speech codec. Speex is based on the Code Excited Linear Prediction (CELP) algorithm and, unlike the previously existing Vorbis codec, is optimised for transmitting speech for low latency communication over an unreliable packet network. This paper presents an overview of Speex, the technology involved in it and how it can be used in applications. The most recent developments in Speex, such as the fixed-point port, acoustic echo cancellation and noise suppression are also addressed.

...read moreread less

52 citations

Journal Article•DOI•

QRDA: Quantum Representation of Digital Audio

[...]

Jian Wang¹•Institutions (1)

Beijing Jiaotong University¹

01 Mar 2016-International Journal of Theoretical Physics

TL;DR: In this paper, a quantum representation of digital audio (QRDA) is proposed to present quantum audio, which uses two entangled qubit sequences to store the audio amplitude and time information.

...read moreread less

Abstract: Multimedia refers to content that uses a combination of different content forms. It includes two main medias: image and audio. However, by contrast with the rapid development of quantum image processing, quantum audio almost never been studied. In order to change this status, a quantum representation of digital audio (QRDA) is proposed in this paper to present quantum audio. QRDA uses two entangled qubit sequences to store the audio amplitude and time information. The two qubit sequences are both in basis state: |0〉 and |1〉. The QRDA audio preparation from initial state |0〉 is given to store an audio in quantum computers. Then some exemplary quantum audio processing operations are performed to indicate QRDA’s usability.

...read moreread less

Journal Article•DOI•

MFCC-GMM based accent recognition system for Telugu speech signals

[...]

Kasiprasad Mannepalli¹, Panyam Narahari Sastry², Maloji Suman¹•Institutions (2)

K L University¹, Chaitanya Bharathi Institute of Technology²

01 Mar 2016-International Journal of Speech Technology

TL;DR: In this work, Mel frequency cepstral coefficients (MFCC) features are extracted for each speech of both training and test samples and Gaussian mixture model (GMM) is used for classification of the speech based on accent.

...read moreread less

Abstract: Speech processing is very important research area where speaker recognition, speech synthesis, speech codec, speech noise reduction are some of the research areas. Many of the languages have different speaking styles called accents or dialects. Identification of the accent before the speech recognition can improve performance of the speech recognition systems. If the number of accents is more in a language, the accent recognition becomes crucial. Telugu is an Indian language which is widely spoken in Southern part of India. Telugu language has different accents. The main accents are coastal Andhra, Telangana, and Rayalaseema. In this present work the samples of speeches are collected from the native speakers of different accents of Telugu language for both training and testing. In this work, Mel frequency cepstral coefficients (MFCC) features are extracted for each speech of both training and test samples. In the next step Gaussian mixture model (GMM) is used for classification of the speech based on accent. The overall efficiency of the proposed system to recognize the speaker, about the region he belongs, based on accent is 91 %.

...read moreread less

Patent•

Speech recognition system and speech recognition method thereof

[...]

Tae-Yoon Kim¹, Sang-Ha Kim¹, Changwoo Han¹, Jae-won Lee¹•Institutions (1)

Samsung¹

29 Feb 2016

TL;DR: In this article, a device detects a wake-up keyword from a received speech signal of a user by using a wakeup keyword model, and transmits a wake up keyword detection/non-detection signal and the received signal of the user to a speech recognition server.

...read moreread less

Abstract: A device detects a wake-up keyword from a received speech signal of a user by using a wake-up keyword model, and transmits a wake-up keyword detection/non-detection signal and the received speech signal of the user to a speech recognition server. The speech recognition server performs a recognition process on the speech signal of the user by setting a speech recognition model according to the detection or non-detection of the wake-up keyword.

...read moreread less

Patent•

Mixed speech recognition

[...]

Dong Yu¹, Chao Weng¹, Michael L. Seltzer¹, James G. Droppo¹•Institutions (1)

Microsoft¹

08 Jun 2016

TL;DR: In this article, a system and method for recognizing mixed speech from a source is presented. But, the method is limited to a single source and does not consider the possibility that a specific frame is a switching point of the speech characteristic.

...read moreread less

Abstract: The claimed subject matter includes a system and method for recognizing mixed speech from a source. The method includes training a first neural network to recognize the speech signal spoken by the speaker with a higher level of a speech characteristic from a mixed speech sample. The method also includes training a second neural network to recognize the speech signal spoken by the speaker with a lower level of the speech characteristic from the mixed speech sample. Additionally, the method includes decoding the mixed speech sample with the first neural network and the second neural network by optimizing the joint likelihood of observing the two speech signals considering the probability that a specific frame is a switching point of the speech characteristic.

...read moreread less

Patent•

Audio diarization system that segments audio input

[...]

Lyren Philip Scott, Norris Glen A

10 Jun 2016

TL;DR: In this paper, an audio diarization system segments the audio input into speech and non-speech segments, and these segments are convolved with one or more head related transfer functions (HRTFs) so the sounds localize to different sound localization points (SLPs) for the user.

...read moreread less

Abstract: Speech and/or non-speech in an audio input are convolved to localize sounds to different locations for a user. An audio diarization system segments the audio input into speech and non-speech segments. These segments are convolved with one or more head related transfer functions (HRTFs) so the sounds localize to different sound localization points (SLPs) for the user.

...read moreread less

Audio Signal Processing For Next Generation Multimedia Communication Systems

[...]

Anja Walter

01 Jan 2016

Proceedings Article•DOI•

Audio-to-Visual Speech Conversion using Deep Neural Networks

[...]

Sarah Taylor, Akihiro Kato, Iain Matthews, Ben Milner

08 Sep 2016

TL;DR: A sliding window deep neural network is presented that learns a mapping from awindow of acoustic features to a window of visual features from a large audio-visual speech dataset and outperform a baseline HMM inversion approach in both objective and subjective evaluations.

...read moreread less

Abstract: We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results.

...read moreread less

Proceedings Article•DOI•

Audio Classification Method Based on Machine Learning

[...]

Feng Rong

01 Dec 2016

TL;DR: This paper illustrates the hierarchical structure of audio data, and discusses how to classify audio data using the SVM classifier with Gaussian kernel, and demonstrates that the proposed method is able to achieve higher audio classification accuracy.

...read moreread less

Abstract: Audio classification has very large theoretical and practical values in both pattern recognition and artificial intelligence. In this paper, we propose a novel audio classification method based on machine learning technique. Firstly, we illustrate the hierarchical structure of audio data, which is made up of four layers: 1) Audio frame, 2) Audio clip, 3) Audio shot, and 4) Audio high level semantic unit. Secondly, three types of audio data feature are extracted to construct feature vector, including 1) Short time energy, 2) Zero crossing rate and 3) Mel-Frequency cepstral coefficients. Thirdly, we discuss how to classify audio data using the SVM classifier with Gaussian kernel. Finally, experimental results demonstrate that the proposed method is able to achieve higher audio classification accuracy.

...read moreread less

Journal Article•DOI•

Robust quad-based audio fingerprinting

[...]

Reinhard Sonnleitner¹, Gerhard Widmer¹•Institutions (1)

Johannes Kepler University of Linz¹

01 Mar 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An audio fingerprinting method that adapts findings from the field of blind astrometry to define simple, efficiently representable characteristic feature combinations called quads is proposed.

...read moreread less

Abstract: We propose an audio fingerprinting method that adapts findings from the field of blind astrometry to define simple, efficiently representable characteristic feature combinations called quads. Based on these, an audio identification algorithm is described that is robust to noise and severe time-frequency scale distortions and accurately identifies the underlying scale transform factors. The low number and compact representation of content features allows for efficient application of exact fixed-radius near-neighbor search methods for fingerprint matching in large audio collections. We demonstrate the practicability of the method on a collection of 100,000 songs, analyze its performance for a diverse set of noise as well as severe speed, tempo and pitch scale modifications, and identify a number of advantages of our method over two state-of-the-art distortion-robust audio identification algorithms.

...read moreread less

Proceedings Article•DOI•

Phoneme-specific speech separation

[...]

Zhong-Qiu Wang¹, Yan Zhao¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

20 Mar 2016

TL;DR: Experiments on the corpus of the second CHiME speech separation and recognition challenge (task-2) demonstrate the effectiveness of this novel phoneme-specific speech separation method in terms of objective measures of speech intelligibility and quality, as well as recognition performance.

...read moreread less

Abstract: Speech separation or enhancement algorithms seldom exploit information about phoneme identities. In this study, we propose a novel phoneme-specific speech separation method. Rather than training a single global model to enhance all the frames, we train a separate model for each phoneme to process its corresponding frames. A robust ASR system is employed to identify the phoneme identity of each frame. This way, the information from ASR systems and language models can directly influence speech separation by selecting a phoneme-specific model to use at the test stage. In addition, phoneme-specific models have fewer variations to model and do not exhibit the data imbalance problem. The improved enhancement results can in turn help recognition. Experiments on the corpus of the second CHiME speech separation and recognition challenge (task-2) demonstrate the effectiveness of this method in terms of objective measures of speech intelligibility and quality, as well as recognition performance.

...read moreread less

Journal Article•DOI•

A Hybrid Approach for Speech Enhancement Using MoG Model and Neural Network Phoneme Classifier

[...]

Shlomo E. Chazan¹, Jacob Goldberger¹, Sharon Gannot¹•Institutions (1)

Bar-Ilan University¹

01 Dec 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this article, a hybrid approach is proposed combining the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN), which is executed in two phases, the training phase which does not recur, and the test phase.

...read moreread less

Abstract: In this paper, we present a single-microphone speech enhancement algorithm. A hybrid approach is proposed merging the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN). The proposed algorithm is executed in two phases, the training phase, which does not recur, and the test phase. First, the noise-free speech log-power spectral density is modeled as an MoG, representing the phoneme-based diversity in the speech signal. A DNN is then trained with phoneme labeled database of clean speech signals for phoneme classification with mel-frequency cepstral coefficients as the input features. In the test phase, a noisy utterance of an untrained speech is processed. Given the phoneme classification results of the noisy speech utterance, a speech presence probability (SPP) is obtained using both the generative and discriminative models. SPP-controlled attenuation is then applied to the noisy speech while simultaneously, the noise estimate is updated. The discriminative DNN maintains the continuity of the speech and the generative phoneme-based MoG preserves the speech spectral structure. Extensive experimental study using real speech and noise signals is provided. We also compare the proposed algorithm with alternative speech enhancement algorithms. We show that we obtain a significant improvement over previous methods in terms of speech quality measures. Finally, we analyze the contribution of all components of the proposed algorithm indicating their combined importance.

...read moreread less

Patent•

Coding of a spatial sampling of a two-dimensional information signal using sub-division

[...]

Heiner Kirchhoffer, Martin Winken, Philipp Helle, Detlev Marpe, Heiko Schwarz, Thomas Wiegand - Show less +2 more

28 Jun 2016

TL;DR: In this article, a coding scheme for coding a spatially sampled information signal using sub-division and coding schemes for coding an information signal with sub-and multi-tree structures are described.

...read moreread less

Abstract: Coding schemes for coding a spatially sampled information signal using sub-division and coding schemes for coding a sub-division or a multitree structure are described, wherein representative embodiments relate to picture and/or video coding applications.

...read moreread less

Proceedings Article•DOI•

Speech enhancement based on neural networks applied to cochlear implant coding strategies

[...]

Federico Bolner¹, Tobias Goehring², Jessica J. M. Monaghan², Bas van Dijk³, Jan Wouters¹, Stefan Bleeck² - Show less +2 more•Institutions (3)

Katholieke Universiteit Leuven¹, University of Southampton², Cochlear Limited³

20 Mar 2016

TL;DR: A speech enhancement algorithm integrating an artificial neural network (NN) into CI coding strategies is proposed, which decomposes the noisy input signal into time-frequency units, extracts a set of auditory-inspired features and feeds them to the NN to produce an estimation of which CI channels contain more perceptually important information.

...read moreread less

Abstract: Traditionally, algorithms that attempt to significantly improve speech intelligibility in noise for cochlear implant (CI) users have met with limited success, particularly in the presence of a fluctuating masker. In the present study, a speech enhancement algorithm integrating an artificial neural network (NN) into CI coding strategies is proposed. The algorithm decomposes the noisy input signal into time-frequency units, extracts a set of auditory-inspired features and feeds them to the NN to produce an estimation of which CI channels contain more perceptually important information (higher signal-to-noise ratio, SNR). This estimate is then used accordingly to retain a subset of channels for electrical stimulation, as in traditional n-of-m coding strategies. The proposed algorithm was tested with 10 normal-hearing participants listening to CI noise-vocoder simulations against a conventional Wiener filter based enhancement algorithm. Significant improvements in speech intelligibility in stationary and fluctuating noise were found over both unprocessed and Wiener filter processed conditions.

...read moreread less

Proceedings Article•DOI•

Audio-visual speech enhancement using deep neural networks

[...]

Jen-Cheng Hou¹, Syu-Siang Wang¹, Ying-Hui Lai², Jen-Chun Lin³, Yu Tsao¹, Hsiu-Wen Chang⁴, Hsin-Min Wang³ - Show less +3 more•Institutions (4)

Center for Information Technology¹, Yuan Ze University², Academia Sinica³, Mackay Medical College⁴

01 Dec 2016

TL;DR: Investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance confirms the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.

...read moreread less

Abstract: This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful complementary information to audio data, have been integrated with audio data in many speech-related approaches to attain more effective speech processing performance. This paper presents our investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance. The experimental results show that the performance of DNN with audio-visual inputs exceeds that of DNN with audio inputs only in four standardized objective evaluations, thereby confirming the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.

...read moreread less

Journal Article•DOI•

A Large-Scale Open-Source Acoustic Simulator for Speaker Recognition

[...]

Marc Ferras¹, Srikanth Madikeri¹, Petr Motlicek¹, Subhadeep Dey¹, Hervé Bourlard¹ - Show less +1 more•Institutions (1)

Idiap Research Institute¹

03 Mar 2016-IEEE Signal Processing Letters

TL;DR: While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.

...read moreread less

Abstract: The state-of-the-art speaker-recognition systems suffer from significant performance loss on degraded speech conditions and acoustic mismatch between enrolment and test phases. Past international evaluation campaigns, such as the NIST speaker recognition evaluation (SRE), have partly addressed these challenges in some evaluation conditions. This work aims at further assessing and compensating for the effect of a wide variety of speech-degradation processes on speaker-recognition performance. We present an open-source simulator generating degraded telephone, VoIP, and interview-speech recordings using a comprehensive list of narrow-band, wide-band, and audio codecs, together with a database of over 60 h of environmental noise recordings and over 100 impulse responses collected from publicly available data. We provide speaker-verification results obtained with an $i$ -vector-based system using either a clean or degraded PLDA back-end on a NIST SRE subset of data corrupted by the proposed simulator. While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.

...read moreread less

Journal Article•DOI•

Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding

[...]

Milos Cernak¹, Alexandros Lazaridis¹, Afsaneh Asaei¹, Philip N. Garner¹•Institutions (1)

Idiap Research Institute¹

01 Dec 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A novel VLBR speech coding framework based on neural networks (NNs) for end-to-end speech analysis and synthesis without HMMs is proposed, which significantly prefer the NN-based approach due to fewer discontinuities and speech artifacts of the encoded speech.

...read moreread less

Abstract: Most current very low bit rate VLBR speech coding systems use hidden Markov model HMM based speech recognition and synthesis techniques. This allows transmission of information such as phonemes segment by segment; this decreases the bit rate. However, an encoder based on a phoneme speech recognition may create bursts of segmental errors; these would be further propagated to any suprasegmental such as syllable information coding. Together with the errors of voicing detection in pitch parametrization, HMM-based speech coding leads to speech discontinuities and unnatural speech sound artifacts. In this paper, we propose a novel VLBR speech coding framework based on neural networks NNs for end-to-end speech analysis and synthesis without HMMs. The speech coding framework relies on a phonological subphonetic representation of speech. It is designed as a composition of deep and spiking NNs: a bank of phonological analyzers at the transmitter, and a phonological synthesizer at the receiver. These are both realized as deep NNs, along with a spiking NN as an incremental and robust encoder of syllable boundaries for coding of continuous fundamental frequency F0. A combination of phonological features defines much more sound patterns than phonetic features defined by HMM-based speech coders; this finer analysis/synthesis code contributes to smoother encoded speech. Listeners significantly prefer the NN-based approach due to fewer discontinuities and speech artifacts of the encoded speech. A single forward pass is required during the speech encoding and decoding. The proposed VLBR speech coding operates at a bit rate of approximately 360 bits/s.

...read moreread less

Journal Article•DOI•

Pitch-based steganography for Speex voice codec

[...]

Artur Janicki¹•Institutions (1)

Warsaw University of Technology¹

01 Oct 2016-Security and Communication Networks

TL;DR: The improved version of a steganographic algorithm for IP telephony based on approximating the F0 parameter, which is responsible for conveying information about the pitch of the speech signal, yielded a significantly lower decrease in speech quality, when compared with the original version of HideF0.

...read moreread less

Abstract: This paper presents an improved version of a steganographic algorithm for IP telephony called HideF0. It is based on approximating the F0 parameter, which is responsible for conveying information about the pitch of the speech signal. The bits saved due to simplification of the pitch contour are used for the hidden transmission. In our experiments, the proposed method was applied to the narrowband Speex codec working in five different modes, with bitrates between 5,950i?źbps and 24,600i?źbps. We showed that HideF0 was able to create hidden channels with steganographic bandwidths of around 200i?źbps at the expense of a steganographic cost of between 0.5 and 0.7 MOS, depending on the Speex mode. Because of placing the approximation flag in the voice packet header, the improved version of the proposed algorithm yielded a significantly lower decrease in speech quality, when compared with the original version of HideF0. In addition, for low bitrates of the hidden channel i.e., below ca. 50i?źbps it was able to operate without introducing any steganographic cost. Copyright © 2016 John Wiley & Sons, Ltd.

...read moreread less

Proceedings Article•DOI•

Kalman filter for speech enhancement in cocktail party scenarios using a codebook-based approach

[...]

Mathew Shaji Kavalekalam¹, Mads Grosboll Christensen¹, Fredrik Gran, Jesper B. Boldt•Institutions (1)

Aalborg University¹

20 Mar 2016

TL;DR: This work investigates a single channel Kalman filter based speech enhancement algorithm, whose parameters are estimated using a codebook based approach, and results indicate that the enhancement algorithm is able to improve the speech intelligibility and quality according to objective measures.

...read moreread less

Abstract: Enhancement of speech in non-stationary background noise is a challenging task, and conventional single channel speech enhancement algorithms have not been able to improve the speech intelligibility in such scenarios. The work proposed in this paper investigates a single channel Kalman filter based speech enhancement algorithm, whose parameters are estimated using a codebook based approach. The results indicate that the enhancement algorithm is able to improve the speech intelligibility and quality according to objective measures. Moreover, we investigate the effects of utilizing a speaker specific trained codebook over a generic speech codebook in relation to the performance of the speech enhancement system.

...read moreread less

Proceedings Article•DOI•

Improved DNN-based segmentation for multi-genre broadcast audio

[...]

L. Wang¹, Chao Zhang¹, Philip C. Woodland¹, Mark J. F. Gales¹, Penny Karanasou¹, Pierre Lanchantin¹, Xunying Liu¹, Yanmin Qian¹ - Show less +4 more•Institutions (1)

University of Cambridge¹

20 Mar 2016

TL;DR: A segmentation system for multi-genre broadcast audio with deep neural network (DNN) based speech/non-speech detection and a further stage of change-point detection and clustering is used to obtain homogeneous segments.

...read moreread less

Abstract: Automatic segmentation is a crucial initial processing step for processing multi-genre broadcast (MGB) audio. It is very challenging since the data exhibits a wide range of both speech types and background conditions with many types of non-speech audio. This paper describes a segmentation system for multi-genre broadcast audio with deep neural network (DNN) based speech/non-speech detection. A further stage of change-point detection and clustering is used to obtain homogeneous segments. Suitable DNN inputs, context window sizes and architectures are studied with a series of experiments using a large corpus of MGB television audio. For MGB transcription, the improved segmenter yields roughly half the increase in word error rate, over manual segmentation, compared to the baseline DNN segmenter supplied for the 2015 ASRU MGB challenge.

...read moreread less

Introduction to audio analysis

[...]

Theodoros Giannakopoulos, Aggelos Pikrakis

08 Oct 2016

TL;DR: Introduction to audio analysis :, Introduction to audioAnalysis :, کتابخانه دیجیتال جندی شاپور اهواز

...read moreread less

Abstract: Introduction to audio analysis : , Introduction to audio analysis : , کتابخانه دیجیتال جندی شاپور اهواز

...read moreread less

Collapse