Showing papers on "Speech coding published in 2017"

PDF

Open Access

Posted Content•

TasNet: time-domain audio separation network for real-time, single-channel speech separation

[...]

Yi Luo¹, Nima Mesgarani¹•Institutions (1)

01 Nov 2017-arXiv: Sound

TL;DR: Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.

...read moreread less

Abstract: Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily an optimal representation for speech separation. In addition, time-frequency decomposition results in inherent problems such as phase/magnitude decoupling and long time window which is required to achieve sufficient frequency resolution. We propose Time-domain Audio Separation Network (TasNet) to overcome these limitations. We directly model the signal in the time-domain using an encoder-decoder framework and perform the source separation on nonnegative encoder outputs. This method removes the frequency decomposition step and reduces the separation problem to estimation of source masks on encoder outputs which is then synthesized by the decoder. Our system outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output. This makes TasNet suitable for applications where low-power, real-time implementation is desirable such as in hearable and telecommunication devices.

...read moreread less

148 citations

DOI•

Noisy speech database for training speech enhancement algorithms and TTS models

[...]

Cassia Valentini-Botinhao

21 Aug 2017

120 citations

Journal Article•DOI•

Contributions of local speech encoding and functional connectivity to audio-visual speech perception.

[...]

Bruno L. Giordano¹, Bruno L. Giordano², Robin A. A. Ince¹, Joachim Gross¹, Philippe G. Schyns¹, Stefano Panzeri³, Christoph Kayser¹ - Show less +3 more•Institutions (3)

University of Glasgow¹, Centre national de la recherche scientifique², Istituto Italiano di Tecnologia³

07 Jun 2017-eLife

TL;DR: The results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments.

...read moreread less

Abstract: Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speaker’s face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments.

...read moreread less

78 citations

Proceedings Article•DOI•

Adapting and controlling DNN-based speech synthesis using input codes

[...]

Hieu-Thi Luong¹, Shinji Takaki², Gustav Eje Henter², Junichi Yamagishi²•Institutions (2)

Ho Chi Minh City University of Science¹, National Institute of Informatics²

19 Jun 2017

TL;DR: Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

...read moreread less

Abstract: Methods for adapting and controlling the characteristics of output speech are important topics in speech synthesis. In this work, we investigated the performance of DNN-based text-to-speech systems that in parallel to conventional text input also take speaker, gender, and age codes as inputs, in order to 1) perform multi-speaker synthesis, 2) perform speaker adaptation using small amounts of target-speaker adaptation data, and 3) modify synthetic speech characteristics based on the input codes. Using a large-scale, studio-quality speech corpus with 135 speakers of both genders and ages between tens and eighties, we performed three experiments: 1) First, we used a subset of speakers to construct a DNN-based, multi-speaker acoustic model with speaker codes. 2) Next, we performed speaker adaptation by estimating code vectors for new speakers via backpropagation from a small amount of adaptation material. 3) Finally, we experimented with manually manipulating input code vectors to alter the gender and/or age characteristics of the synthesised speech. Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

...read moreread less

71 citations

Journal Article•DOI•

Steganalysis of QIM Steganography in Low-Bit-Rate Speech Signals

[...]

Songbin Li¹, Yizhen Jia², C.-C. Jay Kuo³•Institutions (3)

Chinese Academy of Sciences¹, Hainan University², University of Southern California³

01 May 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The proposed QCCN steganalysis method can effectively detect the QIM steganography in encoded speech stream when it is applied to low-bit-rate speech codec such as G.723.1 and G.729.1.

...read moreread less

Abstract: Steganalysis of the quantization index modulation (QIM) steganography in a low-bit-rate encoded speech stream is conducted in this research. According to the speech generation theory and the phoneme distribution properties in language, we first point out that the correlation characteristics of split vector quantization (VQ) codewords of linear predictive coding filter coefficients are changed after the QIM steganography. Based on this observation, we construct a model called the Quantization codeword correlation network (QCCN) based on split VQ codeword from adjacent speech frames. Furthermore, the QCCN model is pruned to yield a stronger correlation network. After quantifying the correlation characteristics of vertices in the pruned correlation network, we obtain feature vectors that are sensitive to steganalysis. Finally, we build a high-performance detector using the support vector machine (SVM) classifier. It is shown by experimental results that the proposed QCCN steganalysis method can effectively detect the QIM steganography in encoded speech stream when it is applied to low-bit-rate speech codec such as G.723.1 and G.729.

...read moreread less

50 citations

Proceedings Article•DOI•

A two-stage algorithm for noisy and reverberant speech enhancement

[...]

Yan Zhao¹, Zhong-Qiu Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

05 Mar 2017

TL;DR: A two-stage algorithm to deal with the confounding effects of noise and reverberation separately, where denoising and dereverberation are conducted sequentially using deep neural networks is proposed, and it substantially outperforms one-stage enhancement baselines.

...read moreread less

Abstract: In daily listening environments, speech is commonly corrupted by room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also severely degrade the performance of automatic speech and speaker recognition systems. In this paper, we propose a two-stage algorithm to deal with the confounding effects of noise and reverberation separately, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase information during training. As the objective function emphasizes more important time-frequency (T-F) units, better estimated magnitude is obtained during testing. By jointly training the two-stage model to optimize the proposed objective function, our algorithm improves objective metrics of speech intelligibility and quality significantly, and substantially outperforms one-stage enhancement baselines.

...read moreread less

49 citations

Book•

Speech Coding: with Code-Excited Linear Prediction

[...]

Tom Bäckström

03 Apr 2017

47 citations

Proceedings Article•DOI•

Speech dereverberation and denoising using complex ratio masks

[...]

Donald S. Williamson¹, DeLiang Wang²•Institutions (2)

Indiana University¹, Ohio State University²

05 Mar 2017

TL;DR: A deep neural network is used to estimate the real and imaginary components of the complex ideal ratio mask (cIRM), which results in clean and anechoic speech when applied to a reverberant-noisy mixture and shows that phase is important for dereverberation, and that complex ratio masking outperforms related methods.

...read moreread less

Abstract: Traditional speech separation systems enhance the magnitude response of noisy speech. Recent studies, however, have shown that perceptual speech quality is significantly improved when magnitude and phase are both enhanced. These studies, however, have not determined if phase enhancement is beneficial in environments that contain reverberation as well as noise. In this paper, we present an approach that jointly enhances the magnitude and phase of reverberant and noisy speech. We use a deep neural network to estimate the real and imaginary components of the complex ideal ratio mask (cIRM), which results in clean and anechoic speech when applied to a reverberant-noisy mixture. Our results show that phase is important for dereverberation, and that complex ratio masking outperforms related methods.

...read moreread less

47 citations

Journal Article•DOI•

Detection of Double Compressed AMR Audio Using Stacked Autoencoder

[...]

Da Luo¹, Rui Yang², Bin Li¹, Jiwu Huang¹•Institutions (2)

Shenzhen University¹, Sun Yat-sen University²

01 Feb 2017-IEEE Transactions on Information Forensics and Security

TL;DR: This paper proposes a framework for detecting double compressed AMR audio based on the stacked autoencoder (SAE) network and the universal background model-Gaussian mixture model (UBM-GMM), and uses the SAE to learn the optimal features automatically from the audio waveforms.

...read moreread less

Abstract: The adaptive multi-rate (AMR) audio codec adopted by many portable recording devices is widely used in speech compression The use of AMR speech recordings as evidence in court is growing Nowadays, it is easy to tamper with digital speech recordings, which makes audio forensics increasingly important The detection of double compressed audio is one of the key issues in audio forensics In this paper, we propose a framework for detecting double compressed AMR audio based on the stacked autoencoder (SAE) network and the universal background model—Gaussian mixture model (UBM-GMM) Instead of hand-crafted features, we used the SAE to learn the optimal features automatically from the audio waveforms Audio frames are used as network input and the last hidden layer’s output constitutes the features of a single frame For an audio clip with many frames, the features of all the frames are aggregated and classified by UBM-GMM Experimental results show that our method is effective in distinguishing single/double compressed AMR audio and outperforms the existing methods by achieving a detection accuracy of 98% on the TIMIT database Exhaustive experiments demonstrate the effectiveness and robustness of the proposed method

...read moreread less

46 citations

Patent•

Methods and systems for detecting and processing speech signals

[...]

Jay Pierre Civelli¹, Mikhal Shemer¹, Turaj Zakizadeh Shabestary¹, David Tapuska¹•Institutions (1)

Google¹

17 May 2017

TL;DR: In this article, a platform for multiple media devices connected via a network is configured to process speech, such as voice commands, detected at the media devices, and respond to the detected speech by causing the devices to simultaneously perform one or more requested actions.

...read moreread less

Abstract: Provided are methods, systems, and apparatuses for detecting, processing, and responding to audio signals, including speech signals, within a designated area or space. A platform for multiple media devices connected via a network is configured to process speech, such as voice commands, detected at the media devices, and respond to the detected speech by causing the media devices to simultaneously perform one or more requested actions. The platform is capable of scoring the quality of a speech request, handling speech requests from multiple end points of the platform using a centralized processing approach, a de-centralized processing approach, or a combination thereof, and also manipulating partial processing of speech requests from multiple end points into a coherent whole when necessary.

...read moreread less

45 citations

Journal Article•DOI•

Cognitive load makes speech sound fast, but does not modulate acoustic context effects

[...]

Hans R. Bosker¹, Hans R. Bosker², Eva Reinisch³, Matthias J. Sjerps⁴, Matthias J. Sjerps¹ - Show less +1 more•Institutions (4)

Radboud University Nijmegen¹, Max Planck Society², Ludwig Maximilian University of Munich³, University of California, Berkeley⁴

01 Jun 2017-Journal of Memory and Language

TL;DR: In this paper, the authors investigated whether and how spectral and temporal cues in a precursor sentence that has been processed under high vs. low cognitive load influence the perception of a subsequent target word.

...read moreread less

Journal Article•DOI•

Generating Intelligible Audio Speech From Visual Speech

[...]

Thomas Le Cornu¹, Ben Milner¹•Institutions (1)

University of East Anglia¹

01 Sep 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper is concerned with generating intelligible audio speech from a video of a person talking, and regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model visual features and two further methods are developed to incorporate temporal information into the prediction.

...read moreread less

Abstract: This paper is concerned with generating intelligible audio speech from a video of a person talking. Regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model visual features. Two further methods are then developed to incorporate temporal information into the prediction: A feature-level method using multiple frames and a model-level method based on recurrent neural networks. Speech excitation information is not available from the visual signal, so methods to artificially generate aperiodicity and fundamental frequency are developed. These are combined within the STRAIGHT vocoder to produce a speech signal. The various systems are optimized through objective tests before applying subjective intelligibility tests that determine a word accuracy of 85% from a set of human listeners on the GRID audio-visual speech database. This compares favorably with a previous regression-based system that serves as a baseline, which achieved a word accuracy of 33%.

...read moreread less

Proceedings Article•DOI•

Audio-visual object localization and separation using low-rank and sparsity

[...]

Jie Pu¹, Yannis Panagakis¹, Stavros Petridis¹, Maja Pantic¹•Institutions (1)

Imperial College London¹

05 Mar 2017

TL;DR: A novel optimization problem, involving the minimization of nuclear norms and matrix ℓ1-norms is solved and the proposed method is evaluated in 1) visual localization and audio separation and 2) visual-assisted audio denoising.

...read moreread less

Abstract: The ability to localize visual objects that are associated with an audio source and at the same time seperate the audio signal is a corner stone in several audio-visual signal processing applications. Past efforts usually focused on localizing only the visual objects, without audio separation abilities. Besides, they often rely computational expensive pre-processing steps to segment images pixels into object regions before applying localization approaches. We aim to address the problem of audio-visual source localization and separation in an unsupervised manner. The proposed approach employs low-rank in order to model the background visual and audio information and sparsity in order to extract the sparsely correlated components between the audio and visual modalities. In particular, this model decomposes each dataset into a sum of two terms: the low-rank matrices capturing the background uncorrelated information, while the sparse correlated components modelling the sound source in visual modality and the associated sound in audio modality. To this end a novel optimization problem, involving the minimization of nuclear norms and matrix l 1 -norms is solved. We evaluated the proposed method in 1) visual localization and audio separation and 2) visual-assisted audio denoising. The experimental results demonstrate the effectiveness of the proposed method.

...read moreread less

Proceedings Article•DOI•

DNN-based speech mask estimation for eigenvector beamforming

[...]

Lukas Pfeifenberger¹, Matthias Zöhrer¹, Franz Pernkopf¹•Institutions (1)

Graz University of Technology¹

05 Mar 2017

TL;DR: This paper presents an optimal multi-channel Wiener filter, which consists of an eigenvector beamformer and a single-channel postfilter, and shows that both components solely depend on a speech presence probability, which is learned using a deep neural network.

...read moreread less

Abstract: In this paper, we present an optimal multi-channel Wiener filter, which consists of an eigenvector beamformer and a single-channel postfilter. We show that both components solely depend on a speech presence probability, which we learn using a deep neural network, consisting of a deep autoencoder and a softmax regression layer. To prevent the DNN from learning specific speaker and noise types, we do not use the signal energy as input feature, but rather the cosine distance between the dominant eigenvectors of consecutive frames of the power spectral density of the noisy speech signal. We compare our system against the BeamformIt toolkit, and state-of-the-art approaches such as the front-end of the best system of the CHiME3 challenge. We show that our system yields superior results, both in terms of perceptual speech quality and classification error.

...read moreread less

Journal Article•DOI•

Steganography integrated into linear predictive coding for low bit-rate speech codec

[...]

Peng Liu¹, Songbin Li¹, Haiqiang Wang²•Institutions (2)

Chinese Academy of Sciences¹, University of Southern California²

01 Jan 2017-Multimedia Tools and Applications

TL;DR: The experimental results demonstrate that the proposed approach to embed information during linear predictive coding (LPC) process based on Matrix Embedding (ME) leads to a better performance with less speech distortion and better security.

...read moreread less

Abstract: The extensive use of Voice over IP (VoIP) applications makes low bit-rate speech stream a very suitable steganographic cover media. To incorporate steganography into low bit-rate speech codec, we propose a novel approach to embed information during linear predictive coding (LPC) process based on Matrix Embedding (ME). In the proposed method, a mapping table is constructed based on the criterion of minimum distance of Linear-Predictive-Coefficient-Vectors, and embedding position and template are selected according to a private key so as to choose the cover frames. The original speech data of the chosen frames are partially encoded to get the codewords for embedding and then the codewords that need to be modified for embedding are selected according to the secret bits and ME algorithm. The selected codeword will be changed into its best replacement codeword according to the mapping table. When embedding k (k > 1) bits into 2kź1 codewords, the embedding efficiency of our method is k times as that of LPC-based Quantization Index Modulation method. The performance of the proposed approach is evaluated in two aspects: distortion in speech quality introduced by embedding and security under steganalysis. The experimental results demonstrate that the proposed approach leads to a better performance with less speech distortion and better security.

...read moreread less

Posted Content•

Wavenet based low rate speech coding

[...]

W. Bastiaan Kleijn¹, Felicia S. C. Lim¹, Alejandro Luebs¹, Jan Skoglund¹, Florian Stimberg, Quan Wang¹, Thomas C. Walters - Show less +3 more•Institutions (1)

Google¹

01 Dec 2017-arXiv: Audio and Speech Processing

TL;DR: This work describes how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s and shows that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener.

...read moreread less

Abstract: Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model.

...read moreread less

Journal Article•DOI•

ESPRIT-Hilbert-Based Audio Tampering Detection With SVM Classifier for Forensic Analysis via Electrical Network Frequency

[...]

Paulo Max Gil Innocencio Reis, Joao Paulo C. L. da Costa¹, Ricardo Kehrle Miranda¹, Giovanni Del Galdo²•Institutions (2)

University of Brasília¹, Technische Universität Ilmenau²

01 Apr 2017-IEEE Transactions on Information Forensics and Security

TL;DR: A new technique to detect adulterations in audio recordings is proposed by exploiting abnormal variations in the electrical network frequency (ENF) signal eventually embedded in a questioned audio recording.

...read moreread less

Abstract: Audio authentication is a critical task in multimedia forensics demanding robust methods to detect and identify tampered audio recordings. In this paper, a new technique to detect adulterations in audio recordings is proposed by exploiting abnormal variations in the electrical network frequency (ENF) signal eventually embedded in a questioned audio recording. These abnormal variations are caused by abrupt phase discontinuities due to insertions and suppressions of audio snippets during the tampering task. First, we propose an ESPRIT-Hilbert ENF estimator in conjunction with an outlier detector based on the sample kurtosis of the estimated ENF. Next, we use the computed kurtosis as an input for a support vector machine classifier to indicate the presence of tampering. The proposed scheme, herein designated as SPHINS, significantly outperforms related previous tampering detection approaches in the conducted tests. We validate our results using the Carioca 1 corpus with 100 unedited authorized audio recordings of phone calls.

...read moreread less

Book•DOI•

Single Channel Phase-Aware Signal Processing in Speech Communication: Theory and Practice

[...]

Pejman Mowlaee, Josef Kulmer, Johannes Stahl, Florian Mayer

04 Jan 2017

Journal Article•DOI•

Steganography in vector quantization process of linear predictive coding for low-bit-rate speech codec

[...]

Peng Liu¹, Songbin Li¹, Haiqiang Wang²•Institutions (2)

Chinese Academy of Sciences¹, University of Southern California²

01 Jul 2017-Multimedia Systems

TL;DR: The proposed novel QIM steganography based on the replacement of quantization index set in linear predictive coding (LPC) outperforms the state-of-the-art LPC-based approach in low-bit-rate speech codec with respect to both steganographic capacity and steganalysis resistance.

...read moreread less

Abstract: In this paper, we focus on quantization-index-modulation (QIM) steganography in low-bit-rate speech codec and contribute to improve its steganalysis resistance. A novel QIM steganography is proposed based on the replacement of quantization index set in linear predictive coding (LPC). In this method, each quantization index set is seen as a point in quantization index space. Steganography is conducted in such space. Comparing with other methods, our algorithm significantly improves the embedding efficiency. One quantization index needs to be changed at most when three binary bits are hidden. The number of alterations introduced by the proposed approach is much lower than that of the current methods with the same embedding rate. Due to the fewer cover changes, the proposed steganography is less detectable. Moreover, a division strategy based on the genetic algorithm is proposed to reduce the additional distortion introduced by replacements. In our experiment, ITU-T G.723.1 is selected as the codec, and the experimental results show that the proposed approach outperforms the state-of-the-art LPC-based approach in low-bit-rate speech codec with respect to both steganographic capacity and steganalysis resistance.

...read moreread less

Journal Article•DOI•

Perceptual Spatial Audio Recording, Simulation, and Rendering: An overview of spatial-audio techniques based on psychoacoustics

[...]

Huseyin Hacihabiboglu¹, Enzo De Sena², Zoran Cvetkovic³, James Johnston, Julius O. Smith⁴ - Show less +1 more•Institutions (4)

Middle East Technical University¹, Katholieke Universiteit Leuven², King's College London³, Stanford University⁴

25 Apr 2017-IEEE Signal Processing Magazine

TL;DR: An overview of perceptually motivated techniques is presented, with a focus on multichannel audio recording and reproduction, audio source and reflection culling, and artificial reverberators.

...read moreread less

Abstract: Developments in immersive audio technologies have been evolving in two directions: physically motivated systems and perceptually motivated systems. Physically motivated techniques aim to reproduce a physically accurate approximation of desired sound fields by employing a very high equipment load and sophisticated, computationally intensive algorithms. Perceptually motivated techniques, however, aim to render only the perceptually relevant aspects of the sound scene by means of modest computational and equipment load. This article presents an overview of perceptually motivated techniques, with a focus on multichannel audio recording and reproduction, audio source and reflection culling, and artificial reverberators.

...read moreread less

Journal Article•DOI•

Multiplicative Update of Auto-Regressive Gains for Codebook-Based Speech Enhancement

[...]

Qi He¹, Feng Bao², Changchun Bao¹•Institutions (2)

Beijing University of Technology¹, University of Auckland²

01 Mar 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An improved codebook-driven Wiener filter combined with the speech-presence probability is developed, so that the proposed method achieves the goal of removing the residual noise between the harmonics of noisy speech.

...read moreread less

Abstract: In this paper, we present a novel method for estimating short-term linear predictive parameters of speech and noise in the codebook-driven Wiener filtering speech enhancement method. We only use pretrained spectral shape codebook of speech to model the a priori information about linear predictive coefficients of speech, and the spectral shape of noise is estimated online directly instead of using noise codebook to solve the problem of noise classification. Differing from the existing codebook-driven methods that the linear predictive gains of speech and noise are estimated by maximum-likelihood method, in the proposed method we exploit a multiplicative update rule to estimate the linear predictive gains more accurately. The estimated gains can help to reserve more speech components in the enhanced speech. Meanwhile, the Bayesian parameter-estimator without the noise codebook is also developed. Moreover, we develop an improved codebook-driven Wiener filter combined with the speech-presence probability, so that the proposed method achieves the goal of removing the residual noise between the harmonics of noisy speech.

...read moreread less

Proceedings Article•DOI•

Training algorithm to deceive Anti-Spoofing Verification for DNN-based speech synthesis

[...]

Yuki Saito¹, Shinnosuke Takamichi¹, Hiroshi Saruwatari¹•Institutions (1)

University of Tokyo¹

01 Mar 2017

TL;DR: A novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis that takes into account an Anti-Spoofing Verification (ASV) as an additional constraint in the acoustic model training.

...read moreread less

Abstract: This paper proposes a novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis. The parameters of synthetic speech tend to be over-smoothed, and this causes significant quality degradation in synthetic speech. The proposed algorithm takes into account an Anti-Spoofing Verification (ASV) as an additional constraint in the acoustic model training. The ASV is a discriminator trained to distinguish natural and synthetic speech. Since acoustic models for speech synthesis are trained so that the ASV recognizes the synthetic speech parameters as natural speech, the synthetic speech parameters are distributed in the same manner as natural speech parameters. Additionally, we find that the algorithm compensates not only the parameter distributions, but also the global variance and the correlations of synthetic speech parameters. The experimental results demonstrate that 1) the algorithm outperforms the conventional training algorithm in terms of speech quality, and 2) it is robust against the hyper-parameter settings.

...read moreread less

Proceedings Article•DOI•

Privacy preserving encrypted phonetic search of speech data

[...]

Cornelius Glackin, Gérard Chollet, Nazim Dugan, Nigel Cannings, Julie Wall¹, Shahzaib Tahir², Indranil Ghosh Ray², Muttukrishnan Rajarajan² - Show less +4 more•Institutions (2)

University of East London¹, City University London²

16 Jun 2017

TL;DR: The approach advocates a demarcation of responsibilities between the client and server-side components for performing the speech recognition task, which symbolically encodes the audio and encrypts the data before uploading to the server.

...read moreread less

Abstract: This paper presents a strategy for enabling speech recognition to be performed in the cloud whilst preserving the privacy of users. The approach advocates a demarcation of responsibilities between the client and server-side components for performing the speech recognition task. On the client-side resides the acoustic model, which symbolically encodes the audio and encrypts the data before uploading to the server. The server-side then employs searchable encryption to enable the phonetic search of the speech content. Some preliminary results for speech encoding and searchable encryption are presented.

...read moreread less

Patent•

Adaptive audio enhancement for multichannel speech recognition

[...]

Bo Li¹, Ron Weiss¹, Michiel Bacchiani¹, Tara N. Sainath¹, Kevin W. Wilson¹ - Show less +1 more•Institutions (1)

Google¹

20 Dec 2017

TL;DR: In this article, the adaptive shaping of neuron pattern for multichannel speech recognition has been proposed, where a first set of filter parameters for a first filter based on a first audio data channel and a second set of filters for a second one based on both the first and second channels are generated using a trained recurrent neural network.

...read moreread less

Abstract: FIELD: information technology.SUBSTANCE: invention discloses means for adaptive shaping of neuron pattern for multichannel speech recognition. A first channel of audio data corresponding to a speech fragment and a second audio data channel corresponding to said speech fragment are received. A first set of filter parameters for a first filter based on a first audio data channel and a second audio data channel and a second set of filter parameters for a second filter based on a first audio data channel and a second audio data channel are generated using a trained recurrent neural network. Generating a single combined audio data channel by combining first channel audio data which has been filtered using first filter, and audio data of second channel, which was filtered using second filter. Audio data are introduced for a single combined channel into a neural network trained as an acoustic model.EFFECT: high efficiency of speech recognition.20 cl, 5 dwg

...read moreread less

Journal Article•DOI•

Epoch Extraction From Telephone Quality Speech Using Single Pole Filter

[...]

C M Vikram¹, S. R. Mahadeva Prasanna¹•Institutions (1)

Indian Institute of Technology Guwahati¹

01 Mar 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An epoch extraction method is proposed that considers the vertical striations present in the time-frequency representation of voiced speech as the representative candidates for the epochs, comparable to that of clean speech.

...read moreread less

Abstract: Epoch extraction from speech involves the suppression of vocal tract resonances, either by linear prediction based inverse filtering or filtering at very low frequency. Degradations due to channel effect and significant attenuation of low frequency components ( $ 300 Hz) create challenges for the epoch extraction from telephone quality speech. An epoch extraction method is proposed that considers the vertical striations present in the time-frequency representation of voiced speech as the representative candidates for the epochs. Time-frequency representation with better localized vertical striations is estimated using single pole filter based filter bank. The time marginal of time-frequency representation is computed to locate the epochs. The proposed algorithm is evaluated on the database of five speakers, which provide simultaneous speech and electroglottographic recordings. Telephone quality speech is simulated using G.191 software tools. The identification rate of the state-of-the-art methods degrades substantially for the telephone quality speech whereas that of the proposed method remains the same, comparable to that of clean speech.

...read moreread less

Posted Content•

Audio Concept Classification with Hierarchical Deep Neural Networks

[...]

Mirco Ravanelli¹, Benjamin Elizalde², Karl Ni³, Gerald Friedland²•Institutions (3)

fondazione bruno kessler¹, International Computer Science Institute², Lawrence Livermore National Laboratory³

11 Oct 2017-arXiv: Audio and Speech Processing

TL;DR: In this paper, the authors explored the potential of deep learning in classifying audio concepts on user-Generated Content videos, using two cascaded neural networks in a hierarchical configuration to analyze the short and long-term context information.

...read moreread less

Abstract: Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter, or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a reduced set of audio concepts. However, multi-class classification can benefit from context window analysis and the discriminating power of deeper architectures. Although deep learning has shown promise in various applications such as speech and object recognition, it has not yet met the expectations for other fields such as audio concept classification. This paper explores, for the first time, the potential of deep learning in classifying audio concepts on User-Generated Content videos. The proposed system is comprised of two cascaded neural networks in a hierarchical configuration to analyze the short- and long-term context information. Our system outperforms a GMM approach by a relative 54%, a Neural Network by 33%, and a Deep Neural Network by 12% on the TRECVID-MED database

...read moreread less

Book Chapter•DOI•

Adaptive Audio Steganography Based on Advanced Audio Coding and Syndrome-Trellis Coding

[...]

Weiqi Luo¹, Yue Zhang¹, Haodong Li¹•Institutions (1)

Sun Yat-sen University¹

23 Aug 2017

TL;DR: This paper proposes a novel adaptive audio steganography in the time domain based on the advanced audio coding (AAC) and the Syndrome-Trellis coding (STC) and shows that the method can significantly outperform the conventional $\pm 1$ LSB based steganographers in terms of security and audio quality.

...read moreread less

Abstract: Most existing audio steganographic methods embed secret messages according to a pseudorandom number generator, thus some auditory sensitive parts in cover audio, such as mute or near-mute segments, will be contaminated, which would lead to poor perceptual quality and may introduce some detectable artifacts for steganalysis. In this paper, we propose a novel adaptive audio steganography in the time domain based on the advanced audio coding (AAC) and the Syndrome-Trellis coding (STC). The proposed method firstly compresses a given wave signal into AAC compressed file with a high bitrate, and then obtains a residual signal by comparing the signal before and after AAC compression. According to the quantity and sign of the residual signal, $\pm 1$ embedding costs are assigned to the audio samples. Finally, the STC is used to create the stego audio. The extensive results evaluated on 10,000 music and 10,000 speech audio clips have shown that our method can significantly outperform the conventional $\pm 1$ LSB based steganography in terms of security and audio quality.

...read moreread less

Book Chapter•DOI•

A Sudoku Matrix-Based Method of Pitch Period Steganography in Low-Rate Speech Coding

[...]

Zhongliang Yang¹, Xueshun Peng¹, Yongfeng Huang¹•Institutions (1)

Tsinghua University¹

22 Oct 2017

TL;DR: A method of embedding and extracting steganography in low-rate speech coding using three-dimensional Sudoku matrix is proposed and analysis shows that this method can enhance the concealment of steganographic information and improve the Steganography capacity of low- rate speech coding.

...read moreread less

Abstract: Using low-rate compressed speech coding for large-capacity steganography is always a big challenge due to its low redundant information. To overcome this challenge, we propose a method of embedding and extracting steganography in low-rate speech coding using three-dimensional Sudoku matrix. Analysis shows that this method can enhance the concealment of steganographic information and improve the steganography capacity of low-rate speech coding. The experimental results showed that using the current typical low speech coding standard G.723.1 achieved a steganographic capacity of 200 bit/s and a reduction of less than 10% in the sensory evaluation value of the speech quality of the coded speech.

...read moreread less

Posted Content•

End-to-End Optimized Speech Coding with Deep Neural Networks

[...]

Srihari Kankanahalli¹•Institutions (1)

Bloomberg L.P.¹

25 Oct 2017-arXiv: Sound

TL;DR: A deep neural network model is presented which optimizes all the steps of a wideband speech coding pipeline end-to-end directly from raw speech data - no manual feature engineering necessary, and it trains in hours.

...read moreread less

Abstract: Modern compression algorithms are often the result of laborious domain-specific research; industry standards such as MP3, JPEG, and AMR-WB took years to develop and were largely hand-designed. We present a deep neural network model which optimizes all the steps of a wideband speech coding pipeline (compression, quantization, entropy coding, and decompression) end-to-end directly from raw speech data -- no manual feature engineering necessary, and it trains in hours. In testing, our DNN-based coder performs on par with the AMR-WB standard at a variety of bitrates (~9kbps up to ~24kbps). It also runs in realtime on a 3.8GhZ Intel CPU.

...read moreread less

Journal Article•DOI•

Speech Enhancement Based on Full-Sentence Correlation and Clean Speech Recognition

[...]

Ji Ming¹, Danny Crookes¹•Institutions (1)

Queen's University Belfast¹

01 Mar 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A novel realization that integrates full-sentence speech correlation with clean speech recognition, formulated as a constrained maximization problem, to overcome the data sparsity problem and be able to significantly outperform conventional methods that use optimized noise tracking.

...read moreread less

Abstract: Conventional speech enhancement methods, based on frame, multiframe, or segment estimation, require knowledge about the noise. This paper presents a new method that aims to reduce or effectively remove this requirement. It is shown that by using the zero-mean normalized correlation coefficient ZNCC as the comparison measure, and by extending the effective length of speech segment matching to sentence-long speech utterances, it is possible to obtain an accurate speech estimate from noise without requiring specific knowledge about the noise. The new method, thus, could be used to deal with unpredictable noise or noise without proper training data. This paper is focused on realizing and evaluating this potential. We propose a novel realization that integrates full-sentence speech correlation with clean speech recognition, formulated as a constrained maximization problem, to overcome the data sparsity problem. Then we propose an efficient implementation algorithm to solve this constrained maximization problem to produce speech sentence estimates. For evaluation, we build the new system on one training dataset and test it on two different test datasets across two databases, for a range of different noises including highly nonstationary ones. It is shown that the new approach, without any estimation of the noise, is able to significantly outperform conventional methods that use optimized noise tracking, in terms of various objective measures including automatic speech recognition.

...read moreread less

Collapse