scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2017"


Posted Content
TL;DR: Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.
Abstract: Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily an optimal representation for speech separation. In addition, time-frequency decomposition results in inherent problems such as phase/magnitude decoupling and long time window which is required to achieve sufficient frequency resolution. We propose Time-domain Audio Separation Network (TasNet) to overcome these limitations. We directly model the signal in the time-domain using an encoder-decoder framework and perform the source separation on nonnegative encoder outputs. This method removes the frequency decomposition step and reduces the separation problem to estimation of source masks on encoder outputs which is then synthesized by the decoder. Our system outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output. This makes TasNet suitable for applications where low-power, real-time implementation is desirable such as in hearable and telecommunication devices.

148 citations



Journal ArticleDOI
07 Jun 2017-eLife
TL;DR: The results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments.
Abstract: Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speaker’s face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments.

78 citations


Proceedings ArticleDOI
19 Jun 2017
TL;DR: Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.
Abstract: Methods for adapting and controlling the characteristics of output speech are important topics in speech synthesis. In this work, we investigated the performance of DNN-based text-to-speech systems that in parallel to conventional text input also take speaker, gender, and age codes as inputs, in order to 1) perform multi-speaker synthesis, 2) perform speaker adaptation using small amounts of target-speaker adaptation data, and 3) modify synthetic speech characteristics based on the input codes. Using a large-scale, studio-quality speech corpus with 135 speakers of both genders and ages between tens and eighties, we performed three experiments: 1) First, we used a subset of speakers to construct a DNN-based, multi-speaker acoustic model with speaker codes. 2) Next, we performed speaker adaptation by estimating code vectors for new speakers via backpropagation from a small amount of adaptation material. 3) Finally, we experimented with manually manipulating input code vectors to alter the gender and/or age characteristics of the synthesised speech. Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

71 citations


Journal ArticleDOI
TL;DR: The proposed QCCN steganalysis method can effectively detect the QIM steganography in encoded speech stream when it is applied to low-bit-rate speech codec such as G.723.1 and G.729.1.
Abstract: Steganalysis of the quantization index modulation (QIM) steganography in a low-bit-rate encoded speech stream is conducted in this research. According to the speech generation theory and the phoneme distribution properties in language, we first point out that the correlation characteristics of split vector quantization (VQ) codewords of linear predictive coding filter coefficients are changed after the QIM steganography. Based on this observation, we construct a model called the Quantization codeword correlation network (QCCN) based on split VQ codeword from adjacent speech frames. Furthermore, the QCCN model is pruned to yield a stronger correlation network. After quantifying the correlation characteristics of vertices in the pruned correlation network, we obtain feature vectors that are sensitive to steganalysis. Finally, we build a high-performance detector using the support vector machine (SVM) classifier. It is shown by experimental results that the proposed QCCN steganalysis method can effectively detect the QIM steganography in encoded speech stream when it is applied to low-bit-rate speech codec such as G.723.1 and G.729.

50 citations


Proceedings ArticleDOI
05 Mar 2017
TL;DR: A two-stage algorithm to deal with the confounding effects of noise and reverberation separately, where denoising and dereverberation are conducted sequentially using deep neural networks is proposed, and it substantially outperforms one-stage enhancement baselines.
Abstract: In daily listening environments, speech is commonly corrupted by room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also severely degrade the performance of automatic speech and speaker recognition systems. In this paper, we propose a two-stage algorithm to deal with the confounding effects of noise and reverberation separately, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase information during training. As the objective function emphasizes more important time-frequency (T-F) units, better estimated magnitude is obtained during testing. By jointly training the two-stage model to optimize the proposed objective function, our algorithm improves objective metrics of speech intelligibility and quality significantly, and substantially outperforms one-stage enhancement baselines.

49 citations



Proceedings ArticleDOI
05 Mar 2017
TL;DR: A deep neural network is used to estimate the real and imaginary components of the complex ideal ratio mask (cIRM), which results in clean and anechoic speech when applied to a reverberant-noisy mixture and shows that phase is important for dereverberation, and that complex ratio masking outperforms related methods.
Abstract: Traditional speech separation systems enhance the magnitude response of noisy speech. Recent studies, however, have shown that perceptual speech quality is significantly improved when magnitude and phase are both enhanced. These studies, however, have not determined if phase enhancement is beneficial in environments that contain reverberation as well as noise. In this paper, we present an approach that jointly enhances the magnitude and phase of reverberant and noisy speech. We use a deep neural network to estimate the real and imaginary components of the complex ideal ratio mask (cIRM), which results in clean and anechoic speech when applied to a reverberant-noisy mixture. Our results show that phase is important for dereverberation, and that complex ratio masking outperforms related methods.

47 citations


Journal ArticleDOI
TL;DR: This paper proposes a framework for detecting double compressed AMR audio based on the stacked autoencoder (SAE) network and the universal background model-Gaussian mixture model (UBM-GMM), and uses the SAE to learn the optimal features automatically from the audio waveforms.
Abstract: The adaptive multi-rate (AMR) audio codec adopted by many portable recording devices is widely used in speech compression The use of AMR speech recordings as evidence in court is growing Nowadays, it is easy to tamper with digital speech recordings, which makes audio forensics increasingly important The detection of double compressed audio is one of the key issues in audio forensics In this paper, we propose a framework for detecting double compressed AMR audio based on the stacked autoencoder (SAE) network and the universal background model—Gaussian mixture model (UBM-GMM) Instead of hand-crafted features, we used the SAE to learn the optimal features automatically from the audio waveforms Audio frames are used as network input and the last hidden layer’s output constitutes the features of a single frame For an audio clip with many frames, the features of all the frames are aggregated and classified by UBM-GMM Experimental results show that our method is effective in distinguishing single/double compressed AMR audio and outperforms the existing methods by achieving a detection accuracy of 98% on the TIMIT database Exhaustive experiments demonstrate the effectiveness and robustness of the proposed method

46 citations


Patent
17 May 2017
TL;DR: In this article, a platform for multiple media devices connected via a network is configured to process speech, such as voice commands, detected at the media devices, and respond to the detected speech by causing the devices to simultaneously perform one or more requested actions.
Abstract: Provided are methods, systems, and apparatuses for detecting, processing, and responding to audio signals, including speech signals, within a designated area or space. A platform for multiple media devices connected via a network is configured to process speech, such as voice commands, detected at the media devices, and respond to the detected speech by causing the media devices to simultaneously perform one or more requested actions. The platform is capable of scoring the quality of a speech request, handling speech requests from multiple end points of the platform using a centralized processing approach, a de-centralized processing approach, or a combination thereof, and also manipulating partial processing of speech requests from multiple end points into a coherent whole when necessary.

45 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigated whether and how spectral and temporal cues in a precursor sentence that has been processed under high vs. low cognitive load influence the perception of a subsequent target word.

Journal ArticleDOI
TL;DR: This paper is concerned with generating intelligible audio speech from a video of a person talking, and regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model visual features and two further methods are developed to incorporate temporal information into the prediction.
Abstract: This paper is concerned with generating intelligible audio speech from a video of a person talking. Regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model visual features. Two further methods are then developed to incorporate temporal information into the prediction: A feature-level method using multiple frames and a model-level method based on recurrent neural networks. Speech excitation information is not available from the visual signal, so methods to artificially generate aperiodicity and fundamental frequency are developed. These are combined within the STRAIGHT vocoder to produce a speech signal. The various systems are optimized through objective tests before applying subjective intelligibility tests that determine a word accuracy of 85% from a set of human listeners on the GRID audio-visual speech database. This compares favorably with a previous regression-based system that serves as a baseline, which achieved a word accuracy of 33%.

Proceedings ArticleDOI
05 Mar 2017
TL;DR: A novel optimization problem, involving the minimization of nuclear norms and matrix ℓ1-norms is solved and the proposed method is evaluated in 1) visual localization and audio separation and 2) visual-assisted audio denoising.
Abstract: The ability to localize visual objects that are associated with an audio source and at the same time seperate the audio signal is a corner stone in several audio-visual signal processing applications. Past efforts usually focused on localizing only the visual objects, without audio separation abilities. Besides, they often rely computational expensive pre-processing steps to segment images pixels into object regions before applying localization approaches. We aim to address the problem of audio-visual source localization and separation in an unsupervised manner. The proposed approach employs low-rank in order to model the background visual and audio information and sparsity in order to extract the sparsely correlated components between the audio and visual modalities. In particular, this model decomposes each dataset into a sum of two terms: the low-rank matrices capturing the background uncorrelated information, while the sparse correlated components modelling the sound source in visual modality and the associated sound in audio modality. To this end a novel optimization problem, involving the minimization of nuclear norms and matrix l 1 -norms is solved. We evaluated the proposed method in 1) visual localization and audio separation and 2) visual-assisted audio denoising. The experimental results demonstrate the effectiveness of the proposed method.

Proceedings ArticleDOI
05 Mar 2017
TL;DR: This paper presents an optimal multi-channel Wiener filter, which consists of an eigenvector beamformer and a single-channel postfilter, and shows that both components solely depend on a speech presence probability, which is learned using a deep neural network.
Abstract: In this paper, we present an optimal multi-channel Wiener filter, which consists of an eigenvector beamformer and a single-channel postfilter. We show that both components solely depend on a speech presence probability, which we learn using a deep neural network, consisting of a deep autoencoder and a softmax regression layer. To prevent the DNN from learning specific speaker and noise types, we do not use the signal energy as input feature, but rather the cosine distance between the dominant eigenvectors of consecutive frames of the power spectral density of the noisy speech signal. We compare our system against the BeamformIt toolkit, and state-of-the-art approaches such as the front-end of the best system of the CHiME3 challenge. We show that our system yields superior results, both in terms of perceptual speech quality and classification error.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed approach to embed information during linear predictive coding (LPC) process based on Matrix Embedding (ME) leads to a better performance with less speech distortion and better security.
Abstract: The extensive use of Voice over IP (VoIP) applications makes low bit-rate speech stream a very suitable steganographic cover media. To incorporate steganography into low bit-rate speech codec, we propose a novel approach to embed information during linear predictive coding (LPC) process based on Matrix Embedding (ME). In the proposed method, a mapping table is constructed based on the criterion of minimum distance of Linear-Predictive-Coefficient-Vectors, and embedding position and template are selected according to a private key so as to choose the cover frames. The original speech data of the chosen frames are partially encoded to get the codewords for embedding and then the codewords that need to be modified for embedding are selected according to the secret bits and ME algorithm. The selected codeword will be changed into its best replacement codeword according to the mapping table. When embedding k (k > 1) bits into 2kź1 codewords, the embedding efficiency of our method is k times as that of LPC-based Quantization Index Modulation method. The performance of the proposed approach is evaluated in two aspects: distortion in speech quality introduced by embedding and security under steganalysis. The experimental results demonstrate that the proposed approach leads to a better performance with less speech distortion and better security.

Posted Content
TL;DR: This work describes how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s and shows that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener.
Abstract: Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model.

Journal ArticleDOI
TL;DR: A new technique to detect adulterations in audio recordings is proposed by exploiting abnormal variations in the electrical network frequency (ENF) signal eventually embedded in a questioned audio recording.
Abstract: Audio authentication is a critical task in multimedia forensics demanding robust methods to detect and identify tampered audio recordings. In this paper, a new technique to detect adulterations in audio recordings is proposed by exploiting abnormal variations in the electrical network frequency (ENF) signal eventually embedded in a questioned audio recording. These abnormal variations are caused by abrupt phase discontinuities due to insertions and suppressions of audio snippets during the tampering task. First, we propose an ESPRIT-Hilbert ENF estimator in conjunction with an outlier detector based on the sample kurtosis of the estimated ENF. Next, we use the computed kurtosis as an input for a support vector machine classifier to indicate the presence of tampering. The proposed scheme, herein designated as SPHINS, significantly outperforms related previous tampering detection approaches in the conducted tests. We validate our results using the Carioca 1 corpus with 100 unedited authorized audio recordings of phone calls.


Journal ArticleDOI
TL;DR: The proposed novel QIM steganography based on the replacement of quantization index set in linear predictive coding (LPC) outperforms the state-of-the-art LPC-based approach in low-bit-rate speech codec with respect to both steganographic capacity and steganalysis resistance.
Abstract: In this paper, we focus on quantization-index-modulation (QIM) steganography in low-bit-rate speech codec and contribute to improve its steganalysis resistance. A novel QIM steganography is proposed based on the replacement of quantization index set in linear predictive coding (LPC). In this method, each quantization index set is seen as a point in quantization index space. Steganography is conducted in such space. Comparing with other methods, our algorithm significantly improves the embedding efficiency. One quantization index needs to be changed at most when three binary bits are hidden. The number of alterations introduced by the proposed approach is much lower than that of the current methods with the same embedding rate. Due to the fewer cover changes, the proposed steganography is less detectable. Moreover, a division strategy based on the genetic algorithm is proposed to reduce the additional distortion introduced by replacements. In our experiment, ITU-T G.723.1 is selected as the codec, and the experimental results show that the proposed approach outperforms the state-of-the-art LPC-based approach in low-bit-rate speech codec with respect to both steganographic capacity and steganalysis resistance.

Journal ArticleDOI
TL;DR: An overview of perceptually motivated techniques is presented, with a focus on multichannel audio recording and reproduction, audio source and reflection culling, and artificial reverberators.
Abstract: Developments in immersive audio technologies have been evolving in two directions: physically motivated systems and perceptually motivated systems. Physically motivated techniques aim to reproduce a physically accurate approximation of desired sound fields by employing a very high equipment load and sophisticated, computationally intensive algorithms. Perceptually motivated techniques, however, aim to render only the perceptually relevant aspects of the sound scene by means of modest computational and equipment load. This article presents an overview of perceptually motivated techniques, with a focus on multichannel audio recording and reproduction, audio source and reflection culling, and artificial reverberators.

Journal ArticleDOI
TL;DR: An improved codebook-driven Wiener filter combined with the speech-presence probability is developed, so that the proposed method achieves the goal of removing the residual noise between the harmonics of noisy speech.
Abstract: In this paper, we present a novel method for estimating short-term linear predictive parameters of speech and noise in the codebook-driven Wiener filtering speech enhancement method. We only use pretrained spectral shape codebook of speech to model the a priori information about linear predictive coefficients of speech, and the spectral shape of noise is estimated online directly instead of using noise codebook to solve the problem of noise classification. Differing from the existing codebook-driven methods that the linear predictive gains of speech and noise are estimated by maximum-likelihood method, in the proposed method we exploit a multiplicative update rule to estimate the linear predictive gains more accurately. The estimated gains can help to reserve more speech components in the enhanced speech. Meanwhile, the Bayesian parameter-estimator without the noise codebook is also developed. Moreover, we develop an improved codebook-driven Wiener filter combined with the speech-presence probability, so that the proposed method achieves the goal of removing the residual noise between the harmonics of noisy speech.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: A novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis that takes into account an Anti-Spoofing Verification (ASV) as an additional constraint in the acoustic model training.
Abstract: This paper proposes a novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis. The parameters of synthetic speech tend to be over-smoothed, and this causes significant quality degradation in synthetic speech. The proposed algorithm takes into account an Anti-Spoofing Verification (ASV) as an additional constraint in the acoustic model training. The ASV is a discriminator trained to distinguish natural and synthetic speech. Since acoustic models for speech synthesis are trained so that the ASV recognizes the synthetic speech parameters as natural speech, the synthetic speech parameters are distributed in the same manner as natural speech parameters. Additionally, we find that the algorithm compensates not only the parameter distributions, but also the global variance and the correlations of synthetic speech parameters. The experimental results demonstrate that 1) the algorithm outperforms the conventional training algorithm in terms of speech quality, and 2) it is robust against the hyper-parameter settings.

Proceedings ArticleDOI
16 Jun 2017
TL;DR: The approach advocates a demarcation of responsibilities between the client and server-side components for performing the speech recognition task, which symbolically encodes the audio and encrypts the data before uploading to the server.
Abstract: This paper presents a strategy for enabling speech recognition to be performed in the cloud whilst preserving the privacy of users. The approach advocates a demarcation of responsibilities between the client and server-side components for performing the speech recognition task. On the client-side resides the acoustic model, which symbolically encodes the audio and encrypts the data before uploading to the server. The server-side then employs searchable encryption to enable the phonetic search of the speech content. Some preliminary results for speech encoding and searchable encryption are presented.

Patent
Bo Li1, Ron Weiss1, Michiel Bacchiani1, Tara N. Sainath1, Kevin W. Wilson1 
20 Dec 2017
TL;DR: In this article, the adaptive shaping of neuron pattern for multichannel speech recognition has been proposed, where a first set of filter parameters for a first filter based on a first audio data channel and a second set of filters for a second one based on both the first and second channels are generated using a trained recurrent neural network.
Abstract: FIELD: information technology.SUBSTANCE: invention discloses means for adaptive shaping of neuron pattern for multichannel speech recognition. A first channel of audio data corresponding to a speech fragment and a second audio data channel corresponding to said speech fragment are received. A first set of filter parameters for a first filter based on a first audio data channel and a second audio data channel and a second set of filter parameters for a second filter based on a first audio data channel and a second audio data channel are generated using a trained recurrent neural network. Generating a single combined audio data channel by combining first channel audio data which has been filtered using first filter, and audio data of second channel, which was filtered using second filter. Audio data are introduced for a single combined channel into a neural network trained as an acoustic model.EFFECT: high efficiency of speech recognition.20 cl, 5 dwg

Journal ArticleDOI
TL;DR: An epoch extraction method is proposed that considers the vertical striations present in the time-frequency representation of voiced speech as the representative candidates for the epochs, comparable to that of clean speech.
Abstract: Epoch extraction from speech involves the suppression of vocal tract resonances, either by linear prediction based inverse filtering or filtering at very low frequency. Degradations due to channel effect and significant attenuation of low frequency components ( $ 300 Hz) create challenges for the epoch extraction from telephone quality speech. An epoch extraction method is proposed that considers the vertical striations present in the time-frequency representation of voiced speech as the representative candidates for the epochs. Time-frequency representation with better localized vertical striations is estimated using single pole filter based filter bank. The time marginal of time-frequency representation is computed to locate the epochs. The proposed algorithm is evaluated on the database of five speakers, which provide simultaneous speech and electroglottographic recordings. Telephone quality speech is simulated using G.191 software tools. The identification rate of the state-of-the-art methods degrades substantially for the telephone quality speech whereas that of the proposed method remains the same, comparable to that of clean speech.

Posted Content
TL;DR: In this paper, the authors explored the potential of deep learning in classifying audio concepts on user-Generated Content videos, using two cascaded neural networks in a hierarchical configuration to analyze the short and long-term context information.
Abstract: Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter, or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a reduced set of audio concepts. However, multi-class classification can benefit from context window analysis and the discriminating power of deeper architectures. Although deep learning has shown promise in various applications such as speech and object recognition, it has not yet met the expectations for other fields such as audio concept classification. This paper explores, for the first time, the potential of deep learning in classifying audio concepts on User-Generated Content videos. The proposed system is comprised of two cascaded neural networks in a hierarchical configuration to analyze the short- and long-term context information. Our system outperforms a GMM approach by a relative 54%, a Neural Network by 33%, and a Deep Neural Network by 12% on the TRECVID-MED database

Book ChapterDOI
23 Aug 2017
TL;DR: This paper proposes a novel adaptive audio steganography in the time domain based on the advanced audio coding (AAC) and the Syndrome-Trellis coding (STC) and shows that the method can significantly outperform the conventional \(\pm 1\) LSB based steganographers in terms of security and audio quality.
Abstract: Most existing audio steganographic methods embed secret messages according to a pseudorandom number generator, thus some auditory sensitive parts in cover audio, such as mute or near-mute segments, will be contaminated, which would lead to poor perceptual quality and may introduce some detectable artifacts for steganalysis. In this paper, we propose a novel adaptive audio steganography in the time domain based on the advanced audio coding (AAC) and the Syndrome-Trellis coding (STC). The proposed method firstly compresses a given wave signal into AAC compressed file with a high bitrate, and then obtains a residual signal by comparing the signal before and after AAC compression. According to the quantity and sign of the residual signal, \(\pm 1\) embedding costs are assigned to the audio samples. Finally, the STC is used to create the stego audio. The extensive results evaluated on 10,000 music and 10,000 speech audio clips have shown that our method can significantly outperform the conventional \(\pm 1\) LSB based steganography in terms of security and audio quality.

Book ChapterDOI
22 Oct 2017
TL;DR: A method of embedding and extracting steganography in low-rate speech coding using three-dimensional Sudoku matrix is proposed and analysis shows that this method can enhance the concealment of steganographic information and improve the Steganography capacity of low- rate speech coding.
Abstract: Using low-rate compressed speech coding for large-capacity steganography is always a big challenge due to its low redundant information. To overcome this challenge, we propose a method of embedding and extracting steganography in low-rate speech coding using three-dimensional Sudoku matrix. Analysis shows that this method can enhance the concealment of steganographic information and improve the steganography capacity of low-rate speech coding. The experimental results showed that using the current typical low speech coding standard G.723.1 achieved a steganographic capacity of 200 bit/s and a reduction of less than 10% in the sensory evaluation value of the speech quality of the coded speech.

Posted Content
TL;DR: A deep neural network model is presented which optimizes all the steps of a wideband speech coding pipeline end-to-end directly from raw speech data - no manual feature engineering necessary, and it trains in hours.
Abstract: Modern compression algorithms are often the result of laborious domain-specific research; industry standards such as MP3, JPEG, and AMR-WB took years to develop and were largely hand-designed. We present a deep neural network model which optimizes all the steps of a wideband speech coding pipeline (compression, quantization, entropy coding, and decompression) end-to-end directly from raw speech data -- no manual feature engineering necessary, and it trains in hours. In testing, our DNN-based coder performs on par with the AMR-WB standard at a variety of bitrates (~9kbps up to ~24kbps). It also runs in realtime on a 3.8GhZ Intel CPU.

Journal ArticleDOI
TL;DR: A novel realization that integrates full-sentence speech correlation with clean speech recognition, formulated as a constrained maximization problem, to overcome the data sparsity problem and be able to significantly outperform conventional methods that use optimized noise tracking.
Abstract: Conventional speech enhancement methods, based on frame, multiframe, or segment estimation, require knowledge about the noise. This paper presents a new method that aims to reduce or effectively remove this requirement. It is shown that by using the zero-mean normalized correlation coefficient ZNCC as the comparison measure, and by extending the effective length of speech segment matching to sentence-long speech utterances, it is possible to obtain an accurate speech estimate from noise without requiring specific knowledge about the noise. The new method, thus, could be used to deal with unpredictable noise or noise without proper training data. This paper is focused on realizing and evaluating this potential. We propose a novel realization that integrates full-sentence speech correlation with clean speech recognition, formulated as a constrained maximization problem, to overcome the data sparsity problem. Then we propose an efficient implementation algorithm to solve this constrained maximization problem to produce speech sentence estimates. For evaluation, we build the new system on one training dataset and test it on two different test datasets across two databases, for a range of different noises including highly nonstationary ones. It is shown that the new approach, without any estimation of the noise, is able to significantly outperform conventional methods that use optimized noise tracking, in terms of various objective measures including automatic speech recognition.