scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2019"


Proceedings ArticleDOI
12 May 2019
TL;DR: This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality.
Abstract: In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps.

96 citations


Journal ArticleDOI
TL;DR: Two postprocessing approaches applying convolutional neural networks either in the time domain or the cepstral domain to enhance the coded speech without any modification of the codecs are proposed.
Abstract: Enhancing coded speech suffering from far-end acoustic background noise, quantization noise, and potentially transmission errors is a challenging task. In this paper, we propose two postprocessing approaches applying convolutional neural networks either in the time domain or the cepstral domain to enhance the coded speech without any modification of the codecs. The time-domain approach follows an end-to-end fashion, whereas the cepstral domain approach uses analysis–synthesis with cepstral domain features. The proposed postprocessors in both domains are evaluated for various narrowband and wideband speech codecs in a wide range of conditions. The proposed postprocessor improves perceptual evaluation of speech quality by up to 0.25 mean opinion score listening quality objective points for G.711, 0.30 points for G.726, 0.82 points for G.722, and 0.26 points for adaptive multirate wideband codec. In a subjective comparison category rating listening test, the proposed postprocessor on G.711-coded speech exceeds the speech quality of an ITU-T-standardized postfilter by 0.36 CMOS points, and obtains a clear preference of 1.77 CMOS points compared to legacy G.711, even better than uncoded speech with statistical significance . The source code for the cepstral domain approach to enhance G.711-coded speech is made available. 1 1 https://github.com/ifnspaml/Enhancement-Coded-Speech .

65 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: Speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model are introduced to improve ASR performance with large speech only and text only training datasets.
Abstract: We introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve the ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus.

42 citations


Journal ArticleDOI
TL;DR: An insight is given into several fields, covering speech production and auditory perception, cognitive aspects of speech communication and language understanding, both speech recognition and text-to-speech synthesis in more details, and consequently the main directions in development of spoken dialogue systems.
Abstract: Speech technologies have been developed for decades as a typical signal processing area, while the last decade has brought a huge progress based on new machine learning paradigms. Owing not only to their intrinsic complexity but also to their relation with cognitive sciences, speech technologies are now viewed as a prime example of interdisciplinary knowledge area. This review article on speech signal analysis and processing, corresponding machine learning algorithms, and applied computational intelligence aims to give an insight into several fields, covering speech production and auditory perception, cognitive aspects of speech communication and language understanding, both speech recognition and text-to-speech synthesis in more details, and consequently the main directions in development of spoken dialogue systems. Additionally, the article discusses the concepts and recent advances in speech signal compression, coding, and transmission, including cognitive speech coding. To conclude, the main intention of this article is to highlight recent achievements and challenges based on new machine learning paradigms that, over the last decade, had an immense impact in the field of speech signal processing.

40 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: The proposed E2E TTS systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data and are confirmed to be effective in terms of quality and speaker similarity of the generated speech.
Abstract: State-of-the-art text-to-speech (TTS) synthesis models can produce monolingual speech with high intelligibility and naturalness. However, when the models are applied to synthesize code-switched (CS) speech, the performance declines seriously. Conventionally, developing a CS TTS system requires multilingual data to incorporate language-specific and cross-lingual knowledge. Recently, end-to-end (E2E) architecture has achieved satisfactory results in monolingual TTS. The architecture enables the training from one end of alphabetic text input to the other end of acoustic feature output. In this paper, we explore the use of E2E framework for CS TTS, using a combination of Mandarin and English monolingual speech corpus uttered by two female speakers. To handle alphabetic input from different languages, we explore two kinds of encoders: (1) shared multilingual encoder with explicit language embedding (LDE); (2) separated monolingual encoder (SPE) for each language. The two systems use identical decoder architecture, where a discriminative code is incorporated to enable the model to generate speech in one speaker’s voice consistently. Experiments confirm the effectiveness of the proposed modifications on the E2E TTS framework in terms of quality and speaker similarity of the generated speech. Moreover, our proposed systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data.

33 citations


Proceedings ArticleDOI
15 Sep 2019
TL;DR: In this paper, a cross-module residual learning (CMRL) pipeline is proposed as a module carrier with each module reconstructing the residual from its preceding modules, which shows better objective performance than AMR-WB and OPUS.
Abstract: Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature.

33 citations


Journal ArticleDOI
01 Sep 2019
TL;DR: Electroencephalography is used to study how a mixture of two speech streams is represented in the brain as subjects attend to one stream or the other, and finds that low-frequency bands show greater speech-EEG coherence when the speech stream is attended compared to when it is ignored.
Abstract: The ability to selectively attend to speech in the presence of other competing talkers is critical for everyday communication; yet the neural mechanisms facilitating this process are poorly understood. Here, we use electroencephalography (EEG) to study how a mixture of two speech streams is represented in the brain as subjects attend to one stream or the other. To characterize the speech-EEG relationships and how they are modulated by attention, we estimate the statistical association between each canonical EEG frequency band (delta, theta, alpha, beta, low-gamma, and high-gamma) and the envelope of each of ten different frequency bands in the input speech. Consistent with previous literature, we find that low-frequency (delta and theta) bands show greater speech-EEG coherence when the speech stream is attended compared to when it is ignored. We also find that the envelope of the low-gamma band shows a similar attention effect, a result not previously reported with EEG. This is consistent with the prevailing theory that neural dynamics in the gamma range are important for attention-dependent routing of information in cortical circuits. In addition, we also find that the greatest attention-dependent increases in speech-EEG coherence are seen in the mid-frequency acoustic bands (0.5-3 kHz) of input speech and the temporal-parietal EEG sensors. Finally, we find individual differences in the following: (1) the specific set of speech-EEG associations that are the strongest, (2) the EEG and speech features that are the most informative about attentional focus, and (3) the overall magnitude of attentional enhancement of speech-EEG coherence.

31 citations


Journal ArticleDOI
TL;DR: It is found that the performance of different tasks on identical speech sounds leads to neural enhancement of the acoustic features in the stimuli that are critically relevant to task performance.
Abstract: Speech is the most important signal in our auditory environment, and the processing of speech is highly dependent on context. However, it is unknown how contextual demands influence the neural encoding of speech. Here, we examine the context dependence of auditory cortical mechanisms for speech encoding at the level of the representation of fundamental acoustic features (spectrotemporal modulations) using model-based functional magnetic resonance imaging. We found that the performance of different tasks on identical speech sounds leads to neural enhancement of the acoustic features in the stimuli that are critically relevant to task performance. These task effects were observed at the earliest stages of auditory cortical processing, in line with interactive accounts of speech processing. Our work provides important insights into the mechanisms that underlie the processing of contextually relevant acoustic information within our rich and dynamic auditory environment.

26 citations


Proceedings ArticleDOI
Janusz Klejsa, Per Hedelin, Cong Zhou1, Fejgin Roy M1, Lars Villemoes 
12 May 2019
TL;DR: A speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of state-of-the-art classic wide-band codecs is provided.
Abstract: We provide a speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of state-of-the-art classic wide-band codecs. Moreover, it is demonstrated that the proposed scheme can provide a meaningful rate-distortion trade-off without retraining. We evaluate the proposed scheme in a series of listening tests and discuss limitations of the approach.

25 citations


Journal ArticleDOI
TL;DR: A new end-to-end communication system is proposed to increase transmission speed, robustness, and security in order to meet the requirements of mobile systems that know an exponentially increasing data amount over time.
Abstract: A new end-to-end communication system is proposed to increase transmission speed, robustness, and security in order to meet the requirements of mobile systems that know an exponentially increasing data amount over time. The design relies on the use of compressed sensing-source coding instead of the supported speech coding standards in actual mobile communication systems. The proposed compressed sensing-source coding method allows reducing the speech coding complexity by using simple quantisation and binary encoding, saving communication system resources, and encrypting communications without additional costs. The performance of the resulting communication system is evaluated for speech communication via 10 dB Rayleigh environment in terms of perceptual evaluation of speech quality (PESQ) scores and coherence speech intelligibility index (CSII) when convolutional coding, orthogonal frequency division multiplexing, and diversity schemes are used. Results report that for a bit rate of 12.8 kbit/s the proposed scheme achieves fair speech intelligibility justified by a CSII value of 0.5, and offers good output speech quality measure, providing a PESQ of 3.33 for the same bit rate.

25 citations


Journal ArticleDOI
19 Sep 2019-Symmetry
TL;DR: A novel idea is to weave three separate convolution layers: traditional time Convolution and the introduction of two different frequency convolutions (mel-frequency cepstral coefficients (MFCC) convolution and spectrum convolution) which takes into account more details contained in the tested signal.
Abstract: This work presents a new approach to speech recognition, based on the specific coding of time and frequency characteristics of speech. The research proposed the use of convolutional neural networks because, as we know, they show high resistance to cross-spectral distortions and differences in the length of the vocal tract. Until now, two layers of time convolution and frequency convolution were used. A novel idea is to weave three separate convolution layers: traditional time convolution and the introduction of two different frequency convolutions (mel-frequency cepstral coefficients (MFCC) convolution and spectrum convolution). This application takes into account more details contained in the tested signal. Our idea assumes creating patterns for sounds in the form of RGB (Red, Green, Blue) images. The work carried out research for isolated words and continuous speech, for neural network structure. A method for dividing continuous speech into syllables has been proposed. This method can be used for symmetrical stereo sound.

Journal ArticleDOI
TL;DR: Electrocortical recordings in participants who maintain sentences in memory identify the phase of left frontotemporalbeta oscillations as the most prominent information carrier of sentence identity, providing evidence for a theoretical model on speech memory representations and explaining why interfering with beta oscillations in the left inferior frontal cortex diminishes verbal working memory capacity.
Abstract: The way the human brain represents speech in memory is still unknown. An obvious characteristic of speech is its evolvement over time. During speech processing, neural oscillations are modulated by the temporal properties of the acoustic speech signal, but also acquired knowledge on the temporal structure of language influences speech perception-related brain activity. This suggests that speech could be represented in the temporal domain, a form of representation that the brain also uses to encode autobiographic memories. Empirical evidence for such a memory code is lacking. We investigated the nature of speech memory representations using direct cortical recordings in the left perisylvian cortex during delayed sentence reproduction in female and male patients undergoing awake tumor surgery. Our results reveal that the brain endogenously represents speech in the temporal domain. Temporal pattern similarity analyses revealed that the phase of frontotemporal low-frequency oscillations, primarily in the beta range, represents sentence identity in working memory. The positive relationship between beta power during working memory and task performance suggests that working memory representations benefit from increased phase separation.SIGNIFICANCE STATEMENT Memory is an endogenous source of information based on experience. While neural oscillations encode autobiographic memories in the temporal domain, little is known on their contribution to memory representations of human speech. Our electrocortical recordings in participants who maintain sentences in memory identify the phase of left frontotemporal beta oscillations as the most prominent information carrier of sentence identity. These observations provide evidence for a theoretical model on speech memory representations and explain why interfering with beta oscillations in the left inferior frontal cortex diminishes verbal working memory capacity. The lack of sentence identity coding at the syllabic rate suggests that sentences are represented in memory in a more abstract form compared with speech coding during speech perception and production.

Posted Content
TL;DR: A cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules in a two-phase training scheme, showing better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture.
Abstract: Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature.

Journal ArticleDOI
TL;DR: A universal codebook-based speech enhancement framework that relies on a single codebook to encode both speech and noise components, and shows that the proposed ITF-based ASPP approach achieves a good balance of the trade-off between binaural noise reduction and bINAural cue preservation.
Abstract: In this work, we present a universal codebook-based speech enhancement framework that relies on a single codebook to encode both speech and noise components. The atomic speech presence probability (ASPP) is defined as the probability that a given codebook atom encodes speech at a given point in time. We develop ASPP estimators based on binaural cues including the interaural phase and level difference (IPD and ILD), the interaural coherence magnitude (ICM), as well as a combined version leveraging the full interaural transfer function (ITF). We evaluate the performance of the resulting ASPP-based speech enhancement algorithms on binaural mixtures of reverberant speech and real-world noise. The proposed approach improves both objective speech quality and intelligibility over a wide range of input SNR, as measured with PESQ and binaural STOI metrics, outperforming two binaural speech enhancement benchmark methods. We show that the proposed ITF-based ASPP approach achieves a good balance of the trade-off between binaural noise reduction and binaural cue preservation.


Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech enhancement, a perceptual weighting filter loss motivated by the weighting filters as employed in analysis-by-synthesis speech coding was proposed.
Abstract: Single-channel speech enhancement with deep neural networks (DNNs) has shown promising performance and is thus intensively being studied. In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech enhancement, we design a perceptual weighting filter loss motivated by the weighting filter as it is employed in analysis-by-synthesis speech coding, e.g., in code-excited linear prediction (CELP). The experimental results show that the proposed simple loss function improves the speech enhancement performance compared to a reference DNN with MSE loss in terms of perceptual quality and noise attenuation. The proposed loss function can be advantageously applied to an existing DNN-based speech enhancement system, without modification of the DNN topology for speech enhancement.

Journal ArticleDOI
TL;DR: An AMR adaptive steganographic scheme which selects the embedded position adaptively in the unvoiced speech segment, which is determined by the distribution characteristic of the adjacent pitch delay, and embeds the secret message by modifying the pitch delay without destroying the short-time stability of the pitch Delay.
Abstract: In this paper, a novel AMR adaptive steganographic scheme based on pitch delay of unvoiced speech (PDU-AAS) was proposed. The existing AMR steganographic schemes based on pitch delay destroy the short-time relative stability of pitch delay of voiced speech segments and they are easier to be detected by the existing steganographic schemes. Especially, the pitch delay distribution of AMR voiced and unvoiced speech segments are analyzed in detail, and based on this characteristic that the pitch delay sequence of AMR unvoiced speech do not have short-term relative stability, we proposed an AMR adaptive steganographic scheme which selects the embedded position adaptively in the unvoiced speech segment, which is determined by the distribution characteristic of the adjacent pitch delay, and embeds the secret message by modifying the pitch delay without destroying the short-time stability of the pitch delay. The experiment results shows that the scheme has good concealment and hiding capability. Most important of all, the comparing experiment results show that the scheme has good security to resist the detection of the existing steganalysis algorithms. The principle of the scheme can be applied to the other steganographic schemes based on the pitch delay of the speech codec, such as G723.1, G.729.

Journal ArticleDOI
TL;DR: A new framework based on i-vector/PLDA and Maximum Entropy (ME) is proposed to improve the performance of speaker identification system in the presence of speech coding distortion and shows that the proposed methode achieves improved performance when compared with the i- vector/PL DA and MEGMM.
Abstract: The system combining i-vector and probabilistic linear discriminant analysis (PLDA) has been applied with great success in the speaker recognition task. The i-vector space gives a low-dimensional representation of a speech segment and training data of a PLDA model, which offers greater robustness under different conditions. In this paper, we propose a new framework based on i-vector/PLDA and Maximum Entropy (ME) to improve the performance of speaker identification system in the presence of speech coding distortion. The results are reported on TIMIT database and speech coding obtained by passing the speech test from TIMIT database through the AMR encoder/decoder. Our results show that the proposed methode achieves improved performance when compared with the i-vector/PLDA and MEGMM.

Journal ArticleDOI
Yanzhen Ren1, Hanyi Yang1, Hongxia Wu1, Weiping Tu1, Lina Wang1 
TL;DR: This paper presents a steganographic scheme in the AMR fixed codebook (FCB) domain based on the pulse distribution model (PDM-AFS), which is obtained from the distribution characteristics of the FCB value in the cover audio.
Abstract: Adaptive multi-rate (AMR), a popular audio compression standard, is widely used in mobile communication and mobile Internet applications and has become a novel carrier for hiding information. To improve the statistical security, this paper presents a steganographic scheme in the AMR fixed codebook (FCB) domain based on the pulse distribution model (PDM-AFS), which is obtained from the distribution characteristics of the FCB value in the cover audio. The pulse positions in stego audio are controlled by message encoding and random masking to make the statistical distribution of the FCB parameters close to that of the cover audio. The experimental results show that the statistical security of the proposed scheme is better than that of the existing schemes. Furthermore, the hiding capacity is maintained compared with the existing schemes. The average hiding capacity can reach 2.06 kbps at an audio compression rate of 12.2 kbps, and the auditory concealment is good. To the best of our knowledge, this is the first secure AMR FCB steganographic scheme that improves the statistical security based on the distribution model of the cover audio. This scheme can be extended to other audio compression codecs under the principle of algebraic code excited linear prediction (ACELP), such as G.723.1 and G.729.

Proceedings ArticleDOI
12 May 2019
TL;DR: This work introduces a method to improve the quality of simple scalar quantization in the context of acoustic sensor networks by combining ideas from sparse reconstruction, artificial neural networks and weighting filters.
Abstract: We introduce a method to improve the quality of simple scalar quantization in the context of acoustic sensor networks by combining ideas from sparse reconstruction, artificial neural networks and weighting filters. We start from the observation that optimization methods based on sparse reconstruction resemble the structure of a neural network. Hence, building upon a successful enhancement method, we unroll the algorithms and use this to build a neural network which we train to obtain enhanced decoding. In addition, the weighting filter from code-excited linear predictive (CELP) speech coding is integrated into the loss function of the neural network, achieving perceptually improved reconstructed speech. Our experiments show that our proposed trained methods allow for better speech reconstruction than the reference optimization methods.

Proceedings ArticleDOI
07 May 2019
TL;DR: WaveNet-based delay-free adaptive differential pulse code modulation (ADPCM) speech coding system, which is a state-of-the-art model for neural-network-based speech waveform synthesis, is proposed to improve speech quality and outperformed the conventional ADPCM system based on ITU-T Recommendation G.726.
Abstract: This paper proposes a WaveNet-based delay-free adaptive differential pulse code modulation (ADPCM) speech coding system. The WaveNet generative model, which is a state-of-the-art model for neural-network-based speech waveform synthesis, is used as the adaptive predictor in ADPCM. To further improve speech quality, mel-cepstrum-based noise shaping and postfiltering were integrated with the proposed ADPCM system. Both objective and subjective evaluation results indicate that the proposed ADPCM system outperformed not only the conventional ADPCM system based on ITU-T Recommendation G.726 but also the ADPCM system based on adaptive mel-cepstral analysis.

Journal ArticleDOI
01 May 2019-Heliyon
TL;DR: A novel scalable speech coding scheme based on Compressive Sensing, which can operate at bit rates from 3.275 to 7.275 kbps is designed and implemented and offers the listening quality for reconstructed speech similar to that of Adaptive Multi rate - Narrowband codec at 6.7 kbps and Enhanced Voice Services (EVS) at 7.2 kbps.

Journal ArticleDOI
TL;DR: The LB-ABE in this paper is employed together with a preexisting ABE toward high-frequency components to obtain spectrally balanced speech signals and the gap of speech quality between wideband and NB speech was significantly reduced when employing the proposed ABE toward low frequencies.
Abstract: Conventional narrowband (NB) telephony suffers from limited acoustic bandwidth at the receiver side, leading to degraded speech quality and intelligibility. In this paper, artificial speech bandwidth extension (ABE) of NB speech toward missing frequencies below about 300 Hz (low-frequency band, LB) is proposed to enhance the speech quality. The LB-ABE in this paper is employed together with a preexisting ABE toward high-frequency components to obtain spectrally balanced speech signals. In an instrumental quality assessment, the spectral distance in the LB was improved by more than 5 dB compared to NB speech. In a subjective listening test, the gap of speech quality between wideband and NB speech was significantly reduced when employing the proposed ABE toward low frequencies. The LB extension was found to further improve the preexisting ABE toward higher frequencies by a significant 0.26 CMOS points.


Proceedings ArticleDOI
01 Sep 2019
TL;DR: The RLP-based TV-CAR speech analysis is proposed and evaluated with the F0 estimation of speech using IRAPT (Instantaneous RAPT) with Keele Pitch Database under noisy conditions.
Abstract: Linear Prediction (LP) analysis is speech analysis to estimate AR (Auto-Regressive) coefficients to represent the all-pole spectrum that is applied in speech synthesis recently besides speech coding. We have proposed $l_{2} -$ norm optimization-based TV-CAR (Time-Varying Complex AR) speech analysis for an analytic signal, MMSE (Minimizing Mean Square Error) or ELS (Extended Least Square) method, and we have applied them into the speech processing such as robust ASR or F 0 estimation of speech. On the other hand, B.Kleijn et al. have proposed Regularized Linear Prediction (RLP) method to suppress pitch related bias that is an overestimation of the first formant. In the RLP, $l_{2} -$ norm regularized term that is the norm of spectral changes in the frequencies is introduced to suppress the rapid spectral changes. The RLP estimates the parameter so as to minimize $l_{2} -$ norm criterion added by the $l_{2} -$ norm regularized penalty term. In this paper, the RLP-based TV-CAR speech analysis is proposed and evaluated with the F 0 estimation of speech using IRAPT (Instantaneous RAPT) with Keele Pitch Database under noisy conditions.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: A new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging, is proposed by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra.
Abstract: We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skipconnection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.

Journal ArticleDOI
TL;DR: A novel framework for end to end secure voice communication over the GSM networks using encryption algorithm AES-256 is presented, believed to be first solution that uses single codebook for transmission of secure voice.
Abstract: Global system for mobile communication (GSM) is widely used digital mobile service around the world. Although GSM was designed as a secure wireless system, it is now vulnerable to different targeted attacks. There is a need to address security domains especially the confidentiality of communication. This paper presents a novel framework for end to end secure voice communication over the GSM networks using encryption algorithm AES-256. A special Modem and speech coding technique are designed to enable the transmission of encrypted speech using GSM voice channel. To the best of our knowledge, this is first solution that uses single codebook for transmission of secure voice. An efficient low bit-rate (1.9 kbps) speech coder is also designed for use with the proposed modulation scheme for optimal results. Different speech characteristics such as pitch, energy and line spectral frequencies are extracted and preserved before compression and encryption of speech. Previously, the best achieved data rate was 1.6 kbps with three codebooks, whilst the proposed approach achieves 2 kbps with 0% bit error rate. The empirical results show that the methodology can be used for real time applications to transmit encrypted voice using GSM network.

Journal ArticleDOI
TL;DR: This work has dealt with the challenge of embedding a secret speech into a cover speech coded by a very low bit rate speech coder while maintaining a reasonable level of speech quality and was able to create hidden channels with maximum steganographic bandwidths up to 266.64 bit/s.
Abstract: In this study, the authors present a new steganographic technique called random least significant bits of pitch and Fourier magnitude steganography (RLPFS). It is based on hiding a secret speech coded by mixed excitation linear prediction (MELP) speech coder in speech bitstream (cover signal), which is also encoded by MELP coder. First, the RLPFS leaks the hidden speech in the following modes: pitch-based steganography, Fourier magnitude-based steganography or both. These modes are selected randomly. Second, during transmission, the stego speech, the mode number, and the number of embedded bits would be transmitted either through a covert channel created in the transmission protocol or through the cover speech. In this work, the authors have dealt with the challenge of embedding a secret speech into a cover speech coded by a very low bit rate speech coder while maintaining a reasonable level of speech quality. They have shown that RLPFS was able to create hidden channels with maximum steganographic bandwidths up to 266.64 bit/s at the cost of a steganographic noise between 0.031 and 0.62 mean opinion score. Also, this study takes into account the security of the parameters, the synchronisation of the receiver to deal with a packet loss during transmission and the resistance of the proposed method against steganalysis.

Journal ArticleDOI
TL;DR: Novel 2-digit ADM with six-level quantization using variable-length coding, for encoding the time-varying signals modelled by Laplacian distribution is presented, indicating that the proposed configuration outperforms the baseline ADM algorithms, including Constant Factor Delta Modulation (CFDM), Continuously Variable Slope Delta Modulations (CVSDM), and operates in a much wider dynamic range.
Abstract: Delta Modulation (DM) is a simple waveform coding algorithm used mostly when timely data delivery is more important than the transmitted data quality. While the implementation of DM is fairly simple and inexpensive, it suffers from several limitations, such as slope overload and granular noise, which can be overcome using Adaptive Delta Modulation (ADM). This paper presents novel 2-digit ADM with six-level quantization using variable-length coding, for encoding the time-varying signals modelled by Laplacian distribution. Two variants of quantizer are employed, distortion-constrained quantizer which is optimally designed for minimal mean-squared error (MSE), and rate-constrained quantizer, which is suboptimal in the minimal MSE sense, but enables minimal loss in SQNR for the target bit rate. Experimental results using real speech signal are provided, indicating that the proposed configuration outperforms the baseline ADM algorithms, including Constant Factor Delta Modulation (CFDM), Continuously Va...

Journal ArticleDOI
TL;DR: It is demonstrated that the final bottleneck layer of the CNN encoder provides an abstract and compact representation of the speech signal that is sufficient to reconstruct the original speech signal in high quality using the CNN decoder.
Abstract: This paper proposes a convolutional neural network (CNN)-based encoder model to compress and code speech signal directly from raw input speech. Although the model can synthesize wideband speech by implicit bandwidth extension, narrowband is preferred for IP telephony and telecommunications purposes. The model takes time domain speech samples as inputs and encodes them using a cascade of convolutional filters in multiple layers, where pooling is applied after some layers to downsample the encoded speech by half. The final bottleneck layer of the CNN encoder provides an abstract and compact representation of the speech signal. In this paper, it is demonstrated that this compact representation is sufficient to reconstruct the original speech signal in high quality using the CNN decoder. This paper also discusses the theoretical background of why and how CNN may be used for end-to-end speech compression and coding. The complexity, delay, memory requirements, and bit rate versus quality are discussed in the experimental results.