Showing papers on "Speech coding published in 2019"

PDF

Open Access

Proceedings Article•DOI•

Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

[...]

Cristina Garbacea¹, Aaron van den Oord, Yazhe Li, Felicia S. C. Lim², Alejandro Luebs², Oriol Vinyals, Thomas C. Walters - Show less +3 more•Institutions (2)

University of Michigan¹, Google²

12 May 2019

TL;DR: This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality.

...read moreread less

Abstract: In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps.

...read moreread less

96 citations

Journal Article•DOI•

Convolutional Neural Networks to Enhance Coded Speech

[...]

Ziyue Zhao¹, Huijun Liu¹, Tim Fingscheidt¹•Institutions (1)

Braunschweig University of Technology¹

01 Apr 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Two postprocessing approaches applying convolutional neural networks either in the time domain or the cepstral domain to enhance the coded speech without any modification of the codecs are proposed.

...read moreread less

Abstract: Enhancing coded speech suffering from far-end acoustic background noise, quantization noise, and potentially transmission errors is a challenging task. In this paper, we propose two postprocessing approaches applying convolutional neural networks either in the time domain or the cepstral domain to enhance the coded speech without any modification of the codecs. The time-domain approach follows an end-to-end fashion, whereas the cepstral domain approach uses analysis–synthesis with cepstral domain features. The proposed postprocessors in both domains are evaluated for various narrowband and wideband speech codecs in a wide range of conditions. The proposed postprocessor improves perceptual evaluation of speech quality by up to 0.25 mean opinion score listening quality objective points for G.711, 0.30 points for G.726, 0.82 points for G.722, and 0.26 points for adaptive multirate wideband codec. In a subjective comparison category rating listening test, the proposed postprocessor on G.711-coded speech exceeds the speech quality of an ITU-T-standardized postfilter by 0.36 CMOS points, and obtains a clear preference of 1.77 CMOS points compared to legacy G.711, even better than uncoded speech with statistical significance . The source code for the cepstral domain approach to enhance G.711-coded speech is made available. 1 1 https://github.com/ifnspaml/Enhancement-Coded-Speech .

...read moreread less

65 citations

Proceedings Article•DOI•

Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders

[...]

Shigeki Karita, Shinji Watanabe¹, Tomoharu Iwata, Marc Delcroix, Atsunori Ogawa, Tomohiro Nakatani - Show less +2 more•Institutions (1)

Johns Hopkins University¹

12 May 2019

TL;DR: Speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model are introduced to improve ASR performance with large speech only and text only training datasets.

...read moreread less

Abstract: We introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve the ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus.

...read moreread less

42 citations

Journal Article•DOI•

Speech Technology Progress Based on New Machine Learning Paradigm

[...]

Vlado Delic¹, Zoran Peric², Milan Sečujski¹, Niksa Jakovljevic¹, Jelena Nikolic², Dragisa Miskovic¹, Nikola Simic², Sinisa Suzic¹, Tijana Delic¹ - Show less +5 more•Institutions (2)

University of Novi Sad Faculty of Technical Sciences¹, University of Niš²

25 Jun 2019-Computational Intelligence and Neuroscience

TL;DR: An insight is given into several fields, covering speech production and auditory perception, cognitive aspects of speech communication and language understanding, both speech recognition and text-to-speech synthesis in more details, and consequently the main directions in development of spoken dialogue systems.

...read moreread less

Abstract: Speech technologies have been developed for decades as a typical signal processing area, while the last decade has brought a huge progress based on new machine learning paradigms. Owing not only to their intrinsic complexity but also to their relation with cognitive sciences, speech technologies are now viewed as a prime example of interdisciplinary knowledge area. This review article on speech signal analysis and processing, corresponding machine learning algorithms, and applied computational intelligence aims to give an insight into several fields, covering speech production and auditory perception, cognitive aspects of speech communication and language understanding, both speech recognition and text-to-speech synthesis in more details, and consequently the main directions in development of spoken dialogue systems. Additionally, the article discusses the concepts and recent advances in speech signal compression, coding, and transmission, including cognitive speech coding. To conclude, the main intention of this article is to highlight recent achievements and challenges based on new machine learning paradigms that, over the last decade, had an immense impact in the field of speech signal processing.

...read moreread less

40 citations

Proceedings Article•DOI•

End-to-end Code-switched TTS with Mix of Monolingual Recordings

[...]

Yuewen Cao¹, Xixin Wu¹, Songxiang Liu¹, Jianwei Yu¹, Xu Li¹, Zhiyong Wu¹, Xunying Liu¹, Helen Meng¹ - Show less +4 more•Institutions (1)

The Chinese University of Hong Kong¹

12 May 2019

TL;DR: The proposed E2E TTS systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data and are confirmed to be effective in terms of quality and speaker similarity of the generated speech.

...read moreread less

Abstract: State-of-the-art text-to-speech (TTS) synthesis models can produce monolingual speech with high intelligibility and naturalness. However, when the models are applied to synthesize code-switched (CS) speech, the performance declines seriously. Conventionally, developing a CS TTS system requires multilingual data to incorporate language-specific and cross-lingual knowledge. Recently, end-to-end (E2E) architecture has achieved satisfactory results in monolingual TTS. The architecture enables the training from one end of alphabetic text input to the other end of acoustic feature output. In this paper, we explore the use of E2E framework for CS TTS, using a combination of Mandarin and English monolingual speech corpus uttered by two female speakers. To handle alphabetic input from different languages, we explore two kinds of encoders: (1) shared multilingual encoder with explicit language embedding (LDE); (2) separated monolingual encoder (SPE) for each language. The two systems use identical decoder architecture, where a discriminative code is incorporated to enable the model to generate speech in one speaker’s voice consistently. Experiments confirm the effectiveness of the proposed modifications on the E2E TTS framework in terms of quality and speaker similarity of the generated speech. Moreover, our proposed systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data.

...read moreread less

33 citations

Proceedings Article•DOI•

Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding.

[...]

Kai Zhen¹, Jongmo Sung², Mi Suk Lee², Seungkwon Beack², Minje Kim¹ - Show less +1 more•Institutions (2)

Indiana University¹, Electronics and Telecommunications Research Institute²

15 Sep 2019

TL;DR: In this paper, a cross-module residual learning (CMRL) pipeline is proposed as a module carrier with each module reconstructing the residual from its preceding modules, which shows better objective performance than AMR-WB and OPUS.

...read moreread less

Abstract: Speech codecs learn compact representations of speech signals to facilitate data transmission. Many recent deep neural network (DNN) based end-to-end speech codecs achieve low bitrates and high perceptual quality at the cost of model complexity. We propose a cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules. CMRL differs from other DNN-based speech codecs, in that rather than modeling speech compression problem in a single large neural network, it optimizes a series of less-complicated modules in a two-phase training scheme. The proposed method shows better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture. As an end-to-end model, it takes raw PCM signals as an input, but is also compatible with linear predictive coding (LPC), showing better subjective quality at high bitrates than AMR-WB and OPUS. The gain is achieved by using only 0.9 million trainable parameters, a significantly less complex architecture than the other DNN-based codecs in the literature.

...read moreread less

33 citations

Journal Article•DOI•

Electroencephalographic Signatures of the Neural Representation of Speech during Selective Attention

[...]

Vibha Viswanathan¹, Hari M. Bharadwaj¹, Barbara G. Shinn-Cunningham²•Institutions (2)

Purdue University¹, Carnegie Mellon University²

01 Sep 2019

TL;DR: Electroencephalography is used to study how a mixture of two speech streams is represented in the brain as subjects attend to one stream or the other, and finds that low-frequency bands show greater speech-EEG coherence when the speech stream is attended compared to when it is ignored.

...read moreread less

Abstract: The ability to selectively attend to speech in the presence of other competing talkers is critical for everyday communication; yet the neural mechanisms facilitating this process are poorly understood. Here, we use electroencephalography (EEG) to study how a mixture of two speech streams is represented in the brain as subjects attend to one stream or the other. To characterize the speech-EEG relationships and how they are modulated by attention, we estimate the statistical association between each canonical EEG frequency band (delta, theta, alpha, beta, low-gamma, and high-gamma) and the envelope of each of ten different frequency bands in the input speech. Consistent with previous literature, we find that low-frequency (delta and theta) bands show greater speech-EEG coherence when the speech stream is attended compared to when it is ignored. We also find that the envelope of the low-gamma band shows a similar attention effect, a result not previously reported with EEG. This is consistent with the prevailing theory that neural dynamics in the gamma range are important for attention-dependent routing of information in cortical circuits. In addition, we also find that the greatest attention-dependent increases in speech-EEG coherence are seen in the mid-frequency acoustic bands (0.5-3 kHz) of input speech and the temporal-parietal EEG sensors. Finally, we find individual differences in the following: (1) the specific set of speech-EEG associations that are the strongest, (2) the EEG and speech features that are the most informative about attentional focus, and (3) the overall magnitude of attentional enhancement of speech-EEG coherence.

...read moreread less

31 citations

Journal Article•DOI•

Cortical encoding of speech enhances task-relevant acoustic information

[...]

Sanne Rutten¹, Roberta Santoro¹, Alexis Hervais-Adelman¹, Alexis Hervais-Adelman², Elia Formisano³, Narly Golestani¹ - Show less +2 more•Institutions (3)

University of Geneva¹, University of Zurich², Maastricht University³

08 Jul 2019-Nature Human Behaviour

TL;DR: It is found that the performance of different tasks on identical speech sounds leads to neural enhancement of the acoustic features in the stimuli that are critically relevant to task performance.

...read moreread less

Abstract: Speech is the most important signal in our auditory environment, and the processing of speech is highly dependent on context. However, it is unknown how contextual demands influence the neural encoding of speech. Here, we examine the context dependence of auditory cortical mechanisms for speech encoding at the level of the representation of fundamental acoustic features (spectrotemporal modulations) using model-based functional magnetic resonance imaging. We found that the performance of different tasks on identical speech sounds leads to neural enhancement of the acoustic features in the stimuli that are critically relevant to task performance. These task effects were observed at the earliest stages of auditory cortical processing, in line with interactive accounts of speech processing. Our work provides important insights into the mechanisms that underlie the processing of contextually relevant acoustic information within our rich and dynamic auditory environment.

...read moreread less

26 citations

Proceedings Article•DOI•

High-quality Speech Coding with Sample RNN

[...]

Janusz Klejsa, Per Hedelin, Cong Zhou¹, Fejgin Roy M¹, Lars Villemoes - Show less +1 more•Institutions (1)

Dolby Laboratories¹

12 May 2019

TL;DR: A speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of state-of-the-art classic wide-band codecs is provided.

...read moreread less

Abstract: We provide a speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of state-of-the-art classic wide-band codecs. Moreover, it is demonstrated that the proposed scheme can provide a meaningful rate-distortion trade-off without retraining. We evaluate the proposed scheme in a series of listening tests and discuss limitations of the approach.

...read moreread less

25 citations

Journal Article•DOI•

New mobile communication system design for Rayleigh environments based on compressed sensing-source coding

[...]

Houria Haneche, Abdeldjalil Ouahabi¹, Bachir Boudraa•Institutions (1)

French Institute of Health and Medical Research¹

26 Jun 2019-Iet Communications

TL;DR: A new end-to-end communication system is proposed to increase transmission speed, robustness, and security in order to meet the requirements of mobile systems that know an exponentially increasing data amount over time.

...read moreread less

Abstract: A new end-to-end communication system is proposed to increase transmission speed, robustness, and security in order to meet the requirements of mobile systems that know an exponentially increasing data amount over time. The design relies on the use of compressed sensing-source coding instead of the supported speech coding standards in actual mobile communication systems. The proposed compressed sensing-source coding method allows reducing the speech coding complexity by using simple quantisation and binary encoding, saving communication system resources, and encrypting communications without additional costs. The performance of the resulting communication system is evaluated for speech communication via 10 dB Rayleigh environment in terms of perceptual evaluation of speech quality (PESQ) scores and coherence speech intelligibility index (CSII) when convolutional coding, orthogonal frequency division multiplexing, and diversity schemes are used. Results report that for a bit rate of 12.8 kbit/s the proposed scheme achieves fair speech intelligibility justified by a CSII value of 0.5, and offers good output speech quality measure, providing a PESQ of 3.33 for the same bit rate.

...read moreread less

25 citations

Journal Article•DOI•

A Method of Speech Coding for Speech Recognition Using a Convolutional Neural Network

[...]

Mariusz Kubanek, Janusz Bobulski, Joanna Kulawik

19 Sep 2019-Symmetry

TL;DR: A novel idea is to weave three separate convolution layers: traditional time Convolution and the introduction of two different frequency convolutions (mel-frequency cepstral coefficients (MFCC) convolution and spectrum convolution) which takes into account more details contained in the tested signal.

...read moreread less

Abstract: This work presents a new approach to speech recognition, based on the specific coding of time and frequency characteristics of speech. The research proposed the use of convolutional neural networks because, as we know, they show high resistance to cross-spectral distortions and differences in the length of the vocal tract. Until now, two layers of time convolution and frequency convolution were used. A novel idea is to weave three separate convolution layers: traditional time convolution and the introduction of two different frequency convolutions (mel-frequency cepstral coefficients (MFCC) convolution and spectrum convolution). This application takes into account more details contained in the tested signal. Our idea assumes creating patterns for sounds in the form of RGB (Red, Green, Blue) images. The work carried out research for isolated words and continuous speech, for neural network structure. A method for dividing continuous speech into syllables has been proposed. This method can be used for symmetrical stereo sound.

...read moreread less

Journal Article•DOI•

Low frequency oscillations code speech during verbal working memory

[...]

Johannes Gehrig¹, Georgios Michalareas², Marie-Therese Forster¹, Juan Lei¹, Pavel Hok³, Pavel Hok¹, Helmut Laufs¹, Helmut Laufs⁴, Christian Senft¹, Volker Seifert¹, Jan-Mathijs Schoffelen⁵, Simon Hanslmayr⁶, Christian A. Kell¹ - Show less +9 more•Institutions (6)

Goethe University Frankfurt¹, Max Planck Society², Palacký University, Olomouc³, University of Kiel⁴, Radboud University Nijmegen⁵, University of Birmingham⁶

14 Aug 2019-The Journal of Neuroscience

TL;DR: Electrocortical recordings in participants who maintain sentences in memory identify the phase of left frontotemporalbeta oscillations as the most prominent information carrier of sentence identity, providing evidence for a theoretical model on speech memory representations and explaining why interfering with beta oscillations in the left inferior frontal cortex diminishes verbal working memory capacity.

...read moreread less

Abstract: The way the human brain represents speech in memory is still unknown. An obvious characteristic of speech is its evolvement over time. During speech processing, neural oscillations are modulated by the temporal properties of the acoustic speech signal, but also acquired knowledge on the temporal structure of language influences speech perception-related brain activity. This suggests that speech could be represented in the temporal domain, a form of representation that the brain also uses to encode autobiographic memories. Empirical evidence for such a memory code is lacking. We investigated the nature of speech memory representations using direct cortical recordings in the left perisylvian cortex during delayed sentence reproduction in female and male patients undergoing awake tumor surgery. Our results reveal that the brain endogenously represents speech in the temporal domain. Temporal pattern similarity analyses revealed that the phase of frontotemporal low-frequency oscillations, primarily in the beta range, represents sentence identity in working memory. The positive relationship between beta power during working memory and task performance suggests that working memory representations benefit from increased phase separation.SIGNIFICANCE STATEMENT Memory is an endogenous source of information based on experience. While neural oscillations encode autobiographic memories in the temporal domain, little is known on their contribution to memory representations of human speech. Our electrocortical recordings in participants who maintain sentences in memory identify the phase of left frontotemporal beta oscillations as the most prominent information carrier of sentence identity. These observations provide evidence for a theoretical model on speech memory representations and explain why interfering with beta oscillations in the left inferior frontal cortex diminishes verbal working memory capacity. The lack of sentence identity coding at the syllabic rate suggests that sentences are represented in memory in a more abstract form compared with speech coding during speech perception and production.

...read moreread less

Posted Content•

Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

[...]

Kai Zhen¹, Jongmo Sung², Mi Suk Lee², Seungkwon Beack², Minje Kim¹ - Show less +1 more•Institutions (2)

Indiana University¹, Electronics and Telecommunications Research Institute²

18 Jun 2019-arXiv: Audio and Speech Processing

TL;DR: A cross-module residual learning (CMRL) pipeline as a module carrier with each module reconstructing the residual from its preceding modules in a two-phase training scheme, showing better objective performance than AMR-WB and the state-of-the-art DNN-based speech codec with a similar network architecture.

...read moreread less

Journal Article•DOI•

Binaural Codebook-Based Speech Enhancement With Atomic Speech Presence Probability

[...]

Sean U. N. Wood¹, Johannes Stahl¹, Pejman Mowlaee¹•Institutions (1)

Graz University of Technology¹

01 Dec 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A universal codebook-based speech enhancement framework that relies on a single codebook to encode both speech and noise components, and shows that the proposed ITF-based ASPP approach achieves a good balance of the trade-off between binaural noise reduction and bINAural cue preservation.

...read moreread less

Abstract: In this work, we present a universal codebook-based speech enhancement framework that relies on a single codebook to encode both speech and noise components. The atomic speech presence probability (ASPP) is defined as the probability that a given codebook atom encodes speech at a given point in time. We develop ASPP estimators based on binaural cues including the interaural phase and level difference (IPD and ILD), the interaural coherence magnitude (ICM), as well as a combined version leveraging the full interaural transfer function (ITF). We evaluate the performance of the resulting ASPP-based speech enhancement algorithms on binaural mixtures of reverberant speech and real-world noise. The proposed approach improves both objective speech quality and intelligibility over a wide range of input SNR, as measured with PESQ and binaural STOI metrics, outperforming two binaural speech enhancement benchmark methods. We show that the proposed ITF-based ASPP approach achieves a good balance of the trade-off between binaural noise reduction and binaural cue preservation.

...read moreread less

Speech Coding Effect on Amazigh Alphabet Speech Recognition Performance

[...]

Mohamed Hamidi, Hassan Satori, Ouissam Zealouk, Khalid Satori, Liian - Show less +1 more

01 Jan 2019

Proceedings Article•DOI•

A Perceptual Weighting Filter Loss for DNN Training In Speech Enhancement

[...]

Ziyue Zhao¹, Samy Elshamy¹, Tim Fingscheidt¹•Institutions (1)

Braunschweig University of Technology¹

01 Oct 2019

TL;DR: In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech enhancement, a perceptual weighting filter loss motivated by the weighting filters as employed in analysis-by-synthesis speech coding was proposed.

...read moreread less

Abstract: Single-channel speech enhancement with deep neural networks (DNNs) has shown promising performance and is thus intensively being studied. In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech enhancement, we design a perceptual weighting filter loss motivated by the weighting filter as it is employed in analysis-by-synthesis speech coding, e.g., in code-excited linear prediction (CELP). The experimental results show that the proposed simple loss function improves the speech enhancement performance compared to a reference DNN with MSE loss in terms of perceptual quality and noise attenuation. The proposed loss function can be advantageously applied to an existing DNN-based speech enhancement system, without modification of the DNN topology for speech enhancement.

...read moreread less

Journal Article•DOI•

An AMR adaptive steganographic scheme based on the pitch delay of unvoiced speech

[...]

Yanzhen Ren¹, Dengkai Liu¹, Jing Yang¹, Lina Wang¹•Institutions (1)

Wuhan University¹

01 Apr 2019-Multimedia Tools and Applications

TL;DR: An AMR adaptive steganographic scheme which selects the embedded position adaptively in the unvoiced speech segment, which is determined by the distribution characteristic of the adjacent pitch delay, and embeds the secret message by modifying the pitch delay without destroying the short-time stability of the pitch Delay.

...read moreread less

Abstract: In this paper, a novel AMR adaptive steganographic scheme based on pitch delay of unvoiced speech (PDU-AAS) was proposed. The existing AMR steganographic schemes based on pitch delay destroy the short-time relative stability of pitch delay of voiced speech segments and they are easier to be detected by the existing steganographic schemes. Especially, the pitch delay distribution of AMR voiced and unvoiced speech segments are analyzed in detail, and based on this characteristic that the pitch delay sequence of AMR unvoiced speech do not have short-term relative stability, we proposed an AMR adaptive steganographic scheme which selects the embedded position adaptively in the unvoiced speech segment, which is determined by the distribution characteristic of the adjacent pitch delay, and embeds the secret message by modifying the pitch delay without destroying the short-time stability of the pitch delay. The experiment results shows that the scheme has good concealment and hiding capability. Most important of all, the comparing experiment results show that the scheme has good security to resist the detection of the existing steganalysis algorithms. The principle of the scheme can be applied to the other steganographic schemes based on the pitch delay of the speech codec, such as G723.1, G.729.

...read moreread less

Journal Article•DOI•

Maximum entropy PLDA for robust speaker recognition under speech coding distortion

[...]

Ahmed Krobba¹, Mohamed Debyeche¹, Sid-Ahmed Selouani²•Institutions (2)

University of Science and Technology Houari Boumediene¹, Université de Moncton²

01 Dec 2019-International Journal of Speech Technology

TL;DR: A new framework based on i-vector/PLDA and Maximum Entropy (ME) is proposed to improve the performance of speaker identification system in the presence of speech coding distortion and shows that the proposed methode achieves improved performance when compared with the i- vector/PL DA and MEGMM.

...read moreread less

Abstract: The system combining i-vector and probabilistic linear discriminant analysis (PLDA) has been applied with great success in the speaker recognition task. The i-vector space gives a low-dimensional representation of a speech segment and training data of a PLDA model, which offers greater robustness under different conditions. In this paper, we propose a new framework based on i-vector/PLDA and Maximum Entropy (ME) to improve the performance of speaker identification system in the presence of speech coding distortion. The results are reported on TIMIT database and speech coding obtained by passing the speech test from TIMIT database through the AMR encoder/decoder. Our results show that the proposed methode achieves improved performance when compared with the i-vector/PLDA and MEGMM.

...read moreread less

Journal Article•DOI•

A Secure AMR Fixed Codebook Steganographic Scheme Based on Pulse Distribution Model

[...]

Yanzhen Ren¹, Hanyi Yang¹, Hongxia Wu¹, Weiping Tu¹, Lina Wang¹ - Show less +1 more•Institutions (1)

Wuhan University¹

18 Mar 2019-IEEE Transactions on Information Forensics and Security

TL;DR: This paper presents a steganographic scheme in the AMR fixed codebook (FCB) domain based on the pulse distribution model (PDM-AFS), which is obtained from the distribution characteristics of the FCB value in the cover audio.

...read moreread less

Abstract: Adaptive multi-rate (AMR), a popular audio compression standard, is widely used in mobile communication and mobile Internet applications and has become a novel carrier for hiding information. To improve the statistical security, this paper presents a steganographic scheme in the AMR fixed codebook (FCB) domain based on the pulse distribution model (PDM-AFS), which is obtained from the distribution characteristics of the FCB value in the cover audio. The pulse positions in stego audio are controlled by message encoding and random masking to make the statistical distribution of the FCB parameters close to that of the cover audio. The experimental results show that the statistical security of the proposed scheme is better than that of the existing schemes. Furthermore, the hiding capacity is maintained compared with the existing schemes. The average hiding capacity can reach 2.06 kbps at an audio compression rate of 12.2 kbps, and the auditory concealment is good. To the best of our knowledge, this is the first secure AMR FCB steganographic scheme that improves the statistical security based on the distribution model of the cover audio. This scheme can be extended to other audio compression codecs under the principle of algebraic code excited linear prediction (ACELP), such as G.723.1 and G.729.

...read moreread less

Proceedings Article•DOI•

Learning to Dequantize Speech Signals by Primal-dual Networks: an Approach for Acoustic Sensor Networks

[...]

Christoph Brauer, Ziyue Zhao¹, Dirk A. Lorenz, Tim Fingscheidt¹•Institutions (1)

Braunschweig University of Technology¹

12 May 2019

TL;DR: This work introduces a method to improve the quality of simple scalar quantization in the context of acoustic sensor networks by combining ideas from sparse reconstruction, artificial neural networks and weighting filters.

...read moreread less

Abstract: We introduce a method to improve the quality of simple scalar quantization in the context of acoustic sensor networks by combining ideas from sparse reconstruction, artificial neural networks and weighting filters. We start from the observation that optimization methods based on sparse reconstruction resemble the structure of a neural network. Hence, building upon a successful enhancement method, we unroll the algorithms and use this to build a neural network which we train to obtain enhanced decoding. In addition, the weighting filter from code-excited linear predictive (CELP) speech coding is integrated into the loss function of the neural network, achieving perceptually improved reconstructed speech. Our experiments show that our proposed trained methods allow for better speech reconstruction than the reference optimization methods.

...read moreread less

Proceedings Article•DOI•

Speaker-dependent Wavenet-based Delay-free Adpcm Speech Coding

[...]

Takenori Yoshimura¹, Kei Hashimoto¹, Keiichiro Oura¹, Yoshihiko Nankaku¹, Keiichi Tokuda¹ - Show less +1 more•Institutions (1)

Nagoya Institute of Technology¹

07 May 2019

TL;DR: WaveNet-based delay-free adaptive differential pulse code modulation (ADPCM) speech coding system, which is a state-of-the-art model for neural-network-based speech waveform synthesis, is proposed to improve speech quality and outperformed the conventional ADPCM system based on ITU-T Recommendation G.726.

...read moreread less

Abstract: This paper proposes a WaveNet-based delay-free adaptive differential pulse code modulation (ADPCM) speech coding system. The WaveNet generative model, which is a state-of-the-art model for neural-network-based speech waveform synthesis, is used as the adaptive predictor in ADPCM. To further improve speech quality, mel-cepstrum-based noise shaping and postfiltering were integrated with the proposed ADPCM system. Both objective and subjective evaluation results indicate that the proposed ADPCM system outperformed not only the conventional ADPCM system based on ITU-T Recommendation G.726 but also the ADPCM system based on adaptive mel-cepstral analysis.

...read moreread less

Journal Article•DOI•

A scalable speech coding scheme using compressive sensing and orthogonal mapping based vector quantization.

[...]

M. S. Arun Sankar¹, P. S. Sathidevi¹•Institutions (1)

National Institute of Technology Calicut¹

01 May 2019-Heliyon

TL;DR: A novel scalable speech coding scheme based on Compressive Sensing, which can operate at bit rates from 3.275 to 7.275 kbps is designed and implemented and offers the listening quality for reconstructed speech similar to that of Adaptive Multi rate - Narrowband codec at 6.7 kbps and Enhanced Voice Services (EVS) at 7.2 kbps.

...read moreread less

Journal Article•DOI•

Sinusoidal-Based Lowband Synthesis for Artificial Speech Bandwidth Extension

[...]

Johannes Abel¹, Tim Fingscheidt¹•Institutions (1)

Braunschweig University of Technology¹

01 Apr 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The LB-ABE in this paper is employed together with a preexisting ABE toward high-frequency components to obtain spectrally balanced speech signals and the gap of speech quality between wideband and NB speech was significantly reduced when employing the proposed ABE toward low frequencies.

...read moreread less

Abstract: Conventional narrowband (NB) telephony suffers from limited acoustic bandwidth at the receiver side, leading to degraded speech quality and intelligibility. In this paper, artificial speech bandwidth extension (ABE) of NB speech toward missing frequencies below about 300 Hz (low-frequency band, LB) is proposed to enhance the speech quality. The LB-ABE in this paper is employed together with a preexisting ABE toward high-frequency components to obtain spectrally balanced speech signals. In an instrumental quality assessment, the spectral distance in the LB was improved by more than 5 dB compared to NB speech. In a subjective listening test, the gap of speech quality between wideband and NB speech was significantly reduced when employing the proposed ABE toward low frequencies. The LB extension was found to further improve the preexisting ABE toward higher frequencies by a significant 0.26 CMOS points.

...read moreread less

Journal Article•DOI•

Novel Two-Bit Adaptive Delta Modulation Algorithms

[...]

Zoran Peric¹, Bojan Denic¹, Vladimir Despotovic²•Institutions (2)

University of Niš¹, University of Belgrade²

01 Jan 2019-Informatica (lithuanian Academy of Sciences)

Proceedings Article•DOI•

TV-CAR speech analysis based on Regularized LP

[...]

Keiichi Funaki¹•Institutions (1)

University of the Ryukyus¹

01 Sep 2019

TL;DR: The RLP-based TV-CAR speech analysis is proposed and evaluated with the F0 estimation of speech using IRAPT (Instantaneous RAPT) with Keele Pitch Database under noisy conditions.

...read moreread less

Abstract: Linear Prediction (LP) analysis is speech analysis to estimate AR (Auto-Regressive) coefficients to represent the all-pole spectrum that is applied in speech synthesis recently besides speech coding. We have proposed $l_{2} -$ norm optimization-based TV-CAR (Time-Varying Complex AR) speech analysis for an analytic signal, MMSE (Minimizing Mean Square Error) or ELS (Extended Least Square) method, and we have applied them into the speech processing such as robust ASR or F 0 estimation of speech. On the other hand, B.Kleijn et al. have proposed Regularized Linear Prediction (RLP) method to suppress pitch related bias that is an overestimation of the first formant. In the RLP, $l_{2} -$ norm regularized term that is the norm of spectral changes in the frequencies is introduced to suppress the rapid spectral changes. The RLP estimates the parameter so as to minimize $l_{2} -$ norm criterion added by the $l_{2} -$ norm regularized penalty term. In this paper, the RLP-based TV-CAR speech analysis is proposed and evaluated with the F 0 estimation of speech using IRAPT (Instantaneous RAPT) with Keele Pitch Database under noisy conditions.

...read moreread less

Proceedings Article•DOI•

Using a manifold vocoder for spectral voice and style conversion

[...]

Tuan Dinh¹, Alexander Kain, Kris Tjaden²•Institutions (2)

Oregon Health & Science University¹, University at Buffalo²

15 Sep 2019

TL;DR: A new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging, is proposed by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra.

...read moreread less

Abstract: We propose a new type of spectral feature that is both compact and interpolable, and thus ideally suited for regression approaches that involve averaging. The feature is realized by means of a speaker-independent variational autoencoder (VAE), which learns a latent space based on the low-dimensional manifold of high-resolution speech spectra. In vocoding experiments, we showed that using a 12-dimensional VAE feature (VAE-12) resulted in significantly better perceived speech quality compared to a 12-dimensional MCEP feature. In voice conversion experiments, using VAE-12 resulted in significantly better perceived speech quality as compared to 40-dimensional MCEPs, with similar speaker accuracy. In habitual to clear style conversion experiments, we significantly improved the speech intelligibility for one of three speakers, using a custom skipconnection deep neural network, with the average keyword recall accuracy increasing from 24% to 46%.

...read moreread less

Journal Article•DOI•

Towards Security of GSM Voice Communication

[...]

Fauzia I. Abro¹, Farzana Rauf², Mobeen-ur-Rehman¹, Bhawani Shankar Chowdhry², Muttukrishnan Rajarajan³ - Show less +1 more•Institutions (3)

Air University (Islamabad)¹, Mehran University of Engineering and Technology², City University London³

08 May 2019-Wireless Personal Communications

TL;DR: A novel framework for end to end secure voice communication over the GSM networks using encryption algorithm AES-256 is presented, believed to be first solution that uses single codebook for transmission of secure voice.

...read moreread less

Abstract: Global system for mobile communication (GSM) is widely used digital mobile service around the world. Although GSM was designed as a secure wireless system, it is now vulnerable to different targeted attacks. There is a need to address security domains especially the confidentiality of communication. This paper presents a novel framework for end to end secure voice communication over the GSM networks using encryption algorithm AES-256. A special Modem and speech coding technique are designed to enable the transmission of encrypted speech using GSM voice channel. To the best of our knowledge, this is first solution that uses single codebook for transmission of secure voice. An efficient low bit-rate (1.9 kbps) speech coder is also designed for use with the proposed modulation scheme for optimal results. Different speech characteristics such as pitch, energy and line spectral frequencies are extracted and preserved before compression and encryption of speech. Previously, the best achieved data rate was 1.6 kbps with three codebooks, whilst the proposed approach achieves 2 kbps with 0% bit error rate. The empirical results show that the methodology can be used for real time applications to transmit encrypted voice using GSM network.

...read moreread less

Journal Article•DOI•

Pitch and Fourier magnitude based steganography for hiding 2.4 kbps MELP bitstream

[...]

Hamza Kheddar, Merouane Bouzid, David Megías¹•Institutions (1)

Open University of Catalonia¹

01 May 2019-Iet Signal Processing

TL;DR: This work has dealt with the challenge of embedding a secret speech into a cover speech coded by a very low bit rate speech coder while maintaining a reasonable level of speech quality and was able to create hidden channels with maximum steganographic bandwidths up to 266.64 bit/s.

...read moreread less

Abstract: In this study, the authors present a new steganographic technique called random least significant bits of pitch and Fourier magnitude steganography (RLPFS). It is based on hiding a secret speech coded by mixed excitation linear prediction (MELP) speech coder in speech bitstream (cover signal), which is also encoded by MELP coder. First, the RLPFS leaks the hidden speech in the following modes: pitch-based steganography, Fourier magnitude-based steganography or both. These modes are selected randomly. Second, during transmission, the stego speech, the mode number, and the number of embedded bits would be transmitted either through a covert channel created in the transmission protocol or through the cover speech. In this work, the authors have dealt with the challenge of embedding a secret speech into a cover speech coded by a very low bit rate speech coder while maintaining a reasonable level of speech quality. They have shown that RLPFS was able to create hidden channels with maximum steganographic bandwidths up to 266.64 bit/s at the cost of a steganographic noise between 0.031 and 0.62 mean opinion score. Also, this study takes into account the security of the parameters, the synchronisation of the receiver to deal with a packet loss during transmission and the resistance of the proposed method against steganalysis.

...read moreread less

Journal Article•DOI•

An efficient two-digit adaptive delta modulation for Laplacian source coding

[...]

Zoran Peric¹, Bojan Denic¹, Vladimir Despotovic²•Institutions (2)

University of Niš¹, University of Belgrade²

04 Mar 2019-International Journal of Electronics

TL;DR: Novel 2-digit ADM with six-level quantization using variable-length coding, for encoding the time-varying signals modelled by Laplacian distribution is presented, indicating that the proposed configuration outperforms the baseline ADM algorithms, including Constant Factor Delta Modulation (CFDM), Continuously Variable Slope Delta Modulations (CVSDM), and operates in a much wider dynamic range.

...read moreread less

Abstract: Delta Modulation (DM) is a simple waveform coding algorithm used mostly when timely data delivery is more important than the transmitted data quality. While the implementation of DM is fairly simple and inexpensive, it suffers from several limitations, such as slope overload and granular noise, which can be overcome using Adaptive Delta Modulation (ADM). This paper presents novel 2-digit ADM with six-level quantization using variable-length coding, for encoding the time-varying signals modelled by Laplacian distribution. Two variants of quantizer are employed, distortion-constrained quantizer which is optimally designed for minimal mean-squared error (MSE), and rate-constrained quantizer, which is suboptimal in the minimal MSE sense, but enables minimal loss in SQNR for the target bit rate. Experimental results using real speech signal are provided, indicating that the proposed configuration outperforms the baseline ADM algorithms, including Constant Factor Delta Modulation (CFDM), Continuously Va...

...read moreread less

Journal Article•DOI•

DeepVoCoder: A CNN Model for Compression and Coding of Narrow Band Speech

[...]

Hacer Yalim Keles¹, Jan Rozhon², H. Gokhan Ilk², Miroslav Voznak²•Institutions (2)

Ankara University¹, Technical University of Ostrava²

04 Jun 2019-IEEE Access

TL;DR: It is demonstrated that the final bottleneck layer of the CNN encoder provides an abstract and compact representation of the speech signal that is sufficient to reconstruct the original speech signal in high quality using the CNN decoder.

...read moreread less

Abstract: This paper proposes a convolutional neural network (CNN)-based encoder model to compress and code speech signal directly from raw input speech. Although the model can synthesize wideband speech by implicit bandwidth extension, narrowband is preferred for IP telephony and telecommunications purposes. The model takes time domain speech samples as inputs and encodes them using a cascade of convolutional filters in multiple layers, where pooling is applied after some layers to downsample the encoded speech by half. The final bottleneck layer of the CNN encoder provides an abstract and compact representation of the speech signal. In this paper, it is demonstrated that this compact representation is sufficient to reconstruct the original speech signal in high quality using the CNN decoder. This paper also discusses the theoretical background of why and how CNN may be used for end-to-end speech compression and coding. The complexity, delay, memory requirements, and bit rate versus quality are discussed in the experimental results.

...read moreread less