scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2018"


Reference BookDOI
03 Oct 2018
TL;DR: Analysis of discrete-time speech signals probability and random processes linear model and dynamic system model optimization methods and estimation theory statistical pattern recognition helps clarify speech technology in selected areas.
Abstract: Analytical background and techniques: discrete-time signals, systems and transforms analysis of discrete-time speech signals probability and random processes linear model and dynamic system model optimization methods and estimation theory statistical pattern recognition Fundamentals of speech science: phonetic process phonological process Computational phonology and phonetics: computational phonology computational models for speech production computational models for auditory speechprocessing Speech technology in selected areas: speech recognition speech enhancement speech synthesis

244 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a new deep network for audio event recognition, called AENet, which uses a convolutional neural network (CNN) operating on a large temporal input.
Abstract: We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear subword units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works, this allows us to train an audio event detection system end to end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learned generic audio features, similar to the way CNNs learn generic features on vision tasks. In video analysis, combining visual features and traditional audio features, such as mel frequency cepstral coefficients, typically only leads to marginal improvements. Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection. In video highlight detection, our audio features improve the performance by more than 8% over visual features alone.

147 citations


Proceedings ArticleDOI
04 May 2018
TL;DR: In this article, a WaveNet generative speech model is used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s.
Abstract: Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model.

128 citations


Proceedings ArticleDOI
15 Apr 2018
TL;DR: Experimental results demonstrate that PPGs successfully improve both naturalness and speaker similarity of the converted speech, and both speaker codes and d-vectors can be adopted to the VAE-based many-to-many non-parallel VC.
Abstract: This paper proposes novel frameworks for non-parallel voice conversion (VC) using variational autoencoders (VAEs). Although conventional VAE-based VC models can be trained using non-parallel speech corpora with given speaker representations, phonetic contents of the converted speech tend to vanish because of an over-regularization issue often observed in latent variables of the VAEs. To overcome the issue, this paper proposes a VAE-based non-parallel VC conditioned by not only the speaker representations but also phonetic contents of speech represented as phonetic posteriorgrams (PPGs). Since the phonetic contents are given during the training, we can expect that the VC models effectively learn speaker-independent latent features of speech. Focusing on the point, this paper also extends the conventional VAE-based non-parallel VC to many-to-many VC that can convert arbitrary speakers' characteristics into another arbitrary speakers' ones. We investigate two methods to estimate speaker representations for speakers not included in speech corpora used for training VC models: 1) adapting conventional speaker codes, and 2) using d-vectors for the speaker representations. Experimental results demonstrate that 1) PPGs successfully improve both naturalness and speaker similarity of the converted speech, and 2) both speaker codes and d-vectors can be adopted to the VAE-based many-to-many non-parallel VC.

114 citations


Journal ArticleDOI
TL;DR: Results showed stronger low-frequency cortical tracking of the speech envelope in IDS than in ADS, which suggests that IDS has a privileged status in facilitating successful cortical track of incoming speech which may, in turn, augment infants’ early speech processing and even later language development.
Abstract: This study assessed cortical tracking of temporal information in incoming natural speech in seven-month-old infants. Cortical tracking refers to the process by which neural activity follows the dynamic patterns of the speech input. In adults, it has been shown to involve attentional mechanisms and to facilitate effective speech encoding. However, in infants, cortical tracking or its effects on speech processing have not been investigated. This study measured cortical tracking of speech in infants and, given the involvement of attentional mechanisms in this process, cortical tracking of both infant-directed speech (IDS), which is highly attractive to infants, and the less captivating adult-directed speech (ADS), were compared. IDS is the speech register parents use when addressing young infants. In comparison to ADS, it is characterised by several acoustic qualities that capture infants’ attention to linguistic input and assist language learning. Seven-month-old infants’ cortical responses were recorded via electroencephalography as they listened to IDS or ADS recordings. Results showed stronger low-frequency cortical tracking of the speech envelope in IDS than in ADS. This suggests that IDS has a privileged status in facilitating successful cortical tracking of incoming speech which may, in turn, augment infants’ early speech processing and even later language development.

60 citations


Journal ArticleDOI
01 Mar 2018
TL;DR: The availability of prior information enhanced the perceived clarity of degraded speech, which was positively correlated with changes in phoneme-level encoding across subjects, and the possibility that prior information affects the early encoding of natural speech in a dual manner is suggested.
Abstract: In real-world environments, humans comprehend speech by actively integrating prior knowledge (P) and expectations with sensory input. Recent studies have revealed effects of prior information in temporal and frontal cortical areas and have suggested that these effects are underpinned by enhanced encoding of speech-specific features, rather than a broad enhancement or suppression of cortical activity. However, in terms of the specific hierarchical stages of processing involved in speech comprehension, the effects of integrating bottom-up sensory responses and top-down predictions are still unclear. In addition, it is unclear whether the predictability that comes with prior information may differentially affect speech encoding relative to the perceptual enhancement that comes with that prediction. One way to investigate these issues is through examining the impact of P on indices of cortical tracking of continuous speech features. Here, we did this by presenting participants with degraded speech sentences that either were or were not preceded by a clear recording of the same sentences while recording non-invasive electroencephalography (EEG). We assessed the impact of prior information on an isolated index of cortical tracking that reflected phoneme-level processing. Our findings suggest the possibility that prior information affects the early encoding of natural speech in a dual manner. Firstly, the availability of prior information, as hypothesized, enhanced the perceived clarity of degraded speech, which was positively correlated with changes in phoneme-level encoding across subjects. In addition, P induced an overall reduction of this cortical measure, which we interpret as resulting from the increase in predictability.

52 citations


Proceedings ArticleDOI
15 Apr 2018
TL;DR: In this article, a deep neural network model was proposed to optimize all the steps of a wideband speech coding pipeline (compression, quantization, entropy coding, and decompression) end-to-end directly from raw speech data.
Abstract: Modern compression algorithms are often the result of laborious domain-specific research; industry standards such as MP3, JPEG, and AMR-WB took years to develop and were largely hand-designed. We present a deep neural network model which optimizes all the steps of a wideband speech coding pipeline (compression, quantization, entropy coding, and decompression) end-to-end directly from raw speech data - no manual feature engineering necessary, and it trains in hours. In testing, our DNN-based coder performs on par with the AMR -WB standard at a variety of bitrates (~9kbps up to ~24kbps). It also runs in realtime on a 3.8GhZ Intel CPU.

49 citations


Journal ArticleDOI
TL;DR: This study introduces a new method for speech signal encryption and compression in a single step using compressive sensing (CS), and the contourlet transform is used to increase the sparsity of the signal required by CS.
Abstract: This study introduces a new method for speech signal encryption and compression in a single step The combined compression/encryption procedures are accomplished using compressive sensing (CS) The contourlet transform is used to increase the sparsity of the signal required by CS Due to its randomness properties and very high sensitivity to initial conditions, the chaotic system is used to generate the sensing matrix of CS This largely increases the key size of encryption to 10 135 when logistic map is used A spectral segmental signal-to-noise ratio of -36813 dB is obtained as a measure of encryption strength The quality of reconstructed speech is given by means of signal-to-noise ratio (SNR), and perceptual evaluation speech quality (PESQ) For 60% compression ratio the proposed method gives 48203 dB SNR and 4437 PESQ for voiced speech segments However, for continuous speech (voiced and unvoiced), it gives 41097 dB SNR and 4321 PESQ

29 citations


Journal ArticleDOI
TL;DR: This proof-of-concept study presents a novel signal processing algorithm specific for BCIs, with the aim to improve sound localization in noise and suggests for modifying and further improving the BCI algorithm.
Abstract: Bilateral cochlear implant (BCI) users only have very limited spatial hearing abilities. Speech coding strategies transmit interaural level differences (ILDs) but in a distorted manner. Interaural time difference (ITD) information transmission is even more limited. With these cues, most BCI users can coarsely localize a single source in quiet, but performance quickly declines in the presence of other sound. This proof-of-concept study presents a novel signal processing algorithm specific for BCIs, with the aim to improve sound localization in noise. The core part of the BCI algorithm duplicates a monophonic electrode pulse pattern and applies quasistationary natural or artificial ITDs or ILDs based on the estimated direction of the dominant source. Three experiments were conducted to evaluate different algorithm variants: Experiment 1 tested if ITD transmission alone enables BCI subjects to lateralize speech. Results showed that six out of nine BCI subjects were able to lateralize intelligible speech in quiet solely based on ITDs. Experiments 2 and 3 assessed azimuthal angle discrimination in noise with natural or modified ILDs and ITDs. Angle discrimination for frontal locations was possible with all variants, including the pure ITD case, but for lateral reference angles, it was only possible with a linearized ILD mapping. Speech intelligibility in noise, limitations, and challenges of this interaural cue transmission approach are discussed alongside suggestions for modifying and further improving the BCI algorithm.

22 citations


Proceedings ArticleDOI
01 Dec 2018
TL;DR: An end-to-end automatic speech recognition model designed for a common low-resource scenario: no pronunciation dictionary or phonemic transcripts, very limited transcribed speech, and much larger non-parallel text and speech corpora is developed.
Abstract: In this paper, we develop an end-to-end automatic speech recognition (ASR) model designed for a common low-resource scenario: no pronunciation dictionary or phonemic transcripts, very limited transcribed speech, and much larger non-parallel text and speech corpora. Our semi-supervised model is built on top of an encoder-decoder model with attention and takes advantage of non-parallel speech and text corpora in several ways: a denoising text autoencoder that shares parameters with the ASR decoder, a speech autoencoder that shares parameters with the ASR encoder, and adversarial training that encourages the speech and text encoders to use the same embedding space. We show that a model with this architecture significantly outperforms the baseline in this low-resource condition. We additionally perform an ablation evaluation, demonstrating that all of our added components contribute substantially to the overall performance of our model. We propose several avenues for further work, noting in particular that a model with this architecture could potentially enable fully unsupervised speech recognition.

21 citations


Journal ArticleDOI
TL;DR: The presented experiments demonstrate that the proposed randomizations yield uncorrelated signals, that perceptual quality is competitive, and that the complexity of the proposed methods is feasible for practical applications.
Abstract: Efficient coding of speech and audio in a distributed system requires that quantization errors across nodes are uncorrelated. Yet, with conventional methods at low bitrates, quantization levels become increasingly sparse, which does not correspond to the distribution of the input signal and, importantly, also reduces coding efficiency in a distributed system. We have recently proposed a distributed speech and audio codec design, which applies quantization in a randomized domain such that quantization errors are randomly rotated in the output domain. Similar to dithering, this ensures that quantization errors across nodes are uncorrelated and coding efficiency is retained. In this paper, we improve this approach by proposing faster randomization methods, with a computational complexity of $\mathcal O(N\log N)$ . The presented experiments demonstrate that the proposed randomizations yield uncorrelated signals, that perceptual quality is competitive, and that the complexity of the proposed methods is feasible for practical applications.

Posted Content
TL;DR: In this article, a speech coding scheme employing a generative model based on SampleRNN that matches or surpasses the perceptual quality of state-of-the-art wide-band codecs was proposed.
Abstract: We provide a speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of state-of-the-art classic wide-band codecs. Moreover, it is demonstrated that the proposed scheme can provide a meaningful rate-distortion trade-off without retraining. We evaluate the proposed scheme in a series of listening tests and discuss limitations of the approach.

Journal ArticleDOI
TL;DR: The mechanisms of speech perception that have led to perceptual speech coding are reviewed and the application of cognitive speech processing in speech compression presents a paradigm shift from perceptual (auditory) speech processing toward cognitive (auditor plus cortical) speechprocessing.
Abstract: Speech coding is a field in which compression paradigms have not changed in the last 30 years. Speech signals are most commonly encoded with compression methods that have roots in linear predictive theory dating back to the early 1940s. This article bridges this influential theory with recent cognitive studies applicable in speech communication engineering. It reviews the mechanisms of speech perception that have led to perceptual speech coding. The focus is on human speech communication and machine learning and the application of cognitive speech processing in speech compression that presents a paradigm shift from perceptual (auditory) speech processing toward cognitive (auditory plus cortical) speech processing.

Proceedings ArticleDOI
15 Apr 2018
TL;DR: In a subjective comparison category rating test, the proposed ABE solution significantly outperforms the competing ABE baseline and was found to improve NB speech quality by 0.80 CMOS points, while the computation time is reduced to about 3 % compared to the ABE baseline.
Abstract: In this work, we present a simple deep neural network (DNN)-based regression approach to artificial speech bandwidth extension (ABE) in the frequency domain for estimating missing speech components in the range 4 … 7 kHz The upper band (UB) spectral magnitudes are found by first estimating the UB cepstrum by means of a DNN regression and subsequent conversion to the spectral domain, leading to a more efficient and generalizing model training rather than estimating highly redundant UB magnitudes directly As second novelty the phase information for the estimated upper band spectral magnitudes is generated by spectrally shifting the NB phase Apart from framing, this very simple approach does not introduce additional algorithmic delay A cross-database and cross-language task is defined for training and evaluation of the ABE framework In a subjective comparison category rating test, the proposed ABE solution significantly outperforms the competing ABE baseline and was found to improve NB speech quality by 080 CMOS points, while the computation time is reduced to about 3 % compared to the ABE baseline

Proceedings ArticleDOI
01 Sep 2018
TL;DR: It is observed that the parametric representation of basis vectors is beneficial while performing online speech enhancement in low delay scenarios and is proposed to use basis vectors that are parameterised by autoregressive (AR) coefficients.
Abstract: In this paper, we propose a speech enhancement method based on non-negative matrix factorization (NMF) techniques. NMF techniques allow us to approximate the power spectral density (PSD) of the noisy signal as a weighted linear combination of trained speech and noise basis vectors arranged as the columns of a matrix. In this work, we propose to use basis vectors that are parameterised by autoregressive (AR) coefficients. Parametric representation of the spectral basis is beneficial as it can encompass the signal characteristics like, e.g. the speech production model. It is observed that the parametric representation of basis vectors is beneficial while performing online speech enhancement in low delay scenarios.

Proceedings ArticleDOI
01 Sep 2018
TL;DR: A convolutional neural network (CNN)-based postprocessor applying cepstral domain features to enhance the transcoded speech for various narrowband and wideband codecs without any modification of these codecs is presented.
Abstract: Postprocessors can be advantageously used to enhance trans-coded speech after the decoder on the receiver side. In this paper we present a convolutional neural network (CNN)-based postprocessor applying cepstral domain features to enhance the transcoded speech for various narrowband and wideband codecs without any modification of these codecs. Simulations show that the proposed postprocessor is able to improve the coded speech quality (PESQ or WB-PESQ) by 0.25 MOS-LQO points for G.711, 0.26 points for G.726, 0.81 points for G.722, and 0.2 points for AMR-WB. Moreover, a superior performance is observed for the proposed postprocessor compared to an ITU-T-standardized postfilter for G.711. We also show that AMR-WB at lower bitrates together with our proposed postprocessor is able to exceed the speech quality of AMR-WB at higher bitrates without postprocessing (up to 3 modes higher).

Proceedings Article
01 Sep 2018
TL;DR: In this article, a combined defense incorporating compressions, speech coding, filtering, and audio panning was shown to be quite effective against the attack on the Speech Commands Model, detecting audio adversarial examples with 93.5% precision and 91.2% recall.
Abstract: An adversarial attack is an exploitative process in which minute alterations are made to natural inputs, causing the inputs to be misclassified by neural models. In the field of speech recognition, this has become an issue of increasing significance. Although adversarial attacks were originally introduced in computer vision, they have since infiltrated the realm of speech recognition. In 2017, a genetic attack was shown to be quite potent against the Speech Commands Model. Limited-vocabulary speech classifiers, such as the Speech Commands Model, are used in a variety of applications, particularly in telephony; as such, adversarial examples produced by this attack pose as a major security threat. This paper explores various methods of detecting these adversarial examples with combinations of audio preprocessing. One particular combined defense incorporating compressions, speech coding, filtering, and audio panning was shown to be quite effective against the attack on the Speech Commands Model, detecting audio adversarial examples with 93.5% precision and 91.2% recall.

Posted Content
TL;DR: One particular combined defense incorporating compressions, speech coding, filtering, and audio panning was shown to be quite effective against the attack on the Speech Commands Model, detecting audio adversarial examples with 93.5% precision and 91.2% recall.
Abstract: An adversarial attack is an exploitative process in which minute alterations are made to natural inputs, causing the inputs to be misclassified by neural models In the field of speech recognition, this has become an issue of increasing significance Although adversarial attacks were originally introduced in computer vision, they have since infiltrated the realm of speech recognition In 2017, a genetic attack was shown to be quite potent against the Speech Commands Model Limited-vocabulary speech classifiers, such as the Speech Commands Model, are used in a variety of applications, particularly in telephony; as such, adversarial examples produced by this attack pose as a major security threat This paper explores various methods of detecting these adversarial examples with combinations of audio preprocessing One particular combined defense incorporating compressions, speech coding, filtering, and audio panning was shown to be quite effective against the attack on the Speech Commands Model, detecting audio adversarial examples with 935% precision and 912% recall

Journal ArticleDOI
TL;DR: Two strategies using the direct formant estimation algorithms are developed within this study, FACE and VFACE, which may serve as potential supplementary channel selection techniques for the ACE sound processing strategy for cochlear implants.
Abstract: The Advanced Combination Encoder (ACE) signal processing strategy is used in the majority of cochlear implant (CI) sound processors manufactured by Cochlear Corporation. This “n-of-m” strategy selects “n” out of “m” available frequency channels with the highest spectral energy in each stimulation cycle. It is hypothesized that at low signal-to-noise ratio (SNR) conditions, noise–dominant frequency channels are susceptible for selection, neglecting channels containing target speech cues. In order to improve speech segregation in noise, explicit encoding of formant frequency locations within the standard channel selection framework of ACE is suggested. Two strategies using the direct formant estimation algorithms are developed within this study, FACE (formant-ACE) and VFACE (voiced-activated-formant-ACE). Speech intelligibility from eight CI users is compared across 11 acoustic conditions, including mixtures of noise and reverberation at multiple SNRs. Significant intelligibility gains were observed with VFACE over ACE in 5 dB babble noise; however, results with FACE/VFACE in all other conditions were comparable to standard ACE. An increased selection of channels associated with the second formant frequency is observed for FACE and VFACE. Both proposed methods may serve as potential supplementary channel selection techniques for the ACE sound processing strategy for cochlear implants.

Journal ArticleDOI
TL;DR: The developed AFM signal model is found to be suitable for representation of an entire speech phoneme irrespective of its time duration, and the model is shown to be applicable for low bit-rate speech coding.
Abstract: In this paper, we propose a novel multicomponent amplitude and frequency modulated (AFM) signal model for parametric representation of speech phonemes. An efficient technique is developed for parameter estimation of the proposed model. The Fourier–Bessel series expansion is used to separate a multicomponent speech signal into a set of individual components. The discrete energy separation algorithm is used to extract the amplitude envelope (AE) and the instantaneous frequency (IF) of each component of the speech signal. Then, the parameter estimation of the proposed AFM signal model is carried out by analysing the AE and IF parts of the signal component. The developed model is found to be suitable for representation of an entire speech phoneme (voiced or unvoiced) irrespective of its time duration, and the model is shown to be applicable for low bit-rate speech coding. The symmetric Itakura–Saito and the root-mean-square log-spectral distance measures are used for comparison of the original and reconstructed speech signals.

Proceedings ArticleDOI
20 May 2018
TL;DR: This work investigates the emphatic speech synthesis and control mechanisms in the E2E framework and proposes an E1E-based method for transferring emphasis characteristic between speakers and indicates the effectiveness of the proposed model.
Abstract: End-to-end text-to-speech (E2E TTS) synthesis has achieved great success. This work investigates the emphatic speech synthesis and control mechanisms in the E2E framework and proposes an E2E-based method for transferring emphasis characteristic between speakers. Characteristic differences between emphatic and neutral speech are learned from a smallscale corpus containing parallel neutral and emphasis speech utterances recorded by one speaker and further transferred to another speaker so that we can generate emphatic speech with latter speakers voice. Emphasis embedding is injected to the encoder of the extended E2E TTS model to capture the aforementioned differences; while the decoder and attention module are used to decode those differences into synthetic neutral / emphatic speech. Speaker codes linked to the decoder and attention module provide the E2E model the ability for characteristic transferring between speakers. To control the emphatic strength, an encoder memory manipulation mechanism is proposed. Experimental results indicate the effectiveness of our proposed model.

Journal ArticleDOI
Zhen Ma1
TL;DR: An arbitrary-location pulse determination algorithm based on multipulse linear prediction coding (MP-LPC) that can determine all the amplitudes of the pulses at a time according to given pulse locations without the use of analysis-by-synthesis is presented.
Abstract: An arbitrary-location pulse determination algorithm based on multipulse linear prediction coding (MP-LPC) is presented. This algorithm can determine all the amplitudes of the pulses at a time according to given pulse locations without the use of analysis-by-synthesis. This ensures that the pulses are optimal in a least-square sense, providing the theoretical foundation to improve the quality of synthesized speech. A fixed-location pulse linear prediction coding (FLP-LPC) method is proposed based on the arbitrary-location pulse determination algorithm. Simulation of the algorithm in MATLAB showed the superior quality of the speech synthesized using pulses in different locations and processed using the arbitrary-location pulse determination algorithm. The algorithm improved speech quality without affecting coding time, which was approximately 1.5% of the coding time for MP-LPC. Pulse locations in FLP-LPC are fixed and do not need to be transmitted, with only LSF, gain, and 16 pulse amplitudes requiring coding and transmission. FLP-LPC allows the generation of synthesized speech similar to G.729 coded speech at a rate of 2.5 kbps.

Journal ArticleDOI
TL;DR: The experimental results prove that the presented AMR-WB steganography may provide higher and flexible embedding capacity without inducing perceptible distortion compared with the state-of-the-art methods.
Abstract: Steganography is a means of covert communication without revealing the occurrence and the real purpose of communication. The adaptive multirate wideband (AMR-WB) is a widely adapted format in mobile handsets and is also the recommended speech codec for VoLTE. In this paper, a novel AMR-WB speech steganography is proposed based on diameter-neighbor codebook partition algorithm. Different embedding capacity may be achieved by adjusting the iterative parameters during codebook division. The experimental results prove that the presented AMR-WB steganography may provide higher and flexible embedding capacity without inducing perceptible distortion compared with the state-of-the-art methods. With 48 iterations of cluster merging, twice the embedding capacity of complementary-neighbor-vertices-based embedding method may be obtained with a decrease of only around 2% in speech quality and much the same undetectability. Moreover, both the quality of stego speech and the security regarding statistical steganalysis are better than the recent speech steganography based on neighbor-index-division codebook partition.

Proceedings ArticleDOI
01 Dec 2018
TL;DR: This paper reconsiders the use of MOS naturalness as an instrument for measuring the quality (vs. intelligibility) of speech and considers an earlier proposed alternative, the paired comparison or “AB” test, and presents new empirical evidence that this is indeed a better method for evaluating TTS quality.
Abstract: This paper reconsiders the use of MOS naturalness as an instrument for measuring the quality (vs. intelligibility) of speech. We reconsider an earlier proposed alternative, the paired comparison or “AB” test, and present new empirical evidence that this is indeed a better method for evaluating TTS quality. Using this, we evaluate three older TTS systems along with a recent deep-learning approach against native North-American and Indian speech and show that, in fact, TTS had already crossed the threshold of human-like speech synthesis some time ago. This suggests that a systematic reappraisal of the concept of abstract “naturalness” of speech is in order.

Proceedings ArticleDOI
01 Dec 2018
TL;DR: Experimental results show that the proposed coding technique can transmit speech audio waveforms with 50% their original bit rate and the WaveNet-based speech coder remains effective for unknown speakers.
Abstract: This paper presents a WaveNet-based zero-delay lossless speech coding technique for high-quality communications. The WaveNet generative model, which is a state-of-the-art model for neural-network-based speech waveform synthesis, is used in both the encoder and decoder. In the encoder, discrete speech signals are losslessly compressed using sample-by-sample entropy coding. The decoder fully reconstructs the original speech signals from the compressed signals without algorithmic delay. Experimental results show that the proposed coding technique can transmit speech audio waveforms with 50% their original bit rate and the WaveNet-based speech coder remains effective for unknown speakers.

Proceedings ArticleDOI
22 Mar 2018
TL;DR: In this study, speech recognition systems were investigated, methods used in the literature were investigated and a Turkish speech recognition application was developed, achieving 91% accurate recognition success for the SM-SVM classifier; 71% correct recognition of the LS-S VM classifier.
Abstract: Speech recognition systems aim to make human-machine communication quickly and easily. In recent years, various researches and studies have been carried out to develop speech recognition systems. Examples of these studies are speech recognition, speaker recognition and speaker verification. In this study, speech recognition systems were investigated, methods used in the literature were investigated and a Turkish speech recognition application was developed. The application consists of speech coding and speech recognition. Firstly 20 Turkish words which are frequently used on the computer were determined. There are 20 records from each word. A total of 400 words were recorded on the computer with a microphone. In the speech coding section of the application, these words recorded on the computer are encoded by the Linear Pre-estimation Coding (LPC) method and the LPC parameters for each word are obtained. In the speech recognition section of the application, the Support Vector Machines (SVM) method is used. Two types of SVM classifiers are designed. These are the Soft Margin SVM (SM-SVM) classifier and the Least Square SVM (LS-SVM) classifier. Classification consists of training and testing stages. Of the 400 coded words, 200 were used for the training phase and 200 were used for the testing phase. As a result, 91% accurate recognition success for the SM-SVM classifier; 71% correct recognition of the LS-SVM classifier has been achieved.

Proceedings ArticleDOI
01 Sep 2018
TL;DR: The article presents the scores of quality assessment of speech signals transmitted via DAB+ system using the use of ACR procedure and comparative analysis of the subjective assessment for speech encoded with AAC and HE-AAC techniques.
Abstract: The article presents the scores of quality assessment of speech signals transmitted via DAB+ system. The subjective research was provided with the use of ACR procedure, according to the International Telecommunication Union recommendation, and the results have been presented as the MOS values for various bit rates. The comparative analysis of the subjective assessment for speech encoded with AAC and HE-AAC techniques were presented. The subjective tests were carried out for the testing material prepared for Polish speech.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: A secure steganography scheme for SILK is proposed, which embeds secret message by modifying the LSF (Line Spectral Frequency) quantization indices based on the statistical distribution of LSF Codebook, which is the first work to hide information in SILK.
Abstract: SILK, as a speech codec for real-time packet-based voice communications, which is widely used in many popular mobile Internet application, such as Skype, WeChat, QQ, WhatsApp, etc. It will be a novel and ideal carrier for information hiding. In this paper, a secure steganography scheme for SILK is proposed, which embeds secret message by modifying the LSF (Line Spectral Frequency) quantization indices based on the statistical distribution of LSF Codebook. The experimental results show that the auditory concealment of the proposed scheme is excellent, the decrease in PESQ is very small. The average hiding capacity can achieve 129 bps and 223 bps under the sampling rate of 8 kHz and 16 kHz respectively. More importantly, the proposed scheme has good statistical security. In this scheme, the statistical distribution of LSF Codebook is considered as a constraint condition to make the distribution of stego's codeword close to that of the cover audio. Under the steganlysis scheme which is referenced from the existing steganlysis scheme for G.723.1, the average correct detection rate is under 55.4% for both cover and stego audio. To the best of our knowledge, this is the first work to hide information in SILK. Based on the similar principle of speech compression, the method can be extended to other CELP codec, such as G.723.1, G.729, AMR, etc.

Proceedings ArticleDOI
Tom Bäckström1
01 Oct 2018
TL;DR: The opportunities and challenges are here summarized in three concepts: collaboration, unification and privacy, which will increase the demand for privacy protection in speech interfaces and it is likely that technologies for supporting privacy and generating trust will be in high demand.
Abstract: Recent speech and audio coding standards such as 3GPP Enhanced Voice Services match the foreseeable needs and requirements in transmission of speech and audio, when using current transmission infrastructure and applications. Trends in Internet-of-Things technology and development in personal digital assistants (PDAs) however begs us to consider future requirements for speech and audio codecs. The opportunities and challenges are here summarized in three concepts: collaboration, unification and privacy. First, an increasing number of devices will in the future be speech-operated, whereby the ability to focus voice commands to a specific devices becomes essential. We therefore need methods which allows collaboration between devices, such that ambiguities can be resolved. Second, such collaboration can be achieved with a unified and standardized communication protocol between voice-operated devices. To achieve such collaboration protocols, we need to develop distributed speech coding technology for ad-hoc IoT networks. Finally however, collaboration will increase the demand for privacy protection in speech interfaces and it is therefore likely that technologies for supporting privacy and generating trust will be in high demand.

Journal ArticleDOI
TL;DR: The ECC strategy represents a neuro-physiological approach that could potentially improve the perception of more complex sound patterns with cochlear implants, possibly due to ECC’s ability to present more of the input spectral contents compared to ACE, which is restricted to a fixed number of maxima.
Abstract: A novel cochlear implant coding strategy based on the neural excitability has been developed and implemented using Matlab/Simulink Unlike present day coding strategies, the Excitability Controlled Coding (ECC) strategy uses a model of the excitability state of the target neural population to determine its stimulus selection, with the aim of more efficient stimulation as well as reduced channel interaction Central to the ECC algorithm is an excitability state model, which takes into account the supposed refractory behaviour of the stimulated neural populations The excitability state, used to weight the input signal for selecting the stimuli, is estimated and updated after the presentation of each stimulus, and used iteratively in selecting the next stimulus Additionally, ECC regulates the frequency of stimulation on a given channel as a function of the corresponding input stimulus intensity Details of the model, implementation and results of benchtop plus subjective tests are presented and discussed Compared to the Advanced Combination Encoder (ACE) strategy, ECC produces a better spectral representation of an input signal, and can potentially reduce channel interactions Pilot test results from 4 CI recipients suggest that ECC may have some advantage over ACE for complex situations such as speech in noise, possibly due to ECC’s ability to present more of the input spectral contents compared to ACE, which is restricted to a fixed number of maxima The ECC strategy represents a neuro-physiological approach that could potentially improve the perception of more complex sound patterns with cochlear implants