scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2016"


Journal ArticleDOI
TL;DR: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech and showed that it was superior to the other systems in terms of both sound quality and processing speed.
Abstract: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing. key words: speech analysis, speech synthesis, vocoder, sound quality, realtime processing

1,025 citations


Journal ArticleDOI
TL;DR: The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.
Abstract: Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

699 citations


Journal ArticleDOI
TL;DR: It is shown that phase-aware signal processing is an important emerging field with high potential in the current speech communication applications and can complement the possible solutions that magnitude-only methods suggest.

126 citations


Posted Content
TL;DR: This paper deals with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech and proposes several strategies based on Deep Neural Networks for speech enhancement in these scenarios.
Abstract: In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech. Most of the current literature on speech enhancement focus primarily on presence of single noise in corrupted speech which is far from real-world environments. Specifically, we deal with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech. We propose several strategies based on Deep Neural Networks (DNN) for speech enhancement in these scenarios. We also investigate a DNN training strategy based on psychoacoustic models from speech coding for enhancement of noisy speech

91 citations



Proceedings ArticleDOI
20 Mar 2016
TL;DR: This paper proposes to enhance the noisy and reverberant speech by learning a mapping to reverberant target speech rather than anechoic target speech, and develops a masking-based method for denoising and compares it with the spectral mapping method.
Abstract: In the real world, speech is usually distorted by both reverberation and background noise. In such conditions, speech intelligibility is degraded substantially, especially for hearing-impaired (HI) listeners. As a consequence, it is essential to enhance speech in the noisy and reverberant environment. Recently, deep neural networks have been introduced to learn a spectral mapping to enhance corrupted speech, and shown significant improvements in objective metrics and automatic speech recognition score. However, listening tests have not yet shown any speech intelligibility benefit. In this paper, we propose to enhance the noisy and reverberant speech by learning a mapping to reverberant target speech rather than anechoic target speech. A preliminary listening test was conducted, and the results show that the proposed algorithm is able to improve speech intelligibility of HI listeners in some conditions. Moreover, we develop a masking-based method for denoising and compare it with the spectral mapping method. Evaluation results show that the masking-based method outperforms the mapping-based method.

72 citations


Journal ArticleDOI
TL;DR: The above proposed technique is called separable deep auto encoder (SDAE), and given the under-determined nature of the above optimization problem, the clean speech reconstruction is confined in the convex hull spanned by a pre-trained speech dictionary.
Abstract: Unseen noise estimation is a key yet challenging step to make a speech enhancement algorithm work in adverse environments. At worst, the only prior knowledge we know about the encountered noise is that it is different from the involved speech. Therefore, by subtracting the components which cannot be adequately represented by a well defined speech model, the noises can be estimated and removed. Given the good performance of deep learning in signal representation, a deep auto encoder (DAE) is employed in this work for accurately modeling the clean speech spectrum. In the subsequent stage of speech enhancement, an extra DAE is introduced to represent the residual part obtained by subtracting the estimated clean speech spectrum (by using the pre-trained DAE) from the noisy speech spectrum. By adjusting the estimated clean speech spectrum and the unknown parameters of the noise DAE, one can reach a stationary point to minimize the total reconstruction error of the noisy speech spectrum. The enhanced speech signal is thus obtained by transforming the estimated clean speech spectrum back into time domain. The above proposed technique is called separable deep auto encoder (SDAE). Given the under-determined nature of the above optimization problem, the clean speech reconstruction is confined in the convex hull spanned by a pre-trained speech dictionary. New learning algorithms are investigated to respect the non-negativity of the parameters in the SDAE. Experimental results on TIMIT with 20 noise types at various noise levels demonstrate the superiority of the proposed method over the conventional baselines.

69 citations


Proceedings ArticleDOI
08 Sep 2016
TL;DR: In this article, the authors considered the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech and proposed several strategies based on Deep Neural Networks (DNN) for speech enhancement.
Abstract: In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech. Most of the current literature on speech enhancement focus primarily on presence of single noise in corrupted speech which is far from real-world environments. Specifically, we deal with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech. We propose several strategies based on Deep Neural Networks (DNN) for speech enhancement in these scenarios. We also investigate a DNN training strategy based on psychoacoustic models from speech coding for enhancement of noisy speech

67 citations


Journal ArticleDOI
TL;DR: It is found that real-time synthesis of vowels and consonants was possible with good intelligibility and open to future speech BCI applications using such articulatory-based speech synthesizer.
Abstract: Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer.

66 citations


Posted Content
TL;DR: An overview of Speex, the technology involved in it and how it can be used in applications is presented.
Abstract: The Speex project has been started in 2002 to address the need for a free, open-source speech codec. Speex is based on the Code Excited Linear Prediction (CELP) algorithm and, unlike the previously existing Vorbis codec, is optimised for transmitting speech for low latency communication over an unreliable packet network. This paper presents an overview of Speex, the technology involved in it and how it can be used in applications. The most recent developments in Speex, such as the fixed-point port, acoustic echo cancellation and noise suppression are also addressed.

52 citations


Journal ArticleDOI
TL;DR: In this paper, a quantum representation of digital audio (QRDA) is proposed to present quantum audio, which uses two entangled qubit sequences to store the audio amplitude and time information.
Abstract: Multimedia refers to content that uses a combination of different content forms. It includes two main medias: image and audio. However, by contrast with the rapid development of quantum image processing, quantum audio almost never been studied. In order to change this status, a quantum representation of digital audio (QRDA) is proposed in this paper to present quantum audio. QRDA uses two entangled qubit sequences to store the audio amplitude and time information. The two qubit sequences are both in basis state: |0〉 and |1〉. The QRDA audio preparation from initial state |0〉 is given to store an audio in quantum computers. Then some exemplary quantum audio processing operations are performed to indicate QRDA’s usability.

Journal ArticleDOI
TL;DR: In this work, Mel frequency cepstral coefficients (MFCC) features are extracted for each speech of both training and test samples and Gaussian mixture model (GMM) is used for classification of the speech based on accent.
Abstract: Speech processing is very important research area where speaker recognition, speech synthesis, speech codec, speech noise reduction are some of the research areas. Many of the languages have different speaking styles called accents or dialects. Identification of the accent before the speech recognition can improve performance of the speech recognition systems. If the number of accents is more in a language, the accent recognition becomes crucial. Telugu is an Indian language which is widely spoken in Southern part of India. Telugu language has different accents. The main accents are coastal Andhra, Telangana, and Rayalaseema. In this present work the samples of speeches are collected from the native speakers of different accents of Telugu language for both training and testing. In this work, Mel frequency cepstral coefficients (MFCC) features are extracted for each speech of both training and test samples. In the next step Gaussian mixture model (GMM) is used for classification of the speech based on accent. The overall efficiency of the proposed system to recognize the speaker, about the region he belongs, based on accent is 91 %.

Patent
29 Feb 2016
TL;DR: In this article, a device detects a wake-up keyword from a received speech signal of a user by using a wakeup keyword model, and transmits a wake up keyword detection/non-detection signal and the received signal of the user to a speech recognition server.
Abstract: A device detects a wake-up keyword from a received speech signal of a user by using a wake-up keyword model, and transmits a wake-up keyword detection/non-detection signal and the received speech signal of the user to a speech recognition server. The speech recognition server performs a recognition process on the speech signal of the user by setting a speech recognition model according to the detection or non-detection of the wake-up keyword.

Patent
08 Jun 2016
TL;DR: In this article, a system and method for recognizing mixed speech from a source is presented. But, the method is limited to a single source and does not consider the possibility that a specific frame is a switching point of the speech characteristic.
Abstract: The claimed subject matter includes a system and method for recognizing mixed speech from a source. The method includes training a first neural network to recognize the speech signal spoken by the speaker with a higher level of a speech characteristic from a mixed speech sample. The method also includes training a second neural network to recognize the speech signal spoken by the speaker with a lower level of the speech characteristic from the mixed speech sample. Additionally, the method includes decoding the mixed speech sample with the first neural network and the second neural network by optimizing the joint likelihood of observing the two speech signals considering the probability that a specific frame is a switching point of the speech characteristic.

Patent
10 Jun 2016
TL;DR: In this paper, an audio diarization system segments the audio input into speech and non-speech segments, and these segments are convolved with one or more head related transfer functions (HRTFs) so the sounds localize to different sound localization points (SLPs) for the user.
Abstract: Speech and/or non-speech in an audio input are convolved to localize sounds to different locations for a user. An audio diarization system segments the audio input into speech and non-speech segments. These segments are convolved with one or more head related transfer functions (HRTFs) so the sounds localize to different sound localization points (SLPs) for the user.


Proceedings ArticleDOI
08 Sep 2016
TL;DR: A sliding window deep neural network is presented that learns a mapping from awindow of acoustic features to a window of visual features from a large audio-visual speech dataset and outperform a baseline HMM inversion approach in both objective and subjective evaluations.
Abstract: We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This paper illustrates the hierarchical structure of audio data, and discusses how to classify audio data using the SVM classifier with Gaussian kernel, and demonstrates that the proposed method is able to achieve higher audio classification accuracy.
Abstract: Audio classification has very large theoretical and practical values in both pattern recognition and artificial intelligence. In this paper, we propose a novel audio classification method based on machine learning technique. Firstly, we illustrate the hierarchical structure of audio data, which is made up of four layers: 1) Audio frame, 2) Audio clip, 3) Audio shot, and 4) Audio high level semantic unit. Secondly, three types of audio data feature are extracted to construct feature vector, including 1) Short time energy, 2) Zero crossing rate and 3) Mel-Frequency cepstral coefficients. Thirdly, we discuss how to classify audio data using the SVM classifier with Gaussian kernel. Finally, experimental results demonstrate that the proposed method is able to achieve higher audio classification accuracy.

Journal ArticleDOI
TL;DR: An audio fingerprinting method that adapts findings from the field of blind astrometry to define simple, efficiently representable characteristic feature combinations called quads is proposed.
Abstract: We propose an audio fingerprinting method that adapts findings from the field of blind astrometry to define simple, efficiently representable characteristic feature combinations called quads. Based on these, an audio identification algorithm is described that is robust to noise and severe time-frequency scale distortions and accurately identifies the underlying scale transform factors. The low number and compact representation of content features allows for efficient application of exact fixed-radius near-neighbor search methods for fingerprint matching in large audio collections. We demonstrate the practicability of the method on a collection of 100,000 songs, analyze its performance for a diverse set of noise as well as severe speed, tempo and pitch scale modifications, and identify a number of advantages of our method over two state-of-the-art distortion-robust audio identification algorithms.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: Experiments on the corpus of the second CHiME speech separation and recognition challenge (task-2) demonstrate the effectiveness of this novel phoneme-specific speech separation method in terms of objective measures of speech intelligibility and quality, as well as recognition performance.
Abstract: Speech separation or enhancement algorithms seldom exploit information about phoneme identities. In this study, we propose a novel phoneme-specific speech separation method. Rather than training a single global model to enhance all the frames, we train a separate model for each phoneme to process its corresponding frames. A robust ASR system is employed to identify the phoneme identity of each frame. This way, the information from ASR systems and language models can directly influence speech separation by selecting a phoneme-specific model to use at the test stage. In addition, phoneme-specific models have fewer variations to model and do not exhibit the data imbalance problem. The improved enhancement results can in turn help recognition. Experiments on the corpus of the second CHiME speech separation and recognition challenge (task-2) demonstrate the effectiveness of this method in terms of objective measures of speech intelligibility and quality, as well as recognition performance.

Journal ArticleDOI
TL;DR: In this article, a hybrid approach is proposed combining the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN), which is executed in two phases, the training phase which does not recur, and the test phase.
Abstract: In this paper, we present a single-microphone speech enhancement algorithm. A hybrid approach is proposed merging the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN). The proposed algorithm is executed in two phases, the training phase, which does not recur, and the test phase. First, the noise-free speech log-power spectral density is modeled as an MoG, representing the phoneme-based diversity in the speech signal. A DNN is then trained with phoneme labeled database of clean speech signals for phoneme classification with mel-frequency cepstral coefficients as the input features. In the test phase, a noisy utterance of an untrained speech is processed. Given the phoneme classification results of the noisy speech utterance, a speech presence probability (SPP) is obtained using both the generative and discriminative models. SPP-controlled attenuation is then applied to the noisy speech while simultaneously, the noise estimate is updated. The discriminative DNN maintains the continuity of the speech and the generative phoneme-based MoG preserves the speech spectral structure. Extensive experimental study using real speech and noise signals is provided. We also compare the proposed algorithm with alternative speech enhancement algorithms. We show that we obtain a significant improvement over previous methods in terms of speech quality measures. Finally, we analyze the contribution of all components of the proposed algorithm indicating their combined importance.

Patent
28 Jun 2016
TL;DR: In this article, a coding scheme for coding a spatially sampled information signal using sub-division and coding schemes for coding an information signal with sub-and multi-tree structures are described.
Abstract: Coding schemes for coding a spatially sampled information signal using sub-division and coding schemes for coding a sub-division or a multitree structure are described, wherein representative embodiments relate to picture and/or video coding applications.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: A speech enhancement algorithm integrating an artificial neural network (NN) into CI coding strategies is proposed, which decomposes the noisy input signal into time-frequency units, extracts a set of auditory-inspired features and feeds them to the NN to produce an estimation of which CI channels contain more perceptually important information.
Abstract: Traditionally, algorithms that attempt to significantly improve speech intelligibility in noise for cochlear implant (CI) users have met with limited success, particularly in the presence of a fluctuating masker. In the present study, a speech enhancement algorithm integrating an artificial neural network (NN) into CI coding strategies is proposed. The algorithm decomposes the noisy input signal into time-frequency units, extracts a set of auditory-inspired features and feeds them to the NN to produce an estimation of which CI channels contain more perceptually important information (higher signal-to-noise ratio, SNR). This estimate is then used accordingly to retain a subset of channels for electrical stimulation, as in traditional n-of-m coding strategies. The proposed algorithm was tested with 10 normal-hearing participants listening to CI noise-vocoder simulations against a conventional Wiener filter based enhancement algorithm. Significant improvements in speech intelligibility in stationary and fluctuating noise were found over both unprocessed and Wiener filter processed conditions.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: Investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance confirms the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.
Abstract: This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful complementary information to audio data, have been integrated with audio data in many speech-related approaches to attain more effective speech processing performance. This paper presents our investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance. The experimental results show that the performance of DNN with audio-visual inputs exceeds that of DNN with audio inputs only in four standardized objective evaluations, thereby confirming the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.

Journal ArticleDOI
TL;DR: While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.
Abstract: The state-of-the-art speaker-recognition systems suffer from significant performance loss on degraded speech conditions and acoustic mismatch between enrolment and test phases. Past international evaluation campaigns, such as the NIST speaker recognition evaluation (SRE), have partly addressed these challenges in some evaluation conditions. This work aims at further assessing and compensating for the effect of a wide variety of speech-degradation processes on speaker-recognition performance. We present an open-source simulator generating degraded telephone, VoIP, and interview-speech recordings using a comprehensive list of narrow-band, wide-band, and audio codecs, together with a database of over 60 h of environmental noise recordings and over 100 impulse responses collected from publicly available data. We provide speaker-verification results obtained with an $i$ -vector-based system using either a clean or degraded PLDA back-end on a NIST SRE subset of data corrupted by the proposed simulator. While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.

Journal ArticleDOI
TL;DR: A novel VLBR speech coding framework based on neural networks (NNs) for end-to-end speech analysis and synthesis without HMMs is proposed, which significantly prefer the NN-based approach due to fewer discontinuities and speech artifacts of the encoded speech.
Abstract: Most current very low bit rate VLBR speech coding systems use hidden Markov model HMM based speech recognition and synthesis techniques. This allows transmission of information such as phonemes segment by segment; this decreases the bit rate. However, an encoder based on a phoneme speech recognition may create bursts of segmental errors; these would be further propagated to any suprasegmental such as syllable information coding. Together with the errors of voicing detection in pitch parametrization, HMM-based speech coding leads to speech discontinuities and unnatural speech sound artifacts. In this paper, we propose a novel VLBR speech coding framework based on neural networks NNs for end-to-end speech analysis and synthesis without HMMs. The speech coding framework relies on a phonological subphonetic representation of speech. It is designed as a composition of deep and spiking NNs: a bank of phonological analyzers at the transmitter, and a phonological synthesizer at the receiver. These are both realized as deep NNs, along with a spiking NN as an incremental and robust encoder of syllable boundaries for coding of continuous fundamental frequency F0. A combination of phonological features defines much more sound patterns than phonetic features defined by HMM-based speech coders; this finer analysis/synthesis code contributes to smoother encoded speech. Listeners significantly prefer the NN-based approach due to fewer discontinuities and speech artifacts of the encoded speech. A single forward pass is required during the speech encoding and decoding. The proposed VLBR speech coding operates at a bit rate of approximately 360 bits/s.

Journal ArticleDOI
TL;DR: The improved version of a steganographic algorithm for IP telephony based on approximating the F0 parameter, which is responsible for conveying information about the pitch of the speech signal, yielded a significantly lower decrease in speech quality, when compared with the original version of HideF0.
Abstract: This paper presents an improved version of a steganographic algorithm for IP telephony called HideF0. It is based on approximating the F0 parameter, which is responsible for conveying information about the pitch of the speech signal. The bits saved due to simplification of the pitch contour are used for the hidden transmission. In our experiments, the proposed method was applied to the narrowband Speex codec working in five different modes, with bitrates between 5,950i?źbps and 24,600i?źbps. We showed that HideF0 was able to create hidden channels with steganographic bandwidths of around 200i?źbps at the expense of a steganographic cost of between 0.5 and 0.7 MOS, depending on the Speex mode. Because of placing the approximation flag in the voice packet header, the improved version of the proposed algorithm yielded a significantly lower decrease in speech quality, when compared with the original version of HideF0. In addition, for low bitrates of the hidden channel i.e., below ca. 50i?źbps it was able to operate without introducing any steganographic cost. Copyright © 2016 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: This work investigates a single channel Kalman filter based speech enhancement algorithm, whose parameters are estimated using a codebook based approach, and results indicate that the enhancement algorithm is able to improve the speech intelligibility and quality according to objective measures.
Abstract: Enhancement of speech in non-stationary background noise is a challenging task, and conventional single channel speech enhancement algorithms have not been able to improve the speech intelligibility in such scenarios. The work proposed in this paper investigates a single channel Kalman filter based speech enhancement algorithm, whose parameters are estimated using a codebook based approach. The results indicate that the enhancement algorithm is able to improve the speech intelligibility and quality according to objective measures. Moreover, we investigate the effects of utilizing a speaker specific trained codebook over a generic speech codebook in relation to the performance of the speech enhancement system.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: A segmentation system for multi-genre broadcast audio with deep neural network (DNN) based speech/non-speech detection and a further stage of change-point detection and clustering is used to obtain homogeneous segments.
Abstract: Automatic segmentation is a crucial initial processing step for processing multi-genre broadcast (MGB) audio. It is very challenging since the data exhibits a wide range of both speech types and background conditions with many types of non-speech audio. This paper describes a segmentation system for multi-genre broadcast audio with deep neural network (DNN) based speech/non-speech detection. A further stage of change-point detection and clustering is used to obtain homogeneous segments. Suitable DNN inputs, context window sizes and architectures are studied with a series of experiments using a large corpus of MGB television audio. For MGB transcription, the improved segmenter yields roughly half the increase in word error rate, over manual segmentation, compared to the baseline DNN segmenter supplied for the 2015 ASRU MGB challenge.

08 Oct 2016
TL;DR: Introduction to audio analysis :, Introduction to audioAnalysis :, کتابخانه دیجیتال جندی شاپور اهواز
Abstract: Introduction to audio analysis : , Introduction to audio analysis : , کتابخانه دیجیتال جندی شاپور اهواز