scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Improving of Open-Set Language Identification by Using Deep SVM and Thresholding Functions

01 Oct 2017-pp 796-802
TL;DR: This paper proposes a deep SVM based LID back-end system to improve the target languages identification and defines three OOS thresholding formulations, which are used to decide whether the speech segment is a target or OOS language.
Abstract: State-of-the-art language identification (LID) systems are based on an iVector feature extractor front-end followed by a multi-class recognition back-end. Identification accuracy degrades considerably when LID systems face open-set languages. As compared to in-set identification task, the open-set task is adequate to mimic the real challenge of language identification. In this paper, we propose an approach to the problem of out-of-set (OOS) data detection in the context of open-set language identification with zero-knowledge for OOS languages. The main feature of this study is the emphasis on the in-set (target) language identification, on the one hand, and on OOS language detection, on the other hand. Accordingly, we propose a deep SVM based LID back-end system to improve the target languages identification. Along with that, we define three OOS thresholding formulations. These formulations are used to decide whether the speech segment is a target or OOS language. The experimental results demonstrate the effectiveness of the deep SVM back-end system as compared to state-of-the-art techniques. Besides that, the thresholding functions perfectly detect and reject the OOS data. A relative decrease of 6% in Equal Error Rate (EER) is reported over classical OOS detection methods, in discriminating target and OOS languages.
Citations
More filters
Journal ArticleDOI
TL;DR: This paper proposes to use speech signal patterns for spoken language identification, where image-based features are used and the highest accuracy of 99.96%, which outperforms the state-of-the-art reported results.
Abstract: Western countries entertain speech recognition-based applications. It does not happen in a similar magnitude in East Asia. Language complexity could potentially be one of the primary reasons behind this lag. Besides, multilingual countries like India need to be considered so that language identification (words and phrases) can be possible through speech signals. Unlike the previous works, in this paper, we propose to use speech signal patterns for spoken language identification, where image-based features are used. The concept is primarily inspired from the fact that speech signal can be read/visualized. In our experiment, we use spectrograms (for image data) and deep learning for spoken language classification. Using the IIIT-H Indic speech database for Indic languages, we achieve the highest accuracy of 99.96%, which outperforms the state-of-the-art reported results. Furthermore, for a relative decrease of 4018.60% in the signal-to-noise ratio, a decrease of only 0.50% in accuracy tells us the fact that our concept is fairly robust.

20 citations

Journal ArticleDOI
TL;DR: This paper proposes image-based features for speech signal classification because it is possible to identify different patterns by visualizing their speech patterns and the highest accuracy of 94.51% was obtained.
Abstract: Like other applications, under the purview of pattern classification, analyzing speech signals is crucial. People often mix different languages while talking which makes this task complicated. This happens mostly in India, since different languages are used from one state to another. Among many, Southern part of India suffers a lot from this situation, where distinguishing their languages is important. In this paper, we propose image-based features for speech signal classification because it is possible to identify different patterns by visualizing their speech patterns. Modified Mel frequency cepstral coefficient (MFCC) features namely MFCC- Statistics Grade (MFCC-SG) were extracted which were visualized by plotting techniques and thereafter fed to a convolutional neural network. In this study, we used the top 4 languages namely Telugu, Tamil, Malayalam, and Kannada. Experiments were performed on more than 900 hours of data collected from YouTube leading to over 150000 images and the highest accuracy of 94.51% was obtained.

8 citations


Cites methods from "Improving of Open-Set Language Iden..."

  • ...[29] used deep SVM for detecting out of set languages in the task of language identification and presented 3 formulations for the out of set languages as well....

    [...]

Journal ArticleDOI
TL;DR: Speech recognition in multilingual scenario is not trivial in the case when multiple languages are used in one conversation and language must be identified before speech recognition as such...
Abstract: Speech recognition in multilingual scenario is not trivial in the case when multiple languages are used in one conversation. Language must be identified before we process speech recognition as such...

7 citations

Journal ArticleDOI
TL;DR: This work tackles the open-set task by adapting two modern-day state-of-the-art approaches to closed-set language identification: the first using a CRNN with attention and the second using a TDNN.
Abstract: While most modern speech Language Identification methods are closed-set, we want to see if they can be modified and adapted for the open-set problem. When switching to the open-set problem, the solution gains the ability to reject an audio input when it fails to match any of our known language options. We tackle the open-set task by adapting two modern-day state-of-the-art approaches to closed-set language identification: the first using a CRNN with attention and the second using a TDNN. In addition to enhancing our input feature embeddings using MFCCs, log spectral features, and pitch, we will be attempting two approaches to out-of-set language detection: one using thresholds, and the other essentially performing a verification task. We will compare both the performance of the TDNN and the CRNN, as well as our detection approaches.
Journal ArticleDOI
TL;DR: In this article , the authors proposed a semi-open set approach for the spoken dialect recognition task, where a closed set model is exposed to unknown class inputs and utterances from other unknown classes are also included.
References
More filters
Journal ArticleDOI
TL;DR: A novel soft margin perspective for MKL is presented, which introduces an additional slack variable called kernel slack variable to each quadratic constraint of MKL, which corresponds to one support vector machine model using a single base kernel.
Abstract: Multiple kernel learning (MKL) has been proposed for kernel methods by learning the optimal kernel from a set of predefined base kernels. However, the traditional $L_{1}{\rm MKL}$ method often achieves worse results than the simplest method using the average of base kernels (i.e., average kernel) in some practical applications. In order to improve the effectiveness of MKL, this paper presents a novel soft margin perspective for MKL. Specifically, we introduce an additional slack variable called kernel slack variable to each quadratic constraint of MKL, which corresponds to one support vector machine model using a single base kernel. We first show that $L_{1}{\rm MKL}$ can be deemed as hard margin MKL, and then we propose a novel soft margin framework for MKL. Three commonly used loss functions, including the hinge loss, the square hinge loss, and the square loss, can be readily incorporated into this framework, leading to the new soft margin MKL objective functions. Many existing MKL methods can be shown as special cases under our soft margin framework. For example, the hinge loss soft margin MKL leads to a new box constraint for kernel combination coefficients. Using different hyper-parameter values for this formulation, we can inherently bridge the method using average kernel, $L_{1}{\rm MKL}$ , and the hinge loss soft margin MKL. The square hinge loss soft margin MKL unifies the family of elastic net constraint/regularizer based approaches; and the square loss soft margin MKL incorporates $L_{2}{\rm MKL}$ naturally. Moreover, we also develop efficient algorithms for solving both the hinge loss and square hinge loss soft margin MKL. Comprehensive experimental studies for various MKL algorithms on several benchmark data sets and two real world applications, including video action recognition and event recognition demonstrate that our proposed algorithms can efficiently achieve an effective yet sparse solution for MKL.

130 citations

01 Jan 2004
TL;DR: This work proposes the use of sequence kernels for language recognition in support vector machines and applies its methods to the NIST 2003 language evaluation task, demonstrating the potential of the new SVM methods.
Abstract: Support vector machines (SVMs) have become a popular tool for discriminative classification. Powerful theoretical and computational tools for support vector machines have enabled significant improvements in pattern classification in several areas. An exciting area of recent application of support vector machines is in speech processing. A key aspect of applying SVMs to speech is to provide a SVM kernel which compares sequences of feature vectors–a sequence kernel. We propose the use of sequence kernels for language recognition. We apply our methods to the NIST 2003 language evaluation task. Results demonstrate the potential of the new SVM methods.

108 citations


"Improving of Open-Set Language Iden..." refers background in this paper

  • ...Support vector machines (SVM) has been the most competitive back-end system over the recent years for acoustic modeling, and i-vector modeling specifically [1], [9], [10]....

    [...]

Proceedings ArticleDOI
14 Mar 2010
TL;DR: This paper presented a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE), which consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization.
Abstract: This paper presents a description of the MIT Lincoln Laboratory language recognition system submitted to the NIST 2009 Language Recognition Evaluation (LRE). This system consists of a fusion of three core recognizers, two based on spectral similarity and one based on tokenization. The 2009 LRE differed from previous ones in that test data included narrowband segments from worldwide Voice of America broadcasts as well as conventional recorded conversational telephone speech. Results are presented for the 23-language closed-set and open-set detection tasks at the 30, 10, and 3 second durations along with a discussion of the language-pair task. On the 30 second 23-language closed set detection task, the system achieved a 1.64 average error rate.

90 citations

Journal ArticleDOI
01 Jul 2014-PLOS ONE
TL;DR: This work proposes using Deep Bottleneck Features (DBF) for spoken LID, and shows that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability.
Abstract: A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed.

71 citations


"Improving of Open-Set Language Iden..." refers background in this paper

  • ...Since then, different alternatives have been introduced including i-vectors based on bottleneck features [7], [8]....

    [...]

Proceedings ArticleDOI
28 Jun 2006
TL;DR: This paper presents a description of the MIT Lincoln Laboratory submissions to the 2005 NIST Language Recognition Evaluation (LRE05), which showed a steady improvement in language recognition performance.
Abstract: This paper presents a description of the MIT Lincoln Laboratory submissions to the 2005 NIST Language Recognition Evaluation (LRE05). As was true in 2003, the 2005 submissions were combinations of core cepstral and phonotactic recognizers whose outputs were fused to generate final scores. For the 2005 evaluation, Lincoln Laboratory had five submissions built upon fused combinations of six core systems. Major improvements included the generation of phone streams using lattices, SVM-based language models using lattice-derived phonotactics, and binary tree language models. In addition, a development corpus was assembled that was designed to test robustness to unseen languages and sources. Language recognition trends based on NIST evaluations conducted since 1996 show a steady improvement in language recognition performance.

60 citations


"Improving of Open-Set Language Iden..." refers methods in this paper

  • ...This approach is straightforward and fast for application and in contrast to existing works [18], [20], [21], does not rely on the use of additional data from OOS languages which may not be available....

    [...]

  • ...[21] NIST LRE 2005 OOS modeling SVM and tokenizer BenZeghiba et al....

    [...]

Trending Questions (1)
Is SVM a part of deep learning?

The experimental results demonstrate the effectiveness of the deep SVM back-end system as compared to state-of-the-art techniques.