scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A text-independent speaker verification model: A comparative analysis

01 Jun 2017-
TL;DR: In this paper, the authors explore various methods available in each block in the process of speaker recognition with the objective to identify best of techniques that could be used to get precise results.
Abstract: The most pressing challenge in the field of voice biometrics is selecting the most efficient technique of speaker recognition. Every individual's voice is peculiar, factors like physical differences in vocal organs, accent and pronunciation contributes to the problem's complexity. In this paper, we explore the various methods available in each block in the process of speaker recognition with the objective to identify best of techniques that could be used to get precise results. We study the results on text independent corpora. We use MFCC (Mel-frequency cepstral coefficient), LPCC (linear predictive cepstral coefficient) and PLP (perceptual linear prediction) algorithms for feature extraction, PCA (Principal Component Analysis) and t-SNE for dimensionality reduction and SVM (Support Vector Machine), feed forward, nearest neighbor and decision tree algorithms for classification block in speaker recognition system and comparatively analyze each block to determine the best technique.
Citations
More filters
Proceedings ArticleDOI
01 Jan 2019
TL;DR: This work used discriminant correlation analysis (DCA) to fuse features from face and voice and used the K-nearest neighbors (KNN) algorithm to classify the features and showed that fusion increased recognition accuracy by 52.45% compared to using face alone and 81.62% when using voice alone.
Abstract: Biometric authentication is a promising approach to securing the Internet of Things (IoT). Although existing research shows that using multiple biometrics for authentication helps increase recognition accuracy, the majority of biometric approaches for IoT today continue to rely on a single modality. We propose a multimodal biometric approach for IoT based on face and voice modalities that is designed to scale to the limited resources of an IoT device. Our work builds on the foundation of Gofman et al. [7] in implementing face and voice feature-level fusion on mobile devices. We used discriminant correlation analysis (DCA) to fuse features from face and voice and used the K-nearest neighbors (KNN) algorithm to classify the features. The approach was implemented on the Raspberry Pi IoT device and was evaluated on a dataset of face images and voice files acquired using a Samsung Galaxy S5 device in real-world conditions such as dark rooms and noisy settings. The results show that fusion increased recognition accuracy by 52.45% compared to using face alone and 81.62% compared to using voice alone. It took an average of 1.34 seconds to enroll a user and 0.91 seconds to perform the authentication. To further optimize execution speed and reduce power consumption, we implemented classification on a field-programmable gate array (FPGA) chip that can be easily integrated into an IoT device. Experimental results showed that the proposed FPGA-accelerated KNN could achieve 150x faster execution time and 12x lower energy consumption compared to a CPU.

23 citations

Journal ArticleDOI
TL;DR: This paper presents an overview of some Machine Learning approaches for biometric pattern recognition and some of these approaches are presented are good for single or several individuals identification.
Abstract: Biometrics, as a computer science field, can be understood as the discipline that study how to generate computer models of the physical (e.g. hand geometry, fingerprints, iris and so on) and behavioral (e.g. signature; a kind of behavior pattern) characteristics of the human being for single or several individuals identification. Usually, these characteristics are used to provide authentication information for security systems. However, some of these characteristics are hard to obtain in a properly way and it is necessary to use several algorithms both to process them and to use them on a security systems. In this sense, in this paper it is presented an overview of some Machine Learning approaches for biometric pattern recognition.

17 citations

Journal ArticleDOI
TL;DR: The results indicated that the t-distributed stochastic neighbor embedding (t-SNE), successfully been employed in several studies, showed a good performance also in the analysis of indris’ repertoire and may open new perspectives towards the achievement of shared methodical techniques for the comparison of animal vocal repertoires.
Abstract: Although there is a growing number of researches focusing on acoustic communication, the lack of shared analytic approaches leads to inconsistency among studies. Here, we introduced a computational method used to examine 3360 calls recorded from wild indris (Indri indri) from 2005–2018. We split each sound into ten portions of equal length and, from each portion we extracted spectral coefficients, considering frequency values up to 15,000 Hz. We submitted the set of acoustic features first to a t-distributed stochastic neighbor embedding algorithm, then to a hard-clustering procedure using a k-means algorithm. The t-distributed stochastic neighbor embedding (t-SNE) mapping indicated the presence of eight different groups, consistent with the acoustic structure of the a priori identification of calls, while the cluster analysis revealed that an overlay between distinct call types might exist. Our results indicated that the t-distributed stochastic neighbor embedding (t-SNE), successfully been employed in several studies, showed a good performance also in the analysis of indris’ repertoire and may open new perspectives towards the achievement of shared methodical techniques for the comparison of animal vocal repertoires.

12 citations

Journal ArticleDOI
27 Jan 2020-PeerJ
TL;DR: The experimental results obtained using Mel-frequency cepstral coefficients (MFCCs) and hidden Markov models (HMMs) support the finding that a reduction in training data by a factor of 10 does not significantly affect the recognition performance.
Abstract: Automated acoustic recognition of birds is considered an important technology in support of biodiversity monitoring and biodiversity conservation activities. These activities require processing large amounts of soundscape recordings. Typically, recordings are transformed to a number of acoustic features, and a machine learning method is used to build models and recognize the sound events of interest. The main problem is the scalability of data processing, either for developing models or for processing recordings made over long time periods. In those cases, the processing time and resources required might become prohibitive for the average user. To address this problem, we evaluated the applicability of three data reduction methods. These methods were applied to a series of acoustic feature vectors as an additional postprocessing step, which aims to reduce the computational demand during training. The experimental results obtained using Mel-frequency cepstral coefficients (MFCCs) and hidden Markov models (HMMs) support the finding that a reduction in training data by a factor of 10 does not significantly affect the recognition performance.

5 citations

Proceedings ArticleDOI
10 Sep 2020
TL;DR: Intelligent speech recognition system is designed and presented here in detail as part of this work to achieve a novel futuristic smart home system design framework with intelligent instruction-based operation mechanism.
Abstract: Design of a smart home using Internet of Things (IoT) and Machine Learning technology has been presented in this paper. This design is primarily based on LoRaWAN protocol and the main objective of this work was to establish an IoT network that is based on integration of sensors, gateway, network server and data visualization system. More importantly, intelligent speech recognition system is designed and presented here in detail as part of this work to achieve a novel futuristic smart home system design framework with intelligent instruction-based operation mechanism. In the case of low noise, the success rate of speaker recognition is above 90% based on THCHS-30 dataset.

5 citations

References
More filters
Journal ArticleDOI
Hynek Hermansky1
TL;DR: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.
Abstract: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, is presented and examined. This technique uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum: (1) the critical-band spectral resolution, (2) the equal-loudness curve, and (3) the intensity-loudness power law. The auditory spectrum is then approximated by an autoregressive all-pole model. A 5th-order all-pole model is effective in suppressing speaker-dependent details of the auditory spectrum. In comparison with conventional linear predictive (LP) analysis, PLP analysis is more consistent with human hearing. The effective second formant F2' and the 3.5-Bark spectral-peak integration theories of vowel perception are well accounted for. PLP analysis is computationally efficient and yields a low-dimensional representation of speech. These properties are found to be useful in speaker-independent automatic-speech recognition.

2,969 citations

Book
02 Apr 2013
TL;DR: This book covers the general principles and ideas of designing biometric-based systems and their underlying tradeoffs, and the exploration of some of the numerous privacy and security implications of biometrics.
Abstract: Biometrics: Personal Identification in Networked Society is a comprehensive and accessible source of state-of-the-art information on all existing and emerging biometrics: the science of automatically identifying individuals based on their physiological or behavior characteristics. In particular, the book covers: *General principles and ideas of designing biometric-based systems and their underlying tradeoffs *Identification of important issues in the evaluation of biometrics-based systems *Integration of biometric cues, and the integration of biometrics with other existing technologies *Assessment of the capabilities and limitations of different biometrics *The comprehensive examination of biometric methods in commercial use and in research development *Exploration of some of the numerous privacy and security implications of biometrics. Also included are chapters on face and eye identification, speaker recognition, networking, and other timely technology-related issues. All chapters are written by leading internationally recognized experts from academia and industry. Biometrics: Personal Identification in Networked Society is an invaluable work for scientists, engineers, application developers, systems integrators, and others working in biometrics.

1,845 citations

Posted Content
TL;DR: This paper presents the viability of MFCC to extract features and DTW to compare the test patterns and explains why the alignment is important to produce the better performance.
Abstract: — Digital processing of speech signal and voice recognition algorithm is very important for fast and accurate automatic voice recognition technology The voice is a signal of infinite information A direct analysis and synthesizing the complex voice signal is due to too much information contained in the signal Therefore the digital signal processes such as Feature Extraction and Feature Matching are introduced to represent the voice signal Several methods such as Liner Predictive Predictive Coding (LPC), Hidden Markov Model (HMM), Artificial Neural Network (ANN) and etc are evaluated with a view to identify a straight forward and effective method for voice signal The extraction and matching process is implemented right after the Pre Processing or filtering signal is performed The non-parametric method for modelling the human auditory perception system, Mel Frequency Cepstral Coefficients (MFCCs) are utilize as extraction techniques The non linear sequence alignment known as Dynamic Time Warping (DTW) introduced by Sakoe Chiba has been used as features matching techniques Since it’s obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performanceThis paper present the viability of MFCC to extract features and DTW to compare the test patterns

846 citations

Posted Content
TL;DR: Some of the most used methods for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features are presented.
Abstract: The time domain waveform of a speech signal carries all of the auditory information. From the phonological point of view, it little can be said on the basis of the waveform itself. However, past research in mathematics, acoustics, and speech technology have provided many methods for converting data that can be considered as information if interpreted correctly. In order to find some statistically relevant information from incoming data, it is important to have mechanisms for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features. These features should describe each segment in such a characteristic way that other similar segments can be grouped together by comparing their features. There are enormous interesting and exceptional ways to describe the speech signal in terms of parameters. Though, they all have their strengths and weaknesses, we have presented some of the most used methods with their importance.

103 citations