scispace - formally typeset
Search or ask a question
Proceedings Article•DOI•

Statistics based features for unvoiced sound classification

TL;DR: This work investigates if statistics obtained by decomposing sounds using a set of filter-banks and computing the moments of the filter responses, along with their correlation values can be used as features for classifying unvoiced sounds.
Abstract: Unvoiced phonemes have significant presence in spoken English language. These phonemes are hard to classify, due to their weak energy and lack of periodicity. Sound textures such as sound made by a flowing stream of water or falling droplets of rain have similar aperiodic properties in temporal domain as unvoiced phonemes. These sounds are easily differentiated by a human ear. Recent studies on sound texture analysis and synthesis have shown that the human auditory system perceives sound textures using simple statistics. These statistics are obtained by decomposing sounds using a set of filter-banks and computing the moments of the filter responses, along with their correlation values. In this work we investigate if the above mentioned statistics, which are easy to extract, can also be used as features for classifying unvoiced sounds. To incorporate the moments and correlation values as features, a framework containing multiple classifiers is proposed. Experiments conducted on the TIMIT dataset gave an accuracy on par with the latest reported in the literature, with lesser computational cost.
Citations
More filters
Proceedings Article•DOI•
11 Apr 2014
TL;DR: Two dimensionality reduction algorithms, namely, t-distributed Stochastic Neighbor Embedding and Sequential Forward Floating Selection were used to obtain a compact representation of the data and it is shown that representing the data by a feature vector with as few as 3 dimensions, yields a classification rate of almost 90% which outperforms most of the results obtained in previous studies.
Abstract: Classification of unvoiced fricatives is an important stage in applications such as spoken term detection and audio-video synchronization, and in technologies for the hearing impaired Due to their acoustic similarity, extraction of multiple features and construction of high-dimensional feature vectors are required for successful classification of these phonemes In this study two dimensionality reduction algorithms, namely, t-distributed Stochastic Neighbor Embedding (t-SNE) and Sequential Forward Floating Selection (SFFS) were used to obtain a compact representation of the data A classification stage (kNN or SVM) was then applied, in which we compared the identification rates between the original feature vector and the low-dimensional representation A total of 1000 unvoiced fricatives (/s/ /sh/ /f/ and /th/) derived from the TIMIT speech database, containing 25000 short frames of 8 ms each, were used for the evaluation We show that representing the data by a feature vector with as few as 3 dimensions, yields a classification rate of almost 90% which outperforms most of the results obtained in previous studies

4 citations


Cites methods from "Statistics based features for unvoi..."

  • ...For example in [18] a correct identification rate of 84% was reported, using a bark bands spectral representation and a canonical discriminant analysis, while in [19], a correct rate of 86....

    [...]

Patent•
23 Dec 2015
TL;DR: In this paper, an online forecasting method for high-frequency mechanical noise of a structure, and belongs to the technical field of noise forecasting, has been revealed, which can be applied to the online forecasting engineering practice, and has a wide application prospect.
Abstract: The invention discloses an online forecasting method for high-frequency mechanical noise of a structure, and belongs to the technical field of noise forecasting. The method comprises the following steps: building a reasonable and effective constraint and load statistical energy analysis model aiming at an engineering practice structure; obtaining quality data of various stimulated sub-systems by the built statistical energy analysis model; testing vibration response data of the stimulated sub-systems through a test; calculating energy data of the stimulated sub-systems by combining the quality data with the response data; obtaining radiated sound power energy mechanical mobility data corresponding to the stimulated sub-systems according to the model; and finally calculating the radiated sound power of the structure, and finishing online forecasting. The online forecasting method has the beneficial effects that rapid calculation from load to radiated sound power is achieved by the system transfer mobility invariability; the problem of relatively long elapsed time of the traditional algorithm is solved; and rapid forecasting of mechanical noise of the structure is achieved. The online forecasting method is considerable in accuracy and relatively short in elapsed time, can be applied to the online forecasting engineering practice, and has a wide application prospect.

1 citations

DOI•
20 Jul 2022
TL;DR: This research obtains a model that can perform pitch estimation with a 90.14% F1 score and an average user evaluation of 8.4 out of 10.
Abstract: This research explores several variations of the automatic music transcription method, specifically in the pitch estimation task. Pitch estimation in this research mainly converts an acoustic piano song recording into a digitally transcribed song format. First, several techniques, including short-time Fast-Fourier transform and constant-Q transform, provide a spectrogram representation of a wav piano recording. Then it is fed into a combination of Convolutional Neural Network (ConvNet) and Long Short-Term Memory (LSTM) neural network. This transcription is a digitally transcribed song in the format of a MIDI file. For training purposes, the MAESTRO dataset was used for conducting training, which every training varies the learning rate value and spectrogram representation. This research obtains a model that can perform pitch estimation with a 90.14% F1 score and an average user evaluation of 8.4 out of 10.
Proceedings Article•DOI•
20 Jul 2022
TL;DR: In this article , a combination of Convolutional Neural Network (ConvNet) and Long Short-Term Memory (LSTM) neural network was used to perform pitch estimation with a 90.14% F1 score and an average user evaluation of 8.4 out of 10.
Abstract: This research explores several variations of the automatic music transcription method, specifically in the pitch estimation task. Pitch estimation in this research mainly converts an acoustic piano song recording into a digitally transcribed song format. First, several techniques, including short-time Fast-Fourier transform and constant-Q transform, provide a spectrogram representation of a wav piano recording. Then it is fed into a combination of Convolutional Neural Network (ConvNet) and Long Short-Term Memory (LSTM) neural network. This transcription is a digitally transcribed song in the format of a MIDI file. For training purposes, the MAESTRO dataset was used for conducting training, which every training varies the learning rate value and spectrogram representation. This research obtains a model that can perform pitch estimation with a 90.14% F1 score and an average user evaluation of 8.4 out of 10.
References
More filters
Journal Article•DOI•
Yoav Freund1, Robert E. Schapire1•
01 Aug 1997
TL;DR: The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.
Abstract: In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in Rn. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.

15,813 citations

Journal Article•DOI•
08 Sep 2011-Neuron
TL;DR: The results suggest that sound texture perception is mediated by relatively simple statistics of early auditory representations, presumably computed by downstream neural populations, and the synthesis methodology offers a powerful tool for their further investigation.

342 citations

Book•
01 Jan 1923

183 citations


"Statistics based features for unvoi..." refers background in this paper

  • ...Unvoiced sounds, due to their low energy and noise like structure, are hard to recognize....

    [...]

Journal Article•DOI•
TL;DR: Systematic evaluation shows that the proposed system extracts a majority of unvoiced speech without including much interference, and it performs substantially better than spectral subtraction.
Abstract: Monaural speech segregation has proven to be extremely challenging. While efforts in computational auditory scene analysis have led to considerable progress in voiced speech segregation, little attention has been given to unvoiced speech, which lacks harmonic structure and has weaker energy, hence more susceptible to interference. This study proposes a new approach to the problem of segregating unvoiced speech from nonspeech interference. The study first addresses the question of how much speech is unvoiced. The segregation process occurs in two stages: Segmentation and grouping. In segmentation, the proposed model decomposes an input mixture into contiguous time-frequency segments by a multiscale analysis of event onsets and offsets. Grouping of unvoiced segments is based on Bayesian classification of acoustic-phonetic features. The proposed model for unvoiced speech segregation joins an existing model for voiced speech segregation to produce an overall system that can deal with both voiced and unvoiced speech. Systematic evaluation shows that the proposed system extracts a majority of unvoiced speech without including much interference, and it performs substantially better than spectral subtraction.

74 citations


"Statistics based features for unvoi..." refers background in this paper

  • ...These sounds make upto 21.0% of the total phonemes spoken in English language [1]....

    [...]

Journal Article•DOI•
TL;DR: A statistically guided, knowledge-based, acoustic-phonetic system for the automatic classification of fricatives in speaker-independent continuous speech is proposed, which uses an auditory-based front-end processing system and incorporates new algorithms for the extraction and manipulation of the acoustic- phonetic features that proved to be rich in their information content.
Abstract: In this article, the acoustic-phonetic characteristics of the American English fricative consonants are investigated from the automatic classification standpoint. The features studied in the literature are evaluated and new features are proposed. To test the value of the extracted features, a statistically guided, knowledge-based, acoustic-phonetic system for the automatic classification of fricatives in speaker-independent continuous speech is proposed. The system uses an auditory-based front-end processing system and incorporates new algorithms for the extraction and manipulation of the acoustic-phonetic features that proved to be rich in their information content. Classification experiments are performed using hard-decision algorithms on fricatives extracted from the TIMIT database continuous speech of 60 speakers (not used in the design/training process) from seven different dialects of American English. An accuracy of 93% is obtained for voicing detection, 91% for place of articulation detection, and 87% for the overall classification of fricatives.

61 citations


"Statistics based features for unvoi..." refers methods in this paper

  • ...A non-linear manifold learning technique called diffusion maps was advocated to improve the classification accuracy....

    [...]