Topic

Mel-frequency cepstrum

About: Mel-frequency cepstrum is a research topic. Over the lifetime, 6455 publications have been published within this topic receiving 92772 citations. The topic is also known as: Mel Frequency Cepstral Coeﬃcients.

...read moreread less

Papers published on a yearly basis

1 / 2

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Environmental Sound Recognition With Time–Frequency Audio Features

[...]

Selina Chu¹, Shrikanth S. Narayanan¹, C.-C.J. Kuo¹•Institutions (1)

University of Southern California¹

01 Aug 2009-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An empirical feature analysis for audio environment characterization is performed and a matching pursuit algorithm is proposed to use to obtain effective time-frequency features to yield higher recognition accuracy for environmental sounds.

...read moreread less

Abstract: The paper considers the task of recognizing environmental sounds for the understanding of a scene or context surrounding an audio sensor. A variety of features have been proposed for audio recognition, including the popular Mel-frequency cepstral coefficients (MFCCs) which describe the audio spectral shape. Environmental sounds, such as chirpings of insects and sounds of rain which are typically noise-like with a broad flat spectrum, may include strong temporal domain signatures. However, only few temporal-domain features have been developed to characterize such diverse audio signals previously. Here, we perform an empirical feature analysis for audio environment characterization and propose to use the matching pursuit (MP) algorithm to obtain effective time-frequency features. The MP-based method utilizes a dictionary of atoms for feature selection, resulting in a flexible, intuitive and physically interpretable set of features. The MP-based feature is adopted to supplement the MFCC features to yield higher recognition accuracy for environmental sounds. Extensive experiments are conducted to demonstrate the effectiveness of these joint features for unstructured environmental sound classification, including listening tests to study human recognition capabilities. Our recognition system has shown to produce comparable performance as human listeners.

...read moreread less

626 citations

Journal Article•DOI•

The cepstrum: A guide to processing

[...]

D.G. Childers¹, D.P. Skinner, R.C. Kemerait•Institutions (1)

University of Florida¹

01 Oct 1977

TL;DR: The power, complex, and phase cepstra are shown to be easily related to one another, and the interpretation and processing of data in such areas as speech, seismology, and hydroacoustics is discussed.

...read moreread less

Abstract: This paper is a pragmatic tutorial review of the cepstrum literature focusing on data processing. The power, complex, and phase cepstra are shown to be easily related to one another. Problems associated with phase unwrapping, linear phase components, spectrum notching, aliasing, oversampling, and extending the data sequence with zeros are discussed. The advantages and disadvantages of windowing the sampled data sequence, the log spectrum, and the complex cepstrum are presented. The influence of noise upon the data processing procedures is discussed throughout the paper, but is not thoroughly analyzed. The effects of various forms of liftering the cepstrum are described. The results obtained by applying whitening and trend removal techniques to the spectrum prior to the calculation of the cepstrum are discussed. We have attempted to synthesize the results, procedures, and information peculiar to the many fields that are finding cepstrum analysis useful. In particular we discuss the interpretation and processing of data in such areas as speech, seismology, and hydroacoustics. But we must caution the reader that the paper is heavily influenced by our own experiences; specific procedures that have been found useful in one field should not be considered as totally general to other fields. It is hoped that this review will be of value to those familiar with the field and reduce the time required for those wishing to become so.

...read moreread less

607 citations

Proceedings Article•DOI•

TUT database for acoustic scene classification and sound event detection

[...]

Annamaria Mesaros¹, Toni Heittola¹, Tuomas Virtanen¹•Institutions (1)

Tampere University of Technology¹

01 Aug 2016

TL;DR: The recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models are presented.

...read moreread less

Abstract: We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting of binaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

...read moreread less

519 citations

Journal Article•DOI•

Deep Scattering Spectrum

[...]

Joakim Andén¹, Stéphane Mallat²•Institutions (2)

Princeton University¹, École Normale Supérieure²

29 May 2014-IEEE Transactions on Signal Processing

TL;DR: A scattering transform defines a locally translation invariant representation which is stable to time-warping deformation and extends MFCC representations by computing modulation spectrum coefficients of multiple orders, through cascades of wavelet convolutions and modulus operators.

...read moreread less

Abstract: A scattering transform defines a locally translation invariant representation which is stable to time-warping deformation. It extends MFCC representations by computing modulation spectrum coefficients of multiple orders, through cascades of wavelet convolutions and modulus operators. Second-order scattering coefficients characterize transient phenomena such as attacks and amplitude modulation. A frequency transposition invariant representation is obtained by applying a scattering transform along log-frequency. State-the-of-art classification results are obtained for musical genre and phone classification on GTZAN and TIMIT databases, respectively.

...read moreread less

495 citations

Journal Article•DOI•

Audio-visual speech recognition using deep learning

[...]

Kuniaki Noda¹, Yuki Yamaguchi², Kazuhiro Nakadai³, Hiroshi G. Okuno², Tetsuya Ogata¹ - Show less +1 more•Institutions (3)

Waseda University¹, Kyoto University², Honda³

01 Jun 2015-Applied Intelligence

TL;DR: A connectionist-hidden Markov model (HMM) system for noise-robust AVSR is introduced and it is demonstrated that approximately 65 % word recognition rate gain is attained with denoised MFCCs under 10 dB signal-to-noise-ratio (SNR) for the audio signal input.

...read moreread less

Abstract: Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition performance. In the machine-learning community, deep learning approaches have recently attracted increasing attention because deep neural networks can effectively extract robust latent features that enable various recognition algorithms to demonstrate revolutionary generalization capabilities under diverse application conditions. This study introduces a connectionist-hidden Markov model (HMM) system for noise-robust AVSR. First, a deep denoising autoencoder is utilized for acquiring noise-robust audio features. By preparing the training data for the network with pairs of consecutive multiple steps of deteriorated audio features and the corresponding clean features, the network is trained to output denoised audio features from the corresponding features deteriorated by noise. Second, a convolutional neural network (CNN) is utilized to extract visual features from raw mouth area images. By preparing the training data for the CNN as pairs of raw images and the corresponding phoneme label outputs, the network is trained to predict phoneme labels from the corresponding mouth area input images. Finally, a multi-stream HMM (MSHMM) is applied for integrating the acquired audio and visual HMMs independently trained with the respective features. By comparing the cases when normal and denoised mel-frequency cepstral coefficients (MFCCs) are utilized as audio features to the HMM, our unimodal isolated word recognition results demonstrate that approximately 65 % word recognition rate gain is attained with denoised MFCCs under 10 dB signal-to-noise-ratio (SNR) for the audio signal input. Moreover, our multimodal isolated word recognition results utilizing MSHMM with denoised MFCCs and acquired visual features demonstrate that an additional word recognition rate gain is attained for the SNR conditions below 10 dB.

...read moreread less

493 citations

Collapse

Network Information

Performance

Metrics

7,530

Papers

112,511

Citations

No. of papers in the topic in previous years
Year	Papers
2023	304
2022	772
2021	363
2020	423
2019	419
2018	431

Mel-frequency cepstrum

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics