scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A comparative performance study of several pitch detection algorithms

TL;DR: A comparative performance study of seven pitch detection algorithms was conducted, consisting of eight utterances spoken by three males, three females, and one child, to assess their relative performance as a function of recording condition, and pitch range of the various speakers.
Abstract: A comparative performance study of seven pitch detection algorithms was conducted. A speech data base, consisting of eight utterances spoken by three males, three females, and one child was constructed. Telephone, close talking microphone, and wideband recordings were made of each of the utterances. For each of the utterances in the data base; a "standard" pitch contour was semiautomatically measured using a highly sophisticated interactive pitch detection program. The "standard" pitch contour was then compared with the pitch contour that was obtained from each of the seven programmed pitch detectors. The algorithms used in this study were 1) a center clipping, infinite-peak clipping, modified autocorrelation method (AUTOC), 2) the cepstral method (CEP), 3) the simplified inverse filtering technique (SIFT) method, 4) the parallel processing time-domain method (PPROC), 5) the data reduction method (DARD), 6) a spectral flattening linear predictive coding (LPC) method, and 7) the average magnitude difference function (AMDF) method. A set of measurements was made on the pitch contours to quantify the various types of errors which occur in each of the above methods. Included among the error measurements were the average and standard deviation of the error in pitch period during voiced regions, the number of gross errors in the pitch period, and the average number of voiced-unvoiced classification errors. For each of the error measurements, the individual pitch detectors could be rank ordered as a measure of their relative performance as a function of recording condition, and pitch range of the various speakers. Performance scores are presented for each of the seven pitch detectors based on each of the categories of error.
Citations
More filters
Journal ArticleDOI
26 Jun 1979
TL;DR: An overview of the variety of techniques that have been proposed for enhancement and bandwidth compression of speech degraded by additive background noise is provided to suggest a unifying framework in terms of which the relationships between these systems is more visible and which hopefully provides a structure which will suggest fruitful directions for further research.
Abstract: Over the past several years there has been considerable attention focused on the problem of enhancement and bandwidth compression of speech degraded by additive background noise. This interest is motivated by several factors including a broad set of important applications, the apparent lack of robustness in current speech-compression systems and the development of several potentially promising and practical solutions. One objective of this paper is to provide an overview of the variety of techniques that have been proposed for enhancement and bandwidth compression of speech degraded by additive background noise. A second objective is to suggest a unifying framework in terms of which the relationships between these systems is more visible and which hopefully provides a structure which will suggest fruitful directions for further research.

1,236 citations

Journal ArticleDOI
Lawrence R. Rabiner1
TL;DR: Several types of (nonlinear) preprocessing which can be used to effectively spectrally flatten the speech signal are presented and an algorithm for adaptively choosing a frame size for an autocorrelation pitch analysis is discussed.
Abstract: One of the most time honored methods of detecting pitch is to use some type of autocorrelation analysis on speech which has been appropriately preprocessed. The goal of the speech preprocessing in most systems is to whiten, or spectrally flatten, the signal so as to eliminate the effects of the vocal tract spectrum on the detailed shape of the resulting autocorrelation function. The purpose of this paper is to present some results on several types of (nonlinear) preprocessing which can be used to effectively spectrally flatten the speech signal The types of nonlinearities which are considered are classified by a non-linear input-output quantizer characteristic. By appropriate adjustment of the quantizer threshold levels, both the ordinary (linear) autocorrelation analysis, and the center clipping-peak clipping autocorrelation of Dubnowski et al. [1] can be obtained. Results are presented to demonstrate the degree of spectrum flattening obtained using these methods. Each of the proposed methods was tested on several of the utterances used in a recent pitch detector comparison study by Rabiner et al. [2] Results of this comparison are included in this paper. One final topic which is discussed in this paper is an algorithm for adaptively choosing a frame size for an autocorrelation pitch analysis.

572 citations

Journal ArticleDOI
TL;DR: This paper describes MARSYAS, a framework for experimenting, evaluating and integrating techniques for audio content analysis in restricted domains and a new method for temporal segmentation based on audio texture that is combined with audio analysis techniques and used for hierarchical browsing, classification and annotation of audio files.
Abstract: Existing audio tools handle the increasing amount of computer audio data inadequately. The typical tape-recorder paradigm for audio interfaces is inflexible and time consuming, especially for large data sets. On the other hand, completely automatic audio analysis and annotation is impossible using current techniques. Alternative solutions are semi-automatic user interfaces that let users interact with sound in flexible ways based on content. This approach offers significant advantages over manual browsing, annotation and retrieval. Furthermore, it can be implemented using existing techniques for audio content analysis in restricted domains. This paper describes MARSYAS, a framework for experimenting, evaluating and integrating such techniques. As a test for the architecture, some recently proposed techniques have been implemented and tested. In addition, a new method for temporal segmentation based on audio texture is described. This method is combined with audio analysis techniques and used for hierarchical browsing, classification and annotation of audio files.

444 citations

Journal ArticleDOI
TL;DR: The spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds and works robustly in noise, and is able to handle sounds that exhibit inharmonicities.
Abstract: A new method for estimating the fundamental frequencies of concurrent musical sounds is described. The method is based on an iterative approach, where the fundamental frequency of the most prominent sound is estimated, the sound is subtracted from the mixture, and the process is repeated for the residual signal. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the subtraction stage, the spectral smoothness principle is proposed as an efficient new mechanism in estimating the spectral envelopes of detected sounds. With these techniques, multiple fundamental frequency estimation can be performed quite accurately in a single time frame, without the use of long-term temporal features. The experimental data comprised recorded samples of 30 musical instruments from four different sources. Multiple fundamental frequency estimation was performed for random sound source and pitch combinations. Error rates for mixtures ranging from one to six simultaneous sounds were 1.8%, 3.9%, 6.3%, 9.9%, 14%, and 18%, respectively. In musical interval and chord identification tasks, the algorithm outperformed the average of ten trained musicians. The method works robustly in noise, and is able to handle sounds that exhibit inharmonicities. The inharmonicity factor and spectral envelope of each sound is estimated along with the fundamental frequency.

356 citations

Journal ArticleDOI
TL;DR: A predominant-F0 estimation method called PreFEst is proposed that does not rely on the unreliable fundamental component and obtains the most predominant F0 supported by harmonics within an intentionally limited frequency range.

345 citations


Cites background or methods from "A comparative performance study of ..."

  • ...While we do not intend to build a psychoacoustical model of human perception, certain psychoacoustical results may have some relevance concerning our strategy: Ritsma (1967) reported that the ear uses a rather limited spectral region in achieving a well-defined pitch perception; Plomp (1967) concluded that for fundamental frequencies up to about 1400 Hz, the pitch of a complex tone is determined by the second and higher harmonics rather than by the fundamental....

    [...]

  • ...Most previous F0 estimation methods (Noll, 1967; Schroeder, 1968; Rabiner et al., 1976; Nehorai and Porat, 1986; Charpentier, 1986; Ohmura, 1994; Abe et al., 1996; Kawahara et al., 1999) have been premised upon the input audio signal containing just a single-pitch sound with aperiodic noise....

    [...]

References
More filters
Book
02 Dec 2011
TL;DR: Speech Analysis and Synthesis Models: Basic Physical Principles, Speech Synthesis Structures, and Considerations in Choice of Analysis.
Abstract: 1. Introduction.- 1.1 Basic Physical Principles.- 1.2 Acoustical Waveform Examples.- 1.3 Speech Analysis and Synthesis Models.- 1.4 The Linear Prediction Model.- 1.5 Organization of Book.- 2. Formulations.- 2.1 Historical Perspective.- 2.2 Maximum Likelihood.- 2.3 Minimum Variance.- 2.4 Prony's Method.- 2.5 Correlation Matching.- 2.6 PARCOR (Partial Correlation).- 2.6.1 Inner Products and an Orthogonality Principle.- 2.6.2 The PARCOR Lattice Structure.- 3. Solutions and Properties.- 3.1 Introduction.- 3.2 Vector Spaces and Inner Products.- 3.2.1 Filter or Polynomial Norms.- 3.2.2 Properties of Inner Products.- 3.2.3 Orthogonality Relations.- 3.3 Solution Algorithms.- 3.3.1 Correlation Matrix.- 3.3.2 Initialization.- 3.3.3 Gram-Schmidt Orthogonalization.- 3.3.4 Levinson Recursion.- 3.3.5 Updating Am(z).- 3.3.6 A Test Example.- 3.4 Matrix Forms.- 4. Acoustic Tube Modeling.- 4.1 Introduction.- 4.2 Acoustic Tube Derivation.- 4.2.1 Single Section Derivation.- 4.2.2 Continuity Conditions.- 4.2.3 Boundary Conditions.- 4.3 Relationship between Acoustic Tube and Linear Prediction.- 4.4 An Algorithm, Examples, and Evaluation.- 4.4.1 An Algorithm.- 4.4.2 Examples.- 4.4.3 Evaluation of the Procedure.- 4.5 Estimation of Lip Impedance.- 4.5.1 Lip Impedance Derivation.- 4.6 Further Topics.- 4.6.1 Losses in the Acoustic Tube Model.- 4.6.2 Acoustic Tube Stability.- 5. Speech Synthesis Structures.- 5.1 Introduction.- 5.2 Stability.- 5.2.1 Step-up Procedure.- 5.2.2 Step-down Procedure.- 5.2.3 Polynomial Properties.- 5.2.4 A Bound on |Fm(z)|.- 5.2.5 Necessary and Sufficient Stability Conditions.- 5.2.6 Application of Results.- 5.3 Recursive Parameter Evaluation.- 5.3.1 Inner Product Properties.- 5.3.2 Equation Summary with Program.- 5.4 A General Synthesis Structure.- 5.5 Specific Speech Synthesis Structures.- 5.5.1 The Direct Form.- 5.5.2 Two-Multiplier Lattice Model.- 5.5.3 Kelly-Lochbaum Model.- 5.5.4 One-Multiplier Models.- 5.5.5 Normalized Filter Model.- 5.5.6 A Test Example.- 6. Spectral Analysis.- 6.1 Introduction.- 6.2 Spectral Properties.- 6.2.1 Zero Mean All-Pole Model.- 6.2.2 Gain Factor for Spectral Matching.- 6.2.3 Limiting Spectral Match.- 6.2.4 Non-uniform Spectral Weighting.- 6.2.5 Minimax Spectral Matching.- 6.3 A Spectral Flatness Model.- 6.3.1 A Spectral Flatness Measure.- 6.3.2 Spectral Flatness Transformations.- 6.3.3 Numerical Evaluation.- 6.3.4 Experimental Results.- 6.3.5 Driving Function Models.- 6.4 Selective Linear Prediction.- 6.4.1 Selective Linear Prediction (SLP) Algorithm.- 6.4.2 A Selective Linear Prediction Program.- 6.4.3 Computational Considerations.- 6.5 Considerations in Choice of Analysis Conditions.- 6.5.1 Choice of Method.- 6.5.2 Sampling Rates.- 6.5.3 Order of Filter.- 6.5.4 Choice of Analysis Interval.- 6.5.5 Windowing.- 6.5.6 Pre-emphasis.- 6.6 Spectral Evaluation Techniques.- 6.7 Pole Enhancement.- 7. Automatic Formant Trajectory Estimation.- 7.1 Introduction.- 7.2 Formant Trajectory Estimation Procedure.- 7.2.1 Introduction.- 7.2.2 Raw Data from A(z).- 7.2.3 Examples of Raw Data.- 7.3 Comparison of Raw Data from Linear Prediction and Cepstral Smoothing.- 7.4 Algorithm 1.- 7.5 Algorithm 2.- 7.5.1 Definition of Anchor Points.- 7.5.2 Processing of Each Voiced Segment.- 7.5.3 Final Smoothing.- 7.5.4 Results and Discussion.- 7.6 Formant Estimation Accuracy.- 7.6.1 An Example of Synthetic Speech Analysis.- 7.6.2 An Example of Real Speech Analysis.- 7.6.3 Influence of Voice Periodicity.- 8. Fundamental Frequency Estimation.- 8.1 Introduction.- 8.2 Preprocessing by Spectral Flattening.- 8.2.1 Analysis of Voiced Speech with Spectral Regularity.- 8.2.2 Analysis of Voiced Speech with Spectral Irregularities.- 8.2.3 The STREAK Algorithm.- 8.3 Correlation Techniques.- 8.3.1 Autocorrelation Analysis.- 8.3.2 Modified Autocorrelation Analysis.- 8.3.3 Filtered Error Signal Autocorrelation Analysis.- 8.3.4 Practical Considerations.- 8.3.5 The SIFT Algorithm.- 9. Computational Considerations in Analysis.- 9.1 Introduction.- 9.2 Ill-Conditioning.- 9.2.1 A Measure of Ill-Conditioning.- 9.2.2 Pre-emphasis of Speech Data.- 9.2.3 Prefiltering before Sampling.- 9.3 Implementing Linear Prediction Analysis.- 9.3.1 Autocorrelation Method.- 9.3.2 Covariance Method.- 9.3.3 Computational Comparison.- 9.4 Finite Word Length Considerations.- 9.4.1 Finite Word Length Coefficient Computation.- 9.4.2 Finite Word Length Solution of Equations.- 9.4.3 Overall Finite Word Length Implementation.- 10. Vocoders.- 10.1 Introduction.- 10.2 Techniques.- 10.2.1 Coefficient Transformations.- 10.2.2 Encoding and Decoding.- 10.2.3 Variable Frame Rate Transmission.- 10.2.4 Excitation and Synthesis Gain Matching.- 10.2.5 A Linear Prediction Synthesizer Program.- 10.3 Low Bit Rate Pitch Excited Vocoders.- 10.3.1 Maximum Likelihood and PARCOR Vocoders.- 10.3.2 Autocorrelation Method Vocoders.- 10.3.3 Covariance Method Vocoders.- 10.4 Base-Band Excited Vocoders.- 11. Further Topics.- 11.1 Speaker Identification and Verification.- 11.2 Isolated Word Recognition.- 11.3 Acoustical Detection of Laryngeal Pathology.- 11.4 Pole-Zero Estimation.- 11.5 Summary and Future Directions.- References.

1,945 citations


"A comparative performance study of ..." refers methods in this paper

  • ...autocorrelation method of LPC analysis [ 14 ]. The 2-kHz...

    [...]

Journal ArticleDOI
TL;DR: Algorithms were developed heuristically for picking those peaks corresponding to voiced‐speech segments and the vocal pitch periods, which were then used to derive the excitation for a computer‐simulated channel vocoder.
Abstract: The cepstrum, defined as the power spectrum of the logarithm of the power spectrum, has a strong peak corresponding to the pitch period of the voiced‐speech segment being analyzed. Cepstra were calculated on a digital computer and were automatically plotted on microfilm. Algorithms were developed heuristically for picking those peaks corresponding to voiced‐speech segments and the vocal pitch periods. This information was then used to derive the excitation for a computer‐simulated channel vocoder. The pitch quality of the vocoded speech was judged by experienced listeners in informal comparison tests to be indistinguishable from the original speech.

851 citations


"A comparative performance study of ..." refers background in this paper

  • ...the cepstral pitch detector [ 5 ]) to estimate the period of the...

    [...]

  • ...speech processing literature (e.g., [ 5 ] —[11])....

    [...]

Journal ArticleDOI
TL;DR: The implementation of the AMDF pitch extractor (nonreal-time simulation and real-time) is described and experimental results presented to illustrate its basic measurement properties.
Abstract: This paper describes a method for using the average magnitude difference function (AMDF) and associated decision logic to estimate the pitch period of voiced speech sounds. The AMDF is a variation on autocorrelation analysis where, instead of correlating the input speech at various delays (where multiplications and summations are formed at each value of delay), a difference signal is formed between the delayed speech and the original and, at each delay, the absolute magnitude of the difference is taken. The difference signal is always zero at delay = π, and exhibits deep nulls at delays corresponding to the pitch period of voiced sounds. Some of the reasons the AMDF is attractive include the following. 1) It is a simple measurement which gives a good estimate of pitch contour, 2) it has no multiply operations, 3) its dynamic range characteristics are suitable for implementation on a 16-bit machine, and 4) the nature of its operations makes it suitable for implementation on a programmable processor or in special purpose hardware. The implementation of the AMDF pitch extractor (nonreal-time simulation and real-time) is described and experimental results presented to illustrate its basic measurement properties.

562 citations


"A comparative performance study of ..." refers methods or result in this paper

  • ...[ 10 ]. (The version used in this study was kindly supplied by...

    [...]

  • ...method. Details of implementation differ somewhat from those of [ 10 ] .)...

    [...]

  • ...7) Average magnitude difference function (AMDF) (NSA version, [ 10 ])....

    [...]

Journal ArticleDOI
TL;DR: A pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, unvoiced speech, or silence, based on measurements made on the signal, which has been found to provide reliable classification with speech segments as short as 10 ms.
Abstract: In speech analysis, the voiced-unvoiced decision is usually performed in conjunction with pitch analysis The linking of voiced-unvoiced (V-UV) decision to pitch analysis not only results in unnecessary complexity, but makes it difficult to classify short speech segments which are less than a few pitch periods in duration In this paper, we describe a pattern recognition approach for deciding whether a given segment of a speech signal should be classified as voiced speech, unvoiced speech, or silence, based on measurements made on the signal In this method, five different measurements are made on the speech segment to be classified The measured parameters are the zero-crossing rate, the speech energy, the correlation between adjacent speech samples, the first predictor coefficient from a 12-pole linear predictive coding (LPC) analysis, and the energy in the prediction error The speech segment is assigned to a particular class based on a minimum-distance rule obtained under the assumption that the measured parameters are distributed according to the multidimensional Gaussian probability density function The means and covariances for the Gaussian distribution are determined from manually classified speech data included in a training set The method has been found to provide reliable classification with speech segments as short as 10 ms and has been used for both speech analysis-synthesis and recognition applications A simple nonlinear smoothing algorithm is described to provide a smooth 3-level contour of an utterance for use in speech recognition applications Quantitative results and several examples illustrating the performance of the method are included in the paper

479 citations


"A comparative performance study of ..." refers methods in this paper

  • ...recognition technique to classify each 10-ms interval of speech as voiced or unvoiced [ 15 ]....

    [...]

  • ...With careful training, voiced—unvoiced accuracies on the order of 99 percent have been obtained [ 15 ]....

    [...]

  • ...pattern recognition approach (of the LPC method) using the five parameters discussed in [ 15 ]....

    [...]

Journal ArticleDOI
TL;DR: It is demonstrated that the simplified inverse filter tracking algorithm (hereafter referred to as the SIFT algorithm) encompasses the desirable properties of both autocorrelation and cepstral pitch analysis techniques.
Abstract: In this paper a new method for estimating F 0 , the fundamental frequency of voiced speech versus time, is presented. The algorithm is based upon a simplified version of a general technique for fundamental frequency extraction using digital inverse filtering. It is demonstrated that the simplified inverse filter tracking algorithm (hereafter referred to as the SIFT algorithm) encompasses the desirable properties of both autocorrelation and cepstral pitch analysis techniques. In addition, the SIFT algorithm is composed of only a relatively small number of elementary arithmetic operations. In machine language, SIFT should run in several times real time while with special-purpose hardware it could easily be realized in real time.

398 citations