scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Improved phase vocoder time-scale modification of audio

01 May 1999-IEEE Transactions on Speech and Audio Processing (IEEE)-Vol. 7, Iss: 3, pp 323-332
TL;DR: This paper examines the problem of phasiness in the context of time-scale modification and provides new insights into its causes, and two extensions to the standard phase vocoder algorithm are introduced, and the resulting sound quality is shown to be significantly improved.
Abstract: The phase vocoder is a well established tool for time scaling and pitch shifting speech and audio signals via modification of their short-time Fourier transforms (STFTs). In contrast to time-domain time-scaling and pitch-shifting techniques, the phase vocoder is generally considered to yield high quality results, especially for large modification factors and/or polyphonic signals. However, the phase vocoder is also known for introducing a characteristic perceptual artifact, often described as "phasiness", "reverberation", or "loss of presence". This paper examines the problem of phasiness in the context of time-scale modification and provides new insights into its causes. Two extensions to the standard phase vocoder algorithm are introduced, and the resulting sound quality is shown to be significantly improved. Moreover, the modified phase vocoder is shown to provide a factor-of-two decrease in computational cost.

Content maybe subject to copyright    Report

Citations
More filters
Book
23 Nov 2007
TL;DR: This new edition now contains essential information on steganalysis and steganography, and digital watermark embedding is given a complete update with new processes and applications.
Abstract: Digital audio, video, images, and documents are flying through cyberspace to their respective owners. Unfortunately, along the way, individuals may choose to intervene and take this content for themselves. Digital watermarking and steganography technology greatly reduces the instances of this by limiting or eliminating the ability of third parties to decipher the content that he has taken. The many techiniques of digital watermarking (embedding a code) and steganography (hiding information) continue to evolve as applications that necessitate them do the same. The authors of this second edition provide an update on the framework for applying these techniques that they provided researchers and professionals in the first well-received edition. Steganography and steganalysis (the art of detecting hidden information) have been added to a robust treatment of digital watermarking, as many in each field research and deal with the other. New material includes watermarking with side information, QIM, and dirty-paper codes. The revision and inclusion of new material by these influential authors has created a must-own book for anyone in this profession. *This new edition now contains essential information on steganalysis and steganography *New concepts and new applications including QIM introduced *Digital watermark embedding is given a complete update with new processes and applications

1,773 citations

Book
26 Jan 2011
TL;DR: In this paper, the authors propose a CONCRETE-based approach to solve the problem of concreTE-convexity, i.e., concrete-concrete.
Abstract: CONCRETE

447 citations

Patent
15 Jun 2007
TL;DR: In this paper, a method to create new music by listening to a plurality of music, learning from the plurality, and performing concatenative synthesis based on the listening and the learning to create the new music is described.
Abstract: Automated creation of new music by listening is disclosed. A method to create new music may comprise listening to a plurality of music, learning from the plurality of music, and performing concatenative synthesis based on the listening and the learning to create the new music. The method may be performed on a computing device having an audio interface, such as a personal computer.

214 citations

Journal ArticleDOI
TL;DR: This work focuses on single-channel speech enhancement algorithms which rely on spectrotemporal properties, and can be employed when the miniaturization of devices only allows for using a single microphone.
Abstract: With the advancement of technology, both assisted listening devices and speech communication devices are becoming more portable and also more frequently used. As a consequence, users of devices such as hearing aids, cochlear implants, and mobile telephones, expect their devices to work robustly anywhere and at any time. This holds in particular for challenging noisy environments like a cafeteria, a restaurant, a subway, a factory, or in traffic. One way to making assisted listening devices robust to noise is to apply speech enhancement algorithms. To improve the corrupted speech, spatial diversity can be exploited by a constructive combination of microphone signals (so-called beamforming), and by exploiting the different spectro?temporal properties of speech and noise. Here, we focus on single-channel speech enhancement algorithms which rely on spectrotemporal properties. On the one hand, these algorithms can be employed when the miniaturization of devices only allows for using a single microphone. On the other hand, when multiple microphones are available, single-channel algorithms can be employed as a postprocessor at the output of a beamformer. To exploit the short-term stationary properties of natural sounds, many of these approaches process the signal in a time-frequency representation, most frequently the short-time discrete Fourier transform (STFT) domain. In this domain, the coefficients of the signal are complex-valued, and can therefore be represented by their absolute value (referred to in the literature both as STFT magnitude and STFT amplitude) and their phase. While the modeling and processing of the STFT magnitude has been the center of interest in the past three decades, phase has been largely ignored.

210 citations

PatentDOI
TL;DR: In this paper, an audio signal is analyzed using multiple pschoacoustic criteria to identify a region of the signal in which time scaling and/or pitch shifting processing would be inaudible or minimally audible.
Abstract: In one alternative, an audio signal is analyzed using multiple pschoacoustic criteria to identify a region of the signal in which time scaling and/or pitch shifting processing whould be inaudible or minimally audible, and the signal is time scaled and/or pitch shifted within that region. In another alternative, the signal is divided into auditory events, and the signal is time scaled and/or pitch shifted within an auditory event. In a further alternative, the signal is divided into auditory events, and the auditory events are analyzed using a psychoacoustic criterion to identify those auditory events in which the time scaling and/or pitch shifting procession of the signal would be inaudible or minimally audible. Further alternatives provide for multiple channels of audio.

171 citations

References
More filters
Journal ArticleDOI
TL;DR: An algorithm to estimate a signal from its modified short-time Fourier transform (STFT) by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT magnitude is presented.
Abstract: In this paper, we present an algorithm to estimate a signal from its modified short-time Fourier transform (STFT). This algorithm is computationally simple and is obtained by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT. Using this algorithm, we also develop an iterative algorithm to estimate a signal from its modified STFT magnitude. The iterative algorithm is shown to decrease, in each iteration, the mean squared error between the STFT magnitude of the estimated signal and the modified STFT magnitude. The major computation involved in the iterative algorithm is the discrete Fourier transform (DFT) computation, and the algorithm appears to be real-time implementable with current hardware technology. The algorithm developed in this paper has been applied to the time-scale modification of speech. The resulting system generates very high-quality speech, and appears to be better in performance than any existing method.

1,899 citations

Journal ArticleDOI
TL;DR: A sinusoidal model for the speech waveform is used to develop a new analysis/synthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves, which forms the basis for new approaches to the problems of speech transformations including time-scale and pitch-scale modification, and midrate speech coding.
Abstract: A sinusoidal model for the speech waveform is used to develop a new analysis/synthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves. These parameters are estimated from the short-time Fourier transform using a simple peak-picking algorithm. Rapid changes in the highly resolved spectral components are tracked using the concept of "birth" and "death" of the underlying sine waves. For a given frequency track a cubic function is used to unwrap and interpolate the phase such that the phase track is maximally smooth. This phase function is applied to a sine-wave generator, which is amplitude modulated and added to the other sine waves to give the final speech output. The resulting synthetic waveform preserves the general waveform shape and is essentially perceptually indistinguishable from the original speech. Furthermore, in the presence of noise the perceptual characteristics of the speech as well as the noise are maintained. In addition, it was found that the representation was sufficiently general that high-quality reproduction was obtained for a larger class of inputs including: two overlapping, superposed speech waveforms; music waveforms; speech in musical backgrounds; and certain marine biologic sounds. Finally, the analysis/synthesis system forms the basis for new approaches to the problems of speech transformations including time-scale and pitch-scale modification, and midrate speech coding [8], [9].

1,659 citations

Proceedings ArticleDOI
14 Apr 1983
TL;DR: An algorithm to estimate a signal from its modified short-time Fourier transform (STFT) by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT magnitude is presented.
Abstract: In this paper, we present an algorithm to estimate a signal from its modified short-time Fourier transform (STFT). This algorithm is computationally simple and is obtained by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT. Using this algorithm, we also develop an iterative algorithm to estimate a signal from its modified STFT magnitude. The iterative algorithm is shown to decrease, in each iteration, the mean squared error between the STFT magnitude of the estimated signal and the modified STFT magnitude. The major computation involved in the iterative algorithm is the discrete Fourier transform (DFT) computation, and the algorithm appears to be real-time implementable with current hardware technology. The algorithm developed in this paper has been applied to the time-scale modification of speech. The resulting system generates very high-quality speech, and appears to be better in performance than any existing method.

532 citations

Proceedings ArticleDOI
S. Roucos1, A. Wilgus1
26 Apr 1985
TL;DR: A new and simple method for speech rate modification that yields high quality rate-modified speech and both objective and informal subjective results for the new and previous TSM methods are presented.
Abstract: We present a new and simple method for speech rate modification that yields high quality rate-modified speech. Earlier algorithms either required a significant amount of computation for good quality output speech or resulted in poor quality rate-modified speech. The algorithm we describe allows arbitrary linear or nonlinear scaling of the time axis. The algorithm operates in the time domain using a modified overlap-and-add (OLA) procedure on the waveform. It requires moderate computation and could be easily implemented in real time on currently available hardware. The algorithm works equally well on single voice speech, multiple-voice speech, and speech in noise. In this paper, we discuss an earlier algorithm for time-scale modification (TSM), and present both objective and informal subjective results for the new and previous TSM methods.

420 citations


"Improved phase vocoder time-scale m..." refers background in this paper

  • ...More recently, various authors have noted that the iterative process can be greatly accelerated by calculating good sets of initial STFT phase values [13]....

    [...]

Journal ArticleDOI
TL;DR: This contribution reviews frequency-domain algorithms (phase-vocoder) and time- domain algorithms (Time-Domain Pitch-Synchronous Overlap/Add and the like) in the same framework and presents more recent variations of these schemes.

363 citations


"Improved phase vocoder time-scale m..." refers background or methods in this paper

  • ...A full discussion of time-domain time-scaling techniques and their shortcomings can be found in [7] or [5]....

    [...]

  • ...Further elaboration of this point can be found in [4], [7]....

    [...]

  • ...Finally, when speech signals are processed, all the above phase-locked techniques still exhibit more reverberation or phasiness than time-domain techniques such as the PSOLA technique [7]....

    [...]