scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2000"


Journal ArticleDOI
01 Apr 2000
TL;DR: This paper reviews methodologies that achieve perceptually transparent coding of FM- and CD-quality audio signals, including algorithms that manipulate transform components, subband signal decompositions, sinusoidal signal components, and linear prediction parameters, as well as hybrid algorithms that make use of more than one signal model.
Abstract: During the last decade, CD-quality digital audio has essentially replaced analog audio. Emerging digital audio applications for network, wireless, and multimedia computing systems face a series of constraints such as reduced channel bandwidth, limited storage capacity, and low cost. These new applications have created a demand for high-quality digital audio delivery at low bit rates. In response to this need, considerable research has been devoted to the development of algorithms for perceptually transparent coding of high-fidelity (CD-quality) digital audio. As a result, many algorithms have been proposed, and several have now become international and/or commercial product standards. This paper reviews algorithms for perceptually transparent coding of CD-quality digital audio, including both research and standardization activities. This paper is organized as follows. First, psychoacoustic principles are described, with the MPEG psychoacoustic signal analysis model 1 discussed in some detail. Next, filter bank design issues and algorithms are addressed, with a particular emphasis placed on the modified discrete cosine transform, a perfect reconstruction cosine-modulated filter bank that has become of central importance in perceptual audio coding. Then, we review methodologies that achieve perceptually transparent coding of FM- and CD-quality audio signals, including algorithms that manipulate transform components, subband signal decompositions, sinusoidal signal components, and linear prediction parameters, as well as hybrid algorithms that make use of more than one signal model. These discussions concentrate on architectures and applications of those techniques that utilize psychoacoustic models to exploit efficiently masking characteristics of the human receiver. Several algorithms that have become international and/or commercial standards receive in-depth treatment, including the ISO/IEC MPEG family (-1, -2, -4), the Lucent Technologies PAC/EPAC/MPAC, the Dolby AC-2/AC-3, and the Sony ATRAC/SDDS algorithms. Then, we describe subjective evaluation methodologies in some detail, including the ITU-R BS.1116 recommendation on subjective measurements of small impairments. This paper concludes with a discussion of future research directions.

938 citations


Proceedings ArticleDOI
30 Jul 2000
TL;DR: This method can find individual note boundaries or even natural segment boundaries such as verse/chorus or speech/music transitions, even in the absence of cues such as silence, by analyzing local self-similarity.
Abstract: The paper describes methods for automatically locating points of significant change in music or audio, by analyzing local self-similarity. This method can find individual note boundaries or even natural segment boundaries such as verse/chorus or speech/music transitions, even in the absence of cues such as silence. This approach uses the signal to model itself, and thus does not rely on particular acoustic cues nor requires training. We present a wide variety of applications, including indexing, segmenting, and beat tracking of music and audio. The method works well on a wide variety of audio sources.

442 citations


Patent
Eric Thelen1, Stefan Besling1
23 Mar 2000
TL;DR: In this article, a distributed speech recognition system includes at least one client station and a server station connected via a network, such as Internet, where a speech controller directs at least part of the speech input signal to a local speech recognizer.
Abstract: A distributed speech recognition system includes at least one client station and a server station connected via a network, such as Internet. The client station includes means for receiving the speech input signal from a user. A speech controller directs at least part of the speech input signal to a local speech recognizer. The, preferably limited, speech recognizer is capable of recognizing at least part of the speech input, for instance a spoken command for starting full recognition. In dependence on the outcome of the recognition, the speech controller selectively directs a part of the speech input signal via the network to the server station. The server station includes means for receiving the speech equivalent signal from the network and a large/huge vocabulary speech recognizer for recognizing the received speech equivalent signal.

290 citations


Proceedings ArticleDOI
30 Jul 2000
TL;DR: This paper presents a solution to robust watermarking of audio data and reflects the security properties of the technique and shows good robustness of the approach against MP3 compression and other common signal processing manipulations.
Abstract: This paper considers the desired properties and possible applications of audio watermarking algorithms. Special attention is given to statistical methods working in the Fourier domain. It presents a solution to robust watermarking of audio data and reflects the security properties of the technique. Experimental results show good robustness of the approach against MP3 compression and other common signal processing manipulations. Enhancements to the presented methods are discussed.

233 citations


Proceedings ArticleDOI
05 Jun 2000
TL;DR: This work presents the results of combining the line spectral frequencies (LSFs) and zero crossing-based features for frame-level narrowband speech/music discrimination and shows the good discriminating power of these features.
Abstract: Automatic discrimination of speech and music is an important tool in many multimedia applications. Previous work has focused on using long-term features such as differential parameters, variances and time-averages of spectral parameters. These classifiers use features estimated over windows of 0.5-5 seconds, and are relatively complex. We present our results of combining the line spectral frequencies (LSFs) and zero crossing-based features for frame-level narrowband speech/music discrimination. Our classification results for different types of music and speech show the good discriminating power of these features. Our classification algorithms operate using only a frame delay of 20 ms, making them suitable for real-time multimedia applications.

229 citations


Journal Article
TL;DR: A methodology, frequency-warped digital signal processing, is presented in a tutorial paper as a means to design or implement digital signal-processing algorithms directly in a way that is relevant for auditory perception.
Abstract: Modern audio techniques, such as audio coding and sound reproduction, emphasize the modeling of auditory perception as one of the cornerstones for system design. A methodology, frequency-warped digital signal processing, is presented in a tutorial paper as a means to design or implement digital signal-processing algorithms directly in a way that is relevant for auditory perception. Several audio applications are considered in which this approach shows advantages when used as a design or implementation tool or as a conceptual framework of design.

201 citations


Proceedings ArticleDOI
05 Jun 2000
TL;DR: Both the objective and subjective test results shows that the proposed algorithm outperforms the conventional codebook mapping method.
Abstract: Reconstruction of wideband speech from its narrowband version is an attractive issue, since it can enhance the speech quality without modifying the existing communication networks. This paper proposes a new recovery method of wideband speech from narrowband speech. In the proposed method, the narrowband spectral envelope of input speech is transformed to a wideband spectral envelope based on the Gaussian mixture model (GMM), whose parameters are calculated by a joint density estimation technique. Then the lowband and highband speech signal is reconstructed by the LPC synthesizer using the reconstructed spectral envelope. This paper also proposes a codeword-dependent power estimation method. Both the objective and subjective test results shows that the proposed algorithm outperforms the conventional codebook mapping method.

197 citations


Patent
09 Aug 2000
TL;DR: In this article, the authors present methods and systems for testing speech recognition systems using a text-to-speech (T2S) device, in which the speech recognition device to be tested is directly monitored in accordance with a T2S device.
Abstract: Methods and systems for testing speech recognition systems are disclosed in which the speech recognition device to be tested is directly monitored in accordance with a text-to-speech device The collection of reference texts to be used by the speech recognition device is provided by a text-to-speech device preferably, in one embodiment, implemented within the same computer system In such an embodiment, a digital audio file stored within a storage area of a computer system is generated from a reference text using a text-to-speech device The digital audio file is later read using a speech recognition device to generate a decoded (or recognized) text representative of the reference text The reference text and the decoded text are compared in an alignment operation and an error report representative of the recognition rate of the speech recognition device is finally generated

178 citations


Patent
19 Jun 2000
TL;DR: In this paper, a multi-channel audio compression technology is presented that extends the range of sampling frequencies compared to existing technologies and/or lowers the noise floor while remaining compatible with those earlier generation technologies.
Abstract: A multi-channel audio compression technology is presented that extends the range of sampling frequencies compared to existing technologies and/or lowers the noise floor while remaining compatible with those earlier generation technologies. The high-sampling frequency multi-channel audio (12) is decomposed into core audio up to the existing sampling frequencies and a difference signal up to the sampling frequencies of the next generation technologies. The core audio is encoded (18) using the first generation technology such as DTS, DOLBY AC-3 or MPEG I or MPEG II such that the encoded core bit stream (20) is fully compatible with a comparable decoder in the market. The difference signal (34) is encoded (36) using technologies that extend the sampling frequency and/or improve the quality of the core audio. The compressed difference signal (38) is attached as an extension to the core bit stream (20). The extension data will be ignored by the first generation decoders but can be decoded by the second generation decoders. By summing the decoded core and extension audio signals together (28), a second generation decoder can effectively extend the audio signal bandwidth and/or improve the signal to noise ratio beyond that available through the core decoder alone.

176 citations


Proceedings ArticleDOI
05 Jun 2000
TL;DR: An overview of design techniques and applications for digital fractional delay filters and their applications is given.
Abstract: In numerous applications, such as communications, audio and music technology, speech coding and synthesis, antenna and transducer arrays, and time delay estimation, not only the sampling frequency but the actual sampling instants are of crucial importance. Digital fractional delay (FD) filters provide a useful building block that can be used for fine-tuning the sampling instants, i.e., implement the required bandlimited interpolation. In this paper an overview of design techniques and applications is given.

170 citations


Proceedings Article
01 Jan 2000
TL;DR: The optimal linear transform is derived to combine the audio and visual information and an implementation that avoids the numerical problems caused by computing the correlation matrices is described.
Abstract: FaceSync is an optimal linear algorithm that finds the degree of synchronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to combine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchronization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by computing the correlation matrices.

Journal ArticleDOI
TL;DR: An electrocardiogram (ECG) compression algorithm, called analysis by synthesis ECG compressor (ASEC), is introduced and was found to be superior to several well-known ECG compression algorithms at all tested bit rates.
Abstract: An electrocardiogram (ECG) compression algorithm, called analysis by synthesis ECG compressor (ASEC), is introduced. The ASEC algorithm is based on analysis by synthesis coding, and consists of a beat codebook, long and short-term predictors, and an adaptive residual quantizer. The compression algorithm uses a defined distortion measure in order to efficiently encode every heartbeat, with minimum bit rate, while maintaining a predetermined distortion level. The compression algorithm was implemented and tested with both the percentage rms difference (PRD) measure and the recently introduced weighted diagnostic distortion (WDD) measure. The compression algorithm has been evaluated with the MIT-BIH Arrhythmia Database. A mean compression rate of approximately 100 bits/s (compression ratio of about 30:1) has been achieved with a good reconstructed signal quality (WDD below 4% and PRD below 8%). The ASEC was compared with several well-known ECG compression algorithms and was found to be superior at all tested bit rates. A mean opinion score (MOS) test was also applied. The testers were three independent expert cardiologists. As In the quantitative test, the proposed compression algorithm was found to be superior to the other tested compression algorithms.

Patent
14 Mar 2000
TL;DR: In this paper, a speech synthesizer is provided with a plurality of speech synthesizers for converting text data to speech data and each synthesizer converts text data of a different language to text data in that language.
Abstract: In a speech synthesizer for converting text data to speech data, it is possible to realize high quality speech output even if the text data to be converted is in many languages. The speech synthesizer is provided with a plurality of speech synthesizers for converting text data to speech data and each speech synthesizer converts text data of a different language to speech data in that language. For conversion of particular text data to speech data, one of the plurality of speech synthesizers is selected and caused to carry out that conversion.

Proceedings ArticleDOI
T.A. Faruquie1, Chalapathy Neti1, Nitendra Rajput1, L. V. Subramaniam1, Ashish Verma1 
31 Jan 2000
TL;DR: This work presents a novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in this case, English.
Abstract: Audio-driven facial animation is an interesting and evolving technique for human-computer interaction. Based on an incoming audio stream, a face image is animated with full lip synchronization. This requires a speech recognition system in the language in which audio is provided to get the time alignment for the phonetic sequence of the audio signal. However, building a speech recognition system is data intensive and is a very tedious and time consuming task. We present a novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English. The method presented here can also be used for text to audio-visual speech synthesis.

Patent
TL;DR: In this paper, a method of producing synthetic visual speech according to this invention includes receiving an input containing speech information, one or more visemes that correspond to the speech input are then identified.
Abstract: A method of producing synthetic visual speech according to this invention includes receiving an input containing speech information. One or more visemes that correspond to the speech input are then identified. Next, the weights of those visemes are calculated using a coarticulation engine including viseme deformability information. Finally, a synthetic visual speech output is produced based on the visemes' weights over time (or tracks). The synthetic visual speech output is combined with a synchronized audio output corresponding to the input to produce a multimedia output containing a 3D lipsyncing animation.

Journal ArticleDOI
TL;DR: An overview of recent bibliographic references dealing with speech processing in mobile terminals is given and a fairly large list of references taken from many conferences proceedings and journals are given and commented.

Proceedings ArticleDOI
28 May 2000
TL;DR: This paper gives an overview of the HILN tools, presents the recent advances in signal modelling and parameter coding, and concludes with an evaluation of the subjective audio quality.
Abstract: The MPEG-4 Audio Standard combines tools for efficient and flexible coding of audio. For very low bitrate applications, tools based on a parametric signal representation are utilised. The parametric speech coding tools (HVXC) are already available in Version 1 of MPEG-4. The main focus of this paper is on the parametric audio coding tools "Harmonic and Individual Lines plus Noise" (KILN) which are included in Version 2 of MPEG-4. As already indicated by their name, the HILN tools are based on the decomposition of the audio signal into components which are described by appropriate source models and represented by model parameters. This paper gives an overview of the HILN tools, presents the recent advances in signal modelling and parameter coding, and concludes with an evaluation of the subjective audio quality.

Patent
Yang Gao1, Adil Benyassine2, Jes Thyssen2, Eyal Shlomot2, Huan-Yu Su2 
15 Sep 2000
TL;DR: In this paper, a speech compression system capable of encoding a speech signal into a bitstream for subsequent decoding to generate synthesized speech is disclosed, which optimizes the bandwidth consumed by the bitstream by balancing the desired average bit rate with the perceptual quality of the reconstructed speech.
Abstract: A speech compression system capable of encoding a speech signal into a bitstream for subsequent decoding to generate synthesized speech is disclosed. The speech compression system optimizes the bandwidth consumed by the bitstream by balancing the desired average bit rate with the perceptual quality of the reconstructed speech. The speech compression system comprises a full-rate codec, a half-rate codec, a quarter-rate codec and an eighth-rate codec. The codecs are selectively activated based on a rate selection. In addition, the full and half-rate codecs are selectively activated based on a type classification. Each codec is selectively activated to encode and decode the speech signals at different bit rates emphasizing different aspects of the speech signal to enhance overall quality of the synthesized speech.

Patent
Steven D. Curtin1
11 Apr 2000
TL;DR: In this paper, a digital wireless premises audio system, a method of operating the same and a home theater system incorporating the audio system or the method, is presented, which includes a digital audio encoder/transmitter, located on the premises, that accepts an audio channel in digital form, encodes the channel into a stream of digital data and wirelessly transmits the stream about the premises.
Abstract: A digital wireless premises audio system, a method of operating the same and a home theater system incorporating the audio system or the method. In one embodiment, the audio system includes: (1) a digital audio encoder/transmitter, located on the premises, that accepts an audio channel in digital form, encodes the channel into a stream of digital data and wirelessly transmits the stream about the premises and (2) a speaker module, located on the premises, couplable to a power source and including, in series, a digital audio receiver/decoder, an audio amplifier and a speaker, that receives the stream, decodes the audio channel therefrom, converts the audio channel to analog form and employs power from the power source to amplify the audio channel and drive the speaker therewith.

PatentDOI
TL;DR: The authors used subband cepstral features to improve the recognition string accuracy rates for speech inputs for first training and then recognizing speech, using a method and apparatus for first classifying speech.
Abstract: A method and apparatus for first training and then recognizing speech. The method and apparatus use subband cepstral features to improve the recognition string accuracy rates for speech inputs.

Patent
20 Mar 2000
TL;DR: In this article, a speech recognition operation is performed on the audio data initially using a speaker independent acoustic model and the recognized text in addition to audio time stamps are produced by the speech recognition operator.
Abstract: Automated methods and apparatus for synchronizing audio and text data, e.g., in the form of electronic files, representing audio and text expressions of the same work or information are described. Also described are automated methods of detecting errors and other discrepancies between the audio and text versions of the same work. A speech recognition operation is performed on the audio data initially using a speaker independent acoustic model. The recognized text in addition to audio time stamps are produced by the speech recognition operation. The recognized text is compared to the text in text data to identify correctly recognized words. The acoustic model is then retrained using the correctly recognized text and corresponding audio segments from the audio data transforming the initial acoustic model into a speaker trained acoustic model. The retrained acoustic model is then used to perform an additional speech recognition operation on the audio data. The audio and text data are synchronized using the results of the updated acoustic model. In addition, one or more error reports based on the final recognition results are generated showing discrepancies between the recognized words and the words included in the text. By retraining the acoustic model in the above described manner, improved accuracy is achieved.

Patent
02 Aug 2000
TL;DR: In this paper, a pitch and voice dependent spectral estimation algorithm (voicing algorithm) is proposed for processing audio and speech signals using a pitch-and voice-dependent spectral estimation (VoE) model.
Abstract: A system and method are provided for processing audio and speech signals using a pitch and voicing dependent spectral estimation algorithm (voicing algorithm) to accurately represent voiced speech, unvoiced speech, and mixed speech in the presence of background noise, and background noise with a single model. The present invention also modifies the synthesis model based on an estimate of the current input signal to improve the perceptual quality of the speech and background noise under a variety of input conditions. The present invention also improves the voicing dependent spectral estimation algorithm robustness by introducing the use of a Multi-Layer Neural Network in the estimation process. The voicing dependent spectral estimation algorithm provides an accurate and robust estimate of the voicing probability under a variety of background noise conditions. This is essential to providing high quality intelligible speech in the presence of background noise. In one embodiment, the waveform coding is implemented by separating the input signal into at least two sub-band signals and encoding one of the at least two sub-band signals using a first encoding algorithm to produce at least one encoded output signal; and encoding another of said at least two sub-band signals using a second encoding algorithm to produce at least one other encoded output signal, where the first encoding algorithm is different from the second encoding algorithm. In accordance with the described embodiment, the present invention provides an encoder that codes N user defined sub-band signal in the baseband with one of a plurality of waveform coding algorithms, and encodes N user defined sub-band signals with one of a plurality of parametric coding algorithms. That is, the selected waveform/parametric encoding algorithm may be different in each sub-band.

Journal ArticleDOI
TL;DR: It is shown that MVDR modeling provides a class of all-pole models that are flexible for tackling a wide variety of speech modeling objectives and the high order MVDR spectrum provides a robust model for all types of speech including voiced speech, unvoiced speech, and mixed spectra.
Abstract: We present all-pole models based upon the minimum variance distortionless response (MVDR) spectrum for spectral modeling of speech. The MVDR method, which is popular in array processing, provides all-pole spectra that are robust for modeling both voiced and unvoiced speech. Although linear prediction (LP) is a popular method for obtaining all-pole model parameters, LP spectral envelopes overestimate and overemphasize the medium and high pitch voiced speech spectral powers, thereby featuring unwanted sharp contours, and do not improve in spectral envelope modeling performance as the filter order is increased. In contrast, the MVDR all-pole spectrum which can be easily obtained from the LP coefficients, features improved spectral envelope modeling as the filter order is increased. In particular, the high order MVDR spectrum models voiced speech spectra very well, particularly at the perceptually important harmonics, and features a smooth contoured envelope. Furthermore, the MVDR spectrum can be based upon either conventional time domain correlation estimates or upon spectral samples, a task that is common in frequency domain speech coding. In particular, the MVDR spectrum of sufficient order provides an all-pole envelope that models a set of spectral samples exactly. In addition, the MVDR all-pole spectrum is also suitable for modeling unvoiced speech spectra.

Patent
18 Apr 2000
TL;DR: In this paper, a method of communication with a speech enabled remote telephony device such as a mobile phone is described comprising the following steps: receiving user speech input into the mobile phone as part of a dialogue with an interactive voice response telephony application, performing speech recognition to convert the speech into text and converting the text into tones such as DTMF tones.
Abstract: Speech encoding in a client server system such as a laptop, personal data assistant or mobile phone communicating with an interactive voice response telephony application. A method of communication with a speech enabled remote telephony device such as a mobile phone is described comprising the following steps. Receiving user speech input into the mobile phone as part of a dialogue with an interactive voice response telephony application. Performing speech recognition to convert the speech into text and converting the text into tones such as DTMF tones. Transmitting the DTMF tones over the voice channel to an interactive voice response (IVR) telephony application an allowed response feature converts the users response to a known valid response of the IVR application. A language conversion feature allows a person in one language to speak in that language to an IVR application operating in a different language.

Patent
04 Aug 2000
TL;DR: In this paper, a scalable data structure for audio transmission includes core and augmentation layers, the former for carrying a first coding of an audio signal that places post decode noise beneath a desired noise spectrum, the latter for carrying offset data regarding the desired noises spectrum and data about coding of the audio signal.
Abstract: Scalable coding of audio into a core layer in response to a desired noise spectrum established according to psychoacoustic principles supports coding augmentation data into augmentation layers in response to various criteria including offset of such desired noise spectrum. Compatible decoding provides a plurality of decoded resolutions from a single signal. Coding is preferably performed on subband signals generated according to spectral transform, quadrature mirror filtering, or other conventional processing of audio input. A scalable data structure for audio transmission includes core and augmentation layers, the former for carrying a first coding of an audio signal that places post decode noise beneath a desired noise spectrum, the later for carrying offset data regarding the desired noise spectrum and data about coding of the audio signal that places post decode noise beneath the desired noise spectrum shifted by the offset data.

PatentDOI
Xiaobo Pi1, Ying Jia1
TL;DR: In this paper, an interactive voice response system is described that supports full duplex data transfer to enable the playing of a voice prompt to a user of telephony system while the system listens for voice barge-in from the user.
Abstract: An interactive voice response system is described that supports full duplex data transfer to enable the playing of a voice prompt to a user of telephony system while the system listens for voice barge-in from the user. The system includes a speech detection module that may utilize various criteria such as frame energy magnitude and duration thresholds to detect speech. The system also includes an automatic speech recognition engine. When the automatic speech recognition engine recognizes a segment of speech, a feature extraction module may be used to subtract a prompt echo spectrum, which corresponds to the currently playing voice prompt, from an echo-dirtied speech spectrum recorded by the system. In order to improve spectrum subtraction, an estimation of the time delay between the echo-dirtied speech and the prompt echo may also be performed.

Book
01 Jan 2000
TL;DR: Auditory Processing of Speech Perceptual Coding Considerations Research in PerceptUAL Speech Coding APPENDIX: RELATED INTERNET SITES.
Abstract: INTRODUCTION SPEECH PRODUCTION The Speech Chain Articulation Source-Filter Model SPEECH ANALYSIS TECHNIQUES Sampling and the Speech Waveform Systems and Filtering z Transform Fourier Transform Discrete Fourier Transform Windowing Signal Segments LINEAR PREDICTION VOCAL TRACT MODELING Sound Propagation in the Vocal Tract Estimation of LP Parameters Transformations of LP Parameters for Quantization Examples of LP Modeling PITCH EXTRACTION Autocorrelation Pitch Extraction Cepstral Pitch Extraction Frequency-Domain Error Minimization Pitch Tracking AUDITORY INFORMATION PROCESSING The Basilar Membrane: A Spectrum Analyzer Critical Bands Thresholds of Audibility and Detectability Monaural Masking QUANTIZATION AND WAVEFORM CODERS Uniform Quantization Nonlinear Quantization Adaptive Quantization Vector Quantization QUALITY EVALUATION Objective Measures Subjective Measures Perceptual Objective Measures VOICE CODING CONCEPTS Channel Vocoder Formant Vocoders The Sinusoidal Speech Coder Linear Prediction Vocoder LINEAR PREDICTION ANALYSIS BY SYNTHESIS Analysis by Synthesis Estimation of Excitation Multi-Pulse Linear Prediction Coder Regular Pulse Excited LP Coder Code Excited Linear Prediction Coder MIXED EXCITATION CODING Multi-Band Excitation Vocoder Mixed Excitation Linear Prediction Coder Split Band LPC Coder Harmonic Vector Excitation Coder Waveform Interpolation Coding PERCEPTUAL SPEECH CODING Auditory Processing of Speech Perceptual Coding Considerations Research in Perceptual Speech Coding APPENDIX: RELATED INTERNET SITES

Patent
Donald W. Moses1, Robert W. Moses1
12 Oct 2000
TL;DR: In this article, a computer-implemented system for providing a digital watermark in an audio signal is presented. But, the system is limited to the use of audio signals.
Abstract: The foregoing problems are solved and a technical advance is achieved by a computer-implemented system for providing a digital watermark in an audio signal. In a preferred embodiment, a audio file (108), such as a .WAV file, representing an audio signal to be watermarked is processed using an algorithm of the present invention herein referred to as the 'PAWS algorithm' (104) to determine and log the location and number of opportunities that exist for inserting a watermark into the audio signal such that it will be masked by the audio signal. The user can adjust (17) certain parameters (112) of the PAWS algorithm (104) before the audio file is processed. A/B/X testing between the original and watermarked files is also supporter to allow the user to undo or re-encode the watermark, if desired.

Proceedings ArticleDOI
05 Jun 2000
TL;DR: This paper describes the calculation of features directly from MPEG audio compressed data and implements two case studies: a general audio segmentation algorithm and a music/speech classifier.
Abstract: There is a huge amount of audio data available that is compressed using the MPEG audio compression standard. Sound analysis is based on the computation of short time feature vectors that describe the instantaneous spectral content of the sound. An interesting possibility is the calculation of features directly from compressed data. Since the bulk of the feature calculation is performed during the encoding stage this process has a significant performance advantage if the available data is compressed. Combining decoding and analysis in one stage is also very important for audio streaming applications. In this paper, we describe the calculation of features directly from MPEG audio compressed data. Two of the basic processes of analyzing sound are: segmentation and classification. To illustrate the effectiveness of the calculated features we have implemented two case studies: a general audio segmentation algorithm and a music/speech classifier. Experimental data is provided to show that the results obtained are comparable with sound analysis algorithms working directly with audio samples.

Patent
07 Sep 2000
TL;DR: In this article, a method for overlapping stored audio elements in a system for providing a customized radio broadcast is proposed, which includes the steps of dividing a first audio element into a plurality of audio element components.
Abstract: A method for overlapping stored audio elements in a system for providing a customized radio broadcast. The method includes the steps of dividing a first audio element into a plurality of audio element components; selecting one of said audio element components; decompressing the selected audio element component; selecting a second audio element; decompressing the second audio element; mixing the decompressed audio element component with the decompressed second audio element to form a mixed audio element component; and compressing the mixed audio element component to form a compressed overlapping audio element component. The compressed overlapping audio element component may replace the selected audio component. The first audio element may be a song, while the second audio element may be a DJ introduction. Accordingly, the compressed overlapping audio element may be broadcast followed by the remaining components of the song audio element.