scispace - formally typeset
Search or ask a question

Showing papers on "Audio signal processing published in 2001"


Journal ArticleDOI
TL;DR: A heuristic rule-based procedure is proposed to segment and classify audio signals and built upon morphological and statistical analysis of the time-varying functions of these audio features.
Abstract: While current approaches for audiovisual data segmentation and classification are mostly focused on visual cues, audio signals may actually play a more important role in content parsing for many applications. An approach to automatic segmentation and classification of audiovisual data based on audio content analysis is proposed. The audio signal from movies or TV programs is segmented and classified into basic types such as speech, music, song, environmental sound, speech with music background, environmental sound with music background, silence, etc. Simple audio features including the energy function, the average zero-crossing rate, the fundamental frequency, and the spectral peak tracks are extracted to ensure the feasibility of real-time processing. A heuristic rule-based procedure is proposed to segment and classify audio signals and built upon morphological and statistical analysis of the time-varying functions of these audio features. Experimental results show that the proposed scheme achieves an accuracy rate of more than 90% in audio classification.

473 citations


PatentDOI
TL;DR: In this paper, a method and system is described which allows users to identify (pre-recorded) sounds such as music, radio broadcast, commercials, and other audio signals in almost any environment.
Abstract: A method and system is described which allows users to identify (pre-recorded) sounds such as music, radio broadcast, commercials, and other audio signals in almost any environment. The audio signal (or sound) must be a recording represented in a database of recordings. The service can quickly identify the signal from just a few seconds of excerption, while tolerating high noise and distortion. Once the signal is identified to the user, the user may perform transactions interactively in real-time or offline using the identification information.

399 citations


Proceedings ArticleDOI
21 Oct 2001
TL;DR: This method attempts to identify the chorus or refrain of a song by identifying repeated sections of the audio waveform using a reduced spectral representation of the selection based on a chroma transformation of the spectrum.
Abstract: An important application for use with multimedia databases is a browsing aid, which allows a user to quickly and efficiently preview selections from either a database or from the results of a database query. Methods for facilitating browsing, though, are necessarily media dependent. We present one such method that produces short, representative samples (or "audio thumbnails") of selections of popular music. This method attempts to identify the chorus or refrain of a song by identifying repeated sections of the audio waveform. A reduced spectral representation of the selection based on a chroma transformation of the spectrum is used to find repeating patterns. This representation encodes harmonic relationships in a signal and thus is ideal for popular music, which is often characterized by prominent harmonic progressions. The method is evaluated over a sizable database of popular music and found to perform well, with most of the errors resulting from songs that do not meet our structural assumptions.

325 citations


Patent
16 May 2001
TL;DR: In this article, the authors propose a wireless communication system for digital audio players that provides for increased functionality, such as communication, interaction and synchronization between a computing platform and various mobile, portable or fixed DAs, as well as providing a communication link between the DAs themselves.
Abstract: A wireless communication system and in particular to a wireless communication system for digital audio players that provides for increased functionality, such as communication, interaction and synchronization between a computing platform and various mobile, portable or fixed digital audio players, as well as providing a communication link between the various digital audio players themselves. The computing platform may act, for example, through a wireless network or wireless communication platform, to control the digital audio players; to act as a cache of digital audio data for the digital audio players; as well as provide a gateway to the Internet to enable the digital audio players to access additional digital audio content and other information. The computing platform may also be used to automatically update digital audio content on the digital audio players; synchronize digital audio content and playlists between digital audio players; and automatically continue a particular playlist as the user moves from one digital audio player to another.

306 citations


Patent
21 Jun 2001
TL;DR: In this article, the authors present a lighting program to control a plurality of light emitting diodes (LEDs) in response to at least one characteristic of an audio input.
Abstract: Methods and apparatus for executing a lighting program to control a plurality of light emitting diodes (LEDs) in response to at least one characteristic of an audio input. In one embodiment, the audio input is digitally processed to determine the at least one characteristic. In other embodiments, control signals for the LEDs are generated in response to a timer and/or input from a user interface, as well as in response to the at least one characteristic of the audio input. In another embodiment, the control signals for the LEDs are generated by a same computer that processes the audio input to transmit signals to speakers to audibly play the audio input. In a further embodiment, a GUI is provided to assist in authoring the lighting program. In another embodiment, the audio signal is processed before being played back. In a further embodiment, the lighting program anticipates changes in the audio input.

291 citations


Patent
14 Aug 2001
TL;DR: In this paper, the authors present systems and methods for receiving live speech, converting the speech to text, and transferring the text to a user, as desired, in one or more different languages.
Abstract: The present invention relates to systems and methods for audio processing. For example, the present invention provides systems and methods for receiving live speech, converting the speech to text, and transferring the text to a user. As desired, the speech or text can be translated into one or more different languages. Systems and methods for real-time conversion and transmission of speech and text are provided.

278 citations


Book
01 Jan 2001
TL;DR: This book presents an introduction to Real-Time Digital Signal Processing, a branch of Digital Image Processing, and some of the techniques used in this area, as well as some new ideas on how to implement these techniques in the real-time.
Abstract: Preface. Chapter 1. Introduction to Real-Time Digital Signal Processing. Chapter 2. Introduction to TMS320C55x Digital Signal Processor. Chapter 3. DSP Fundamentals and Implementation Considerations. Chapter 4. Design and Implementation of FIR Filters. Chapter 5. Design and Implementation of IIR Filters. Chapter 6. Frequency Analysis and Fast Fourier Transform. Chapter 7. Adaptive Filtering. Chapter 8. Digital Signal Generators. Chapter 9. Dual-Tone Multi-Frequency Detection. Chapter 10. Adaptive Echo Cancellation. Chapter 11. Speech Coding Techniques. Chapter 12. Speech Enhancement Techniques. Chapter 13. Audio Signal Processing. Chapter 14. Channel Coding Techniques. Chapter 15. Introduction to Digital Image Processing. Appendix A: Some Useful Formulas and Definitions. A.1 Trigonometric Identities. A.2 Geometric Series. A.3 Complex Variables. A.4 Units of Power. References. Appendix B: Software Organization and List of Experiments. Index.

228 citations


PatentDOI
Yun-Ting Lin1, Yong Yan1
TL;DR: In this paper, a sound imaging system and method for generating multi-channel audio data from an audio/video signal having an audio component and a video component is presented, which includes a system for associating sound sources within the audio component to video objects within the video component.
Abstract: A sound imaging system and method for generating multi-channel audio data from an audio/video signal having an audio component and a video component. The system comprises: a system for associating sound sources within the audio component to video objects within the video component of the audio/video signal; a system for determining position information of each sound source based on a position of the associated video object in the video component; and a system for assigning sound sources to audio channels based on the position information of each sound source.

213 citations


Patent
14 Feb 2001
TL;DR: An audio controller for use with laptop and notebook digital computers for reproducing compressed digital audio recordings is described in this paper, where the controller includes a drive interface for traversing and accessing audio data files stored on a drive of a computer system.
Abstract: An audio controller for use with laptop and notebook digital computers for reproducing compressed digital audio recordings. The controller includes a drive interface for traversing and accessing audio data files stored on a drive of a computer system. Function keys coupled to the controller permit users to access drives containing desired audio data. The selected audio data is read from the drive into the controller. Decoding circuitry decodes the audio data and generates a decoded audio data stream. The data stream can be converted to an analog signal by the controller, or sent to the audio subsystem of the computer system. Advantageously, the controller operates when the computer system is in an inactive state, for example in power saving mode or OFF, and operates in passthrough mode when the computer system is ON or active.

187 citations


Proceedings ArticleDOI
02 Apr 2001
TL;DR: The architecture of a public automated evaluation service for still images, sound and video and the set of tests is related to audio data and addresses the usual equalisation and normalisation but also time stretching, pitch shifting and specially designed audio attack algorithms.
Abstract: We briefly present the architecture of a public automated evaluation service we are developing for still images, sound and video. We also detail new tests that will be included in this platform. The set of tests is related to audio data and addresses the usual equalisation and normalisation but also time stretching, pitch shifting and specially designed audio attack algorithms. These attacks are discussed and results on watermark attacks and perceived quality after applying the attacks are provided.

160 citations


Patent
26 Jun 2001
TL;DR: In this paper, a vehicle audio system includes a wireless audio sensor (28) configured to wirelessly detect different portable audio sources brought into the vehicle and audio output devices (20) located in the vehicle for outputting audio signals from the different audio sources.
Abstract: A vehicle audio system includes a wireless audio sensor (28) configured to wirelessly detect different portable audio sources brought into the vehicle. Audio output devices (20) are located in the vehicle for outputting audio signals from the different audio sources. A processor selectively connects the different audio sources to the different audio output devices (20). In another aspect, the audio system includes object sensors (16, 18, 22) that detect objets located outside the vehicle. The processor generates warning signals that are output from the different audio output devices (20) according to where the objects are detected by the object sensors (16, 18, 22).

Patent
30 Jan 2001
TL;DR: In this article, a method for providing multiple users with voice-to-remaining audio (VRA) adjustment capability includes receiving at first decoder a voice signal and a remaining audio signal and simultaneously receiving at a second decoder, the voice signals and the remaining audio signals.
Abstract: A method for providing multiple users with voice-to-remaining audio (VRA) adjustment capability includes receiving at a first decoder (14) a voice signal and a remaining audio signal and simultaneously receiving at a second decoder (15), the voice signal and the remaining audio signal, wherein the voice signal and the remaining audio signal are received separately, and separately adjusting by each of the decoders, the separately received voice and remaining audio signals.

Patent
10 Dec 2001
TL;DR: In this article, the authors propose a technique to adjust the ratio of the primary vocal/dialog content of an audio program relative to the remaining portion of the audio content in that program.
Abstract: The invention enables the inclusion of voice and remaining audio information at different parts of the audio production process. In particular, the invention embodies special techniques for VRA-capable digital mastering and accommodation of VRA by those classes of audio compression formats that sustain less losses of audio data as compared to any codecs that sustain comparable net losses equal or greater than the AC3 compression format. The invention facilitates an end-listener's voice-to-remaining audio (VRA) adjustment upon the playback of digital audio media formats by focusing on new configurations of multiple parts of the entire digital audio system, thereby enabling a new technique intended to benefit audio end-users (end-listeners) who wish to control the ratio of the primary vocal/dialog content of an audio program relative to the remaining portion of the audio content in that program.

Patent
04 Sep 2001
TL;DR: In this article, an audio converter device and a method for using the same is presented, in which the audio data is decompressed and converted into analog electrical data and then transferred to an audio playback device.
Abstract: An audio converter device and a method for using the same are provided. In one embodiment, the audio converter device receives the digital audio data from a first device via a local area network. The audio converter device decompresses the digital audio data and converts the digital audio data into analog electrical data. The audio converter device transfers the analog electrical data to an audio playback device.

Patent
24 Jan 2001
TL;DR: In this article, a system and method for monitoring a plurality of audio sources and switching from one to another of the audio sources in accordance with a stored program is presented, where an audio output device receives a signal from each of the portable electronic devices and selectively switches the contents of its output according to at least one preprogrammed user preference.
Abstract: A system and method is provided for monitoring a plurality of audio sources and switching from one to another of the audio sources in accordance with a stored program. An audio output device receives a signal from each of the portable electronic devices and selectively switches the contents of its output according to at least one preprogrammed user preference. The audio output device also automatically communicates with transceiver modules connected to local information systems, for example within a vehicle, office or shopping center. Based on programs stored in a storage device connected to the headset, the local information sources may be monitored and selected to interrupt other audio sources received by the headset when desired.

Patent
16 May 2001
TL;DR: In this article, the authors propose a gateway device for use in a wireless network of digital audio playback devices, which is wirelessly linked to one or more digital audio players to provide a gateway to the Internet for the digital audio devices.
Abstract: A digital audio gateway device for use in a wireless network of digital audio playback devices. The gateway device is wirelessly linked to one or more digital audio playback devices to provide a gateway to the Internet for the digital audio playback devices. In addition to functioning as a gateway, the device provides additional functionality and may act as a cache of digital audio data for the various digital audio players connected in the wireless network and may also act to automatically update digital audio content on the audio players, synchronize digital audio content and playlists between the digital audio players and continue automatically or upon user request a particular playlist as the user moves from one digital audio player to another.

Patent
30 May 2001
TL;DR: In this paper, a post-processing algorithm is used to simulate live or theater sound, where audio signals are selectively post-processed according to equipment availability and listener preferences, and a center channel equalizer balances the signal playback.
Abstract: The method and system of present invention sequences audio post-processing algorithms to simulate live or theater sound. An audio signal is selectively post-processed according to equipment availability and listener preferences. Downmixing or Prologic algorithms are applied to a signal arriving at sound system. A listener inputs their speaker configuration to a player console. Desired post-processing effects are likewise indicated to the console. For instance, if surround sound equipment is both available and selected, then surround portions of the audio signal are parsed to surround speakers. Bass management techniques then transfer low frequency channels of the signal to compatible speakers. VES or DCS algorithms further manipulate the surround portion of the signal to create an illusion of immersion, and a center channel equalizer balances the signal playback. Alternatively, the post-processed signal is transmitted to a headphone set.

Patent
30 Mar 2001
TL;DR: In this paper, a video system for presenting content from a content provider to a user includes a tuner to select a program from a plurality of programs, and an analog output port is coupled to the analog output of the tuner, and is connected to a storage device to record the selected program represented by an analog signal.
Abstract: A video system for presenting content from a content provider to a user includes a tuner to select a program from a plurality of programs. The tuner outputs the selected program at an analog output when the selected program is represented by an analog signal. An analog output port is coupled to the analog output of the tuner, and is configured to be connectable to a storage device to record the selected program represented by an analog signal. An analog signal processing circuit is coupled to the analog output of the tuner to receive the analog signal representing the selected program from the tuner and to generate a digital representation of the analog signal. A first interface module is configured to be connectable to the storage device to receive recorded programs from the storage device. An overlay module is coupled to the analog signal processing circuit and to the first interface module. The overlay module selectively overlays information to a program received from one of the analog signal processing circuit and the first interface module.

PatentDOI
TL;DR: Time-limited electrical audio signals are fed to an electromechanical output transducer in addition to the signals from the hearing aid input, resulting in time-limited audio signals that are user-defined and can be programmed by the user.
Abstract: Time-limited electrical audio signals are fed to an electromechanical output transducer in addition to the signals from the hearing aid input. Some of the time-limited audio signals are user-defined. The process is implemented in a hearing aid having an electromechanical transducer and a signal processor. An audio signal generator has a user-changeable memory and/or a read/write memory that can be programmed by the user.

Patent
08 Nov 2001
TL;DR: In this paper, the authors present a data processing system and method that uses RTSP and associated protocols to support voice applications and audio processing by various, distributed, speech processing engines.
Abstract: The present invention relates to a data processing system and method and, more particularly, to a computer aided telephony system and method which uses RTSP and associated protocols to support voice applications and audio processing by various, distributed, speech processing engines. Since RTSP is used to distribute the tasks to be performed by the speech processing engines, a distributed and scalable system can be realised. Furthermore, the integration of third party speech processing engines is greatly simplified due to the RTSP or HTTP interface to those engines.

Patent
25 May 2001
TL;DR: In this paper, an improved pulse oximeter includes audio signal generation means, controlled by algorithms in a processing element which continuously transform the signals from the sensor into signal quality information, which is converted into an audio signal and annunciated for the operator's use in guiding sensor placement.
Abstract: An improved pulse oximeter is disclosed. The pulse oximeter includes audio signal generation means, controlled by algorithms in a processing element which continuously transform the signals from the sensor into signal quality information. This information is converted into an audio signal and annunciated for the operator's use in guiding sensor placement. This signal quality information is available even in the absence of successful computation of pulse rate and/or oxygen saturation level. It furthermore can reflect signal quality changes that may be too subtle to be reflected in the typical numerical representation of pulse rate and oxygen saturation trend. The audio representation of the signal quality can further be modulated to convey other system and/or physiological status and alerts.

Proceedings ArticleDOI
21 Oct 2001
TL;DR: Preliminary experimental results suggest that the listener's ability to identify messages in a multi-talker environment significantly improves by enhancing a monophonic signal with the proposed scheme.
Abstract: We introduce a new scheme for simultaneous placement of a number of sources in auditory space. The scheme is based on an assumption about the relevance of localization cues in different critical bands. Given the sum signal of a number of sources, i.e. a monophonic signal, and a set of parameters (side-information) the scheme is capable of generating a binaural signal by spatially placing the sources contained in the monophonic signal. Potential applications for the scheme are multi-talker desktop conferencing and audio coding. Preliminary experimental results suggest that the listener's ability to identify messages in a multi-talker environment significantly improves by enhancing a monophonic signal with the proposed scheme.

Proceedings ArticleDOI
06 Aug 2001
TL;DR: In this paper, Ephraim et al. derived three different suppression rules for short-time spectral attenuation and showed that they are efficient to implement and yield a more intuitive interpretation.
Abstract: Short-time spectral attenuation is a common form of audio signal enhancement in which a time-varying filter, or suppression rule, is applied to the frequency-domain transform of a corrupted signal. The suppression rule (see Ephraim, Y. and Malah, D., IEEE Trans. on Acoustics, Speech and Signal Proc., vol.ASSP-32, no.6, p.1109-21, 1984) for speech enhancement is both optimal in the minimum mean-square error sense and well-known for its associated colourless residual noise; however, it requires the computation of exponential and Bessel functions. We show that, under the same modelling assumptions, alternative Bayesian approaches lead to suppression rules exhibiting almost identical behaviour. We derive three such rules and show that they are efficient to implement and yield a more intuitive interpretation.

Proceedings ArticleDOI
07 May 2001
TL;DR: The conference proceedings are published in six volumes and deal with speech processing, image and multidimensional signal processing, sensor array and multichannel signal processing; audio and electroacoustics.
Abstract: The conference proceedings are published in six volumes. Volume I deals with speech processing. Volume II deals with: speech processing; industry technology track; design and implementation of signal processing systems; neural networks for signal processing. Volume III deals with: image and multidimensional signal processing; multimedia signal processing. Volume IV deals with signal processing for communications. Volume V deals with:signal processing education; sensor array and multichannel signal processing; audio and electroacoustics. Volume VI deals with signal processing theory and methods

Proceedings ArticleDOI
07 May 2001
TL;DR: This work addresses the problem of audio-visual information fusion to provide highly robust speech recognition and proposes a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration, and shows how these models can be trained jointly based on maximum likelihood estimation.
Abstract: Addresses the problem of audio-visual information fusion to provide highly robust speech recognition We investigate methods that make different assumptions about asynchrony and conditional dependence across streams and propose a technique based on composite HMMs that can account for stream asynchrony and different levels of information integration We show how these models can be trained jointly based on maximum likelihood estimation Experiments, performed for a speaker-independent large vocabulary continuous speech recognition task and different integration methods, show that best performance is obtained by asynchronous stream integration This system reduces the error rate at a 85 dB SNR with additive speech "babble" noise by 27 % relative over audio-only models and by 12 % relative over traditional audio-visual models using concatenative feature fusion

Proceedings ArticleDOI
Wu Chou1, Liang Gu2
07 May 2001
TL;DR: A new set of features derived from the harmonic coefficient and its 4 Hz modulation values are developed in this paper and these new features provide additional and reliable cues to separate speech from singing, which leads to further improvements in speech/music discrimination.
Abstract: In this paper, an approach for robust singing signal detection in speech/music discrimination is proposed and applied to applications of audio indexing. Conventional approaches in speech/music discrimination can provide reasonable performance with regular music signals but often perform poorly with singing segments. This is due mainly to the fact that speech and singing signals are extremely close and traditional features used in speech recognition do not provide a reliable cue for speech and singing signal discrimination. In order to improve the robustness of speech/music discrimination, a new set of features derived from the harmonic coefficient and its 4 Hz modulation values are developed in this paper, and these new features provide additional and reliable cues to separate speech from singing. In addition, a rule-based post-filtering scheme is also described which leads to further improvements in speech/music discrimination. Source-independent audio indexing experiments on the PBS Skills database indicate that the proposed approach can greatly reduce the classification error rate on singing segments in the audio stream. Comparing with existing approaches, the overall segmentation error rate is reduced by more than 30%, averaged over all shows in the database.

Patent
07 Nov 2001
TL;DR: Perceptual coding of spatial cues (PCSC) as discussed by the authors is used to convert two or more input audio signals into a combined audio signal, where each set of audio scene parameters corresponds to a different frequency band in the combined signal.
Abstract: Perceptual coding of spatial cues (PCSC) is used to convert two or more input audio signals into a combined audio signal that is embedded with two or more sets of one or more auditory scene parameters, where each set of auditory scene parameters (e.g., one or more spatial cues such as an inter-ear level difference (ILD), inter-ear time difference (ITD), and/or head-related transfer function (HRTF)) corresponds to a different frequency band in the combined audio signal. A PCSC-based receiver is able to extract the auditory scene parameters and apply them to the corresponding frequency bands of the combined audio signal to synthesize an auditory scene. The technique used to embed the auditory scene parameters into the combined signal enables a legacy receiver that is unaware of the embedded auditory scene parameters to play back the combined audio signal in a conventional manner, thereby providing backwards compatibility. In one embodiment, two or more input signals are used to generate a mono audio signal with embedded spatial cues. A PCSC-based receiver can extract and apply the spatial cues to generate two (or more) output audio channels, while a legacy receiver is able to play back the mono audio signal in a conventional (i.e., mono) manner. The backwards compatibility feature can be combined with a layered coding technique and/or a multi-descriptive coding technique to improve error protection when the embedded audio signal is transmitted over one or more lossy channels.

Patent
27 Feb 2001
TL;DR: In this paper, a disc jockey mixing console with analog controls, crossfader and scratchpad is used to mix audio tracks from compressed digital audio data sound recordings, which allows far superior manual dexterity for adjustment of volume and speed when mixing audio tracks as compared to digital mouse.
Abstract: Device includes a disc jockey mixing console with analog controls, crossfader and scratchpad and is used to mix audio tracks from compressed digital audio data sound recordings. One device substitutes for mixing console, turntables and records. For compressed digital audio data recordings, analog controls allow far superior manual dexterity for adjustment of volume and speed when mixing audio tracks as compared to digital mouse. Includes two audio outputs—a headphone and a main speaker output—having digital to analog convertors, analog controls in the form of knobs and sliders, a touch screen LCD panel for selecting and queuing songs and a computer with a processor, ROM storage means, RAM storage means, software and a hard disc to store audio track files. Optional interface between device and personal computer to upload songs, audio input and CD ROM drive for converting audio to compressed digital audio data.

PatentDOI
TL;DR: In this article, a system and method of audio processing provides enhanced speech recognition, where the multi-channel audio signal from the microphones may be processed by a beamforming network to generate a single-channel enhanced audio signal, on which voice activity is detected.
Abstract: A system and method of audio processing provides enhanced speech recognition. Audio input is received at a plurality of microphones. The multi-channel audio signal from the microphones may be processed by a beamforming network to generate a single-channel enhanced audio signal, on which voice activity is detected. Audio signals from the microphones are additionally processed by an adaptable noise cancellation filter having variable filter coefficients to generate a noise-suppressed audio signal. The variable filter coefficients are updated during periods of voice inactivity. A speech recognition engine may apply a speech recognition algorithm to the noise-suppressed audio signal and generate an appropriate output. The operation of the speech recognition engine and the adaptable noise cancellation filter may advantageously be controlled based on voice activity detected in the single-channel enhanced audio signal from the beamforming network.

Patent
04 May 2001
TL;DR: In this article, an auditory scene is synthesized by applying two or more different sets of one or more spatial parameters (e.g., an inter-ear level difference (ILD), interear time difference (ITD), and/or head-related transfer function (HRTF)) to two different frequency bands of a combined audio signal, where each different frequency band is treated as if it corresponded to a single audio source in the auditory scene.
Abstract: An auditory scene is synthesized by applying two or more different sets of one or more spatial parameters (e.g., an inter-ear level difference (ILD), inter-ear time difference (ITD), and/or head-related transfer function (HRTF)) to two or more different frequency bands of a combined audio signal, where each different frequency band is treated as if it corresponded to a single audio source in the auditory scene. In one embodiment, the combined audio signal corresponds to the combination of two or more different source signals, where each different frequency band corresponds to a region of the combined audio signal in which one of the source signals dominates the others. In this embodiment, the different sets of spatial parameters are applied to synthesize an auditory scene comprising the different source signals. In another embodiment, the combined audio signal corresponds to the combination of the left and right audio signals of a binaural signal corresponding to an input auditory scene. In this embodiment, the different sets of spatial parameters are applied to reconstruct the input auditory scene. In either case, transmission bandwidth requirements are reduced by reducing to one the number of different audio signals that need to be transmitted to a receiver configured to synthesize/reconstruct the auditory scene.