scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2008"


Journal ArticleDOI
TL;DR: One important goal of the research programme is to develop linking hypotheses between putative neurobiological primitives and those primitives derived from linguistic inquiry, to arrive ultimately at a biologically sensible and theoretically satisfying model of representation and computation in speech.
Abstract: Speech perception consists of a set of computations that take continuously varying acoustic waveforms as input and generate discrete representations that make contact with the lexical representations stored in long-term memory as output. Because the perceptual objects that are recognized by the speech perception enter into subsequent linguistic computation, the format that is used for lexical representation and processing fundamentally constrains the speech perceptual processes. Consequently, theories of speech perception must, at some level, be tightly linked to theories of lexical representation. Minimally, speech perception must yield representations that smoothly and rapidly interface with stored lexical items. Adopting the perspective of Marr, we argue and provide neurobiological and psychophysical evidence for the following research programme. First, at the implementational level, speech perception is a multi-time resolution process, with perceptual analyses occurring concurrently on at least two time scales (approx. 20–80 ms, approx. 150–300 ms), commensurate with (sub)segmental and syllabic analyses, respectively. Second, at the algorithmic level, we suggest that perception proceeds on the basis of internal forward models, or uses an ‘analysis-by-synthesis’ approach. Third, at the computational level (in the sense of Marr), the theory of lexical representation that we adopt is principally informed by phonological research and assumes that words are represented in the mental lexicon in terms of sequences of discrete segments composed of distinctive features. One important goal of the research programme is to develop linking hypotheses between putative neurobiological primitives (e.g. temporal primitives) and those primitives derived from linguistic inquiry, to arrive ultimately at a biologically sensible and theoretically satisfying model of representation and computation in speech.

443 citations


Journal Article
TL;DR: This paper will describe the chosen reference model architecture, the association between the different operational modes and applications, and the current status of the standardization process.
Abstract: Following the recent trend of employing parametric enhancement tools for increasing coding or spatial rendering efficiency, Spatial Audio Object Coding (SAOC) is one of the recent standardization activities in the MPEG audio group. SAOC is a technique for efficient coding and flexible, user-controllable rendering of multiple audio objects based on transmission of a mono or stereo downmix of the object signals. The SAOC system extends the MPEG Surround standard by re-using its spatial rendering capabilities. This paper will describe the chosen reference model architecture, the association between the different operational modes and applications, and the current status of the standardization process.

202 citations


PatentDOI
Shuhei Maegawa1
TL;DR: A speech recognition method includes a model selection step which selects a recognition model and translation dictionary information based on characteristic information of input speech and a speech recognition step which translates input speech into text data based on the selected recognition model as mentioned in this paper.
Abstract: A speech recognition method includes a model selection step which selects a recognition model and translation dictionary information based on characteristic information of input speech and a speech recognition step which translates input speech into text data based on the selected recognition model and translation step which translates the text data based on the selected translation dictionary information.

197 citations


Patent
30 Dec 2008
TL;DR: In this article, a linear prediction unit for filtering an input signal based on an adaptive filter; a transformation unit for transforming a frame of the filtered input signal into a transform domain; and a quantization unit for quantizing the transform domain signal.
Abstract: The present invention teaches a new audio coding system that can code both general audio and speech signals well at low bit rates. A proposed audio coding system comprises linear prediction unit for filtering an input signal based on an adaptive filter; a transformation unit for transforming a frame of the filtered input signal into a transform domain; and a quantization unit for quantizing the transform domain signal. The quantization unit decides, based on input signal characteristics, to encode the transform domain signal with a model-based quantizer or a non-model-based quantizer. Preferably, the decision is based on the frame size applied by the transformation unit.

170 citations


Patent
03 Jul 2008
TL;DR: In this article, an electronic device and method for obtaining a digital speech signal and a control command relating to the signal while obtaining the signal was presented, and for temporally associating the control command with a substantially corresponding time instant in the signal to which the command was directed, where one or more punctuation marks or another, optionally symbolic, elements to be at least logically positioned at a text location corresponding to the communication instant relative to the digital signal.
Abstract: Electronic device and method for obtaining a digital speech signal and a control command relating to thedigital speech signal while obtaining the digital speech signal, and for temporally associating the control command with a substantially corresponding time instant in the digital speech signal towhich the control command was directed, wherein the control command determines one or more punctuation marks or another, optionally symbolic, elements to be at least logically positioned at a text location corresponding to the communication instant relative to the digital speech signal so as to cultivate the speech to text conversion procedure.

159 citations


Patent
Michael M. Lee1
02 Apr 2008
TL;DR: In this paper, the authors present a system for altering an audio output to sound as if a different person had recorded it when it was played back when the audio data file was sent to the system.
Abstract: Methods, systems and computer readable media for altering an audio output are provided. In some embodiments, the system may change the original frequency content of an audio data file to a second frequency content so that a recorded audio track will sound as if a different person had recorded it when it is played back. In other embodiments, the system may receive an audio data file and a voice signature, and it may apply the voice signature to the audio data file to alter the audio output of the audio data file. In that instance, the audio data file may be a textual representation of a recorded audio data file.

148 citations


DissertationDOI
21 Nov 2008
TL;DR: In this thesis methods are presented for which no external training data is required for training models, and these novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT.
Abstract: In this thesis, research on large vocabulary continuous speech recognition for unknown audio conditions is presented. For automatic speech recognition systems based on statistical methods, it is important that the conditions of the audio used for training the statistical models match the conditions of the audio to be processed. Any mismatch will decrease the accuracy of the recognition. If it is unpredictable what kind of data can be expected, or in other words if the conditions of the audio to be processed are unknown, it is impossible to tune the models. If the material consists of `surprise data' the output of the system is likely to be poor. In this thesis methods are presented for which no external training data is required for training models. These novel methods have been implemented in a large vocabulary continuous speech recognition system called SHoUT. This system consists of three subsystems: speech/non-speech classification, speaker diarization and automatic speech recognition.

129 citations


Journal ArticleDOI
TL;DR: The combination of various deformation- and fault-tolerance mechanisms allows us to employ standard indexing techniques to obtain an efficient, index-based matching procedure, thus providing an important step towards semantically searching large-scale real-world music collections.
Abstract: Given a large audio database of music recordings, the goal of classical audio identification is to identify a particular audio recording by means of a short audio fragment. Even though recent identification algorithms show a significant degree of robustness towards noise, MP3 compression artifacts, and uniform temporal distortions, the notion of similarity is rather close to the identity. In this paper, we address a higher level retrieval problem, which we refer to as audio matching: given a short query audio clip, the goal is to automatically retrieve all excerpts from all recordings within the database that musically correspond to the query. In our matching scenario, opposed to classical audio identification, we allow semantically motivated variations as they typically occur in different interpretations of a piece of music. To this end, this paper presents an efficient and robust audio matching procedure that works even in the presence of significant variations, such as nonlinear temporal, dynamical, and spectral deviations, where existing algorithms for audio identification would fail. Furthermore, the combination of various deformation- and fault-tolerance mechanisms allows us to employ standard indexing techniques to obtain an efficient, index-based matching procedure, thus providing an important step towards semantically searching large-scale real-world music collections.

125 citations


PatentDOI
TL;DR: An audio system for processing two channels of audio input to provide more than two output channels is described in this article, where the audio processing includes separating the input signals into frequency bands and processing the frequency bands according to processes which may differ from band to band.
Abstract: An audio system for processing two channels of audio input to provide more than two output channels. The input may be conventional stereo material or compressed audio signal data. The audio processing includes separating the input signals into frequency bands and processing the frequency bands according to processes which may differ from band to band. The audio processing includes no processing of L−R signals.

119 citations


Patent
Jonathan Alastair Gibbs1
09 Sep 2008
TL;DR: In this article, a frame processor is used to determine an inter-time difference between the first audio signal and the second audio signal, and a set of delays are used to delay at least one of the first and second audio signals in response to the inter time difference signal.
Abstract: An encoding apparatus comprises a frame processor (105) which receives a multi channel audio signal comprising at least a first audio signal from a first microphone (101) and a second audio signal from a second microphone (103). An ITD processor (107) then determines an inter time difference between the first audio signal and the second audio signal and a set of delays (109, 111) generates a compensated multi channel audio signal from the multi channel audio signal by delaying at least one of the first and second audio signals in response to the inter time difference signal. A combiner (113) then generates a mono signal by combining channels of the compensated multi channel audio signal and a mono signal encoder (115) encodes the mono signal. The inter time difference may specifically be determined by an algorithm based on determining cross correlations between the first and second audio signals.

112 citations


Journal ArticleDOI
TL;DR: A novel method for underdetermined blind source separation using an instantaneous mixing model which assumes closely spaced microphones is proposed and is applicable to segregate speech signals under reverberant conditions and is compared to another state-of-the-art algorithm.
Abstract: Separation of speech mixtures, often referred to as the cocktail party problem, has been studied for decades. In many source separation tasks, the separation method is limited by the assumption of at least as many sensors as sources. Further, many methods require that the number of signals within the recorded mixtures be known in advance. In many real-world applications, these limitations are too restrictive. We propose a novel method for underdetermined blind source separation using an instantaneous mixing model which assumes closely spaced microphones. Two source separation techniques have been combined, independent component analysis (ICA) and binary time-frequency (T-F) masking. By estimating binary masks from the outputs of an ICA algorithm, it is possible in an iterative way to extract basis speech signals from a convolutive mixture. The basis signals are afterwards improved by grouping similar signals. Using two microphones, we can separate, in principle, an arbitrary number of mixed speech signals. We show separation results for mixtures with as many as seven speech signals under instantaneous conditions. We also show that the proposed method is applicable to segregate speech signals under reverberant conditions, and we compare our proposed method to another state-of-the-art algorithm. The number of source signals is not assumed to be known in advance and it is possible to maintain the extracted signals as stereo signals.

Proceedings ArticleDOI
08 Dec 2008
TL;DR: This is the first work adopting graph theory to improve the codebook partition while using QIM in low bit-rate streaming media and guarantees that every codeword is in the opposite part to its nearest neighbor, and the distortion is limited by a bound.
Abstract: In this paper we introduce a novel codebook partition algorithm for quantization index modulation (QIM), which is applied to information hiding in instant low bit-rate speech stream. The QIM method divides the codebook into two parts, each representing '0' and '1' respectively. Instead of randomly partitioning the codebook, the relationship between codewords is considered. The proposed algorithm - complementary neighbor vertices (CNV) guarantees that every codeword is in the opposite part to its nearest neighbor, and the distortion is limited by a bound. The feasibility of CNV is proved with graph theory. Moreover, in our work the secret message is embedded in the field of vector quantization index of LPC coefficients, getting the benefit that the distortion due to QIM is lightened adaptively by the rest of the encoding procedure. Experiments on iLBC and G.723.1 verify the effectiveness of the proposed method. Both objective and subjective assessments show the proposed method only slightly decreases the speech quality to an indistinguishable degree. The hiding capacity is no less than 100 bps. To the best of our knowledge, this is the first work adopting graph theory to improve the codebook partition while using QIM in low bit-rate streaming media.

Journal ArticleDOI
TL;DR: Experimental results indicate that the proposed analysis-by-synthesis echo hiding scheme is superior to the conventional schemes in terms of robustness, security, and perceptual quality.
Abstract: Audio watermarking using echo hiding has fairly good perceptual quality. However, security and the tradeoff between robustness and imperceptibility are still relevant issues. This paper presents the echo hiding scheme in which the analysis-by-synthesis approach, interlaced kernels, and frequency hopping are adopted to achieve high robustness, security, and perceptual quality. The amplitudes of the embedded echoes are adequately adapted during the embedding process by considering not only the characteristics of the host signals, but also cases in which the watermarked audio signals have suffered various attacks. Additionally, the interlaced kernels are introduced such that the echo positions of the interlaced kernels for embedding "zero" and "one" are interchanged alternately to minimize the influence of host signals and various attacks on the watermarked data. Frequency hopping is employed to increase the robustness and security of the proposed echo hiding scheme in which each audio segment for watermarking is established by combining the fractions selected from all frequency bands based on a pseudonoise sequence as a secret key. Experimental results indicate that the proposed analysis-by-synthesis echo hiding scheme is superior to the conventional schemes in terms of robustness, security, and perceptual quality.

Book
02 Jan 2008
TL;DR: This book is a thorough reference to the 3GPP and MPEG Parametric Stereo standards and the MPEG Surround multi-channel audio coding standard and describes key developments in coding techniques, which is an important factor in the optimization of advanced entertainment, communications and signal processing applications.
Abstract: This book collects a wealth of information about spatial audio coding into one comprehensible volume. It is a thorough reference to the 3GPP and MPEG Parametric Stereo standards and the MPEG Surround multi-channel audio coding standard. It describes key developments in coding techniques, which is an important factor in the optimization of advanced entertainment, communications and signal processing applications. Until recently, technologies for coding audio signals, such as redundancy reduction and sophisticated source and receiver models did not incorporate spatial characteristics of source and receiving ends. Spatial audio coding achieves much higher compression ratios than conventional coders. It does this by representing multi-channel audio signals as a downmix signal plus side information that describes the perceptually-relevant spatial information. Written by experts in spatial audio coding, Spatial Audio Processing: reviews psychoacoustics (the relationship between physical measures of sound and the corresponding percepts) and spatial audio sound formats and reproduction systems; brings together the processing, acquisition, mixing, playback, and perception of spatial audio, with the latest coding techniques; analyses algorithms for the efficient manipulation of multiple, discrete and combined spatial audio channels, including both MP3 and MPEG Surround; shows how the same insights on source and receiver models can also be applied for manipulation of audio signals, such as the synthesis of virtual auditory scenes employing head-related transfer function (HRTF) processing and stereo to N-channel audio upmix. Audio processing research engineers and audio coding research and implementation engineers will find this an insightful guide. Academic audio and psychoacoustic researchers, including post-graduate and third/fourth year students taking courses in signal processing, audio and speech processing, and telecommunications, will also benefit from the information inside.

Patent
25 Mar 2008
TL;DR: An audio signal processing system is configured to separate an audio signal into a dry signal component and one or more reverberant signal components, which can be separately modified and then recombined to form a processed audio signal as mentioned in this paper.
Abstract: An audio signal processing system is configured to separate an audio signal into a dry signal component and one or more reverberant signal components. The dry signal component and the reverberant signal components can be separately modified and then recombined to form a processed audio signal. Alternatively, the dry signal component may be combined with an artificial reverberation component to form the processed audio signal. Modification of the reverberation signal component and generation of the artificial reverberation component may be performed in order to modify the acoustic characteristics of an acoustic space in which the audio signal is driving loudspeakers. The audio signal may be a pre-recorded audio signal or a live audio signal generated inside or outside the acoustic space.

Patent
05 Jun 2008
TL;DR: In this article, an audio encoder for encoding an audio signal includes an impulse extractor (10) for extracting an impulse-like portion from the audio signal, which is encoded and forwarded to an output interface (22).
Abstract: An audio encoder for encoding an audio signal includes an impulse extractor (10) for extracting an impulse-like portion from the audio signal. This impulse-like portion is encoded and forwarded to an output interface (22). Furthermore, the audio encoder includes a signal encoder (16) which encodes a residual signal derived from the original audio signal so that the impulse-like portion is reduced or eliminated in the residual audio signal. The output interface (22) forwards both, the encoded signals, i.e., the encoded impulse signal (12) and the encoded residual signal (20) for transmission or storage. On the decoder-side, both signal portions are separately decoded and then combined to obtain a decoded audio signal.

BookDOI
01 Jan 2008
TL;DR: Speech Recognition in Mobile Phones, Handheld Speech to Speech Translation System, Automotive Speech Recognition, Energy Aware Speech recognition for Mobile Devices.
Abstract: Network Speech Recognition.- Network, Distributed and Embedded Speech Recognition: An Overview.- Speech Coding and Packet Loss Effects on Speech and Speaker Recognition.- Speech Recognition Over Mobile Networks.- Speech Recognition Over IP Networks.- Distributed Speech Recognition.- Distributed Speech Recognition Standards.- Speech Feature Extraction and Reconstruction.- Quantization of Speech Features: Source Coding.- Error Recovery: Channel Coding and Packetization.- Error Concealment.- Embedded Speech Recognition.- Algorithm Optimizations: Low Computational Complexity.- Algorithm Optimizations: Low Memory Footprint.- Fixed-Point Arithmetic.- Systems and Applications.- Software Architectures for Networked Mobile Speech Applications.- Speech Recognition in Mobile Phones.- Handheld Speech to Speech Translation System.- Automotive Speech Recognition.- Energy Aware Speech Recognition for Mobile Devices.

Journal ArticleDOI
TL;DR: The data from both experiments combined indicate that, in contrast to normal hearing, timing cues available from natural head-width delays do not offer binaural advantages with present methods of electrical stimulation, even when fine-timing cues are explicitly coded.
Abstract: Four adult bilateral cochlear implant users, with good open-set sentence recognition, were tested with three different sound coding strategies for binaural speech unmasking and their ability to localize 100 and 500 Hz click trains in noise. Two of the strategies tested were envelope-based strategies that are clinically widely used. The third was a research strategy that additionally preserved fine-timing cues at low frequencies. Speech reception thresholds were determined in diotic noise for diotic and interaurally time-delayed speech using direct audio input to a bilateral research processor. Localization in noise was assessed in the free field. Overall results, for both speech and localization tests, were similar with all three strategies. None provided a binaural speech unmasking advantage due to the application of 700 micros interaural time delay to the speech signal, and localization results showed similar response patterns across strategies that were well accounted for by the use of broadband interaural level cues. The data from both experiments combined indicate that, in contrast to normal hearing, timing cues available from natural head-width delays do not offer binaural advantages with present methods of electrical stimulation, even when fine-timing cues are explicitly coded.

Journal ArticleDOI
TL;DR: This paper presents LSP manipulation methods that can be used to alter frequencies within the represented signal in a consistent and relevant way, and considers the use of LSPs for analysis of non-speech information.

Journal ArticleDOI
TL;DR: These experiments are concerned with the intelligibility of target speech in the presence of a background talker using a noise vocoder and showed that intelligibility was lower when fast single-channel compression was applied to the target and background after mixing rather than before.
Abstract: These experiments are concerned with the intelligibility of target speech in the presence of a background talker. Using a noise vocoder, Stone and Moore [J. Acoust. Soc. Am. 114, 1023-1034 (2003)] showed that single-channel fast-acting compression degraded intelligibility, but slow compression did not. Stone and Moore [J. Acoust. Soc. Am. 116, 2311-2323 (2004)] showed that intelligibility was lower when fast single-channel compression was applied to the target and background after mixing rather than before, and suggested that this was partly due to compression after mixing introducing "comodulation" between the target and background talkers. Experiment 1 here showed a similar effect for multi-channel compression. In experiment 2, intelligibility was measured as a function of the speed of multi-channel compression applied after mixing. For both eight- and 12-channel vocoders with one compressor per channel, intelligibility decreased as compression speed increased. For the eight-channel vocoder, a compressor that only affected modulation depth for rates below 2 Hz still reduced intelligibility. Experiment 3 used 12- or 18-channel vocoders. There were between 1 and 12 compression channels, and four speeds of compression. Intelligibility decreased as the number and speed of compression channels increased. The results are interpreted using several measures of the effects of compression, especially "across-source modulation correlation."

Patent
01 Feb 2008
TL;DR: In this paper, an input multi-channel representation is converted into a different output multichannel representation of a spatial audio signal, in which an intermediate representation of the audio signal is derived, the intermediate representation having direction parameters indicating a direction of origin of a portion of the signal.
Abstract: An input multi-channel representation is converted into a different output multi-channel representation of a spatial audio signal, in that an intermediate representation of the spatial audio signal is derived, the intermediate representation having direction parameters indicating a direction of origin of a portion of the spatial audio signal; and in that the output multi-channel representation of the spatial audio signal is generated using the intermediate representation of the spatial audio signal.

Patent
17 Oct 2008
TL;DR: In this article, an audio decoder for decoding a multi-audio-object signal having an audio signal of a first type and an audio signals of a second type encoded therein is described.
Abstract: An Audio decoder for decoding a multi-audio-object signal having an audio signal of a first type and an audio signal of a second type encoded therein is described, the multi-audio-object signal consisting of a downmix signal (56) and side information (58), the side information comprising level information (60) of the audio signal of the first type and the audio signal of the second type in a first predetermined time/frequency resolution (42), and a residual signal (62) specifying residual level values in a second predetermined time/frequency resolution, the audio decoder comprising means (52) for computing prediction coefficients (64) based on the level information (60); and means (54) for up-mixing the downmix signal (56) based on the prediction coefficients (64) and the residual signal (62) to obtain a first up-mix audio signal approximating the audio signal of the first type and/or a second up-mix audio signal approximating the audio signal of the second type.

Proceedings ArticleDOI
12 May 2008
TL;DR: The key element of the method is an alternative search strategy for the ACELP codebook which allows for joint data hiding and speech coding and it is pointed out that the method can also be exploited to reduce the codec bit rate.
Abstract: A new method for hiding digital data in the bitstream of an ACELP speech codec is proposed in this paper. The key element of our method is an alternative search strategy for the ACELP codebook which allows for joint data hiding and speech coding. The concept has been examplarily applied to the AMR speech codec (12.2 kbit/s mode) and it is shown that steganographic data can be reliably transmitted at a rate of up to 2 kbit/s both with a negligible effect on the subjective quality of the coded speech and with reasonable computational complexity. Apart from data hiding, it is further pointed out that our method can also be exploited to reduce the codec bit rate.

BookDOI
17 Sep 2008
TL;DR: In this paper, the state of the art in important areas of speech and audio signal processing is discussed, including multi-microphone systems, specific approaches for noise reduction, and evaluations of speech signals and speech processing systems.
Abstract: The book reflects the state of the art in important areas of speech and audio signal processing. It presents topics which are missed so far and most recent findings in the field. Leading international experts report on their field of work and their new results. Considerable amount of space is covered by multi-microphone systems, specific approaches for noise reduction, and evaluations of speech signals and speech processing systems. Multi-microphone systems include automatic calibration of microphones, localisation of sound sources, and source separation procedures. Also covered are recent approaches to the problem of adaptive echo and noise suppression. A novel solution allows the design of filter banks exhibiting bands spaced according to the Bark scale und especially short delay times. Furthermore, a method for engine noise reduction and proposals for improving the signal/noise ratio based on partial signal reconstruction or using a noise reference are reported. A number of contributions deal with speech quality. Besides basic considerations for quality evaluation specific methods for bandwidth extension of telephone speech are described. Procedures to reduce the reverberation of audio signals can help to increase speech intelligibility and speech recognition rates. In addition, solutions for specific applications in speech and audio signal processing are reported including, e.g., the enhancement of audio signal reproduction in automobiles and the automatic evaluation of hands-free systems and hearing aids.

Patent
Jussi Virolainen1, Jarmo Hiipakka1
22 Apr 2008
TL;DR: In this article, an apparatus for utilizing spatial information for audio signal enhancement in a multiple distributed network may include a processor, which can be configured to receive representations of a plurality of audio signals including at least one audio signal received at a first device and at least a second audio signal receiving at a second device.
Abstract: An apparatus for utilizing spatial information for audio signal enhancement in a multiple distributed network may include a processor. The processor may be configured to receive representations of a plurality of audio signals including at least one audio signal received at a first device and at least a second audio signal received at a second device. The first and second devices may be part of a common acoustic space network and may be arbitrarily positioned with respect to each other. The processor may be further configured to combine the first and second audio signals to form a composite audio signal, and provide for communication of the composite audio signal along with spatial information relating to a sound source of at least one of the plurality of audio signals to another device.

Journal ArticleDOI
TL;DR: The findings from the present study suggest that the SNR criterion is an effective selection criterion for n-of-m strategies with the potential of restoring speech intelligibility.
Abstract: In the n-of-m strategy, the signal is processed through m bandpass filters from which only the n maximum envelope amplitudes are selected for stimulation. While this maximum selection criterion, adopted in the advanced combination encoder strategy, works well in quiet, it can be problematic in noise as it is sensitive to the spectral composition of the input signal and does not account for situations in which the masker completely dominates the target. A new selection criterion is proposed based on the signal-to-noise ratio (SNR) of individual channels. The new criterion selects target-dominated (SNR⩾0dB) channels and discards masker-dominated (SNR<0dB) channels. Experiment 1 assessed cochlear implant users’ performance with the proposed strategy assuming that the channel SNRs are known. Results indicated that the proposed strategy can restore speech intelligibility to the level attained in quiet independent of the type of masker (babble or continuous noise) and SNR level (0–10dB) used. Results from exper...

Patent
31 Dec 2008
TL;DR: In this article, a method and a system for providing multiple display systems with an enhanced acoustics experience is described, where a source audio signal having a plurality of source audio channels is generated from an audio signal source.
Abstract: A method and a system are described providing multiple display systems with an enhanced acoustics experience. A source audio signal having a plurality of source audio channels is generated from an audio signal source. The system includes a plurality of speakers connected to a plurality of display systems. A speaker configuration gatherer determines the spatial configuration of the speakers. An audio signal processor is provided to generate synthesized audio signal based on the contents of the source audio signal and spatial configuration of the speakers. The synthesized audio signal is mapped and delivered to the speakers to produce an enhanced sound field.

Book ChapterDOI
25 Aug 2008
TL;DR: Enriched transcriptions, that is enhancing the automatic word transcripts with meta-data derived from the audio data is discussed, followed by some hightlights of recent progress and remaining challenges in speech recognition.
Abstract: This paper addresses some of the recent trends in speech processing, with a focus on speech-to-text transcription as a means to facilitate access to multimedia information in a multilingual context. A brief overview of automatic speech recognition is given along with indicative performance measures for a range of tasks. Enriched transcriptions, that is enhancing the automatic word transcripts with meta-data derived from the audio data is discussed, followed by some hightlights of recent progress and remaining challenges in speech recognition.

Book
11 Feb 2008
TL;DR: Advances in Digital Speech Transmission provides an up-to-date overview of the field, including topics such as speech coding in heterogeneous communication networks, wideband coding, and the quality assessment of wideband speech.
Abstract: Speech processing and speech transmission technology are expanding fields of active research. New challenges arise from the 'anywhere, anytime' paradigm of mobile communications, the ubiquitous use of voice communication systems in noisy environments and the convergence of communication networks toward Internet based transmission protocols, such as Voice over IP. As a consequence, new speech coding, new enhancement and error concealment, and new quality assessment methods are emerging. Advances in Digital Speech Transmission provides an up-to-date overview of the field, including topics such as speech coding in heterogeneous communication networks, wideband coding, and the quality assessment of wideband speech. Provides an insight into the latest developments in speech processing and speech transmission, making it an essential reference to those working in these fields Offers a balanced overview of technology and applications Discusses topics such as speech coding in heterogeneous communications networks, wideband coding, and the quality assessment of the wideband speech Explains speech signal processing in hearing instruments and man-machine interfaces from applications point of view Covers speech coding for Voice over IP, blind source separation, digital hearing aids and speech processing for automatic speech recognition Advances in Digital Speech Transmission serves as an essential link between the basics and the type of technology and applications (prospective) engineers work on in industry labs and academia. The book will also be of interest to advanced students, researchers, and other professionals who need to brush up their knowledge in this field.

Patent
30 Sep 2008
TL;DR: In this article, a demultiplexer (401) and decoder (403) are used to generate a binaural audio signal, which is a downmix of an N-channel audio signal and spatial parameter data.
Abstract: An apparatus for generating a binaural audio signal comprises a demultiplexer (401) and decoder (403) which receives audio data comprising an audio M-channel audio signal which is a downmix of an N-channel audio signal and spatial parameter data for upmixing the M-channel audio signal to the N-channel audio signal. A conversion processor (411) converts spatial parameters of the spatial parameter data into first binaural parameters in response to at least one binaural perceptual transfer function. A matrix processor (409) converts the M-channel audio signal into a first stereo signal in response to the first binaural parameters. A stereo filter (415, 417) generates the binaural audio signal by filtering the first stereo signal. The filter coefficients for the stereo filter are determined in response to the at least one binaural perceptual transfer function by a coefficient processor (419). The combination of parameter conversion/ processing and filtering allows a high quality binaural signal to be generated with low complexity.