scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2002"


Book
01 Jan 2002
TL;DR: This paper presents a meta-modelling framework for building a Perceptual Audio Decoder that automates the very labor-intensive and therefore time-heavy, and therefore expensive, and expensive, process of Audio Coding.
Abstract: Foreword. Preface. I: Audio Coding Methods. 2. Quantization. 3. Representation of Audio Signals. 4. Time to Frequency Mapping Part I: The PQMF. 5. Time to Frequency Mapping Part II: The MDCT. 6. Introduction to Psychoacoustics. 7. Psychoacoustic Models for Audio Coding. 8. Bit Allocation Strategies. 9. Building a Perceptual Audio Decoder. 10. Quality Measurement of Perceptual Audio Codecs. II: Audio Coding Standards. 11. MPEG-1 Audio. 12. MPEG-2 Audio. 13. MPEG-2 AAC. 14.Dolby AC-3. 15. MPEG-4 Audio. Index.

367 citations



PatentDOI
TL;DR: In this article, the authors propose a speech recognition technique for video and audio signals that consists of processing a video signal associated with an arbitrary content video source, processing an audio signal associated to the video signal, and recognizing at least a portion of the processed audio signal using at least the processed video signal to generate output signal representative of the audio signal.
Abstract: Techniques for providing speech recognition comprise the steps of processing a video signal associated with an arbitrary content video source, processing an audio signal associated with the video signal, and recognizing at least a portion of the processed audio signal, using at least a portion of the processed video signal, to generate an output signal representative of the audio signal.

302 citations


Proceedings ArticleDOI
Ara V. Nefian1, Luhong Liang1, Xiaobo Pi1, Liu Xiaoxiang1, Crusoe Mao1, Kevin Murphy1 
13 May 2002
TL;DR: This paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM) to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time.
Abstract: In recent years several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. The use of visual features is justified by both the bimodality of the speech generation and by the need of features that are invariant to acoustic noise perturbation. The audio-visual speech recognition system presented in this paper introduces a novel audio-visual fusion technique that uses a coupled hidden Markov model (HMM). The statistical properties of the coupled-HMM allow us to model the state asynchrony of the audio and visual observations sequences while still preserving their natural correlation over time. The experimental results show that the coupled HMM outperforms the multistream HMM in audio visual speech recognition.

252 citations


Patent
30 May 2002
TL;DR: In this paper, a voice operated portable information management system that is substantially language independent and capable of supporting a substantially unlimited vocabulary is presented, which includes a microphone, speaker, clock and GPS connected to a speech processing system.
Abstract: A voice operated portable information management system that is substantially language independent and capable of supporting a substantially unlimited vocabulary. The system (Fig. 1) includes a microphone, speaker, clock and GPS connected to a speech processing system. The speech processing system (Fig. 4): 1) generates and stores compressed speech data corresponding to a user's speech received through the microphone, 2) compares the stored speech data, 3) re-synthesizes the stored speech data for output as speech through the speaker, 4) provides an audible user interface including a speech assistant for providing instructions in the user's language, 5) stores user-specific compressed speech data, including commands, received in response to prompts from the speech assistant for purposes of adapting the system to the user's speech, 6) identifies memo management commands spoken by the user, and stores and organizes compressed speech data as a function of the identified commands, and 7) identifies memo retrieval commands spoken by the user, and retrieves and outputs the stored speech data as a function of the commands.

160 citations


PatentDOI
Philip R. Wiser1, LeeAnn Heringer1, Gerry Kearby1, Leon Rishniw1, Jason S. Brownell1 
TL;DR: In this paper, audio processing profiles are organized according to specific delivery bandwidths such that a sound engineer can quickly and efficiently encode audio signals for each of a number of distinct delivery media.
Abstract: Essentially all of the processing parameters which control processing of a source audio signal to produce an encoded audio signal are stored in an audio processing profile. Multiple audio processing profiles are stored in a processing profile database such that specific combinations of processing parameters can be retrieved and used at a later time. Audio processing profiles are organized according to specific delivery bandwidths such that a sound engineer can quickly and efficiently encode audio signals for each of a number of distinct delivery media. Synchronized A/B switching during playback of various encoded audio signals allows the sound engineer to detect nuances in the sound characteristics of the various encoded audio signals.

151 citations


Journal ArticleDOI
TL;DR: There was remarkably close agreement in the pattern of group mean scores for the three strategies for CNC words and CUNY sentences in noise between the present study and the Conversion study, and the same percentage of subjects preferred each strategy.
Abstract: ObjectiveThe objective of this study was to determine whether 1) the SPEAK, ACE or CIS speech coding strategy was associated with significantly better speech recognition for individual subjects implanted with the Nucleus CI24M internal device who used the SPrint™ speech processor, and 2) whether a s

143 citations


Patent
Donald T. Tang1, Ligin Shen1, Qin Shi1, Wei Zhang1
05 Apr 2002
TL;DR: In this paper, a method for generating personalized speech from text includes the steps of analyzing the input text to get standard parameters of the speech to be synthesized from a standard text-to-speech database.
Abstract: A method for generating personalized speech from text includes the steps of analyzing the input text to get standard parameters of the speech to be synthesized from a standard text-to-speech database; mapping the standard speech parameters to the personalized speech parameters via a personalization model obtained in a training process; and synthesizing speech of the input text based on the personalized speech parameters. The method can be used to simulate the speech of the target person so as to make the speech produced by a TTS system more attractive and personalized.

127 citations


PatentDOI
TL;DR: In this article, a transform coding method for music signals was proposed, which is suitable for use in a hybrid codec, whereby a common linear predictive (LP) synthesis filter was employed for both speech and music signals.
Abstract: The present invention provides a transform coding method efficient for music signals that is suitable for use in a hybrid codec, whereby a common Linear Predictive (LP) synthesis filter is employed for both speech and music signals. The LP synthesis filter switches between a speech excitation generator and a transform excitation generator, in accordance with the coding of a speech or music signal, respectively. For coding speech signals, the conventional CELP technique may be used, while a novel asymmetrical overlap-add transform technique is applied for coding music signals. In performing the common LP synthesis filtering, interpolation of the LP coefficients is conducted for signals in overlap-add operation regions. The invention enables smooth transitions when the decoder switches between speech and music decoding modes.

126 citations


Journal ArticleDOI
TL;DR: A subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay.
Abstract: This paper proposes a versatile perceptual audio coding method that achieves high compression ratios and is capable of low encoding/decoding delay. It accommodates a variety of source signals (including both music and speech) with different sampling rates. It is based on separating irrelevance and redundancy reductions into independent functional units. This contrasts traditional audio coding where both are integrated within the same subband decomposition. The separation allows for the independent optimization of the irrelevance and redundancy reduction units. For both reductions, we rely on adaptive filtering and predictive coding as much as possible to minimize the delay. A psycho-acoustically controlled adaptive linear filter is used for the irrelevance reduction, and the redundancy reduction is carried out by a predictive lossless coding scheme, which is termed weighted cascaded least mean squared (WCLMS) method. Experiments are carried out on a database of moderate size which contains mono-signals of different sampling rates and varying nature (music, speech, or mixed). They show that the proposed WCLMS lossless coder outperforms other competing lossless coders in terms of compression ratios and delay, as applied to the pre-filtered signal. Moreover, a subjective listening test of the combined pre-filter/lossless coder and a state-of-the-art perceptual audio coder (PAC) shows that the new method achieves a comparable compression ratio and audio quality with a lower delay.

125 citations


Patent
25 Feb 2002
TL;DR: In this paper, a method for time aligning audio signal, wherein one signal has been derived from the other or both have been derived separately from another signal, comprises deriving reduced-information characterizations of the audio signals, auditory scene analysis.
Abstract: A method for time aligning audio signal, wherein one signal has been derived from the other or both have been derived from another signal, comprises deriving reduced-information characterizations of the audio signals, auditory scene analysis. The time offset of one characterization with respect to the other characterization is calculated and the temporal relationship of the audio signals with respect to each other is modified in response to the time offset such that the audio signals are coicident with each other. These principles may also be applied to a method for time aligning a video signal and an audio signal that will be subjected to differential time offsets.

Patent
09 Oct 2002
TL;DR: In this article, the authors present a method and apparatus for identifying broadcast digital audio signals, where the digital audio signal is provided to a processing structure which is configured to identify a program-identifying code in the received digital signal, identifying a program identifying code in a decompressed received digital audio message, and identifying a feature signature in the decoded received digital message.
Abstract: Method and apparatus for identifying broadcast digital audio signals include structure and/or function whereby the digital audio signal is provided to processing structure which is configured to (i) identify a program-identifying code in the received digital audio signal, (ii) identify a program-identifying code in a decompressed received digital audio signal, (iii) identify a feature signature in the received digital audio signal, and (iv) identify a feature signature in the decompressed received digital audio signal. Preferably, such processing structure is disposed in a dwelling or a monitoring site in an audience measurement system, such as the Nielsen TV ratings system.

Journal ArticleDOI
TL;DR: A spectral domain, speech enhancement algorithm based on a mixture model for the short time spectrum of the clean speech signal, and on a maximum assumption in the production of the noisy speech spectrum that shows improved performance compared to alternative speech enhancement algorithms.
Abstract: We present a spectral domain, speech enhancement algorithm. The new algorithm is based on a mixture model for the short time spectrum of the clean speech signal, and on a maximum assumption in the production of the noisy speech spectrum. In the past this model was used in the context of noise robust speech recognition. In this paper we show that this model is also effective for improving the quality of speech signals corrupted by additive noise. The computational requirements of the algorithm can be significantly reduced, essentially without paying performance penalties, by incorporating a dual codebook scheme with tied variances. Experiments, using recorded speech signals and actual noise sources, show that in spite of its low computational requirements, the algorithm shows improved performance compared to alternative speech enhancement algorithms.


Proceedings ArticleDOI
13 May 2002
TL;DR: The characteristics of voiced speech can be used to derive a coherently added signal from the linear prediction (LP) residuals of the degraded speech data from different microphones to enhance speech degraded by noise and reverberation.
Abstract: This paper proposes an approach for processing speech from multiple microphones to enhance speech degraded by noise and reverberation. The approach is based on exploiting the features of the excitation source in speech production. In particular, the characteristics of voiced speech can be used to derive a coherently added signal from the linear prediction (LP) residuals of the degraded speech data from different microphones. A weight function is derived from the coherently added signal. For coherent addition the time-delay between a pair of microphones is estimated using the knowledge of the source information present in the LP residual. The enhanced speech is generated by exciting the time varying all-pole filter with the weighted LP residual.

Proceedings ArticleDOI
07 Nov 2002
TL;DR: A comparison of the relative merits and demerits along with the subjective quality of speech after the pruning of silence periods for four time-domain VAD algorithms in terms of speech quality, compression level and computational complexity.
Abstract: We discuss techniques for voice activity detection (VAD) for voice over Internet Protocol (VoIP). VAD aids in reducing the bandwidth requirement of a voice session, thereby using bandwidth efficiently. Such a scheme would be implemented in the application layer. Thus the VAD is independent of the lower layers in the network stack (see Flood, J.E., "Telecommunications Switching - Traffic and Networks", Prentice Hall India). We compare four time-domain VAD algorithms in terms of speech quality, compression level and computational complexity. A comparison of the relative merits and demerits along with the subjective quality of speech after the pruning of silence periods is presented for all the algorithms. A quantitative measurement of speech quality for different algorithms is also presented.

Proceedings ArticleDOI
Luhong Liang1, Xiaoxing Liu1, Yibao Zhao1, Xiaobo Pi1, Ara V. Nefian1 
07 Nov 2002
TL;DR: The speaker independent audio-visual continuous speech recognition system presented relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region using a coupled hidden Markov (CHMM) model.
Abstract: The increase in the number of multimedia applications that require robust speech recognition systems determined a large interest in the study of audio-visual speech recognition (AVSR) systems. The use of visual features in AVSR is justified by both the audio and visual modality of the speech generation and the need for features that are invariant to acoustic noise perturbation. The speaker independent audio-visual continuous speech recognition system presented relies on a robust set of visual features obtained from the accurate detection and tracking of the mouth region. Further, the visual and acoustic observation sequences are integrated using a coupled hidden Markov (CHMM) model. The statistical properties of the CHMM can model the audio and visual state asynchrony while preserving their natural correlation over time. The experimental results show that the current system tested on the XM2VTS database reduces by over 55% the error rate of the audio only speech recognition system at SNR of 0 dB.

Proceedings ArticleDOI
07 Aug 2002
TL;DR: This work carried out a fundamental investigation of the impact of packet loss and talkers on perceived speech quality to provide the basis for developing an artificial neural network (ANN) model to predict speech quality for VoIP.
Abstract: Perceived speech quality is the key metric for QoS in VoIP applications. Our primary aims are to carry out a fundamental investigation of the impact of packet loss and talkers on perceived speech quality using an objective method and, thus, to provide the basis for developing an artificial neural network (ANN) model to predict speech quality for VoIP. The impact on perceived speech quality of packet loss and of different talkers was investigated for three modern codecs (G.729, G.723.1 and AMR) using the new ITU PESQ algorithm. Results show that packet loss burstiness, loss locations/patterns and the gender of talkers have an impact. Packet size has, in general, no obvious influence on perceived speech quality for the same network conditions, but the deviation in speech quality depends on packet size and codec. Based on the investigation, we used talkspurt-based conditional and unconditional packet loss rates (which are perceptually more relevant than network packet loss rates), codec type and the gender of the talker (extracted from decoder) as inputs to an ANN model to predict speech quality directly from network parameters. Results show that high prediction accuracy was obtained from the ANN model (correlation coefficients for the test and validation datasets were 0.952 and 0.946 respectively). This work should help to develop efficient, nonintrusive QoS monitoring and control strategies for VoIP applications.

Proceedings ArticleDOI
07 Nov 2002
TL;DR: Computationally efficient methods for detecting non-Gaussian impulsive noise in digital speech and audio signals are presented and can be applied in real time to a digital data stream.
Abstract: Computationally efficient methods for detecting non-Gaussian impulsive noise in digital speech and audio signals are presented. The aim of the detection is to find the errors without false detections in the case of e.g. percussive sounds in music signal or stop-consonants in speech signal. Various methods for computing a detection signal and a threshold curve are studied and tested. The detection can be applied in real time to a digital data stream.


Proceedings ArticleDOI
Tian Wang1, Kazuhuito Koishida1, V. Cuperman, A. Gersho, J.S. Collura 
06 Oct 2002
TL;DR: Key algorithm features of the future NATO narrow band voice coder (NBVC) are presented, a 1.2/2.4 kbps speech coder with noise preprocessor based on the MELP analysis algorithm that achieves quality close to the existing federal standard 2.2 kbps.
Abstract: This paper presents key algorithm features of the future NATO narrow band voice coder (NBVC), a 1.2/2.4 kbps speech coder with noise preprocessor based on the MELP analysis algorithm. At 1.2 kbps, the MELP parameters for three consecutive frames are grouped into a superframe and jointly quantized to obtain high coding efficiency. The inter-frame redundancy is exploited with distinct quantization schemes for different unvoiced/voiced (U/V) frame combinations in the superframe. Novel techniques used at 1.2 kbps include pitch vector quantization using pitch differentials, joint quantization of pitch and U/V decisions and LSF quantization with a forward-backward interpolation method. A new harmonic synthesizer is introduced for both rates which improves the reproduction quality. Subjective test results indicate that the 1.2 kbps speech coder achieves quality close to the existing federal standard 2.4 kbps MELP coder.

Patent
Jussi Virolainen1, Ari Lakaniemi1
14 Jun 2002
TL;DR: In this article, an error concealment method for multi-channel digital audio is proposed, where the first and second audio channels are correlated with each other in a manner so that a spatial sensation is typically perceived when listened to by a user.
Abstract: An error concealment method for multi-channel digital audio involves receiving an audio signal having audio data forming a first audio channel and a second audio channel included therein, wherein the first and second audio channels are correlated with each other in a manner so that a spatial sensation is typically perceived when listened to by a user. Erroneous first-channel data is detected in the first audio channel, and second-channel data is obtained from the second audio channel. The erroneous first-channel data of the first audio channel is corrected by using the second-channel data. Upon detection of the erroneous first-channel data, a spatially perceivable inter-channel relation between the first and second audio channels is determined, and the determined inter-channel relation is used when correcting the erroneous first-channel data of the first audio channel so as to preserve the spatial sensation perceived by the user.

Proceedings ArticleDOI
13 May 2002
TL;DR: This paper presents an integer approximation of this lapped transform, called IntMDCT, which is derived from theMDCT using the lifting scheme, and inherits most of the attractive properties of the MDCT, exhibiting a good spectral representation of the audio signal, critical sampling and overlapping of blocks.
Abstract: The Modified Discrete Cosine Transform (MDCT) is widely used in modem perceptual audio coding schemes. In this paper we present an integer approximation of this lapped transform, called IntMDCT, which is derived from the MDCT using the lifting scheme. This reversible integer transform inherits most of the attractive properties of the MDCT, exhibiting a good spectral representation of the audio signal, critical sampling and overlapping of blocks. This makes the IntMDCT well suited for both lossless audio coding as well as for combined perceptual and lossless audio coding. A scalable system is presented providing a lossless enhancement of perceptual audio coding schemes, such as MPEG-2 AAC.

Proceedings Article
01 Jan 2002
TL;DR: This work investigates how the selection of the DCT coefficients influences the recognition scores in a hybridANN/HMM audio-visual speech recognition system on a continuous word recognition task with a vocabulary of 30 numbers.
Abstract: Encouraged by the good performance of the DCT in audiovisual speech recognition [1], we investigate how the selection of the DCT coefficients influences the recognition scores in a hybridANN/HMM audio-visual speech recognition system on a continuous word recognition task with a vocabulary of 30 numbers. Three sets of coefficients, based on the mean energy, the variance and the variance relative to the mean value, were chosen. The performance of these coefficients is evaluated in a video only and an audio-visual recognition scenario with varying Signal to Noise Ratios (SNR). The audio-visual tests are performed with 5 types of additional noise at 12 SNR values each. Furthermore the results of the DCT based recognition are compared to those obtained via chroma-keyed geometric lip features [2]. In order to achieve this comparison, a second audio-visual database without chroma-key has been recorded. This database has similar content but a different speaker.

Journal ArticleDOI
TL;DR: An audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions, and a robust and automatic algorithm is described to extract FAPs from visual data, which does not require hand labeling or extensive training procedures.
Abstract: We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs) supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA) was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR) experiments. Both single-stream and multistream hidden Markov models (HMMs) were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words) speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0-30 dB) with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.

Journal Article
TL;DR: In this paper, an overview of various nonlinear processing techniques applied to speech signals is presented, including speech coding, speech synthesis, speech and speaker recognition, voice analysis and enhancement, and analyses and simulation of dysphonic voices.
Abstract: This article presents an overview of various nonlinear processing techniques applied to speech signals. Evidence relating to the existence of nonlinearities in speech is presented, and the main differences between linear and nonlinear analysis are summarized. A brief review is given of the important nonlinear speech processing techniques reported to date, and their applications to speech coding, speech synthesis, speech and speaker recognition, voice analysis and enhancement, and analyses and simulation of dysphonic voices.

Proceedings ArticleDOI
13 May 2002
TL;DR: Results from a subjective test suggest that BCC, combined with existing mono audio coders, offers better quality than conventional stereo and multi-channel perceptual transform audiocoders for a wide range of bitrates.
Abstract: We present a novel concept for representing multi-channel audio signals: Binaural Cue Coding (BCC). BCC aims at separating the basic audio content and the information relevant for spatial perception. A multi-channel audio signal is represented as a mono signal and BCC parameters. We present two types of applications of BCC. Firstly, a number of separate sound source signals are reduced to a mono signal and BCC parameters. In this case, the decoder has control over the location of each source in auditory space. In other words, the decoder can render spatial images as if the separate source signals were given. Secondly, a multi-channel audio signal is reduced to a mono signal and BCC parameters. In this case the decoder generates a multi-channel signal with a spatial image similar to the spatial image of the input signal of the encoder. Results from a subjective test suggest that BCC, combined with existing mono audio coders, offers better quality than conventional stereo and multi-channel perceptual transform audio coders for a wide range of bitrates.

Journal ArticleDOI
TL;DR: Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bit rates of 1.2 kbps or less.
Abstract: We present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packet-based) speech recognition. It is shown that speech recognition as opposed to speech coding, is more sensitive to channel errors than channel erasures, and appropriate channel coding design criteria are determined. For channel decoding, we introduce a novel technique for combining at the receiver soft decision decoding with error detection. Frame erasure concealment techniques are used at the decoder to deal with unreliable frames. At the recognition stage, we present a technique to modify the recognition engine itself to take into account the time-varying reliability of the decoded feature after channel transmission. The resulting engine, referred to as weighted Viterbi recognition, further improves the recognition accuracy. Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bit rates of 1.2 kbps or less.

Journal ArticleDOI
TL;DR: A theoretical framework is presented showing that it is indeed possible to separate a source when some of its spectral characteristics are provided to the system and how, if a statistical model of the joint probability of visual and spectral audio input is learnt to quantify the audio-visual coherence, separation can be achieved by maximizing this probability.
Abstract: We present a new approach to the source separation problem in the case of multiple speech signals. The method is based on the use of automatic lipreading, the objective is to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the speaker's lip movements. We consider the case of an additive stationary mixture of decorrelated sources, with no further assumptions on independence or non-Gaussian character. Firstly, we present a theoretical framework showing that it is indeed possible to separate a source when some of its spectral characteristics are provided to the system. Then we address the case of audio-visual sources. We show how, if a statistical model of the joint probability of visual and spectral audio input is learnt to quantify the audio-visual coherence, separation can be achieved by maximizing this probability. Finally, we present a number of separation results on a corpus of vowel-plosive-vowel sequences uttered by a single speaker, embedded in a mixture of other voices. We show that separation can be quite good for mixtures of 2, 3, and 5 sources. These results, while very preliminary, are encouraging, and are discussed in respect to their potential complementarity with traditional pure audio separation or enhancement techniques.

Patent
25 Mar 2002
TL;DR: In this article, the authors proposed a data transmission method and a data receiving method which enable audio data to be multiplexed into video data and be transmitted using a DVI standard cable or the like satisfactorily with a simple configuration.
Abstract: This invention provides a data transmission method and a data receiving method which enable audio data to be multiplexed into video data and be transmitted using a DVI standard cable or the like satisfactorily with a simple configuration. From a data transmitting end, a superimposed video/audio data signal in which audio data are superimposed over a video blanking interval of video data in superimposition timing that is generated using a video blank sync signal and a pixel clock, are transmitted to a data receiving end through the DVI cable, together with the video blank sync signal and the pixel clock. On the data receiving end, a timing signal for extracting the audio data from the superimposed video/audio data signal is generated using the transmitted video blank sync signal and pixel clock, and the superimposed video/audio data signal is separated into video data and audio data using the generated timing signal, as well as the digital audio data are converted into an analog audio signal using an audio clock that is generated by dividing the frequency of the pixel clock.