scispace - formally typeset
Search or ask a question

Showing papers in "Journal of The Audio Engineering Society in 2013"


Journal Article
TL;DR: The authors present the Perceptual Objective Listening Quality Assessment (POLQA), the third-generation speech quality measurement algorithm, which provides a new measurement standard for predicting Mean Opinion Scores that outperforms the older PESQ standard.
Abstract: In this and the companion paper Part II, the authors present the Perceptual Objective Listening Quality Assessment (POLQA), the third-generation speech quality measurement algorithm, standardized by the International Telecommunication Union in 2011 as Recommendation P.863. In contrast to the previous standard (P.862 Perceptual Evaluation of Speech Quality), a more complex temporal alignment was developed allowing for the alignment of a wide variety of complex distortions for which P.862 was known to fail, such as multiple delay variations within utterances as well as temporal stretching and compression of the degraded signal. When this new algorithm is used in combination with the advanced perceptual model described in Part II, it provides a new measurement standard for predicting Mean Opinion Scores that outperforms the older PESQ standard, especially for wideband and super wideband speech signals (7 and 14 kHz audio bandwidth). Part I provides the basics of the POLQA approach and outlines the core elements of the temporal alignment.

132 citations



Journal Article
TL;DR: Spatially Oriented Format for Acoustics: A Data Exchange Format Representing Head-Related Transfer Functions is presented, representing head-related transfer functions in a data exchange format.
Abstract: Spatially Oriented Format for Acoustics: A Data Exchange Format Representing Head-Related Transfer Functions

55 citations


Journal Article
TL;DR: In this paper, a personal audio system was implemented in an automobile cabin using the dual arrays; performance was consistent with the simulations, and a contrast of 15 dB between bright and dark seats was possible.
Abstract: With the individual requirements of different occupants and the proliferation of audio sources in the automobile, there is an interest in implementing independent front and rear listening zones to match the preferences of the occupants. Because simulations showed the physical limits for creating personal listening zones, low- and high-frequencies arrays were considered separately. Four standard audio loudspeakers were used for frequencies below 200 Hz, and phase-shift loudspeaker arrays mounted at the headrests were used for frequencies above 200 Hz. The split-band technique avoids the need for full-bandwidth loudspeakers in the headrests. To validate the results a personal audio system was implemented in an automobile cabin using the dual arrays; performance was consistent with the simulations. A contrast of 15 dB between bright and dark seats was possible.

54 citations


Journal Article
TL;DR: The Opus codec as discussed by the authors uses a linear prediction coder with a transform coder, with particular attention to the psychoacoustic knowledge built into the format, which out-performs existing audio codecs that do not operate under real-time constraints.
Abstract: The IETF recently standardized the Opus codec as RFC6716. Opus targets a wide range of real-time Internet applications by combining a linear prediction coder with a transform coder. We describe the transform coder, with particular attention to the psychoacoustic knowledge built into the format. The result out-performs existing audio codecs that do not operate under real-time constraints.

45 citations


Journal Article
TL;DR: All aspects of this standardization effort are outlined, starting with the history and motivation of the MPEG work item, describing all technical features of the final system, and further discussing listening test results and performance numbers which show the advantages of the new system over current state-of-the-art codecs.
Abstract: In early 2012 the ISO/IEC JTC1/SC29/WG11 (MPEG) finalized the new MPEG-D Unified Speech and Audio Coding standard The new codec brings together the previously separated worlds of general audio coding and speech coding It does so by integrating elements from audio coding and speech coding into a unified system The present publication outlines all aspects of this standardization effort, starting with the history and motivation of the MPEG work item, describing all technical features of the final system, and further discussing listening test results and performance numbers which show the advantages of the new system over current state-of-the-art codecs

42 citations


Journal Article
TL;DR: The main components that constitute the voice part of the Opus speech and audio codec are gone through, an overview is provided, insights are given, and the design decisions made during the development are discussed.
Abstract: In this paper, we describe the voice mode of the Opus speech and audio codec. As only the decoder is standardized, the details in this paper will help anyone who wants to modify the encoder or gain a better understanding of the codec. We go through the main components that constitute the voice part of the codec, provide an overview, give insights, and discuss the design decisions made during the development. Tests have shown that Opus quality is comparable to or better than several state-of-the-art voice codecs, while covering a much broader application area than competing codecs.

41 citations


Journal Article
TL;DR: An automated approach to dynamic range compression where the parameters are configured automatically based on real-time, side-chain feature extraction from the input signal, leaving only the threshold as a user controlled parameter to set the preferred amount of compression.
Abstract: Dynamic range compression is a nonlinear, time dependent audio effect. As such, preferred parameter settings are difficult to achieve even when there is advance knowledge of the input signal and the desired perceptual characteristics of the output. We introduce an automated approach to dynamic range compression where the parameters are configured automatically based on real-time, side-chain feature extraction from the input signal. Parameters are all dynamically varied depending on extracted features, leaving only the threshold as a user controlled parameter to set the preferred amount of compression. We analyze a series of automation techniques, including comparison of methods based on different signal characteristics. Subjective evaluation was performed with amateur and professional sound engineers, which established preference for dynamic range compressor parameters when applied to musical signals, and allowed us to compare performance of our various approaches against manual parameter settings.

36 citations


Journal Article
TL;DR: In this paper, the effects of both a finite-sized baffle and sources on the optimized array performance is investigated for a two-source array mounted on a mobile phone-sized device through both finite element simulations and real-time implementation.
Abstract: The widespread use of loudspeakers on mobile devices to reproduce audio in public spaces has led to issues of both user privacy and noise nuisance. Previous work has investigated the use of acoustic contrast control to optimize the performance of small arrays of loudspeakers to create a zone within which the audio program is audible, while minimizing the level reproduced elsewhere. These investigations have generally assumed that the dimensions of both the device within which the array is mounted and the loudspeaker drivers themselves are negligible, so that the array can be modeled as monopoles in the free field. Although this is reasonable at low frequencies the effect of both finite-sized baffle and sources on the optimized array performance is significant at higher frequencies. The effects of finite-sized baffle and sources are investigated for a two-source loudspeaker array mounted on a mobile phone-sized device through both finite element simulations and real-time implementation. The baffle is shown to reduce the performance of the array at frequencies greater than around 1 kHz for the geometry considered here, but the directivity of the individual drivers then enhances the performance at higher frequencies. The effects of implementing the optimal filters in the time-domain for a real-time system are also investigated.

35 citations



Journal Article
TL;DR: In this paper, a frequency-domain pitch shifting approach based on the invertible constant-Q transforms (CQT) is proposed. But the pitch shifting is performed by shifting partials along the frequency axis, rather than spectral stretching in the Fourier transform domain.
Abstract: Pitch shifting of polyphonic music is usually performed by manipulating the time-frequency representation of the input signal. Most approaches proposed in the past are based on the Fourier transform although its linear frequency bin spacing is known to be inadequate to some degree for analyzing and processing music signals. Recently invertible constant-Q transforms (CQT) featuring high Q-factors have been proposed exhibiting a more suitable geometrical bin spacing. In this paper a frequency-domain pitch shifting approach based on the CQT is proposed. The CQT is specifically attractive for pitch shifting because it can be implemented by frequency translation (shifting partials along the frequency axis) as opposed to spectral stretching in the Fourier transform domain. Furthermore, the high time resolution of CQT at high frequencies improves transient preservation. Audio examples are provided to illustrate the results achieved with the proposed method.

Journal Article
TL;DR: In this article, a knowledge-engineered mixing engine is introduced that uses semantic mixing rules and bases mixing decisions on instrument tags as well as elementary, low-level signal features, derived from practical mixing engineering textbooks.
Abstract: In this paper a knowledge-engineered mixing engine is introduced that uses semantic mixing rules and bases mixing decisions on instrument tags as well as elementary, low-level signal features. Mixing rules are derived from practical mixing engineering textbooks. The performance of the system is compared to existing automatic mixing tools as well as human engineers by means of a listening test, and future directions are established.


Journal Article
TL;DR: In this paper, the authors present a listening test to identify the perceptual dimensions that are associated with these artifacts, and the influence of the array length on these two dimensions is evaluated further in a second listening test.
Abstract: Wave Field Synthesis (WFS) allows virtual sound sources to be synthesized that are located between the loudspeaker array and the listener. Such sources are known as focused sources. Due to practical limitations related to real loudspeaker arrays, such as spatial sampling and truncation, there are different artifacts in the synthesized sound field of focused sources. In this paper we present a listening test to identify the perceptual dimensions that are associated with these artifacts. Two main dimensions were found, one describing the amount of perceptual artifacts and the other one describing the localization of the focused source. The influence of the array length on these two dimensions is evaluated further in a second listening test. A binaural model is used to model the perceived location of focused sources found in the second test and to analyze dominant localization cues.



Journal Article
TL;DR: In this article, head motion was observed in natural listening activities such as concerts, movies, and video games, and the statistics of movement were similar to that observed in the first experiment, laboratory results were used as the basis of an objective model of spatial behavior.
Abstract: Understanding the way in which listeners move their heads must be part of any objective model for evaluating and reproducing the sonic experience of space. Head movement is part of the listening experience because it allows for sensing the spatial distribution of parameters. In the first experiment, the head positions of subjects was recorded when they were asked to evaluate perceived source location, apparent source width, envelopment, and timbre of synthesis stimuli. Head motion was larger when judging source width than when judging direction or timbre. In the second experiment, head movement was observed in natural listening activities such as concerts, movies, and video games. Because the statistics of movement were similar to that observed in the first experiment, laboratory results can to be used as the basis of an objective model of spatial behavior. The results were based on 10 subjects.

Journal Article
TL;DR: In this paper, a generalized overlapping strategy for these sweeps is proposed considering the length of each harmonic impulse responses and additionally the temporal structure of the desired impulse responses measured in anechoic environments.
Abstract: Measuring spatial features of sound sources and receivers is typically a time consuming task, especially when a high spatial resolution is required, as independent measurements have to be conducted for each measured direction. A speed-up in measurement time can be achieved with parallel measurement techniques using arrays of sound sensors or sources. For linear and time-invariant systems only loose restrictions are claimed for the excitation signal and the measurement method. Nevertheless, when measuring a sound receiver, e.g. directional microphones, the signals emitted by the multiple sound sources must be separable. Acoustic systems can be treated as linear systems for low input levels. However, when it comes to moderate levels, loudspeakers show non-linear behavior that cannot be neglected. To conduct a parallelized measurement technique at these levels the multiple exponential sweep method has recently been introduced to measure the acoustic transfer characteristics with weakly non-linear sound sources by using exponential sweeps. This method decreases the measurement time compared to sequential measurements. However, compared to the ideal linear case, the measurement duration is increased due to occurring harmonic impulse responses. A novel generalized overlapping strategy for these sweeps is proposed considering the length of each harmonic impulse responses and additionally the temporal structure of the desired impulse responses measured in anechoic environments. It is shown that the resulting optimized multiple exponential sweep method can yield even shorter measurement times than the original method.


Journal Article
TL;DR: In this paper, the authors evaluated four state-of-the-art acoustic feedback cancellation systems for hearing aid applications and showed that significant improvements in cancellation performance can be made over traditional systems by allowing small alterations in the loudspeaker signal and a computational complexity increase by a factor of 2 − 3.
Abstract: In this work we evaluate four state-of-the-art acoustic feedback cancellation systems for hearing aid applications. We show that significant improvements in cancellation performance can be made over traditional systems by allowing small alterations in the loudspeaker signal and a computational complexity increase by a factor of 2 − 3. The evaluation is based on a listening test and objective assessments of simulation results.

Journal Article
TL;DR: The aim was to analyze overall trends, as well as yearly and genre-specific ones, of a large dataset of popular commercial recordings and a novel method for averaging spectral distributions is proposed, which yields results that are prone to comparison.
Abstract: In this work, the long-term spectral contours of a large dataset of popular commercial recordings were analyzed. The aim was to analyze overall trends, as well as yearly and genre-specific ones. A novel method for averaging spectral distributions is proposed, which yields results that are prone to comparison. With it, we found out that there is a consistent leaning towards a target equalization curve that stems from practices in the music industry, but also to some extent mimics natural, acoustic spectra of ensembles. For as long as spectral analysis has been a viable tool in the commercial sectors, audio engineers have looked at integrated spectral responses as possible answers for audio quality. Michael Paul Stavrou [1] states that, while at Abbey Road, he lost endless afternoons hopelessly chasing the illusive hit song characteristic in technical parameters and Neil Dorfsman [2] acknowledges that, while many sound engineers would not admit to doing it, he feels that most of them use spectral analysis and comparison to previous work or other commercial work as a standard tool during mixing. In the mixing context, “achieving frequency balance (also referred to as tonal balance) is a prime challenge in most mixes” [3]. Bob Katz [4] proposes that the tonal balance of a symphony orchestra is the ideal reference for the spectral distribution of music. Yet there is no consistent academic study that tackles the question of how generally similar is the spectral response of critically acclaimed tracks, nor has anyone analyzed the surrounding factors upon which it depends. The seminal work in spectrum analysis of musical signals is [5] (in which live signals are used), and it pioneered the 1/3 octave filter bank analysis process that influenced most early studies of the same type. The musical signals were of individual instruments and ensembles in live rooms. McKnight [6] took a 8960 Pestana et al. Spectral Characteristics of Commercial Recordings similar approach in the realm of pre-recorded material but was looking for technical correction measures in the distribution format and used a small dataset. The earliest study that is closest to ours is Bauer’s [7], where the author looked for the average statistical distribution of a small classical dataset. Moller [8] is the only analysis that tries to track down the yearly evolution of spectra. The BBC [9] researched the spectral content of pop music, using custom recordings made for the purpose of the test and [10] focused on the effect of the Compact Disc media on the spectral contour of recordings. Recently, [11] and [12] returned to the subject with a broader dataset, but their analyses focused more on dynamics and panning than frequency response, and their dataset does not follow any objective criteria of popularity. No study relies on a detailed FFT approach as we do, often choosing instead the coarser and more error prone Real Time Analysis (RTA) filter bank approach; nor has any of the aforementioned works tackled a really large representative dataset that follows the idea of commercial popularity, and thus a ‘best-practices’ approach. For our analysis to be consistent with general public preference, we must run it on a dataset that includes the most commercially relevant songs of the time period of interest. We chose to select songs that had been number ones in either the US or the UK charts, found primarily from [13, 14] and Wikipedia. The anglo-saxon bias was considered acceptable as most of the western world’s music industry has a very strong anglo-saxon influence. The list of all the aforementioned singles can be found at [15], a document which also indicates the songs we were able to use. Our dataset is comprised of about half the singles that have been number one over the last 60 years, with a good representation of both genre and year of production (as there were no pilot tests that would allow an estimation of the ideal sample size, we tried, as is customary, to get the largest possible number of observations). All the songs in our dataset are uncompressed and, while we tried to find un-remastered versions, it was not always possible. This means that we are giving extra prominence to current standards of production and the differences we present should be even greater than that which our data suggests. Table 1 shows the number of songs we had available, divided by decade. Years Number of Songs 50s 71 60s 156 70s. 129 80s 193 90s 96 After 2000 127 Total 772 Table 1: Number of songs per decade in the dataset. In Section 2 we will look at the overall average of all the songs in our collection. In Section 3 and 4, the data will be broken down by year and genre respectively and some additional low-level features are introduced to better characterize the differences we are unveiling. Section 5 presents an overview of the present research and presents some viable future directions and applications. The aforementioned accompanying website [15] includes more detailed plots, discussion of remastering, and extended numerical data for the results that have been found in this research. 1. OVERALL AVERAGE SPECTRUM OF COMMERCIAL RECORDINGS Our main analysis focused on the monaural (left+right channel over two), average long-term spectrum of the aforementioned dataset. In order for spectra to be comparable, we first make sure that all songs are sampled at the same frequency (44.1 kHz being the obvious candidate for us, as most works stemmed from CD copies), and that we apply the same window length (4096 samples) to all content, so that the frequency resolution is consistent (≈ 10 Hz). Let: X (k, τ) = (τ+1)wlen−1 ∑ n=τ ·wlen x (n) e−j2πk n N , k = { 0, 1, ..., 2 − 1 } , τ = { 0, 1, ..., ⌊ xlen wlen ⌋} ,


Journal Article
TL;DR: In this paper, the authors partially funded by Spanish Ministerio de Economia (TEC2009-13741 and TEC2012-38142-C04-01) and EU through FEDER funds, Generalitat Valenciana PROMETEO 2009/2013, and Universitat Politecnica de Valencia through Programa de Apoyo a la Investigación y Desarrollo (PAID-05-11 and PAID-15-12).
Abstract: This work has been partially funded by Spanish Ministerio de Economia (TEC2009-13741 and TEC2012-38142-C04-01) and EU through FEDER funds, Generalitat Valenciana PROMETEO 2009/2013, and Universitat Politecnica de Valencia through Programa de Apoyo a la Investigacion y Desarrollo (PAID-05-11 and PAID-05-12).


Journal Article
TL;DR: In this paper, independent influences of interchannel level difference (ICLD) and interchannel time difference (ICTD) on the panning of 2-channel stereo phantom images for various musical sources were investigated.
Abstract: This study investigates the independent influences of interchannel level difference (ICLD) and interchannel time difference (ICTD) on the panning of 2-channel stereo phantom images for various musical sources. The results indicate that a level panning can perform robustly regardless of the spectral and temporal characteristics of source signals,whereas a time panning is not suitable for a continuous source with a high fundamental frequency. Statistical differences between the data obtained for different sources are found to be insignificant, and from this a unified set of ICLD and ICTD values for 10◦, 20◦, and 30◦ image positions are derived. Linear level and time panning functions for the two separate panning regions of 0◦–20◦ and 21◦–30◦ are further proposed, and their applicability to arbitrary loudspeaker base angle is also considered. These perceptual panning functions are expected to be more accurate than the theoretical sine or tangent law in terms of matching between predicted and actually perceived image positions.


Journal Article
TL;DR: This paper explores the particular test case of the timpani drum using the finite difference time domain (FDTD) method, which is suitable for parallelization, and examines performance on the Nvidia Tesla architecture, simulation results and sound examples.
Abstract: Powerful parallel hardware, such as general purpose graphical processing units (GPGPUs), is now becoming a viable tool in audio engineering research, with great potential in applications involving heavy computational loads. One particular area of interest is large scale audio rate physical modeling synthesis of instruments embedded in a full 3D environment. This paper explores the particular test case of the timpani drum using the finite difference time domain (FDTD) method, which is suitable for parallelization. The constituents of the timpani model, namely a membrane and cavity are introduced, as well as the coupling conditions to the acoustic field. FDTD methods are then developed, with special attention paid to implementation issues in parallel hardware. An analysis of performance on the Nvidia Tesla architecture, simulation results and sound examples are presented.

Journal Article
TL;DR: This paper compares several precedence models and their influence on the performance of a baseline separation algorithm and finds the one that was based on interaural coherence and onset-based inhibition produced the greatest performance improvement.
Abstract: Reverberation is a problem for source separation algorithms. Because the precedence effect allows human listeners to suppress the perception of reflections arising from room boundaries, numerous computational models have incorporated the precedence effect. However, relatively little work has been done on using the precedence effect in source separation algorithms. This paper compares several precedence models and their influence on the performance of a baseline separation algorithm. The models were tested in a variety of reverberant rooms and with a range of mixing parameters. Although there was a large difference in performance among the models, the one that was based on interaural coherence and onset-based inhibition produced the greatest performance improvement. There is a trade-off between selecting reliable cues that correspond closely to free-field conditions and maximizing the proportion of the input signals that contributes to localization. For optimal source separation performance, it is necessary to adapt the dynamic component of the precedence model to the acoustic conditions of the room.

Journal Article
TL;DR: The approach of the so-called ”Port-Hamiltonian Systems” (PHS) is considered, this framework naturally preserves the energetic behavior of elementary components and the power exchanges between them, which guarantees the passivity of the simulations.
Abstract: Several methods are available to simulate electronic circuits. However, for nonlinear circuits, the stability guarantee is not straightforward. In this paper, the approach of the so-called ”Port-Hamiltonian Systems” (PHS) is considered. This framework naturally preserves the energetic behavior of elementary components and the power exchanges between them. This guarantees the passivity of the simulations.