What is the main purpose of the system?

the system could be used as annotation tool: it could assist semi-automatic transcription music signals into musical scores, where an automatic system would infer note boundaries, rhythms, key and time signature from the user inputs.

What is the purpose of the method?

Assuming the desired source generates smooth melody lines, the melody path is then tracked in H̃F0 with a Viterbi algorithm [4]: the user-defined regions are therefore used to restrict the melody tracking.

What is the underlying mechanism of the audio mixture?

The audio mixture is modelled through its F × N short-term power spectrum (STPS) matrix S, defined as the power of its STFT X, with F the number of Fourier frequencies and N the number of frames.

What is the purpose of the study?

In order to evaluate the usage and the performance of the proposed user-guided source separation system, the development set (5 excerpts) for the SiSEC 2011 “Professionally Produced Music Recordings” task [10] is used.

What is the purpose of the proposed system?

The proposed system delegates the source identification to the user, such that there is less ambiguity with the definition of the target source, for the system.

What is the default for the HF0 algorithm?

All the systems and users discussed in this section used the same default following parameters: K = 4, U = 577 (for 16 F0s per semitone, from 100 to 800Hz), R = 40, F = 1025 (for Fourier tranforms of size 2048, i.e. 46.44ms@44.1kHz) and with 25 iterations of the NMF algorithm.

Why is the rap song more difficult to identify?

Some songs might be more challenging, such as the rap song (dev2 fort), probably because the desired vocal signal is closer to speech than to singing voice.

(Open Access) Nonnegative Matrix Factorization (2014) | Ke-Lin Du

Q: What have the authors contributed in "Musical audio source separation based on user-selected f0 track" ?

A system for user-guided audio source separation is presented in this article.

Musical Audio Source Separation Based on

User-Selected F0 Track

Jean-Louis Durrieu and Jean-Philippe Thiran

Ecole Polytechnique F´ed´erale de Lausanne (EPFL)

Signal Processing Laboratory (LTS5)

Switzerland

firstname.lastname@epfl.ch

Abstract. A system for user-guided audio source separation is pre-

sented in this article. Following previous works on time-frequency music

representations, the proposed User Interface allows the user to select the

desired audio source, by means of the assumed fundamental frequency

(F0) track of that source. The system then automatically reﬁnes the se-

lected F0 tracks, estimates and separates the corresponding source from

the mixture. The interface was tested and the separation results compare

positively to the results of a fully automatic system, showing that the

F0 track selection improves the separation performance.

Keywords: User-guided Audio Source Separation, Graphical User In-

terface, Non-negative Matrix Factorization

1 INTRODUCTION

Most audio signals are mixtures of diﬀerent sources, such as a speaker, an instru-

ment, or noise. Applications such as speech enhancement or musical rem ixing

require the identiﬁcation and the extraction of one such source from the others.

While many existing musical source separation algorithms aim at blindly

separating all the diﬀerent instruments, the aim of the proposed system is to

separate the source deﬁned by the user. Let {x

}

t=1...T

be a single-channel mix-

ture signal of duration T . Let {v

}

and {m

r,t

}

respectively be the mono signals

of the source of interest, usually a singing voice, and of the R remaining sources,

i.e. the musical accompaniment. These s ignals are mixed s uch that:

= v

r=1

r,t

(1)

The task at hand is to estimate the signal of interest v

, given user-provided

information on the corresponding source. We propose a separation system that

J.-L. Durrieu; e-mail: jean-louis.durrieu@epﬂ.ch

This work was funded by the Swiss CTI agency, project n. 11359.1 PFES-ES, in

collaboration w ith Sp eedL ingua SA, Geneva, Switzerland.

allows the user to choose the source in an intuitive way, thanks to a representation

of the polyphonic pitch content of the audio excerpt. The system was tested by

several users on a SiSEC 2011 [10] data set, and the contribution of the users is

shown to improve the separation performance compared to the automatic system

in [3].

This paper is organized as follows. The relevance of user-guided source sepa-

ration is ﬁrst discussed, followed by the presentation of the proposed Graphical

User Interface (GUI). The underlying signal model, representation and the al-

gorithm for source separation, mostly derived from previous works from the

authors [3], are then brieﬂy stated. The separation guided by the users is there-

after discussed and compared with the automatic separation system. Finally, we

conclude with perspectives for the proposed system and concept.

2 User-Guided Source Separation

2.1 Related Works

Audio source separation methods essentially mimic auditory abilities: a human

being can focus on the individual instruments of a mixture thanks to their lo-

cations, energies, pitch ranges or timbres. With multi-channel signals, such as

stereo signals, one can infer spatial information [2], or train models to extract

speciﬁc sources, even with single-channel signals [1].

The user can be required to provide some meta-information, such as the

instrument name in a supervised framework [13], a musical score [5], the time

intervals of activity for each instrument [7] or a sung target sound [11]. Musical

scores or correct singing are however diﬃcult to acquire, and are often not aligned

with the mixture signal.

Expert users can be asked to choose the desired source through its posi-

tion [14] or selecting components that are played by the desired instrument,

thanks to intermediate separation results [15]. In [8], the automatically esti-

mated melody line can be corrected by the user.

2.2 F0-guided Musical Source Separation

For musical audio excerpts, in particular for vocal sources, many studies have

shown the relevance of the fundamental frequency (F0) contours. In [5], the

authors use the music score to extract the notes, which helps estimating the

actual F0 line of the instrument to remove. In [9], an estimated F0-contour is

used to separate the corresponding instrument.

The goal of this work is to study to what extent user input can improve the

separation of a speciﬁc source. Indeed, some ill-posed issues in the automatic

separation problem, using F0 contours, can arise. First, with many interfering

sources, it is diﬃcult to automatically decide whether a speciﬁc source is present

or not. Furthermore, octave and other harmonic-related confusions in the F0

representation can lead to erroneous separations. These errors may easily be

corrected by a trained user who uses the context to solve these ambiguities.

3 Graphical User Interface

3.1 Ergonomy issues

Allowing the user to dynamically choose the desired source requires a repre-

sentation that clearly displays the possible choices. The waveform would not

allow to locate, in time and in frequency, sources that are overlapping in time.

Time-frequency represe ntations (TFR), such as the short-term Fourier trans-

form (STFT), are therefore required to visually identify such s ources. With a

time x-axis and a frequency y-axis, the sinusoids (horizontal lines) or the noises

(vertical patterns) corresponding to the desired source easily stand out. Such an

approach would however require a signiﬁcant amount of work, and would not

scale well.

Harmonic sources e xhibit a characteristic graphical pattern, in the STFT,

for each F0: the system in [3] identiﬁes these patterns and provides the energy

of the diﬀerent F0s for each signal frame. From such a representation, the user

can select the desired source thanks to its melody line, with little eﬀort.

Furthermore, representing the pitch on the Western musical scale is a visu-

alization that many users can understand. For instance, in [6], Klapuri proposes

such a “piano-roll” visualization.

In this article, the mid-level representation introduced in [3] was chosen,

because it is easy to conﬁgure so as to look like a piano-roll. The method however

relies on a ﬁxed dictionary of harmonic spectral shapes, and the proposed system

is therefore better suited for the separation of corresponding sources, such as

wind instruments, voice or bowed string instruments.

3.2 Practical solutions

Using Python/NumPy, with the Matplotlib and PyQt4 modules [12], it was

possible to design a GUI taking advantage of the representation in [3].

A screen capture of the proposed GUI application is shown on Fig. 1, with the

following elements: (1) specify the audio ﬁle and the output folder, (2) parameter

controls, for the analysis window length, the minimum and maximum candidate

F0, (3) a button to “load the ﬁle” (computing the decomposition of Sect. 4.1),

(4) the waveform of the audio ﬁle, (5) the energies for each frame and for each

F0 candidate, on which the user can select the melody F0 track (time on x-axis

and F0 on y-axis), (6) a toolbar, for zooming and exploring, (7) a representation

(musical staﬀ) to indicate the corresponding F0s or notes, (8) normalization

choices for the image, (9) buttons toggling between selection (“Lead”) and de-

selection (“Delete”), plus a ﬁeld to choose the vertical extent of the selection

(in semitone), (10) “Separate” and “Separate (Auto)” buttons to launch the

separation with or without the user selected track, respectively.

The user can select on (5) a region and thus identify it as a desired F0 range.

Once she is ﬁnished with her choice, she can start the separation with one of

the “Separate” buttons. The underlying mechanisms are further explained in the

following section.

Fig. 1. GUI for selecting the desired F0 track.

4 F0 Re presentation and Separation Algorithm

The audio signal model presented in [3] is ﬁrst brieﬂy described. The computation

of the F0 representation is then discussed, and at last the user-assisted separation

algorithm of the selected source is presented.

4.1 Audio Signal Model

The audio mixture is modelled through its F × N short-term power spectrum

(STPS) matrix S, deﬁned as the power of its STFT X, with F the number of

Fourier frequencies and N the number of frames. For simplicity, the model is

presented for the single-channel case, but the stereo model of [3] was used for

the experiments of this article.

S is assumed to be the sum of the STPS of the signal of interest S

with the

residual STPS S

S = S

+ S

(2)

is the element-wise product of a “source” part (F

) by a “ﬁlter” part (Φ):

= S

• S

(3)

All the contributions S

, S

and S

are further modelled as non-negative

matrix products of a sp e ctral shape matrix (W

, W

and W

, with K, U

and R elementary shapes, respectively) by the corresponding amplitude matrix

, H

and H

). Finally:

S = W

• W

+ W

(4)

In (4), all the parameters of the right hand-side are estimated on the signal,

except the matrix W

which is a dictionary of harmonic spectral “comb”,

parameterized by its F0 frequency. As discussed in [3], a careful choice of the

F0s used in that dictionary leads to the desired representation in H

: in our

case, we chose log

-spaced F0 values, i.e. a scale proportional to the Western

musical scale. The number of F0s per semitone is ﬁxed to 16, and the user can

choose the extents of the scale, to ﬁt the expected tessitura.

The other parameters are estimated thanks to the Non-negative Matrix Fac-

torization (NMF) algorithm developed in [3]. The resulting matrix H

ﬁnally

provides the user with an image in which high values correspond to high energies

associated with F0 frequencies, as shown on Fig. 1.

4.2 F0 line selection and usage

The user can then, through the GUI of Fig. 1, select the zones containing the

F0 values that correspond to the desired melody. A binary mask matrix H, of

the same size as H

, initialized to 0 everywhere, is updated each time the user

draws a curve with the mouse (while holding the left button) over the H

image. All the coeﬃcients along that curve, as well as the coeﬃcients located

within a user-deﬁned vertical extent (half a semitone by default) are set to 1.

The program superimposes the contour of the selection on the H

image.

Once all the desired tracks have been selected, the user can trigger the sep-

aration, given her mask H. Let

= H • H

. Assuming the desired source

generates smooth melody lines, the melody path is then tracked in

with a

Viterbi algorithm [4]: the user-deﬁned regions are therefore used to restrict the

melody tracking. The user can also reﬁne the chosen re gions with a narrower

vertical extent, eﬀectively allowing non-smooth melodies if needed.

Finally, the smoothed-out melody line is used to create a reﬁned version

, zeroing coeﬃcients lying too far from the melody. The parameters are

then re-estimated, using

as initial H

matrix. These updated parameters

, W

, H

, W

, H

} are used to compute the separated sources. This sec-

ond estimation round focuses on voiced patterns, and a third round is done to

include more unvoiced elements [3].

4.3 Separating the Selected Source

Wiener ﬁlters are used to separate the sources, obtaining the e stimates of the

STFT V and M, using [3]:

V =

• W

+ W

• X and

M = X −

V (5)

The time-domain signals are then retrieved using an inverse STFT (overlap-add

procedure).

Nonnegative Matrix Factorization

Figures

Citations

Nonnegative Matrix Factorization: A Comprehensive Review

Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering

Past review, current progress, and challenges ahead on the cocktail party problem

Discriminative Nonnegative Matrix Factorization for dimensionality reduction

Relevance of polynomial matrix decompositions to broadband blind signal separation

References

Indexing by Latent Semantic Analysis

Learning the parts of objects by non-negative matrix factorization

Learning parts of objects by non-negative matrix factorization

Top 10 algorithms in data mining

Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values†

Related Papers (5)

Incremental nonnegative matrix factorization based on correlation and graph regularization for matrix completion

Convex nonnegative matrix factorization with manifold regularization

Nonnegative Matrix Factorization Via Archetypal Analysis

Soft clustering with CP matrices

Nonnegative matrix factorization: When data is not nonnegative

Frequently Asked Questions (11)

Q1. What have the authors contributed in "Musical audio source separation based on user-selected f0 track" ?

Q2. What is the main purpose of the system?

Q3. What is the purpose of the method?

Q4. What is the underlying mechanism of the audio mixture?

Q5. What is the purpose of the study?

Q6. How can a user choose the source?

Q7. What can be done to improve the separation of a particular source?

Q8. What is the purpose of the proposed system?

Q9. What is the default for the HF0 algorithm?

Q10. Why is the rap song more difficult to identify?

Q11. What is the use of time-frequency representations?