scispace - formally typeset
Open AccessBook ChapterDOI

Nonnegative Matrix Factorization

TLDR
SVD is a classical method for matrix factorization, which gives the optimal low-rank approximation to a real-valued matrix in terms of the squared error.
Abstract
Matrix factorization or factor analysis is an important task that is helpful in the analysis of high-dimensional real-world data. SVD is a classical method for matrix factorization, which gives the optimal low-rank approximation to a real-valued matrix in terms of the squared error. Many application areas, including information retrieval, pattern recognition, and data mining, require processing of binary rather than real data.

read more

Content maybe subject to copyright    Report

Musical Audio Source Separation Based on
User-Selected F0 Track
?
Jean-Louis Durrieu and Jean-Philippe Thiran
Ecole Polytechnique F´ed´erale de Lausanne (EPFL)
Signal Processing Laboratory (LTS5)
Switzerland
firstname.lastname@epfl.ch
??
Abstract. A system for user-guided audio source separation is pre-
sented in this article. Following previous works on time-frequency music
representations, the proposed User Interface allows the user to select the
desired audio source, by means of the assumed fundamental frequency
(F0) track of that source. The system then automatically refines the se-
lected F0 tracks, estimates and separates the corresponding source from
the mixture. The interface was tested and the separation results compare
positively to the results of a fully automatic system, showing that the
F0 track selection improves the separation performance.
Keywords: User-guided Audio Source Separation, Graphical User In-
terface, Non-negative Matrix Factorization
1 INTRODUCTION
Most audio signals are mixtures of different sources, such as a speaker, an instru-
ment, or noise. Applications such as speech enhancement or musical rem ixing
require the identification and the extraction of one such source from the others.
While many existing musical source separation algorithms aim at blindly
separating all the different instruments, the aim of the proposed system is to
separate the source defined by the user. Let {x
t
}
t=1...T
be a single-channel mix-
ture signal of duration T . Let {v
t
}
t
and {m
r,t
}
t
respectively be the mono signals
of the source of interest, usually a singing voice, and of the R remaining sources,
i.e. the musical accompaniment. These s ignals are mixed s uch that:
x
t
= v
t
+
R
X
r=1
m
r,t
(1)
The task at hand is to estimate the signal of interest v
t
, given user-provided
information on the corresponding source. We propose a separation system that
?
J.-L. Durrieu; e-mail: jean-louis.durrieu@epfl.ch
??
This work was funded by the Swiss CTI agency, project n. 11359.1 PFES-ES, in
collaboration w ith Sp eedL ingua SA, Geneva, Switzerland.

allows the user to choose the source in an intuitive way, thanks to a representation
of the polyphonic pitch content of the audio excerpt. The system was tested by
several users on a SiSEC 2011 [10] data set, and the contribution of the users is
shown to improve the separation performance compared to the automatic system
in [3].
This paper is organized as follows. The relevance of user-guided source sepa-
ration is first discussed, followed by the presentation of the proposed Graphical
User Interface (GUI). The underlying signal model, representation and the al-
gorithm for source separation, mostly derived from previous works from the
authors [3], are then briefly stated. The separation guided by the users is there-
after discussed and compared with the automatic separation system. Finally, we
conclude with perspectives for the proposed system and concept.
2 User-Guided Source Separation
2.1 Related Works
Audio source separation methods essentially mimic auditory abilities: a human
being can focus on the individual instruments of a mixture thanks to their lo-
cations, energies, pitch ranges or timbres. With multi-channel signals, such as
stereo signals, one can infer spatial information [2], or train models to extract
specific sources, even with single-channel signals [1].
The user can be required to provide some meta-information, such as the
instrument name in a supervised framework [13], a musical score [5], the time
intervals of activity for each instrument [7] or a sung target sound [11]. Musical
scores or correct singing are however difficult to acquire, and are often not aligned
with the mixture signal.
Expert users can be asked to choose the desired source through its posi-
tion [14] or selecting components that are played by the desired instrument,
thanks to intermediate separation results [15]. In [8], the automatically esti-
mated melody line can be corrected by the user.
2.2 F0-guided Musical Source Separation
For musical audio excerpts, in particular for vocal sources, many studies have
shown the relevance of the fundamental frequency (F0) contours. In [5], the
authors use the music score to extract the notes, which helps estimating the
actual F0 line of the instrument to remove. In [9], an estimated F0-contour is
used to separate the corresponding instrument.
The goal of this work is to study to what extent user input can improve the
separation of a specific source. Indeed, some ill-posed issues in the automatic
separation problem, using F0 contours, can arise. First, with many interfering
sources, it is difficult to automatically decide whether a specific source is present
or not. Furthermore, octave and other harmonic-related confusions in the F0
representation can lead to erroneous separations. These errors may easily be
corrected by a trained user who uses the context to solve these ambiguities.

3 Graphical User Interface
3.1 Ergonomy issues
Allowing the user to dynamically choose the desired source requires a repre-
sentation that clearly displays the possible choices. The waveform would not
allow to locate, in time and in frequency, sources that are overlapping in time.
Time-frequency represe ntations (TFR), such as the short-term Fourier trans-
form (STFT), are therefore required to visually identify such s ources. With a
time x-axis and a frequency y-axis, the sinusoids (horizontal lines) or the noises
(vertical patterns) corresponding to the desired source easily stand out. Such an
approach would however require a significant amount of work, and would not
scale well.
Harmonic sources e xhibit a characteristic graphical pattern, in the STFT,
for each F0: the system in [3] identifies these patterns and provides the energy
of the different F0s for each signal frame. From such a representation, the user
can select the desired source thanks to its melody line, with little effort.
Furthermore, representing the pitch on the Western musical scale is a visu-
alization that many users can understand. For instance, in [6], Klapuri proposes
such a “piano-roll” visualization.
In this article, the mid-level representation introduced in [3] was chosen,
because it is easy to configure so as to look like a piano-roll. The method however
relies on a fixed dictionary of harmonic spectral shapes, and the proposed system
is therefore better suited for the separation of corresponding sources, such as
wind instruments, voice or bowed string instruments.
3.2 Practical solutions
Using Python/NumPy, with the Matplotlib and PyQt4 modules [12], it was
possible to design a GUI taking advantage of the representation in [3].
A screen capture of the proposed GUI application is shown on Fig. 1, with the
following elements: (1) specify the audio file and the output folder, (2) parameter
controls, for the analysis window length, the minimum and maximum candidate
F0, (3) a button to “load the file” (computing the decomposition of Sect. 4.1),
(4) the waveform of the audio file, (5) the energies for each frame and for each
F0 candidate, on which the user can select the melody F0 track (time on x-axis
and F0 on y-axis), (6) a toolbar, for zooming and exploring, (7) a representation
(musical staff) to indicate the corresponding F0s or notes, (8) normalization
choices for the image, (9) buttons toggling between selection (“Lead”) and de-
selection (“Delete”), plus a field to choose the vertical extent of the selection
(in semitone), (10) “Separate” and “Separate (Auto)” buttons to launch the
separation with or without the user selected track, respectively.
The user can select on (5) a region and thus identify it as a desired F0 range.
Once she is finished with her choice, she can start the separation with one of
the “Separate” buttons. The underlying mechanisms are further explained in the
following section.

Fig. 1. GUI for selecting the desired F0 track.
4 F0 Re presentation and Separation Algorithm
The audio signal model presented in [3] is first briefly described. The computation
of the F0 representation is then discussed, and at last the user-assisted separation
algorithm of the selected source is presented.
4.1 Audio Signal Model
The audio mixture is modelled through its F × N short-term power spectrum
(STPS) matrix S, defined as the power of its STFT X, with F the number of
Fourier frequencies and N the number of frames. For simplicity, the model is
presented for the single-channel case, but the stereo model of [3] was used for
the experiments of this article.
S is assumed to be the sum of the STPS of the signal of interest S
V
with the
residual STPS S
M
:
S = S
V
+ S
M
(2)
S
V
is the element-wise product of a “source” part (F
0
) by a “filter” part (Φ):
S
V
= S
Φ
S
F
0
(3)
All the contributions S
Φ
, S
F
0
and S
M
are further modelled as non-negative
matrix products of a sp e ctral shape matrix (W
Φ
, W
F
0
and W
M
, with K, U

and R elementary shapes, respectively) by the corresponding amplitude matrix
(H
Φ
, H
F
0
and H
M
). Finally:
S = W
Φ
H
Φ
W
F
0
H
F
0
+ W
M
H
M
(4)
In (4), all the parameters of the right hand-side are estimated on the signal,
except the matrix W
F
0
which is a dictionary of harmonic spectral “comb”,
parameterized by its F0 frequency. As discussed in [3], a careful choice of the
F0s used in that dictionary leads to the desired representation in H
F
0
: in our
case, we chose log
2
-spaced F0 values, i.e. a scale proportional to the Western
musical scale. The number of F0s per semitone is fixed to 16, and the user can
choose the extents of the scale, to fit the expected tessitura.
The other parameters are estimated thanks to the Non-negative Matrix Fac-
torization (NMF) algorithm developed in [3]. The resulting matrix H
F
0
finally
provides the user with an image in which high values correspond to high energies
associated with F0 frequencies, as shown on Fig. 1.
4.2 F0 line selection and usage
The user can then, through the GUI of Fig. 1, select the zones containing the
F0 values that correspond to the desired melody. A binary mask matrix H, of
the same size as H
F
0
, initialized to 0 everywhere, is updated each time the user
draws a curve with the mouse (while holding the left button) over the H
F
0
image. All the coefficients along that curve, as well as the coefficients located
within a user-defined vertical extent (half a semitone by default) are set to 1.
The program superimposes the contour of the selection on the H
F
0
image.
Once all the desired tracks have been selected, the user can trigger the sep-
aration, given her mask H. Let
e
H
F
0
= H H
F
0
. Assuming the desired source
generates smooth melody lines, the melody path is then tracked in
e
H
F
0
with a
Viterbi algorithm [4]: the user-defined regions are therefore used to restrict the
melody tracking. The user can also refine the chosen re gions with a narrower
vertical extent, effectively allowing non-smooth melodies if needed.
Finally, the smoothed-out melody line is used to create a refined version
of
e
H
F
0
, zeroing coefficients lying too far from the melody. The parameters are
then re-estimated, using
e
H
F
0
as initial H
F
0
matrix. These updated parameters
{H
F
0
, W
Φ
, H
Φ
, W
M
, H
M
} are used to compute the separated sources. This sec-
ond estimation round focuses on voiced patterns, and a third round is done to
include more unvoiced elements [3].
4.3 Separating the Selected Source
Wiener filters are used to separate the sources, obtaining the e stimates of the
STFT V and M, using [3]:
b
V =
W
Φ
H
Φ
W
F
0
H
F
0
W
Φ
H
Φ
W
F
0
H
F
0
+ W
M
H
M
X and
c
M = X
b
V (5)
The time-domain signals are then retrieved using an inverse STFT (overlap-add
procedure).

Citations
More filters
Journal ArticleDOI

Nonnegative Matrix Factorization: A Comprehensive Review

TL;DR: A comprehensive survey of NMF algorithms can be found in this paper, where the principles, basic models, properties, and algorithms along with its various modifications, extensions, and generalizations are summarized systematically.
Book ChapterDOI

Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering

TL;DR: In this article, the authors present a visual analytic system called UTOPIAN (User-driven Topic modeling based on Interactive Nonnegative Matrix Factorization) for text clustering and topic modeling.
Journal ArticleDOI

Past review, current progress, and challenges ahead on the cocktail party problem

TL;DR: This overview paper focuses on the speech separation problem given its central role in the cocktail party environment, and describes the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, and the newly developed deep learning-based techniques.
Journal ArticleDOI

Discriminative Nonnegative Matrix Factorization for dimensionality reduction

TL;DR: This work presents a label constrained NMF, namely Discriminative Nonnegative Matrix Factorization (DNMF), which utilizes the label information of a fraction of the data as a discriminative constraint.
Journal ArticleDOI

Relevance of polynomial matrix decompositions to broadband blind signal separation

TL;DR: The paper focuses on the applicability of the PEVD and broadband subspace techniques enabled by the diagonalisation and spectral majorisation capabilities of PEVD algorithms to define broadband BSS solutions that generalise well-known narrowband techniques based on the EVD.
References
More filters
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Journal ArticleDOI

Learning the parts of objects by non-negative matrix factorization

TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.

Learning parts of objects by non-negative matrix factorization

D. D. Lee
TL;DR: In this article, non-negative matrix factorization is used to learn parts of faces and semantic features of text, which is in contrast to principal components analysis and vector quantization that learn holistic, not parts-based, representations.
Journal ArticleDOI

Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values†

TL;DR: In this paper, a new variant of Factor Analysis (PMF) is described, where the problem is solved in the weighted least squares sense: G and F are determined so that the Frobenius norm of E divided (element-by-element) by σ is minimized.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What have the authors contributed in "Musical audio source separation based on user-selected f0 track" ?

A system for user-guided audio source separation is presented in this article. 

the system could be used as annotation tool: it could assist semi-automatic transcription music signals into musical scores, where an automatic system would infer note boundaries, rhythms, key and time signature from the user inputs. 

Assuming the desired source generates smooth melody lines, the melody path is then tracked in H̃F0 with a Viterbi algorithm [4]: the user-defined regions are therefore used to restrict the melody tracking. 

The audio mixture is modelled through its F × N short-term power spectrum (STPS) matrix S, defined as the power of its STFT X, with F the number of Fourier frequencies and N the number of frames. 

In order to evaluate the usage and the performance of the proposed user-guided source separation system, the development set (5 excerpts) for the SiSEC 2011 “Professionally Produced Music Recordings” task [10] is used. 

Expert users can be asked to choose the desired source through its position [14] or selecting components that are played by the desired instrument, thanks to intermediate separation results [15]. 

With multi-channel signals, such as stereo signals, one can infer spatial information [2], or train models to extract specific sources, even with single-channel signals [1]. 

The proposed system delegates the source identification to the user, such that there is less ambiguity with the definition of the target source, for the system. 

All the systems and users discussed in this section used the same default following parameters: K = 4, U = 577 (for 16 F0s per semitone, from 100 to 800Hz), R = 40, F = 1025 (for Fourier tranforms of size 2048, i.e. 46.44ms@44.1kHz) and with 25 iterations of the NMF algorithm. 

Some songs might be more challenging, such as the rap song (dev2 fort), probably because the desired vocal signal is closer to speech than to singing voice. 

Time-frequency representations (TFR), such as the short-term Fourier transform (STFT), are therefore required to visually identify such sources.