scispace - formally typeset
Open AccessJournal ArticleDOI

Exploring Monaural Features for Classification-Based Speech Segregation

Reads0
Chats0
TLDR
This paper expands T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cep stral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP), and proposes to use a group Lasso approach to select complementary features in a principled way.
Abstract
Monaural speech segregation has been a very challenging problem for decades. By casting speech segregation as a binary classification problem, recent advances have been made in computational auditory scene analysis on segregation of both voiced and unvoiced speech. So far, pitch and amplitude modulation spectrogram have been used as two main kinds of time-frequency (T-F) unit level features in classification. In this paper, we expand T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cepstral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP). Comprehensive comparisons are performed in order to identify effective features for classification-based speech segregation. Our experiments in matched and unmatched test conditions show that these newly included features significantly improve speech segregation performance. Specifically, GFCC and RASTA-PLP are the best single features in matched-noise and unmatched-noise test conditions, respectively. We also find that pitch-based features are crucial for good generalization to unseen environments. To further explore complementarity in terms of discriminative power, we propose to use a group Lasso approach to select complementary features in a principled way. The final combined feature set yields promising results in both matched and unmatched test conditions.

read more

Content maybe subject to copyright    Report

270 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013
Exploring Monaural Features for Classication-Based
Speech Segregation
Yuxuan Wang, Kun Han, and DeLiang Wang, Fellow, IEEE
Abstract—Monaural speech segregation has been a very chal-
lenging problem for decades. By casting speech segregation as a
binary classication problem, recent advances have been made
in computational auditory scene analysis on segregation of both
voiced and unvoiced speech. So far, pitch and amplitude modula-
tion spectrogram have been used as two main kinds of time-fre-
quency (T-F) unit level features in classication. In this paper,
we expand T-F unit features to include gammatone frequency
cepstral coefcients (GFCC), mel-frequency cepstral coefcients,
relative spectral transform (RASTA) and perceptual linear pre-
diction (PLP). Comprehensive comparisons are performed in
order to identify effective features for classication-based speech
segregation. Our experiments in matched and unmatched test
conditions show that these newly included features signicantly
improve speech segregation performance. Specically, GFCC and
RASTA-PLP are the bes t single features i n matched-noise and
unmatched-noise test conditions, respectively. We also nd that
pitch-based features are crucial for good generalization to unseen
environments. To further explore complementarity in terms of
discriminative power, we propose to use a group Lasso approach
to select complementary features in a principled way. The nal
combined feature set yields promising results in both matched and
unmatched test conditions.
Index Terms—Binary classication, computational auditory
scene analysis (CASA), feature combination, group Lasso,
monaural speech segregation.
I. INTRODUCTION
S
PEECH segregation, also k now n as the cocktail party
problem, refers to the problem of segregating target
speech from its background interference. Monaural speech
segregation, which is the task of speech segregation from
monaural recordings, is im portant for many real-world ap-
plications including robust speech and speaker recognition,
audio information retrieval andhearingaidsdesign(seee.g.,
[1], [7]). However, despite decades of effort, monaural speech
segregation still remains one of the hardest problems in signal
and speech processing. In this paper, we are concerned with
Manuscript received February 16, 2012; revised June 05, 2012; acce pted
September 20, 2012. Date of publication October 02, 2012; date o f current
version November 21, 2012. This work was supported in part by the Air Force
Ofce of Scientic Research (AFOSR) under Grant FA9550-08-1-0155 and
in part by an STTR grant from the AFOSR. The associate editor coordinating
the review of this manuscript and approving it for publication was Prof. Bryan
Pardo.
Y. Wang and K. Han are with the Department o f Computer Science and
Engineering, The Ohio State University, Columbus, OH 43210 USA (e-mail:
wangyuxu@cse.ohio-state.edu; hank@cse.ohio-state.edu).
D. Wang is with the Department of Computer Science and Engineer ing and
the Cen ter for Cognitive Science, The Oh io State University, Columbus, OH
43210 USA (e-mail: dwang@cse.ohio-state.edu).
Dig
ital Object Identier 10.1109/TASL.2012.2221459
monaural speech segregation from nonspeech interference; in
other words, we do no t add ress multitalker separation.
Numerous algorithms have been developed to attack the
monaural speech segregation problem. For example, spectral
subtraction [4] and Weiner ltering [6] are two representa-
tive techniques. However, assumptio ns regarding background
interference are needed to make them work reasonably we ll.
Another line of research relies on source m odels, e.g., trainin g
models for different speakers. Algorithms suc h as [19], [27],
[28] can wo rk well if the statistical properties of the observa-
tions correspond well to train ing conditions. Generalization to
different sources usually needs model adaptation, which is a
non-trivial issue.
Computational auditory scene analysis (CASA), which is in-
spired by Bregman’s account of aud itory scene analysis (ASA )
[2], has shown considerable promise in the last decade. The es-
timation of th e ideal binary mask (IBM) is s ugg ested as a pri-
mary goal of CASA [35]. The IBM is a time-frequency (T-F) bi-
nary mask, c on stru cted from premixed target and interference.
A mask value 1 for a T-F unit indicates that the signal-to-noise
ratio (SNR) within the unit exceeds a threshold ( target-domi-
nant), and 0 otherwise (interference-dominant). In this work, we
use a 0 dB threshold in all the experiments. A s eries of recent
experiments [5], [ 24], [37] shows that IBM processing of sound
mixtures yields large speech intelligibility gains.
The estimation of t he IBM may be viewed as binary classi-
cation of T-F units. Recent studies have applied this formula-
tion and achieved good speech segregation results in both ane-
choic and reverberant e nvironments [11], [14], [20], [22], [23],
[29], [ 39]. In [14], [20], the pitch-based features are u sed in
training a classier to separate target and interference dominant
units. However, the pitch-based features cannot deal with un-
voiced speech that lacks harmonic structure. Seltzer et al. [29]
and Weiss et al. [39] use comb lter and spectrogram statistics
as features. In [11], [22], [23], amplitude mod ulat ion spectro -
gram (AMS) is used, which makes unvoiced speech segregation
possible as AMS is a characteristic of both voiced and unvoiced
speech. U nfortunately, the generalization ability of AMS is not
good [11].
For classication, the use of an appropriate classier is ob-
viously important. Our previous study [11] focu ses on classi-
er comparisons, and suggests that support vector machines
(SVMs) work better than Gaussian m ixture models (GMMs).
However, this study only uses two existing features. Equally
important for classication is the choice of appropriate features,
which are less studied. It s hould be noted that we are concerned
with T-F unit level features, i.e., spectral/cepstral features ex-
tracted from each T-F unit. Feature e xtraction is possible be-
1558-7916/$31.00 © 2012 IEEE

WA NG et al.: EXPLORING MONAURAL FEATURE S FOR CLASSIFICATION-BASED SPEECH SEGREGATION 271
cause a T-F un it is a signal of a certain length. To our knowl-
edge, aside from the features used in [29], only pitch and AMS
have been used as T-F unit level features. On the other hand, in
the speech and speaker recognition comm unity, m any acoustic
features have been explored, such as gammatone frequency cep-
stral coefcients (GFCC), mel-frequency cepstral coefcients
(MFCC), relative spectral transform (RASTA) and perceptual
linear prediction (PLP), each having its own advantages. How -
ever, they have not been studied as T-F unit level features for
classication-based speech segregation.
The objective of this p aper is to conduct a comprehensive
feature study for classication-based speech segregation. That
said, we x SVM as the classier and explore the use of ex -
isting speech and speaker features under the same classication
framework. Our contributio ns are as follows:
We propose to extract conventional speech/speaker fea-
tures within each T-F unit t o signicantly enlarge the fea-
ture repository for un it classicatio n.
We propose a principled method to identify a complemen-
tary feature set. It is shown in speech recognition that com-
plementarity exists between basic acoustic features [9],
[42]. To investigate complementary features in terms of
discriminative power, we address the corresponding group
variable selection problem using a group least absolute
shrinkage and selection operator (Lasso) [41].
We s ystem a ti cally compare the segregation performance
of the new ly included features and combinations in various
acoustic environments.
This paper is organized as follows. We present an overview
of the system along with the methodology of extracting features
at the T-F unit level in Section II. Section III describes a g roup
Lasso approach to combining different features. Unit labeling
results are reported in Section IV. We conclu de thi s pa per in
Section V.
II. S
Y
STEM
OVERVIEW A ND FEATURE EXTRACTION
We describe the architecture of our segregation system as fol-
lows. A sound mixture with the 16 kHz sampling frequency is
rst fed into a 64-channel gam m atone lterbank, with center fre-
quencies equally spaced from 50 H z to 8000 Hz on the equiva-
lent rectangular bandwidth rate scale. Gammatone lters m odel
human auditory lters (critical bands) [26], and 64 channels pro-
vide an adequate frequency representation (see e.g., [37]). The
output in each channel is then divided into 20-ms frames with
10-ms overlapping between consecutive frames. This procedure
produces a time-frequency representation of the so und mixture,
called a cochleagram [36]. Our computational goal is to estimate
the ideal binary mask for the mixture. Since the energy distribu-
tion of speech signals in different channels can be very different,
we train a Gaussian-kernel SVM [11] for each subband channel
separately, and gro und truth labels are prov ided by the IBM. We
use 5-fold cross validation to deter mine the hyperpa rameters.
Feature extraction is perform ed at the T-F unit level in the w ay
described below. After obtaining a binary mask, i.e., estimated
IBM, from trained SV M classiers, the target speech is segre-
gated from the sound mixture in a resynthesis step [36]. Note
that we do not perfo rm auditory segmentation, which is usually
done for better segregation [11], [20], as we want to directly
Fig. 1. Illustration of deriving RASTA-PLP features for the T-F unit in channel
20 and at frame 50
.
compare the unit labeling performance of each feature type. Au-
ditory segmen tation refers to a stage of processing that breaks
the auditory scene into contiguous T-F regions each of which
contains acoustic energy mainly from a sin gle so und source.
Acoustic features are usually derived at the frame level. But
since a binary decision needs to be m ade for each T-F unit, we
need to nd an appropriate representation for each T-F unit (re-
call that each T-F unit contains a slice of a subband signal).
This can be done in a straightforward way as follows. To get
acoustic features for the T-F unit
in channel and at frame
,wetaketheltered output in channel . Treating
as the input, conventional frame-level acoustic feature extrac-
tion is carried out and the feature vector at fram e is taken
as the feature representation for . The unit level features
derived this way obviously con tain redudancy, as the subban d
signals are limited to the bandwidth of the corresponding gam-
matone lters. Nevertheless, such redundancy does no harm to
classication in our exp erimen ts. We also proposed a method
to reduce the dimensionality f or unit level features, which de-
rives different acoustic features based on bandlimited spectral
features. Interested readers are referred to our technical report
[38]. F ig. 1 illustrates how to derive a 12th o rder RASTA-PLP
feature vector (including zeroth cepstral coefcient) for the T-F
unit in channel 20 and at frame 50.
In the following, we describe the f eatu res used in our ex -
periments. These features have been successfully used in many
speech processing tasks. We use the RASTAMAT toolbox [8]
for extracting MFCC, PLP, and RASTA-PLP features.
A. Amplitude Modula tion Spectrogram
AMS features have been applied to speech segregation prob-
lems recently [23]. To extract AMS features, we extract the en-
velope o f the mixture signal by full-w ave rectication and dec-
imate it by a factor of 4. The decim ated envelope is Hanning
windowed and zero-padded for a 256-p oin t FFT. The resulted
FFT magnitud es are integrated by 15 trian gular windows uni-
formly spaced from 15.6 to 400 Hz, producing a 15-D AMS
feature vector.
B. Perceptual Linear Prediction
PLP [12] is a popular representation in speech recognition,
and it is designed to n d smooth spectra consisting of resonant

272 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013
peaks. To derive PLPs, we rst warp the power spectrum to
a 20-channel Bark scale using trapezoidal lters. Then, equal
loudness preem ph asis is applied, followed by applying an in-
tensity loudness law. Fin a lly, cepstral coefcients from linear
predictions form the PLP features. Following common prac-
tice in speech recognition, we use a 12th order linear predic-
tion model, yielding 1 3-D (including zeroth cepstral coefcient)
PLP features.
C. Relative Spectral Transform-PLP
RASTA ltering [13] is often coupled with PLP for robust
speech recognition. In our experiments, we use a log-RASTA
ltering approach. After the power spectrum is warped to the
Bark scale, we
-compress the resulted auditory spectrum,
lter it by the RASTA lter (single po le at 0.94), and expand
it again by an exponential function. Subsequently, PLP analysis
is taken on this ltered spectrum. In essence, RASTA ltering
serves as a m od ulation-frequency bandpass lter, which empha-
sizes the modulation frequency range most relevant to speech
while discarding lower or higher modulation frequencies. Same
as PLP, we use 13-D RASTA-PLP in this paper.
D. Gammatone Frequency Cepstral Coefcient
To get GFCC features [31], a signal is decomposed by a
64-channel gammato ne lterbank rst. Then, we decimate a
lter response to an effective sampling rate of 100 Hz, resulting
in a 10-ms frame shift. The magnitudes of the decimated lt er
outputs are then loudn ess-compressed by a cubic root opera-
tion. Finally, discrete cosine transform (DCT) is applied to the
compressed signal to yield GFCC. As suggested in [30], we
use 31-D GFCC in this paper.
E. Mel-Frequency Cepstral Coef cient
We follow the standard procedure to get MFCC. The signal is
rst preemphasized, followed by a 512-point short-time F ou rier
transform with a 2 0-m s Hamming window to get its power spec-
trogram. The power spectra are then warped to the mel scale
followed by a
operation and DCT. Note that we warp the
magnitudes to a 64-channel mel scale, for fair comparisons with
GFCCs in which a 64-channel gammaton e lterba nk is used for
subband analysis. We use 31-D MFCC in this pap er.
F. Pitch-Based Features
Pitch is a primary cue for ASA. In our experiments, we use
a set of pitch-based features originally proposed in [14], and its
effectiveness has been conrmed in both anechoic and rever-
berant environments with add itive noise [17], [20]. Although we
are only concerned with nonspeech interference in this paper, it
should be noted that pitch can also be effective for segregating
target speech from competing speech. To get pitch-based fea-
tures for the T-F unit
,we rst calculate the normalized au-
tocorrelation function at each time lag
, denoted by :
(1)
where
is the frame shift a nd is the sampling pe-
riod. The summation is over a 20-ms frame. If the signal in
is voiced and dominated by the target speech, it should have a
period close to the pitch period at frame . That is, given the
pitch p eriod of the target speech at fram e ,
measures how well the signal in is consisten t with the
target speech.
The second and third features involve the average in stanta-
neous frequency
derived from the zero-crossing rat
eof
. If the signal in belongs to target speech, the
product of and gives a harmonic number. H ence, we
set the second feature to be the nearest integer of
and
the third feature to be the difference between the actual value of
the product and its nearest integer. These two features have com -
plementary info rmation to the rst fea
ture
[17].
The next three features are the sam e as the rst three except
that they are extracted from the envelop es of lter r esponses.
The envelopes are calculated by usi
ng a low-pass FIR lter with
passband
and a Kaiser window of 18 .25 ms. The re-
sulting 6-D feature vector is:
(2)
where
denotes the round operation, and subscript indicates
envelope. It should be noted that pitch exists only in voiced
speech. In this study, classiers are trained on g ro und truth pitch
extracted from clean speech by PRAAT [ 3], but tested on pitch
estimated by a recently proposed multipitch tracker [21].
III. F
EATURE COMBINATION:AGROUP LASSO APPROACH
Different acoustic features characterize different properties
of the speech signal. As observed in speech recognition, fea-
ture combinat ion may lead to signicant performance improve-
ment [9], [42]. Here, feature combination is usually d one in
three ways. The s i m p lest method is to directly try different com-
binations. The expo nential number of possibilities renders this
method unrealistic when th e number of featur es is large. T he
second way is to p erform unsupervised feature transform ation
such as kernel-PCA [32] on the concatenated feature vector. The
third way is to apply supervised f eature transformation such as
linear discriminant analysis (LDA) [9] to t he concatenated fea-
ture vector. However, an issue with feature transformation re-
lates to complementarity; i.e., it is unclear which feature types
are complementary after transfo rm atio n. Here, by complemen-
tarity, we mean that each feature type provides complementary
information to boost classication a nd thus their combinatio n
(concatenation in paper) should outperform an individual type.
Therefore, our goal is to nd a principled way to select a set
of complementary features, and such compleme ntarity should
be related to the discrimination of target-dominance and inter-
ference-dominance. This problem can be cast as a group v ari-
able selection problem, which is to nd important groups of
explanatory factors for prediction in the regression framework.

WA NG et al.: EXPLORING MONAURAL FEATURE S FOR CLASSIFICATION-BASED SPEECH SEGREGATION 273
Group Lasso [41], a generalization of the w idely used Lasso op-
erator [34], is designed t o tackle th is problem by incorporatin g
a mixed-norm regularization over regression coefcients. Since
our labels are binary, we use the logistic regression extension
of group Lasso [25], which can be efciently solved by block
coordinate gradient descent. The estimator is
(3)
where
is the th train ing sample, is the ground truth label
scaled to
,and is the intercept. refers to the
norm. consists of predened non-overlap ping groups an d
is the index set of the th group. The rst term in the mini-
mization is a standard log loss that concerns discrimination. The
second term is an
mixed-norm regularization, which im-
poses an
regularization between groups and an regulariza-
tion within each group. It is well known that the
norm induces
sparsity, therefore the
regularization results in group spar-
sity hence gro up level feature selection. Regularization param-
eter
controls the level of sparsity of the resulting model. In
practice, we usually calculate
rst, above which is very
close to zero. We then use
with as in (3)
for the ease of c hoosing appropriate parameter values.
To d o feature combination, all the features are concatenated
together to form a long feature vector, and each feature type is
dened as a group; e.g., AM S (all 15 feature elements) is dened
as the rst group, PLP as the second, and so on. Then, for a xed
(hence ), we solve (3) to get . Since group sparsity is in-
duced,
shall be zeros (or small nu mbers) for some groups
, m eaning that these groups (feature types) contribute little to
discrimination in the presence of the other groups. Groups shall
be selected if the m agnitude s of their regression coefcients
are greater than zero. Since (3) is solved at each channel sep-
arately, different types of features may get selected for different
channels. A subband SV M classier is then trained on the se-
lected features and a cross-validation accuracy is obtained. To
select a “global” set of complementary features, we average the
cross-validation accuracies and corresponding regression coef-
cients across frequency channels. Features having signicant
average responses or peaks are considered to be complementary
for the particu lar choice of
. This is done for varying from 0
to 1 with the step size of 0.05. To achieve a good trade-o ff be-
tween discrimination power and model complexity which is the
number of grou ps selected , we empirically determ ine the nal
combination by leveraging the averaged cross-validation accu-
racies with the corresponding m od el complexity.
IV. E
VA L UAT IO N RESULTS
A. Experimental Setup
We use the IEEE corpus [18] for most of our evaluations. All
utterances are downsampled to 16 kHz. For t rainin g, we mix 50
utterances recorded by a female talker with three types of noise
at 0 dB . The three noises are: N1—bird chirps, N2—crow noise,
and N3—cocktail pa rty noise [14]. We choose 20 new utter-
ances from the IEEE corpus for testing. The test u tterances are
different from those in training. Unless stated otherwise, test ut-
terances from the same female talker are used, i.e., a speaker-de-
pendent setting. This enables us to directly compare with [ 23]
where the same speaker is used in training and testing. Relaxing
speaker dependency is examined in Section IV-I. Two test con-
ditions are employed. In the matched-noise condition, we mix
the test utterances with different cuts from the trained noises
(i.e., N1-N3) in order to test the performance on unseen utter-
ances. In the unmatched-noise condition, the test utterances are
mixed with three unseen noises: N4—crowd noise at a play-
ground, N5—electric fan noise, and N6—trafc noise. The test
mixtures are all mixed at 0 dB except in Section IV-H. There
are approximately 800 seconds of m ixtures for training in m ost
of th e experiments. T he experiments in Section IV-G use long er
training d a ta as the numb er of tra in ing utterances is in creased.
For testing, there are a pp rox ima tely 6 50 seconds of mixtures
for the IEEE test set and 700 seco nds for the TIMIT test set
(see Section IV-I). The number of T-F units to be classied
is about
for the IEEE test set and
for the TIM IT test set.
The dimensionality of each feature is described in Section II.
As mention ed before, for the pitch-based features, ground truth
pitch and estimated pitch are used in training and testing, respec-
tively. We use PITCH to denote the 6-D pitch-based features.
To put the performance of our classication-based s egreg a tion
in perspective, we include results from a recent CA SA system,
the tandem algorithm [17], which jointly performs voiced speech
segregation and pitch estimation in an iterative fashion. The
tandem algorith m is ini tial ized by the same estimated pitch from
[21]. We use ideal seq uential groupin g for the tandem algorithm,
because the algorithm does not deal with the issue of sequential
grouping, i.e., it does not have a way to group pitch contours (and
their associated masks) of the same speaker across time to form
a segregated sentence. So these results represent the ceiling
performance of the tandem algorithm.
Aside from the tandem algorithm which tries to estimate t he
IBM explicitly, we focus on c om parisons betw een different
features under the same framework. Comparisons with fun-
damentally different techniques are not included in this study
which is ab out feature exploration for classication-based
speech separation.
B. Evaluation Criteria
Since the task is classication, it is straightforward to m ea-
sure the performance using classication accuracy. However,
simply using accuracy as the evaluation criterion may not be
appropriate, as miss and false-alarm errors are treated equally.
Speech intelligibility studies [23], [24] have shown that false-
alarm (FA) errors a re far m ore d etrim ental to human speech
intelligibility than miss errors. Kim et al. have thus proposed
the HIT-FA rate as an evaluation criterion, and shown that this
rate is well correlat ed to intelligibility [2 4]. The H IT rate is
the percent of correctly classied target dom inant T-F units in
the IBM. The FA rate is the percen t of wrongl y classied in-
terference-dominant T-F units in the I BM . Therefor e , we use
HIT-FA as our main evaluation criterion. Another criterion is
the IBM-modulated SNR of the segregated speech. When com-
puting SNRs, the target speech resynthesized from the IBM is

274 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013
TABLE I
S
EGREGATION PERFORMANCE FOR SINGLE FEATURES IN THE M AT C HED-NOISE CONDITION.BOLDFACE INDICATE S BEST RESULT.
”INDICATES THE RESULT IS SIGNIFICANTLY BETTER THAN AMS AT A 5% SIGNIFICANCE LEVEL
TABLE II
S
EGREGATION PERFORMANCE FOR SINGLE FEATURES IN THE UNMATCHED-NOISE CONDITION
used as the ground truth signal [15], [17], as the IBM represents
the grou nd truth of classication. This IBM-modulated S NR
complements the above classication-based criteria by taking
into account the underlying signal energy of each T-F unit.
We should note that other evaluation criteria have been de-
veloped in the speech separation community, including SNR
and source to distor ti on ratio (SDR). Unlike the I BM which is
directly motivated by the auditory masking phenomenon, SNR
and SDR do not take into c onsideration perceptual effects. Also,
it is well known that SNR may not correlate to speech intelli-
gibility and the relationship between SDR and speech intelligi-
bility is still unknown. Because of its correlation with speech
intelligibility, we prefer the HIT- FA rate o ver S NR and SDR.
C. Single Features
In terms of HIT-FA, we document unit labeling perfor-
mance at three levels: voiced speech intervals (pitched frames),
unvoiced speech intervals (unpitched frames), and overall.
Voiced/unvoiced speech intervals are determined by ground
truth pitch. Both classication accuracy and SNR are eval-
uated at the overall level. Table I gives the results in the
matched-noise t est condition. In this condition, all f eatur es are
able to maintain a low FA rate. The performance differences
mainly stem from the HIT rate. Clearly, AMS does not perform
well compared with the other features as it fails to label a
lot of target-dominant units. In contrast, GFCC manages to
achieve high HIT rates, with 79% overall HIT-FA, which is
signicantly better than other single features. The classication
accuracy and SNR using GFCC are also signicantly higher
than th ose obtained by the other features (except MFCC in
terms of SNR). Unvoiced speech is important to speech intel-
ligibility, and its segreg a tion is a difcult task due to t he lack
of harmonicity and weak energy [16]. A gain, AM S performs
the worst whereas GFC C does a very good job at segregating
unvoiced speech. The good performance of GFCC is probably
due to its effectiveness as a speaker identication feature [31].
An encouraging observ atio n in the matched-noise condition is
that som e gen eral acoustic features such as GFCC and MFCC
signicantly outperform PITCH even in voiced intervals. This
remains true even when ground truth pitch is used in (2),
which achieves 72% HIT-FA in voiced interv als. Sim ila rly, the
tandem algorithm, which includes auditory segmentation , is
not competitive. For systematic comparison, we have produced
the receiver operating characteristic (ROC) curves for overall
classication obtained by using single features, and interested
readers are referred to our technical report [38].
Unlike the matched-n oise co ndition, the unseen broadband
noises are more dem anding for generalization. The s egregation
results in the unmatched-no ise cond iti on are listed in Table II.
We can see that the classication accuracy and both HIT rate
and FA rate are affected, and the main degradation comes from
substantially increased FA rates. Contrary to the other features,
PITCH is the least affected feature type with only 5% reduction
in HIT-FA. Using ground truth pitch i t is able to achieve 68%
HIT-FA in voiced intervals. A s the pitch-based features reect
intrinsic properties of speech, we do not expect that the change
of interference will dramatically change pitch characteristics in
target-dominant T-F units. Similarly, the tandem a lgorithm ob -
tains a fairly l ow FA rate and achieves the best HIT-FA result i n
voiced intervals in this condition. Among o thers, it is interesting
to see that RASTA-PLP becomes the b est performing feature
type in terms of all three criteria. As shown in [13], RASTA-PLP
effectively acts a s a modulation-freq uency lter, which retains
slow modulations corresponding to speech.
We h ave used Student’s
-tests at a 5% signicance level to
examine if an improv e ment is statisti cally signicant. We use
the symbol
to denote that a result is signicantly better
than the previously studied AMS feature. As can be seen in
Tables I and II, almost all the improvements are statistically
signicant.

Citations
More filters
Proceedings ArticleDOI

Deep clustering: Discriminative embeddings for segmentation and separation

TL;DR: In this paper, a deep network is trained to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures.
Journal ArticleDOI

On training targets for supervised speech separation

TL;DR: Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets.
Journal ArticleDOI

Supervised Speech Separation Based on Deep Learning: An Overview

TL;DR: A comprehensive overview of deep learning-based supervised speech separation can be found in this paper, where three main components of supervised separation are discussed: learning machines, training targets, and acoustic features.
Journal ArticleDOI

Complex ratio masking for monaural speech separation

TL;DR: The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.
Posted Content

Deep clustering: Discriminative embeddings for segmentation and separation

TL;DR: Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
References
More filters
Journal ArticleDOI

Regression Shrinkage and Selection via the Lasso

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Journal ArticleDOI

Model selection and estimation in regression with grouped variables

TL;DR: In this paper, instead of selecting factors by stepwise backward elimination, the authors focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection.
Journal ArticleDOI

Suppression of acoustic noise in speech using spectral subtraction

TL;DR: A stand-alone noise suppression algorithm that resynthesizes a speech waveform and can be used as a pre-processor to narrow-band voice communications systems, speech recognition systems, or speaker authentication systems.
Journal ArticleDOI

Perceptual linear predictive (PLP) analysis of speech

TL;DR: A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Exploring monaural features for classification-based speech segregation" ?

In this paper, the authors expand T-F unit features to include gammatone frequency cepstral coefficients ( GFCC ), mel-frequency cepstral coefficients, relative spectral transform ( RASTA ) and perceptual linear prediction ( PLP ). To further explore complementarity in terms of discriminative power, the authors propose to use a group Lasso approach to select complementary features in a principled way. The final combined feature set yields promising results in both matched and unmatched test conditions. 

The authors plan to address reverberant speech segregation in future work using this combined feature set. It would be interesting to explore new features that characterize both pitch and low modulation frequencies in future research. In addition to pitch, their results suggest that RASTA filtering also plays an important role in good generalization. 

The third way is to apply supervised feature transformation such as linear discriminant analysis (LDA) [9] to the concatenated feature vector. 

Treating as the input, conventional frame-level acoustic feature extraction is carried out and the feature vector at frame is taken as the feature representation for . 

GFCC’s contribution to model fitting is relatively weak (i.e., its regression coefficients are relatively small), making it almost redundant given AMS, RASTA-PLP, MFCC and PITCH. 

The classification performance is expected to degrade when tested on unseen speakers, as is evident from theperformance of single features. 

The authors also proposed a method to reduce the dimensionality for unit level features, which derives different acoustic features based on bandlimited spectral features. 

The authors use ideal sequential grouping for the tandem algorithm, because the algorithm does not deal with the issue of sequential grouping, i.e., it does not have away to group pitch contours (and their associated masks) of the same speaker across time to form a segregated sentence. 

for RASTA-PLP, a 5% gain is achieved by using 100 utterances compared to 20, and the performance seems to keep increasing with more training utterances. 

For testing, there are approximately 650 seconds of mixtures for the IEEE test set and 700 seconds for the TIMIT test set (see Section IV-I). 

The resulted FFT magnitudes are integrated by 15 triangular windows uniformly spaced from 15.6 to 400 Hz, producing a 15-D AMS feature vector. 

The authors have opted for using T-F unit level featuresmainly because their experiments show that, although frame-level features produce comparable performance in matched-noise conditions, the performance is significantly worse than unit-level features in unmatched test conditions. 

It is worth noting that the performance of the complementary feature set using only 20 training utterances surpasses the other features using more training utterances. 

The authors should note that other evaluation criteria have been developed in the speech separation community, including SNR and source to distortion ratio (SDR). 

To further test generalization to different speakers, the authors create a new test set for each gender by mixing 20 utterances from the TIMIT corpus [10] with N1-N6 at 0 dB.