What are the future works mentioned in the paper "Exploring monaural features for classification-based speech segregation" ?

The authors plan to address reverberant speech segregation in future work using this combined feature set. It would be interesting to explore new features that characterize both pitch and low modulation frequencies in future research. In addition to pitch, their results suggest that RASTA filtering also plays an important role in good generalization.

What is the contribution of GFCC to model fitting?

GFCC’s contribution to model fitting is relatively weak (i.e., its regression coefficients are relatively small), making it almost redundant given AMS, RASTA-PLP, MFCC and PITCH.

What is the expected performance of the classifier when tested on unseen speakers?

The classification performance is expected to degrade when tested on unseen speakers, as is evident from theperformance of single features.

How many utterances are used in the training set?

for RASTA-PLP, a 5% gain is achieved by using 100 utterances compared to 20, and the performance seems to keep increasing with more training utterances.

Why did the authors opt for using T-F unit level features?

The authors have opted for using T-F unit level featuresmainly because their experiments show that, although frame-level features produce comparable performance in matched-noise conditions, the performance is significantly worse than unit-level features in unmatched test conditions.

How many training utterances are used for the complementary feature set?

It is worth noting that the performance of the complementary feature set using only 20 training utterances surpasses the other features using more training utterances.

How do the authors test the generalization to different speakers?

To further test generalization to different speakers, the authors create a new test set for each gender by mixing 20 utterances from the TIMIT corpus [10] with N1-N6 at 0 dB.

(Open Access) Exploring Monaural Features for Classification-Based Speech Segregation (2013) | Yuxuan Wang

Q: What contributions have the authors mentioned in the paper "Exploring monaural features for classification-based speech segregation" ?

In this paper, the authors expand T-F unit features to include gammatone frequency cepstral coefficients ( GFCC ), mel-frequency cepstral coefficients, relative spectral transform ( RASTA ) and perceptual linear prediction ( PLP ). To further explore complementarity in terms of discriminative power, the authors propose to use a group Lasso approach to select complementary features in a principled way. The final combined feature set yields promising results in both matched and unmatched test conditions.

Q: What is the third way to combine features?

The third way is to apply supervised feature transformation such as linear discriminant analysis (LDA) [9] to the concatenated feature vector.

Q: What is the simplest way to extract acoustic features?

The authors also proposed a method to reduce the dimensionality for unit level features, which derives different acoustic features based on bandlimited spectral features.

Q: Why do the authors use ideal sequential grouping for the tandem algorithm?

The authors use ideal sequential grouping for the tandem algorithm, because the algorithm does not deal with the issue of sequential grouping, i.e., it does not have away to group pitch contours (and their associated masks) of the same speaker across time to form a segregated sentence.

Q: How many seconds of mixtures are used for testing?

For testing, there are approximately 650 seconds of mixtures for the IEEE test set and 700 seconds for the TIMIT test set (see Section IV-I).

270 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013

Exploring Monaural Features for Classiﬁcation-Based

Speech Segregation

Yuxuan Wang, Kun Han, and DeLiang Wang, Fellow, IEEE

Abstract—Monaural speech segregation has been a very chal-

lenging problem for decades. By casting speech segregation as a

binary classiﬁcation problem, recent advances have been made

in computational auditory scene analysis on segregation of both

voiced and unvoiced speech. So far, pitch and amplitude modula-

tion spectrogram have been used as two main kinds of time-fre-

quency (T-F) unit level features in classiﬁcation. In this paper,

we expand T-F unit features to include gammatone frequency

cepstral coefﬁcients (GFCC), mel-frequency cepstral coefﬁcients,

relative spectral transform (RASTA) and perceptual linear pre-

diction (PLP). Comprehensive comparisons are performed in

order to identify effective features for classiﬁcation-based speech

segregation. Our experiments in matched and unmatched test

conditions show that these newly included features signiﬁcantly

improve speech segregation performance. Speciﬁcally, GFCC and

RASTA-PLP are the bes t single features i n matched-noise and

unmatched-noise test conditions, respectively. We also ﬁnd that

pitch-based features are crucial for good generalization to unseen

environments. To further explore complementarity in terms of

discriminative power, we propose to use a group Lasso approach

to select complementary features in a principled way. The ﬁnal

combined feature set yields promising results in both matched and

unmatched test conditions.

Index Terms—Binary classiﬁcation, computational auditory

scene analysis (CASA), feature combination, group Lasso,

monaural speech segregation.

I. INTRODUCTION

PEECH segregation, also k now n as the cocktail party

problem, refers to the problem of segregating target

speech from its background interference. Monaural speech

segregation, which is the task of speech segregation from

monaural recordings, is im portant for many real-world ap-

plications including robust speech and speaker recognition,

audio information retrieval andhearingaidsdesign(seee.g.,

[1], [7]). However, despite decades of effort, monaural speech

segregation still remains one of the hardest problems in signal

and speech processing. In this paper, we are concerned with

Manuscript received February 16, 2012; revised June 05, 2012; acce pted

September 20, 2012. Date of publication October 02, 2012; date o f current

version November 21, 2012. This work was supported in part by the Air Force

Ofﬁce of Scientiﬁc Research (AFOSR) under Grant FA9550-08-1-0155 and

in part by an STTR grant from the AFOSR. The associate editor coordinating

the review of this manuscript and approving it for publication was Prof. Bryan

Pardo.

Y. Wang and K. Han are with the Department o f Computer Science and

Engineering, The Ohio State University, Columbus, OH 43210 USA (e-mail:

wangyuxu@cse.ohio-state.edu; hank@cse.ohio-state.edu).

D. Wang is with the Department of Computer Science and Engineer ing and

the Cen ter for Cognitive Science, The Oh io State University, Columbus, OH

43210 USA (e-mail: dwang@cse.ohio-state.edu).

Dig

ital Object Identiﬁer 10.1109/TASL.2012.2221459

monaural speech segregation from nonspeech interference; in

other words, we do no t add ress multitalker separation.

Numerous algorithms have been developed to attack the

monaural speech segregation problem. For example, spectral

subtraction [4] and Weiner ﬁ ltering [6] are two representa-

tive techniques. However, assumptio ns regarding background

interference are needed to make them work reasonably we ll.

Another line of research relies on source m odels, e.g., trainin g

models for different speakers. Algorithms suc h as [19], [27],

[28] can wo rk well if the statistical properties of the observa-

tions correspond well to train ing conditions. Generalization to

different sources usually needs model adaptation, which is a

non-trivial issue.

Computational auditory scene analysis (CASA), which is in-

spired by Bregman’s account of aud itory scene analysis (ASA )

[2], has shown considerable promise in the last decade. The es-

timation of th e ideal binary mask (IBM) is s ugg ested as a pri-

mary goal of CASA [35]. The IBM is a time-frequency (T-F) bi-

nary mask, c on stru cted from premixed target and interference.

A mask value 1 for a T-F unit indicates that the signal-to-noise

ratio (SNR) within the unit exceeds a threshold ( target-domi-

nant), and 0 otherwise (interference-dominant). In this work, we

use a 0 dB threshold in all the experiments. A s eries of recent

experiments [5], [ 24], [37] shows that IBM processing of sound

mixtures yields large speech intelligibility gains.

The estimation of t he IBM may be viewed as binary classi-

ﬁcation of T-F units. Recent studies have applied this formula-

tion and achieved good speech segregation results in both ane-

choic and reverberant e nvironments [11], [14], [20], [22], [23],

[29], [ 39]. In [14], [20], the pitch-based features are u sed in

training a classiﬁer to separate target and interference dominant

units. However, the pitch-based features cannot deal with un-

voiced speech that lacks harmonic structure. Seltzer et al. [29]

and Weiss et al. [39] use comb ﬁlter and spectrogram statistics

as features. In [11], [22], [23], amplitude mod ulat ion spectro -

gram (AMS) is used, which makes unvoiced speech segregation

possible as AMS is a characteristic of both voiced and unvoiced

speech. U nfortunately, the generalization ability of AMS is not

good [11].

For classiﬁcation, the use of an appropriate classiﬁer is ob-

viously important. Our previous study [11] focu ses on classi-

ﬁer comparisons, and suggests that support vector machines

(SVMs) work better than Gaussian m ixture models (GMMs).

However, this study only uses two existing features. Equally

important for classiﬁcation is the choice of appropriate features,

which are less studied. It s hould be noted that we are concerned

with T-F unit level features, i.e., spectral/cepstral features ex-

tracted from each T-F unit. Feature e xtraction is possible be-

WA NG et al.: EXPLORING MONAURAL FEATURE S FOR CLASSIFICATION-BASED SPEECH SEGREGATION 271

cause a T-F un it is a signal of a certain length. To our knowl-

edge, aside from the features used in [29], only pitch and AMS

have been used as T-F unit level features. On the other hand, in

the speech and speaker recognition comm unity, m any acoustic

features have been explored, such as gammatone frequency cep-

stral coefﬁcients (GFCC), mel-frequency cepstral coefﬁcients

(MFCC), relative spectral transform (RASTA) and perceptual

linear prediction (PLP), each having its own advantages. How -

ever, they have not been studied as T-F unit level features for

classiﬁcation-based speech segregation.

The objective of this p aper is to conduct a comprehensive

feature study for classiﬁcation-based speech segregation. That

said, we ﬁx SVM as the classiﬁer and explore the use of ex -

isting speech and speaker features under the same classiﬁcation

framework. Our contributio ns are as follows:

• We propose to extract conventional speech/speaker fea-

tures within each T-F unit t o signiﬁcantly enlarge the fea-

ture repository for un it classiﬁcatio n.

• We propose a principled method to identify a complemen-

tary feature set. It is shown in speech recognition that com-

plementarity exists between basic acoustic features [9],

[42]. To investigate complementary features in terms of

discriminative power, we address the corresponding group

variable selection problem using a group least absolute

shrinkage and selection operator (Lasso) [41].

• We s ystem a ti cally compare the segregation performance

of the new ly included features and combinations in various

acoustic environments.

This paper is organized as follows. We present an overview

of the system along with the methodology of extracting features

at the T-F unit level in Section II. Section III describes a g roup

Lasso approach to combining different features. Unit labeling

results are reported in Section IV. We conclu de thi s pa per in

Section V.

II. S

STEM

OVERVIEW A ND FEATURE EXTRACTION

We describe the architecture of our segregation system as fol-

lows. A sound mixture with the 16 kHz sampling frequency is

ﬁrst fed into a 64-channel gam m atone ﬁlterbank, with center fre-

quencies equally spaced from 50 H z to 8000 Hz on the equiva-

lent rectangular bandwidth rate scale. Gammatone ﬁlters m odel

human auditory ﬁlters (critical bands) [26], and 64 channels pro-

vide an adequate frequency representation (see e.g., [37]). The

output in each channel is then divided into 20-ms frames with

10-ms overlapping between consecutive frames. This procedure

produces a time-frequency representation of the so und mixture,

called a cochleagram [36]. Our computational goal is to estimate

the ideal binary mask for the mixture. Since the energy distribu-

tion of speech signals in different channels can be very different,

we train a Gaussian-kernel SVM [11] for each subband channel

separately, and gro und truth labels are prov ided by the IBM. We

use 5-fold cross validation to deter mine the hyperpa rameters.

Feature extraction is perform ed at the T-F unit level in the w ay

described below. After obtaining a binary mask, i.e., estimated

IBM, from trained SV M classiﬁers, the target speech is segre-

gated from the sound mixture in a resynthesis step [36]. Note

that we do not perfo rm auditory segmentation, which is usually

done for better segregation [11], [20], as we want to directly

Fig. 1. Illustration of deriving RASTA-PLP features for the T-F unit in channel

20 and at frame 50

compare the unit labeling performance of each feature type. Au-

ditory segmen tation refers to a stage of processing that breaks

the auditory scene into contiguous T-F regions each of which

contains acoustic energy mainly from a sin gle so und source.

Acoustic features are usually derived at the frame level. But

since a binary decision needs to be m ade for each T-F unit, we

need to ﬁnd an appropriate representation for each T-F unit (re-

call that each T-F unit contains a slice of a subband signal).

This can be done in a straightforward way as follows. To get

acoustic features for the T-F unit

in channel and at frame

,wetaketheﬁltered output in channel . Treating

as the input, conventional frame-level acoustic feature extrac-

tion is carried out and the feature vector at fram e is taken

as the feature representation for . The unit level features

derived this way obviously con tain redudancy, as the subban d

signals are limited to the bandwidth of the corresponding gam-

matone ﬁlters. Nevertheless, such redundancy does no harm to

classiﬁcation in our exp erimen ts. We also proposed a method

to reduce the dimensionality f or unit level features, which de-

rives different acoustic features based on bandlimited spectral

features. Interested readers are referred to our technical report

[38]. F ig. 1 illustrates how to derive a 12th o rder RASTA-PLP

feature vector (including zeroth cepstral coefﬁcient) for the T-F

unit in channel 20 and at frame 50.

In the following, we describe the f eatu res used in our ex -

periments. These features have been successfully used in many

speech processing tasks. We use the RASTAMAT toolbox [8]

for extracting MFCC, PLP, and RASTA-PLP features.

A. Amplitude Modula tion Spectrogram

AMS features have been applied to speech segregation prob-

lems recently [23]. To extract AMS features, we extract the en-

velope o f the mixture signal by full-w ave rectiﬁcation and dec-

imate it by a factor of 4. The decim ated envelope is Hanning

windowed and zero-padded for a 256-p oin t FFT. The resulted

FFT magnitud es are integrated by 15 trian gular windows uni-

formly spaced from 15.6 to 400 Hz, producing a 15-D AMS

feature vector.

B. Perceptual Linear Prediction

PLP [12] is a popular representation in speech recognition,

and it is designed to ﬁn d smooth spectra consisting of resonant

272 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013

peaks. To derive PLPs, we ﬁrst warp the power spectrum to

a 20-channel Bark scale using trapezoidal ﬁlters. Then, equal

loudness preem ph asis is applied, followed by applying an in-

tensity loudness law. Fin a lly, cepstral coefﬁcients from linear

predictions form the PLP features. Following common prac-

tice in speech recognition, we use a 12th order linear predic-

tion model, yielding 1 3-D (including zeroth cepstral coefﬁcient)

PLP features.

C. Relative Spectral Transform-PLP

RASTA ﬁltering [13] is often coupled with PLP for robust

speech recognition. In our experiments, we use a log-RASTA

ﬁltering approach. After the power spectrum is warped to the

Bark scale, we

-compress the resulted auditory spectrum,

ﬁlter it by the RASTA ﬁlter (single po le at 0.94), and expand

it again by an exponential function. Subsequently, PLP analysis

is taken on this ﬁltered spectrum. In essence, RASTA ﬁltering

serves as a m od ulation-frequency bandpass ﬁlter, which empha-

sizes the modulation frequency range most relevant to speech

while discarding lower or higher modulation frequencies. Same

as PLP, we use 13-D RASTA-PLP in this paper.

D. Gammatone Frequency Cepstral Coefﬁcient

To get GFCC features [31], a signal is decomposed by a

64-channel gammato ne ﬁlterbank ﬁrst. Then, we decimate a

ﬁlter response to an effective sampling rate of 100 Hz, resulting

in a 10-ms frame shift. The magnitudes of the decimated ﬁlt er

outputs are then loudn ess-compressed by a cubic root opera-

tion. Finally, discrete cosine transform (DCT) is applied to the

compressed signal to yield GFCC. As suggested in [30], we

use 31-D GFCC in this paper.

E. Mel-Frequency Cepstral Coefﬁ cient

We follow the standard procedure to get MFCC. The signal is

ﬁrst preemphasized, followed by a 512-point short-time F ou rier

transform with a 2 0-m s Hamming window to get its power spec-

trogram. The power spectra are then warped to the mel scale

followed by a

operation and DCT. Note that we warp the

magnitudes to a 64-channel mel scale, for fair comparisons with

GFCCs in which a 64-channel gammaton e ﬁlterba nk is used for

subband analysis. We use 31-D MFCC in this pap er.

F. Pitch-Based Features

Pitch is a primary cue for ASA. In our experiments, we use

a set of pitch-based features originally proposed in [14], and its

effectiveness has been conﬁrmed in both anechoic and rever-

berant environments with add itive noise [17], [20]. Although we

are only concerned with nonspeech interference in this paper, it

should be noted that pitch can also be effective for segregating

target speech from competing speech. To get pitch-based fea-

tures for the T-F unit

,weﬁ rst calculate the normalized au-

tocorrelation function at each time lag

, denoted by :

(1)

where

is the frame shift a nd is the sampling pe-

riod. The summation is over a 20-ms frame. If the signal in

is voiced and dominated by the target speech, it should have a

period close to the pitch period at frame . That is, given the

pitch p eriod of the target speech at fram e ,

measures how well the signal in is consisten t with the

target speech.

The second and third features involve the average in stanta-

neous frequency

derived from the zero-crossing rat

eof

. If the signal in belongs to target speech, the

product of and gives a harmonic number. H ence, we

set the second feature to be the nearest integer of

and

the third feature to be the difference between the actual value of

the product and its nearest integer. These two features have com -

plementary info rmation to the ﬁrst fea

ture

[17].

The next three features are the sam e as the ﬁrst three except

that they are extracted from the envelop es of ﬁlter r esponses.

The envelopes are calculated by usi

ng a low-pass FIR ﬁlter with

passband

and a Kaiser window of 18 .25 ms. The re-

sulting 6-D feature vector is:

(2)

where

denotes the round operation, and subscript indicates

envelope. It should be noted that pitch exists only in voiced

speech. In this study, classiﬁers are trained on g ro und truth pitch

extracted from clean speech by PRAAT [ 3], but tested on pitch

estimated by a recently proposed multipitch tracker [21].

III. F

EATURE COMBINATION:AGROUP LASSO APPROACH

Different acoustic features characterize different properties

of the speech signal. As observed in speech recognition, fea-

ture combinat ion may lead to signiﬁcant performance improve-

ment [9], [42]. Here, feature combination is usually d one in

three ways. The s i m p lest method is to directly try different com-

binations. The expo nential number of possibilities renders this

method unrealistic when th e number of featur es is large. T he

second way is to p erform unsupervised feature transform ation

such as kernel-PCA [32] on the concatenated feature vector. The

third way is to apply supervised f eature transformation such as

linear discriminant analysis (LDA) [9] to t he concatenated fea-

ture vector. However, an issue with feature transformation re-

lates to complementarity; i.e., it is unclear which feature types

are complementary after transfo rm atio n. Here, by complemen-

tarity, we mean that each feature type provides complementary

information to boost classiﬁcation a nd thus their combinatio n

(concatenation in paper) should outperform an individual type.

Therefore, our goal is to ﬁnd a principled way to select a set

of complementary features, and such compleme ntarity should

be related to the discrimination of target-dominance and inter-

ference-dominance. This problem can be cast as a group v ari-

able selection problem, which is to ﬁ nd important groups of

explanatory factors for prediction in the regression framework.

WA NG et al.: EXPLORING MONAURAL FEATURE S FOR CLASSIFICATION-BASED SPEECH SEGREGATION 273

Group Lasso [41], a generalization of the w idely used Lasso op-

erator [34], is designed t o tackle th is problem by incorporatin g

a mixed-norm regularization over regression coefﬁcients. Since

our labels are binary, we use the logistic regression extension

of group Lasso [25], which can be efﬁciently solved by block

coordinate gradient descent. The estimator is

(3)

where

is the th train ing sample, is the ground truth label

scaled to

,and is the intercept. refers to the

norm. consists of predeﬁned non-overlap ping groups an d

is the index set of the th group. The ﬁrst term in the mini-

mization is a standard log loss that concerns discrimination. The

second term is an

mixed-norm regularization, which im-

poses an

regularization between groups and an regulariza-

tion within each group. It is well known that the

norm induces

sparsity, therefore the

regularization results in group spar-

sity hence gro up level feature selection. Regularization param-

eter

controls the level of sparsity of the resulting model. In

practice, we usually calculate

ﬁrst, above which is very

close to zero. We then use

with as in (3)

for the ease of c hoosing appropriate parameter values.

To d o feature combination, all the features are concatenated

together to form a long feature vector, and each feature type is

deﬁned as a group; e.g., AM S (all 15 feature elements) is deﬁned

as the ﬁrst group, PLP as the second, and so on. Then, for a ﬁxed

(hence ), we solve (3) to get . Since group sparsity is in-

duced,

shall be zeros (or small nu mbers) for some groups

, m eaning that these groups (feature types) contribute little to

discrimination in the presence of the other groups. Groups shall

be selected if the m agnitude s of their regression coefﬁcients

are greater than zero. Since (3) is solved at each channel sep-

arately, different types of features may get selected for different

channels. A subband SV M classiﬁer is then trained on the se-

lected features and a cross-validation accuracy is obtained. To

select a “global” set of complementary features, we average the

cross-validation accuracies and corresponding regression coef-

ﬁcients across frequency channels. Features having signiﬁcant

average responses or peaks are considered to be complementary

for the particu lar choice of

. This is done for varying from 0

to 1 with the step size of 0.05. To achieve a good trade-o ff be-

tween discrimination power and model complexity which is the

number of grou ps selected , we empirically determ ine the ﬁnal

combination by leveraging the averaged cross-validation accu-

racies with the corresponding m od el complexity.

IV. E

VA L UAT IO N RESULTS

A. Experimental Setup

We use the IEEE corpus [18] for most of our evaluations. All

utterances are downsampled to 16 kHz. For t rainin g, we mix 50

utterances recorded by a female talker with three types of noise

at 0 dB . The three noises are: N1—bird chirps, N2—crow noise,

and N3—cocktail pa rty noise [14]. We choose 20 new utter-

ances from the IEEE corpus for testing. The test u tterances are

different from those in training. Unless stated otherwise, test ut-

terances from the same female talker are used, i.e., a speaker-de-

pendent setting. This enables us to directly compare with [ 23]

where the same speaker is used in training and testing. Relaxing

speaker dependency is examined in Section IV-I. Two test con-

ditions are employed. In the matched-noise condition, we mix

the test utterances with different cuts from the trained noises

(i.e., N1-N3) in order to test the performance on unseen utter-

ances. In the unmatched-noise condition, the test utterances are

mixed with three unseen noises: N4—crowd noise at a play-

ground, N5—electric fan noise, and N6—trafﬁc noise. The test

mixtures are all mixed at 0 dB except in Section IV-H. There

are approximately 800 seconds of m ixtures for training in m ost

of th e experiments. T he experiments in Section IV-G use long er

training d a ta as the numb er of tra in ing utterances is in creased.

For testing, there are a pp rox ima tely 6 50 seconds of mixtures

for the IEEE test set and 700 seco nds for the TIMIT test set

(see Section IV-I). The number of T-F units to be classiﬁed

is about

for the IEEE test set and

for the TIM IT test set.

The dimensionality of each feature is described in Section II.

As mention ed before, for the pitch-based features, ground truth

pitch and estimated pitch are used in training and testing, respec-

tively. We use PITCH to denote the 6-D pitch-based features.

To put the performance of our classiﬁcation-based s egreg a tion

in perspective, we include results from a recent CA SA system,

the tandem algorithm [17], which jointly performs voiced speech

segregation and pitch estimation in an iterative fashion. The

tandem algorith m is ini tial ized by the same estimated pitch from

[21]. We use ideal seq uential groupin g for the tandem algorithm,

because the algorithm does not deal with the issue of sequential

grouping, i.e., it does not have a way to group pitch contours (and

their associated masks) of the same speaker across time to form

a segregated sentence. So these results represent the ceiling

performance of the tandem algorithm.

Aside from the tandem algorithm which tries to estimate t he

IBM explicitly, we focus on c om parisons betw een different

features under the same framework. Comparisons with fun-

damentally different techniques are not included in this study

which is ab out feature exploration for classiﬁcation-based

speech separation.

B. Evaluation Criteria

Since the task is classiﬁcation, it is straightforward to m ea-

sure the performance using classiﬁcation accuracy. However,

simply using accuracy as the evaluation criterion may not be

appropriate, as miss and false-alarm errors are treated equally.

Speech intelligibility studies [23], [24] have shown that false-

alarm (FA) errors a re far m ore d etrim ental to human speech

intelligibility than miss errors. Kim et al. have thus proposed

the HIT-FA rate as an evaluation criterion, and shown that this

rate is well correlat ed to intelligibility [2 4]. The H IT rate is

the percent of correctly classiﬁed target dom inant T-F units in

the IBM. The FA rate is the percen t of wrongl y classiﬁed in-

terference-dominant T-F units in the I BM . Therefor e , we use

HIT-FA as our main evaluation criterion. Another criterion is

the IBM-modulated SNR of the segregated speech. When com-

puting SNRs, the target speech resynthesized from the IBM is

274 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 2, FEBRUARY 2013

TABLE I

EGREGATION PERFORMANCE FOR SINGLE FEATURES IN THE M AT C HED-NOISE CONDITION.BOLDFACE INDICATE S BEST RESULT.

“

”INDICATES THE RESULT IS SIGNIFICANTLY BETTER THAN AMS AT A 5% SIGNIFICANCE LEVEL

TABLE II

EGREGATION PERFORMANCE FOR SINGLE FEATURES IN THE UNMATCHED-NOISE CONDITION

used as the ground truth signal [15], [17], as the IBM represents

the grou nd truth of classiﬁcation. This IBM-modulated S NR

complements the above classiﬁcation-based criteria by taking

into account the underlying signal energy of each T-F unit.

We should note that other evaluation criteria have been de-

veloped in the speech separation community, including SNR

and source to distor ti on ratio (SDR). Unlike the I BM which is

directly motivated by the auditory masking phenomenon, SNR

and SDR do not take into c onsideration perceptual effects. Also,

it is well known that SNR may not correlate to speech intelli-

gibility and the relationship between SDR and speech intelligi-

bility is still unknown. Because of its correlation with speech

intelligibility, we prefer the HIT- FA rate o ver S NR and SDR.

C. Single Features

In terms of HIT-FA, we document unit labeling perfor-

mance at three levels: voiced speech intervals (pitched frames),

unvoiced speech intervals (unpitched frames), and overall.

Voiced/unvoiced speech intervals are determined by ground

truth pitch. Both classiﬁcation accuracy and SNR are eval-

uated at the overall level. Table I gives the results in the

matched-noise t est condition. In this condition, all f eatur es are

able to maintain a low FA rate. The performance differences

mainly stem from the HIT rate. Clearly, AMS does not perform

well compared with the other features as it fails to label a

lot of target-dominant units. In contrast, GFCC manages to

achieve high HIT rates, with 79% overall HIT-FA, which is

signiﬁcantly better than other single features. The classiﬁcation

accuracy and SNR using GFCC are also signiﬁcantly higher

than th ose obtained by the other features (except MFCC in

terms of SNR). Unvoiced speech is important to speech intel-

ligibility, and its segreg a tion is a difﬁcult task due to t he lack

of harmonicity and weak energy [16]. A gain, AM S performs

the worst whereas GFC C does a very good job at segregating

unvoiced speech. The good performance of GFCC is probably

due to its effectiveness as a speaker identiﬁcation feature [31].

An encouraging observ atio n in the matched-noise condition is

that som e gen eral acoustic features such as GFCC and MFCC

signiﬁcantly outperform PITCH even in voiced intervals. This

remains true even when ground truth pitch is used in (2),

which achieves 72% HIT-FA in voiced interv als. Sim ila rly, the

tandem algorithm, which includes auditory segmentation , is

not competitive. For systematic comparison, we have produced

the receiver operating characteristic (ROC) curves for overall

classiﬁcation obtained by using single features, and interested

readers are referred to our technical report [38].

Unlike the matched-n oise co ndition, the unseen broadband

noises are more dem anding for generalization. The s egregation

results in the unmatched-no ise cond iti on are listed in Table II.

We can see that the classiﬁcation accuracy and both HIT rate

and FA rate are affected, and the main degradation comes from

substantially increased FA rates. Contrary to the other features,

PITCH is the least affected feature type with only 5% reduction

in HIT-FA. Using ground truth pitch i t is able to achieve 68%

HIT-FA in voiced intervals. A s the pitch-based features reﬂect

intrinsic properties of speech, we do not expect that the change

of interference will dramatically change pitch characteristics in

target-dominant T-F units. Similarly, the tandem a lgorithm ob -

tains a fairly l ow FA rate and achieves the best HIT-FA result i n

voiced intervals in this condition. Among o thers, it is interesting

to see that RASTA-PLP becomes the b est performing feature

type in terms of all three criteria. As shown in [13], RASTA-PLP

effectively acts a s a modulation-freq uency ﬁlter, which retains

slow modulations corresponding to speech.

We h ave used Student’s

-tests at a 5% signiﬁcance level to

examine if an improv e ment is statisti cally signiﬁcant. We use

the symbol “

” to denote that a result is signiﬁcantly better

than the previously studied AMS feature. As can be seen in

Tables I and II, almost all the improvements are statistically

signiﬁcant.

Exploring Monaural Features for Classification-Based Speech Segregation

Figures

Citations

Deep clustering: Discriminative embeddings for segmentation and separation

On training targets for supervised speech separation

Supervised Speech Separation Based on Deep Learning: An Overview

Complex ratio masking for monaural speech separation

Deep clustering: Discriminative embeddings for segmentation and separation

References

Regression Shrinkage and Selection via the Lasso

Model selection and estimation in regression with grouped variables

Suppression of acoustic noise in speech using spectral subtraction

Perceptual linear predictive (PLP) analysis of speech

Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST

Related Papers (5)

On training targets for supervised speech separation

An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech

Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems

Computational Auditory Scene Analysis: Principles, Algorithms, and Applications

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Exploring monaural features for classification-based speech segregation" ?

Q2. What are the future works mentioned in the paper "Exploring monaural features for classification-based speech segregation" ?

Q3. What is the third way to combine features?

Q4. What is the acoustic feature vector used in the experiments?

Q5. What is the contribution of GFCC to model fitting?

Q6. What is the expected performance of the classifier when tested on unseen speakers?

Q7. What is the simplest way to extract acoustic features?

Q8. Why do the authors use ideal sequential grouping for the tandem algorithm?

Q9. How many utterances are used in the training set?

Q10. How many seconds of mixtures are used for testing?

Q11. How many windows are used to integrate the acoustic features?

Q12. Why did the authors opt for using T-F unit level features?

Q13. How many training utterances are used for the complementary feature set?

Q14. What other evaluation criteria have been developed in the speech separation community?

Q15. How do the authors test the generalization to different speakers?