Signal estimation from modified short-time Fourier transform

doi:10.1109/TASSP.1984.1164317

2

36

IEEE

TRANSACTIONS

ON

ACOUSTICS,

SPEECH,

AND

SIGNAL PROCESSING, VOL.

ASSP-32,

NO.

2,

APRIL

1984

Signal Estimation from Modified Short-Time

Fourier Transform

DANIEL

W.

GRIFFIN

AND

JAE

S.

LIM,

SENIOR

MEMBER,

IEEE

Abstract-In this paper, we present an algorithm to estimate a signal

from its modified short-time Fourier transform (STFT). This algorithm

is computationally simple and

is

obtained by minimizing the mean

squared error between the STFT of the estimated signal and the modi-

fied STFT. Using this algorithm, we also develop an iterative algorithm

to estimate a signal from its modified STFT magnitude. The iterative

algorithm is shown

to

decrease,

in

each iteration, the mean squared

error

between the STFT magnitude of the estimated signal and the

modified STFT magnitude. The major computation involved

in

the

iterative algorithm is the discrete Fourier transform (DFT) computa-

tion, and the algorithm appears to be real-time implementable with

current hardware technology. The algorithm developed in this paper

has been applied to the time-scale modification of speech. The result-

ing system generates very high-quality speech, and appears

to

be better

in performancc than any existing method.

I. INTRODUCTION

I

N

a

number

of

practical applications

[1]-[5],

it is desirable

to modify the short-time Fourier transform (STFT) or the

short-time Fourier transform magnitude (STFTM) and then es-

timate the processed signal from the modified STFT (MSTFT)

or the modified STFTM (MSTFTM). For example, in speech

enhancement by spectral subtraction

[2],

[3]

~

the STFT is

modified by combining the STFT phase of the degraded speech

with

a

MSTFTM, and then

a

signal

is

reconstructed from the

MSTFT.

As

another example, in the time-scale modification

of speech, one approach is to modify the STFTM and then

to

reconstruct

a

signal from the MSTFTM. In most applications,

including the two cited above, the MSTFT or MSTFTM

is

not

valid in the sense that no signal

has

the MSTFT or MSTFTM,

and therefore it is important

to

develop algorithms to estimate

a

signal whose STFT or STFTM is close in some sense to the

MSTFT or MSTFTM. Previous approaches

to

this problem

have been mostly heuristic

[6]

-[8],

and have been limited to

estimating a signal from the MSTFT

[6],

171.

In this paper,

we develop new algorithms based on theoretical grounds

to

estimate a signal from the MSTFT or the MSTFTM.

In

addi-

tion, the new algorithm is applied to the problem of time-scale

modification of speech. The resulting system is considerably

simpler conceptually and appears to have better performance

than the system described by Portnoff

[I]

.

The paper is organized as follows. In Section

11,

we develop

an algorithm to estimate

a

signal from the MSTFT by mini-

mizing the mean squared error between the STFT

of

the esti-

Manuscript received Dcccnlber 27, 1982; revised

May

12,

1983, and

September 26, 1983. This

{vork

was supported

in

part

by

the Advanced

Research Projects

Agency

monitored

by

ONR

under Contract

SO00

14-

81-K-0742 NR-049-509

and

the

National

Scicncc

I:ound;~tion under

Grant

I<CS80-07

102.

The authors

arc

with thc Iicscarch Laboratory

of

Iblcctronics. Dcpart-

mcnt

of

I~.lcctrical

Iknpinccring

and

Computcr Scicncc, Massachusetts

Institute

ol'Tcchnology, Cambridge,

MA

021 39.

mated signal and the MSTFT. The resulting algorithm is quite

simple computationally. In Section

111,

the algorithm in Sec-

tion

I1

is used to develop an iterative algorithm that estimates

a

signal from the MSTFTM.

The

iterative algorithm is shown

to decrease, in each iteration, the mean squared error between

the STFTM of

the

estimated signal and the MSTFTM. In Sec-

tion

IV,

we present

an

example

of

the successful application

of our theoretical results. Specifically, we develop

a

time-

scale speech modification system by modifying the STFTM

first

and

then estimating

a

signal from the MSTFTM using the

algorithm developed in Section

111.

The resulting system has

been demonstrated

to

gene]-ate very high quality, time-scale

modified speech.

11.

SIGNAL ESTlblATION

FROM

MODIFIED SHORT-TIME

FOUI<I~.:K

TRANSFOKbl

Let

x(n)

and

X,(nzS,

w)

denote a real sequence and its

STFT. The variable

S

is

a

positive integer, which represents

the sampling period of

X,(n,

w)

in the variable

n.

Let the

analysis window used in the STFT be denoted by

~(n),

and

with little

loss

of generality,

w(n)

is assumed to be real,

I,

points long, and nonzero for

0

<

n

<

I,

-

1.

From the defini-

tion of the STFT

x,(~?zs,

o)

=

F~

[x,(urz~,

I)]

=

x,(rn~,

I)

dm'

(1)

m

[=

--

where

x,(rnS,

I)

=

w(mS

-

I)

x(/)

(2)

and

Fl

[xw(mS,

I)]

represents the Fourier transform ofx,(mS,

I)

with respect

to

the variable

1.

Let

Y,(mS,

w)

denote the given MSTFT and let

y,(mS,

I)

be given by

Yw(mS,

I)

=

-

Y,(mS,

w)

cjwl

dw.

(3)

sn

271

w=-r

An

arbitrary

Y,(mS.

o),

in

general, is not a valid STFT

in

the

sense that there

is

no

sequence whose STFT isgiven by

Y,(mS,

0).

In this section, we develop

a

new algorithm

to

estimate

a

sequence

X(/?)

whose STFT

X,(mS,

w)

is

closest to

Y,(rn.S,

o)

in the squared error sense.

Consider the following distance measure between

x(n)

and

a

given MSTFT

Y,(mS,

0):

ea

D[X(fZ),

Y,(f72S,

a)]

=

--

I

In

/X,(flZS,

w)

m

=

-m

2n

w

=-n

-

Y,(mS,

o)l2

dw.

(4)

OOc)6-35

I8/84/04OO-O236$01

.OO

0

1984

IEEE

GRIFFIN AND LIM: SIGNAL ESTIMATION FROM MODIFIED

STFT

237

The distance measure in

(4)

is the squared error between

the window in

(6)

can be normalized

so

that

=-_

wz

(mS

-

X,(mS,

o)

and

Y,(mS,

w)

integrated over all

w

and summed

n)

is unity for all

n.

Any nonzero window can be normalized

over all

m.

It

has been written as a function

of

x(n)

and

Y,

in this manner for maximum window overlap

(S

=

1).

For

(mS,

a)

to emphasize that

X,(mS,

o)

is a valid STFT while

partial window overlap, however, the window ismore restricted.

Y,(mS,

o)

is not necessarily a valid STFT. By Parseval’s

Several windows which have this property for partial window

theorem,

(4)

can be written as

overlap are discussed below.

When the window shift

(S)

divides the window length

(I,)

D[x(n),

Y,(mS,

all

=

2

[x,(mS,

0

evenly, the rectangular window defined by

m

=--

-

y,(mS,

012

f

(’1

w,(n)

=

O<n<L

Since

(5)

is in the quadratic form of

~(n),

minimization of

D[x(n),

Y,(mS,

w)]

can be accomplished by setting the.

otherwise

gradient with respect to

~(n)

to zero and solving

for

x(n)

has the property

which leads to the following result:

2

w,”(mS-n)=

-=

1.

(LIa-1

S

(1

0)

5

w(mS

-

n)y,(mS,

n)

??I

=

-m

m=O

I,

x(n)

=

m

=-m

(6)

We can further show with some algebra that if the window

length

(L)

is a multiple of four times the window shift

(S)

then

5

w2(mS-

n)

m

=-m

the sinusoidal window defined by

This solution is similar in form to the standard overlap-add

procedure

[6],

[7],

or the weighted overlap-add procedure

[9],

[IO].

The overlap-add procedure can be expressed as

The weighted overlap-add procedure can be expressed as

x(n)

=

2

f(mS

-

n)v,(mS,

n)

m

=

-.X

for some “synthesis” filter

f(n).

The major difference be-

tween

(6)

and

(7)

is that

(6)

specifies that

yw(mS,

n)

should

be windowed with the analysis window before being over-

lap added and

w(mS

-

n)

should be squared before summation

over the variable

m

for normalization. The difference between

(6)

and

(8)

is that

(6)

explicitly specifies what

f(n)

is and has

the normalization constant. In addition, the major difference

between

(6),

and

(7)

and

(8),

is that

(6)

was theoretically de-

rived explicitly for the purpose of estimating a signal from the

MSTFT based on the least squares error criterion of

(4).

Equa-

tions

(7)

and

(8),

however, were derived to reconstruct a signal

from its exact STFT or

to

estimate a signal from the MSTFT

for

a

very restricted class of modifications, and were sometimes

used as ad hoc methods to estimate a signal from the MSTFT.

From the computational point of view, the differences cited

above are minor in terms of both the number of arithmetic

op-

erations and the amount of on-line-storage required. For ex-

ample,

(6)

can be implemented with little on-line storage and

delay, in the same manner

[IO]

as the standard-overlap proce-

dure of

(7)

or the weighted overlap-add procedure of

(8).

Since the algorithm represented by

(6)

minimizes the distance

measure of

(4),

it

will be referred

to

as LSEE-MSTFT, meaning

least squares error estimation from the MSTFT.

In the standard overlap-add method, the window is usually

normalized

so

that

X,,=-,

w(mS

-

n)

is unity for all

n

in

order

to

reduce computation.

As

in the overlap-add method,

m

has the property given by

(10).

In addition, we require that

this class of sinusoidal windows be symmetric

so

that

w(n)

=

w(L

-

1

-

n).

This requirement can be satisfied by choosing

qb

=

v/L.

By choosing values for

a

and

b,

windows similar to

the Hamming window and the Hanning window can be ob-

tained. Thus, the modified Hamming window used for time-

scale modification of speech in Section

IV

will be defined as

(1

1)

for

a

=

0.54,

b

=

-

0.46,

and

qb

=

v/L.

The major differ-

ence between this definition and the standard definition of

the Hamming window is that the period of the sine wave is

I

in the modified Hamming window as opposed to

I

-

1

for the

standard Hamming window. Similarly, a modified Hanning

window can be defined as

(1

1)

for

a

=

0.5,

b

=

-0.5,

and

qb

=

n/L.

Use of these modified windows eliminates the need for

normalizing by

=-m

w2

(mS

-

n)

in

(6):

which reduces

computation and/or storage requirements for partial window

overlap.

Estimating

x(n)

based on

(6)

minimizes the squared error

between

X,(mS,

o)

and

Y,(rnS,

a),

and therefore can be

used directly to estimate a sequence from a MSTFT.

As

will

be discussed in the next section,

(6)

can also be used to de-

velop an iterative algorithm that estimates a signal from the

MSTFTM.

111.

SIGNAL

ESTIMATION

mOM

MODIFII~

STFT MAGNITUDE

In this section, we consider the problem of estimating

x(n)

from the modified STFT magnitude

~

Y,(nzS,

o)l.

The algo-

rithm we develop is an iterative procedure based on the

LSEE-

MSTFT algorithm which is similar in style to several other

iterative algorithms

[

1

I

1,

[

121.

In this algorithm, the squared

error between

IX,(mS,

o)/

and

I

Y,(mS,

o)

1

is decreased in

each iteration. Let

x’(n)

denote the estimated

x(n)

after the

ith iteration. The

if

1st

estimate

x‘+’(n)

is obtained by tak-

ing the STFT of

x‘(n),

replacing the magnitude of

Xk(nzS,

w)

with the given magnitude

1

Y,(nzS,

w)l

and then finding the

signal with STFT closest to this modified STFT using

(6).

The

238

IEEE TRANSACTIONS

ON

ACOUSTICS,

SPEECH,

AND

SIGNAL

PROCESSING,

VOL.

ASS-32,

NO.

2,

APRIL

1984

Given

1

Y,(mS,

O)

1

Initial

Estimate

of

x(n)

J

>

x'(n)

1,'ig.

1.

LSI1E-MSTFTM

algorithm.

iterative algorithm, which is illustrated in Fig.

1.

results in the

following update equation:

LV2

(ms

-

n)

m

=

-m

where

In

(13):

if

IX:v(mS,

w)1

=

0,

then

Xk(mS,

w)

is set to

1

Y,

(nzS,

a)/.

It can be shown (see Appendix) that the algorithm

in Fig.

1

decreases in each iteration

the

following distance

measure:

*.

lT

4RI[x(n),

/Yw(mS,C3)II

=

2

-

J

[:Xw(mS,w)I

m

2n

w

=-71

1

Y,(nzS,

w)l]%

do.

(14)

It can also be shown (see Appendix) that the algorithm always

converges

to

a set consisting of the critical points of the dis-

tance measure

D,v

as a function of

x(M).

This algorithn~ will

be referred to as LSEE-MSTFTM.

It is possible to develop ad hoc methods to estimate

X(.)

from the MSTFTM by modifying the iterative algorithm in

Fig.

1.

For example, suppose we use in one step of the itera-

tive procedure the standard overlap-add method rather than

the LSEE-MSTFT method

in

obtaining the next estimate

xi+'(/?)

from the MSTFT

XL,(n7S.

w).

This results

in

the

following update equation:

where

?L(rnS?

w)

is given by

(13).

This algorithm will be

called OA(overlap-add)-MSTFTM

to

distinguish it from the

LSEE-MSTFTM algorithm. Although OA-MSTFTM requires

fewer multiplications per iteration since one

less

windowing

step is required,

it

is not guaranteed to converge

to

the criti-

cal paints of

DM.

As

will be shown in Section

IV,

however,

OA-MSTFTM does appear

to

reduce

DM

enough

to

produce

a

reasonable signal estimate for the purposes

of

time-scale modi-

fication of speech.

One method of decomposing a speech signaly(n)

is

to

repre-

sent it as the convolution of an excitation function with the

vocal tract impulse response. Consequently, the STFT magni-

tude of this speech signal

1

Y,(mS,

w)

1

can be written as the

product of a component due to the excitation function

IP,

(nzS,

w)l

and a component due to the vocal tract impulse

response

IH,(mS,

o)/.

This decomposition is valid if the

analysis window is long enough to include several vocal tract

impulse responses and short enough so that the speech signal

is approximately stationary over the window length. Under

these conditions, the function

/P,(mS,

w)l

will correspond

to

the rapidly varying portion of

IY,(mS,

w)/

with

w?

taking

on an harmonic structure for voiced speech or noise for un-

voiced speech. The function

IH,(mS,

w)l

will correspond

to

the slowly varying portion of

I

Y,(mS,

w)I

with

w,

and will

include the formant information of the speech signal. Since

the speech signal is assumed to be approximately stationary

over the window length,

IP,(mS,

w)l

and

IH,(mS,

w)l

will

change slowly with the time index

mS

as the pitch pcriod and

vocal tract impulse response change.

The goal of time-scale modification

is

to

modify the rate at

which

/P,(mS,

w)l

and

/H,(mS,

w)'

vary with time, and

hence the rate at which

I

Y,(mS,

w)I

varies with time, without

affecting the spectral characteristics. This can be accomplished

by estimating a signal with STFT magnitude close

to

a

tirne-

scale modified version of

I

Y,(mS,

w)i.

A time-scale modifica-

tion of

SI :S2

can be performed by calculating

I

Y,(mS,,

w)I

st the window shift

SI

and

XL>(mS2,

o)

at

the window shift

5'2

in the LSEE-MSTFTM or OA-MSTFTM algorithms. For

example,

1

Y,.(~s,,

wjl

for the sentcnce "line up at the

screcn

door."

saclpled at

IO

kHz

is shown in Fig.

2

for a

256

point modified Hamming window and

a

window shift

S1

of

128. Fig. 3(a) shows a

128

:

64

time-scale modified version of

I

Y,(/nS,,

o)l

produced by displaying these samples of

'

YW(n,

w);

with a spacing

of

64

samples instead of 128 samples. A

signal with STFTM close to this MSTFT" was estimated by

starting with an initial white Gaussian noise sequence and then

iterating with LSEE-MSTFTM until the distance measure

DAM

was decreased to the desired level. The Fourier transforms in

thc algorithm were implemcnted with

5

I

?-point

FFT's.

Fig.

3(b) shows

IX{v(n~S2,

w)/

for

S2

=

64

after

100

iterations.

Similarly, Fig.

3(c)

shows

IX{,,(mS,,

w)/

after

100

iterations

of the OA-MSTFTM algorithm using the same initial estimate.

Comparisons of Fig. 3(b) and 3(c) with Fig.

3(a)

indicatc that

the STFTM of the signal estimate is very close

to

the desired

MSTFTM and that the performance of LSEE-MSTFTM

and

0.4-MSTFTM is similar. In Fig.

4.

the distance measure

Div

is

shown

as

;I

function

of

thc number

of

itcrations

for

LSEE-

GRIFFIN AND LIM: SIGNAL ESTIMATION

FROM

MODIFIED STFT

2

39

Fig.

2.

STFTM

of

“line up at the screen door.”

I\

(C)

Fig.

3.

(a)

128

:64

time-scale compressed STFTM

of

original speech.

(b) STFTM of LSEE-MSTFTM estimate. (c) STFTM of OA-MSTFTM

estimate.

MSTFTM and OA-MSTFTM. Although OA-MSTFTM performs

somewhat better during the initial iterations, LSEE-MSTFTM

eventually surpasses it. This same performance difference was

noted in all of the examples where these two methods were

compared. In addition, LSEE-MSTFTM was observed to al-

ways decrease

DM

whereas OA-MSTFTM usually stopped de-

creasing

DM

after about

100

iterations and in some cases in-

creased

DM

as more iterations were performed.

To

show that these methods perform as well for noninteger

compression or expansion factors, the second example shows

a

35

:

64 expansion. Fig. 5(a) shows a

35

:

64 time-scale modi-

fied version of

IY,(rnSI,

o)l’

calculated from the original

speech signal. As in the first example, the initial estimate was

a white Gaussian noise sequence. Fig. 5(b) and 5(c) show the

STFTM of the signal estimate after

100

iterations using a 256

point modified Hamming window for LSEE-MSTFTM and

OA-MSTFTM, respectively. In both of these examples, the

resultant signal estimate was clean high quality speech and

the estimates produced by LSEE-MSTFTM and OA-MSTFTM

were indistinguishable in listening tests.

The final example consists of a

1

:

2

time-scale expansion

of the

2:

1

time-scale compressed speech generated in the first

example. The STFTM of the signal estimates produced are

then compared with the STFTM of the original speech sig-

nal. Fig. 6(a) and 6(b) show the STFTM of the signal esti-

mates after

100

iterations of LSEE-MSTFTM and OA-MSTFTM,

respectively. Comparisons of Fig. 6(a) and 6(b) with Fig.

2

show that both BEE-MSTFTM and OA-MSTFTM produce a

signal estimate with STFTM close to the STFTM of the original

speech signal. The primary difference between these signal

estimates and the original speech signal

is

that a small amount

LSEE-MSTFTM

0.2

x

IO4

I

IC

0

20

40

60

BO

IO0

Number

of

Iterations

Fig.

4.

DM

versus iteration number of LSEE-MSTFTM and

OA-

MSTFTM.

of reverberati0n.k detectable

in

the signal estimate due to the

nonstationarity of the 2:

1

time-scale compressed speech over

the window length.

In addition to the above three examples, other speech mate-

rial including noisy speech has been processed by the two

methods at various compression and expansion ratios. Infor-

mal listening appears to indicate that the performance of

these methods is superior to that

of

the system by Portnoff

[l]

.

It should be noted that this approach to time-scale modi-

fication of speech differs’considerably from that of Portnoff.

In Portnoffs method, the phase of

Y,(rnS,

o)

is explicitly

obtained by phase unwrapping which is undesirable due to

various considerations including the computational aspect.

In

the LSEE-MSTFTM or OA-MSTFTM algorithms, the phase

of

Yw(mS,

o)

is implicitly estimated in the process of estimat-

ing a signal with STFTM close to

1

Y,(rnS,

o)l

and no phase

unwrapping is performed.

Even though we used a large number of iterations

(100)

for

the examples illustrated in this paper, we have observed that

essentially the same results in terms of speech quality can be

obtained after

25

to

100 iterations.

In

addition, we have ob-

served that speech quality improves rapidly initially and then

more slowly as the number of iterations increases. This is evi-

denced, to some extent, in Fig. 4, where

DM

decreases rapidly

initially but more slowly as the number

of

iterations increases.

With a better choice of the initial estimate of

x(n)

than a Gaus-

sian noise sequence, it may be possible to reduce the number

of iterations required to achieve a certain performance.

Despite the large number of iterations’ required, real time’

‘Due

to

iterations, the total number of computations is considerably

larger than Portnoff‘s method

[

1

1.

In

a multiprocessor environment,

however, the computational requirement

of

each processor is compar-

able or perhaps less than that

of

Portnoff‘s method.

’The definition of “real time”

for

time-scale modification depends on

the application.

In

applications where the input

to

the algorithm is from

some storage device and the output is converted to an analog signal

which the user listens to, the algorithm must produceoneoutput sample

in an average time less than

TI

w’here

T1

is the sampling period associ-

ated with the digital

to

analog converter used to gcnerateoutput speech.

In

applications where the input

to

the algorithm is digitized directly

from the user’s speech and the output is placed

on

some storage device,

the algorithm must process an input data sample in an average time less

than

Tz

where

T2

is the sampling period associated with the analog

to

digital converter used

to

digitize the input speech.

240

IEEE TRANSACTIONS

ON

ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-32, NO.

2,

APRIL

1984

Fig.

5.

(a)

35

:64

time-scale expanded STFTM

of

original speech.

(b)

STFTM

of

LSEE-MSTFTM estimate. (c) STFTM

of

OA-MSTFTM estimate.

Fig.

6.

I:

2

expansion

of

2

:

1

compressed speech for (a) LSEE-MSTFTM

and

(b)

OA-MSTFTM.

implementation appears possible if enough processors are used

in series. Specifically, as input data are received, the ith pro-

cessor can perform the ith iteration and the

i+

1st processor

which follows the ith processor can perform the

i

+

1st itera-

tion. The inherent delay associated with each iteration is only

the length of the analysis window,

L

data points. This is due

to the fact that the computational aspect

of

each iteration of

the algorithm is essentially the same as the weighted overlap-

add method

[lo],

in which the delay between the input and

outpat data is

L

points assuming the required computation for

each windowed data segment can be performed during the

time corresponding to the window shift,

S

data points. As an

example that illustrates the computational requirements and

delay involved, suppose

SI

=

Sz

=

64,

L

=

256, the size of the

DFT used is

512,

the number

of

iterations required and the

number of processors available is 50, and speech is sampled at

a

10

kHz

rate. Since the major computations involved in the

algorithm are due to the DFT and IDFT, if each processor can

compute two 512-point DFT’s once every 6.4 ms, then the

iterative algorithm can be implemented in real time with a

delay of about

1.3

s.

Current hardware technology is more

than capable of handling such computational requirements,

and

a

delay of a few seconds is not a serious problem in most

applications

of

time-scale modification

of

speech.

Even though LSEE-MSTFTM and OA-MSTFTM had similar

Signal estimation from modified short-time Fourier transform

Citations

Intelligent Automated Assistant

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Tacotron: Towards End-to-End Speech Synthesis

Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

References

A practical algorithm for the determination of phase from image and diffraction plane pictures

Digital Processing of Speech Signals

Enhancement and bandwidth compression of noisy speech

Short term spectral analysis, synthesis, and modification by discrete Fourier transform

Time-frequency representation of digital signals and systems based on short-time Fourier analysis

Related Papers (5)

WaveNet: A Generative Model for Raw Audio

Adam: A Method for Stochastic Optimization

Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator

Generative Adversarial Nets

Performance measurement in blind audio source separation