scispace - formally typeset
Open AccessJournal ArticleDOI

Signal estimation from modified short-time Fourier transform

TLDR
An algorithm to estimate a signal from its modified short-time Fourier transform (STFT) by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT magnitude is presented.
Abstract
In this paper, we present an algorithm to estimate a signal from its modified short-time Fourier transform (STFT). This algorithm is computationally simple and is obtained by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT. Using this algorithm, we also develop an iterative algorithm to estimate a signal from its modified STFT magnitude. The iterative algorithm is shown to decrease, in each iteration, the mean squared error between the STFT magnitude of the estimated signal and the modified STFT magnitude. The major computation involved in the iterative algorithm is the discrete Fourier transform (DFT) computation, and the algorithm appears to be real-time implementable with current hardware technology. The algorithm developed in this paper has been applied to the time-scale modification of speech. The resulting system generates very high-quality speech, and appears to be better in performance than any existing method.

read more

Content maybe subject to copyright    Report

2
36
IEEE
TRANSACTIONS
ON
ACOUSTICS,
SPEECH,
AND
SIGNAL PROCESSING, VOL.
ASSP-32,
NO.
2,
APRIL
1984
Signal Estimation from Modified Short-Time
Fourier Transform
DANIEL
W.
GRIFFIN
AND
JAE
S.
LIM,
SENIOR
MEMBER,
IEEE
Abstract-In this paper, we present an algorithm to estimate a signal
from its modified short-time Fourier transform (STFT). This algorithm
is computationally simple and
is
obtained by minimizing the mean
squared error between the STFT of the estimated signal and the modi-
fied STFT. Using this algorithm, we also develop an iterative algorithm
to estimate a signal from its modified STFT magnitude. The iterative
algorithm is shown
to
decrease,
in
each iteration, the mean squared
error
between the STFT magnitude of the estimated signal and the
modified STFT magnitude. The major computation involved
in
the
iterative algorithm is the discrete Fourier transform (DFT) computa-
tion, and the algorithm appears to be real-time implementable with
current hardware technology. The algorithm developed in this paper
has been applied to the time-scale modification of speech. The result-
ing system generates very high-quality speech, and appears
to
be better
in performancc than any existing method.
I. INTRODUCTION
I
N
a
number
of
practical applications
[1]-[5],
it is desirable
to modify the short-time Fourier transform (STFT) or the
short-time Fourier transform magnitude (STFTM) and then es-
timate the processed signal from the modified STFT (MSTFT)
or the modified STFTM (MSTFTM). For example, in speech
enhancement by spectral subtraction
[2],
[3]
~
the STFT is
modified by combining the STFT phase of the degraded speech
with
a
MSTFTM, and then
a
signal
is
reconstructed from the
MSTFT.
As
another example, in the time-scale modification
of speech, one approach is to modify the STFTM and then
to
reconstruct
a
signal from the MSTFTM. In most applications,
including the two cited above, the MSTFT or MSTFTM
is
not
valid in the sense that no signal
has
the MSTFT or MSTFTM,
and therefore it is important
to
develop algorithms to estimate
a
signal whose STFT or STFTM is close in some sense to the
MSTFT or MSTFTM. Previous approaches
to
this problem
have been mostly heuristic
[6]
-[8],
and have been limited to
estimating a signal from the MSTFT
[6],
171.
In this paper,
we develop new algorithms based on theoretical grounds
to
estimate a signal from the MSTFT or the MSTFTM.
In
addi-
tion, the new algorithm is applied to the problem of time-scale
modification of speech. The resulting system is considerably
simpler conceptually and appears to have better performance
than the system described by Portnoff
[I]
.
The paper is organized as follows. In Section
11,
we develop
an algorithm to estimate
a
signal from the MSTFT by mini-
mizing the mean squared error between the STFT
of
the esti-
Manuscript received Dcccnlber 27, 1982; revised
May
12,
1983, and
September 26, 1983. This
{vork
was supported
in
part
by
the Advanced
Research Projects
Agency
monitored
by
ONR
under Contract
SO00
14-
81-K-0742 NR-049-509
and
the
National
Scicncc
I:ound;~tion under
Grant
I<CS80-07
102.
The authors
arc
with thc Iicscarch Laboratory
of
Iblcctronics. Dcpart-
mcnt
of
I~.lcctrical
Iknpinccring
and
Computcr Scicncc, Massachusetts
Institute
ol'Tcchnology, Cambridge,
MA
021 39.
mated signal and the MSTFT. The resulting algorithm is quite
simple computationally. In Section
111,
the algorithm in Sec-
tion
I1
is used to develop an iterative algorithm that estimates
a
signal from the MSTFTM.
The
iterative algorithm is shown
to decrease, in each iteration, the mean squared error between
the STFTM of
the
estimated signal and the MSTFTM. In Sec-
tion
IV,
we present
an
example
of
the successful application
of our theoretical results. Specifically, we develop
a
time-
scale speech modification system by modifying the STFTM
first
and
then estimating
a
signal from the MSTFTM using the
algorithm developed in Section
111.
The resulting system has
been demonstrated
to
gene]-ate very high quality, time-scale
modified speech.
11.
SIGNAL ESTlblATION
FROM
MODIFIED SHORT-TIME
FOUI<I~.:K
TRANSFOKbl
Let
x(n)
and
X,(nzS,
w)
denote a real sequence and its
STFT. The variable
S
is
a
positive integer, which represents
the sampling period of
X,(n,
w)
in the variable
n.
Let the
analysis window used in the STFT be denoted by
~(n),
and
with little
loss
of generality,
w(n)
is assumed to be real,
I,
points long, and nonzero for
0
<
n
<
I,
-
1.
From the defini-
tion of the STFT
x,(~?zs,
o)
=
F~
[x,(urz~,
I)]
=
x,(rn~,
I)
dm'
(1)
m
[=
--
where
x,(rnS,
I)
=
w(mS
-
I)
x(/)
(2)
and
Fl
[xw(mS,
I)]
represents the Fourier transform ofx,(mS,
I)
with respect
to
the variable
1.
Let
Y,(mS,
w)
denote the given MSTFT and let
y,(mS,
I)
be given by
Yw(mS,
I)
=
-
Y,(mS,
w)
cjwl
dw.
(3)
sn
271
w=-r
An
arbitrary
Y,(mS.
o),
in
general, is not a valid STFT
in
the
sense that there
is
no
sequence whose STFT isgiven by
Y,(mS,
0).
In this section, we develop
a
new algorithm
to
estimate
a
sequence
X(/?)
whose STFT
X,(mS,
w)
is
closest to
Y,(rn.S,
o)
in the squared error sense.
Consider the following distance measure between
x(n)
and
a
given MSTFT
Y,(mS,
0):
ea
D[X(fZ),
Y,(f72S,
a)]
=
--
I
In
/X,(flZS,
w)
m
=
-m
2n
w
=-n
-
Y,(mS,
o)l2
dw.
(4)
OOc)6-35
I8/84/04OO-O236$01
.OO
0
1984
IEEE

GRIFFIN AND LIM: SIGNAL ESTIMATION FROM MODIFIED
STFT
237
The distance measure in
(4)
is the squared error between
the window in
(6)
can be normalized
so
that
=-_
wz
(mS
-
X,(mS,
o)
and
Y,(mS,
w)
integrated over all
w
and summed
n)
is unity for all
n.
Any nonzero window can be normalized
over all
m.
It
has been written as a function
of
x(n)
and
Y,
in this manner for maximum window overlap
(S
=
1).
For
(mS,
a)
to emphasize that
X,(mS,
o)
is a valid STFT while
partial window overlap, however, the window ismore restricted.
Y,(mS,
o)
is not necessarily a valid STFT. By Parseval’s
Several windows which have this property for partial window
theorem,
(4)
can be written as
overlap are discussed below.
When the window shift
(S)
divides the window length
(I,)
D[x(n),
Y,(mS,
all
=
2
2
[x,(mS,
0
evenly, the rectangular window defined by
m
=--
-
y,(mS,
012
f
(’1
w,(n)
=
O<n<L
Since
(5)
is in the quadratic form of
~(n),
minimization of
D[x(n),
Y,(mS,
w)]
can be accomplished by setting the.
otherwise
gradient with respect to
~(n)
to zero and solving
for
x(n)
has the property
which leads to the following result:
2
w,”(mS-n)=
-=
1.
(LIa-1
S
(1
0)
5
w(mS
-
n)y,(mS,
n)
??I
=
-m
m=O
I,
x(n)
=
m
=-m
(6)
We can further show with some algebra that if the window
length
(L)
is a multiple of four times the window shift
(S)
then
5
w2(mS-
n)
m
=-m
the sinusoidal window defined by
This solution is similar in form to the standard overlap-add
procedure
[6],
[7],
or the weighted overlap-add procedure
[9],
[IO].
The overlap-add procedure can be expressed as
The weighted overlap-add procedure can be expressed as
x(n)
=
2
f(mS
-
n)v,(mS,
n)
m
=
-.X
for some “synthesis” filter
f(n).
The major difference be-
tween
(6)
and
(7)
is that
(6)
specifies that
yw(mS,
n)
should
be windowed with the analysis window before being over-
lap added and
w(mS
-
n)
should be squared before summation
over the variable
m
for normalization. The difference between
(6)
and
(8)
is that
(6)
explicitly specifies what
f(n)
is and has
the normalization constant. In addition, the major difference
between
(6),
and
(7)
and
(8),
is that
(6)
was theoretically de-
rived explicitly for the purpose of estimating a signal from the
MSTFT based on the least squares error criterion of
(4).
Equa-
tions
(7)
and
(8),
however, were derived to reconstruct a signal
from its exact STFT or
to
estimate a signal from the MSTFT
for
a
very restricted class of modifications, and were sometimes
used as ad hoc methods to estimate a signal from the MSTFT.
From the computational point of view, the differences cited
above are minor in terms of both the number of arithmetic
op-
erations and the amount of on-line-storage required. For ex-
ample,
(6)
can be implemented with little on-line storage and
delay, in the same manner
[IO]
as the standard-overlap proce-
dure of
(7)
or the weighted overlap-add procedure of
(8).
Since the algorithm represented by
(6)
minimizes the distance
measure of
(4),
it
will be referred
to
as LSEE-MSTFT, meaning
least squares error estimation from the MSTFT.
In the standard overlap-add method, the window is usually
normalized
so
that
X,,=-,
w(mS
-
n)
is unity for all
n
in
order
to
reduce computation.
As
in the overlap-add method,
m
has the property given by
(10).
In addition, we require that
this class of sinusoidal windows be symmetric
so
that
w(n)
=
w(L
-
1
-
n).
This requirement can be satisfied by choosing
qb
=
v/L.
By choosing values for
a
and
b,
windows similar to
the Hamming window and the Hanning window can be ob-
tained. Thus, the modified Hamming window used for time-
scale modification of speech in Section
IV
will be defined as
(1
1)
for
a
=
0.54,
b
=
-
0.46,
and
qb
=
v/L.
The major differ-
ence between this definition and the standard definition of
the Hamming window is that the period of the sine wave is
I
in the modified Hamming window as opposed to
I
-
1
for the
standard Hamming window. Similarly, a modified Hanning
window can be defined as
(1
1)
for
a
=
0.5,
b
=
-0.5,
and
qb
=
n/L.
Use of these modified windows eliminates the need for
normalizing by
=-m
w2
(mS
-
n)
in
(6):
which reduces
computation and/or storage requirements for partial window
overlap.
Estimating
x(n)
based on
(6)
minimizes the squared error
between
X,(mS,
o)
and
Y,(rnS,
a),
and therefore can be
used directly to estimate a sequence from a MSTFT.
As
will
be discussed in the next section,
(6)
can also be used to de-
velop an iterative algorithm that estimates a signal from the
MSTFTM.
111.
SIGNAL
ESTIMATION
mOM
MODIFII~
STFT MAGNITUDE
In this section, we consider the problem of estimating
x(n)
from the modified STFT magnitude
~
Y,(nzS,
o)l.
The algo-
rithm we develop is an iterative procedure based on the
LSEE-
MSTFT algorithm which is similar in style to several other
iterative algorithms
[
1
I
1,
[
121.
In this algorithm, the squared
error between
IX,(mS,
o)/
and
I
Y,(mS,
o)
1
is decreased in
each iteration. Let
x’(n)
denote the estimated
x(n)
after the
ith iteration. The
if
1st
estimate
x‘+’(n)
is obtained by tak-
ing the STFT of
x‘(n),
replacing the magnitude of
Xk(nzS,
w)
with the given magnitude
1
Y,(nzS,
w)l
and then finding the
signal with STFT closest to this modified STFT using
(6).
The

238
IEEE TRANSACTIONS
ON
ACOUSTICS,
SPEECH,
AND
SIGNAL
PROCESSING,
VOL.
ASS-32,
NO.
2,
APRIL
1984
Given
1
Y,(mS,
O)
1
Initial
Estimate
of
x(n)
J
>
x'(n)
1,'ig.
1.
LSI1E-MSTFTM
algorithm.
iterative algorithm, which is illustrated in Fig.
1.
results in the
following update equation:
LV2
(ms
-
n)
m
=
-m
where
In
(13):
if
IX:v(mS,
w)1
=
0,
then
Xk(mS,
w)
is set to
1
Y,
(nzS,
a)/.
It can be shown (see Appendix) that the algorithm
in Fig.
1
decreases in each iteration
the
following distance
measure:
*.
lT
4RI[x(n),
/Yw(mS,C3)II
=
2
-
J
[:Xw(mS,w)I
m
2n
w
=-71
1
Y,(nzS,
w)l]%
do.
(14)
It can also be shown (see Appendix) that the algorithm always
converges
to
a set consisting of the critical points of the dis-
tance measure
D,v
as a function of
x(M).
This algorithn~ will
be referred to as LSEE-MSTFTM.
It is possible to develop ad hoc methods to estimate
X(.)
from the MSTFTM by modifying the iterative algorithm in
Fig.
1.
For example, suppose we use in one step of the itera-
tive procedure the standard overlap-add method rather than
the LSEE-MSTFT method
in
obtaining the next estimate
xi+'(/?)
from the MSTFT
XL,(n7S.
w).
This results
in
the
following update equation:
where
?L(rnS?
w)
is given by
(13).
This algorithm will be
called OA(overlap-add)-MSTFTM
to
distinguish it from the
LSEE-MSTFTM algorithm. Although OA-MSTFTM requires
fewer multiplications per iteration since one
less
windowing
step is required,
it
is not guaranteed to converge
to
the criti-
cal paints of
DM.
As
will be shown in Section
IV,
however,
OA-MSTFTM does appear
to
reduce
DM
enough
to
produce
a
reasonable signal estimate for the purposes
of
time-scale modi-
fication of speech.
One method of decomposing a speech signaly(n)
is
to
repre-
sent it as the convolution of an excitation function with the
vocal tract impulse response. Consequently, the STFT magni-
tude of this speech signal
1
Y,(mS,
w)
1
can be written as the
product of a component due to the excitation function
IP,
(nzS,
w)l
and a component due to the vocal tract impulse
response
IH,(mS,
o)/.
This decomposition is valid if the
analysis window is long enough to include several vocal tract
impulse responses and short enough so that the speech signal
is approximately stationary over the window length. Under
these conditions, the function
/P,(mS,
w)l
will correspond
to
the rapidly varying portion of
IY,(mS,
w)/
with
w?
taking
on an harmonic structure for voiced speech or noise for un-
voiced speech. The function
IH,(mS,
w)l
will correspond
to
the slowly varying portion of
I
Y,(mS,
w)I
with
w,
and will
include the formant information of the speech signal. Since
the speech signal is assumed to be approximately stationary
over the window length,
IP,(mS,
w)l
and
IH,(mS,
w)l
will
change slowly with the time index
mS
as the pitch pcriod and
vocal tract impulse response change.
The goal of time-scale modification
is
to
modify the rate at
which
/P,(mS,
w)l
and
/H,(mS,
w)'
vary with time, and
hence the rate at which
I
Y,(mS,
w)I
varies with time, without
affecting the spectral characteristics. This can be accomplished
by estimating a signal with STFT magnitude close
to
a
tirne-
scale modified version of
I
Y,(mS,
w)i.
A time-scale modifica-
tion of
SI :S2
can be performed by calculating
I
Y,(mS,,
w)I
st the window shift
SI
and
XL>(mS2,
o)
at
the window shift
5'2
in the LSEE-MSTFTM or OA-MSTFTM algorithms. For
example,
1
Y,.(~s,,
wjl
for the sentcnce "line up at the
screcn
door."
saclpled at
IO
kHz
is shown in Fig.
2
for a
256
point modified Hamming window and
a
window shift
S1
of
128. Fig. 3(a) shows a
128
:
64
time-scale modified version of
I
Y,(/nS,,
o)l
produced by displaying these samples of
'
YW(n,
w);
with a spacing
of
64
samples instead of 128 samples. A
signal with STFTM close to this MSTFT" was estimated by
starting with an initial white Gaussian noise sequence and then
iterating with LSEE-MSTFTM until the distance measure
DAM
was decreased to the desired level. The Fourier transforms in
thc algorithm were implemcnted with
5
I
?-point
FFT's.
Fig.
3(b) shows
IX{v(n~S2,
w)/
for
S2
=
64
after
100
iterations.
Similarly, Fig.
3(c)
shows
IX{,,(mS,,
w)/
after
100
iterations
of the OA-MSTFTM algorithm using the same initial estimate.
Comparisons of Fig. 3(b) and 3(c) with Fig.
3(a)
indicatc that
the STFTM of the signal estimate is very close
to
the desired
MSTFTM and that the performance of LSEE-MSTFTM
and
0.4-MSTFTM is similar. In Fig.
4.
the distance measure
Div
is
shown
as
;I
function
of
thc number
of
itcrations
for
LSEE-

GRIFFIN AND LIM: SIGNAL ESTIMATION
FROM
MODIFIED STFT
2
39
Fig.
2.
STFTM
of
“line up at the screen door.”
I\
(C)
Fig.
3.
(a)
128
:64
time-scale compressed STFTM
of
original speech.
(b) STFTM of LSEE-MSTFTM estimate. (c) STFTM of OA-MSTFTM
estimate.
MSTFTM and OA-MSTFTM. Although OA-MSTFTM performs
somewhat better during the initial iterations, LSEE-MSTFTM
eventually surpasses it. This same performance difference was
noted in all of the examples where these two methods were
compared. In addition, LSEE-MSTFTM was observed to al-
ways decrease
DM
whereas OA-MSTFTM usually stopped de-
creasing
DM
after about
100
iterations and in some cases in-
creased
DM
as more iterations were performed.
To
show that these methods perform as well for noninteger
compression or expansion factors, the second example shows
a
35
:
64 expansion. Fig. 5(a) shows a
35
:
64 time-scale modi-
fied version of
IY,(rnSI,
o)l’
calculated from the original
speech signal. As in the first example, the initial estimate was
a white Gaussian noise sequence. Fig. 5(b) and 5(c) show the
STFTM of the signal estimate after
100
iterations using a 256
point modified Hamming window for LSEE-MSTFTM and
OA-MSTFTM, respectively. In both of these examples, the
resultant signal estimate was clean high quality speech and
the estimates produced by LSEE-MSTFTM and OA-MSTFTM
were indistinguishable in listening tests.
The final example consists of a
1
:
2
time-scale expansion
of the
2:
1
time-scale compressed speech generated in the first
example. The STFTM of the signal estimates produced are
then compared with the STFTM of the original speech sig-
nal. Fig. 6(a) and 6(b) show the STFTM of the signal esti-
mates after
100
iterations of LSEE-MSTFTM and OA-MSTFTM,
respectively. Comparisons of Fig. 6(a) and 6(b) with Fig.
2
show that both BEE-MSTFTM and OA-MSTFTM produce a
signal estimate with STFTM close to the STFTM of the original
speech signal. The primary difference between these signal
estimates and the original speech signal
is
that a small amount
LSEE-MSTFTM
0.2
x
IO4
I
I
I
I
IC
0
20
40
60
BO
IO0
Number
of
Iterations
Fig.
4.
DM
versus iteration number of LSEE-MSTFTM and
OA-
MSTFTM.
of reverberati0n.k detectable
in
the signal estimate due to the
nonstationarity of the 2:
1
time-scale compressed speech over
the window length.
In addition to the above three examples, other speech mate-
rial including noisy speech has been processed by the two
methods at various compression and expansion ratios. Infor-
mal listening appears to indicate that the performance of
these methods is superior to that
of
the system by Portnoff
[l]
.
It should be noted that this approach to time-scale modi-
fication of speech differs’considerably from that of Portnoff.
In Portnoffs method, the phase of
Y,(rnS,
o)
is explicitly
obtained by phase unwrapping which is undesirable due to
various considerations including the computational aspect.
In
the LSEE-MSTFTM or OA-MSTFTM algorithms, the phase
of
Yw(mS,
o)
is implicitly estimated in the process of estimat-
ing a signal with STFTM close to
1
Y,(rnS,
o)l
and no phase
unwrapping is performed.
Even though we used a large number of iterations
(100)
for
the examples illustrated in this paper, we have observed that
essentially the same results in terms of speech quality can be
obtained after
25
to
100 iterations.
In
addition, we have ob-
served that speech quality improves rapidly initially and then
more slowly as the number of iterations increases. This is evi-
denced, to some extent, in Fig. 4, where
DM
decreases rapidly
initially but more slowly as the number
of
iterations increases.
With a better choice of the initial estimate of
x(n)
than a Gaus-
sian noise sequence, it may be possible to reduce the number
of iterations required to achieve a certain performance.
Despite the large number of iterations’ required, real time’
‘Due
to
iterations, the total number of computations is considerably
larger than Portnoff‘s method
[
1
1.
In
a multiprocessor environment,
however, the computational requirement
of
each processor is compar-
able or perhaps less than that
of
Portnoff‘s method.
’The definition of “real time”
for
time-scale modification depends on
the application.
In
applications where the input
to
the algorithm is from
some storage device and the output is converted to an analog signal
which the user listens to, the algorithm must produceoneoutput sample
in an average time less than
TI
w’here
T1
is the sampling period associ-
ated with the digital
to
analog converter used to gcnerateoutput speech.
In
applications where the input
to
the algorithm is digitized directly
from the user’s speech and the output is placed
on
some storage device,
the algorithm must process an input data sample in an average time less
than
Tz
where
T2
is the sampling period associated with the analog
to
digital converter used
to
digitize the input speech.

240
IEEE TRANSACTIONS
ON
ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-32, NO.
2,
APRIL
1984
Fig.
5.
(a)
35
:64
time-scale expanded STFTM
of
original speech.
(b)
STFTM
of
LSEE-MSTFTM estimate. (c) STFTM
of
OA-MSTFTM estimate.
Fig.
6.
I:
2
expansion
of
2
:
1
compressed speech for (a) LSEE-MSTFTM
and
(b)
OA-MSTFTM.
implementation appears possible if enough processors are used
in series. Specifically, as input data are received, the ith pro-
cessor can perform the ith iteration and the
i+
1st processor
which follows the ith processor can perform the
i
+
1st itera-
tion. The inherent delay associated with each iteration is only
the length of the analysis window,
L
data points. This is due
to the fact that the computational aspect
of
each iteration of
the algorithm is essentially the same as the weighted overlap-
add method
[lo],
in which the delay between the input and
outpat data is
L
points assuming the required computation for
each windowed data segment can be performed during the
time corresponding to the window shift,
S
data points. As an
example that illustrates the computational requirements and
delay involved, suppose
SI
=
Sz
=
64,
L
=
256, the size of the
DFT used is
512,
the number
of
iterations required and the
number of processors available is 50, and speech is sampled at
a
10
kHz
rate. Since the major computations involved in the
algorithm are due to the DFT and IDFT, if each processor can
compute two 512-point DFT’s once every 6.4 ms, then the
iterative algorithm can be implemented in real time with a
delay of about
1.3
s.
Current hardware technology is more
than capable of handling such computational requirements,
and
a
delay of a few seconds is not a serious problem in most
applications
of
time-scale modification
of
speech.
Even though LSEE-MSTFTM and OA-MSTFTM had similar

Citations
More filters
Patent

Intelligent Automated Assistant

TL;DR: In this article, an intelligent automated assistant system engages with the user in an integrated, conversational manner using natural language dialog, and invokes external services when appropriate to obtain information or perform various actions.
Journal ArticleDOI

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

TL;DR: In a common framework several algorithms that have been proposed recently, in order to improve the voice quality of a text-to-speech synthesis based on acoustical units concatenation based on pitch-synchronous overlap-add approach are reviewed.
Proceedings ArticleDOI

Tacotron: Towards End-to-End Speech Synthesis

TL;DR: Tacotron as mentioned in this paper is an end-to-end generative text to speech model that synthesizes speech directly from characters, given pairs, the model can be trained completely from scratch with random initialization.
Journal ArticleDOI

Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria

TL;DR: An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented and enables a better separation quality than the previous algorithms.
Journal ArticleDOI

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

TL;DR: A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
References
More filters
Journal Article

A practical algorithm for the determination of phase from image and diffraction plane pictures

R. W. Gerchberg
- 01 Jan 1972 - 
TL;DR: In this article, an algorithm is presented for the rapid solution of the phase of the complete wave function whose intensity in the diffraction and imaging planes of an imaging system are known.
Book

Digital Processing of Speech Signals

TL;DR: This paper presents a meta-modelling framework for digital Speech Processing for Man-Machine Communication by Voice that automates the very labor-intensive and therefore time-heavy and expensive process of encoding and decoding speech.
Journal ArticleDOI

Enhancement and bandwidth compression of noisy speech

TL;DR: An overview of the variety of techniques that have been proposed for enhancement and bandwidth compression of speech degraded by additive background noise is provided to suggest a unifying framework in terms of which the relationships between these systems is more visible and which hopefully provides a structure which will suggest fruitful directions for further research.
Journal ArticleDOI

Short term spectral analysis, synthesis, and modification by discrete Fourier transform

TL;DR: In this article, a theory of short term spectral analysis, synthesis, and modification is presented with an attempt at pointing out certain practical and theoretical questions, which are useful in designing filter banks when the filter bank outputs are to be used for synthesis after multiplicative modifications are made to the spectrum.
Journal ArticleDOI

Time-frequency representation of digital signals and systems based on short-time Fourier analysis

TL;DR: In this article, the authors developed a representation for discrete-time signals and systems based on short-time Fourier analysis and showed that a class of linear-filtering problems can be represented as the product of the time-varying frequency response of the filter multiplied by the short time Fourier transform of the input signal.
Related Papers (5)