scispace - formally typeset
Open AccessBook ChapterDOI

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

TLDR
This work experiments with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time, and finds experimentally that the 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+ LSTM networks in SSI systems.
Abstract
Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that our 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI systems.

read more

Content maybe subject to copyright    Report

3D Convolutional Neural Networks
for Ultrasound-Based Silent Speech
Interfaces
aszl´oT´oth
(
B
)
and Amin Honarmandi Shandiz
Institute of Informatics, University of Szeged, Szeged, Hungary
{tothl,shandiz}@inf.u-szeged.hu
Abstract. Silent speech interfaces (SSI) aim to reconstruct the speech
signal from a recording of the articulatory movement, such as an ultra-
sound video of the tongue. Currently, deep neural networks are the most
successful technology for this task. The efficient solution requires meth-
ods that do not simply process single images, but are able to extract
the tongue movement information from a sequence of video frames. One
option for this is to apply recurrent neural structures such as the long
short-term memory network (LSTM) in combination with 2D convolu-
tional neural networks (CNNs). Here, we experiment with another app-
roach that extends the CNN to perform 3D convolution, where the extra
dimension corresponds to time. In particular, we apply the spatial and
temporal convolutions in a decomposed form, which proved very suc-
cessful recently in video action recognition. We find experimentally that
our 3D network outperforms the CNN+LSTM model, indicating that
3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI
systems.
Keywords: Silent speech interface
· Convolutional neural network ·
3D convolution · Ultrasound video
1 Introduction
During the last couple of years, there has been an increasing interest in
articulatory-to-acoustic conversion, which seeks to reproduce the speech signal
from a recording of the articulatory organs, giving the technological background
for creating “Silent Speech Interfaces” (SSI) [6,28]. These interfaces allow us to
record the soundless articulatory movement, and then automatically generate
speech from the movement information, while the subject is actually not pro-
ducing any sound. Such an SSI system could be very useful for the speaking
impaired who are able to move their articulators, but have lost their ability
to produce any sound (e.g. due to a laryngectomy or some injury of the vocal
chords). It could also be applied in human-computer interaction in situations
where regular speech is not feasible (e.g. extremely noisy environments or mil-
itary applications). Several solutions exist for the recording of the articulatory
c
Springer Nature Switzerland AG 2020
L. Rutkowski et al. (Eds.): ICAISC 2020, LNAI 12415, pp. 159–169, 2020.
https://doi.org/10.1007/978-3-030-61401-0
_16

160 L. oth and A. H. Shandiz
movements, the simplest approach being a lip video [1,8]. But one may also
apply electromagnetic articulography (EMA, [18,19]), ultrasound tongue imag-
ing (UTI, [5,10,16,20]) or permanent magnetic articulography (PMA, [9]). Sur-
face Electromiography (sEMG, [14,15,24]) is also an option, while some authors
use a combination of the above methods [6]. Here we are going to work with
ultrasound tongue videos.
To convert the movement recordings into speech, the conventional approach
is to apply a two-step procedure of ‘recognition-and-synthesis’ [28]. In this case,
the biosignal is first converted into text by a properly adjusted speech recognition
system. The text is then converted into speech using text-to-speech synthesis [7,
13,31]. The drawbacks of this approach are the relatively large delay between
the input and the output, and that the errors made by the speech recognizer
will inevitably appear as errors in the TTS output. Also, all information related
to speech prosody is lost, while certain prosodic components such as energy and
pitch can be reasonably well estimated from the articulatory signal [10].
Current SSI systems prefer the ‘direct synthesis’ principle, where speech is
generated directly from the articulatory data, without any intermediate step.
Moreover, as recently the Deep Neural Network (DNN) technology have become
dominant in practically all areas of speech technology, such as speech recogni-
tion [11], speech synthesis [22] and language modeling [33], most recent studies
have attempted to solve the articulatory-to-acoustic conversion problem by using
deep learning, regardless of the recording technique applied [5,9,10,14,16,20,25].
In this paper, we also apply deep neural networks to convert the ultrasound video
of the tongue movement to speech. Although some early studies used simple fully
connected neural networks [5,16], as we are working with images, it seems more
reasonable to apply convolutional neural networks (CNN), which are currently
very popular and successful in image recognition [21]. Thus, many recent studies
on SSI systems use CNNs [14,20,25].
Our input here is a video, that is, not just one still image, but a sequence of
images. This sequence carries extra information about the time trajectory of the
tongue movement, which might be exploited by processing several neighboring
video frames at the same time. There are several options to create a network
structure for processing a time sequences. For such data, usually recurrent neural
networks such as the long short-term memory network (LSTM) are applied,
typically stacking it on top of a 2D CNN that seeks to process the individual
frames [
9,19,23,25]. Alternatively, one may experiment with extending the 2D
CNN structure to 3D, by adding time as an extra dimension [17,20,32]. Here,
we follow the latter approach, and we investigate the applicability of a special
3D CNN model called the (2+1)D CNN [30] for ultrasound-based direct speech
synthesis, and compare the results with those of a CNN+LSTM model. We find
that our 3D CNN model achieves a lower error rate, while it is smaller, and its
training is faster. We conclude that for ultrasound video-based SSI systems, 3D
CNNs are definitely a feasible alternative to recurrent neural models.
The paper is structured as follows. Section 2 gives a technological overview
of the CNNs we are going to apply. In Sect. 3 we describe the data acquisition

3D CNNs for Silent Speech Interfaces 161
and processing steps for the ultrasound videos and the speech signal. Section 4
presents our experimental set-up. We present the experimental results and dis-
cuss them in Sect. 5, and the paper is closed with the conclusions in Sect. 6.
1D temporal convolution
2D spatial
convolution
y
t
x
Fig. 1. Illustration of how the (2+1)D CNN operates. The video frames (at the bottom)
are first processed by layers that perform 2D spatial convolution, then their outputs
are combined by 1D temporal convolution. The model is allowed to skip video frames
by changing the stride parameter of the temporal convolution.
2 Convolutional Neural Networks for Video Processing
Ever since the invention of ‘Alexnet’, CNNs have remained the leading tech-
nology in the recognition of still images [21]. These standard CNNs apply the
convolution along the two spatial axes, that is, in two dimensions (2D). How-
ever, there are several tasks where the input is a video, and handling the video as
a sequence (instead of simply processing separate frames) is vital for obtaining
good recognition results. The best example is human gait recognition, but we can
talk about action recognition in general [17,36,37]. In these cases, the sequence
of video frames forms a three-dimensional data array, with the temporal axis
being the third dimension in addition to the two spatial dimensions (cf. Fig. 1).
For the processing of sequences, recurrent neural structures such as the LSTM
are the most powerful tool [12]. However, the training of these networks is known
to be slow and problematic, which led to the invention of simplified models, such
as the gated recurrent unit (GRU) [3] or the quasi-recurrent neural network [2].
Alternatively, several convolutional network structures have been proposed that
handle time sequences without recurrent connections. In speech recognition,

162 L. oth and A. H. Shandiz
time-delay neural networks (TDNNs) have proved very successful [26,29], but
we can also mention the feedforward sequential memory network [34]. As regards
video processing, several modified CNN structures have been proposed to handle
the temporal sequence of video frames [17,36,37]. Unfortunately, the standard
2D convolution may be extended to 3D in many possible ways, giving a lot of
choices for optimization. Tran et al. performed an experimental comparison of
several 3D variants, and they got the best results when they decomposed the
spatial and temporal convolution steps [30]. The model they called ‘(2+1)D con-
volution’ first performs a 2D convolution along the spatial axes, and then a 1D
convolution along the time axis (see Fig. 1). By changing the stride parameter
of the 1D convolution, the model can skip several video frames, thus covering a
wider time context without increasing the number of processed frames. Interest-
ingly, a very similar network structure proved very efficient in speech recognition
as well [29]. Stacking several such processing blocks on top of each other is also
possible, resulting in a very deep network [30]. Here, we are going to experiment
with a similar (2+1)D network structure for ultrasound-based SSI systems.
3 Data Acquisition and Signal Preprocessing
The ultrasound recordings were collected from a Hungarian female subject (42
years old, with normal speaking abilities) while she was reading sentences aloud.
Her tongue movement was recorded in a midsagittal orientation placing the
ultrasonic imaging probe under the jaw using a “Micro” ultrasound system
by Articulate Instruments Ltd. The transducer was fixed using a stabilization
headset. The 2–4 MHz/64 element 20 mm radius convex ultrasound transducer
produced 82 images per second. The speech signal was recorded in parallel with
an Audio-Technica ATR 3350 omnidirectional condenser microphone placed at
a distance of 20 cm from the lips. The ultrasound and the audio signals were
synchronized using the software tool provided with the equipment. Altogether
438 sentences (approximately half an hour) were recorded from the subject,
which was divided into train, development and test sets in a 310-41-87 ratio. We
should add that the same dataset was used in several earlier studies [5,10].
The ultrasound probe records 946 samples along each of its 64 scan lines.
The recorded data can be converted to conventional ultrasound images using
the software tools provided. However, due to its irregular shape, this image is
harder to process by computers, while it contains no extra information compared
to the original scan data. Hence, we worked with the original 964 × 64 data items,
which were downsampled to 128 × 64 pixels. Figure 2 shows an example of the
data samples arranged as a rectangular image, and the standard ultrasound-
style display generated from it. The intensity range of the data was min-max
normalized to the [1, 1] interval before feeding it to the network.
The speech signal was recorded with a sampling rate of 11025 Hz, and then
processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).
The vocoder represented the speech signals by 12 Mel-Generalized Cepstral Coef-
ficients (MGCC) converted to a Line Spectral Pair representation (LSP), with

3D CNNs for Silent Speech Interfaces 163
Fig. 2. Example of displaying the ultrasound recordings as a) a rectangular image of
raw data samples b) an anatomically correct image, obtained by interpolation.
the signal’s gain being the 13th parameter. These 13 coefficients served as the
training targets in the DNN modeling experiments, as the speech signal can be
reasonably well reconstructed from these parameters. Although perfect recon-
struction would require the estimation of the pitch (F0 parameter) as well, in this
study we ignored this component during the experiments. To facilitate training,
each of the 13 targets were standardized to zero mean and unit variance.
4 Experimental Set-Up
We implemented our deep neural networks in Keras, using a Tensorflow back-
end [4]. We created three different models: a simple fully connected network
(FCN), a convolutional network that processes one frame of video (2D CNN),
and a convolutional network that can process several subsequent video frames as
input (3D CNN). To keep them comparable with respect to parameter count, all
three models had approximately 3.3 million tunable parameters. Training was
performed using the stochatic gradient descent method (SGD) with a batch size
of 100. The training objective function was the mean squared error (MSE).
Fully Connected Network (FCN): The simplest possible DNN type is a
network with fully connected layers. To be comparable with an earlier study [5],
our FCN model consisted of 5 fully connected hidden layers, with an output layer
of 13 neurons for the 13 training targets. The input of the network consisted of
one video frame (128 × 64 = 8192 pixels). Each hidden layers had 350 neurons, so
the model was about 4 times smaller compared to the FCN described in [5]. Apart
from the linear output layer, all layers applied the swish activation function [27],
and were followed by a dropout layer with the dropout rate set to 0.2.
Convolutional Network (2D CNN): Similar to the FCN, the input to this
network consisted of only one frame of data. The network performed spatial con-
volution on the input image via its four convolutional layers below the uppermost

Citations
More filters
Posted Content

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders.

TL;DR: In this article, the authors compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers.
Book ChapterDOI

Improving Neural Silent Speech Interface Models by Adversarial Training

TL;DR: In this paper, a Generative Adversarial Network (GAN) is proposed to improve the perceptual quality of the generated signals by increasing their similarity to real signals, where the similarity is evaluated via a discriminator network.
Book ChapterDOI

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks.

TL;DR: In this paper, a convolutional neural network classifier was used to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal.
Proceedings ArticleDOI

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

TL;DR: This paper experimentally compared various combinations of the above layer types for a silent speech interface task, and obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers.
Book ChapterDOI

Improving Neural Silent Speech Interface Models by Adversarial Training

TL;DR: In this paper, a Generative Adversarial Network (GAN) is proposed to improve the perceptual quality of the generated signals by increasing their similarity to real signals, where the similarity is evaluated via a discriminator network.
References
More filters
Journal ArticleDOI

Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

TL;DR: A segmental vocoder driven by ultrasound and optical images of the tongue and lips for a ''silent speech interface'' application, usable either by a laryngectomized patient or for silent communication.
Journal ArticleDOI

Biosignal-Based Spoken Communication: A Survey

TL;DR: An overview of the various modalities, research approaches, and objectives for biosignal-based spoken communication is given.
Proceedings ArticleDOI

Session independent non-audible speech recognition using surface electromyography

TL;DR: A speech recognition system based on myoelectric signals that recognizes both speech manners accurately when trained on pooled data and audibly to non-audibly spoken speech is introduced.
Proceedings ArticleDOI

Deep-FSMN for Large Vocabulary Continuous Speech Recognition

TL;DR: DFSMN as mentioned in this paper introduces skip connections between memory blocks in adjacent layers, which enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure.
Journal ArticleDOI

Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition

TL;DR: The proposed framework utilizes two state-of-the-art ConvNets, i.e., the very deep spatial net (VGGNet) and the temporal net from Two-Stream convolutional layers, and proposes the new Line pooling strategy, which can speed up the extraction of feature and achieve the comparable performance of the Trajectory pooling.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

One option for this is to apply recurrent neural structures such as the long short-term memory network ( LSTM ) in combination with 2D convolutional neural networks ( CNNs ). Here, the authors experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, the authors apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. The authors find experimentally that their 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI 

In the future, the authors plan to investigate more sophisticated network types such as the ConvLSTM network that directly integrates the advantages of the convolutional and LSTM units [ 35 ].