What contributions have the authors mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

One option for this is to apply recurrent neural structures such as the long short-term memory network ( LSTM ) in combination with 2D convolutional neural networks ( CNNs ). Here, the authors experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, the authors apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. The authors find experimentally that their 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI

What are the future works mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

In the future, the authors plan to investigate more sophisticated network types such as the ConvLSTM network that directly integrates the advantages of the convolutional and LSTM units [ 35 ].

(Open Access) 3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces (2020) | László Tóth

3D Convolutional Neural Networks

for Ultrasound-Based Silent Speech

Interfaces

L´aszl´oT´oth

(

)

and Amin Honarmandi Shandiz

Institute of Informatics, University of Szeged, Szeged, Hungary

{tothl,shandiz}@inf.u-szeged.hu

Abstract. Silent speech interfaces (SSI) aim to reconstruct the speech

signal from a recording of the articulatory movement, such as an ultra-

sound video of the tongue. Currently, deep neural networks are the most

successful technology for this task. The eﬃcient solution requires meth-

ods that do not simply process single images, but are able to extract

the tongue movement information from a sequence of video frames. One

option for this is to apply recurrent neural structures such as the long

short-term memory network (LSTM) in combination with 2D convolu-

tional neural networks (CNNs). Here, we experiment with another app-

roach that extends the CNN to perform 3D convolution, where the extra

dimension corresponds to time. In particular, we apply the spatial and

temporal convolutions in a decomposed form, which proved very suc-

cessful recently in video action recognition. We ﬁnd experimentally that

our 3D network outperforms the CNN+LSTM model, indicating that

3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI

systems.

Keywords: Silent speech interface

· Convolutional neural network ·

3D convolution · Ultrasound video

1 Introduction

During the last couple of years, there has been an increasing interest in

articulatory-to-acoustic conversion, which seeks to reproduce the speech signal

from a recording of the articulatory organs, giving the technological background

for creating “Silent Speech Interfaces” (SSI) [6,28]. These interfaces allow us to

record the soundless articulatory movement, and then automatically generate

speech from the movement information, while the subject is actually not pro-

ducing any sound. Such an SSI system could be very useful for the speaking

impaired who are able to move their articulators, but have lost their ability

to produce any sound (e.g. due to a laryngectomy or some injury of the vocal

chords). It could also be applied in human-computer interaction in situations

where regular speech is not feasible (e.g. extremely noisy environments or mil-

itary applications). Several solutions exist for the recording of the articulatory

 Springer Nature Switzerland AG 2020

L. Rutkowski et al. (Eds.): ICAISC 2020, LNAI 12415, pp. 159–169, 2020.

https://doi.org/10.1007/978-3-030-61401-0

_16

160 L. T´oth and A. H. Shandiz

movements, the simplest approach being a lip video [1,8]. But one may also

apply electromagnetic articulography (EMA, [18,19]), ultrasound tongue imag-

ing (UTI, [5,10,16,20]) or permanent magnetic articulography (PMA, [9]). Sur-

face Electromiography (sEMG, [14,15,24]) is also an option, while some authors

use a combination of the above methods [6]. Here we are going to work with

ultrasound tongue videos.

To convert the movement recordings into speech, the conventional approach

is to apply a two-step procedure of ‘recognition-and-synthesis’ [28]. In this case,

the biosignal is ﬁrst converted into text by a properly adjusted speech recognition

system. The text is then converted into speech using text-to-speech synthesis [7,

13,31]. The drawbacks of this approach are the relatively large delay between

the input and the output, and that the errors made by the speech recognizer

will inevitably appear as errors in the TTS output. Also, all information related

to speech prosody is lost, while certain prosodic components such as energy and

pitch can be reasonably well estimated from the articulatory signal [10].

Current SSI systems prefer the ‘direct synthesis’ principle, where speech is

generated directly from the articulatory data, without any intermediate step.

Moreover, as recently the Deep Neural Network (DNN) technology have become

dominant in practically all areas of speech technology, such as speech recogni-

tion [11], speech synthesis [22] and language modeling [33], most recent studies

have attempted to solve the articulatory-to-acoustic conversion problem by using

deep learning, regardless of the recording technique applied [5,9,10,14,16,20,25].

In this paper, we also apply deep neural networks to convert the ultrasound video

of the tongue movement to speech. Although some early studies used simple fully

connected neural networks [5,16], as we are working with images, it seems more

reasonable to apply convolutional neural networks (CNN), which are currently

very popular and successful in image recognition [21]. Thus, many recent studies

on SSI systems use CNNs [14,20,25].

Our input here is a video, that is, not just one still image, but a sequence of

images. This sequence carries extra information about the time trajectory of the

tongue movement, which might be exploited by processing several neighboring

video frames at the same time. There are several options to create a network

structure for processing a time sequences. For such data, usually recurrent neural

networks such as the long short-term memory network (LSTM) are applied,

typically stacking it on top of a 2D CNN that seeks to process the individual

frames [

9,19,23,25]. Alternatively, one may experiment with extending the 2D

CNN structure to 3D, by adding time as an extra dimension [17,20,32]. Here,

we follow the latter approach, and we investigate the applicability of a special

3D CNN model called the (2+1)D CNN [30] for ultrasound-based direct speech

synthesis, and compare the results with those of a CNN+LSTM model. We ﬁnd

that our 3D CNN model achieves a lower error rate, while it is smaller, and its

training is faster. We conclude that for ultrasound video-based SSI systems, 3D

CNNs are deﬁnitely a feasible alternative to recurrent neural models.

The paper is structured as follows. Section 2 gives a technological overview

of the CNNs we are going to apply. In Sect. 3 we describe the data acquisition

3D CNNs for Silent Speech Interfaces 161

and processing steps for the ultrasound videos and the speech signal. Section 4

presents our experimental set-up. We present the experimental results and dis-

cuss them in Sect. 5, and the paper is closed with the conclusions in Sect. 6.

1D temporal convolution

2D spatial

convolution

Fig. 1. Illustration of how the (2+1)D CNN operates. The video frames (at the bottom)

are ﬁrst processed by layers that perform 2D spatial convolution, then their outputs

are combined by 1D temporal convolution. The model is allowed to skip video frames

by changing the stride parameter of the temporal convolution.

2 Convolutional Neural Networks for Video Processing

Ever since the invention of ‘Alexnet’, CNNs have remained the leading tech-

nology in the recognition of still images [21]. These standard CNNs apply the

convolution along the two spatial axes, that is, in two dimensions (2D). How-

ever, there are several tasks where the input is a video, and handling the video as

a sequence (instead of simply processing separate frames) is vital for obtaining

good recognition results. The best example is human gait recognition, but we can

talk about action recognition in general [17,36,37]. In these cases, the sequence

of video frames forms a three-dimensional data array, with the temporal axis

being the third dimension in addition to the two spatial dimensions (cf. Fig. 1).

For the processing of sequences, recurrent neural structures such as the LSTM

are the most powerful tool [12]. However, the training of these networks is known

to be slow and problematic, which led to the invention of simpliﬁed models, such

as the gated recurrent unit (GRU) [3] or the quasi-recurrent neural network [2].

Alternatively, several convolutional network structures have been proposed that

handle time sequences without recurrent connections. In speech recognition,

162 L. T´oth and A. H. Shandiz

time-delay neural networks (TDNNs) have proved very successful [26,29], but

we can also mention the feedforward sequential memory network [34]. As regards

video processing, several modiﬁed CNN structures have been proposed to handle

the temporal sequence of video frames [17,36,37]. Unfortunately, the standard

2D convolution may be extended to 3D in many possible ways, giving a lot of

choices for optimization. Tran et al. performed an experimental comparison of

several 3D variants, and they got the best results when they decomposed the

spatial and temporal convolution steps [30]. The model they called ‘(2+1)D con-

volution’ ﬁrst performs a 2D convolution along the spatial axes, and then a 1D

convolution along the time axis (see Fig. 1). By changing the stride parameter

of the 1D convolution, the model can skip several video frames, thus covering a

wider time context without increasing the number of processed frames. Interest-

ingly, a very similar network structure proved very eﬃcient in speech recognition

as well [29]. Stacking several such processing blocks on top of each other is also

possible, resulting in a very deep network [30]. Here, we are going to experiment

with a similar (2+1)D network structure for ultrasound-based SSI systems.

3 Data Acquisition and Signal Preprocessing

The ultrasound recordings were collected from a Hungarian female subject (42

years old, with normal speaking abilities) while she was reading sentences aloud.

Her tongue movement was recorded in a midsagittal orientation – placing the

ultrasonic imaging probe under the jaw – using a “Micro” ultrasound system

by Articulate Instruments Ltd. The transducer was ﬁxed using a stabilization

headset. The 2–4 MHz/64 element 20 mm radius convex ultrasound transducer

produced 82 images per second. The speech signal was recorded in parallel with

an Audio-Technica ATR 3350 omnidirectional condenser microphone placed at

a distance of 20 cm from the lips. The ultrasound and the audio signals were

synchronized using the software tool provided with the equipment. Altogether

438 sentences (approximately half an hour) were recorded from the subject,

which was divided into train, development and test sets in a 310-41-87 ratio. We

should add that the same dataset was used in several earlier studies [5,10].

The ultrasound probe records 946 samples along each of its 64 scan lines.

The recorded data can be converted to conventional ultrasound images using

the software tools provided. However, due to its irregular shape, this image is

harder to process by computers, while it contains no extra information compared

to the original scan data. Hence, we worked with the original 964 × 64 data items,

which were downsampled to 128 × 64 pixels. Figure 2 shows an example of the

data samples arranged as a rectangular image, and the standard ultrasound-

style display generated from it. The intensity range of the data was min-max

normalized to the [−1, 1] interval before feeding it to the network.

The speech signal was recorded with a sampling rate of 11025 Hz, and then

processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).

The vocoder represented the speech signals by 12 Mel-Generalized Cepstral Coef-

ﬁcients (MGCC) converted to a Line Spectral Pair representation (LSP), with

3D CNNs for Silent Speech Interfaces 163

Fig. 2. Example of displaying the ultrasound recordings as a) a rectangular image of

raw data samples b) an anatomically correct image, obtained by interpolation.

the signal’s gain being the 13th parameter. These 13 coeﬃcients served as the

training targets in the DNN modeling experiments, as the speech signal can be

reasonably well reconstructed from these parameters. Although perfect recon-

struction would require the estimation of the pitch (F0 parameter) as well, in this

study we ignored this component during the experiments. To facilitate training,

each of the 13 targets were standardized to zero mean and unit variance.

4 Experimental Set-Up

We implemented our deep neural networks in Keras, using a Tensorﬂow back-

end [4]. We created three diﬀerent models: a simple fully connected network

(FCN), a convolutional network that processes one frame of video (2D CNN),

and a convolutional network that can process several subsequent video frames as

input (3D CNN). To keep them comparable with respect to parameter count, all

three models had approximately 3.3 million tunable parameters. Training was

performed using the stochatic gradient descent method (SGD) with a batch size

of 100. The training objective function was the mean squared error (MSE).

Fully Connected Network (FCN): The simplest possible DNN type is a

network with fully connected layers. To be comparable with an earlier study [5],

our FCN model consisted of 5 fully connected hidden layers, with an output layer

of 13 neurons for the 13 training targets. The input of the network consisted of

one video frame (128 × 64 = 8192 pixels). Each hidden layers had 350 neurons, so

the model was about 4 times smaller compared to the FCN described in [5]. Apart

from the linear output layer, all layers applied the swish activation function [27],

and were followed by a dropout layer with the dropout rate set to 0.2.

Convolutional Network (2D CNN): Similar to the FCN, the input to this

network consisted of only one frame of data. The network performed spatial con-

volution on the input image via its four convolutional layers below the uppermost

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Figures

Citations

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders.

Improving Neural Silent Speech Interface Models by Adversarial Training

Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks.

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

Improving Neural Silent Speech Interface Models by Adversarial Training

References

Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

Biosignal-Based Spoken Communication: A Survey

Session independent non-audible speech recognition using surface electromyography

Deep-FSMN for Large Vocabulary Continuous Speech Recognition

Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition

Related Papers (5)

Waveglow: A Flow-based Generative Network for Speech Synthesis

Ultrasound-Based Articulatory-to-Acoustic Mapping with WaveGlow Speech Synthesis.

Continuous Chinese sign language recognition with CNN-LSTM

Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks

Human Motion Recognition Based on Improved 3-Dimensional Convolutional Neural Network

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

Q2. What are the future works mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?