Book Chapter•DOI•

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

László Tóth¹, Amin Honarmandi Shandiz¹•Institutions (1)

12 Oct 2020-pp 159-169

TL;DR: This work experiments with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time, and finds experimentally that the 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+ LSTM networks in SSI systems.

read less

Abstract: Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that our 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI systems.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 Convolutional Neural Networks for Video Processing] – [3 Data Acquisition and Signal Preprocessing] – [4 Experimental Set-Up] – [2D CNN 3D CNN] – [5 Results and Discussion] and [6 Conclusions]

1 Introduction

During the last couple of years, there has been an increasing interest in articulatory-to-acoustic conversion, which seeks to reproduce the speech signal from a recording of the articulatory organs, giving the technological background for creating “Silent Speech Interfaces” (SSI) [6,28].
Current SSI systems prefer the ‘direct synthesis’ principle, where speech is generated directly from the articulatory data, without any intermediate step.
This sequence carries extra information about the time trajectory of the tongue movement, which might be exploited by processing several neighboring video frames at the same time.
Alternatively, one may experiment with extending the 2D CNN structure to 3D, by adding time as an extra dimension [17,20,32].

2 Convolutional Neural Networks for Video Processing

Ever since the invention of ‘Alexnet’, CNNs have remained the leading technology in the recognition of still images [21].
There are several tasks where the input is a video, and handling the video as a sequence (instead of simply processing separate frames) is vital for obtaining good recognition results.
For the processing of sequences, recurrent neural structures such as the LSTM are the most powerful tool [12].
The model they called ‘(2+1)D convolution’ first performs a 2D convolution along the spatial axes, and then a 1D convolution along the time axis (see Fig. 1).
Interestingly, a very similar network structure proved very efficient in speech recognition as well [29].

3 Data Acquisition and Signal Preprocessing

The ultrasound recordings were collected from a Hungarian female subject (42 years old, with normal speaking abilities) while she was reading sentences aloud.
The ultrasound and the audio signals were synchronized using the software tool provided with the equipment.
Figure 2 shows an example of the data samples arranged as a rectangular image, and the standard ultrasoundstyle display generated from it.
The speech signal was recorded with a sampling rate of 11025 Hz, and then processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).
To facilitate training, each of the 13 targets were standardized to zero mean and unit variance.

4 Experimental Set-Up

The authors implemented their deep neural networks in Keras, using a Tensorflow backend [4].
To keep them comparable with respect to parameter count, all three models had approximately 3.3 million tunable parameters.
Training was performed using the stochatic gradient descent method (SGD) with a batch size of 100.
The simplest possible DNN type is a network with fully connected layers.
Similar to the FCN, the input to this network consisted of only one frame of data, also known as Convolutional Network (2D CNN).

2D CNN 3D CNN

The optimal network meta-parameters were found experimentally, and all hidden layers applied the swish activation function [27].
Following the concept of (2+1)D convolution described in Sect. 2, the five frames were first processed only spatially, and then got combined along the time axis just below the uppermost dense layer.
There are several options for evaluating the performance of their networks.
In the simplest case, the authors can compare their performance by simple objective metrics, such as the value of the target function optimized during training (the MSE function in their case).
Hence, many authors apply subjective listening tests such as the MUSHRA method [25].

5 Results and Discussion

As for the 3D CNN, the authors found that the value of the stride parameter s has a significant impact on the error rate attained.
Along with the MSE values, now the correlation-based R2 scores are also shown.
They obtained slightly better results than those given by their FCN, presumably due to the fact that their network had about 4 times as many parameters.
These simple methods failed to significantly reduce the error rate.
Moreover, instead of reducing the input size by feature selection, it seems to be more efficient to send the frames through several neural layers, with a relatively narrow ‘bottleneck’ layer on top.

6 Conclusions

Here, the authors implemented a 3D CNN for ultrasound-based articulation-to-acoustic conversion, where the CNN applied separate spatial and temporal components, motivated by the (2+1)D CNN of Tran et al. [30].
The model was compared with a CNN+LSTM architecture that was recently proposed for the same task.
This study was supported by the National Research, Development and Innovation Office of Hungary through project FK 124584 and by the AI National Excellence Program (grant 2018-1.2.1-NKP-2018-00008) and by grant TUDFO/47138-1/2019-ITM of the Ministry of Innovation and Technology.
The GPU card used for the computations was donated by the NVIDIA Corporation.
The authors thank the MTA-ELTE Lendület Lingual Articulation Research Group for providing the ultrasound recordings.

Did you find this useful? Give us your feedback

Figures (5)

Table 1. The layers of the 2D and 3D CNNs in the Keras implementation, along with their most important parameters. The differences are highlighted in bold.

Fig. 3. MSE rates of the 3D CNN on the development set for various s stride values. For comparison, the MSE attained by the 2D CNN is also shown (leftmost column).

Fig. 1. Illustration of how the (2+1)D CNN operates. The video frames (at the bottom) are first processed by layers that perform 2D spatial convolution, then their outputs are combined by 1D temporal convolution. The model is allowed to skip video frames by changing the stride parameter of the temporal convolution.

Table 2. The results obtained with the various network configurations. For comparison, two results from the literature are also shown in the bottom rows.

Fig. 2. Example of displaying the ultrasound recordings as a) a rectangular image of raw data samples b) an anatomically correct image, obtained by interpolation.

Content maybe subject to copyright Report

3D Convolutional Neural Networks

for Ultrasound-Based Silent Speech

Interfaces

L´aszl´oT´oth

(

)

and Amin Honarmandi Shandiz

Institute of Informatics, University of Szeged, Szeged, Hungary

{tothl,shandiz}@inf.u-szeged.hu

Abstract. Silent speech interfaces (SSI) aim to reconstruct the speech

signal from a recording of the articulatory movement, such as an ultra-

sound video of the tongue. Currently, deep neural networks are the most

successful technology for this task. The eﬃcient solution requires meth-

ods that do not simply process single images, but are able to extract

the tongue movement information from a sequence of video frames. One

option for this is to apply recurrent neural structures such as the long

short-term memory network (LSTM) in combination with 2D convolu-

tional neural networks (CNNs). Here, we experiment with another app-

roach that extends the CNN to perform 3D convolution, where the extra

dimension corresponds to time. In particular, we apply the spatial and

temporal convolutions in a decomposed form, which proved very suc-

cessful recently in video action recognition. We ﬁnd experimentally that

our 3D network outperforms the CNN+LSTM model, indicating that

3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI

systems.

Keywords: Silent speech interface

· Convolutional neural network ·

3D convolution · Ultrasound video

1 Introduction

During the last couple of years, there has been an increasing interest in

articulatory-to-acoustic conversion, which seeks to reproduce the speech signal

from a recording of the articulatory organs, giving the technological background

for creating “Silent Speech Interfaces” (SSI) [6,28]. These interfaces allow us to

record the soundless articulatory movement, and then automatically generate

speech from the movement information, while the subject is actually not pro-

ducing any sound. Such an SSI system could be very useful for the speaking

impaired who are able to move their articulators, but have lost their ability

to produce any sound (e.g. due to a laryngectomy or some injury of the vocal

chords). It could also be applied in human-computer interaction in situations

where regular speech is not feasible (e.g. extremely noisy environments or mil-

itary applications). Several solutions exist for the recording of the articulatory

 Springer Nature Switzerland AG 2020

L. Rutkowski et al. (Eds.): ICAISC 2020, LNAI 12415, pp. 159–169, 2020.

https://doi.org/10.1007/978-3-030-61401-0

_16

160 L. T´oth and A. H. Shandiz

movements, the simplest approach being a lip video [1,8]. But one may also

apply electromagnetic articulography (EMA, [18,19]), ultrasound tongue imag-

ing (UTI, [5,10,16,20]) or permanent magnetic articulography (PMA, [9]). Sur-

face Electromiography (sEMG, [14,15,24]) is also an option, while some authors

use a combination of the above methods [6]. Here we are going to work with

ultrasound tongue videos.

To convert the movement recordings into speech, the conventional approach

is to apply a two-step procedure of ‘recognition-and-synthesis’ [28]. In this case,

the biosignal is ﬁrst converted into text by a properly adjusted speech recognition

system. The text is then converted into speech using text-to-speech synthesis [7,

13,31]. The drawbacks of this approach are the relatively large delay between

the input and the output, and that the errors made by the speech recognizer

will inevitably appear as errors in the TTS output. Also, all information related

to speech prosody is lost, while certain prosodic components such as energy and

pitch can be reasonably well estimated from the articulatory signal [10].

Current SSI systems prefer the ‘direct synthesis’ principle, where speech is

generated directly from the articulatory data, without any intermediate step.

Moreover, as recently the Deep Neural Network (DNN) technology have become

dominant in practically all areas of speech technology, such as speech recogni-

tion [11], speech synthesis [22] and language modeling [33], most recent studies

have attempted to solve the articulatory-to-acoustic conversion problem by using

deep learning, regardless of the recording technique applied [5,9,10,14,16,20,25].

In this paper, we also apply deep neural networks to convert the ultrasound video

of the tongue movement to speech. Although some early studies used simple fully

connected neural networks [5,16], as we are working with images, it seems more

reasonable to apply convolutional neural networks (CNN), which are currently

very popular and successful in image recognition [21]. Thus, many recent studies

on SSI systems use CNNs [14,20,25].

Our input here is a video, that is, not just one still image, but a sequence of

images. This sequence carries extra information about the time trajectory of the

tongue movement, which might be exploited by processing several neighboring

video frames at the same time. There are several options to create a network

structure for processing a time sequences. For such data, usually recurrent neural

networks such as the long short-term memory network (LSTM) are applied,

typically stacking it on top of a 2D CNN that seeks to process the individual

frames [

9,19,23,25]. Alternatively, one may experiment with extending the 2D

CNN structure to 3D, by adding time as an extra dimension [17,20,32]. Here,

we follow the latter approach, and we investigate the applicability of a special

3D CNN model called the (2+1)D CNN [30] for ultrasound-based direct speech

synthesis, and compare the results with those of a CNN+LSTM model. We ﬁnd

that our 3D CNN model achieves a lower error rate, while it is smaller, and its

training is faster. We conclude that for ultrasound video-based SSI systems, 3D

CNNs are deﬁnitely a feasible alternative to recurrent neural models.

The paper is structured as follows. Section 2 gives a technological overview

of the CNNs we are going to apply. In Sect. 3 we describe the data acquisition

3D CNNs for Silent Speech Interfaces 161

and processing steps for the ultrasound videos and the speech signal. Section 4

presents our experimental set-up. We present the experimental results and dis-

cuss them in Sect. 5, and the paper is closed with the conclusions in Sect. 6.

1D temporal convolution

2D spatial

convolution

Fig. 1. Illustration of how the (2+1)D CNN operates. The video frames (at the bottom)

are ﬁrst processed by layers that perform 2D spatial convolution, then their outputs

are combined by 1D temporal convolution. The model is allowed to skip video frames

by changing the stride parameter of the temporal convolution.

2 Convolutional Neural Networks for Video Processing

Ever since the invention of ‘Alexnet’, CNNs have remained the leading tech-

nology in the recognition of still images [21]. These standard CNNs apply the

convolution along the two spatial axes, that is, in two dimensions (2D). How-

ever, there are several tasks where the input is a video, and handling the video as

a sequence (instead of simply processing separate frames) is vital for obtaining

good recognition results. The best example is human gait recognition, but we can

talk about action recognition in general [17,36,37]. In these cases, the sequence

of video frames forms a three-dimensional data array, with the temporal axis

being the third dimension in addition to the two spatial dimensions (cf. Fig. 1).

For the processing of sequences, recurrent neural structures such as the LSTM

are the most powerful tool [12]. However, the training of these networks is known

to be slow and problematic, which led to the invention of simpliﬁed models, such

as the gated recurrent unit (GRU) [3] or the quasi-recurrent neural network [2].

Alternatively, several convolutional network structures have been proposed that

handle time sequences without recurrent connections. In speech recognition,

162 L. T´oth and A. H. Shandiz

time-delay neural networks (TDNNs) have proved very successful [26,29], but

we can also mention the feedforward sequential memory network [34]. As regards

video processing, several modiﬁed CNN structures have been proposed to handle

the temporal sequence of video frames [17,36,37]. Unfortunately, the standard

2D convolution may be extended to 3D in many possible ways, giving a lot of

choices for optimization. Tran et al. performed an experimental comparison of

several 3D variants, and they got the best results when they decomposed the

spatial and temporal convolution steps [30]. The model they called ‘(2+1)D con-

volution’ ﬁrst performs a 2D convolution along the spatial axes, and then a 1D

convolution along the time axis (see Fig. 1). By changing the stride parameter

of the 1D convolution, the model can skip several video frames, thus covering a

wider time context without increasing the number of processed frames. Interest-

ingly, a very similar network structure proved very eﬃcient in speech recognition

as well [29]. Stacking several such processing blocks on top of each other is also

possible, resulting in a very deep network [30]. Here, we are going to experiment

with a similar (2+1)D network structure for ultrasound-based SSI systems.

3 Data Acquisition and Signal Preprocessing

The ultrasound recordings were collected from a Hungarian female subject (42

years old, with normal speaking abilities) while she was reading sentences aloud.

Her tongue movement was recorded in a midsagittal orientation – placing the

ultrasonic imaging probe under the jaw – using a “Micro” ultrasound system

by Articulate Instruments Ltd. The transducer was ﬁxed using a stabilization

headset. The 2–4 MHz/64 element 20 mm radius convex ultrasound transducer

produced 82 images per second. The speech signal was recorded in parallel with

an Audio-Technica ATR 3350 omnidirectional condenser microphone placed at

a distance of 20 cm from the lips. The ultrasound and the audio signals were

synchronized using the software tool provided with the equipment. Altogether

438 sentences (approximately half an hour) were recorded from the subject,

which was divided into train, development and test sets in a 310-41-87 ratio. We

should add that the same dataset was used in several earlier studies [5,10].

The ultrasound probe records 946 samples along each of its 64 scan lines.

The recorded data can be converted to conventional ultrasound images using

the software tools provided. However, due to its irregular shape, this image is

harder to process by computers, while it contains no extra information compared

to the original scan data. Hence, we worked with the original 964 × 64 data items,

which were downsampled to 128 × 64 pixels. Figure 2 shows an example of the

data samples arranged as a rectangular image, and the standard ultrasound-

style display generated from it. The intensity range of the data was min-max

normalized to the [−1, 1] interval before feeding it to the network.

The speech signal was recorded with a sampling rate of 11025 Hz, and then

processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).

The vocoder represented the speech signals by 12 Mel-Generalized Cepstral Coef-

ﬁcients (MGCC) converted to a Line Spectral Pair representation (LSP), with

3D CNNs for Silent Speech Interfaces 163

Fig. 2. Example of displaying the ultrasound recordings as a) a rectangular image of

raw data samples b) an anatomically correct image, obtained by interpolation.

the signal’s gain being the 13th parameter. These 13 coeﬃcients served as the

training targets in the DNN modeling experiments, as the speech signal can be

reasonably well reconstructed from these parameters. Although perfect recon-

struction would require the estimation of the pitch (F0 parameter) as well, in this

study we ignored this component during the experiments. To facilitate training,

each of the 13 targets were standardized to zero mean and unit variance.

4 Experimental Set-Up

We implemented our deep neural networks in Keras, using a Tensorﬂow back-

end [4]. We created three diﬀerent models: a simple fully connected network

(FCN), a convolutional network that processes one frame of video (2D CNN),

and a convolutional network that can process several subsequent video frames as

input (3D CNN). To keep them comparable with respect to parameter count, all

three models had approximately 3.3 million tunable parameters. Training was

performed using the stochatic gradient descent method (SGD) with a batch size

of 100. The training objective function was the mean squared error (MSE).

Fully Connected Network (FCN): The simplest possible DNN type is a

network with fully connected layers. To be comparable with an earlier study [5],

our FCN model consisted of 5 fully connected hidden layers, with an output layer

of 13 neurons for the 13 training targets. The input of the network consisted of

one video frame (128 × 64 = 8192 pixels). Each hidden layers had 350 neurons, so

the model was about 4 times smaller compared to the FCN described in [5]. Apart

from the linear output layer, all layers applied the swish activation function [27],

and were followed by a dropout layer with the dropout rate set to 0.2.

Convolutional Network (2D CNN): Similar to the FCN, the input to this

network consisted of only one frame of data. The network performed spatial con-

volution on the input image via its four convolutional layers below the uppermost

HTML Viewer

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

One option for this is to apply recurrent neural structures such as the long short-term memory network ( LSTM ) in combination with 2D convolutional neural networks ( CNNs ). Here, the authors experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, the authors apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. The authors find experimentally that their 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI

Q2. What are the future works mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

In the future, the authors plan to investigate more sophisticated network types such as the ConvLSTM network that directly integrates the advantages of the convolutional and LSTM units [ 35 ].

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Summary (2 min read)

1 Introduction

2 Convolutional Neural Networks for Video Processing

3 Data Acquisition and Signal Preprocessing

4 Experimental Set-Up

2D CNN 3D CNN

5 Results and Discussion

6 Conclusions

Figures (5)

Citations

References

Related Papers (5)

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

Q2. What are the future works mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?