3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces
Summary (2 min read)
1 Introduction
- During the last couple of years, there has been an increasing interest in articulatory-to-acoustic conversion, which seeks to reproduce the speech signal from a recording of the articulatory organs, giving the technological background for creating “Silent Speech Interfaces” (SSI) [6,28].
- Current SSI systems prefer the ‘direct synthesis’ principle, where speech is generated directly from the articulatory data, without any intermediate step.
- This sequence carries extra information about the time trajectory of the tongue movement, which might be exploited by processing several neighboring video frames at the same time.
- Alternatively, one may experiment with extending the 2D CNN structure to 3D, by adding time as an extra dimension [17,20,32].
2 Convolutional Neural Networks for Video Processing
- Ever since the invention of ‘Alexnet’, CNNs have remained the leading technology in the recognition of still images [21].
- There are several tasks where the input is a video, and handling the video as a sequence (instead of simply processing separate frames) is vital for obtaining good recognition results.
- For the processing of sequences, recurrent neural structures such as the LSTM are the most powerful tool [12].
- The model they called ‘(2+1)D convolution’ first performs a 2D convolution along the spatial axes, and then a 1D convolution along the time axis (see Fig. 1).
- Interestingly, a very similar network structure proved very efficient in speech recognition as well [29].
3 Data Acquisition and Signal Preprocessing
- The ultrasound recordings were collected from a Hungarian female subject (42 years old, with normal speaking abilities) while she was reading sentences aloud.
- The ultrasound and the audio signals were synchronized using the software tool provided with the equipment.
- Figure 2 shows an example of the data samples arranged as a rectangular image, and the standard ultrasoundstyle display generated from it.
- The speech signal was recorded with a sampling rate of 11025 Hz, and then processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).
- To facilitate training, each of the 13 targets were standardized to zero mean and unit variance.
4 Experimental Set-Up
- The authors implemented their deep neural networks in Keras, using a Tensorflow backend [4].
- To keep them comparable with respect to parameter count, all three models had approximately 3.3 million tunable parameters.
- Training was performed using the stochatic gradient descent method (SGD) with a batch size of 100.
- The simplest possible DNN type is a network with fully connected layers.
- Similar to the FCN, the input to this network consisted of only one frame of data, also known as Convolutional Network (2D CNN).
2D CNN 3D CNN
- The optimal network meta-parameters were found experimentally, and all hidden layers applied the swish activation function [27].
- Following the concept of (2+1)D convolution described in Sect. 2, the five frames were first processed only spatially, and then got combined along the time axis just below the uppermost dense layer.
- There are several options for evaluating the performance of their networks.
- In the simplest case, the authors can compare their performance by simple objective metrics, such as the value of the target function optimized during training (the MSE function in their case).
- Hence, many authors apply subjective listening tests such as the MUSHRA method [25].
5 Results and Discussion
- As for the 3D CNN, the authors found that the value of the stride parameter s has a significant impact on the error rate attained.
- Along with the MSE values, now the correlation-based R2 scores are also shown.
- They obtained slightly better results than those given by their FCN, presumably due to the fact that their network had about 4 times as many parameters.
- These simple methods failed to significantly reduce the error rate.
- Moreover, instead of reducing the input size by feature selection, it seems to be more efficient to send the frames through several neural layers, with a relatively narrow ‘bottleneck’ layer on top.
6 Conclusions
- Here, the authors implemented a 3D CNN for ultrasound-based articulation-to-acoustic conversion, where the CNN applied separate spatial and temporal components, motivated by the (2+1)D CNN of Tran et al. [30].
- The model was compared with a CNN+LSTM architecture that was recently proposed for the same task.
- This study was supported by the National Research, Development and Innovation Office of Hungary through project FK 124584 and by the AI National Excellence Program (grant 2018-1.2.1-NKP-2018-00008) and by grant TUDFO/47138-1/2019-ITM of the Ministry of Innovation and Technology.
- The GPU card used for the computations was donated by the NVIDIA Corporation.
- The authors thank the MTA-ELTE Lendület Lingual Articulation Research Group for providing the ultrasound recordings.
Did you find this useful? Give us your feedback
Citations
5 citations
5 citations
3 citations
1 citations
1 citations
References
70 citations
64 citations
54 citations
"3D Convolutional Neural Networks fo..." refers background in this paper
...Several solutions exist for the recording of the articulatory movements, the simplest approach being a lip video [8,1]....
[...]
54 citations
"3D Convolutional Neural Networks fo..." refers background or methods or result in this paper
...But one may also apply electromagnetic articulography (EMA, [18,19]), ultrasound tongue imaging (UTI, [16,5,10,20]) or permanent magnetic articulography (PMA, [9])....
[...]
...The authors of [5] applied a fully connected network on the same data set....
[...]
...Moreover, as recently the Deep Neural Network (DNN) technology have become dominant in practically all areas of speech technology, such as speech recognition [11], speech synthesis [22] and language modeling [33], most recent studies have attempted to solve the articulatory-to-acoustic conversion problem by using deep learning, regardless of the recording technique applied [5,10,14,16,9,25,20]....
[...]
...To be comparable with an earlier study [5], our FCN model consisted of 5 fully connected hidden layers, with an output layer of 13 neurons for the 13 training targets....
[...]
...Each hidden layers had 350 neurons, so the model was about 4 times smaller compared to the FCN described in [5]....
[...]
47 citations
Related Papers (5)
Frequently Asked Questions (2)
Q2. What are the future works mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?
In the future, the authors plan to investigate more sophisticated network types such as the ConvLSTM network that directly integrates the advantages of the convolutional and LSTM units [ 35 ].