scispace - formally typeset
Search or ask a question
Book ChapterDOI

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

12 Oct 2020-pp 159-169
TL;DR: This work experiments with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time, and finds experimentally that the 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+ LSTM networks in SSI systems.
Abstract: Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that our 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI systems.

Summary (2 min read)

1 Introduction

  • During the last couple of years, there has been an increasing interest in articulatory-to-acoustic conversion, which seeks to reproduce the speech signal from a recording of the articulatory organs, giving the technological background for creating “Silent Speech Interfaces” (SSI) [6,28].
  • Current SSI systems prefer the ‘direct synthesis’ principle, where speech is generated directly from the articulatory data, without any intermediate step.
  • This sequence carries extra information about the time trajectory of the tongue movement, which might be exploited by processing several neighboring video frames at the same time.
  • Alternatively, one may experiment with extending the 2D CNN structure to 3D, by adding time as an extra dimension [17,20,32].

2 Convolutional Neural Networks for Video Processing

  • Ever since the invention of ‘Alexnet’, CNNs have remained the leading technology in the recognition of still images [21].
  • There are several tasks where the input is a video, and handling the video as a sequence (instead of simply processing separate frames) is vital for obtaining good recognition results.
  • For the processing of sequences, recurrent neural structures such as the LSTM are the most powerful tool [12].
  • The model they called ‘(2+1)D convolution’ first performs a 2D convolution along the spatial axes, and then a 1D convolution along the time axis (see Fig. 1).
  • Interestingly, a very similar network structure proved very efficient in speech recognition as well [29].

3 Data Acquisition and Signal Preprocessing

  • The ultrasound recordings were collected from a Hungarian female subject (42 years old, with normal speaking abilities) while she was reading sentences aloud.
  • The ultrasound and the audio signals were synchronized using the software tool provided with the equipment.
  • Figure 2 shows an example of the data samples arranged as a rectangular image, and the standard ultrasoundstyle display generated from it.
  • The speech signal was recorded with a sampling rate of 11025 Hz, and then processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).
  • To facilitate training, each of the 13 targets were standardized to zero mean and unit variance.

4 Experimental Set-Up

  • The authors implemented their deep neural networks in Keras, using a Tensorflow backend [4].
  • To keep them comparable with respect to parameter count, all three models had approximately 3.3 million tunable parameters.
  • Training was performed using the stochatic gradient descent method (SGD) with a batch size of 100.
  • The simplest possible DNN type is a network with fully connected layers.
  • Similar to the FCN, the input to this network consisted of only one frame of data, also known as Convolutional Network (2D CNN).

2D CNN 3D CNN

  • The optimal network meta-parameters were found experimentally, and all hidden layers applied the swish activation function [27].
  • Following the concept of (2+1)D convolution described in Sect. 2, the five frames were first processed only spatially, and then got combined along the time axis just below the uppermost dense layer.
  • There are several options for evaluating the performance of their networks.
  • In the simplest case, the authors can compare their performance by simple objective metrics, such as the value of the target function optimized during training (the MSE function in their case).
  • Hence, many authors apply subjective listening tests such as the MUSHRA method [25].

5 Results and Discussion

  • As for the 3D CNN, the authors found that the value of the stride parameter s has a significant impact on the error rate attained.
  • Along with the MSE values, now the correlation-based R2 scores are also shown.
  • They obtained slightly better results than those given by their FCN, presumably due to the fact that their network had about 4 times as many parameters.
  • These simple methods failed to significantly reduce the error rate.
  • Moreover, instead of reducing the input size by feature selection, it seems to be more efficient to send the frames through several neural layers, with a relatively narrow ‘bottleneck’ layer on top.

6 Conclusions

  • Here, the authors implemented a 3D CNN for ultrasound-based articulation-to-acoustic conversion, where the CNN applied separate spatial and temporal components, motivated by the (2+1)D CNN of Tran et al. [30].
  • The model was compared with a CNN+LSTM architecture that was recently proposed for the same task.
  • This study was supported by the National Research, Development and Innovation Office of Hungary through project FK 124584 and by the AI National Excellence Program (grant 2018-1.2.1-NKP-2018-00008) and by grant TUDFO/47138-1/2019-ITM of the Ministry of Innovation and Technology.
  • The GPU card used for the computations was donated by the NVIDIA Corporation.
  • The authors thank the MTA-ELTE Lendület Lingual Articulation Research Group for providing the ultrasound recordings.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

3D Convolutional Neural Networks
for Ultrasound-Based Silent Speech
Interfaces
aszl´oT´oth
(
B
)
and Amin Honarmandi Shandiz
Institute of Informatics, University of Szeged, Szeged, Hungary
{tothl,shandiz}@inf.u-szeged.hu
Abstract. Silent speech interfaces (SSI) aim to reconstruct the speech
signal from a recording of the articulatory movement, such as an ultra-
sound video of the tongue. Currently, deep neural networks are the most
successful technology for this task. The efficient solution requires meth-
ods that do not simply process single images, but are able to extract
the tongue movement information from a sequence of video frames. One
option for this is to apply recurrent neural structures such as the long
short-term memory network (LSTM) in combination with 2D convolu-
tional neural networks (CNNs). Here, we experiment with another app-
roach that extends the CNN to perform 3D convolution, where the extra
dimension corresponds to time. In particular, we apply the spatial and
temporal convolutions in a decomposed form, which proved very suc-
cessful recently in video action recognition. We find experimentally that
our 3D network outperforms the CNN+LSTM model, indicating that
3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI
systems.
Keywords: Silent speech interface
· Convolutional neural network ·
3D convolution · Ultrasound video
1 Introduction
During the last couple of years, there has been an increasing interest in
articulatory-to-acoustic conversion, which seeks to reproduce the speech signal
from a recording of the articulatory organs, giving the technological background
for creating “Silent Speech Interfaces” (SSI) [6,28]. These interfaces allow us to
record the soundless articulatory movement, and then automatically generate
speech from the movement information, while the subject is actually not pro-
ducing any sound. Such an SSI system could be very useful for the speaking
impaired who are able to move their articulators, but have lost their ability
to produce any sound (e.g. due to a laryngectomy or some injury of the vocal
chords). It could also be applied in human-computer interaction in situations
where regular speech is not feasible (e.g. extremely noisy environments or mil-
itary applications). Several solutions exist for the recording of the articulatory
c
Springer Nature Switzerland AG 2020
L. Rutkowski et al. (Eds.): ICAISC 2020, LNAI 12415, pp. 159–169, 2020.
https://doi.org/10.1007/978-3-030-61401-0
_16

160 L. oth and A. H. Shandiz
movements, the simplest approach being a lip video [1,8]. But one may also
apply electromagnetic articulography (EMA, [18,19]), ultrasound tongue imag-
ing (UTI, [5,10,16,20]) or permanent magnetic articulography (PMA, [9]). Sur-
face Electromiography (sEMG, [14,15,24]) is also an option, while some authors
use a combination of the above methods [6]. Here we are going to work with
ultrasound tongue videos.
To convert the movement recordings into speech, the conventional approach
is to apply a two-step procedure of ‘recognition-and-synthesis’ [28]. In this case,
the biosignal is first converted into text by a properly adjusted speech recognition
system. The text is then converted into speech using text-to-speech synthesis [7,
13,31]. The drawbacks of this approach are the relatively large delay between
the input and the output, and that the errors made by the speech recognizer
will inevitably appear as errors in the TTS output. Also, all information related
to speech prosody is lost, while certain prosodic components such as energy and
pitch can be reasonably well estimated from the articulatory signal [10].
Current SSI systems prefer the ‘direct synthesis’ principle, where speech is
generated directly from the articulatory data, without any intermediate step.
Moreover, as recently the Deep Neural Network (DNN) technology have become
dominant in practically all areas of speech technology, such as speech recogni-
tion [11], speech synthesis [22] and language modeling [33], most recent studies
have attempted to solve the articulatory-to-acoustic conversion problem by using
deep learning, regardless of the recording technique applied [5,9,10,14,16,20,25].
In this paper, we also apply deep neural networks to convert the ultrasound video
of the tongue movement to speech. Although some early studies used simple fully
connected neural networks [5,16], as we are working with images, it seems more
reasonable to apply convolutional neural networks (CNN), which are currently
very popular and successful in image recognition [21]. Thus, many recent studies
on SSI systems use CNNs [14,20,25].
Our input here is a video, that is, not just one still image, but a sequence of
images. This sequence carries extra information about the time trajectory of the
tongue movement, which might be exploited by processing several neighboring
video frames at the same time. There are several options to create a network
structure for processing a time sequences. For such data, usually recurrent neural
networks such as the long short-term memory network (LSTM) are applied,
typically stacking it on top of a 2D CNN that seeks to process the individual
frames [
9,19,23,25]. Alternatively, one may experiment with extending the 2D
CNN structure to 3D, by adding time as an extra dimension [17,20,32]. Here,
we follow the latter approach, and we investigate the applicability of a special
3D CNN model called the (2+1)D CNN [30] for ultrasound-based direct speech
synthesis, and compare the results with those of a CNN+LSTM model. We find
that our 3D CNN model achieves a lower error rate, while it is smaller, and its
training is faster. We conclude that for ultrasound video-based SSI systems, 3D
CNNs are definitely a feasible alternative to recurrent neural models.
The paper is structured as follows. Section 2 gives a technological overview
of the CNNs we are going to apply. In Sect. 3 we describe the data acquisition

3D CNNs for Silent Speech Interfaces 161
and processing steps for the ultrasound videos and the speech signal. Section 4
presents our experimental set-up. We present the experimental results and dis-
cuss them in Sect. 5, and the paper is closed with the conclusions in Sect. 6.
1D temporal convolution
2D spatial
convolution
y
t
x
Fig. 1. Illustration of how the (2+1)D CNN operates. The video frames (at the bottom)
are first processed by layers that perform 2D spatial convolution, then their outputs
are combined by 1D temporal convolution. The model is allowed to skip video frames
by changing the stride parameter of the temporal convolution.
2 Convolutional Neural Networks for Video Processing
Ever since the invention of ‘Alexnet’, CNNs have remained the leading tech-
nology in the recognition of still images [21]. These standard CNNs apply the
convolution along the two spatial axes, that is, in two dimensions (2D). How-
ever, there are several tasks where the input is a video, and handling the video as
a sequence (instead of simply processing separate frames) is vital for obtaining
good recognition results. The best example is human gait recognition, but we can
talk about action recognition in general [17,36,37]. In these cases, the sequence
of video frames forms a three-dimensional data array, with the temporal axis
being the third dimension in addition to the two spatial dimensions (cf. Fig. 1).
For the processing of sequences, recurrent neural structures such as the LSTM
are the most powerful tool [12]. However, the training of these networks is known
to be slow and problematic, which led to the invention of simplified models, such
as the gated recurrent unit (GRU) [3] or the quasi-recurrent neural network [2].
Alternatively, several convolutional network structures have been proposed that
handle time sequences without recurrent connections. In speech recognition,

162 L. oth and A. H. Shandiz
time-delay neural networks (TDNNs) have proved very successful [26,29], but
we can also mention the feedforward sequential memory network [34]. As regards
video processing, several modified CNN structures have been proposed to handle
the temporal sequence of video frames [17,36,37]. Unfortunately, the standard
2D convolution may be extended to 3D in many possible ways, giving a lot of
choices for optimization. Tran et al. performed an experimental comparison of
several 3D variants, and they got the best results when they decomposed the
spatial and temporal convolution steps [30]. The model they called ‘(2+1)D con-
volution’ first performs a 2D convolution along the spatial axes, and then a 1D
convolution along the time axis (see Fig. 1). By changing the stride parameter
of the 1D convolution, the model can skip several video frames, thus covering a
wider time context without increasing the number of processed frames. Interest-
ingly, a very similar network structure proved very efficient in speech recognition
as well [29]. Stacking several such processing blocks on top of each other is also
possible, resulting in a very deep network [30]. Here, we are going to experiment
with a similar (2+1)D network structure for ultrasound-based SSI systems.
3 Data Acquisition and Signal Preprocessing
The ultrasound recordings were collected from a Hungarian female subject (42
years old, with normal speaking abilities) while she was reading sentences aloud.
Her tongue movement was recorded in a midsagittal orientation placing the
ultrasonic imaging probe under the jaw using a “Micro” ultrasound system
by Articulate Instruments Ltd. The transducer was fixed using a stabilization
headset. The 2–4 MHz/64 element 20 mm radius convex ultrasound transducer
produced 82 images per second. The speech signal was recorded in parallel with
an Audio-Technica ATR 3350 omnidirectional condenser microphone placed at
a distance of 20 cm from the lips. The ultrasound and the audio signals were
synchronized using the software tool provided with the equipment. Altogether
438 sentences (approximately half an hour) were recorded from the subject,
which was divided into train, development and test sets in a 310-41-87 ratio. We
should add that the same dataset was used in several earlier studies [5,10].
The ultrasound probe records 946 samples along each of its 64 scan lines.
The recorded data can be converted to conventional ultrasound images using
the software tools provided. However, due to its irregular shape, this image is
harder to process by computers, while it contains no extra information compared
to the original scan data. Hence, we worked with the original 964 × 64 data items,
which were downsampled to 128 × 64 pixels. Figure 2 shows an example of the
data samples arranged as a rectangular image, and the standard ultrasound-
style display generated from it. The intensity range of the data was min-max
normalized to the [1, 1] interval before feeding it to the network.
The speech signal was recorded with a sampling rate of 11025 Hz, and then
processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).
The vocoder represented the speech signals by 12 Mel-Generalized Cepstral Coef-
ficients (MGCC) converted to a Line Spectral Pair representation (LSP), with

3D CNNs for Silent Speech Interfaces 163
Fig. 2. Example of displaying the ultrasound recordings as a) a rectangular image of
raw data samples b) an anatomically correct image, obtained by interpolation.
the signal’s gain being the 13th parameter. These 13 coefficients served as the
training targets in the DNN modeling experiments, as the speech signal can be
reasonably well reconstructed from these parameters. Although perfect recon-
struction would require the estimation of the pitch (F0 parameter) as well, in this
study we ignored this component during the experiments. To facilitate training,
each of the 13 targets were standardized to zero mean and unit variance.
4 Experimental Set-Up
We implemented our deep neural networks in Keras, using a Tensorflow back-
end [4]. We created three different models: a simple fully connected network
(FCN), a convolutional network that processes one frame of video (2D CNN),
and a convolutional network that can process several subsequent video frames as
input (3D CNN). To keep them comparable with respect to parameter count, all
three models had approximately 3.3 million tunable parameters. Training was
performed using the stochatic gradient descent method (SGD) with a batch size
of 100. The training objective function was the mean squared error (MSE).
Fully Connected Network (FCN): The simplest possible DNN type is a
network with fully connected layers. To be comparable with an earlier study [5],
our FCN model consisted of 5 fully connected hidden layers, with an output layer
of 13 neurons for the 13 training targets. The input of the network consisted of
one video frame (128 × 64 = 8192 pixels). Each hidden layers had 350 neurons, so
the model was about 4 times smaller compared to the FCN described in [5]. Apart
from the linear output layer, all layers applied the swish activation function [27],
and were followed by a dropout layer with the dropout rate set to 0.2.
Convolutional Network (2D CNN): Similar to the FCN, the input to this
network consisted of only one frame of data. The network performed spatial con-
volution on the input image via its four convolutional layers below the uppermost

Citations
More filters
Posted Content
TL;DR: In this article, the authors compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers.
Abstract: Several approaches exist for the recording of articulatory movements, such as eletromagnetic and permanent magnetic articulagraphy, ultrasound tongue imaging and surface electromyography. Although magnetic resonance imaging (MRI) is more costly than the above approaches, the recent developments in this area now allow the recording of real-time MRI videos of the articulators with an acceptable resolution. Here, we experiment with the reconstruction of the speech signal from a real-time MRI recording using deep neural networks. Instead of estimating speech directly, our networks are trained to output a spectral vector, from which we reconstruct the speech signal using the WaveGlow neural vocoder. We compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers. Besides the mean absolute error (MAE) of our networks, we also evaluate our models by comparing the speech signals obtained using several objective speech quality metrics like the mean cepstral distortion (MCD), Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ) and Signal-to-Distortion Ratio (SDR). The results indicate that our approach can successfully reconstruct the gross spectral shape, but more improvements are needed to reproduce the fine spectral details.

5 citations

Book ChapterDOI
28 Jun 2021
TL;DR: In this paper, a Generative Adversarial Network (GAN) is proposed to improve the perceptual quality of the generated signals by increasing their similarity to real signals, where the similarity is evaluated via a discriminator network.
Abstract: Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.

5 citations

Book ChapterDOI
06 Sep 2021
TL;DR: In this paper, a convolutional neural network classifier was used to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal.
Abstract: Voice Activity Detection (VAD) is not easy task when the input audio signal is noisy, and it is even more complicated when the input is not even an audio recording. This is the case with Silent Speech Interfaces (SSI) where we record the movement of the articulatory organs during speech, and we aim to reconstruct the speech signal from this recording. Our SSI system synthesizes speech from ultrasonic videos of the tongue movement, and the quality of the resulting speech signals are evaluated by metrics such as the mean squared error loss function of the underlying neural network and the Mel-Cepstral Distortion (MCD) of the reconstructed speech compared to the original. Here, we first demonstrate that the amount of silence in the training data can have an influence both on the MCD evaluation metric and on the performance of the neural network model. Then, we train a convolutional neural network classifier to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal. In the experiments our ultrasound-based speech/silence separator achieved a classification accuracy of about 85% and an AUC score around 86%.

3 citations

Proceedings ArticleDOI
26 Jun 2022
TL;DR: This paper experimentally compared various combinations of the above layer types for a silent speech interface task, and obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers.
Abstract: . Silent Speech Interfaces aim to reconstruct the acoustic signal from a sequence of ultrasound tongue images that records the articulatory movement. The extraction of information about the tongue movement requires us to efficiently process the whole sequence of images, not just as a single image. Several approaches have been suggested to process such a sequential image data. The classic neural network structure combines two-dimensional convolutional (2D-CNN) layers that process the images separately with recurrent layers (eg. an LSTM) on top of them to fuse the information along time. More recently, it was shown that one may also apply a 3D-CNN network that can extract information along both the spatial and the temporal axes in parallel, achieving a similar accuracy while being less time consuming. A third option is to apply the less well-known ConvLSTM layer type, which combines the advantages of LSTM and CNN layers by replacing matrix multiplication with the convolution operation. In this paper, we experimentally compared various combinations of the above mentions layer types for a silent speech interface task, and we obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers. This hybrid network is slightly faster, smaller and more accurate than our previous 3D-CNN model.

1 citations

Book ChapterDOI
TL;DR: In this paper, a Generative Adversarial Network (GAN) is proposed to improve the perceptual quality of the generated signals by increasing their similarity to real signals, where the similarity is evaluated via a discriminator network.
Abstract: Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.

1 citations

References
More filters
Journal ArticleDOI
TL;DR: This work promises to lead to a technology that truly will give people whose larynx has been removed their voices back, with the best results obtained for a silent-speech system without a restricted vocabulary and with an unobtrusive device that delivers audio in close to real time.
Abstract: This paper describes a technique that generates speech acoustics from articulator movements. Our motivation is to help people who can no longer speak following laryngectomy, a procedure that is carried out tens of thousands of times per year in the Western world. Our method for sensing articulator movement, permanent magnetic articulography, relies on small, unobtrusive magnets attached to the lips and tongue. Changes in magnetic field caused by magnet movements are sensed and form the input to a process that is trained to estimate speech acoustics. In the experiments reported here this “Direct Synthesis” technique is developed for normal speakers, with glued-on magnets, allowing us to train with parallel sensor and acoustic data. We describe three machine learning techniques for this task, based on Gaussian mixture models, deep neural networks, and recurrent neural networks (RNNs). We evaluate our techniques with objective acoustic distortion measures and subjective listening tests over spoken sentences read from novels (the CMU Arctic corpus). Our results show that the best performing technique is a bidirectional RNN (BiRNN), which employs both past and future contexts to predict the acoustics from the sensor data. BiRNNs are not suitable for synthesis in real time but fixed-lag RNNs give similar results and, because they only look a little way into the future, overcome this problem. Listening tests show that the speech produced by this method has a natural quality that preserves the identity of the speaker. Furthermore, we obtain up to 92% intelligibility on the challenging CMU Arctic material. To our knowledge, these are the best results obtained for a silent-speech system without a restricted vocabulary and with an unobtrusive device that delivers audio in close to real time. This work promises to lead to a technology that truly will give people whose larynx has been removed their voices back.

70 citations

Proceedings ArticleDOI
27 Feb 2017
TL;DR: In this article, an end-to-end model based on a convolutional neural network (CNN) was proposed for generating an intelligible acoustic speech signal from silent video frames of a speaking person.
Abstract: Speechreading is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. The proposed CNN generates sound features for each frame based on its neighboring frames. Waveforms are then synthesized from the learned speech features to produce intelligible speech. We show that by leveraging the automatic feature learning capabilities of a CNN, we can obtain state-of-the-art word intelligibility on the GRID dataset, and show promising results for learning out-of-vocabulary (OOV) words.

64 citations

Posted Content
TL;DR: A deep neural network, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.
Abstract: In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram which is then used as target to our main lip reading network comprising of CNN, LSTM and fully connected layers. Our experiments show that the autoencoder is able to reconstruct the original auditory spectrogram with a 98% correlation and also improves the quality of reconstructed speech from the main lip reading network. Our model, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.

54 citations


"3D Convolutional Neural Networks fo..." refers background in this paper

  • ...Several solutions exist for the recording of the articulatory movements, the simplest approach being a lip video [8,1]....

    [...]

Proceedings ArticleDOI
20 Aug 2017
TL;DR: It is found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error.
Abstract: In this paper we present our initial results in articulatory-toacoustic conversion based on tongue movement recordings using Deep Neural Networks (DNNs). Despite the fact that deep learning has revolutionized several fields, so far only a few researchers have applied DNNs for this task. Here, we compare various possible feature representation approaches combined with DNN-based regression. As the input, we recorded synchronized 2D ultrasound images and speech signals. The task of the DNN was to estimate Mel-Generalized Cepstrum-based Line Spectral Pair (MGC-LSP) coefficients, which then served as input to a standard pulse-noise vocoder for speech synthesis. As the raw ultrasound images have a relatively high resolution, we experimented with various feature selection and transformation approaches to reduce the size of the feature vectors. The synthetic speech signals resulting from the various DNN configurations were evaluated both using objective measures and a subjective listening test. We found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error. Our results may be useful for creating Silent Speech Interface applications in the future.

54 citations


"3D Convolutional Neural Networks fo..." refers background or methods or result in this paper

  • ...But one may also apply electromagnetic articulography (EMA, [18,19]), ultrasound tongue imaging (UTI, [16,5,10,20]) or permanent magnetic articulography (PMA, [9])....

    [...]

  • ...The authors of [5] applied a fully connected network on the same data set....

    [...]

  • ...Moreover, as recently the Deep Neural Network (DNN) technology have become dominant in practically all areas of speech technology, such as speech recognition [11], speech synthesis [22] and language modeling [33], most recent studies have attempted to solve the articulatory-to-acoustic conversion problem by using deep learning, regardless of the recording technique applied [5,10,14,16,9,25,20]....

    [...]

  • ...To be comparable with an earlier study [5], our FCN model consisted of 5 fully connected hidden layers, with an output layer of 13 neurons for the 13 training targets....

    [...]

  • ...Each hidden layers had 350 neurons, so the model was about 4 times smaller compared to the FCN described in [5]....

    [...]

Proceedings ArticleDOI
01 Apr 2018
TL;DR: In this paper, an autoencoder is used to extract bottleneck features from the auditory spectrogram which is then used as target to the main lip reading network comprising of CNN, LSTM and fully connected layers.
Abstract: In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram which is then used as target to our main lip reading network comprising of CNN, LSTM and fully connected layers. Our experiments show that the autoencoder is able to reconstruct the original auditory spectrogram with a 98% correlation and also improves the quality of reconstructed speech from the main lip reading network. Our model, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.

47 citations

Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

One option for this is to apply recurrent neural structures such as the long short-term memory network ( LSTM ) in combination with 2D convolutional neural networks ( CNNs ). Here, the authors experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, the authors apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. The authors find experimentally that their 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI 

In the future, the authors plan to investigate more sophisticated network types such as the ConvLSTM network that directly integrates the advantages of the convolutional and LSTM units [ 35 ].