scispace - formally typeset
Search or ask a question
Book ChapterDOI

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

12 Oct 2020-pp 159-169
TL;DR: This work experiments with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time, and finds experimentally that the 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+ LSTM networks in SSI systems.
Abstract: Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that our 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI systems.

Summary (2 min read)

1 Introduction

  • During the last couple of years, there has been an increasing interest in articulatory-to-acoustic conversion, which seeks to reproduce the speech signal from a recording of the articulatory organs, giving the technological background for creating “Silent Speech Interfaces” (SSI) [6,28].
  • Current SSI systems prefer the ‘direct synthesis’ principle, where speech is generated directly from the articulatory data, without any intermediate step.
  • This sequence carries extra information about the time trajectory of the tongue movement, which might be exploited by processing several neighboring video frames at the same time.
  • Alternatively, one may experiment with extending the 2D CNN structure to 3D, by adding time as an extra dimension [17,20,32].

2 Convolutional Neural Networks for Video Processing

  • Ever since the invention of ‘Alexnet’, CNNs have remained the leading technology in the recognition of still images [21].
  • There are several tasks where the input is a video, and handling the video as a sequence (instead of simply processing separate frames) is vital for obtaining good recognition results.
  • For the processing of sequences, recurrent neural structures such as the LSTM are the most powerful tool [12].
  • The model they called ‘(2+1)D convolution’ first performs a 2D convolution along the spatial axes, and then a 1D convolution along the time axis (see Fig. 1).
  • Interestingly, a very similar network structure proved very efficient in speech recognition as well [29].

3 Data Acquisition and Signal Preprocessing

  • The ultrasound recordings were collected from a Hungarian female subject (42 years old, with normal speaking abilities) while she was reading sentences aloud.
  • The ultrasound and the audio signals were synchronized using the software tool provided with the equipment.
  • Figure 2 shows an example of the data samples arranged as a rectangular image, and the standard ultrasoundstyle display generated from it.
  • The speech signal was recorded with a sampling rate of 11025 Hz, and then processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).
  • To facilitate training, each of the 13 targets were standardized to zero mean and unit variance.

4 Experimental Set-Up

  • The authors implemented their deep neural networks in Keras, using a Tensorflow backend [4].
  • To keep them comparable with respect to parameter count, all three models had approximately 3.3 million tunable parameters.
  • Training was performed using the stochatic gradient descent method (SGD) with a batch size of 100.
  • The simplest possible DNN type is a network with fully connected layers.
  • Similar to the FCN, the input to this network consisted of only one frame of data, also known as Convolutional Network (2D CNN).

2D CNN 3D CNN

  • The optimal network meta-parameters were found experimentally, and all hidden layers applied the swish activation function [27].
  • Following the concept of (2+1)D convolution described in Sect. 2, the five frames were first processed only spatially, and then got combined along the time axis just below the uppermost dense layer.
  • There are several options for evaluating the performance of their networks.
  • In the simplest case, the authors can compare their performance by simple objective metrics, such as the value of the target function optimized during training (the MSE function in their case).
  • Hence, many authors apply subjective listening tests such as the MUSHRA method [25].

5 Results and Discussion

  • As for the 3D CNN, the authors found that the value of the stride parameter s has a significant impact on the error rate attained.
  • Along with the MSE values, now the correlation-based R2 scores are also shown.
  • They obtained slightly better results than those given by their FCN, presumably due to the fact that their network had about 4 times as many parameters.
  • These simple methods failed to significantly reduce the error rate.
  • Moreover, instead of reducing the input size by feature selection, it seems to be more efficient to send the frames through several neural layers, with a relatively narrow ‘bottleneck’ layer on top.

6 Conclusions

  • Here, the authors implemented a 3D CNN for ultrasound-based articulation-to-acoustic conversion, where the CNN applied separate spatial and temporal components, motivated by the (2+1)D CNN of Tran et al. [30].
  • The model was compared with a CNN+LSTM architecture that was recently proposed for the same task.
  • This study was supported by the National Research, Development and Innovation Office of Hungary through project FK 124584 and by the AI National Excellence Program (grant 2018-1.2.1-NKP-2018-00008) and by grant TUDFO/47138-1/2019-ITM of the Ministry of Innovation and Technology.
  • The GPU card used for the computations was donated by the NVIDIA Corporation.
  • The authors thank the MTA-ELTE Lendület Lingual Articulation Research Group for providing the ultrasound recordings.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

3D Convolutional Neural Networks
for Ultrasound-Based Silent Speech
Interfaces
aszl´oT´oth
(
B
)
and Amin Honarmandi Shandiz
Institute of Informatics, University of Szeged, Szeged, Hungary
{tothl,shandiz}@inf.u-szeged.hu
Abstract. Silent speech interfaces (SSI) aim to reconstruct the speech
signal from a recording of the articulatory movement, such as an ultra-
sound video of the tongue. Currently, deep neural networks are the most
successful technology for this task. The efficient solution requires meth-
ods that do not simply process single images, but are able to extract
the tongue movement information from a sequence of video frames. One
option for this is to apply recurrent neural structures such as the long
short-term memory network (LSTM) in combination with 2D convolu-
tional neural networks (CNNs). Here, we experiment with another app-
roach that extends the CNN to perform 3D convolution, where the extra
dimension corresponds to time. In particular, we apply the spatial and
temporal convolutions in a decomposed form, which proved very suc-
cessful recently in video action recognition. We find experimentally that
our 3D network outperforms the CNN+LSTM model, indicating that
3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI
systems.
Keywords: Silent speech interface
· Convolutional neural network ·
3D convolution · Ultrasound video
1 Introduction
During the last couple of years, there has been an increasing interest in
articulatory-to-acoustic conversion, which seeks to reproduce the speech signal
from a recording of the articulatory organs, giving the technological background
for creating “Silent Speech Interfaces” (SSI) [6,28]. These interfaces allow us to
record the soundless articulatory movement, and then automatically generate
speech from the movement information, while the subject is actually not pro-
ducing any sound. Such an SSI system could be very useful for the speaking
impaired who are able to move their articulators, but have lost their ability
to produce any sound (e.g. due to a laryngectomy or some injury of the vocal
chords). It could also be applied in human-computer interaction in situations
where regular speech is not feasible (e.g. extremely noisy environments or mil-
itary applications). Several solutions exist for the recording of the articulatory
c
Springer Nature Switzerland AG 2020
L. Rutkowski et al. (Eds.): ICAISC 2020, LNAI 12415, pp. 159–169, 2020.
https://doi.org/10.1007/978-3-030-61401-0
_16

160 L. oth and A. H. Shandiz
movements, the simplest approach being a lip video [1,8]. But one may also
apply electromagnetic articulography (EMA, [18,19]), ultrasound tongue imag-
ing (UTI, [5,10,16,20]) or permanent magnetic articulography (PMA, [9]). Sur-
face Electromiography (sEMG, [14,15,24]) is also an option, while some authors
use a combination of the above methods [6]. Here we are going to work with
ultrasound tongue videos.
To convert the movement recordings into speech, the conventional approach
is to apply a two-step procedure of ‘recognition-and-synthesis’ [28]. In this case,
the biosignal is first converted into text by a properly adjusted speech recognition
system. The text is then converted into speech using text-to-speech synthesis [7,
13,31]. The drawbacks of this approach are the relatively large delay between
the input and the output, and that the errors made by the speech recognizer
will inevitably appear as errors in the TTS output. Also, all information related
to speech prosody is lost, while certain prosodic components such as energy and
pitch can be reasonably well estimated from the articulatory signal [10].
Current SSI systems prefer the ‘direct synthesis’ principle, where speech is
generated directly from the articulatory data, without any intermediate step.
Moreover, as recently the Deep Neural Network (DNN) technology have become
dominant in practically all areas of speech technology, such as speech recogni-
tion [11], speech synthesis [22] and language modeling [33], most recent studies
have attempted to solve the articulatory-to-acoustic conversion problem by using
deep learning, regardless of the recording technique applied [5,9,10,14,16,20,25].
In this paper, we also apply deep neural networks to convert the ultrasound video
of the tongue movement to speech. Although some early studies used simple fully
connected neural networks [5,16], as we are working with images, it seems more
reasonable to apply convolutional neural networks (CNN), which are currently
very popular and successful in image recognition [21]. Thus, many recent studies
on SSI systems use CNNs [14,20,25].
Our input here is a video, that is, not just one still image, but a sequence of
images. This sequence carries extra information about the time trajectory of the
tongue movement, which might be exploited by processing several neighboring
video frames at the same time. There are several options to create a network
structure for processing a time sequences. For such data, usually recurrent neural
networks such as the long short-term memory network (LSTM) are applied,
typically stacking it on top of a 2D CNN that seeks to process the individual
frames [
9,19,23,25]. Alternatively, one may experiment with extending the 2D
CNN structure to 3D, by adding time as an extra dimension [17,20,32]. Here,
we follow the latter approach, and we investigate the applicability of a special
3D CNN model called the (2+1)D CNN [30] for ultrasound-based direct speech
synthesis, and compare the results with those of a CNN+LSTM model. We find
that our 3D CNN model achieves a lower error rate, while it is smaller, and its
training is faster. We conclude that for ultrasound video-based SSI systems, 3D
CNNs are definitely a feasible alternative to recurrent neural models.
The paper is structured as follows. Section 2 gives a technological overview
of the CNNs we are going to apply. In Sect. 3 we describe the data acquisition

3D CNNs for Silent Speech Interfaces 161
and processing steps for the ultrasound videos and the speech signal. Section 4
presents our experimental set-up. We present the experimental results and dis-
cuss them in Sect. 5, and the paper is closed with the conclusions in Sect. 6.
1D temporal convolution
2D spatial
convolution
y
t
x
Fig. 1. Illustration of how the (2+1)D CNN operates. The video frames (at the bottom)
are first processed by layers that perform 2D spatial convolution, then their outputs
are combined by 1D temporal convolution. The model is allowed to skip video frames
by changing the stride parameter of the temporal convolution.
2 Convolutional Neural Networks for Video Processing
Ever since the invention of ‘Alexnet’, CNNs have remained the leading tech-
nology in the recognition of still images [21]. These standard CNNs apply the
convolution along the two spatial axes, that is, in two dimensions (2D). How-
ever, there are several tasks where the input is a video, and handling the video as
a sequence (instead of simply processing separate frames) is vital for obtaining
good recognition results. The best example is human gait recognition, but we can
talk about action recognition in general [17,36,37]. In these cases, the sequence
of video frames forms a three-dimensional data array, with the temporal axis
being the third dimension in addition to the two spatial dimensions (cf. Fig. 1).
For the processing of sequences, recurrent neural structures such as the LSTM
are the most powerful tool [12]. However, the training of these networks is known
to be slow and problematic, which led to the invention of simplified models, such
as the gated recurrent unit (GRU) [3] or the quasi-recurrent neural network [2].
Alternatively, several convolutional network structures have been proposed that
handle time sequences without recurrent connections. In speech recognition,

162 L. oth and A. H. Shandiz
time-delay neural networks (TDNNs) have proved very successful [26,29], but
we can also mention the feedforward sequential memory network [34]. As regards
video processing, several modified CNN structures have been proposed to handle
the temporal sequence of video frames [17,36,37]. Unfortunately, the standard
2D convolution may be extended to 3D in many possible ways, giving a lot of
choices for optimization. Tran et al. performed an experimental comparison of
several 3D variants, and they got the best results when they decomposed the
spatial and temporal convolution steps [30]. The model they called ‘(2+1)D con-
volution’ first performs a 2D convolution along the spatial axes, and then a 1D
convolution along the time axis (see Fig. 1). By changing the stride parameter
of the 1D convolution, the model can skip several video frames, thus covering a
wider time context without increasing the number of processed frames. Interest-
ingly, a very similar network structure proved very efficient in speech recognition
as well [29]. Stacking several such processing blocks on top of each other is also
possible, resulting in a very deep network [30]. Here, we are going to experiment
with a similar (2+1)D network structure for ultrasound-based SSI systems.
3 Data Acquisition and Signal Preprocessing
The ultrasound recordings were collected from a Hungarian female subject (42
years old, with normal speaking abilities) while she was reading sentences aloud.
Her tongue movement was recorded in a midsagittal orientation placing the
ultrasonic imaging probe under the jaw using a “Micro” ultrasound system
by Articulate Instruments Ltd. The transducer was fixed using a stabilization
headset. The 2–4 MHz/64 element 20 mm radius convex ultrasound transducer
produced 82 images per second. The speech signal was recorded in parallel with
an Audio-Technica ATR 3350 omnidirectional condenser microphone placed at
a distance of 20 cm from the lips. The ultrasound and the audio signals were
synchronized using the software tool provided with the equipment. Altogether
438 sentences (approximately half an hour) were recorded from the subject,
which was divided into train, development and test sets in a 310-41-87 ratio. We
should add that the same dataset was used in several earlier studies [5,10].
The ultrasound probe records 946 samples along each of its 64 scan lines.
The recorded data can be converted to conventional ultrasound images using
the software tools provided. However, due to its irregular shape, this image is
harder to process by computers, while it contains no extra information compared
to the original scan data. Hence, we worked with the original 964 × 64 data items,
which were downsampled to 128 × 64 pixels. Figure 2 shows an example of the
data samples arranged as a rectangular image, and the standard ultrasound-
style display generated from it. The intensity range of the data was min-max
normalized to the [1, 1] interval before feeding it to the network.
The speech signal was recorded with a sampling rate of 11025 Hz, and then
processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).
The vocoder represented the speech signals by 12 Mel-Generalized Cepstral Coef-
ficients (MGCC) converted to a Line Spectral Pair representation (LSP), with

3D CNNs for Silent Speech Interfaces 163
Fig. 2. Example of displaying the ultrasound recordings as a) a rectangular image of
raw data samples b) an anatomically correct image, obtained by interpolation.
the signal’s gain being the 13th parameter. These 13 coefficients served as the
training targets in the DNN modeling experiments, as the speech signal can be
reasonably well reconstructed from these parameters. Although perfect recon-
struction would require the estimation of the pitch (F0 parameter) as well, in this
study we ignored this component during the experiments. To facilitate training,
each of the 13 targets were standardized to zero mean and unit variance.
4 Experimental Set-Up
We implemented our deep neural networks in Keras, using a Tensorflow back-
end [4]. We created three different models: a simple fully connected network
(FCN), a convolutional network that processes one frame of video (2D CNN),
and a convolutional network that can process several subsequent video frames as
input (3D CNN). To keep them comparable with respect to parameter count, all
three models had approximately 3.3 million tunable parameters. Training was
performed using the stochatic gradient descent method (SGD) with a batch size
of 100. The training objective function was the mean squared error (MSE).
Fully Connected Network (FCN): The simplest possible DNN type is a
network with fully connected layers. To be comparable with an earlier study [5],
our FCN model consisted of 5 fully connected hidden layers, with an output layer
of 13 neurons for the 13 training targets. The input of the network consisted of
one video frame (128 × 64 = 8192 pixels). Each hidden layers had 350 neurons, so
the model was about 4 times smaller compared to the FCN described in [5]. Apart
from the linear output layer, all layers applied the swish activation function [27],
and were followed by a dropout layer with the dropout rate set to 0.2.
Convolutional Network (2D CNN): Similar to the FCN, the input to this
network consisted of only one frame of data. The network performed spatial con-
volution on the input image via its four convolutional layers below the uppermost

Citations
More filters
Posted Content
TL;DR: In this article, the authors compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers.
Abstract: Several approaches exist for the recording of articulatory movements, such as eletromagnetic and permanent magnetic articulagraphy, ultrasound tongue imaging and surface electromyography. Although magnetic resonance imaging (MRI) is more costly than the above approaches, the recent developments in this area now allow the recording of real-time MRI videos of the articulators with an acceptable resolution. Here, we experiment with the reconstruction of the speech signal from a real-time MRI recording using deep neural networks. Instead of estimating speech directly, our networks are trained to output a spectral vector, from which we reconstruct the speech signal using the WaveGlow neural vocoder. We compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers. Besides the mean absolute error (MAE) of our networks, we also evaluate our models by comparing the speech signals obtained using several objective speech quality metrics like the mean cepstral distortion (MCD), Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ) and Signal-to-Distortion Ratio (SDR). The results indicate that our approach can successfully reconstruct the gross spectral shape, but more improvements are needed to reproduce the fine spectral details.

5 citations

Book ChapterDOI
28 Jun 2021
TL;DR: In this paper, a Generative Adversarial Network (GAN) is proposed to improve the perceptual quality of the generated signals by increasing their similarity to real signals, where the similarity is evaluated via a discriminator network.
Abstract: Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.

5 citations

Book ChapterDOI
06 Sep 2021
TL;DR: In this paper, a convolutional neural network classifier was used to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal.
Abstract: Voice Activity Detection (VAD) is not easy task when the input audio signal is noisy, and it is even more complicated when the input is not even an audio recording. This is the case with Silent Speech Interfaces (SSI) where we record the movement of the articulatory organs during speech, and we aim to reconstruct the speech signal from this recording. Our SSI system synthesizes speech from ultrasonic videos of the tongue movement, and the quality of the resulting speech signals are evaluated by metrics such as the mean squared error loss function of the underlying neural network and the Mel-Cepstral Distortion (MCD) of the reconstructed speech compared to the original. Here, we first demonstrate that the amount of silence in the training data can have an influence both on the MCD evaluation metric and on the performance of the neural network model. Then, we train a convolutional neural network classifier to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal. In the experiments our ultrasound-based speech/silence separator achieved a classification accuracy of about 85% and an AUC score around 86%.

3 citations

Proceedings ArticleDOI
26 Jun 2022
TL;DR: This paper experimentally compared various combinations of the above layer types for a silent speech interface task, and obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers.
Abstract: . Silent Speech Interfaces aim to reconstruct the acoustic signal from a sequence of ultrasound tongue images that records the articulatory movement. The extraction of information about the tongue movement requires us to efficiently process the whole sequence of images, not just as a single image. Several approaches have been suggested to process such a sequential image data. The classic neural network structure combines two-dimensional convolutional (2D-CNN) layers that process the images separately with recurrent layers (eg. an LSTM) on top of them to fuse the information along time. More recently, it was shown that one may also apply a 3D-CNN network that can extract information along both the spatial and the temporal axes in parallel, achieving a similar accuracy while being less time consuming. A third option is to apply the less well-known ConvLSTM layer type, which combines the advantages of LSTM and CNN layers by replacing matrix multiplication with the convolution operation. In this paper, we experimentally compared various combinations of the above mentions layer types for a silent speech interface task, and we obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers. This hybrid network is slightly faster, smaller and more accurate than our previous 3D-CNN model.

1 citations

Book ChapterDOI
TL;DR: In this paper, a Generative Adversarial Network (GAN) is proposed to improve the perceptual quality of the generated signals by increasing their similarity to real signals, where the similarity is evaluated via a discriminator network.
Abstract: Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.

1 citations

References
More filters
Proceedings Article
01 Jan 2018
TL;DR: This work proposes a new CNN architecture TrajectoryNet, which incorporates trajectory convolution, a new operation for integrating features along the temporal dimension, to replace the existing temporal convolution.
Abstract: How to leverage the temporal dimension is a key question in video analysis. Recent works suggest an efficient approach to video feature learning, i.e., factorizing 3D convolutions into separate components respectively for spatial and temporal convolutions. The temporal convolution, however, comes with an implicit assumption – the feature maps across time steps are well aligned so that the features at the same locations can be aggregated. This assumption may be overly strong in practical applications, especially in action recognition where the motion serves as a crucial cue. In this work, we propose a new CNN architecture TrajectoryNet, which incorporates trajectory convolution, a new operation for integrating features along the temporal dimension, to replace the existing temporal convolution. This operation explicitly takes into account the changes in contents caused by deformation or motion, allowing the visual features to be aggregated along the the motion paths, trajectories. On two large-scale action recognition datasets, namely, Something-Something and Kinetics, the proposed network architecture achieves notable improvement over strong baselines.

93 citations


"3D Convolutional Neural Networks fo..." refers background or methods in this paper

  • ...As regards video processing, several modified CNN structures have been proposed to handle the temporal sequence of video frames [17,36,37]....

    [...]

  • ...The best example is human gait recognition, but we can talk about action recognition in general [17,36,37]....

    [...]

Journal ArticleDOI
TL;DR: The characteristics of the speech EMG signal are described, techniques for extracting relevant features are introduced, different EMG-to-speech mapping methods are presented, and an evaluation of the different methods for real-time capability and conversion quality is presented.
Abstract: Silent speech interfaces are systems that enable speech communication even when an acoustic signal is unavailable. Over the last years, public interest in such interfaces has intensified. They provide solutions for some of the challenges faced by today's speech-driven technologies, such as robustness to noise and usability for people with speech impediments. In this paper, we provide an overview over our silent speech interface. It is based on facial surface electromyography (EMG) , which we use to record the electrical signals that control muscle contraction during speech production. These signals are then converted directly to an audible speech waveform, retaining important paralinguistic speech cues for information such as speaker identity and mood. This paper gives an overview over our state-of-the-art direct EMG-to-speech transformation system. This paper describes the characteristics of the speech EMG signal, introduces techniques for extracting relevant features, presents different EMG-to-speech mapping methods, and finally, presents an evaluation of the different methods for real-time capability and conversion quality.

91 citations

Proceedings ArticleDOI
04 May 2014
TL;DR: The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
Abstract: Convolutional neural networks have proved very successful in image recognition, thanks to their tolerance to small translations. They have recently been applied to speech recognition as well, using a spectral representation as input. However, in this case the translations along the two axes - time and frequency - should be handled quite differently. So far, most authors have focused on convolution along the frequency axis, which offers invariance to speaker and speaking style variations. Other researchers have developed a different network architecture that applies time-domain convolution in order to process a longer time-span of input in a hierarchical manner. These two approaches have different background motivations, and both offer significant gains over a standard fully connected network. Here we show that the two network architectures can be readily combined, like their advantages. With the combined model we report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.

87 citations


"3D Convolutional Neural Networks fo..." refers background in this paper

  • ...Interestingly, a very similar network structure proved very efficient in speech recognition as well [29]....

    [...]

  • ...In speech recognition, time-delay neural networks (TDNNs) have proved very successful [26,29], but we can also mention the feedforward sequential memory network [34]....

    [...]

Proceedings ArticleDOI
02 May 2019
TL;DR: A system to detect a user's unvoiced utterance and recognize the utterance contents without the user's uttering voice is proposed, and it is confirmed that audio signals generated by the system can control the existing smart speakers.
Abstract: The availability of digital devices operated by voice is expanding rapidly. However, the applications of voice interfaces are still restricted. For example, speaking in public places becomes an annoyance to the surrounding people, and secret information should not be uttered. Environmental noise may reduce the accuracy of speech recognition. To address these limitations, a system to detect a user's unvoiced utterance is proposed. From internal information observed by an ultrasonic imaging sensor attached to the underside of the jaw, our proposed system recognizes the utterance contents without the user's uttering voice. Our proposed deep neural network model is used to obtain acoustic features from a sequence of ultrasound images. We confirmed that audio signals generated by our system can control the existing smart speakers. We also observed that a user can adjust their oral movement to learn and improve the accuracy of their voice recognition.

82 citations

Journal ArticleDOI
TL;DR: This paper adopts a bidirectional long short-term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history and proposes Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences.
Abstract: Silent speech recognition (SSR) converts nonaudio information such as articulatory movements into text. SSR has the potential to enable persons with laryngectomy to communicate through natural spoken expression. Current SSR systems have largely relied on speaker-dependent recognition models. The high degree of variability in articulatory patterns across different speakers has been a barrier for developing effective speaker-independent SSR approaches. Speaker-independent SSR approaches, however, are critical for reducing the amount of training data required from each speaker. In this paper, we investigate speaker-independent SSR from the movements of flesh points on tongue and lips with articulatory normalization methods that reduce the interspeaker variation. To minimize the across-speaker physiological differences of the articulators, we propose Procrustes matching-based articulatory normalization by removing locational, rotational, and scaling differences. To further normalize the articulatory data, we apply feature-space maximum likelihood linear regression and i-vector. In this paper, we adopt a bidirectional long short-term memory recurrent neural network (BLSTM) as an articulatory model to effectively model the articulatory movements with long-range articulatory history. A silent speech dataset with flesh-point articulatory movements was collected using an electromagnetic articulograph from 12 healthy and two laryngectomized English speakers. Experimental results showed the effectiveness of our speaker-independent SSR approaches on healthy as well as laryngectomy speakers. In addition, BLSTM outperformed the standard deep neural network. The best performance was obtained by the BLSTM with all the three normalization approaches combined.

77 citations


"3D Convolutional Neural Networks fo..." refers background or methods in this paper

  • ...For such data, usually recurrent neural networks such as the long short-term memory network (LSTM) are applied, typically stacking it on top of a 2D CNN that seeks to process the individual frames [9,25,23,19]....

    [...]

  • ...But one may also apply electromagnetic articulography (EMA, [18,19]), ultrasound tongue imaging (UTI, [16,5,10,20]) or permanent magnetic articulography (PMA, [9])....

    [...]

Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

One option for this is to apply recurrent neural structures such as the long short-term memory network ( LSTM ) in combination with 2D convolutional neural networks ( CNNs ). Here, the authors experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, the authors apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. The authors find experimentally that their 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI 

In the future, the authors plan to investigate more sophisticated network types such as the ConvLSTM network that directly integrates the advantages of the convolutional and LSTM units [ 35 ].