scispace - formally typeset

Book ChapterDOI

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

12 Oct 2020-pp 159-169

TL;DR: This work experiments with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time, and finds experimentally that the 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+ LSTM networks in SSI systems.

AbstractSilent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (LSTM) in combination with 2D convolutional neural networks (CNNs). Here, we experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, we apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. We find experimentally that our 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI systems.

Summary (2 min read)

1 Introduction

  • During the last couple of years, there has been an increasing interest in articulatory-to-acoustic conversion, which seeks to reproduce the speech signal from a recording of the articulatory organs, giving the technological background for creating “Silent Speech Interfaces” (SSI) [6,28].
  • Current SSI systems prefer the ‘direct synthesis’ principle, where speech is generated directly from the articulatory data, without any intermediate step.
  • This sequence carries extra information about the time trajectory of the tongue movement, which might be exploited by processing several neighboring video frames at the same time.
  • Alternatively, one may experiment with extending the 2D CNN structure to 3D, by adding time as an extra dimension [17,20,32].

2 Convolutional Neural Networks for Video Processing

  • Ever since the invention of ‘Alexnet’, CNNs have remained the leading technology in the recognition of still images [21].
  • There are several tasks where the input is a video, and handling the video as a sequence (instead of simply processing separate frames) is vital for obtaining good recognition results.
  • For the processing of sequences, recurrent neural structures such as the LSTM are the most powerful tool [12].
  • The model they called ‘(2+1)D convolution’ first performs a 2D convolution along the spatial axes, and then a 1D convolution along the time axis (see Fig. 1).
  • Interestingly, a very similar network structure proved very efficient in speech recognition as well [29].

3 Data Acquisition and Signal Preprocessing

  • The ultrasound recordings were collected from a Hungarian female subject (42 years old, with normal speaking abilities) while she was reading sentences aloud.
  • The ultrasound and the audio signals were synchronized using the software tool provided with the equipment.
  • Figure 2 shows an example of the data samples arranged as a rectangular image, and the standard ultrasoundstyle display generated from it.
  • The speech signal was recorded with a sampling rate of 11025 Hz, and then processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).
  • To facilitate training, each of the 13 targets were standardized to zero mean and unit variance.

4 Experimental Set-Up

  • The authors implemented their deep neural networks in Keras, using a Tensorflow backend [4].
  • To keep them comparable with respect to parameter count, all three models had approximately 3.3 million tunable parameters.
  • Training was performed using the stochatic gradient descent method (SGD) with a batch size of 100.
  • The simplest possible DNN type is a network with fully connected layers.
  • Similar to the FCN, the input to this network consisted of only one frame of data, also known as Convolutional Network (2D CNN).

2D CNN 3D CNN

  • The optimal network meta-parameters were found experimentally, and all hidden layers applied the swish activation function [27].
  • Following the concept of (2+1)D convolution described in Sect. 2, the five frames were first processed only spatially, and then got combined along the time axis just below the uppermost dense layer.
  • There are several options for evaluating the performance of their networks.
  • In the simplest case, the authors can compare their performance by simple objective metrics, such as the value of the target function optimized during training (the MSE function in their case).
  • Hence, many authors apply subjective listening tests such as the MUSHRA method [25].

5 Results and Discussion

  • As for the 3D CNN, the authors found that the value of the stride parameter s has a significant impact on the error rate attained.
  • Along with the MSE values, now the correlation-based R2 scores are also shown.
  • They obtained slightly better results than those given by their FCN, presumably due to the fact that their network had about 4 times as many parameters.
  • These simple methods failed to significantly reduce the error rate.
  • Moreover, instead of reducing the input size by feature selection, it seems to be more efficient to send the frames through several neural layers, with a relatively narrow ‘bottleneck’ layer on top.

6 Conclusions

  • Here, the authors implemented a 3D CNN for ultrasound-based articulation-to-acoustic conversion, where the CNN applied separate spatial and temporal components, motivated by the (2+1)D CNN of Tran et al. [30].
  • The model was compared with a CNN+LSTM architecture that was recently proposed for the same task.
  • This study was supported by the National Research, Development and Innovation Office of Hungary through project FK 124584 and by the AI National Excellence Program (grant 2018-1.2.1-NKP-2018-00008) and by grant TUDFO/47138-1/2019-ITM of the Ministry of Innovation and Technology.
  • The GPU card used for the computations was donated by the NVIDIA Corporation.
  • The authors thank the MTA-ELTE Lendület Lingual Articulation Research Group for providing the ultrasound recordings.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

3D Convolutional Neural Networks
for Ultrasound-Based Silent Speech
Interfaces
aszl´oT´oth
(
B
)
and Amin Honarmandi Shandiz
Institute of Informatics, University of Szeged, Szeged, Hungary
{tothl,shandiz}@inf.u-szeged.hu
Abstract. Silent speech interfaces (SSI) aim to reconstruct the speech
signal from a recording of the articulatory movement, such as an ultra-
sound video of the tongue. Currently, deep neural networks are the most
successful technology for this task. The efficient solution requires meth-
ods that do not simply process single images, but are able to extract
the tongue movement information from a sequence of video frames. One
option for this is to apply recurrent neural structures such as the long
short-term memory network (LSTM) in combination with 2D convolu-
tional neural networks (CNNs). Here, we experiment with another app-
roach that extends the CNN to perform 3D convolution, where the extra
dimension corresponds to time. In particular, we apply the spatial and
temporal convolutions in a decomposed form, which proved very suc-
cessful recently in video action recognition. We find experimentally that
our 3D network outperforms the CNN+LSTM model, indicating that
3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI
systems.
Keywords: Silent speech interface
· Convolutional neural network ·
3D convolution · Ultrasound video
1 Introduction
During the last couple of years, there has been an increasing interest in
articulatory-to-acoustic conversion, which seeks to reproduce the speech signal
from a recording of the articulatory organs, giving the technological background
for creating “Silent Speech Interfaces” (SSI) [6,28]. These interfaces allow us to
record the soundless articulatory movement, and then automatically generate
speech from the movement information, while the subject is actually not pro-
ducing any sound. Such an SSI system could be very useful for the speaking
impaired who are able to move their articulators, but have lost their ability
to produce any sound (e.g. due to a laryngectomy or some injury of the vocal
chords). It could also be applied in human-computer interaction in situations
where regular speech is not feasible (e.g. extremely noisy environments or mil-
itary applications). Several solutions exist for the recording of the articulatory
c
Springer Nature Switzerland AG 2020
L. Rutkowski et al. (Eds.): ICAISC 2020, LNAI 12415, pp. 159–169, 2020.
https://doi.org/10.1007/978-3-030-61401-0
_16

160 L. oth and A. H. Shandiz
movements, the simplest approach being a lip video [1,8]. But one may also
apply electromagnetic articulography (EMA, [18,19]), ultrasound tongue imag-
ing (UTI, [5,10,16,20]) or permanent magnetic articulography (PMA, [9]). Sur-
face Electromiography (sEMG, [14,15,24]) is also an option, while some authors
use a combination of the above methods [6]. Here we are going to work with
ultrasound tongue videos.
To convert the movement recordings into speech, the conventional approach
is to apply a two-step procedure of ‘recognition-and-synthesis’ [28]. In this case,
the biosignal is first converted into text by a properly adjusted speech recognition
system. The text is then converted into speech using text-to-speech synthesis [7,
13,31]. The drawbacks of this approach are the relatively large delay between
the input and the output, and that the errors made by the speech recognizer
will inevitably appear as errors in the TTS output. Also, all information related
to speech prosody is lost, while certain prosodic components such as energy and
pitch can be reasonably well estimated from the articulatory signal [10].
Current SSI systems prefer the ‘direct synthesis’ principle, where speech is
generated directly from the articulatory data, without any intermediate step.
Moreover, as recently the Deep Neural Network (DNN) technology have become
dominant in practically all areas of speech technology, such as speech recogni-
tion [11], speech synthesis [22] and language modeling [33], most recent studies
have attempted to solve the articulatory-to-acoustic conversion problem by using
deep learning, regardless of the recording technique applied [5,9,10,14,16,20,25].
In this paper, we also apply deep neural networks to convert the ultrasound video
of the tongue movement to speech. Although some early studies used simple fully
connected neural networks [5,16], as we are working with images, it seems more
reasonable to apply convolutional neural networks (CNN), which are currently
very popular and successful in image recognition [21]. Thus, many recent studies
on SSI systems use CNNs [14,20,25].
Our input here is a video, that is, not just one still image, but a sequence of
images. This sequence carries extra information about the time trajectory of the
tongue movement, which might be exploited by processing several neighboring
video frames at the same time. There are several options to create a network
structure for processing a time sequences. For such data, usually recurrent neural
networks such as the long short-term memory network (LSTM) are applied,
typically stacking it on top of a 2D CNN that seeks to process the individual
frames [
9,19,23,25]. Alternatively, one may experiment with extending the 2D
CNN structure to 3D, by adding time as an extra dimension [17,20,32]. Here,
we follow the latter approach, and we investigate the applicability of a special
3D CNN model called the (2+1)D CNN [30] for ultrasound-based direct speech
synthesis, and compare the results with those of a CNN+LSTM model. We find
that our 3D CNN model achieves a lower error rate, while it is smaller, and its
training is faster. We conclude that for ultrasound video-based SSI systems, 3D
CNNs are definitely a feasible alternative to recurrent neural models.
The paper is structured as follows. Section 2 gives a technological overview
of the CNNs we are going to apply. In Sect. 3 we describe the data acquisition

3D CNNs for Silent Speech Interfaces 161
and processing steps for the ultrasound videos and the speech signal. Section 4
presents our experimental set-up. We present the experimental results and dis-
cuss them in Sect. 5, and the paper is closed with the conclusions in Sect. 6.
1D temporal convolution
2D spatial
convolution
y
t
x
Fig. 1. Illustration of how the (2+1)D CNN operates. The video frames (at the bottom)
are first processed by layers that perform 2D spatial convolution, then their outputs
are combined by 1D temporal convolution. The model is allowed to skip video frames
by changing the stride parameter of the temporal convolution.
2 Convolutional Neural Networks for Video Processing
Ever since the invention of ‘Alexnet’, CNNs have remained the leading tech-
nology in the recognition of still images [21]. These standard CNNs apply the
convolution along the two spatial axes, that is, in two dimensions (2D). How-
ever, there are several tasks where the input is a video, and handling the video as
a sequence (instead of simply processing separate frames) is vital for obtaining
good recognition results. The best example is human gait recognition, but we can
talk about action recognition in general [17,36,37]. In these cases, the sequence
of video frames forms a three-dimensional data array, with the temporal axis
being the third dimension in addition to the two spatial dimensions (cf. Fig. 1).
For the processing of sequences, recurrent neural structures such as the LSTM
are the most powerful tool [12]. However, the training of these networks is known
to be slow and problematic, which led to the invention of simplified models, such
as the gated recurrent unit (GRU) [3] or the quasi-recurrent neural network [2].
Alternatively, several convolutional network structures have been proposed that
handle time sequences without recurrent connections. In speech recognition,

162 L. oth and A. H. Shandiz
time-delay neural networks (TDNNs) have proved very successful [26,29], but
we can also mention the feedforward sequential memory network [34]. As regards
video processing, several modified CNN structures have been proposed to handle
the temporal sequence of video frames [17,36,37]. Unfortunately, the standard
2D convolution may be extended to 3D in many possible ways, giving a lot of
choices for optimization. Tran et al. performed an experimental comparison of
several 3D variants, and they got the best results when they decomposed the
spatial and temporal convolution steps [30]. The model they called ‘(2+1)D con-
volution’ first performs a 2D convolution along the spatial axes, and then a 1D
convolution along the time axis (see Fig. 1). By changing the stride parameter
of the 1D convolution, the model can skip several video frames, thus covering a
wider time context without increasing the number of processed frames. Interest-
ingly, a very similar network structure proved very efficient in speech recognition
as well [29]. Stacking several such processing blocks on top of each other is also
possible, resulting in a very deep network [30]. Here, we are going to experiment
with a similar (2+1)D network structure for ultrasound-based SSI systems.
3 Data Acquisition and Signal Preprocessing
The ultrasound recordings were collected from a Hungarian female subject (42
years old, with normal speaking abilities) while she was reading sentences aloud.
Her tongue movement was recorded in a midsagittal orientation placing the
ultrasonic imaging probe under the jaw using a “Micro” ultrasound system
by Articulate Instruments Ltd. The transducer was fixed using a stabilization
headset. The 2–4 MHz/64 element 20 mm radius convex ultrasound transducer
produced 82 images per second. The speech signal was recorded in parallel with
an Audio-Technica ATR 3350 omnidirectional condenser microphone placed at
a distance of 20 cm from the lips. The ultrasound and the audio signals were
synchronized using the software tool provided with the equipment. Altogether
438 sentences (approximately half an hour) were recorded from the subject,
which was divided into train, development and test sets in a 310-41-87 ratio. We
should add that the same dataset was used in several earlier studies [5,10].
The ultrasound probe records 946 samples along each of its 64 scan lines.
The recorded data can be converted to conventional ultrasound images using
the software tools provided. However, due to its irregular shape, this image is
harder to process by computers, while it contains no extra information compared
to the original scan data. Hence, we worked with the original 964 × 64 data items,
which were downsampled to 128 × 64 pixels. Figure 2 shows an example of the
data samples arranged as a rectangular image, and the standard ultrasound-
style display generated from it. The intensity range of the data was min-max
normalized to the [1, 1] interval before feeding it to the network.
The speech signal was recorded with a sampling rate of 11025 Hz, and then
processed by a vocoder from the SPTK toolkit (http://sp-tk.sourceforge.net).
The vocoder represented the speech signals by 12 Mel-Generalized Cepstral Coef-
ficients (MGCC) converted to a Line Spectral Pair representation (LSP), with

3D CNNs for Silent Speech Interfaces 163
Fig. 2. Example of displaying the ultrasound recordings as a) a rectangular image of
raw data samples b) an anatomically correct image, obtained by interpolation.
the signal’s gain being the 13th parameter. These 13 coefficients served as the
training targets in the DNN modeling experiments, as the speech signal can be
reasonably well reconstructed from these parameters. Although perfect recon-
struction would require the estimation of the pitch (F0 parameter) as well, in this
study we ignored this component during the experiments. To facilitate training,
each of the 13 targets were standardized to zero mean and unit variance.
4 Experimental Set-Up
We implemented our deep neural networks in Keras, using a Tensorflow back-
end [4]. We created three different models: a simple fully connected network
(FCN), a convolutional network that processes one frame of video (2D CNN),
and a convolutional network that can process several subsequent video frames as
input (3D CNN). To keep them comparable with respect to parameter count, all
three models had approximately 3.3 million tunable parameters. Training was
performed using the stochatic gradient descent method (SGD) with a batch size
of 100. The training objective function was the mean squared error (MSE).
Fully Connected Network (FCN): The simplest possible DNN type is a
network with fully connected layers. To be comparable with an earlier study [5],
our FCN model consisted of 5 fully connected hidden layers, with an output layer
of 13 neurons for the 13 training targets. The input of the network consisted of
one video frame (128 × 64 = 8192 pixels). Each hidden layers had 350 neurons, so
the model was about 4 times smaller compared to the FCN described in [5]. Apart
from the linear output layer, all layers applied the swish activation function [27],
and were followed by a dropout layer with the dropout rate set to 0.2.
Convolutional Network (2D CNN): Similar to the FCN, the input to this
network consisted of only one frame of data. The network performed spatial con-
volution on the input image via its four convolutional layers below the uppermost

Citations
More filters

Book ChapterDOI
28 Jun 2021
Abstract: Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.

4 citations


Posted Content
Abstract: Several approaches exist for the recording of articulatory movements, such as eletromagnetic and permanent magnetic articulagraphy, ultrasound tongue imaging and surface electromyography. Although magnetic resonance imaging (MRI) is more costly than the above approaches, the recent developments in this area now allow the recording of real-time MRI videos of the articulators with an acceptable resolution. Here, we experiment with the reconstruction of the speech signal from a real-time MRI recording using deep neural networks. Instead of estimating speech directly, our networks are trained to output a spectral vector, from which we reconstruct the speech signal using the WaveGlow neural vocoder. We compare the performance of three deep neural architectures for the estimation task, combining convolutional (CNN) and recurrence-based (LSTM) neural layers. Besides the mean absolute error (MAE) of our networks, we also evaluate our models by comparing the speech signals obtained using several objective speech quality metrics like the mean cepstral distortion (MCD), Short-Time Objective Intelligibility (STOI), Perceptual Evaluation of Speech Quality (PESQ) and Signal-to-Distortion Ratio (SDR). The results indicate that our approach can successfully reconstruct the gross spectral shape, but more improvements are needed to reproduce the fine spectral details.

3 citations


Posted Content
Abstract: Voice Activity Detection (VAD) is not easy task when the input audio signal is noisy, and it is even more complicated when the input is not even an audio recording. This is the case with Silent Speech Interfaces (SSI) where we record the movement of the articulatory organs during speech, and we aim to reconstruct the speech signal from this recording. Our SSI system synthesizes speech from ultrasonic videos of the tongue movement, and the quality of the resulting speech signals are evaluated by metrics such as the mean squared error loss function of the underlying neural network and the Mel-Cepstral Distortion (MCD) of the reconstructed speech compared to the original. Here, we first demonstrate that the amount of silence in the training data can have an influence both on the MCD evaluation metric and on the performance of the neural network model. Then, we train a convolutional neural network classifier to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal. In the experiments our ultrasound-based speech/silence separator achieved a classification accuracy of about 85\% and an AUC score around 86\%.

1 citations


Posted Content
Abstract: Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.

Book ChapterDOI
Abstract: Besides the well-known classification task, these days neural networks are frequently being applied to generate or transform data, such as images and audio signals. In such tasks, the conventional loss functions like the mean squared error (MSE) may not give satisfactory results. To improve the perceptual quality of the generated signals, one possibility is to increase their similarity to real signals, where the similarity is evaluated via a discriminator network. The combination of the generator and discriminator nets is called a Generative Adversarial Network (GAN). Here, we evaluate this adversarial training framework in the articulatory-to-acoustic mapping task, where the goal is to reconstruct the speech signal from a recording of the movement of articulatory organs. As the generator, we apply a 3D convolutional network that gave us good results in an earlier study. To turn it into a GAN, we extend the conventional MSE training loss with an adversarial loss component provided by a discriminator network. As for the evaluation, we report various objective speech quality metrics such as the Perceptual Evaluation of Speech Quality (PESQ), and the Mel-Cepstral Distortion (MCD). Our results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.

References
More filters

Proceedings Article
03 Dec 2012
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,871 citations


Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

49,735 citations


Proceedings ArticleDOI
01 Jan 2014
Abstract: In this paper, we propose a novel neural network model called RNN Encoder‐ Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixedlength vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder‐Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

14,140 citations


Journal ArticleDOI
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

7,700 citations


Journal ArticleDOI
Abstract: We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation combines information from all channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport surveillance videos, and they achieve superior performance in comparison to baseline methods.

3,755 citations


Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "3d convolutional neural networks for ultrasound-based silent speech interfaces" ?

One option for this is to apply recurrent neural structures such as the long short-term memory network ( LSTM ) in combination with 2D convolutional neural networks ( CNNs ). Here, the authors experiment with another approach that extends the CNN to perform 3D convolution, where the extra dimension corresponds to time. In particular, the authors apply the spatial and temporal convolutions in a decomposed form, which proved very successful recently in video action recognition. The authors find experimentally that their 3D network outperforms the CNN+LSTM model, indicating that 3D CNNs may be a feasible alternative to CNN+LSTM networks in SSI 

In the future, the authors plan to investigate more sophisticated network types such as the ConvLSTM network that directly integrates the advantages of the convolutional and LSTM units [ 35 ].