scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs

01 Jul 2017-pp 3416-3424
TL;DR: This work proposes an algorithm that treats the provided training labels as weak labels and refines the label-to-image alignment on-the-fly in a weakly supervised fashion, and embedded into an HMM the resulting deep model continuously improves its performance in several re-alignments.
Abstract: This work presents an iterative re-alignment approach applicable to visual sequence labelling tasks such as gesture recognition, activity recognition and continuous sign language recognition. Previous methods dealing with video data usually rely on given frame labels to train their classifiers. However, looking at recent data sets, these labels often tend to be noisy which is commonly overseen. We propose an algorithm that treats the provided training labels as weak labels and refines the label-to-image alignment on-the-fly in a weakly supervised fashion. Given a series of frames and sequence-level labels, a deep recurrent CNN-BLSTM network is trained end-to-end. Embedded into an HMM the resulting deep model corrects the frame labels and continuously improves its performance in several re-alignments. We evaluate on two challenging publicly available sign recognition benchmark data sets featuring over 1000 classes. We outperform the state-of-the-art by up to 10% absolute and 30% relative.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: This work formalizes SLT in the framework of Neural Machine Translation (NMT) for both end-to-end and pretrained settings (using expert knowledge) and allows to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language.
Abstract: Sign Language Recognition (SLR) has been an active research field for the last two decades. However, most research to date has considered SLR as a naive gesture recognition problem. SLR seeks to recognize a sequence of continuous signs but neglects the underlying rich grammatical and linguistic structures of sign language that differ from spoken language. In contrast, we introduce the Sign Language Translation (SLT) problem. Here, the objective is to generate spoken language translations from sign language videos, taking into account the different word orders and grammar. We formalize SLT in the framework of Neural Machine Translation (NMT) for both end-to-end and pretrained settings (using expert knowledge). This allows us to jointly learn the spatial representations, the underlying language model, and the mapping between sign and spoken language. To evaluate the performance of Neural SLT, we collected the first publicly available Continuous SLT dataset, RWTH-PHOENIX-Weather 2014T1. It provides spoken language translations and gloss level annotations for German Sign Language videos of weather broadcasts. Our dataset contains over .95M frames with >67K signs from a sign vocabulary of >1K and >99K words from a German vocabulary of >2.8K. We report quantitative and qualitative results for various SLT setups to underpin future research in this newly established field. The upper bound for translation performance is calculated at 19.26 BLEU-4, while our end-to-end frame-level and gloss-level tokenization networks were able to achieve 9.58 and 18.13 respectively.

382 citations


Cites background or methods from "Re-Sign: Re-Aligned End-to-End Sequ..."

  • ...Therefore, the numerous advances in SLR [15] and even the move to the challenging Continuous SLR (CSLR) [33, 36] problem, do not allow us to provide meaningful interpretations...

    [...]

  • ...Computer vision researchers adopted CTC and applied it to weakly labeled visual problems, such as lip reading [3], action recognition [30], hand shape recognition [6] and CSLR [6, 17]....

    [...]

  • ...To achieve NMT from sign videos, we employed CNN based spatial embedding, various tokenization methods including state-of-the-art RNN-HMM hybrids [36] and attentionbased encoder-decoder networks, to jointly learn to align, recognize and translate sign videos to spoken text....

    [...]

  • ...The PHOENIX datasets were created for CSLR and they provide sequence level gloss annotations....

    [...]

  • ...[36] as our Tokenization Layer, which is the state-of-the-art CSLR....

    [...]

Proceedings ArticleDOI
24 Oct 2019
TL;DR: The results of an interdisciplinary workshop are presented, providing key background that is often overlooked by computer scientists, a review of the state-of-the-art, a set of pressing challenges, and a call to action for the research community.
Abstract: Developing successful sign language recognition, generation, and translation systems requires expertise in a wide range of fields, including computer vision, computer graphics, natural language processing, human-computer interaction, linguistics, and Deaf culture. Despite the need for deep interdisciplinary knowledge, existing research occurs in separate disciplinary silos, and tackles separate portions of the sign language processing pipeline. This leads to three key questions: 1) What does an interdisciplinary view of the current landscape reveal? 2) What are the biggest challenges facing the field? and 3) What are the calls to action for people working in the field? To help answer these questions, we brought together a diverse group of experts for a two-day workshop. This paper presents the results of that interdisciplinary workshop, providing key background that is often overlooked by computer scientists, a review of the state-of-the-art, a set of pressing challenges, and a call to action for the research community.

237 citations


Cites background from "Re-Sign: Re-Aligned End-to-End Sequ..."

  • ...00 yes no Signum [118] 465 (24 train, 1 test) - 25 yes 15,075 yes no MS-ASL [62] 1,000 (165 train, 37 dev, 20 test) - 222 yes 25,513 no yes RWTH Phoenix [43] 1,081 9 no 6,841 yes yes RWTH Phoenix SI5 [74] 1,081 (8 train, 1 test) - 9 yes 4,667 yes yes Devisign [22] 2,000 8 no 24,000 no no Table 1. Popular public corpora of sign language video. These datasets are commonly used for sign language recognit...

    [...]

  • ...on method impacts both content and signer identity. For example, some corpora are formed of professional interpreters paid to interpret spoken content, such as news channels that provide interpreting [43,74,21]. Others are formed of expert signers paid to sign desired corpus content (e.g., [65,124,118]). Yet other corpora consist sign language videos posted on sites such as YouTube (e.g., [62]) – these post...

    [...]

Journal ArticleDOI
TL;DR: This work develops a continuous sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sentences to sequences of ordered gloss labels, and proposed architecture adopts deep convolutional neural networks with stacked temporal fusion layers as the feature extraction module.
Abstract: This work develops a continuous sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sentences to sequences of ordered gloss labels. Previous methods dealing with continuous SL recognition usually employ hidden Markov models with limited capacity to capture the temporal information. In contrast, our proposed architecture adopts deep convolutional neural networks with stacked temporal fusion layers as the feature extraction module, and bidirectional recurrent neural networks as the sequence learning module. We propose an iterative optimization process for our architecture to fully exploit the representation capability of deep neural networks with limited data. We first train the end-to-end recognition model for alignment proposal, and then use the alignment proposal as strong supervisory information to directly tune the feature extraction module. This training process can run iteratively to achieve improvements on the recognition performance. We further contribute by exploring the multimodal fusion of RGB images and optical flow in sign language. Our method is evaluated on two challenging SL recognition benchmarks, and outperforms the state of the art by a relative improvement of more than 15% on both databases.

229 citations


Cites background or methods or result from "Re-Sign: Re-Aligned End-to-End Sequ..."

  • ...Several recent approaches taking advantage of neural networks have also been proposed for continuous SL recognition [12], [13], [22]....

    [...]

  • ...In their later works [13], [22], they use the frame-state alignment, provided by a baseline HMM recognition system, as frame labelling to train the embedded neural networks....

    [...]

  • ...In contrast with these works [12], [13], [22], [25], our sequence learning module of recurrent neural networks with end-to-end training shows much more learning capacity and better performance for the dynamic dependencies....

    [...]

  • ...However, the frame-wise labelling adopted in [12], [13], [22] is noisy for training the deep neural networks, and HMMs might be hard to learn the complex dynamic variations, considering their limited representation capability....

    [...]

  • ...1% on test set reported in [22], our results reduce WER of the state-of-the-art by a margin around 5%, which is a relative improvement of more than 10%....

    [...]

Journal ArticleDOI
TL;DR: This work applies the approach to the domain of sign language recognition exploiting the sequential parallelism to learn sign language, mouth shape and hand shape classifiers and clearly outperform the state-of-the-art on all data sets and observe significantly faster convergence using the parallel alignment approach.
Abstract: In this work we present a new approach to the field of weakly supervised learning in the video domain. Our method is relevant to sequence learning problems which can be split up into sub-problems that occur in parallel. Here, we experiment with sign language data. The approach exploits sequence constraints within each independent stream and combines them by explicitly imposing synchronisation points to make use of parallelism that all sub-problems share. We do this with multi-stream HMMs while adding intermediate synchronisation constraints among the streams. We embed powerful CNN-LSTM models in each HMM stream following the hybrid approach. This allows the discovery of attributes which on their own lack sufficient discriminative power to be identified. We apply the approach to the domain of sign language recognition exploiting the sequential parallelism to learn sign language, mouth shape and hand shape classifiers. We evaluate the classifiers on three publicly available benchmark data sets featuring challenging real-life sign language with over 1,000 classes, full sentence based lip-reading and articulated hand shape recognition on a fine-grained hand shape taxonomy featuring over 60 different hand shapes. We clearly outperform the state-of-the-art on all data sets and observe significantly faster convergence using the parallel alignment approach.

210 citations


Cites methods or result from "Re-Sign: Re-Aligned End-to-End Sequ..."

  • ...Hybrid CNN-LSTM-HMM models learnt with an Expectation Maximization (EM) algorithm alleviate such problems [4]....

    [...]

  • ...Comparable to [4] and with a similar effect as ‘curriculum learning’ [69], we observe fastest convergence if we first train a CNN-HMM model with randomly shuffled input images for 4 EM re-alignment iterations starting from a flat start in each stream....

    [...]

  • ...We extend our previous work on hybrid HMM modelling for sign language recognition [3] [4] [5] by adding multi-stream HMMs with synchronisation constraints....

    [...]

  • ...For comparison, we provide the WER achieved on a single-stream system, which represents the same setup as in [4]....

    [...]

  • ...The hybrid HMM modelling has shown to outperform other sequence learning approaches on sign language recognition data sets while requiring less memory and allowing for deeper architectures [4]....

    [...]

Posted Content
TL;DR: A novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation while being trainable in an end-to-end manner is introduced by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and translation problems into a single unified architecture.
Abstract: Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the-art in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation while being trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and translation problems into a single unified architecture. This joint approach does not require any ground-truth timing information, simultaneously solving two co-dependant sequence-to-sequence learning problems and leads to significant performance gains. We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.

183 citations


Cites methods from "Re-Sign: Re-Aligned End-to-End Sequ..."

  • ...utilized a stateof-the-art CSLR method [41] to obtain sign glosses, and then used an attention-based text-to-text NMT model [44] to learn the sign gloss to spoken language sentence translation, p(S|G) [9]....

    [...]

  • ...The resulting Sign2Gloss2Text model first recognized glosses from continuous sign videos using a state-of-the-art CSLR method [41], which worked as a tokenization layer....

    [...]

  • ...This made the use of available CSLR methods [42, 41] (that were designed to learn from weakly annotated data) infeasible, as they are build on the assumption that sign language videos and corresponding annotations share the same temporal order....

    [...]

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


"Re-Sign: Re-Aligned End-to-End Sequ..." refers background in this paper

  • ...LSTMs [19] have been discovered nearly two decades ago....

    [...]

Proceedings Article
04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

55,235 citations


"Re-Sign: Re-Aligned End-to-End Sequ..." refers methods in this paper

  • ...After comparing different CNN architectures [34, 24, 37], we opted for the 22 layer deep GoogleNet [37] architecture, which we initially pre-train on the 1....

    [...]

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations