Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs
Citations
382 citations
Cites background or methods from "Re-Sign: Re-Aligned End-to-End Sequ..."
...Therefore, the numerous advances in SLR [15] and even the move to the challenging Continuous SLR (CSLR) [33, 36] problem, do not allow us to provide meaningful interpretations...
[...]
...Computer vision researchers adopted CTC and applied it to weakly labeled visual problems, such as lip reading [3], action recognition [30], hand shape recognition [6] and CSLR [6, 17]....
[...]
...To achieve NMT from sign videos, we employed CNN based spatial embedding, various tokenization methods including state-of-the-art RNN-HMM hybrids [36] and attentionbased encoder-decoder networks, to jointly learn to align, recognize and translate sign videos to spoken text....
[...]
...The PHOENIX datasets were created for CSLR and they provide sequence level gloss annotations....
[...]
...[36] as our Tokenization Layer, which is the state-of-the-art CSLR....
[...]
237 citations
Cites background from "Re-Sign: Re-Aligned End-to-End Sequ..."
...00 yes no Signum [118] 465 (24 train, 1 test) - 25 yes 15,075 yes no MS-ASL [62] 1,000 (165 train, 37 dev, 20 test) - 222 yes 25,513 no yes RWTH Phoenix [43] 1,081 9 no 6,841 yes yes RWTH Phoenix SI5 [74] 1,081 (8 train, 1 test) - 9 yes 4,667 yes yes Devisign [22] 2,000 8 no 24,000 no no Table 1. Popular public corpora of sign language video. These datasets are commonly used for sign language recognit...
[...]
...on method impacts both content and signer identity. For example, some corpora are formed of professional interpreters paid to interpret spoken content, such as news channels that provide interpreting [43,74,21]. Others are formed of expert signers paid to sign desired corpus content (e.g., [65,124,118]). Yet other corpora consist sign language videos posted on sites such as YouTube (e.g., [62]) – these post...
[...]
229 citations
Cites background or methods or result from "Re-Sign: Re-Aligned End-to-End Sequ..."
...Several recent approaches taking advantage of neural networks have also been proposed for continuous SL recognition [12], [13], [22]....
[...]
...In their later works [13], [22], they use the frame-state alignment, provided by a baseline HMM recognition system, as frame labelling to train the embedded neural networks....
[...]
...In contrast with these works [12], [13], [22], [25], our sequence learning module of recurrent neural networks with end-to-end training shows much more learning capacity and better performance for the dynamic dependencies....
[...]
...However, the frame-wise labelling adopted in [12], [13], [22] is noisy for training the deep neural networks, and HMMs might be hard to learn the complex dynamic variations, considering their limited representation capability....
[...]
...1% on test set reported in [22], our results reduce WER of the state-of-the-art by a margin around 5%, which is a relative improvement of more than 10%....
[...]
210 citations
Cites methods or result from "Re-Sign: Re-Aligned End-to-End Sequ..."
...Hybrid CNN-LSTM-HMM models learnt with an Expectation Maximization (EM) algorithm alleviate such problems [4]....
[...]
...Comparable to [4] and with a similar effect as ‘curriculum learning’ [69], we observe fastest convergence if we first train a CNN-HMM model with randomly shuffled input images for 4 EM re-alignment iterations starting from a flat start in each stream....
[...]
...We extend our previous work on hybrid HMM modelling for sign language recognition [3] [4] [5] by adding multi-stream HMMs with synchronisation constraints....
[...]
...For comparison, we provide the WER achieved on a single-stream system, which represents the same setup as in [4]....
[...]
...The hybrid HMM modelling has shown to outperform other sequence learning approaches on sign language recognition data sets while requiring less memory and allowing for deeper architectures [4]....
[...]
183 citations
Cites methods from "Re-Sign: Re-Aligned End-to-End Sequ..."
...utilized a stateof-the-art CSLR method [41] to obtain sign glosses, and then used an attention-based text-to-text NMT model [44] to learn the sign gloss to spoken language sentence translation, p(S|G) [9]....
[...]
...The resulting Sign2Gloss2Text model first recognized glosses from continuous sign videos using a state-of-the-art CSLR method [41], which worked as a tokenization layer....
[...]
...This made the use of available CSLR methods [42, 41] (that were designed to learn from weakly annotated data) infeasible, as they are build on the assumption that sign language videos and corresponding annotations share the same temporal order....
[...]
References
73,978 citations
72,897 citations
"Re-Sign: Re-Aligned End-to-End Sequ..." refers background in this paper
...LSTMs [19] have been discovered nearly two decades ago....
[...]
55,235 citations
"Re-Sign: Re-Aligned End-to-End Sequ..." refers methods in this paper
...After comparing different CNN architectures [34, 24, 37], we opted for the 22 layer deep GoogleNet [37] architecture, which we initially pre-train on the 1....
[...]
49,914 citations