scispace - formally typeset
Open AccessProceedings ArticleDOI

Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition

TLDR
A new symbolic processing approach for converting sign language annotations into structured sequences of labels according to the Posture-Detention-Transition-Steady Shift phonetic model results in improved performance compared to pure data-driven approaches, but also in meaningful phonetic sub-unit models that can be further exploited in interdisciplinary sign language analysis.
Abstract
We explore novel directions for incorporating phonetic transcriptions into sub-unit based statistical models for sign language recognition. First, we employ a new symbolic processing approach for converting sign language annotations, based on HamNoSys symbols, into structured sequences of labels according to the Posture-Detention-Transition-Steady Shift phonetic model. Next, we exploit these labels, and their correspondence with visual features to construct phonetics-based statistical sub-unit models. We also align these sequences, via the statistical sub-unit construction and decoding, to the visual data to extract time boundary information that they would lack otherwise. The resulting phonetic sub-units offer new perspectives for sign language analysis, phonetic modeling, and automatic recognition. We evaluate this approach via sign language recognition experiments on an extended Lemmas Corpus of Greek Sign Language, which results not only in improved performance compared to pure data-driven approaches, but also in meaningful phonetic sub-unit models that can be further exploited in interdisciplinary sign language analysis.

read more

Content maybe subject to copyright    Report

Advances in Phonetics-based Sub-Unit Modeling for Transcription Alignment
and Sign Language Recognition
Vassilis Pitsikalis and Stavros Theodorakis
School of Electrical and Computer Engineering
National Technical University of Athens
{vpitsik,sth}@cs.ntua.gr
Christian Vogler
Institute for Language and Speech Processing
Athena R.C.
cvogler@ilsp.athena-innovation.gr
Petros Maragos
School of Electrical and Computer Engineering
National Technical University of Athens
maragos@cs.ntua.gr
Abstract
We explore novel directions for incorporating phonetic
transcriptions into sub-unit based statistical models for sign
language recognition. First, we employ a new symbolic pro-
cessing approach for converting sign language annotations,
based on HamNoSys symbols, into structured sequences
of labels according to the Posture-Detention-Transition-
Steady Shift phonetic model. Next, we exploit these la-
bels, and their correspondence with visual features to con-
struct phonetics-based statistical sub-unit models. We also
align these sequences, via the statistical sub-unit construc-
tion and decoding, to the visual data to extract time bound-
ary information that they would lack otherwise. The result-
ing phonetic sub-units offer new perspectives for sign lan-
guage analysis, phonetic modeling, and automatic recogni-
tion. We evaluate this approach via sign language recogni-
tion experiments on an extended Lemmas Corpus of Greek
Sign Language, which results not only in improved perfor-
mance compared to pure data-driven approaches, but also
in meaningful phonetic sub-unit models that can be further
exploited in interdisciplinary sign language analysis.
1. Introduction
Phonetic transcriptions are crucial for the performance of
sign language (SL) and speech recognition systems. For the
recognition of SL, which is the primary means of commu-
nication for many deaf people, this has not been practical,
due to the huge level of effort required for creating d etailed
phonetic annotations, unlike the case of speech recognition.
Another problem is the lack of appropriate phonetic models
in the area of SL linguistics (although this is changing now).
Thus, data-driven methods have prevailed in r ecent years.
We propose a novel approach to address these issues. It
is based on two aspects: (1) converting SL annotations into
structured sequential phonetic labels, and (2) incorporating
these labels into a sub-unit-based statistical framework for
training, alignment, and recognition. This framework can
be applied similarly to arbitrary gesture data.
Recent successful data-driven methods include [1, 4, 2,
5, 3, 12, 8]. One employs a linguistic feature vector based
on measured visual features, such as relative hand move-
ments [2]. Another one clusters independent frames via
K-means, and produces “phenones” [1]. Instead of single
frames, [4, 5, 12] cluster sequences of frames on the feature
level, such that they exploit the dynamics inherent to sign
language. Recently, separate features and modeling for dy-
namic vs. static segments have been proposed [8].
These data-driven approaches allow adapting recogni-
tion systems to the concrete feature space, and work even
in the face of insufficient detailed transcriptions. As men-
tioned before, creating such transcriptions requires an im-
practical amount of effort, unlike phoneme-level transcrip-
tions for speech recognition. Yet, their value is clear: they
simplify adding new words to the lexicon, and allow cap-
turing commonalities across signs. They can also be used
to create meaningful representations of intra-sign segments,
for further linguistic or interdisciplinary processing.
Our approach is based on having annotations in Ham-
NoSys [9], the creation of which requires less effort than
full phonetic descriptions, and incorporating them into a
statistical recognition system. This is conceptually similar
to taking a written word and converting it into its pronunci-
ation in speech recognition, and has hitherto not been pos-
sible for SL recognition. Our first contribution is that we
1
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

have developed a parsing system for converting HamNoSys
into structured phonetic sequences of labels, according to
the Posture-Detention-Transition-Steady Shift (PDTS) sys-
tem [6]. However, they do not provide any timing informa-
tion, which leads us to the second contribution: We employ
simple visual tracking features extracted from sign language
videos. Using them in conjunction with the phonetic la-
bels, we construct sub-units via a statistical hidden Markov
model (HMM)-based system, which allows us to align the
PDTS sequences with the visual data segments. The result-
ing output consists of sub-units that are no longer purely
data-driven, in contrast to previous work. Rather, they are
phonetic sub-units, each of which corresponds to a mean-
ingful PDTS label, along with the tim ing information on
where they occur in the data.
Once the segments h ave been mapped to their PDTS la-
bels, the output of the recognition system produces phonetic
labels during decoding. Such labels are invaluable in inter-
disciplinary research tasks, such as linguistic analysis and
synthesis. We evaluate the proposed approach by perform-
ing recognition experiments o n a new corpus of 1000 Greek
Sign Language lemmata, with promising results.
2. Data, Visual Processing and Overview
Data: The Greek Sign Language (GSL) Lemmas Cor-
pus consists of 1046 isolated signs, 5 repetitions each, from
two n ative signers (male and female). The videos h ave a
uniform background and a resolution of 1440x1080 pixels,
recorded at 25 frames p er second interlaced.
Visual Processing: For the segmentation and detection
of the signer’s h ands and head in the Greek Sign Lan-
guage (GSL) Lemmas Corpus, we employed a skin color
model utilizing a Gaussian Mar kov Model (GMM), ac-
companied by morphological processing to enhance skin
detection. Moreover, for tracking we employed forward-
backward linear prediction, and template matching, in or-
der to disambiguate occlusions. The adopted approach is
described in [10]. The extracted feature vector has ve com-
ponents, and consists of the planar coordinates of the dom-
inant h and, the instantaneous direction, and the velocity.
Overview: In the following, we adopt the Greek signs
for PILE, IMMEDIATELY, and EUROPE as examples from
the corpus. Figure 1 shows the initial and end frames of
each sign superimposed. The arrows illustrate the move-
ments of the hands between the frames. In the next sections
we present details on the articulation of these signs via rep-
resentative examples alongside the contributions.
3. Data-Driven Sub-Units without Phonetic
Evidence for Recognition
Our data-driven approach is based on the work in [8].
Other previous approaches include [1, 4, 5]. We seg-
(a) PILE (b) IMMEDIATELY (c) EUROPE
Figure 1. Overview of articulation for three selected GSL signs.
ment signs automatically and construct data-driven sub-
units, which are the p rimitive segments that are used to
construct all signs that share similar articulation parame-
ters. Based on simple movement-related measurements for
the dominant hand, the first step for sub-unit construction
involves the unsupervised partitioning of the segments into
two groups with respect to their movement dynamics for
each sign unit, a model-based process finds the segmenta-
tion points and assigns them the label “static” or “dynamic.
For the second step, the sub-unit construction (i.e., the
statistical modeling and the features employed for the static
or dynamic segments) depends on the assigned label: For
static segments, we employ K-means for clustering based
on their position. For dynamic segments, we employ hier-
archical clustering based on their DTW distances wr t. the
instantaneous direction. Thus, after clustering we end up
with a lexicon, where each sign consists of a sequence of
dynamic and static sub-units. The characteristics of the ap-
proach above imply a sequential structure of dynamic and
static segments that are explicitly accounted for by the pro-
posed sub-unit construction and statistical modeling.
4. Conversion of Annotations to Phonetic
Transcriptions
There has been little progress in the area of phonetic
modeling fo r the purposes of SL recognition since the work
of Vogler and Metaxas [11]. It is possible that the lack of
widely available phonetic transcriptions in sign language
corpora has contributed to this state of affairs. Because
of the level of detail required, such transcriptions are time-
consuming to produce and involve a steep learning curve.
In this paper, we propose a different approach that con-
sists of generating annotations that are merely detailed
enough to reproduce the sign, and having the computer con-
vert these to the full phonetic structure. This approach has
the advantage that it takes far less time and hu man train-
ing to produce the annotations. A disadvantage, however, is
that such annotations make assumptions that require com-
plex inferences by the conversion code. Describing such
inferences in detail is beyond the scope of this paper; in the
2
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

following we give a general overview of the method.
Like in the work by Vogler and Metaxas, the basic pho-
netic structure of a sign is a sequence of segments, which we
model according to Johnson’s and Liddell’s recent work on
the Posture-Detention-Transition-Steady Shift (PDTS) sys-
tem [6]. It supersedes the older Movement-Hold model [7]
used in earlier work, and fixes many of its shortcomings
1
.
In this system, each sign can be considered as a sequence
of key points in the form of postures (P), with associated
hand configuration and location information. Transitions
(T) correspond to hand movements between the key points,
with attached trajectory informatio n. Detentions (D) are
like P, but the hand is held station ary; steady shifts are like
T, but w ith a slow, deliber ate movement; in this paper we
distinguish only among P, D and T. In addition, we con-
sider epenthesis movements (E) [7] to be distinct from T;
the former are transitions between two locations without an
explicit path, and primarily occur when the hands move into
position between signs, and during repeated movements.
An example of the basic structure of the sign for PILE E
PTPTPE—isshowninFig.2, and Table 1.
The annotations of the signs are coded in HamNoSys [9],
a symbolic annotation system that can describe a sign in
sufficient detail to display it in an animated avatar. It mod-
els signs as clusters of handshape, orientation, location,
and movement, without explicit segmentation information,
which makes it unsuitable for direct application to recog-
nition systems. HamNoSys’s philosophy is minimalist, in
the sense that it avoids redundancy and strives to describe a
sign in detail with as few symbols as possible. To this end, it
provides symmetry and repetition operators, and describes
only h ow a sign’s configuration changes over time. As an
example consider the first part of the sign for PILE:

This annotation says that the hands move symmetrically,
so it needs to provide only the hand configuration and loca-
tion for the right hand, and the fact that the fingers of both
hands touch each other. In contrast, the left hand’s informa-
tion (mirrored along the x axis) is implied.
In order to mode l signs properly in the recognition sys-
tem, we require that all information, according to the PDTS
system, is made explicit for every segment; that is, Ps and
Ds contain the full information on hand configuration and
location, and Ts contain the full information on movement
trajectories, for each hand respectively. Our conversion
method from HamNoSys to the PDTS structure resolves the
implied parts, and splits the signs into its constituent seg-
ments. The key step consists of accumulating deltas, which
1
Specifically, movements no longer have attached location information,
which previously had preve nted a direct adaptation to recognition systems.
In addition, there is a strict alternation of P/D with T/S, whereas the older
model could have sequences of movements without intervening holds.
Table 1. Phonetic PDTS labels of the corresponding sub-units for
the sign “PILE” (location and trajectories only).
Frames Type PDTS label
1:12 E rest-position location-head
13:13
P location-head
14:25
T directedmotion, curve-r, direction-o, second-direction-do,
tense-true
26:27
P location-torso, side=right beside
28:50
T directedmotion, direction-dr, small
51:51
P location-torso, side=right beside down
52:66
E location-torso, side=right beside down rest-position
describe how a posture or transition has changed with re-
spect to a prototype. These are then applied in a specific
order. Note that this process also works for independent
channels of information, such as hand configuration versus
location, dominant hand versus nondominant hand, and so
on, and provides relative timings of segments across chan-
nels; however, the details are beyond the scope of this paper.
Further examples of PDTS sequences can be found in
Tables 1, 2. The details of the conversion are beyond the
scope of this paper, due to space limitations, and will be
published separately.
5. Phonetic Based Sub-units, Training, Align-
ment and Recognition
In the previous section we have covered our first main
contribution. Our second main contribution consists of in-
corporating the phonetic labels into a statistical recogni-
tion system. The data-driven-only sub-units from Section 3,
without any phonetic information, adapt well to specific fea-
ture spaces. However, they produce meaningless sub-unit
labels, which cannot be exploited for interdisciplinary sign
language processing (e.g., synthesis, linguistics).
We call the process of incorporating the phonetic in-
formation “Phonetic Sub-unit Construction for Recogni-
tion. This is the first time that the following are taken into
account in an automatic, statistical, and systematic way:
(1) phonetic transcriptions of SL, provided as d escribed in
the previous section by the PDTS system, and (2) the corre-
sponding underlying visual data and features from process-
ing the video data and the feature extraction. The proce-
dures involved in this process involve: (1) phonetic sub-unit
construction and training, (2) phonetic label alignment and
segmentation, (3) lexicon construction, and (4) recognition.
5.1. Phonetic Sub-Unit Model Training
For each phonetic label provided by the PDTS system,
and the features from the visual processing, we train one
sub-unit HMM. These sub-units have both phonetic labels
from the PDTS structure, and statistical parameters stem-
3
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

(a) E (b) P (c) T (d) P (e) T (f) P (g) E
Figure 2. Sign for PILE: Segments after incorporation of PDTS phonetic labels into Phonetic Sub-unit Construction, Training and Align-
ment. Superimposed start and end frames of each sequence of segments, accompanied with an arrow for transitions and epenthesis. Each
segment corresponds to a single phonetic label. PDTS segments labels are of type Epenthesis (E), Posture (P), Transition (T)
ming from the data-driven models, as a result of the train-
ing step. An example is illustrated in Table 1, which lists
the sequence of phonetic labels for sign for “PILE”.
We use different HMM parameters for each type of sub-
unit. Distinguishing between movements (T/E) and pos-
tures/detentions (P/D) corresponds to making a distinction
between dynamic and static segments, as described in Sec-
tion 3. This also is consistent with the concepts in the old
Movement-Hold model [7]. ForTandE,weemploya6-
state and 3-state Bakis HMM topology, respectively. For
P and D, we use a 1-state HMM, and a 2-state left-right
HMM, respectively. One mixture and a diagonal covariance
matrix was employed for each HMM. We initialize the pho-
netic sub-unit models in a uniform way with a flat-start pro-
cedure using the global mean and covariance of the feature
space, and employ embedded training on strings of concate-
nated sub-unit models with unsegmented data.
5.2. Alignment and Time Segmentation
We co ncatenate the trained HMMs into a recognition
network and decode each feature sequence via the Viterbi
algorithm. This results in a sequence of phonetic PDTS
labels, together with their respective starting and ending
frames. Doing this for all sequences results in a lexicon
with segmentation boundaries for each PDTS label.
We recognize signs by decoding unseen test data in the
HMM network on the PDTS label level. We evaluate the
accuracy on the sign level, b ased on the lexicon above.
Fig. 2 shows an example of the segmentation acquired
during the decoding, which illustrates the sequence of pho-
netic sub-units for the above-mentioned sign for “PILE”.
Each image corresponds to a phonetic PDTS segment pro-
duced by the decoding. For visualization, we adopt the fol-
lowing conventions: (1) For T and E segments, we superim-
pose their respective initial and final frames. We also high-
light specific movement trajectories with an arrow from the
initial to the final hand position in the respective segment.
(2) For P and D segments, we show only the first frame of
the segment, as the hand does not move within them. In ad-
dition, the labels corresponding to this sign, along with the
segmentation boundaries, are listed in Table 1.
5.3. Phonetic Sub-Units Results
Fig. 3 and 4 show examples of movement-based sub-
units (T and E), using x and y coordinates mapped from
the signing space. For the corresponding phonetic labels
see Table 2.Fig.3(a) shows a common epenthesis sub-unit
(E-to-head). It models the movement from the rest position
to the head, a common starting postur e. Fig. 3(b) corre-
sponds to a circular transition sub-unit (T-circular). An in-
dicative sign that contains this sub-unit is “EUROPE” (see
Fig. 1(c)). Fig. 3(c) and 3(d) depict directed transition sub-
units (T-down-right, T-in-left) with right-down and left di-
rections respectively. Representative signs are “PILE” and
“IMMEDIATELY, respectively (see Fig. 1(a), 1(b)).
In Fig. 4 we show results for the P and D sub-units, with
the actual coordinates for four different postures superim-
posed in different colors. P-head, P-stomach, P-shoulder
and P-head-top correspond to locations at the signer’s head,
stomach, shoulder and top of head, respectively.
In all these figures, there are cases of compact phonetic
sub-units with less variance, of sparsely populated ones
(i.e., few available data), and some that contain outliers.
For instance, the sub-unit P-head-top is compact, but has
few data. In contrast, P-head has more data and increased
variance. The sub-un it fo r the initial transition from the
rest posture to the starting position occurs in many signs,
whereas other sub-units may occur in only a single sign.
Outliers and high variances seem to be caused by visual pro-
cessing inaccuracies (we perform 2D, rather than 3D, pro-
cessing), tracking or parameter estimation errors, or human
annotator errors, o r actual data exhibiting such p roperties.
6. Sign Language Recognition Experiments
The recognition task in this paper was conducted on one
signer and 961 out of the 1046 signs. Approximately half of
the missing 85 signs share the same pronunciation with an-
other sign, and thus are the same for reco gnition purposes,
while the other half were eliminated due to unacceptably
poor tracking or poor segmentation of the ve repetitions
4
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

−200 −100 0 100
0
50
100
150
200
250
x coordinates
y coordinates
(a) E-to-head
−100 −50 0 50 100
−100
−80
−60
−40
−20
0
20
x coordinates
y coordinates
(b) T-circular
−100 −50 0
−80
−60
−40
−20
0
x coordinates
y coordinates
(c) T-down-ri ght
0 20 40 60 80 100 120 140
−40
−20
0
20
x coordinates
y coordinates
(d) T-in-left
Figure 3. Sub-units after Phonetic Sub-unit Construction, Training and Alignment. (a) corresponds to an epenthesis sub-unit (E-to-head)
and (b-d) to transition sub-units (T-circular, T-down-right, T-in-left). Trajectories are illustrated in the real signing space normalized wrt.
their initial position (x,y) = (0,0). Red marker indicates trajectories’ start position. See Table 2 for the corresponding phonetic labels.
200 400 600 800 1000
60
65
70
75
80
85
Number of Signs
Sign Accuracy %
DD P
(a)
200 400 600 800 1000
0
500
1000
1500
2000
Number of Si
g
ns
Number of Sub−units
DD P
(b)
0 200 400 600 800 1000 1200 1400 1600 1800 2000
44
46
48
50
52
54
56
58
60
62
64
# Clusters in Dynamic Segments
Sign Accuracy %
DD P
(c)
Figure 5. Comparison of Data-Driven (DD) Sub-units without phonetic evidence vs. Phonetic based approach (P). (a) Sign Accuracy,
(b) Number of sub-units. In (a,b) x-axis corresponds to the v ariation on the number of signs. (c) For the maximum number of signs, sign
accuracy as affected by the number of sub-units (x-axis) in the DD case; in the Phonetic approach number of sub-units is predefined.
head
stomach
shoulder
head−top
(a)
Figure 4. Sub-units after Phonetic Sub-unit Construction, Training
and Alignment. Data for multiple posture phonetic sub-units su-
perimposed in the signing space indicating their relative position to
the signer. Sub-units with multiple colored pixels are: P-forehead,
P-stomach, P-shoulder, P-head-top. Legend shows the primary lo-
cations of the corresponding phonetic labels (see also Table 2).
into individual signs. The data were split randomly into four
training examples and one testing example per sign, which
was the same across all experiments. Future work should
expand these experiments to both signers and the full set, as
Table 2. Examples of phonetic subunit (PSU) and the sign where
they occur. ’*’ correspond to multiple signs.
PSU Sign Type PDTS Label
E-to-head * E rest-position location-head
T-circular
EUROPE T circularmotion, axis=i
T-down-right
PILE T directedmotion, direction=dr,
small
T-in-left
IMMEDIATELY T directedmotion, direction=il,
fast=true, halt=true
P-forehead
* P location=forehead
P-stomach
* P location=stomach
P-shoulder
* P location=shouldertop,
side=right
beside
P-head-top
* P location=head-top
more tracking results come in and improve. The visual pro-
cessing and feature extraction was conducted as d escribed
in Section 2. The modeling an d reco gnition proceeded, as
described in the previous section. Our evaluation criterion
was the number of correctly recognized signs, via matching
sequences of phonetic labels to the lexicon.
We first compare the two approaches for sub-unit con-
struction, as follows: (1) Data-Driven (DD): Data-driven
sub-unit construction, which does not make use of any pho-
netic transcription labels. (2) Phonetic (P): Phonetics-based
approach which makes use of the PDTS phonetic labels, via
the statistically trained sub-unit models.
Second, we evaluate the relationship between lexicon
5
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

Figures
Citations
More filters

Probabilistic sequence models for image sequence processing and recognition

TL;DR: A multi purpose model-free object tracking framework which is based on dynamic programming (DP), and which is applied to hand and head tracking tasks in automatic sign language recognition (ASLR), and a context-dependent tracking decision optimization over time allows to robustly track occluded objects.
Posted Content

Independent Sign Language Recognition with 3D Body, Hands, and Face Reconstruction

TL;DR: A holistic 3D reconstruction for SLR is used, demonstrating that it leads to higher accuracy than recognition from raw RGB images and their optical flow fed into the state-of-the-art I3D-type network for 3D action recognition and from 2D Openpose skeletons fed into a Recurrent Neural Network.
Posted Content

Lexicon-Free Fingerspelling Recognition from Video: Data, Models, and Signer Adaptation

TL;DR: In this article, the problem of recognizing video sequences of fingerspelled letters in American Sign Language (ASL) was studied and the best performing models were segmental (semi-Markov) conditional random fields using deep neural network-based features.
Proceedings ArticleDOI

Recognitionwith raw canonical phonetic movement and handshape subunits on videos of continuous Sign Language

TL;DR: A new framework to construct a data-driven lexicon that retains phonetics' movement information and to perform automatic recognition of continuous SL videos is proposed and results lead to promising results.
Journal ArticleDOI

Isolated Sign Language Recognition with Depth Cameras

TL;DR: In this paper, an approach to isolated sign language recognition with data provided by a depth camera is presented, where sequences of depth maps of dynamic sign language gestures are divided into smaller regions (cells) and statistical information is used to describe the cells.
References
More filters
Journal ArticleDOI

American Sign Language: The Phonological Base

TL;DR: In this paper, the authors outline the phonological structures and phrases in American Sign Language (ASL) and present a segmental phonetic description system for ASL phonetic segmentation.
Journal ArticleDOI

A Framework for Recognizing the Simultaneous Aspects of American Sign Language

TL;DR: This paper presents a novel framework to ASL recognition that aspires to being a solution to the scalability problems, based on breaking down the signs into their phonemes and modeling them with parallel hidden Markov models.
Book ChapterDOI

A Linguistic Feature Vector for the Visual Interpretation of Sign Language

TL;DR: A novel approach to sign language recognition that provides extremely high classification rates on minimal training data using only single instance training outperforming previous approaches where thousands of training examples are required.
Journal Article

Towards an automatic sign language recognition system using subunits

TL;DR: In this article, an automatic recognition of German continuous sign language is presented. The statistical approach is based on the Bayes decision rule for minimum error rate, which can be used to reduce the amount of necessary training material.
Journal ArticleDOI

Modelling and segmenting subunits for sign language recognition based on hand motion analysis

TL;DR: This paper attempts to define and segment subunits using computer vision techniques, which also can be basically explained by sign language linguistics and correlates highly with the definition of syllables in sign language while sharing characteristics of syllable in spoken languages.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What are the contributions in "Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition" ?

The authors also align these sequences, via the statistical sub-unit construction and decoding, to the visual data to extract time boundary information that they would lack otherwise. The authors evaluate this approach via sign language recognition experiments on an extended Lemmas Corpus of Greek Sign Language, which results not only in improved performance compared to pure data-driven approaches, but also in meaningful phonetic sub-unit models that can be further exploited in interdisciplinary sign language analysis. 

Outliers and high variances seem to be caused by visual processing inaccuracies (we perform 2D, rather than 3D, processing), tracking or parameter estimation errors, or human annotator errors, or actual data exhibiting such properties. 

The procedures involved in this process involve: (1) phonetic sub-unit construction and training, (2) phonetic label alignment and segmentation, (3) lexicon construction, and (4) recognition. 

the authors expect that other disciplines, such as linguistics, can greatly benefit from their results for the analysis of sign languages. 

For the segmentation and detection of the signer’s hands and head in the Greek Sign Language (GSL) Lemmas Corpus, the authors employed a skin color model utilizing a Gaussian Markov Model (GMM), accompanied by morphological processing to enhance skin detection. 

The annotations of the signs are coded in HamNoSys [9], a symbolic annotation system that can describe a sign in sufficient detail to display it in an animated avatar. 

By increasing the number of signs, the recognition performance for both approaches decreases; this is expected as the recognition task becomes harder. 

Their conversion method from HamNoSys to the PDTS structure resolves the implied parts, and splits the signs into its constituent segments.