how can other disciplines benefit from their results for the analysis of sign languages?

the authors expect that other disciplines, such as linguistics, can greatly benefit from their results for the analysis of sign languages.

What is the morphological processing used for the detection of the signer’s hands and?

For the segmentation and detection of the signer’s hands and head in the Greek Sign Language (GSL) Lemmas Corpus, the authors employed a skin color model utilizing a Gaussian Markov Model (GMM), accompanied by morphological processing to enhance skin detection.

What is the expected effect of increasing the number of signs?

By increasing the number of signs, the recognition performance for both approaches decreases; this is expected as the recognition task becomes harder.

(Open Access) Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition (2011) | Vassilis Pitsikalis

Q: What are the contributions in "Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition" ?

The authors also align these sequences, via the statistical sub-unit construction and decoding, to the visual data to extract time boundary information that they would lack otherwise. The authors evaluate this approach via sign language recognition experiments on an extended Lemmas Corpus of Greek Sign Language, which results not only in improved performance compared to pure data-driven approaches, but also in meaningful phonetic sub-unit models that can be further exploited in interdisciplinary sign language analysis.

Q: What are the causes of outliers and high variances?

Outliers and high variances seem to be caused by visual processing inaccuracies (we perform 2D, rather than 3D, processing), tracking or parameter estimation errors, or human annotator errors, or actual data exhibiting such properties.

Q: What are the steps involved in the conversion of the phonetic labels?

The procedures involved in this process involve: (1) phonetic sub-unit construction and training, (2) phonetic label alignment and segmentation, (3) lexicon construction, and (4) recognition.

Q: What is the conversion method for PILE?

Their conversion method from HamNoSys to the PDTS structure resolves the implied parts, and splits the signs into its constituent segments.

Advances in Phonetics-based Sub-Unit Modeling for Transcription Alignment

and Sign Language Recognition

Vassilis Pitsikalis and Stavros Theodorakis

School of Electrical and Computer Engineering

National Technical University of Athens

{vpitsik,sth}@cs.ntua.gr

Christian Vogler

Institute for Language and Speech Processing

Athena R.C.

cvogler@ilsp.athena-innovation.gr

Petros Maragos

School of Electrical and Computer Engineering

National Technical University of Athens

maragos@cs.ntua.gr

Abstract

We explore novel directions for incorporating phonetic

transcriptions into sub-unit based statistical models for sign

language recognition. First, we employ a new symbolic pro-

cessing approach for converting sign language annotations,

based on HamNoSys symbols, into structured sequences

of labels according to the Posture-Detention-Transition-

Steady Shift phonetic model. Next, we exploit these la-

bels, and their correspondence with visual features to con-

struct phonetics-based statistical sub-unit models. We also

align these sequences, via the statistical sub-unit construc-

tion and decoding, to the visual data to extract time bound-

ary information that they would lack otherwise. The result-

ing phonetic sub-units offer new perspectives for sign lan-

guage analysis, phonetic modeling, and automatic recogni-

tion. We evaluate this approach via sign language recogni-

tion experiments on an extended Lemmas Corpus of Greek

Sign Language, which results not only in improved perfor-

mance compared to pure data-driven approaches, but also

in meaningful phonetic sub-unit models that can be further

exploited in interdisciplinary sign language analysis.

1. Introduction

Phonetic transcriptions are crucial for the performance of

sign language (SL) and speech recognition systems. For the

recognition of SL, which is the primary means of commu-

nication for many deaf people, this has not been practical,

due to the huge level of effort required for creating d etailed

phonetic annotations, unlike the case of speech recognition.

Another problem is the lack of appropriate phonetic models

in the area of SL linguistics (although this is changing now).

Thus, data-driven methods have prevailed in r ecent years.

We propose a novel approach to address these issues. It

is based on two aspects: (1) converting SL annotations into

structured sequential phonetic labels, and (2) incorporating

these labels into a sub-unit-based statistical framework for

training, alignment, and recognition. This framework can

be applied similarly to arbitrary gesture data.

Recent successful data-driven methods include [1, 4, 2,

5, 3, 12, 8]. One employs a linguistic feature vector based

on measured visual features, such as relative hand move-

ments [2]. Another one clusters independent frames via

K-means, and produces “phenones” [1]. Instead of single

frames, [4, 5, 12] cluster sequences of frames on the feature

level, such that they exploit the dynamics inherent to sign

language. Recently, separate features and modeling for dy-

namic vs. static segments have been proposed [8].

These data-driven approaches allow adapting recogni-

tion systems to the concrete feature space, and work even

in the face of insufﬁcient detailed transcriptions. As men-

tioned before, creating such transcriptions requires an im-

practical amount of effort, unlike phoneme-level transcrip-

tions for speech recognition. Yet, their value is clear: they

simplify adding new words to the lexicon, and allow cap-

turing commonalities across signs. They can also be used

to create meaningful representations of intra-sign segments,

for further linguistic or interdisciplinary processing.

Our approach is based on having annotations in Ham-

NoSys [9], the creation of which requires less effort than

full phonetic descriptions, and incorporating them into a

statistical recognition system. This is conceptually similar

to taking a written word and converting it into its pronunci-

ation in speech recognition, and has hitherto not been pos-

sible for SL recognition. Our ﬁrst contribution is that we

Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

have developed a parsing system for converting HamNoSys

into structured phonetic sequences of labels, according to

the Posture-Detention-Transition-Steady Shift (PDTS) sys-

tem [6]. However, they do not provide any timing informa-

tion, which leads us to the second contribution: We employ

simple visual tracking features extracted from sign language

videos. Using them in conjunction with the phonetic la-

bels, we construct sub-units via a statistical hidden Markov

model (HMM)-based system, which allows us to align the

PDTS sequences with the visual data segments. The result-

ing output consists of sub-units that are no longer purely

data-driven, in contrast to previous work. Rather, they are

phonetic sub-units, each of which corresponds to a mean-

ingful PDTS label, along with the tim ing information on

where they occur in the data.

Once the segments h ave been mapped to their PDTS la-

bels, the output of the recognition system produces phonetic

labels during decoding. Such labels are invaluable in inter-

disciplinary research tasks, such as linguistic analysis and

synthesis. We evaluate the proposed approach by perform-

ing recognition experiments o n a new corpus of 1000 Greek

Sign Language lemmata, with promising results.

2. Data, Visual Processing and Overview

Data: The Greek Sign Language (GSL) Lemmas Cor-

pus consists of 1046 isolated signs, 5 repetitions each, from

two n ative signers (male and female). The videos h ave a

uniform background and a resolution of 1440x1080 pixels,

recorded at 25 frames p er second interlaced.

Visual Processing: For the segmentation and detection

of the signer’s h ands and head in the Greek Sign Lan-

guage (GSL) Lemmas Corpus, we employed a skin color

model utilizing a Gaussian Mar kov Model (GMM), ac-

companied by morphological processing to enhance skin

detection. Moreover, for tracking we employed forward-

backward linear prediction, and template matching, in or-

der to disambiguate occlusions. The adopted approach is

described in [10]. The extracted feature vector has ﬁve com-

ponents, and consists of the planar coordinates of the dom-

inant h and, the instantaneous direction, and the velocity.

Overview: In the following, we adopt the Greek signs

for PILE, IMMEDIATELY, and EUROPE as examples from

the corpus. Figure 1 shows the initial and end frames of

each sign superimposed. The arrows illustrate the move-

ments of the hands between the frames. In the next sections

we present details on the articulation of these signs via rep-

resentative examples alongside the contributions.

3. Data-Driven Sub-Units without Phonetic

Evidence for Recognition

Our data-driven approach is based on the work in [8].

Other previous approaches include [1, 4, 5]. We seg-

(a) PILE (b) IMMEDIATELY (c) EUROPE

Figure 1. Overview of articulation for three selected GSL signs.

ment signs automatically and construct data-driven sub-

units, which are the p rimitive segments that are used to

construct all signs that share similar articulation parame-

ters. Based on simple movement-related measurements for

the dominant hand, the ﬁrst step for sub-unit construction

involves the unsupervised partitioning of the segments into

two groups with respect to their movement dynamics — for

each sign unit, a model-based process ﬁnds the segmenta-

tion points and assigns them the label “static” or “dynamic.”

For the second step, the sub-unit construction (i.e., the

statistical modeling and the features employed for the static

or dynamic segments) depends on the assigned label: For

static segments, we employ K-means for clustering based

on their position. For dynamic segments, we employ hier-

archical clustering based on their DTW distances wr t. the

instantaneous direction. Thus, after clustering we end up

with a lexicon, where each sign consists of a sequence of

dynamic and static sub-units. The characteristics of the ap-

proach above imply a sequential structure of dynamic and

static segments that are explicitly accounted for by the pro-

posed sub-unit construction and statistical modeling.

4. Conversion of Annotations to Phonetic

Transcriptions

There has been little progress in the area of phonetic

modeling fo r the purposes of SL recognition since the work

of Vogler and Metaxas [11]. It is possible that the lack of

widely available phonetic transcriptions in sign language

corpora has contributed to this state of affairs. Because

of the level of detail required, such transcriptions are time-

consuming to produce and involve a steep learning curve.

In this paper, we propose a different approach that con-

sists of generating annotations that are merely detailed

enough to reproduce the sign, and having the computer con-

vert these to the full phonetic structure. This approach has

the advantage that it takes far less time and hu man train-

ing to produce the annotations. A disadvantage, however, is

that such annotations make assumptions that require com-

plex inferences by the conversion code. Describing such

inferences in detail is beyond the scope of this paper; in the

Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

following we give a general overview of the method.

Like in the work by Vogler and Metaxas, the basic pho-

netic structure of a sign is a sequence of segments, which we

model according to Johnson’s and Liddell’s recent work on

the Posture-Detention-Transition-Steady Shift (PDTS) sys-

tem [6]. It supersedes the older Movement-Hold model [7]

used in earlier work, and ﬁxes many of its shortcomings

In this system, each sign can be considered as a sequence

of key points in the form of postures (P), with associated

hand conﬁguration and location information. Transitions

(T) correspond to hand movements between the key points,

with attached trajectory informatio n. Detentions (D) are

like P, but the hand is held station ary; steady shifts are like

T, but w ith a slow, deliber ate movement; in this paper we

distinguish only among P, D and T. In addition, we con-

sider epenthesis movements (E) [7] to be distinct from T;

the former are transitions between two locations without an

explicit path, and primarily occur when the hands move into

position between signs, and during repeated movements.

An example of the basic structure of the sign for PILE — E

PTPTPE—isshowninFig.2, and Table 1.

The annotations of the signs are coded in HamNoSys [9],

a symbolic annotation system that can describe a sign in

sufﬁcient detail to display it in an animated avatar. It mod-

els signs as clusters of handshape, orientation, location,

and movement, without explicit segmentation information,

which makes it unsuitable for direct application to recog-

nition systems. HamNoSys’s philosophy is minimalist, in

the sense that it avoids redundancy and strives to describe a

provides symmetry and repetition operators, and describes

only h ow a sign’s conﬁguration changes over time. As an

example consider the ﬁrst part of the sign for PILE:



This annotation says that the hands move symmetrically,

so it needs to provide only the hand conﬁguration and loca-

tion for the right hand, and the fact that the ﬁngers of both

hands touch each other. In contrast, the left hand’s informa-

tion (mirrored along the x axis) is implied.

In order to mode l signs properly in the recognition sys-

tem, we require that all information, according to the PDTS

system, is made explicit for every segment; that is, Ps and

Ds contain the full information on hand conﬁguration and

location, and Ts contain the full information on movement

trajectories, for each hand respectively. Our conversion

method from HamNoSys to the PDTS structure resolves the

implied parts, and splits the signs into its constituent seg-

ments. The key step consists of accumulating deltas, which

Speciﬁcally, movements no longer have attached location information,

which previously had preve nted a direct adaptation to recognition systems.

In addition, there is a strict alternation of P/D with T/S, whereas the older

model could have sequences of movements without intervening holds.

Table 1. Phonetic PDTS labels of the corresponding sub-units for

the sign “PILE” (location and trajectories only).

Frames Type PDTS label

1:12 E rest-position — location-head