scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition

TL;DR: A new symbolic processing approach for converting sign language annotations into structured sequences of labels according to the Posture-Detention-Transition-Steady Shift phonetic model results in improved performance compared to pure data-driven approaches, but also in meaningful phonetic sub-unit models that can be further exploited in interdisciplinary sign language analysis.
Abstract: We explore novel directions for incorporating phonetic transcriptions into sub-unit based statistical models for sign language recognition. First, we employ a new symbolic processing approach for converting sign language annotations, based on HamNoSys symbols, into structured sequences of labels according to the Posture-Detention-Transition-Steady Shift phonetic model. Next, we exploit these labels, and their correspondence with visual features to construct phonetics-based statistical sub-unit models. We also align these sequences, via the statistical sub-unit construction and decoding, to the visual data to extract time boundary information that they would lack otherwise. The resulting phonetic sub-units offer new perspectives for sign language analysis, phonetic modeling, and automatic recognition. We evaluate this approach via sign language recognition experiments on an extended Lemmas Corpus of Greek Sign Language, which results not only in improved performance compared to pure data-driven approaches, but also in meaningful phonetic sub-unit models that can be further exploited in interdisciplinary sign language analysis.

Summary (2 min read)

1. Introduction

  • The concept of an “Asian premium,” which originated in crude oil markets, has been extended to the natural gas market.
  • Therefore, prices on the three markets may not be comparable.
  • This is because if the market fundamentals are the determining factors of natural gas prices in East Asia, moving away from oil indexation may not be the solution to the Asian premium.
  • Using historical data from three countries—Japan, the United States, and Germany—the authors hope to reveal the role that various factors—supply, demand, global economic conditions, and the oil market—play in natural gas price variation and how their roles change over time.
  • The remainder of this paper is organized as follows.

2. Literature review

  • The Asian premium in the gas sector has been debated in both academia and the gas industry, generated by a real policy issue over whether to retain the oil-indexed gas pricing mechanism.
  • Many studies observe that natural gas prices used to link to oil prices but decoupled recently, and they were much more volatile than oil prices (Geng et al., 2016b; Serletis and Shahmoradi, 2005).
  • One key question in the debate on the Asian premium is that whether it represents price discrimination or simply reflects differences in fundamentals among different markets.
  • And this is driven by the weather factor, indicating the important role of market fundamentals.
  • This paper tries to fill this gap by revealing the driving factors behind gas pricing in East Asia (represented by Japan) and the other two major markets (the United States and Europe).

3. Methodology

  • Following the literature, this paper constructs a system based on the VAR approach introduced by Diebold and Yilmaz (2009).
  • Consider a VAR model of the following form: 1 p t i t i t i yy (1) where t is a vector of independently identically distributed disturbances, and s are matrices of coefficients to be estimated.
  • The net directional connectedness (NDC) is therefore H Hi iC C - .
  • These two data series are collected from the International Monetary Fund Primary Commodity Prices.
  • Among the early similarities in pricing patterns in Japan and Germany are that natural gas prices are largely indexed to oil prices in these markets.

5. Empirical results

  • A seven-variable VAR model is estimated for each country.
  • The empirical results for the full sample and rolling-windows estimation are reported in sections 5.1 and 5.2.
  • First, oil price changes contribute significantly to the system for Japan and Germany, 62.77% and 50.26%, respectively, whereas it only offers 23.13% to the system for the United States.
  • Oil price changes also have very low share of influence (only 5.18% and ranking number 3 in all five factors) in the US market.
  • This finding is consistent with the previous literature (e.g., Ji et al., 2014).

5.2 Rolling-windows estimation

  • Over the full sample period, market conditions have continually changed.
  • One would expect to see that the structure/mechanism of this system may be time varying.
  • Overall, oil’s contribution to gas prices in Japan and Germany shows a declining trend in recent years, which is to be expected.
  • Four lags for each rolling window are used to make the results comparable.

5.3 Subsample analysis

  • Rolling-windows analysis shows that the system may experience structural changes.
  • This is not the focus of their paper.
  • The full sample analysis shows that oil price changes are the most important contributor to the dynamics of natural gas prices in Japan and Germany.
  • The impacts from market fundamentals and global economic conditions differ from country to country.
  • 24    Given the recent trend toward the financialization of energy markets, natural gas prices are expected to respond less to market fundamentals and more to financial markets and trading mechanisms.

Did you find this useful? Give us your feedback

Figures (7)

Content maybe subject to copyright    Report

Advances in Phonetics-based Sub-Unit Modeling for Transcription Alignment
and Sign Language Recognition
Vassilis Pitsikalis and Stavros Theodorakis
School of Electrical and Computer Engineering
National Technical University of Athens
{vpitsik,sth}@cs.ntua.gr
Christian Vogler
Institute for Language and Speech Processing
Athena R.C.
cvogler@ilsp.athena-innovation.gr
Petros Maragos
School of Electrical and Computer Engineering
National Technical University of Athens
maragos@cs.ntua.gr
Abstract
We explore novel directions for incorporating phonetic
transcriptions into sub-unit based statistical models for sign
language recognition. First, we employ a new symbolic pro-
cessing approach for converting sign language annotations,
based on HamNoSys symbols, into structured sequences
of labels according to the Posture-Detention-Transition-
Steady Shift phonetic model. Next, we exploit these la-
bels, and their correspondence with visual features to con-
struct phonetics-based statistical sub-unit models. We also
align these sequences, via the statistical sub-unit construc-
tion and decoding, to the visual data to extract time bound-
ary information that they would lack otherwise. The result-
ing phonetic sub-units offer new perspectives for sign lan-
guage analysis, phonetic modeling, and automatic recogni-
tion. We evaluate this approach via sign language recogni-
tion experiments on an extended Lemmas Corpus of Greek
Sign Language, which results not only in improved perfor-
mance compared to pure data-driven approaches, but also
in meaningful phonetic sub-unit models that can be further
exploited in interdisciplinary sign language analysis.
1. Introduction
Phonetic transcriptions are crucial for the performance of
sign language (SL) and speech recognition systems. For the
recognition of SL, which is the primary means of commu-
nication for many deaf people, this has not been practical,
due to the huge level of effort required for creating d etailed
phonetic annotations, unlike the case of speech recognition.
Another problem is the lack of appropriate phonetic models
in the area of SL linguistics (although this is changing now).
Thus, data-driven methods have prevailed in r ecent years.
We propose a novel approach to address these issues. It
is based on two aspects: (1) converting SL annotations into
structured sequential phonetic labels, and (2) incorporating
these labels into a sub-unit-based statistical framework for
training, alignment, and recognition. This framework can
be applied similarly to arbitrary gesture data.
Recent successful data-driven methods include [1, 4, 2,
5, 3, 12, 8]. One employs a linguistic feature vector based
on measured visual features, such as relative hand move-
ments [2]. Another one clusters independent frames via
K-means, and produces “phenones” [1]. Instead of single
frames, [4, 5, 12] cluster sequences of frames on the feature
level, such that they exploit the dynamics inherent to sign
language. Recently, separate features and modeling for dy-
namic vs. static segments have been proposed [8].
These data-driven approaches allow adapting recogni-
tion systems to the concrete feature space, and work even
in the face of insufficient detailed transcriptions. As men-
tioned before, creating such transcriptions requires an im-
practical amount of effort, unlike phoneme-level transcrip-
tions for speech recognition. Yet, their value is clear: they
simplify adding new words to the lexicon, and allow cap-
turing commonalities across signs. They can also be used
to create meaningful representations of intra-sign segments,
for further linguistic or interdisciplinary processing.
Our approach is based on having annotations in Ham-
NoSys [9], the creation of which requires less effort than
full phonetic descriptions, and incorporating them into a
statistical recognition system. This is conceptually similar
to taking a written word and converting it into its pronunci-
ation in speech recognition, and has hitherto not been pos-
sible for SL recognition. Our first contribution is that we
1
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

have developed a parsing system for converting HamNoSys
into structured phonetic sequences of labels, according to
the Posture-Detention-Transition-Steady Shift (PDTS) sys-
tem [6]. However, they do not provide any timing informa-
tion, which leads us to the second contribution: We employ
simple visual tracking features extracted from sign language
videos. Using them in conjunction with the phonetic la-
bels, we construct sub-units via a statistical hidden Markov
model (HMM)-based system, which allows us to align the
PDTS sequences with the visual data segments. The result-
ing output consists of sub-units that are no longer purely
data-driven, in contrast to previous work. Rather, they are
phonetic sub-units, each of which corresponds to a mean-
ingful PDTS label, along with the tim ing information on
where they occur in the data.
Once the segments h ave been mapped to their PDTS la-
bels, the output of the recognition system produces phonetic
labels during decoding. Such labels are invaluable in inter-
disciplinary research tasks, such as linguistic analysis and
synthesis. We evaluate the proposed approach by perform-
ing recognition experiments o n a new corpus of 1000 Greek
Sign Language lemmata, with promising results.
2. Data, Visual Processing and Overview
Data: The Greek Sign Language (GSL) Lemmas Cor-
pus consists of 1046 isolated signs, 5 repetitions each, from
two n ative signers (male and female). The videos h ave a
uniform background and a resolution of 1440x1080 pixels,
recorded at 25 frames p er second interlaced.
Visual Processing: For the segmentation and detection
of the signer’s h ands and head in the Greek Sign Lan-
guage (GSL) Lemmas Corpus, we employed a skin color
model utilizing a Gaussian Mar kov Model (GMM), ac-
companied by morphological processing to enhance skin
detection. Moreover, for tracking we employed forward-
backward linear prediction, and template matching, in or-
der to disambiguate occlusions. The adopted approach is
described in [10]. The extracted feature vector has ve com-
ponents, and consists of the planar coordinates of the dom-
inant h and, the instantaneous direction, and the velocity.
Overview: In the following, we adopt the Greek signs
for PILE, IMMEDIATELY, and EUROPE as examples from
the corpus. Figure 1 shows the initial and end frames of
each sign superimposed. The arrows illustrate the move-
ments of the hands between the frames. In the next sections
we present details on the articulation of these signs via rep-
resentative examples alongside the contributions.
3. Data-Driven Sub-Units without Phonetic
Evidence for Recognition
Our data-driven approach is based on the work in [8].
Other previous approaches include [1, 4, 5]. We seg-
(a) PILE (b) IMMEDIATELY (c) EUROPE
Figure 1. Overview of articulation for three selected GSL signs.
ment signs automatically and construct data-driven sub-
units, which are the p rimitive segments that are used to
construct all signs that share similar articulation parame-
ters. Based on simple movement-related measurements for
the dominant hand, the first step for sub-unit construction
involves the unsupervised partitioning of the segments into
two groups with respect to their movement dynamics for
each sign unit, a model-based process finds the segmenta-
tion points and assigns them the label “static” or “dynamic.
For the second step, the sub-unit construction (i.e., the
statistical modeling and the features employed for the static
or dynamic segments) depends on the assigned label: For
static segments, we employ K-means for clustering based
on their position. For dynamic segments, we employ hier-
archical clustering based on their DTW distances wr t. the
instantaneous direction. Thus, after clustering we end up
with a lexicon, where each sign consists of a sequence of
dynamic and static sub-units. The characteristics of the ap-
proach above imply a sequential structure of dynamic and
static segments that are explicitly accounted for by the pro-
posed sub-unit construction and statistical modeling.
4. Conversion of Annotations to Phonetic
Transcriptions
There has been little progress in the area of phonetic
modeling fo r the purposes of SL recognition since the work
of Vogler and Metaxas [11]. It is possible that the lack of
widely available phonetic transcriptions in sign language
corpora has contributed to this state of affairs. Because
of the level of detail required, such transcriptions are time-
consuming to produce and involve a steep learning curve.
In this paper, we propose a different approach that con-
sists of generating annotations that are merely detailed
enough to reproduce the sign, and having the computer con-
vert these to the full phonetic structure. This approach has
the advantage that it takes far less time and hu man train-
ing to produce the annotations. A disadvantage, however, is
that such annotations make assumptions that require com-
plex inferences by the conversion code. Describing such
inferences in detail is beyond the scope of this paper; in the
2
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

following we give a general overview of the method.
Like in the work by Vogler and Metaxas, the basic pho-
netic structure of a sign is a sequence of segments, which we
model according to Johnson’s and Liddell’s recent work on
the Posture-Detention-Transition-Steady Shift (PDTS) sys-
tem [6]. It supersedes the older Movement-Hold model [7]
used in earlier work, and fixes many of its shortcomings
1
.
In this system, each sign can be considered as a sequence
of key points in the form of postures (P), with associated
hand configuration and location information. Transitions
(T) correspond to hand movements between the key points,
with attached trajectory informatio n. Detentions (D) are
like P, but the hand is held station ary; steady shifts are like
T, but w ith a slow, deliber ate movement; in this paper we
distinguish only among P, D and T. In addition, we con-
sider epenthesis movements (E) [7] to be distinct from T;
the former are transitions between two locations without an
explicit path, and primarily occur when the hands move into
position between signs, and during repeated movements.
An example of the basic structure of the sign for PILE E
PTPTPE—isshowninFig.2, and Table 1.
The annotations of the signs are coded in HamNoSys [9],
a symbolic annotation system that can describe a sign in
sufficient detail to display it in an animated avatar. It mod-
els signs as clusters of handshape, orientation, location,
and movement, without explicit segmentation information,
which makes it unsuitable for direct application to recog-
nition systems. HamNoSys’s philosophy is minimalist, in
the sense that it avoids redundancy and strives to describe a
sign in detail with as few symbols as possible. To this end, it
provides symmetry and repetition operators, and describes
only h ow a sign’s configuration changes over time. As an
example consider the first part of the sign for PILE:

This annotation says that the hands move symmetrically,
so it needs to provide only the hand configuration and loca-
tion for the right hand, and the fact that the fingers of both
hands touch each other. In contrast, the left hand’s informa-
tion (mirrored along the x axis) is implied.
In order to mode l signs properly in the recognition sys-
tem, we require that all information, according to the PDTS
system, is made explicit for every segment; that is, Ps and
Ds contain the full information on hand configuration and
location, and Ts contain the full information on movement
trajectories, for each hand respectively. Our conversion
method from HamNoSys to the PDTS structure resolves the
implied parts, and splits the signs into its constituent seg-
ments. The key step consists of accumulating deltas, which
1
Specifically, movements no longer have attached location information,
which previously had preve nted a direct adaptation to recognition systems.
In addition, there is a strict alternation of P/D with T/S, whereas the older
model could have sequences of movements without intervening holds.
Table 1. Phonetic PDTS labels of the corresponding sub-units for
the sign “PILE” (location and trajectories only).
Frames Type PDTS label
1:12 E rest-position location-head
13:13
P location-head
14:25
T directedmotion, curve-r, direction-o, second-direction-do,
tense-true
26:27
P location-torso, side=right beside
28:50
T directedmotion, direction-dr, small
51:51
P location-torso, side=right beside down
52:66
E location-torso, side=right beside down rest-position
describe how a posture or transition has changed with re-
spect to a prototype. These are then applied in a specific
order. Note that this process also works for independent
channels of information, such as hand configuration versus
location, dominant hand versus nondominant hand, and so
on, and provides relative timings of segments across chan-
nels; however, the details are beyond the scope of this paper.
Further examples of PDTS sequences can be found in
Tables 1, 2. The details of the conversion are beyond the
scope of this paper, due to space limitations, and will be
published separately.
5. Phonetic Based Sub-units, Training, Align-
ment and Recognition
In the previous section we have covered our first main
contribution. Our second main contribution consists of in-
corporating the phonetic labels into a statistical recogni-
tion system. The data-driven-only sub-units from Section 3,
without any phonetic information, adapt well to specific fea-
ture spaces. However, they produce meaningless sub-unit
labels, which cannot be exploited for interdisciplinary sign
language processing (e.g., synthesis, linguistics).
We call the process of incorporating the phonetic in-
formation “Phonetic Sub-unit Construction for Recogni-
tion. This is the first time that the following are taken into
account in an automatic, statistical, and systematic way:
(1) phonetic transcriptions of SL, provided as d escribed in
the previous section by the PDTS system, and (2) the corre-
sponding underlying visual data and features from process-
ing the video data and the feature extraction. The proce-
dures involved in this process involve: (1) phonetic sub-unit
construction and training, (2) phonetic label alignment and
segmentation, (3) lexicon construction, and (4) recognition.
5.1. Phonetic Sub-Unit Model Training
For each phonetic label provided by the PDTS system,
and the features from the visual processing, we train one
sub-unit HMM. These sub-units have both phonetic labels
from the PDTS structure, and statistical parameters stem-
3
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

(a) E (b) P (c) T (d) P (e) T (f) P (g) E
Figure 2. Sign for PILE: Segments after incorporation of PDTS phonetic labels into Phonetic Sub-unit Construction, Training and Align-
ment. Superimposed start and end frames of each sequence of segments, accompanied with an arrow for transitions and epenthesis. Each
segment corresponds to a single phonetic label. PDTS segments labels are of type Epenthesis (E), Posture (P), Transition (T)
ming from the data-driven models, as a result of the train-
ing step. An example is illustrated in Table 1, which lists
the sequence of phonetic labels for sign for “PILE”.
We use different HMM parameters for each type of sub-
unit. Distinguishing between movements (T/E) and pos-
tures/detentions (P/D) corresponds to making a distinction
between dynamic and static segments, as described in Sec-
tion 3. This also is consistent with the concepts in the old
Movement-Hold model [7]. ForTandE,weemploya6-
state and 3-state Bakis HMM topology, respectively. For
P and D, we use a 1-state HMM, and a 2-state left-right
HMM, respectively. One mixture and a diagonal covariance
matrix was employed for each HMM. We initialize the pho-
netic sub-unit models in a uniform way with a flat-start pro-
cedure using the global mean and covariance of the feature
space, and employ embedded training on strings of concate-
nated sub-unit models with unsegmented data.
5.2. Alignment and Time Segmentation
We co ncatenate the trained HMMs into a recognition
network and decode each feature sequence via the Viterbi
algorithm. This results in a sequence of phonetic PDTS
labels, together with their respective starting and ending
frames. Doing this for all sequences results in a lexicon
with segmentation boundaries for each PDTS label.
We recognize signs by decoding unseen test data in the
HMM network on the PDTS label level. We evaluate the
accuracy on the sign level, b ased on the lexicon above.
Fig. 2 shows an example of the segmentation acquired
during the decoding, which illustrates the sequence of pho-
netic sub-units for the above-mentioned sign for “PILE”.
Each image corresponds to a phonetic PDTS segment pro-
duced by the decoding. For visualization, we adopt the fol-
lowing conventions: (1) For T and E segments, we superim-
pose their respective initial and final frames. We also high-
light specific movement trajectories with an arrow from the
initial to the final hand position in the respective segment.
(2) For P and D segments, we show only the first frame of
the segment, as the hand does not move within them. In ad-
dition, the labels corresponding to this sign, along with the
segmentation boundaries, are listed in Table 1.
5.3. Phonetic Sub-Units Results
Fig. 3 and 4 show examples of movement-based sub-
units (T and E), using x and y coordinates mapped from
the signing space. For the corresponding phonetic labels
see Table 2.Fig.3(a) shows a common epenthesis sub-unit
(E-to-head). It models the movement from the rest position
to the head, a common starting postur e. Fig. 3(b) corre-
sponds to a circular transition sub-unit (T-circular). An in-
dicative sign that contains this sub-unit is “EUROPE” (see
Fig. 1(c)). Fig. 3(c) and 3(d) depict directed transition sub-
units (T-down-right, T-in-left) with right-down and left di-
rections respectively. Representative signs are “PILE” and
“IMMEDIATELY, respectively (see Fig. 1(a), 1(b)).
In Fig. 4 we show results for the P and D sub-units, with
the actual coordinates for four different postures superim-
posed in different colors. P-head, P-stomach, P-shoulder
and P-head-top correspond to locations at the signer’s head,
stomach, shoulder and top of head, respectively.
In all these figures, there are cases of compact phonetic
sub-units with less variance, of sparsely populated ones
(i.e., few available data), and some that contain outliers.
For instance, the sub-unit P-head-top is compact, but has
few data. In contrast, P-head has more data and increased
variance. The sub-un it fo r the initial transition from the
rest posture to the starting position occurs in many signs,
whereas other sub-units may occur in only a single sign.
Outliers and high variances seem to be caused by visual pro-
cessing inaccuracies (we perform 2D, rather than 3D, pro-
cessing), tracking or parameter estimation errors, or human
annotator errors, o r actual data exhibiting such p roperties.
6. Sign Language Recognition Experiments
The recognition task in this paper was conducted on one
signer and 961 out of the 1046 signs. Approximately half of
the missing 85 signs share the same pronunciation with an-
other sign, and thus are the same for reco gnition purposes,
while the other half were eliminated due to unacceptably
poor tracking or poor segmentation of the ve repetitions
4
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

−200 −100 0 100
0
50
100
150
200
250
x coordinates
y coordinates
(a) E-to-head
−100 −50 0 50 100
−100
−80
−60
−40
−20
0
20
x coordinates
y coordinates
(b) T-circular
−100 −50 0
−80
−60
−40
−20
0
x coordinates
y coordinates
(c) T-down-ri ght
0 20 40 60 80 100 120 140
−40
−20
0
20
x coordinates
y coordinates
(d) T-in-left
Figure 3. Sub-units after Phonetic Sub-unit Construction, Training and Alignment. (a) corresponds to an epenthesis sub-unit (E-to-head)
and (b-d) to transition sub-units (T-circular, T-down-right, T-in-left). Trajectories are illustrated in the real signing space normalized wrt.
their initial position (x,y) = (0,0). Red marker indicates trajectories’ start position. See Table 2 for the corresponding phonetic labels.
200 400 600 800 1000
60
65
70
75
80
85
Number of Signs
Sign Accuracy %
DD P
(a)
200 400 600 800 1000
0
500
1000
1500
2000
Number of Si
g
ns
Number of Sub−units
DD P
(b)
0 200 400 600 800 1000 1200 1400 1600 1800 2000
44
46
48
50
52
54
56
58
60
62
64
# Clusters in Dynamic Segments
Sign Accuracy %
DD P
(c)
Figure 5. Comparison of Data-Driven (DD) Sub-units without phonetic evidence vs. Phonetic based approach (P). (a) Sign Accuracy,
(b) Number of sub-units. In (a,b) x-axis corresponds to the v ariation on the number of signs. (c) For the maximum number of signs, sign
accuracy as affected by the number of sub-units (x-axis) in the DD case; in the Phonetic approach number of sub-units is predefined.
head
stomach
shoulder
head−top
(a)
Figure 4. Sub-units after Phonetic Sub-unit Construction, Training
and Alignment. Data for multiple posture phonetic sub-units su-
perimposed in the signing space indicating their relative position to
the signer. Sub-units with multiple colored pixels are: P-forehead,
P-stomach, P-shoulder, P-head-top. Legend shows the primary lo-
cations of the corresponding phonetic labels (see also Table 2).
into individual signs. The data were split randomly into four
training examples and one testing example per sign, which
was the same across all experiments. Future work should
expand these experiments to both signers and the full set, as
Table 2. Examples of phonetic subunit (PSU) and the sign where
they occur. ’*’ correspond to multiple signs.
PSU Sign Type PDTS Label
E-to-head * E rest-position location-head
T-circular
EUROPE T circularmotion, axis=i
T-down-right
PILE T directedmotion, direction=dr,
small
T-in-left
IMMEDIATELY T directedmotion, direction=il,
fast=true, halt=true
P-forehead
* P location=forehead
P-stomach
* P location=stomach
P-shoulder
* P location=shouldertop,
side=right
beside
P-head-top
* P location=head-top
more tracking results come in and improve. The visual pro-
cessing and feature extraction was conducted as d escribed
in Section 2. The modeling an d reco gnition proceeded, as
described in the previous section. Our evaluation criterion
was the number of correctly recognized signs, via matching
sequences of phonetic labels to the lexicon.
We first compare the two approaches for sub-unit con-
struction, as follows: (1) Data-Driven (DD): Data-driven
sub-unit construction, which does not make use of any pho-
netic transcription labels. (2) Phonetic (P): Phonetics-based
approach which makes use of the PDTS phonetic labels, via
the statistically trained sub-unit models.
Second, we evaluate the relationship between lexicon
5
Proceedings CVPR-2011 Workshop on Gesture Recognition, June 2011, Colorado Springs, USA.

Citations
More filters
Journal ArticleDOI
TL;DR: This work presents a statistical recognition approach performing large vocabulary continuous sign language recognition across different signers, and is the first time system design on a large data set with true focus on real-life applicability is thoroughly presented.

309 citations


Cites background from "Advances in phonetics-based sub-uni..."

  • ...[55] extract sub-unit definitions from linguistic annotation in HamNoSys [31] to improve an HMM-based system recognising...

    [...]

01 Jan 2017
TL;DR: In this paper, sign language recognition using linguistic sub-units is discussed, which includes those learned from appearance data as well as those inferred from both 2D or 3D tracking data.
Abstract: This paper discusses sign language recognition using linguistic sub-units. It presents three types of sub-units for consideration; those learnt from appearance data as well as those inferred from both 2D or 3D tracking data. These sub-units are then combined using a sign level classifier; here, two options are presented. The first uses Markov Models to encode the temporal changes between sub-units. The second makes use of Sequential Pattern Boosting to apply discriminative feature selection at the same time as encoding temporal information. This approach is more robust to noise and performs well in signer independent tests, improving results from the 54% achieved by the Markov Chains to 76%.

146 citations

Book ChapterDOI
TL;DR: This paper discusses sign language recognition using linguistic sub-units, presenting three types of sub- units for consideration; those learnt from appearance data as well as those inferred from both 2D or 3D tracking data.
Abstract: This paper discusses sign language recognition using linguistic sub-units. It presents three types of sub-units for consideration; those learnt from appearance data as well as those inferred from both 2D or 3D tracking data. These sub-units are then combined using a sign level classifier; here, two options are presented. The first uses Markov Models to encode the temporal changes between sub-units. The second makes use of Sequential Pattern Boosting to apply discriminative feature selection at the same time as encoding temporal information. This approach is more robust to noise and performs well in signer independent tests, improving results from the 54% achieved by the Markov Chains to 76%.

135 citations


Cites background from "Advances in phonetics-based sub-uni..."

  • ...Future Work The learnt sub-units show promise and, as shown by the work of Pitsikalis et al. (2011), there are several avenues which can be explored....

    [...]

  • ...Using the motion of the hands, the sign can be split into its component parts (as in Pitsikalis et al., 2011), that are then aligned with the sign annotations....

    [...]

  • ...The learnt sub-units show promise and, as shown by the work of Pitsikalis et al. (2011), there are several avenues which can be explored....

    [...]

  • ...Cooper and Bowden (2010) learnt linguistic sub-units from hand annotated data which they combined with Markov models to create sign level classifiers, while Pitsikalis et al. (2011) presented a method which incorporated phonetic transcriptions into sub-unit based statistical models....

    [...]

Proceedings ArticleDOI
16 Jun 2012
TL;DR: This paper presents a novel, discriminative, multi-class classifier based on Sequential Pattern Trees that is efficient to learn, and scalable for use with large classifier banks, well suited to Sign Language Recognition.
Abstract: This paper presents a novel, discriminative, multi-class classifier based on Sequential Pattern Trees. It is efficient to learn, compared to other Sequential Pattern methods, and scalable for use with large classifier banks. For these reasons it is well suited to Sign Language Recognition. Using deterministic robust features based on hand trajectories, sign level classifiers are built from sub-units. Results are presented both on a large lexicon single signer data set and a multi-signer Kinect™ data set. In both cases it is shown to out perform the non-discriminative Markov model approach and be equivalent to previous, more costly, Sequential Pattern (SP) techniques.

74 citations


Cites methods from "Advances in phonetics-based sub-uni..."

  • ...[9] proposed a method which uses linguistic labelling to split signs into sub-units....

    [...]

Journal ArticleDOI
TL;DR: This article proposes a covariance matrix--based representation to naturally fuse information from multimodal sources to utilize long-term dynamics over an isolated sign sequence, and demonstrates that the proposed method outperforms the state-of-the-art methods both in accuracy and computational cost.
Abstract: In this article, to utilize long-term dynamics over an isolated sign sequence, we propose a covariance matrix--based representation to naturally fuse information from multimodal sources. To tackle the drawback induced by the commonly used Riemannian metric, the proximity of covariance matrices is measured on the Grassmann manifold. However, the inherent Grassmann metric cannot be directly applied to the covariance matrix. We solve this problem by evaluating and selecting the most significant singular vectors of covariance matrices of sign sequences. The resulting compact representation is called the Grassmann covariance matrix. Finally, the Grassmann metric is used to be a kernel for the support vector machine, which enables learning of the signs in a discriminative manner. To validate the proposed method, we collect three challenging sign language datasets, on which comprehensive evaluations show that the proposed method outperforms the state-of-the-art methods both in accuracy and computational cost.

72 citations


Cites methods from "Advances in phonetics-based sub-uni..."

  • ...…recognition from these sequential observations, hidden state–based methods like the hidden Markov model (HMM) [Liang and Ouhyoung 1996; Starner et al. 1998; Gao et al. 2004; Pitsikalis et al. 2011] and conditional random fields (CRF) [Yang et al. 2009; Kong and Ranganath 2014] were frequently used....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors outline the phonological structures and phrases in American Sign Language (ASL) and present a segmental phonetic description system for ASL phonetic segmentation.
Abstract: This paper has the ambitious goal of outlining the phonological structures and proc- esses we have analyzed in American Sign Language (ASL). In order to do this we have divided the paper into five parts. In section 1 we detail the types of sequential phenomena found in the production of individual signs, allowing us to argue that ASL signs are composed of sequences of phonological segments, just as are words in spoken languages. Section 2 provides the details of a segmental phonetic tran- scription system. Using the descriptions made available by the transcription system, Section 3 briefly discusses both paradigmatic and syntagmatic contrast in ASL signs. Section 4 deals with the various types of phonological processes at work in the language, processes remarkable in their similarity to phonological processes found in spoken languages. We conclude the paper with an overview of the major typed of phonological effects of ASL's rich system of morphological processes. We realize that the majority of readers will come to this paper with neither sign language proficiency nor a knowledge of sign language structure. As a result, many will encounter reference to ASL signs without knowing their form. Although we have been unable to illustrate all the examples, we hope we have provided sufficient illustra- tions to make the paper more accessible.

703 citations


"Advances in phonetics-based sub-uni..." refers background or methods or result in this paper

  • ...In addition, we consider epenthesis movements (E) [7] to be distinct from T; the former are transitions between two locations without an explicit path, and primarily occur when the hands move into position between signs, and during repeated movements....

    [...]

  • ...This also is consistent with the concepts in the old Movement-Hold model [7]....

    [...]

  • ...It supersedes the older Movement-Hold model [7] used in earlier work, and fixes many of its shortcomings(1)....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a novel framework to ASL recognition that aspires to being a solution to the scalability problems, based on breaking down the signs into their phonemes and modeling them with parallel hidden Markov models.

321 citations


"Advances in phonetics-based sub-uni..." refers background in this paper

  • ...There has been little progress in the area of phonetic modeling for the purposes of SL recognition since the work of Vogler and Metaxas [11]....

    [...]

  • ...Like in the work by Vogler and Metaxas, the basic phonetic structure of a sign is a sequence of segments, which we model according to Johnson’s and Liddell’s recent work on the Posture-Detention-Transition-Steady Shift (PDTS) system [6]....

    [...]

Book ChapterDOI
11 May 2004
TL;DR: A novel approach to sign language recognition that provides extremely high classification rates on minimal training data using only single instance training outperforming previous approaches where thousands of training examples are required.
Abstract: This paper presents a novel approach to sign language recognition that provides extremely high classification rates on minimal training data. Key to this approach is a 2 stage classification procedure where an initial classification stage extracts a high level description of hand shape and motion. This high level description is based upon sign linguistics and describes actions at a conceptual level easily understood by humans. Moreover, such a description broadly generalises temporal activities naturally overcoming variability of people and environments. A second stage of classification is then used to model the temporal transitions of individual signs using a classifier bank of Markov chains combined with Independent Component Analysis. We demonstrate classification rates as high as 97.67% for a lexicon of 43 words using only single instance training outperforming previous approaches where thousands of training examples are required.

178 citations


"Advances in phonetics-based sub-uni..." refers background in this paper

  • ...Recent successful data-driven methods include [1, 4, 2, 5, 3, 12, 8]....

    [...]

  • ...One employs a linguistic feature vector based on measured visual features, such as relative hand movements [2]....

    [...]

Journal Article
TL;DR: In this article, an automatic recognition of German continuous sign language is presented. The statistical approach is based on the Bayes decision rule for minimum error rate, which can be used to reduce the amount of necessary training material.
Abstract: This paper is concerned with the automatic recognition of German continuous sign language. For the most user-friendliness only one single color video camera is used for image recording. The statistical approach is based on the Bayes decision rule for minimum error rate. Following speech recognition system design, which are in general based on subunits, here the idea of an automatic sign language recognition system using subunits rather than models for whole signs will be outlined. The advantage of such a system will be a future reduction of necessary training material. Furthermore, a simplified enlargement of the existing vocabulary is expected. Since it is difficult to define subunits for sign language, this approach employs totally self-organized subunits called fenone. K-means algorithm is used for the definition of such fenones. The software prototype of the system is currently evaluated in experiments.

101 citations

Journal ArticleDOI
TL;DR: This paper attempts to define and segment subunits using computer vision techniques, which also can be basically explained by sign language linguistics and correlates highly with the definition of syllables in sign language while sharing characteristics of syllable in spoken languages.

81 citations


"Advances in phonetics-based sub-uni..." refers background in this paper

  • ...Instead of single frames, [4, 5, 12] cluster sequences of frames on the feature level, such that they exploit the dynamics inherent to sign language....

    [...]

  • ...Recent successful data-driven methods include [1, 4, 2, 5, 3, 12, 8]....

    [...]

  • ...Other previous approaches include [1, 4, 5]....

    [...]

Frequently Asked Questions (8)
Q1. What are the contributions in "Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition" ?

The authors also align these sequences, via the statistical sub-unit construction and decoding, to the visual data to extract time boundary information that they would lack otherwise. The authors evaluate this approach via sign language recognition experiments on an extended Lemmas Corpus of Greek Sign Language, which results not only in improved performance compared to pure data-driven approaches, but also in meaningful phonetic sub-unit models that can be further exploited in interdisciplinary sign language analysis. 

Outliers and high variances seem to be caused by visual processing inaccuracies (we perform 2D, rather than 3D, processing), tracking or parameter estimation errors, or human annotator errors, or actual data exhibiting such properties. 

The procedures involved in this process involve: (1) phonetic sub-unit construction and training, (2) phonetic label alignment and segmentation, (3) lexicon construction, and (4) recognition. 

the authors expect that other disciplines, such as linguistics, can greatly benefit from their results for the analysis of sign languages. 

For the segmentation and detection of the signer’s hands and head in the Greek Sign Language (GSL) Lemmas Corpus, the authors employed a skin color model utilizing a Gaussian Markov Model (GMM), accompanied by morphological processing to enhance skin detection. 

The annotations of the signs are coded in HamNoSys [9], a symbolic annotation system that can describe a sign in sufficient detail to display it in an animated avatar. 

By increasing the number of signs, the recognition performance for both approaches decreases; this is expected as the recognition task becomes harder. 

Their conversion method from HamNoSys to the PDTS structure resolves the implied parts, and splits the signs into its constituent segments.