Book Chapter•DOI•

Finding recurrent patterns from continuous sign language sentences for automated extraction of signs

Sunita Nayak, Kester Duncan¹, Sudeep Sarkar¹, Barbara L. Loeding²•Institutions (2)

University of South Florida¹, Florida Polytechnic University²

01 Jan 2012-Journal of Machine Learning Research (Springer, Cham)-Vol. 13, Iss: 1, pp 2589-2615

TL;DR: In this paper, a probabilistic framework is presented to automatically learn recurring signs from multiple sign language video sequences containing the vocabulary of interest, which is robust to the variations produced by adjacent signs.

read less

Abstract: We present a probabilistic framework to automatically learn models of recurring signs from multiple sign language video sequences containing the vocabulary of interest. We extract the parts of the signs that are present in most occurrences of the sign in context and are robust to the variations produced by adjacent signs. Each sentence video is first transformed into a multidimensional time series representation, capturing the motion and shape aspects of the sign. Skin color blobs are extracted from frames of color video sequences, and a probabilistic relational distribution is formed for each frame using the contour and edge pixels from the skin blobs. Each sentence is represented as a trajectory in a low dimensional space called the space of relational distributions. Given these time series trajectories, we extract signemes from multiple sentences concurrently using iterated conditional modes (ICM). We show results by learning single signs from a collection of sentences with one common pervading sign, multiple signs from a collection of sentences with more than one common sign, and single signs from a mixed collection of sentences. The extracted signemes demonstrate that our approach is robust to some extent to the variations produced within a sign due to different contexts. We also show results whereby these learned sign models are used for spotting signs in test sequences.

...read moreread less

Summary (3 min read)

Jump to: [1. Introduction] – [2. Relational Distributions] – [3. Problem Formulation] – [3.1 Distance Measure] – [3.2 Parameter Estimation] – [3.3 Sampling Starting Points For ICM] – [4. Experiments And Results] – [4.1 Data Set] – [4.2 Common Pattern Extraction Results] – [4.2.1 EXTRACTING THE MOST COMMON PATTERN] – [4.2.2 EXTRACTING MULTIPLE COMMON SIGNS] – [4.2.3 EXTRACTING THE MOST COMMON PATTERNS FROM MIXED SENTENCES] and [4.3 Sign Localization]

1. Introduction

Most of the existing work in sign language assumes that the training signs are already available and often signs used in the training set are the isolated signs with the boundaries chopped off, or manually selected frames from continuous sentences.
The process is iterated till the parameter values converge to a stable solution.
The authors also extract single signs from a mixed collection of sentences where there are more than one common sign in context.

2. Relational Distributions

The authors use relational distributions to capture the global and relative configuration of the hands and the face in an image.
The authors start from some level of segmentation of the object from the scene.
It captures the global configuration of the low-level primitives.
Figure 3(c) illustrates how motion is captured using relational distributions.
Each bin then counts the pairs of edge pixels between which the horizontal and vertical distances each lie in some fixed range that depends on the location of the bin in the histogram.

3. Problem Formulation

Sign language sentences are series of signs.
Figure 4 illustrates the traces of the first vs. second dimension in the feature space, of three sentences S1, S2 and S3 with only one common sign, R, among them.
Table 3 defines the notations that will be used in this paper.
Also note that p(θ) is hard to compute or even sample from because it is computationally expensive to compute the denominator in Equation 2, as it involves the summation over all possible parameter combinations.
In other words, the authors construct a probability density function of the possible starting points and widths in each sentence, given the estimated starting points and widths of the common pattern in all other sentences, that is, f (θi|θ(i)).

3.1 Distance Measure

The distance function d in the above equations needs to be chosen carefully such that it is not biased towards the shorter subsequences.
Here, the authors briefly describe how they compute the distance between two substrings using dynamic time warping.
Let l1 and l2 represent the length of the two substrings and e(i, j) represent the Euclidean distance between the ith data point from the first substring and the jth data point from the second substring.

3.2 Parameter Estimation

Gibbs sampling (Casella and George, 1992) is a Markov Chain Monte Carlo approach (Gilks et al., 1998) that allows us to sample the conditional probability density f (θi|θ(i)) for all the sequences sequentially and then iterate the whole process until convergence.
ICM has much faster convergence, but it is also known to be heavily dependent on the initialization.
The values for ai and wi are updated with those that maximize the conditional density f (θi|θ(i)).
The vertical axis in the probabilities represents the starting locations and the horizontal axis represents the possible widths.
Note that the probabilities are spread out in the first iteration for each sentence and it slowly converges to a fixed starting location for each of them.

3.3 Sampling Starting Points For ICM

In order to address the local convergence nature of ICM, the authors adopt a uniform random samplingbased approach.
The value for a0i is obtained by sampling a starting point based on uniform random distribution from the set of all possible starting points in the ith sequence, that is, from the set {1 · · ·(Li−w 0 i +1)}.
ICM is run using each initial parameter vector generated and the most common solution is considered as the final solution.
The authors run it the number of times equal to the average number of frames in each sentence from the given set of sentences for extracting the sign.
Assign most frequently occurring value as the final value for each parameter, also known as comment.

4. Experiments And Results

The authors present visual and quantitative results of their approach for extracting signemes from video sequences representing sentences from American Sign Language.
The authors first describe the data set used then present the results of the automatic common pattern extraction.

4.1 Data Set

The authors data set consists of 155 American Sign Language (ASL) video sequences organized into 12 groups based on the vocabulary (word that pervades the sentences of the group).
The breakdown of these ‘pure’ groups and the number of sentences in each are as follows.
The initial parameter vector for each ICM run was chosen independently using uniform random sampling.
This data set was used to extract 12 common subsequences when the authors searched for the first most common sign, and 24 common subsequences when they searched for the second most common sign.
All of the signs were performed by the same signer with plain clothing and background.

4.2 Common Pattern Extraction Results

The authors present the results of their method for extracting common patterns from sign language sentences.
The authors first present results for extracting the single most common sign and multiple common signs from the ‘pure’ sentence groups, followed by results for the most common patterns from the ‘mixed’ groups.

4.2.1 EXTRACTING THE MOST COMMON PATTERN

The authors perform extraction of the most common patterns from the ‘pure’ sentence groups.
The authors possess a priori knowledge of the most common word due to the organization of the sentence groups.
As can be seen, the extracted patterns and the corresponding ground truth patterns are quite similar, except for a few frames at the beginning and end of the some of the patterns.
Figure 10(b) shows the corresponding scatter plot for the end position of the patterns in the sentences.
As can be seen most of the points in the scatter plots lie along the diagonal.

4.2.2 EXTRACTING MULTIPLE COMMON SIGNS

In this section the authors present some visual results for the extraction of the two most common signs from the ‘pure’ groups of sentences.
The authors focused on extracting only two signs because the shortest ASL sentence contained two signs.
Figure 13 shows the results for the two most common signs extracted from the sentence ‘BAGGAGE THERE NOT MINE THERE’.
The extracted subsequences correspond to the ASL words ‘BAGGAGE’ and ‘MINE’.
The word ‘BAGGAGE’ appears in all the 14 sentences of the group, whereas the word ‘MINE’ (or ‘MY’) shows up in 11 sentences coinciding with what was expected.

4.2.3 EXTRACTING THE MOST COMMON PATTERNS FROM MIXED SENTENCES

The authors perform extraction of the most common patterns from the collection of ‘mixed’ sentences as outlined in Section 4.1.
Figure 15(a) shows the scatter plot of the ground truth start positions vs. the estimated start positions of the pattern extracted from each of the sentences.
The frame width range for the sign ‘HAVE’ is between 4 and 6 frames with 4 being the minimum width and 6 being the maximum width.
Combining these width ranges could be done using an average of the two or by selecting the minimum and maximum values between the two.

4.3 Sign Localization

The same process that is used for training sign models is used for sign localization.
The set of points representing the signeme were matched with the segments of the SoRD points from the test sentences to find the segment with the minimum matching score, which would represent the sign in the test sentence.
The plot of the Start Offset vs. the End Offset is shown in Figure 16.
The points for different signs are scattered in the four quadrants depending on the nature of the overlap between the ground truth sign and the retrieved signeme.
The closer it is to the origin, the better the quality.

Did you find this useful? Give us your feedback

Figures (17)

Figure 12: Signemes extracted from sentences

Figure 3: Variations in relational distributions with motion. (a) Motion sequence. (b) Edge pixels from the skin color blobs. (c) Relational distributions constructed from the low level features (edge pixels) of the images in the motion sequence. The horizontal axis of the relational distribution represents the horizontal distance between the edge pixels and its vertical axis represents the vertical distance between edge pixels.

Figure 14: Extraction of the two most common patterns or signemes from the sentence ‘MY PASSPORT THERE STILL GOOD THERE’.

Figure 13: Extraction of the two most common patterns or signemes from the sentence ‘BAGGAGE THERE NOT MINE THERE’.

Figure 7: Convergence of values of the parameter set. The above plot shows the norm of the difference between two consecutive parameter vectors representing the set of starting points and widths of the common subsequence in the given set of sequences. It shows the typical convergence with a given initialization vector. ICM is repeated with multiple initializations and the most frequently occurring solution is considered as the final solution.

Figure 1: Movement epenthesis in sign language sentences. Frames corresponding to the common sign ‘BUY’ are marked in red. Signs adjacent to BUY are marked in magenta. Frames between marked frames represent movement epenthesis that is, the transition between signs. Note that the sign itself is also affected by having different signs preceding or following it.

Figure 9: The first dimension of the video sequences containing a common sign ‘DEPART’. The sequences are indicated by the dotted curves and the solid lines on each of them indicate the common pattern or signeme. The odd columns represent the ground truth and the even columns show the results.

Figure 15: Extraction of the most common patterns or signemes from the ‘mixed’ sentence groups. The closer the points are to the diagonal, the closer the result is to the ground truth.

Figure 6: Convergence of the conditional probability density f (θi|θ(i)) for sentences S1...S6 from a given set of sentences S1...S14. The brighter regions represent a higher probability value. The vertical axis in the probabilities represents the starting locations and the horizontal axis represents the possible widths. Note that the probabilities are spread out in the first iteration and it slowly converges to a particular starting location. They are still spread across the horizontal (width) axis because we vary the width only in a small range that is decided based on the amount of motion present in the sign.

Figure 10: Extraction of the most common patterns or signemes from the ‘pure’ sentence groups. The closer the points are to the diagonal, the closer the result is to the ground truth.

Figure 5: Sequential update of the parameter values using ICM. (a), (b) and (c) respectively show the parameter updates in the first sentence, the ith and the nth sentences. In the rth iteration, the parameters of the common sign in ith sentence is computed based on the parameter values of the previous (i− 1) sentences obtained in the same iteration, and those of the (i+1)th to nth sentences obtained in the previous, that is, the (r−1)th iteration.

Figure 16: Start Offset vs. End Offset of Localized Signs

Figure 11: Signemes extracted from sentences

Figure 2: Overview of our approach. Each of the n sentences is represented as a sequence in the Space of Relational Distributions, and common patterns are extracted using iterated conditional modes (ICM). The parameter set {a1,w1, ...an,wn} is initialized using uniform random sampling and the conditional density corresponding to each sentence is updated in a sequential manner.

Figure 8: Histograms showing the start and end locations of signs extracted from 14 different sen-

Content maybe subject to copyright Report

Journal of Machine Learning Research 13 (2012) 2589-2615 Submitted 11/11; Revised 5/12; Published 9/12

Finding Recurrent Patterns from Continuous Sign Language

Sentences for Automated Extraction of Signs

Sunita Nayak SNAYAK@TAAZ.COM

Taaz Inc.

4250 Executive Square, Suite 420

La Jolla, CA 92037 USA

Kester Duncan KKDUNCAN@CSE.USF.EDU

Sudeep Sarkar SARKAR@CSE.USF.EDU

Department of Computer Science & Engineering

University of South Florida

Tampa, FL 33620, USA

Barbara Loeding BARBARA@USF.EDU

Department of Special Education

University of South Florida

Lakeland, FL 33803, USA

Editor: Isabelle Guyon

Abstract

We present a probabilistic framework to automatically learn models of recurring signs from mul-

tiple sign language video sequences containing the vocabulary of interest. We extract the parts of

the signs that are present in most occurrences of the sign in context and are robust to the variations

produced by adjacent signs. Each sentence video is ﬁrst transformed into a multidimensional time

series representation, capturing the motion and shape aspects of the si gn. Skin color blobs are ex-

tracted from frames of color video sequences, and a probabilistic relational distribution is formed

for each frame using the contour and edge pixels from the skin blobs. Each sentence is represented

as a trajectory in a low dimensional space called the space of relational distributions. Given these

time series trajectories, we extract signemes from multiple sentences concurrently using iterated

conditional modes (ICM). We show results by learning single signs from a collection of sentences

with one common pervading sign, multiple signs from a collection of sentences with more than

one common sign, and single signs from a mixed collection of sentences. The extracted signemes

demonstrate that our approach is robust to some extent to the variations produced within a sign due

to different contexts. We also show results whereby these learned sign models are used for spotting

signs in test sequences.

Keywords: pattern extraction, sign language r ecognition, signeme extraction, sign modeling,

iterated conditional modes

1. Introduction

Sign language research in the computer vision community has primarily focused on improving

recognition rates of signs either by improving the motion representation and similarity measures

(Yang et al., 2002; Al-Jarrah and Halawani, 2001; Athitsos et al., 2004; Cui and Weng, 2000; Wang

et al., 2007; Bauer and Hienz, 2000) or by adding linguistic clues during the recognition process

2012 Sunita Nayak, Kester Duncan, Sudeep Sarkar and Barbara Loeding.

NAYAK, DUNCAN, SARKAR AND LOEDING

(Bowden et al., 2004; Derpanis et al., 2004). Ong and Ranganath (2005) presented a review of

the automated sign language research and also highlighted one important issue in continuous sign

language recognition. While signing a sentence, there exists transitions of the hands between two

consecutive signs that do not belong to either sign. This is called movement epenthesis (Liddell and

Johnson, 1989). This needs to be dealt with ﬁrst before dealing with any other phonological issues

in sign language (Ong and Ranganath, 2005). Most of the existing work in sign language assumes

that the training signs are already available and often signs used in the training set are the isolated

signs with the boundaries chopped off, or manually selected frames from continuous sentences.

The ability to recognize isolated signs does not guarantee the recognition of signs in continuous

sentences. Unlike isolated signs, a sign in a continuous sentence is strongly affected by its context

in the sentence. Figure 1 shows two sentences ‘I BUY TI CKET WHERE?’ and ‘YOU CAN BUY

THIS FOR HER’ with a common sign ‘BUY’ between them. The frames representing the sign

‘BUY’ and the neighboring signs are marked. The unmarked frames between the signs indicate

the frames corresponding to movement epenthesis. It can be observed that the same sign ‘BUY’ is

preceded and succeeded by movement epenthesis that depends on the end and start of the preceding

and succeeding sign respectively. The movement epenthesis also affects how the sign is signed.

This effect makes the automated extraction, modeling and recognition of signs from continuous

sentences more difﬁcult when compared to just plain gestures, isolated signs, or ﬁnger spelling.

In this paper, we address the problem of automatically extracting the par t of a sign that is most

common in all occurrences of the sign, and hence expected to be robust with respect to the variation

of adjacent signs. These common parts can be used for spotting or recognition of signs in continuous

sign language sentences. They can also be used by sign language experts for teaching or studying

variations between instances of signs in continuous sign language sentences, or in automated sign

language tutoring systems. Furthermore, they can be used even in the process of translating sign

language videos directly to spoken words.

In a related work inspired by the success of the use of phonemes in speech recognition, the

authors sought to extract common parts in different instances of a sign and thus arrive at a phoneme-

analogue for signs (Bauer and Kraiss, 2002). But unlike speech, sign language does not have a

completely deﬁned set of phonemes. Hence, we consider extracting commonalities at the sentence

and sub-sentence level.

A different but a closely related problem is the extraction of common subsequences, also called

motifs, from very long multiple gene sequences in biology (Bailey and Elkan, 1995; Lawrence et al.,

1993; Pevzner and Sze, 2000; Rigoutsos and Floratos, 1998). Lawrence et al. (1993) used a Gibbs

sampling approach based on discrete matches or mismatches of subsequences that were strings of

symbols of gene sequences. Bailey and Elkan (1995) used expectation maximization to ﬁnd com-

mon subsequences in univariate biopolymer sequences. In biology, researchers deal with univariate

discrete sequences, and hence their algorithms are not always directly applicable to other multi-

variate continuous domains in time series like speech or sign language. Some researchers tried to

symbolize a continuous time series into discrete sequences and used existing algorithms from bioin-

formatics. For example, Chiu et al. (2003) symbolized the time series into a sequence of symbols

using local approximations and used random projections to extract common subsequences in noisy

data. Tanaka et al. (2005) extended their work by performing principal component analysis on the

multivariate time series data and projected them onto a single dimension and symbolized the data

into discrete sequences. However, it is not always possible to get all the important information in

2590

FINDING RECURRENT PATTERNS FROM CONTINUOUS SIGN LANGUAGE SENTENCES

(a) Continuous Sentence ‘I BUY TICKET WHERE?’

(b) Continuous Sentence ‘YOU CAN BUY THIS FOR HER’

Figure 1: Movement epenthesis in sign language sentences. Frames corresponding to the common

sign ‘BUY’ are marked in red. Signs adjacent to BUY are marked in magenta. Frames

between marked frames represent movement epenthesis that is, the transition between

signs. Note that the sign itself is also affected by having different signs preceding or

following it.

the ﬁrst principal component alone. Further extending his work, Duchne et al. (2007) ﬁnd recurrent

patterns from multivariate discrete data using time series random projections.

Due to the inherent continuous nature of many time series data like gesture and speech, new

methods were developed that do not require approximating the data to a sequence of discrete sym-

bols. Denton (2005) used a continuous random-walk noise model to cluster similar substrings.

Nayak et al. (2005) and Minnen et al. (2007) use continuous multivariate sequences and dynamic

time warping to ﬁnd distances between the substrings. Oates (2002); Nayak et al. (2005) and Nayak

et al. (2009a) are among the few works in ﬁnding recurrent patterns that address non-uniform sam-

pling of time series. The recurrent pattern extraction approach proposed in this paper is based

2591

NAYAK, DUNCAN, SARKAR AND LOEDING

on multivariate continuous time series, uses dynamic time warping to ﬁnd distances between sub-

strings, and handles length variations of common patterns.

Following the success of Hidden Markov Models (HMMs) in speech recognition, they were

used by sign language researchers (Vogler and Metaxas, 1999; Starner and Pentland, 1997; Bowden

et al., 2004; Bauer and Hienz, 2000; Starner et al., 1998) for representing and recognizing signs.

However, HMMs require a large number of training data and unlike speech, data from native sign-

ers is not as easily available as speech data. Hence, non-HMM-based approaches have been used

(Farhadi et al., 2007; Nayak et al., 2009a; Yang et al., 2010; Buehler et al., 2009; Nayak et al.,

2009b; Oszust and Wysocki, 2010; Han et al., 2009). In this paper, we use a continuous trajectory

representation of signs in a multidimensional space and use dynamic time warping to match sub-

sequences. The relative conﬁguration of the two hands and face in each frame is represented by a

relational distribution (Vega and Sarkar, 2003; Nayak et al., 2005), which in itself is a probability

density function. The motion dynamics of the s igner is captured as changes in the relational distri-

butions. It also allows us to interpolate motion, if required, for data sets with lower frame capture

rates. It should also be noted that, unlike many of the previous works in sign language that perform

tracking of the hands using 3D magnetic trackers or color gloves (Fang et al., 2004; Vogler and

Metaxas, 2001; Wang et al., 2002; Ma et al., 2000; Cooper and Bowden, 2009), our representation

does not require tracking and relies on skin segmentation.

We present a Bayesian framework to extract the common subsequences or signemes from all

the given sentences simultaneously. Figure 2 depicts the overview of our approach. With this

framework, we can extract the ﬁrst most common sign, the second most common sign, the third

most common sign and so on. We represent each sentence as a trajectory in a multi-dimensional

space that implicitly captures the shape and motion in the video. Skin color blobs are extracted

from frames of color video, and a relational distribution is formed for each frame using the edge

pixels in the skin blobs. Each sentence is then represented as a trajectory in a low dimensional space

called the space of relational distributions, which is arrived at by performing principal component

analysis (PCA) on the relational distributions. There are other alternatives to PCA that are possible

and discussed in Nayak et al. (2009b). The other choices do not change the nature of the signeme

ﬁnding approach, they only affect the quality of the features. The starting locations (a

,...a

) and

widths (w

,...w

) of the candidate signemes in all the n sentences are together represented by a

parameter vector. The starting locations are initialized with random starting locations, based on

uniform random sampling from each sentence, and the initial width values are randomly selected

from a given range of values. The parameter vector is updated sequentially by sampling the starting

point and width of the possible signeme in each sentence from a joint conditional distribution that is

based on the locations and widths of the target possible signeme in all other sentences. The process

is iterated till the parameter values converge to a stable solution. Monte Carlo approaches like

Gibbs sampling (Robert and Casella, 2004; Gilks et al., 1998; Casella and George, 1992), which

is a special case of the Metropolis-Hastings algorithm (Chib and Greenberg, 1995) can be used for

global optimization while updating the parameter vector by performing importance sampling on the

conditional probability distribution. However, this has a high burn-in period.

In this paper, we adopt a greedy approach based on the us e of iterated conditional modes (ICM)

(Besag, 1986). ICM converges much faster than a Gibbs sampler, but is known to be largely de-

pendent on the initialization. We overcome this limitation by performing ICM a number of times

equal to the average length of the n sentences, with different initializations. The most frequently

occurring solution from all the ICM runs is considered as the ﬁnal solution.

2592

FINDING RECURRENT PATTERNS FROM CONTINUOUS SIGN LANGUAGE SENTENCES

Figure 2: Overview of our approach. Each of the n sentences is represented as a sequence in the

Space of Relational Distributions, and common patterns are extracted using iterated con-

ditional modes (ICM). The parameter set {a

,...a

} is initialized using uniform

random sampling and the conditional density corresponding to each sentence is updated

in a sequential manner.

The work in this paper builds on the work of Nayak et al. (2009a) and is different in multiple

respects. We propose a system that is generalized to extract more than one common sign from a

collection of sentences (ﬁrst most common sign, second most common sign and so on), whereas

2593

HTML Viewer

Frequently Asked Questions (2)

Q1. What are the contributions in "Finding recurrent patterns from continuous sign language sentences for automated extraction of signs" ?

The authors present a probabilistic framework to automatically learn models of recurring signs from multiple sign language video sequences containing the vocabulary of interest. The authors extract the parts of the signs that are present in most occurrences of the sign in context and are robust to the variations produced by adjacent signs. Given these time series trajectories, the authors extract signemes from multiple sentences concurrently using iterated conditional modes ( ICM ). The authors show results by learning single signs from a collection of sentences with one common pervading sign, multiple signs from a collection of sentences with more than one common sign, and single signs from a mixed collection of sentences. The extracted signemes demonstrate that their approach is robust to some extent to the variations produced within a sign due to different contexts. The authors also show results whereby these learned sign models are used for spotting signs in test sequences.

Q2. What future works have the authors mentioned in the paper "Finding recurrent patterns from continuous sign language sentences for automated extraction of signs" ?

Additionally, the authors plan to extend their work to address the challenge of handling the large variations encountered when automatically recognizing signemes across different signers. The authors plan to work on a variation of dynamic time warping that is robust to amplitude differences between various instances of signs.

Finding recurrent patterns from continuous sign language sentences for automated extraction of signs

Summary (3 min read)

1. Introduction

2. Relational Distributions

3. Problem Formulation

3.1 Distance Measure

3.2 Parameter Estimation

3.3 Sampling Starting Points For ICM

4. Experiments And Results

4.1 Data Set

4.2 Common Pattern Extraction Results

4.2.1 EXTRACTING THE MOST COMMON PATTERN

4.2.2 EXTRACTING MULTIPLE COMMON SIGNS

4.2.3 EXTRACTING THE MOST COMMON PATTERNS FROM MIXED SENTENCES

4.3 Sign Localization

Figures (17)

Citations

References

Related Papers (5)

Frequently Asked Questions (2)

Q1. What are the contributions in "Finding recurrent patterns from continuous sign language sentences for automated extraction of signs" ?

Q2. What future works have the authors mentioned in the paper "Finding recurrent patterns from continuous sign language sentences for automated extraction of signs" ?