What contributions have the authors mentioned in the paper "Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals" ?

The authors test these phoneme to viseme maps to examine how similarly speakers talk visually and they use signed rank tests to measure the distance between individuals.

How do the authors measure the distances between the speaker-dependent P2V maps?

The authors use the Wilcoxon signed rank [32] to measure the distances between the speaker-dependent P2V maps before drawing conclusions on the observations.

How many letters of the alphabet are used in the AVL2 dataset?

To measure the performance of AVL2 speakers the authors noted that a classification network restricts the output to be one of the 26 letters of the alphabet (with the AVL2 dataset).

What is the disadvantage of a speaker-dependent P2V map?

The authors know a benefit of this is more training samples per class which compensates for the limited data in currently available datasets but the disadvantage is generalization between different articulated sounds.

How many words are in the RMAV dataset?

Formerly known as LiLIR, the RMAV dataset consists of 20 British English speakers (we use 12, seven male and five female), up to 200 utterances per speaker of the Resource Management (RM) sentences from [51] which totals around 1000 words each.

Why is the resulting set of visemes smaller than the set of phonemes?

Due to the many-to-one relationship in traditional mappings of phonemes to visemes, any resulting set of visemes will always be smaller than the set of phonemes.

What datasets are used to test speaker-dependent visemes?

Benchmarked against speakerdependent results, the authors experiment with speakers from both the AVLetters2 (AVL2) and Resource Management Audio-Visual (RMAV) datasets.

How can the authors address lipreading dependency on training speakers?

The authors can address lipreading dependency on training speakers by generalizing to those speakers who are visually similar in viseme usage/trajectory through gestures.

(Open Access) Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals (2018) | Helen L. Bear

Book ChapterDOI

Individual Comparisons by Ranking Methods

Frank Wilcoxon

- 01 Dec 1945 -

Biometrics

TL;DR: The comparison of two treatments generally falls into one of the following two categories: (a) a number of replications for each of the two treatments, which are unpaired, or (b) we may have a series of paired comparisons, some of which may be positive and some negative as mentioned in this paper.

...read moreread less

Journal ArticleDOI

A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation

Bradley Efron, +1 more

- 01 Feb 1983 -

The American Statistician

TL;DR: This paper reviewed the nonparametric estimation of statistical error, mainly the bias and standard error of an estimator, or the error rate of a prediction rule, at a relaxed mathematical level, omitting most proofs, regularity conditions and technical details.

...read moreread less

Proceedings Article

Unsupervised Domain Adaptation by Backpropagation

Yaroslav Ganin, +1 more

TL;DR: The method performs very well in a series of image classification experiments, achieving adaptation effect in the presence of big domain shifts and outperforming previous state-of-the-art on Office datasets.

...read moreread less

Journal ArticleDOI

Active Appearance Models Revisited

Iain Matthews, +1 more

- 01 Nov 2004 -

International Journal of Computer Vision

TL;DR: This work proposes an efficient fitting algorithm for AAMs based on the inverse compositional image alignment algorithm and shows that the effects of appearance variation during fitting can be precomputed (“projected out”) using this algorithm and how it can be extended to include a global shape normalising warp.

...read moreread less

Journal ArticleDOI

An audio-visual corpus for speech perception and automatic speech recognition

Martin Cooke, +3 more

- 24 Oct 2006 -

Journal of the Acoustical Society of Ame...

TL;DR: An audio-visual corpus that consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers to support the use of common material in speech perception and automatic speech recognition studies.

...read moreread less

Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

Figures

Citations

The speaker-independent lipreading play-off; a survey of lipreading machines

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Alternative visual units for an optimized phoneme-based lipreading system

Viseme set identification from Malayalam phonemes and allophones

Linguistically involved data-driven approach for Malayalam phoneme-to-viseme mapping

References

Individual Comparisons by Ranking Methods

A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation

Unsupervised Domain Adaptation by Backpropagation

Active Appearance Models Revisited

An audio-visual corpus for speech perception and automatic speech recognition

Related Papers (5)

Speaker-independent machine lip-reading with speaker-dependent viseme classifiers

Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments

Audiovisual speech recognition with articulator positions as hidden variables

Confusions Among Visually Perceived Consonants

Visual-only discrimination between native and non-native speech

Frequently Asked Questions (9)

Q1. What are the contributions in "Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals" ?

Q2. What contributions have the authors mentioned in the paper "Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals" ?

Q3. How do the authors measure the distances between the speaker-dependent P2V maps?

Q4. How many letters of the alphabet are used in the AVL2 dataset?

Q5. What is the disadvantage of a speaker-dependent P2V map?

Q6. How many words are in the RMAV dataset?

Q7. Why is the resulting set of visemes smaller than the set of phonemes?

Q8. What datasets are used to test speaker-dependent visemes?

Q9. How can the authors address lipreading dependency on training speakers?