Q2. What contributions have the authors mentioned in the paper "Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals" ?
The authors test these phoneme to viseme maps to examine how similarly speakers talk visually and they use signed rank tests to measure the distance between individuals.
Q3. How do the authors measure the distances between the speaker-dependent P2V maps?
The authors use the Wilcoxon signed rank [32] to measure the distances between the speaker-dependent P2V maps before drawing conclusions on the observations.
Q4. How many letters of the alphabet are used in the AVL2 dataset?
To measure the performance of AVL2 speakers the authors noted that a classification network restricts the output to be one of the 26 letters of the alphabet (with the AVL2 dataset).
Q5. What is the disadvantage of a speaker-dependent P2V map?
The authors know a benefit of this is more training samples per class which compensates for the limited data in currently available datasets but the disadvantage is generalization between different articulated sounds.
Q6. How many words are in the RMAV dataset?
Formerly known as LiLIR, the RMAV dataset consists of 20 British English speakers (we use 12, seven male and five female), up to 200 utterances per speaker of the Resource Management (RM) sentences from [51] which totals around 1000 words each.
Q7. Why is the resulting set of visemes smaller than the set of phonemes?
Due to the many-to-one relationship in traditional mappings of phonemes to visemes, any resulting set of visemes will always be smaller than the set of phonemes.
Q8. What datasets are used to test speaker-dependent visemes?
Benchmarked against speakerdependent results, the authors experiment with speakers from both the AVLetters2 (AVL2) and Resource Management Audio-Visual (RMAV) datasets.
Q9. How can the authors address lipreading dependency on training speakers?
The authors can address lipreading dependency on training speakers by generalizing to those speakers who are visually similar in viseme usage/trajectory through gestures.