scispace - formally typeset
Open AccessJournal ArticleDOI

Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

TLDR
Speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures, and a phoneme-clustering method is used to form new phoneme to viseme maps for both individual and multiple speakers.
About
This article is published in Computer Speech & Language.The article was published on 2018-11-01 and is currently open access. It has received 6 citations till now. The article focuses on the topics: Viseme & Gesture.

read more

Citations
More filters
Proceedings ArticleDOI

The speaker-independent lipreading play-off; a survey of lipreading machines

TL;DR: A systematic survey of experiments with the TCD-TIMIT dataset is undertaken using both conventional approaches and deep learning methods to provide a series of wholly speaker- independent benchmarks and shows that the best speaker-independent machine scores 69.58% accuracy with CNN features and an SVM classifier.
Journal ArticleDOI

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

TL;DR: A structured approach to create speaker-dependent visemes with a fixed number of viseme within each set, based upon clustering phonemes, which significantly improves on previous lipreading results with RMAV speakers.
Journal ArticleDOI

Alternative visual units for an optimized phoneme-based lipreading system

TL;DR: In this paper, a structured approach was proposed to create speaker-dependent visemes with a fixed number of viseme within each set, each set having a unique phoneme-to-viseme mapping.
Journal ArticleDOI

Viseme set identification from Malayalam phonemes and allophones

TL;DR: The coarticulation effect in the visual speech studied by creating many-to-many allophone- to-viseme mapping based on the data-driven approach only and both mapping methods make use of K-mean data clustering algorithm.
Book ChapterDOI

Linguistically involved data-driven approach for Malayalam phoneme-to-viseme mapping

TL;DR: This chapter discusses the primary task of identifying visemes and the number of frames required to encode the temporal evolution of vowel and consonant phonemes, and analyzed three phoneme-to-viseme mappings.
References
More filters
Book ChapterDOI

Individual Comparisons by Ranking Methods

TL;DR: The comparison of two treatments generally falls into one of the following two categories: (a) a number of replications for each of the two treatments, which are unpaired, or (b) we may have a series of paired comparisons, some of which may be positive and some negative as mentioned in this paper.
Journal ArticleDOI

A Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation

TL;DR: This paper reviewed the nonparametric estimation of statistical error, mainly the bias and standard error of an estimator, or the error rate of a prediction rule, at a relaxed mathematical level, omitting most proofs, regularity conditions and technical details.
Proceedings Article

Unsupervised Domain Adaptation by Backpropagation

TL;DR: The method performs very well in a series of image classification experiments, achieving adaptation effect in the presence of big domain shifts and outperforming previous state-of-the-art on Office datasets.
Journal ArticleDOI

Active Appearance Models Revisited

TL;DR: This work proposes an efficient fitting algorithm for AAMs based on the inverse compositional image alignment algorithm and shows that the effects of appearance variation during fitting can be precomputed (“projected out”) using this algorithm and how it can be extended to include a global shape normalising warp.
Journal ArticleDOI

An audio-visual corpus for speech perception and automatic speech recognition

TL;DR: An audio-visual corpus that consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers to support the use of common material in speech perception and automatic speech recognition studies.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the contributions in "Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals" ?

The authors test these phoneme to viseme maps to examine how similarly speakers talk visually and they use signed rank tests to measure the distance between individuals. 

The authors test these phoneme to viseme maps to examine how similarly speakers talk visually and they use signed rank tests to measure the distance between individuals. 

The authors use the Wilcoxon signed rank [32] to measure the distances between the speaker-dependent P2V maps before drawing conclusions on the observations. 

To measure the performance of AVL2 speakers the authors noted that a classification network restricts the output to be one of the 26 letters of the alphabet (with the AVL2 dataset). 

The authors know a benefit of this is more training samples per class which compensates for the limited data in currently available datasets but the disadvantage is generalization between different articulated sounds. 

Formerly known as LiLIR, the RMAV dataset consists of 20 British English speakers (we use 12, seven male and five female), up to 200 utterances per speaker of the Resource Management (RM) sentences from [51] which totals around 1000 words each. 

Due to the many-to-one relationship in traditional mappings of phonemes to visemes, any resulting set of visemes will always be smaller than the set of phonemes. 

Benchmarked against speakerdependent results, the authors experiment with speakers from both the AVLetters2 (AVL2) and Resource Management Audio-Visual (RMAV) datasets. 

The authors can address lipreading dependency on training speakers by generalizing to those speakers who are visually similar in viseme usage/trajectory through gestures.