S
Sankar Basu
Researcher at IBM
Publications - 28
Citations - 1134
Sankar Basu is an academic researcher from IBM. The author has contributed to research in topics: Speech processing & Audio mining. The author has an hindex of 16, co-authored 28 publications receiving 1133 citations.
Papers
More filters
PatentDOI
Method and apparatus for audio-visual speech detection and recognition
TL;DR: In this article, the authors propose a speech recognition technique for video and audio signals that consists of processing a video signal associated with an arbitrary content video source, processing an audio signal associated to the video signal, and recognizing at least a portion of the processed audio signal using at least the processed video signal to generate output signal representative of the audio signal.
Patent
Methods and apparatus for audio-visual speaker recognition and utterance verification
Sankar Basu,Homayoon S. M. Beigi,Stephane H. Maes,Benoît Maison,Chalapathy Neti,Andrew William Senior +5 more
TL;DR: In this paper, an identification and/or verification decision is made based on the processed audio signal and the processed video signal, which is referred to as unsupervised utterance verification.
Patent
Method and apparatus for active annotation of multimedia content
TL;DR: In this paper, the authors propose an annotation framework in which supervised training with partially labeled data is facilitated using active learning, which results in propagation of labels to unlabeled data and greatly facilitates the user in annotating large amounts of multimedia content.
Patent
Adaptive probabilistic query expansion
TL;DR: In this article, an expanding operation is used to expand the query into sub-queries, wherein at least one of the subqueries is expanded probabilistically, and an adapting operation is configured to modify the search such that the relevance of the search result is increased when the search is repeated.
Proceedings ArticleDOI
A cascade image transform for speaker independent automatic speechreading
TL;DR: A three-stage pixel based visual front end for automatic speechreading (lipreading) that results in improved recognition performance of spoken words or phonemes with significant classification accuracy gains by each added stage, which, when combined, can reach up to 27% improvement.