Recognizing and Presenting the Storytelling Video Structure With Deep Multimodal Networks
TLDR
A novel scene detection algorithm which employs semantic, visual, textual, and audio cues and how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails is shown.Abstract:
In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual, and audio cues. We also show how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails. Our method is built upon two advancements of the state of the art: first is semantic feature extraction which builds video-specific concept detectors; and second is multimodal feature embedding learning that maps the feature vector of a shot to a space in which the Euclidean distance has task specific semantic properties. The proposed method is able to decompose the video in annotated temporal segments which allow us for a query specific thumbnail extraction. Extensive experiments are performed on different data sets to demonstrate the effectiveness of our algorithm. An in-depth discussion on how to deal with the subjectivity of the task is conducted and a strategy to overcome the problem is suggested.read more
Citations
More filters
Journal ArticleDOI
A Metaverse: Taxonomy, Components, Applications, and Open Challenges
TL;DR: In this article , the authors divide the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) rather than marketing or hardware approach to conduct a comprehensive analysis.
Journal ArticleDOI
Deep Binary Reconstruction for Cross-Modal Hashing
Di Hu,Feiping Nie,Xuelong Li +2 more
TL;DR: A novel multimodal deep binary reconstruction model is proposed, which can be trained to simultaneously model the correlation across modalities and learn the binary hashing codes, where the model can be easily optimized by a standard gradient descent optimizer.
Journal ArticleDOI
Multitask Learning for Cross-Domain Image Captioning
TL;DR: A novel Multitask Learning Algorithm for cross-Domain Image Captioning (MLADIC) is introduced, which is a multitask system that simultaneously optimizes two coupled objectives via a dual learning mechanism: image captioning and text-to-image synthesis, with the hope that by leveraging the correlation of the two dual tasks, it is able to enhance the image captioned performance in the target domain.
Journal ArticleDOI
High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention
TL;DR: A mechanism of fine-grained and semantic-guided visual attention is created, which can accurately link the relevant visual information with each semantic meaning inside the text, which significantly outperforms all other methods that use VGG-based CNN encoders without fine-tuning.
Journal ArticleDOI
Temporal Action Localization in Untrimmed Videos Using Action Pattern Trees
TL;DR: A novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video is presented and deep neural networks are introduced to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments.
References
More filters
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal Article
Dropout: a simple way to prevent neural networks from overfitting
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Journal ArticleDOI
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings Article
Distributed Representations of Words and Phrases and their Compositionality
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.