scispace - formally typeset
Open AccessJournal ArticleDOI

Recognizing and Presenting the Storytelling Video Structure With Deep Multimodal Networks

TLDR
A novel scene detection algorithm which employs semantic, visual, textual, and audio cues and how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails is shown.
Abstract
In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual, and audio cues. We also show how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails. Our method is built upon two advancements of the state of the art: first is semantic feature extraction which builds video-specific concept detectors; and second is multimodal feature embedding learning that maps the feature vector of a shot to a space in which the Euclidean distance has task specific semantic properties. The proposed method is able to decompose the video in annotated temporal segments which allow us for a query specific thumbnail extraction. Extensive experiments are performed on different data sets to demonstrate the effectiveness of our algorithm. An in-depth discussion on how to deal with the subjectivity of the task is conducted and a strategy to overcome the problem is suggested.

read more

Citations
More filters
Journal ArticleDOI

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

- 01 Jan 2022 - 
TL;DR: In this article , the authors divide the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) rather than marketing or hardware approach to conduct a comprehensive analysis.
Journal ArticleDOI

Deep Binary Reconstruction for Cross-Modal Hashing

TL;DR: A novel multimodal deep binary reconstruction model is proposed, which can be trained to simultaneously model the correlation across modalities and learn the binary hashing codes, where the model can be easily optimized by a standard gradient descent optimizer.
Journal ArticleDOI

Multitask Learning for Cross-Domain Image Captioning

TL;DR: A novel Multitask Learning Algorithm for cross-Domain Image Captioning (MLADIC) is introduced, which is a multitask system that simultaneously optimizes two coupled objectives via a dual learning mechanism: image captioning and text-to-image synthesis, with the hope that by leveraging the correlation of the two dual tasks, it is able to enhance the image captioned performance in the target domain.
Journal ArticleDOI

High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention

TL;DR: A mechanism of fine-grained and semantic-guided visual attention is created, which can accurately link the relevant visual information with each semantic meaning inside the text, which significantly outperforms all other methods that use VGG-based CNN encoders without fine-tuning.
Journal ArticleDOI

Temporal Action Localization in Untrimmed Videos Using Action Pattern Trees

TL;DR: A novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video is presented and deep neural networks are introduced to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Related Papers (5)