Recognizing and Presenting the Storytelling Video Structure With Deep Multimodal Networks

doi:10.1109/TMM.2016.2644872

Open AccessJournal ArticleDOI

Recognizing and Presenting the Storytelling Video Structure With Deep Multimodal Networks

Lorenzo Baraldi, +2 more

- 01 May 2017 -

IEEE Transactions on Multimedia

- Vol. 19, Iss: 5, pp 955-968

TLDR

A novel scene detection algorithm which employs semantic, visual, textual, and audio cues and how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails is shown.

Abstract:

In this paper, we propose a novel scene detection algorithm which employs semantic, visual, textual, and audio cues. We also show how the hierarchical decomposition of the storytelling video structure can improve retrieval results presentation with semantically and aesthetically effective thumbnails. Our method is built upon two advancements of the state of the art: first is semantic feature extraction which builds video-specific concept detectors; and second is multimodal feature embedding learning that maps the feature vector of a shot to a space in which the Euclidean distance has task specific semantic properties. The proposed method is able to decompose the video in annotated temporal segments which allow us for a query specific thumbnail extraction. Extensive experiments are performed on different data sets to demonstrate the effectiveness of our algorithm. An in-depth discussion on how to deal with the subjectivity of the task is conducted and a strategy to overcome the problem is suggested.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

- 01 Jan 2022 -

IEEE Access

TL;DR: In this article , the authors divide the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) rather than marketing or hardware approach to conduct a comprehensive analysis.

...read moreread less

Journal ArticleDOI

Deep Binary Reconstruction for Cross-Modal Hashing

Di Hu, +2 more

- 01 Apr 2019 -

IEEE Transactions on Multimedia

TL;DR: A novel multimodal deep binary reconstruction model is proposed, which can be trained to simultaneously model the correlation across modalities and learn the binary hashing codes, where the model can be easily optimized by a standard gradient descent optimizer.

...read moreread less

Journal ArticleDOI

Multitask Learning for Cross-Domain Image Captioning

Min Yang, +6 more

- 01 Apr 2019 -

IEEE Transactions on Multimedia

TL;DR: A novel Multitask Learning Algorithm for cross-Domain Image Captioning (MLADIC) is introduced, which is a multitask system that simultaneously optimizes two coupled objectives via a dual learning mechanism: image captioning and text-to-image synthesis, with the hope that by leveraging the correlation of the two dual tasks, it is able to enhance the image captioned performance in the target domain.

...read moreread less

Journal ArticleDOI

High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention

Zongjian Zhang, +3 more

- 01 Jul 2019 -

IEEE Transactions on Multimedia

TL;DR: A mechanism of fine-grained and semantic-guided visual attention is created, which can accurately link the relevant visual information with each semantic meaning inside the text, which significantly outperforms all other methods that use VGG-based CNN encoders without fine-tuning.

...read moreread less

Journal ArticleDOI

Temporal Action Localization in Untrimmed Videos Using Action Pattern Trees

Hao Song, +5 more

- 01 Mar 2019 -

IEEE Transactions on Multimedia

TL;DR: A novel framework of automatically localizing action instances based on action pattern trees (AP-Trees) in a long untrimmed video is presented and deep neural networks are introduced to annotate the segments by simultaneously leveraging the spatio-temporal information and the high-level semantic feature of segments.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Journal Article

Dropout: a simple way to prevent neural networks from overfitting

Nitish Srivastava, +4 more

- 01 Jan 2014 -

Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Collapse

Recognizing and Presenting the Storytelling Video Structure With Deep Multimodal Networks

Citations

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

Deep Binary Reconstruction for Cross-Modal Hashing

Multitask Learning for Cross-Domain Image Captioning

High-Quality Image Captioning With Fine-Grained and Semantic-Guided Visual Attention

Temporal Action Localization in Untrimmed Videos Using Action Pattern Trees

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet: A large-scale hierarchical image database

Dropout: a simple way to prevent neural networks from overfitting

ImageNet Large Scale Visual Recognition Challenge

Distributed Representations of Words and Phrases and their Compositionality

Related Papers (5)

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Scene Detection in Videos Using Shot Clustering and Sequence Alignment

Detection and representation of scenes in videos

ImageNet Classification with Deep Convolutional Neural Networks