TL;DR: The SocialStories benchmark, comprised of total of 40 curated stories covering sports and cultural events, provides the experimental setup and introduces novel quantitative metrics to perform a rigorous evaluation of visual storytelling with social media data.
Abstract: Media editors in the newsroom are constantly pressed to provide a "like-being there" coverage of live events. Social media provides a disorganised collection of images and videos that media professionals need to grasp before publishing their latest news updated. Automated news visual storyline editing with social media content can be very challenging, as it not only entails the task of finding the right content but also making sure that news content evolves coherently over time. To tackle these issues, this paper proposes a benchmark for assessing social media visual storylines. The SocialStories benchmark, comprised by total of 40 curated stories covering sports and cultural events, provides the experimental setup and introduces novel quantitative metrics to perform a rigorous evaluation of visual storytelling with social media data.
Editorial coverage of events is often a challenging task, in that media professionals need to identify interesting stories, summarise each story, and illustrate the story episodes, in order to inform the public about how an event unfolded over time.
The authors created three types of storylines: news article, investigative topics, and review topics.
The authors propose a new metric that assesses the quality of a visual storyline in terms of its relevance and transition between segment illustrations.
2.1 SocialStories: Event Data and Storylines
To enable social media visual storyline illustration, a data collection strategy was designed to create a suitable corpora, limiting the number of retrieved documents to those posted during the span of the event.
Events adequate for storytelling were selected, namely those with strong social-dynamics in terms of temporal variations 1HTTPS://NOVASEARCH.ORG/DATASETS/. with respect to their semantics (textual vocabulary and visual content).
Le Tour de France (TDF) is one of the main road cycling race competitions.
The authors keyword-based approach, consists of querying the social media APIs with a set of keyword terms.
Therefore a set of relevant hashtags grouping content of the same topic was also manually defined.
2.2 Visual Storyline Quality Metric
Media editors are constantly judging the quality of news material to decide if it deserves being published.
The task is highly skillful and deriving a methodology from such process is not straightforward.
The first step towards the quantification of visual storyline quality concerns the human-judgement of these different dimensions.
Once a visual storyline is generated, annotators will judge the relevance of the story segment illustration as: si=0: the image/video is not relevant to the story segment; si=1: the image/video is relevant to the story segment; si=2: the image/video is highly relevant to the story segment.
Given the underlying subjectivity of the task, the values of α or β that optimally represents the human perception of visual stories, are in fact average values.
3.1 Protocol and Ground-truth
The goal of this experiment is demonstrate the robustness of the proposed benchmark.
Target storylines and segments were obtained using several methods, resulting in a total of 40 generated storylines (20 for each event), each comprising 3 to 4 segments.
Ground truth for both relevant segment illustrations, transitions and global story quality were obtained as described in the following section.
Stories were visualised and assessed in a specifically designed prototype interface.
Using the subjective assessment of the annotators, the score proposed in Section 2.2 was calculated for each story.
3.2 Quality Metric vs Human Judgement
To do so, the authors computed the metric based on the relevance of segments and transitions between segments, and related it to the overall story rating assigned by annotators.
Figure 3 compares the annotator rating to the quality metric.
These values show that linear increments in the ratings provided by the annotators were matched by the metric.
Thus, these results show that the metric Quality effectively emulates the human perception of visual storyline quality.
3.3 Automatic Visual Storytelling
Figure 4 (a) presents the influence of illustrations in the story Quality metric introduced in Section 2.2.
In scenarios where relevant content is scarce, the approach is hindered by noise.
Hence, the performance of these baselines was lower than that of Text Retrieval.
The CNN Dense baseline, minimises distance between representations extracted from the penultimate layer of the visual concept detector.
Additionally, and similarly to what was observed while assessing the segment illustration baselines, Figure 4 shows that creating storylines with good transitions is easier for the TDF dataset than for the EdFest dataset.
4 CONCLUSIONS
This paper addressed the problem of automatic visual story editing using social media data that run in TRECVID2018.
Media professionals are asked to cover large events and are required to manually process large amounts of social media data to create event plots and select appropriate pieces of content for each segment.
The main contribution of this paper is a benchmark to assess the overall quality of a visual story based on the relevance of individual illustrations and transitions between consecutive segment illustrations.
It was shown that the proposed experimental test-bed proved to be effective in the assessment of story editing and composition with social media material.
This work has been partially funded by the GoLocal CMU-Portugal project Ref. CMUP-ERI/TIC/0046/2014, by the COGNITUS H2020 ICT project No 687605 and by the project NOVA LINCS Ref.
TL;DR: Looking forward, this penetration of AI opens new challenges, such as interpretability of deep learning (to enable use AI in an accountable way as well as to enable AI-inspired low-complexity algorithms) and applicability in systems which require low- complexity solutions and/or do not have enough training data.
Abstract: Numerous breakthroughs in multimedia signal processing are being enabled thanks to applications of machine learning in tasks such as multimedia creation, enhancement, classification and compression [1]. Notably, in the context of production and distribution of television programmes, it has been successfully demonstrated how Artificial Intelligence (AI) can support innovation in the creative sector. In the context of delivering TV programmes of stunning visual quality, the applications of deep learning have enabled significant advances when the original content is of poor quality / resolution, or when delivery channels are very limited. Examples when the enhancement of originally poor quality is needed include new content forms (e.g. user generated content) and historical content (e.g. archives), while limitations of delivery channels can, first of all, be addressed by improving content compression. As a state-of-the-art example, the benefits of deep-learning solutions have been recently demonstrated within an end-to-end platform for management of user generated content [2], where deep learning is applied to increase video resolution, evaluate video quality and enrich the video by providing automatic metadata. Within this particular application space where large amount of user generated content is available, the progress has also been made in addressing visual story editing using social media data in automatic ways, making programmes from large amount of content faster [3]. Broadcasters are also interested in restauration of historical content more cheaply. For example, adding colour to "black and white" content has until now been an expensive and time-consuming task. However, recently new algorithms have been developed to perform the task more efficiently. Generative Adversarial Networks (GANs) have become the baseline for many image-to-image translation tasks, including image colourisation. Aiming at the generation of more naturally coloured images from "black and white" sources, newest algorithms are capable of generalisation of the colour of natural images, producing realistic and plausible results [4]. In the context of content delivery, new generations of compression standards enable significant reduction of required bandwidth [5], however, with a cost of increased computational complexity. This is another area where AI can be utilised for better efficiency - either in its simple forms as decision trees [6,7] or more advanced deep convolutional neural networks [8]. Looking forward, this penetration of AI opens new challenges, such as interpretability of deep learning (to enable use AI in an accountable way as well as to enable AI-inspired low-complexity algorithms) and applicability in systems which require low-complexity solutions and/or do not have enough training data. However, overall further benefits of these new approaches include automatization of many traditional production tasks which has the potential to transform the way content providers make their programmes in cheaper and more effective ways.
2 citations
Cites background from "A Benchmark of Visual Storytelling ..."
...Within this particular application space where large amount of user generated content is available, the progress has also been made in addressing visual story editing using social media data in automatic ways, making programmes from large amount of content faster [3]....
TL;DR: A context-enriched Multimodal Transformer model is proposed, NewsLXMERT, capable of jointly attending to complementary multimodal news data perspectives, to create knowledge-rich and diverse multi-modal sequences.
Abstract: The connection between news and the images that illustrate them goes beyond visual concept to natural language matching. Instead, the open-domain and event-reporting nature of news leads to semantically complex texts, in which images are used as a contextualizing element. This connection is often governed by a certain level of indirection, with journalistic criteria also playing an important role. In this paper, we address the complex challenge of connecting images to news text. A context-enriched Multimodal Transformer model is proposed, NewsLXMERT, capable of jointly attending to complementary multimodal news data perspectives. The idea is to create knowledge-rich and diverse multimodal sequences, going beyond the news headline (often lacking the necessary context) and visual objects, to effectively ground images to news pieces. A comprehensive evaluation of challenging image-news piece matching settings is conducted, where we show the effectiveness of NewsLXMERT, the importance of leveraging the additional context and demonstrate the usefulness of the obtained pre-trained news representations for transfer-learning. Finally, to shed light on the heterogeneous nature of the problem, we contribute with a systematic model-driven study that identifies image-news matching profiles, thus explaining news piece-image matches.
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
55,235 citations
"A Benchmark of Visual Storytelling ..." refers methods in this paper
...A visual concept detector baseline, based on a pre-trained VGG16 [15] CNN, did not perform as expected....
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
TL;DR: This paper investigates the real-time interaction of events such as earthquakes in Twitter and proposes an algorithm to monitor tweets and to detect a target event and produces a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location.
Abstract: Twitter, a popular microblogging service, has received much attention recently. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many Twitter posts (tweets) related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. As described in this paper, we investigate the real-time interaction of events such as earthquakes in Twitter and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location. We consider each Twitter user as a sensor and apply Kalman filtering and particle filtering, which are widely used for location estimation in ubiquitous/pervasive computing. The particle filter works better than other comparable methods for estimating the centers of earthquakes and the trajectories of typhoons. As an application, we construct an earthquake reporting system in Japan. Because of the numerous earthquakes and the large number of Twitter users throughout the country, we can detect an earthquake with high probability (96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more are detected) merely by monitoring tweets. Our system detects earthquakes promptly and sends e-mails to registered users. Notification is delivered much faster than the announcements that are broadcast by the JMA.
3,976 citations
"A Benchmark of Visual Storytelling ..." refers background in this paper
...on for Computing Machinery. ACM ISBN 978-1-4503-6765-3/19/06...$15.00 https://doi.org/10.1145/3323873.3325047 The timeline of an event, e.g. a music festival, a sport tournament or a natural disaster [13], contains visual and textual pieces of information that are strongly correlated. There are several ways of presenting the same event, by covering specificstorylines, each offering different perspecti...
TL;DR: It is argued that for some highly structured and recurring events, such as sports, it is better to use more sophisticated techniques to summarize the relevant tweets, and a solution based on learning the underlying hidden state representation of the event via Hidden Markov Models is given.
Abstract: Twitter has become exceedingly popular, with hundreds of millions of tweets being posted every day on a wide variety of topics. This has helped make real-time search applications possible with leading search engines routinely displaying relevant tweets in response to user queries. Recent research has shown that a considerable fraction of these tweets are about "events," and the detection of novel events in the tweet-stream has attracted a lot of research interest. However, very little research has focused on properly displaying this real-time information about events. For instance, the leading search engines simply display all tweets matching the queries in reverse chronological order. In this paper we argue that for some highly structured and recurring events, such as sports, it is better to use more sophisticated techniques to summarize the relevant tweets. We formalize the problem of summarizing event-tweets and give a solution based on learning the underlying hidden state representation of the event via Hidden Markov Models. In addition, through extensive experiments on real-world data we show that our model significantly outperforms some intuitive and competitive baselines.
331 citations
"A Benchmark of Visual Storytelling ..." refers background in this paper
...ocial media platforms, includingTwitter, FlickrorYouTubeprovide a stream of multimodal social media content, naturally yielding an unfiltered event timeline. These timelines can be listened to, mined [2, 3, 6], and exploited to gather visual content, specifically image and video [11,14]. The primary contribution of this paper is the introduction of a quality metric to assess visual storylines. This metric ...
TL;DR: Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
Abstract: We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling The first release of this dataset, SIND v1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression
231 citations
"A Benchmark of Visual Storytelling ..." refers background in this paper
... BENCHMARK1 Assessing the success of news visual storyline creation is a complex task. In this section, we address this task and propose the SocialStories benchmark. Visual storytelling datasets like [7] and [8] contain sequences of image-caption pairs, that capture a specific activity, e.g., "playing frisbee with a dog". A characteristic of these stories is that the sequence of visual elem...
Q1. What have the authors contributed in "A benchmark of visual storytelling in social media" ?
Media editors in the newsroom are constantly pressed to provide a `` like-being there '' coverage of live events. To tackle these issues, this paper proposes a benchmark for assessing social media visual storylines. The SocialStories benchmark, comprised by total of 40 curated stories covering sports and cultural events, provides the experimental setup and introduces novel quantitative metrics to perform a rigorous evaluation of visual storytelling with social media data.