scispace - formally typeset

Book ChapterDOI

Video Fragmentation and Reverse Search on the Web.

01 Jan 2019-pp 53-90

TL;DR: This chapter reports the findings of a series of experimental evaluations regarding the efficiency of the above-mentioned technologies, which indicate their competence to generate a concise and complete keyframe-based summary of the video content, and the use of this fragment-level representation for fine-grained reverse video search on the web.
Abstract: This chapter is focused on methods and tools for video fragmentation and reverse search on the web. These technologies can assist journalists when they are dealing with fake news—which nowadays are being rapidly spread via social media platforms—that rely on the reuse of a previously posted video from a past event with the intention to mislead the viewers about a contemporary event. The fragmentation of a video into visually and temporally coherent parts and the extraction of a representative keyframe for each defined fragment enables the provision of a complete and concise keyframe-based summary of the video. Contrary to straightforward approaches that sample video frames with a constant step, the generated summary through video fragmentation and keyframe extraction is considerably more effective for discovering the video content and performing a fragment-level search for the video on the web. This chapter starts by explaining the nature and characteristics of this type of reuse-based fake news in its introductory part, and continues with an overview of existing approaches for temporal fragmentation of single-shot videos into sub-shots (the most appropriate level of temporal granularity when dealing with user-generated videos) and tools for performing reverse search of a video on the web. Subsequently, it describes two state-of-the-art methods for video sub-shot fragmentation—one relying on the assessment of the visual coherence over sequences of frames, and another one that is based on the identification of camera activity during the video recording—and presents the InVID web application that enables the fine-grained (at the fragment-level) reverse search for near-duplicates of a given video on the web. In the sequel, the chapter reports the findings of a series of experimental evaluations regarding the efficiency of the above-mentioned technologies, which indicate their competence to generate a concise and complete keyframe-based summary of the video content, and the use of this fragment-level representation for fine-grained reverse video search on the web. Finally, it draws conclusions about the effectiveness of the presented technologies and outlines our future plans for further advancing them.
Topics: Reverse video (60%), Web application (54%)

Content maybe subject to copyright    Report

Chapter 1
Video fragmentation and reverse search on the
Web
Evlampios Apostolidis, Konstantinos Apostolidis, Ioannis Patras, Vasileios
Mezaris
Abstract This chapter is focused on methods and tools for video fragmentation and
reverse search on the Web. These technologies can assist journalists when they are
dealing with fake news - which nowadays are rapidly spread via social media plat-
forms - that rely on the reuse of a previously posted video from a past event with
the intention to mislead the viewers about a contemporary event. The fragmentation
of a video into visually and temporally coherent parts and the extraction of a rep-
resentative keyframe for each defined fragment enables the provision of a complete
and concise keyframe-based summary of the video. Contrary to straightforward ap-
proaches that sample video frames with a constant step, the generated summary
through video fragmentation and keyframe extraction is considerably more effec-
tive for discovering the video content and performing a fragment-level search for
the video on the Web. This chapter starts by explaining the nature and character-
istics of this type of reuse-based fake news in its introductory part, and continues
with an overview of existing approaches for temporal fragmentation of single-shot
videos into sub-shots (the most appropriate level of temporal granularity when deal-
ing with user-generated videos) and tools for performing reverse search of a video
on the Web. Subsequently it describes two state-of-the-art methods for video sub-
Evlampios Apostolidis
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki,
Greece and School of Electronic Engineering and Computer Science, Queen Mary University,
London, UK, e-mail: apostolid@iti.gr
Konstantinos Apostolidis
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki,
Greece, e-mail: kapost@iti.gr
Ioannis Patras
School of Electronic Engineering and Computer Science, Queen Mary University, London, UK,
e-mail: i.patras@qmul.ac.uk
Vasileios Mezaris
Information Technologies Institute, Centre for Research and Technology Hellas, Thessaloniki,
Greece, e-mail: bmezaris@iti.gr
1

2 E. Apostolidis et al.
shot fragmentation - one relying on the assessment of the visual coherence over
sequences of frames, and another one that is based on the identification of camera
activity during the video recording - and presents the InVID web application that
enables the fine-grained (at the fragment-level) reverse search for near-duplicates
of a given video on the Web. In the sequel the chapter reports the findings of a
series of experimental evaluations regarding the efficiency of the above mentioned
technologies, which indicate their competence to generate a concise and complete
keyframe-based summary of the video content, and the use of this fragment-level
representation for fine-grained reverse video search on the Web. Finally, it draws
conclusions about the effectiveness of the presented technologies and outlines our
future plans for further advancing them.
1.1 Introduction
The recent advances in video capturing technology made possible the embedding of
powerful, high-resolution video sensors into portable devices, such as camcorders,
digital cameras, tablets and smartphones. Most of these technologies now offer net-
work connectivity and file sharing functionalities. The latter, combined with the rise
and widespread use of social networks (such as Facebook, Twitter, Instagram) and
video sharing platforms (such as YouTube, Vimeo, DailyMotion) resulted in a enor-
mous increase in the number of videos captured and shared online by amateur users
on a daily basis. These user-generated videos (UGVs) can nowadays be recorded at
any time and place using smartphones, tablets and a variety of video cameras (such
as GoPro action cameras) that can be attached to sticks, body parts or even drones.
The ubiquitous use of video capturing devices supported by the convenience of the
user to share videos through social networks and video sharing platforms, leads to a
wealth of online available UGVs.
Over the last years these online shared UGVs are, in many cases, the only ev-
idence of a breaking or evolving story. The sudden and unexpected appearance of
these events make their timely coverage by news or media organization impossi-
ble. However, the existence (in most cases) of eyewitnesses capturing the story with
their smartphones and instantly sharing the recorded video (even live, i.e. during
its recording) via social networks, makes the UGV the only and highly valuable
source of information about the breaking event. In this newly formed technological
environment that facilitates information diffusion through a variety of social me-
dia platforms, journalists and investigators alike are increasingly turning to these
platforms to find media recordings of events. Newsrooms in TV stations and online
news platforms make use of video to illustrate and report on news events, and since
professional journalists are not always at the scene of a breaking or evolving story
(as mentioned above), it is the content shared by users that can be used for reporting
the story. Nevertheless, the rise of social media as a news source has also seen a rise
in fake news, i.e. the spread of deliberate misinformation or disinformation on these

1 Video fragmentation and reverse search on the Web 3
platforms. Based on this unfortunate fact, the online shared user-generated content
comes into question and people’s trust in journalism is severely shaken.
One type of fakes, probably the easiest to do and thus one of most commonly
found by journalists, relies on the reuse of a video from an earlier event with the
claim that it shows a contemporary event. An example of such a fake is depicted in
Fig. 1.1. In this figure, the image on the left is a screenshot of a video showing a
hurricane that strikes in Dolores, Uruguay on May 29 2016, the image on the middle
is a screenshot of the same video with the claim that is shows Hurricane Otto that
strikes in Bocas del Toro, Panama on November 24 2016, and the image on the
right is a screenshot of a tweet that uses the same video with the claim that is shows
the activity of Hurricane Irma in the islands near the United States on September 9
2017.
Fig. 1.1: Example of a fake news based on the reuse of a video from a hurricane
in Uruguay (image on the left) to deliberately mislead people about the strike of
hurricane Otto in Panama (image in the middle) and the strike of hurricane Irma in
the US islands (image on the right).
The identification and debunking of such fakes requires the detection of the orig-
inal video through the search for prior occurrences of this video (or parts of it)
on the Web. Early approaches for performing this task were based on manually
taking screenshots of the video in the player and uploading these images for per-
forming reverse image search using the corresponding functionality of popular Web
search engines (e.g. Google search). This process can be highly laborious and time-
demanding, while its efficiency depends on a limited set of manually taken screen-
shots of the video. However, the in-time identification of media posted online, which
(claim to) illustrate a (breaking) news event is for many journalists the foremost
challenge in order to meet deadlines to publish a news story online or fill a news
broadcast with content. The time needed for extensive and effective search regard-
ing the posted video, in combination with the lack of expertise by many journalists
and the time-pressure to publish the story, can seriously affect the credibility of the
published news item. And the publication or re-publication of fake news can sig-
nificantly harm the reliability of the entire news organization. An example of miss-
verification of a fake video by an Italian news organization is presented in Fig. 1.2.
A video from the filming of the “World War Z” movie (left part of Fig. 1.2) was
used in a tweet claiming to show a Hummer attack against police in Notre-Dame,

4 E. Apostolidis et al.
Paris, France on June 6 2017 (middle part of Fig. 1.2) and another tweet claiming
to show an attack at Gare Centrale, Brussels, Belgium two weeks later (right part of
Fig. 1.2). The fake tweet about the Paris attack was used in a new item published by
the aforementioned news organization, causing a strong defeat in its trustworthiness.
Fig. 1.2: Example of a fake news based on the reuse of a video from the filming of
the “World War Z” movie (image on the left) to deliberately mislead people about
a Hummer attack attack in Notre-Dame, Paris (image in the middle) and at Gare
Centrale in Brussels (image on the right).
Several tools that enable the identification of near-duplicates of a video on the
Web have been developed over the last years, a fact that indicates the usefulness
and applicability of this process by journalists and members of the media verifica-
tion community. Nevertheless, the existing solutions (presented in details in Sec-
tion 1.2.2) exhibit several limitations that restrict the effectiveness of the video re-
verse search task. In particular, some of these solutions rely on a limited set of video
thumbnails provided by the video sharing platform (e.g. the YouTube DataViewer
of Amnesty International
1
and the Custom Reverse Image Search of IntelTech-
niques
2
). Other technologies demand the extraction of video frames for performing
reverse image search (e.g. the TinEye search engine
3
and the Karma Decay
4
web
application). A number of tools enable this reverse search on closed collections of
videos, that significantly limit the boundaries of investigation (e.g. the Berify
5
, the
RevIMG
6
and the Videntifier
7
platforms). Last but not least, a commonality among
the aforementioned technologies is that none of them supports the analysis of locally
stored videos.
1
https://citizenevidence.amnestyusa.org/
2
https://inteltechniques.com/osint/reverse.video.html
3
https://tineye.com/
4
http://karmadecay.com/
5
https://berify.com/
6
http://www.revimg.com/
7
http://www.videntifier.com

1 Video fragmentation and reverse search on the Web 5
Aiming to offer a more effective approach for reverse video search on the Web,
in InVID we developed: a) an algorithm for temporal fragmentation of (single-shot)
UGVs into sub-shots (presented in Section 1.3.1.1), and b) a web application that
integrates this algorithm and makes possible the time-efficient and at the fragment-
level reverse search for near-duplicates of a given video on the Web (described in
Section 1.3.2. The developed algorithm allows the identification of visually and
temporally coherent parts of the processed video, and the extraction of a dynamic
number of keyframes in a manner that secures a complete and concise representation
of the defined - visually discrete - parts of the video. Moreover, the compatibility
of the web application with several video sharing platforms and social networks is
further extended by the ability to directly process videos that are locally stored in
the user’s machine. In a nutshell, our complete technology assists users to quickly
discover the temporal structure of the video, extract detailed information about the
video content and use this data in their reverse video search queries.
In the following, Section 1.2 discusses the current state of the art on methods
for video sub-shot fragmentation (Section 1.2.1) and tools for reverse video search
on the Web (Section 1.2.2. Then Section 1.3 is dedicated to the presentation of two
advanced approaches for video sub-shot fragmentation - the InVID method that re-
lies on the visual resemblance of the video content (see Section 1.3.1.1) and another
algorithm that is based on the extraction of motion information (see Section 1.3.1.2)
- and the description of the InVID web application for reverse video search on the
Web (see Section 1.3.2). Subsequently, Section 1.4 reports the extracted findings re-
garding the performance of the aforementioned methods (see Section 1.4.1) and tool
(see Section 1.4.2), while the last Section 1.5 concludes the document and presents
our future plans on this research area.
1.2 Related Work
This part presents the related work, both in terms of methods for temporal frag-
mentation of uninterruptedly captured (i.e. single-shot) videos into sub-shots (Sec-
tion 1.2.1) and tools for finding near-duplicates of a given video on the Web (Sec-
tion 1.2.2).
1.2.1 Video Fragmentation
A variety of methods dealing with the temporal fragmentation of single-shot videos
have been proposed over the last couple of decades. Most of them are related to
approaches for video summarization and keyframe selection (e.g. [21, 9, 29, 15]),
some focus on the analysis of egocentric or wearable videos (e.g. [27, 41, 19]),
others aim to address the need for detecting duplicates of videos (e.g. [8]), a number
of them is related to the indexing and annotation of personal videos (e.g. [28]),

Figures (20)
References
More filters

Journal ArticleDOI
Martin A. Fischler1, Robert C. Bolles1Institutions (1)
TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Abstract: A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing

20,503 citations


Proceedings ArticleDOI
David G. Lowe1Institutions (1)
20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Abstract: An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

15,597 citations


"Video Fragmentation and Reverse Sea..." refers methods in this paper

  • ...• Among the examined similarity-based techniques, the use of HSV histograms results in better performance in terms of precision; however, the utilization of DCT features leads to remarkably higher recall scores, and thus a better overall performance (F-score). most effective ones, SIFT perform slightly worse, and ORB exhibit the weakest performance....

    [...]

  • ...[10] studies several approaches for optical flow field calculation, that include the matching of local descriptors (i.e. SIFT [26], SURF [5]) based on a variety of block matching algorithms, and the use of the Pyramidal Lucas Kanade (PLK) algorithm [7]....

    [...]

  • ...• Variations of the local-feature-based approaches documented in [10], that rely on the extraction and matching of SIFT, SURF and ORB descriptors (denoted H SIFT, H SURF and H ORB, respectively) or the computation of the optical flow using PLK (denoted H OF), for estimating the dominant motion based on specific parameters of the homography matrix computed by the RANSAC method [14]; an example of SURF-based homography estimation between a pair of frames is depicted in Fig....

    [...]

  • ...• The algorithm of [9] (denoted A SIFT), which estimates the dominant motion between a pair of frames based on the computed parameters of a 3× 3 affine model through the extraction and matching of SIFT descriptors; furthermore, variations of this approach that rely on the use of SURF (denoted A SURF) and ORB [37] (denoted A ORB) descriptors were also implemented for assessing the efficiency of faster alternatives to SIFT....

    [...]

  • ...SIFT [26], SURF [5]) based on a variety of block matching algorithms, and the use of the Pyramidal Lucas Kanade (PLK) algorithm [7]....

    [...]


Journal ArticleDOI
TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
Abstract: This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.

11,276 citations


Proceedings ArticleDOI
Jianbo Shi1, Tomasi2Institutions (2)
21 Jun 1994
TL;DR: A feature selection criterion that is optimal by construction because it is based on how the tracker works, and a feature monitoring method that can detect occlusions, disocclusions, and features that do not correspond to points in the world are proposed.
Abstract: No feature-based vision system can work unless good features can be identified and tracked from frame to frame. Although tracking itself is by and large a solved problem, selecting features that can be tracked well and correspond to physical points in the world is still hard. We propose a feature selection criterion that is optimal by construction because it is based on how the tracker works, and a feature monitoring method that can detect occlusions, disocclusions, and features that do not correspond to points in the world. These methods are based on a new tracking algorithm that extends previous Newton-Raphson style search methods to work under affine image transformations. We test performance with several simulations and experiments. >

8,046 citations


"Video Fragmentation and Reverse Sea..." refers methods in this paper

  • ...• the most prominent corners in each quartile are detected based on the algorithm of [38]; • the detected corners are used for estimating the optical flow at the region-level by utilizing the Pyramidal Lucas Kanade (PLK) method; • based on the extracted optical flow, a mean displacement vector is computed for each quartile, and the four spatially distributed vectors are treated as a regionlevel representation of the motion activity between the pair of frames....

    [...]


Proceedings ArticleDOI
06 Nov 2011
TL;DR: This paper proposes a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise, and demonstrates through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations.
Abstract: Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.

6,644 citations


"Video Fragmentation and Reverse Sea..." refers methods in this paper

  • ...• Among the examined similarity-based techniques, the use of HSV histograms results in better performance in terms of precision; however, the utilization of DCT features leads to remarkably higher recall scores, and thus a better overall performance (F-score). most effective ones, SIFT perform slightly worse, and ORB exhibit the weakest performance....

    [...]

  • ...• Variations of the local-feature-based approaches documented in [10], that rely on the extraction and matching of SIFT, SURF and ORB descriptors (denoted H SIFT, H SURF and H ORB, respectively) or the computation of the optical flow using PLK (denoted H OF), for estimating the dominant motion based on specific parameters of the homography matrix computed by the RANSAC method [14]; an example of SURF-based homography estimation between a pair of frames is depicted in Fig....

    [...]

  • ...• The algorithm of [9] (denoted A SIFT), which estimates the dominant motion between a pair of frames based on the computed parameters of a 3× 3 affine model through the extraction and matching of SIFT descriptors; furthermore, variations of this approach that rely on the use of SURF (denoted A SURF) and ORB [37] (denoted A ORB) descriptors were also implemented for assessing the efficiency of faster alternatives to SIFT....

    [...]

  • ...• With respect to the evaluated homography-based approaches, the use of different local descriptors or optical flow resulted in similar efficiency, with ORB being the least competitive descriptor due to lower precision....

    [...]

  • ...Moreover, the use of DCT features, especially in the way that the method of Section 1.3.1.1 utilises them, outperforms the HSV histograms, while the extraction and matching of complex local descriptors (SIFT and SURF) is more computationally expensive compared to the matching of binary de- scriptors (ORB) or the extraction of optical flow for computing the affine or homography matrices....

    [...]