scispace - formally typeset
Search or ask a question

Showing papers by "Pavel Korshunov published in 2018"


Posted Content
TL;DR: This paper presents the first publicly available set of Deepfake videos generated from videos of VidTIMIT database, and demonstrates that GAN-generated Deep fake videos are challenging for both face recognition systems and existing detection methods.
Abstract: It is becoming increasingly easy to automatically replace a face of one person in a video with the face of another person by using a pre-trained generative adversarial network (GAN). Recent public scandals, e.g., the faces of celebrities being swapped onto pornographic videos, call for automated ways to detect these Deepfake videos. To help developing such methods, in this paper, we present the first publicly available set of Deepfake videos generated from videos of VidTIMIT database. We used open source software based on GANs to create the Deepfakes, and we emphasize that training and blending parameters can significantly impact the quality of the resulted videos. To demonstrate this impact, we generated videos with low and high visual quality (320 videos each) using differently tuned parameter sets. We showed that the state of the art face recognition systems based on VGG and Facenet neural networks are vulnerable to Deepfake videos, with 85.62% and 95.00% false acceptance rates respectively, which means methods for detecting Deepfake videos are necessary. By considering several baseline approaches, we found that audio-visual approach based on lip-sync inconsistency detection was not able to distinguish Deepfake videos. The best performing method, which is based on visual quality metrics and is often used in presentation attack detection domain, resulted in 8.97% equal error rate on high quality Deepfakes. Our experiments demonstrate that GAN-generated Deepfake videos are challenging for both face recognition systems and existing detection methods, and the further development of face swapping technology will make it even more so.

369 citations


Proceedings ArticleDOI
01 Sep 2018
TL;DR: Several approaches proposed for lip-syncing and dubbing detection, based on convolutional and recurrent networks and compare them with systems that are based on more traditional classifiers are evaluated.
Abstract: With the increasing amount of video being consumed by people daily, there is a danger of the rise in maliciously modified video content (i.e., ‘fake news') that could be used to damage innocent people or to impose a certain agenda, e.g., meddle in elections. In this paper, we consider audio manipulations in video of a person speaking to the camera. Such manipulation is easy to perform, for instance, one can just replace a part of audio, while it can dramatically change the message and the meaning of the video. With the goal to develop an automated system that can detect these audio-visual speaker inconsistencies, we consider several approaches proposed for lip-syncing and dubbing detection, based on convolutional and recurrent networks and compare them with systems that are based on more traditional classifiers. We evaluated these methods on publicly available databases VidTIMIT, AMI, and GRID, for which we generated sets of tampered data.

69 citations


Proceedings ArticleDOI
01 Jan 2018
TL;DR: This paper considers shallow and deep examples of CNN architectures implemented using Tensorflow and compares their performances with the state of the art MFCC with GMM-based system on two large databases with presentation attacks: publicly available voicePA and proprietary BioCPqD-PA.
Abstract: Research in the area of automatic speaker verification (ASV) has advanced enough for the industry to start using ASV systems in practical applications. However, these systems are highly vulnerable to spoofing or presentation attacks(PAs), limiting their wide deployment. Several speech-based presentation attack detection (PAD) methods have been proposed recently but most of them are based on hand crafted frequency or phase-based features. Although convolutional neural networks (CNN) have already shown breakthrough results in face recognition, little is understood whether CNNs are as effective in detecting presentation attacks in speech. In this paper, to investigate the applicability of CNNs for PAD, we consider shallow and deep examples of CNN architectures implemented using Tensorflow and compare their performances with the state of the art MFCC with GMM-based system on two large databases with presentation attacks: publicly available voicePA and proprietary BioCPqD-PA. We study the impact of increasing the depth of CNNs on the performance, and note how they perform on unknown attacks, by using one database to train and another to evaluate. The results demonstrate that CNNs are able to learn a database significantly better (increasing depth also improves the performance), compared to hand crafted features. However, CNN-based PADs still lack the ability to generalize across databases and are unable to detect unknown attacks well.

15 citations


Journal ArticleDOI
TL;DR: A psychophysical study to assess the human visual system’s ability to perceive detail when impacted by surround luminance and found they are consistent with an existing surround effect visual model, which has basis in the behavior of cone photoreceptors.
Abstract: One of the key new attributes of high-dynamic-range (HDR) imaging and displays is the ability to present many stops of shadow detail and, with the best systems, a perceptually pure black. Displays perform at their best in a dark room as no ambient illumination impinges on the surface of the display, which can elevate the display’s perceived black level. In addition, the viewer can see the most shadow detail when the region surrounding the display is also dark. In addition to applications where a display is viewed in a dark-surround environment, there are also viewing conditions where higher ambient light levels occur. Knowledgeable viewers prevent ambient illumination reflecting from the display, but even then, surrounding luminance will be increased. To understand the impact of this surrounding ambient illumination on black level visibility and shadow detail, and to further guide ambient compensation algorithms, we performed a psychophysical study to assess the human visual system’s ability to perceive detail when impacted by surround luminance. For the stimuli, we used a Gabor signal to probe the visual system’s best capability. For the display, we used a Pulsar HDR display with a large neutral density filter placed over the display to enable black levels as low 0.0006 cd/m 2 , relevant for organic light-emitting diode display and cinema applications. The surround luminance levels ranged from fully dark up to 100 cd/m 2 , and for each of these, shadow detail thresholds were measured as a function of display mean luminance levels from 0.001 to 400 cd/m 2 . The results are useful for perceptual display performance assessment and tone-mapping applications. Further analysis that found they are consistent with an existing surround effect visual model, which has basis in the behavior of cone photoreceptors.

3 citations