Top 8 papers published by Pavel Korshunov from Idiap Research Institute in 2019

Proceedings Article•DOI•

Vulnerability assessment and detection of Deepfake videos

[...]

Pavel Korshunov¹, Sébastien Marcel¹•Institutions (1)

04 Jun 2019

TL;DR: This paper presents the first publicly available set of Deepfake videos generated from videos of VidTIMIT database, and demonstrates that GAN-generated Deep fake videos are challenging for both face recognition systems and existing detection methods.

...read moreread less

Abstract: It is becoming increasingly easy to automatically replace a face of one person in a video with the face of another person by using a pre-trained generative adversarial network (GAN). Recent public scandals, e.g., the faces of celebrities being swapped onto pornographic videos, call for automated ways to detect these Deepfake videos. To help developing such methods, in this paper, we present the first publicly available set of Deepfake videos generated from videos of VidTIMIT database. We used open source software based on GANs to create the Deepfakes, and we emphasize that training and blending parameters can significantly impact the quality of the resulted videos. To demonstrate this impact, we generated videos with low and high visual quality (320 videos each) using differently tuned parameter sets. We showed that the state of the art face recognition systems based on VGG and Facenet neural networks are vulnerable to Deepfake videos, with 85.62% and 95.00% false acceptance rates (on high quality versions) respectively, which means methods for detecting Deepfake videos are necessary. By considering several baseline approaches, we found the best performing method based on visual quality metrics, which is often used in presentation attack detection domain, to lead to 8.97% equal error rate on high quality Deep-fakes. Our experiments demonstrate that GAN-generated Deepfake videos are challenging for both face recognition systems and existing detection methods, and the further development of face swapping technology will make it even more so.

...read moreread less

86 citations

Journal Article•DOI•

Overview and evaluation of the JPEG XT HDR image compression standard

[...]

Alessandro Artusi¹, Rafal Mantiuk², Thomas Richter³, Philippe Hanhart⁴, Pavel Korshunov⁵, Massimiliano Agostinelli, Arkady Ten⁶, Touradj Ebrahimi⁴ - Show less +4 more•Institutions (6)

University of Girona¹, University of Cambridge², University of Stuttgart³, École Polytechnique Fédérale de Lausanne⁴, Idiap Research Institute⁵, Dolby Laboratories⁶

01 Apr 2019-Journal of Real-time Image Processing

TL;DR: The paper introduces three of currently defined profiles in JPEG XT, each constraining the common decoder architecture to a subset of allowable configurations, and assess the coding efficiency of each profile extensively through subjective assessments, using 24 naïve subjects to evaluate 20 images and objective evaluations.

...read moreread less

Abstract: Standards play an important role in providing a common set of specifications and allowing inter-operability between devices and systems. Until recently, no standard for high-dynamic-range (HDR) image coding had been adopted by the market, and HDR imaging relies on proprietary and vendor-specific formats which are unsuitable for storage or exchange of such images. To resolve this situation, the JPEG Committee is developing a new coding standard called JPEG XT that is backward compatible to the popular JPEG compression, allowing it to be implemented using standard 8-bit JPEG coding hardware or software. In this paper, we present design principles and technical details of JPEG XT. It is based on a two-layer design, a base layer containing a low-dynamic-range image accessible to legacy implementations, and an extension layer providing the full dynamic range. The paper introduces three of currently defined profiles in JPEG XT, each constraining the common decoder architecture to a subset of allowable configurations. We assess the coding efficiency of each profile extensively through subjective assessments, using 24 naive subjects to evaluate 20 images, and objective evaluations, using 106 images with five different tone-mapping operators and at 100 different bit rates. The objective results (based on benchmarking with subjective scores) demonstrate that JPEG XT can encode HDR images at bit rates varying from 1.1 to 1.9 bit/pixel for estimated mean opinion score (MOS) values above 4.5 out of 5, which is considered as fully transparent in many applications. This corresponds to 23-times bitstream reduction compared to lossless OpenEXR PIZ compression.

...read moreread less

65 citations

Posted Content•

The Speed Submission to DIHARD II: Contributions & Lessons Learned

[...]

Sahidullah, Jose Patino¹, Samuele Cornell, Ruiqing Yin, Sunit Sivasankaran, Hervé Bredin, Pavel Korshunov², Alessio Brutti³, Romain Serizel, Emmanuel Vincent, Nicholas Evans¹, Sébastien Marcel², Stefano Squartini, Claude Barras - Show less +10 more•Institutions (3)

Institut Eurécom¹, Idiap Research Institute², fondazione bruno kessler³

07 Nov 2019-arXiv: Audio and Speech Processing

TL;DR: This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team, and presents several components of the system, including categorization of domains, speech enhancement, speech activity detection, and speaker embeddings.

...read moreread less

Abstract: This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team. Besides describing the system, which considerably outperformed the challenge baselines, we also focus on the lessons learned from numerous approaches that we tried for single and multi-channel systems. We present several components of our diarization system, including categorization of domains, speech enhancement, speech activity detection, speaker embeddings, clustering methods, resegmentation, and system fusion. We analyze and discuss the effect of each such component on the overall diarization performance within the realistic settings of the challenge.

...read moreread less

12 citations

Proceedings Article•

Tampered Speaker Inconsistency Detection with Phonetically Aware Audio-visual Features

[...]

Pavel Korshunov¹, Michael Halstead², Diego Castan³, Martin Graciarena³, Mitchell McLaren³, Brian Burns³, Aaron Lawson³, Sébastien Marcel¹ - Show less +4 more•Institutions (3)

Idiap Research Institute¹, Queensland University of Technology², SRI International³

01 Jan 2019

TL;DR: This paper demonstrates that by replacing standard MFCC features with embeddings from a DNN trained for automatic speech recognition, combined with mouth landmarks (visual features), this model can achieve a significant performance improvement on several challenging publicly available databases of speakers, for which it generated sets of tampered data.

...read moreread less

Abstract: The recent increase in social media based propaganda, i.e., ‘fake news’, calls for automated methods to detect tampered content. In this paper, we focus on detecting tampering in a video with a person speaking to a camera. This form of manipulation is easy to perform, since one can just replace a part of the audio, dramatically changing the meaning of the video. We consider several detection approaches based on phonetic features and recurrent networks. We demonstrate that by replacing standard MFCC features with embeddings from a DNN trained for automatic speech recognition, combined with mouth landmarks (visual features), we can achieve a significant performance improvement on several challenging publicly available databases of speakers (VidTIMIT, AMI, and GRID), for which we generated sets of tampered data. The evaluations demonstrate a relative equal error rate reduction of 55% (to 4.5% from 10.0%) on the large GRID corpus based dataset and a satisfying generalization of the model on other datasets.

...read moreread less

11 citations

Book Chapter•DOI•

A Cross-database Study of Voice Presentation Attack Detection

[...]

Pavel Korshunov¹, Sébastien Marcel¹•Institutions (1)

Idiap Research Institute¹

01 Jan 2019

TL;DR: This chapter presents an overview of the latest databases and the techniques to detect presentation attacks, and discusses the performance of PAD systems based on handcrafted features and traditional Gaussian mixture model (GMM) classifiers, and demonstrates whether the score fusion techniques can improve the performanceof PADs.

...read moreread less

Abstract: Despite an increasing interest in speaker recognition technologies, a significant obstacle still hinders their wide deployment—their high vulnerability to spoofing or presentation attacks. These attacks can be easy to perform. For instance, if an attacker has access to a speech sample from a target user, he/she can replay it using a loudspeaker or a smartphone to the recognition system during the authentication process. The ease of executing presentation attacks and the fact that no technical knowledge of the biometric system is required to make these attacks especially threatening in practical application. Therefore, late research focuses on collecting data databases with such attacks and on development of presentation attack detection (PAD) systems. In this chapter, we present an overview of the latest databases and the techniques to detect presentation attacks. We consider several prominent databases that contain bona fide and attack data, including ASVspoof 2015, ASVspoof 2017, AVspoof, voicePA, and BioCPqD-PA (the only proprietary database). Using these databases, we focus on the performance of PAD systems in the cross-database scenario or in the presence of “unknown” (not available during training) attacks, as these scenarios are closer to practice, when pretrained systems need to detect attacks in unforeseen conditions. We first present and discuss the performance of PAD systems based on handcrafted features and traditional Gaussian mixture model (GMM) classifiers. We then demonstrate whether the score fusion techniques can improve the performance of PADs. We also present some of the latest results of using neural networks for presentation attack detection. The experiments show that PAD systems struggle to generalize across databases and mostly unable to detect unknown attacks, with systems based on neural networks demonstrating better performance compared to the systems based on handcrafted features.

...read moreread less

10 citations

Posted Content•

Vulnerability of Face Recognition to Deep Morphing

[...]

Pavel Korshunov¹, Sébastien Marcel¹•Institutions (1)

Idiap Research Institute¹

03 Oct 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents the publicly available dataset of the Deepfake videos with faces morphed with a GAN-based algorithm and considers several baseline approaches for detecting deep morphs, finding that the method based on visual quality metrics leads to the best performance.

...read moreread less

Abstract: It is increasingly easy to automatically swap faces in images and video or morph two faces into one using generative adversarial networks (GANs). The high quality of the resulted deep-morph raises the question of how vulnerable the current face recognition systems are to such fake images and videos. It also calls for automated ways to detect these GAN-generated faces. In this paper, we present the publicly available dataset of the Deepfake videos with faces morphed with a GAN-based algorithm. To generate these videos, we used open source software based on GANs, and we emphasize that training and blending parameters can significantly impact the quality of the resulted videos. We show that the state of the art face recognition systems based on VGG and Facenet neural networks are vulnerable to the deep morph videos, with 85.62 and 95.00 false acceptance rates, respectively, which means methods for detecting these videos are necessary. We consider several baseline approaches for detecting deep morphs and find that the method based on visual quality metrics (often used in presentation attack detection domain) leads to the best performance with 8.97 equal error rate. Our experiments demonstrate that GAN-generated deep morph videos are challenging for both face recognition systems and existing detection methods, and the further development of deep morphing technologies will make it even more so.

...read moreread less

10 citations

Posted Content•

pyannote.audio: neural building blocks for speaker diarization

[...]

Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, Marie-Philippe Gill - Show less +6 more

04 Nov 2019-arXiv: Audio and Speech Processing

TL;DR: The pyannote.audio toolkit as discussed by the authors provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.

...read moreread less

Abstract: We introduce pyannote.audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote.audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding -- reaching state-of-the-art performance for most of them.

...read moreread less

7 citations

Proceedings Article•

Vulnerability of Face Recognition to Deep Morphing

[...]

Pavel Korshunov¹, Sébastien Marcel¹•Institutions (1)

Idiap Research Institute¹

01 Jan 2019

TL;DR: In this article, the authors presented the publicly available dataset of the Deepfake videos with faces morphed with a GAN-based algorithm, and they used open source software based on GANs, and emphasize that training and blending parameters can significantly impact the quality of the resulted videos.

...read moreread less

Abstract: It is increasingly easy to automatically swap faces in images and video or morph two faces into one using generative adversarial networks (GANs). The high quality of the resulted deep-morph raises the question of how vulnerable the current face recognition systems are to such fake images and videos. It also calls for automated ways to detect these GAN-generated faces. In this paper, we present the publicly available dataset of the Deepfake videos with faces morphed with a GAN-based algorithm. To generate these videos, we used open source software based on GANs, and we emphasize that training and blending parameters can significantly impact the quality of the resulted videos. We show that the state of the art face recognition systems based on VGG and Facenet neural networks are vulnerable to the deep morph videos, with 85.62 and 95.00 false acceptance rates, respectively, which means methods for detecting these videos are necessary. We consider several baseline approaches for detecting deep morphs and find that the method based on visual quality metrics (often used in presentation attack detection domain) leads to the best performance with 8.97 equal error rate. Our experiments demonstrate that GAN-generated deep morph videos are challenging for both face recognition systems and existing detection methods, and the further development of deep morphing technologies will make it even more so.

...read moreread less

2 citations

Showing papers by "Pavel Korshunov published in 2019"