scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Journal of Selected Topics in Signal Processing in 2020"


Journal ArticleDOI
TL;DR: In this paper, the authors present an analysis of the methods for visual media integrity verification, that is, the detection of manipulated images and videos, and highlight the limits of current forensic tools, the most relevant issues, the upcoming challenges, and suggest future directions for research.
Abstract: With the rapid progress in recent years, techniques that generate and manipulate multimedia content can now provide a very advanced level of realism. The boundary between real and synthetic media has become very thin. On the one hand, this opens the door to a series of exciting applications in different fields such as creative arts, advertising, film production, and video games. On the other hand, it poses enormous security threats. Software packages freely available on the web allow any individual, without special skills, to create very realistic fake images and videos. These can be used to manipulate public opinion during elections, commit fraud, discredit or blackmail people. Therefore, there is an urgent need for automated tools capable of detecting false multimedia content and avoiding the spread of dangerous false information. This review paper aims to present an analysis of the methods for visual media integrity verification, that is, the detection of manipulated images and videos. Special emphasis will be placed on the emerging phenomenon of deepfakes, fake media created through deep learning tools, and on modern data-driven forensic methods to fight them. The analysis will help highlight the limits of current forensic tools, the most relevant issues, the upcoming challenges, and suggest future directions for research.

251 citations


Journal ArticleDOI
TL;DR: This article reviews both datasets and visual attention modelling approaches for 360° video/image, which either utilize the spherical characteristics or visual attention models, and overviews the compression approaches.
Abstract: Nowadays, 360° video/image has been increasingly popular and drawn great attention. The spherical viewing range of 360° video/image accounts for huge data, which pose the challenges to 360° video/image processing in solving the bottleneck of storage, transmission, etc. Accordingly, the recent years have witnessed the explosive emergence of works on 360° video/image processing. In this article, we review the state-of-the-art works on 360° video/image processing from the aspects of perception, assessment and compression. First, this article reviews both datasets and visual attention modelling approaches for 360° video/image. Second, we survey the related works on both subjective and objective visual quality assessment (VQA) of 360° video/image. Third, we overview the compression approaches for 360° video/image, which either utilize the spherical characteristics or visual attention models. Finally, we summarize this overview article and outlook the future research trends on 360° video/image processing.

191 citations


Journal ArticleDOI
TL;DR: A technical review of available models and learning methods for multimodal intelligence, focusing on the combination of vision and natural language modalities, which has become an important topic in both the computer vision andnatural language processing research communities.
Abstract: Deep learning methods haverevolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal representations, fusing multimodal signals at various levels, and multimodal applications. Regarding multimodal representation learning, we review the key concepts of embedding, which unify multimodal signals into a single vector space and thereby enable cross-modality signal processing. We also review the properties of many types of embeddings that are constructed and learned for general downstream tasks. Regarding multimodal fusion, this review focuses on special architectures for the integration of representations of unimodal signals for a particular task. Regarding applications, selected areas of a broad interest in the current literature are covered, including image-to-text caption generation, text-to-image generation, and visual question answering. We believe that this review will facilitate future studies in the emerging field of multimodal intelligence for related communities.

174 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed framework for detecting image operator chain based on convolutional neural network not only obtains significant detection performance but also can distinguish the order in some cases that previous works were unable to identify.
Abstract: Many forensic techniques have recently been developed to determine whether an image has undergone a specific manipulation operation. When multiple consecutive operations are applied to images, forensic analysts not only need to identify the existence of each manipulation operation, but also to distinguish the order of the involved operations. However, image operator chain detection is still a challenging problem. In this paper, an order forensics framework for detecting image operator chain based on convolutional neural network (CNN) is presented. Two-stream CNN architecture is designed to capture both tampering artifact evidence and local noise residual evidence. Specifically, the new CNN-based method is proposed for forensically detecting a chain made of two image operators, which could automatically learn manipulation detection features directly from image data. Further, we empirically investigate the robustness of our proposed method in two practical scenarios: forensic investigators have no access to the operating parameters, and manipulations are applied to a JPEG compressed image. Experimental results show that the proposed framework not only obtains significant detection performance but also can distinguish the order in some cases that previous works were unable to identify.

131 citations


Journal ArticleDOI
TL;DR: A spherical bit-rate equalization strategy is developed to obtain a block-level Lagrangian multiplier for the rate-distortion optimization process in video coding, and two algorithms are developed to enhance compression efficiency for the ERP and CMP pictures, respectively.
Abstract: To provide excellent visual experience for customers, virtual reality (VR) sources require higher resolutions and better visual quality than traditional picture sequences. The content of a VR video can be mapped into a sphere by playing devices to present a $360^{\circ }$ scene, which is usually called VR360 in industrial community. The most popular formats for VR360 sources are the equirectangular projection (ERP) and the cubemap projection (CMP). Both ERP and CMP pictures can be effectively projected to a virtual three-dimensional spherical surface for rendering. It brings a new challenge to the compression of VR video sources, which is how to reallocate proper bit-rate to match mainstream projection formats. The most intuitive way to deal with this challenge is to empirically assign a fixed quantization parameter (QP) to each coding unit according to its position, which evidently lacks precision, rationality, and thus, degrades coding performance. This research proposes a new entropy equilibrium optimization (EEO) methodology to enhance the coding performance of VR360 videos. Specifically, we develop a spherical bit-rate equalization strategy to obtain a block-level Lagrangian multiplier ( $lambda$ , $\lambda$ ) for the rate-distortion optimization process in video coding. The appropriate QP value for each block is then dynamically determined in accordance with its $\lambda$ . Based on our EEO methodology, we develop two algorithms, EEOA-ERP and EEOA-CMP, to enhance compression efficiency for the ERP and CMP pictures, respectively. Experimental results demonstrate that both algorithms achieve significant BD-Rate savings and outperform the HM16.17 platform for all-intra (AI), low-delay (LD) and random-access (RA) configurations, respectively. Concretely, compared with the state-of-the-art algorithm WSU-ERP, the proposed EEOA-ERP achieves BD-Rate saving of 0.37% in LD configuration. Furthermore, the proposed EEOA-CMP gains 2.6% on objective quality in RA configuration when compared with the HM16.17 VR CMP under the common test condition.

108 citations


Journal ArticleDOI
TL;DR: Results show that classification models based solely on acoustic speech features extracted through the new active data representation (ADR) method can achieve accuracy levels comparable to those achieved by models that employ higher-level language features.
Abstract: Speech analysis could provide an indicator of Alzheimer's disease and help develop clinical tools for automatically detecting and monitoring disease progression. While previous studies have employed acoustic (speech) features for characterisation of Alzheimer's dementia, these studies focused on a few common prosodic features, often in combination with lexical and syntactic features which require transcription. We present a detailed study of the predictive value of purely acoustic features automatically extracted from spontaneous speech for Alzheimer's dementia detection, from a computational paralinguistics perspective. The effectiveness of several state-of-the-art paralinguistic feature sets for Alzheimer's detection were assessed on a balanced sample of DementiaBank's Pitt spontaneous speech dataset, with patients matched by gender and age. The feature sets assessed were the extended Geneva minimalistic acoustic parameter set (eGeMAPS), the emobase feature set, the ComParE 2013 feature set, and new Multi-Resolution Cochleagram (MRCG) features. Furthermore, we introduce a new active data representation (ADR) method for feature extraction in Alzheimer's dementia recognition. Results show that classification models based solely on acoustic speech features extracted through our ADR method can achieve accuracy levels comparable to those achieved by models that employ higher-level language features. Analysis of the results suggests that all feature sets contribute information not captured by other feature sets. We show that while the eGeMAPS feature set provides slightly better accuracy than other feature sets individually (71.34%), “hard fusion” of feature sets improves accuracy to 78.70%.

108 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a regularization by artifact-removal (RARE) algorithm, which can leverage priors learned on datasets containing only undersampled measurements, which is applicable to problems where it is practically impossible to have fully-sampled groundtruth data for training.
Abstract: Regularization by denoising (RED) is an image reconstruction framework that uses an image denoiser as a prior. Recent work has shown the state-of-the-art performance of RED with learned denoisers corresponding to pre-trained convolutional neural nets (CNNs). In this work, we propose to broaden the current denoiser-centric view of RED by considering priors corresponding to networks trained for more general artifact-removal. The key benefit of the proposed family of algorithms, called regularization by artifact-removal (RARE) , is that it can leverage priors learned on datasets containing only undersampled measurements. This makes RARE applicable to problems where it is practically impossible to have fully-sampled groundtruth data for training. We validate RARE on both simulated and experimentally collected data by reconstructing a free-breathing whole-body 3D MRIs into ten respiratory phases from heavily undersampled k-space measurements. Our results corroborate the potential of learning regularizers for iterative inversion directly on undersampled and noisy measurements.

100 citations


Journal ArticleDOI
TL;DR: The results obtained in the empirical evaluation show that additional efforts are required to develop robust facial manipulation detection systems against unseen conditions and spoof techniques, such as the one proposed in this study.
Abstract: The availability of large-scale facial databases, together with the remarkable progresses of deep learning technologies, in particular Generative Adversarial Networks (GANs), have led to the generation of extremely realistic fake facial content, raising obvious concerns about the potential for misuse. Such concerns have fostered the research on manipulation detection methods that, contrary to humans, have already achieved astonishing results in various scenarios. In this study, we focus on the synthesis of entire facial images, which is a specific type of facial manipulation. The main contributions of this study are four-fold: i) a novel strategy to remove GAN “fingerprints” from synthetic fake images based on autoencoders is described, in order to spoof facial manipulation detection systems while keeping the visual quality of the resulting images; ii) an in-depth analysis of the recent literature in facial manipulation detection; iii) a complete experimental assessment of this type of facial manipulation, considering the state-of-the-art fake detection systems (based on holistic deep networks, steganalysis, and local artifacts), remarking how challenging is this task in unconstrained scenarios; and finally iv) we announce a novel public database, named iFakeFaceDB, yielding from the application of our proposed GAN-fingerprint Removal approach (GANprintR) to already very realistic synthetic fake images. The results obtained in our empirical evaluation show that additional efforts are required to develop robust facial manipulation detection systems against unseen conditions and spoof techniques, such as the one proposed in this study.

98 citations


Journal ArticleDOI
TL;DR: A new approach for synergistic recovery of undersampled multi-contrast acquisitions based on conditional generative adversarial networks is proposed, which mitigates the limitations of pure learning-based reconstruction or synthesis by utilizing three priors: shared high-frequency prior available in the source contrast to preserve high-spatial-frequency details, low-frequencyPrior available inThe undersampling target contrast to prevent feature leakage/loss, and perceptual prior to improve recovery of high-level features.
Abstract: Multi-contrast MRI acquisitions of an anatomy enrich the magnitude of information available for diagnosis. Yet, excessive scan times associated with additional contrasts may be a limiting factor. Two mainstream frameworks for enhanced scan efficiency are reconstruction of undersampled acquisitions and synthesis of missing acquisitions. Recently, deep learning methods have enabled significant performance improvements in both frameworks. Yet, reconstruction performance decreases towards higher acceleration factors with diminished sampling density at high-spatial-frequencies, whereas synthesis can manifest artefactual sensitivity or insensitivity to image features due to the absence of data samples from the target contrast. In this article, we propose a new approach for synergistic recovery of undersampled multi-contrast acquisitions based on conditional generative adversarial networks. The proposed method mitigates the limitations of pure learning-based reconstruction or synthesis by utilizing three priors: shared high-frequency prior available in the source contrast to preserve high-spatial-frequency details, low-frequency prior available in the undersampled target contrast to prevent feature leakage/loss, and perceptual prior to improve recovery of high-level features. Demonstrations on brain MRI datasets from healthy subjects and patients indicate the superior performance of the proposed method compared to pure reconstruction and synthesis methods. The proposed method can help improve the quality and scan efficiency of multi-contrast MRI exams.

93 citations


Journal ArticleDOI
TL;DR: This paper builds a compressed VR image quality (CVIQ) database, and proposes a multi-channel convolution neural network (CNN) for blind 360-degree image quality assessment (MC360IQA), which achieves the best performance among the state-of-art full-reference and no-reference image quality Assessment (IQA) models on the CVIQ database and other available360-degree IQA database.
Abstract: 360-degree images/videos have been dramatically increasing in recent years. The characteristic of omnidirectional-view results in high resolution of 360-degree images/videos, which makes them difficult to be transported and stored. To deal with the problem, video coding technologies are used to compress the omnidirectional content but they will introduce the compression distortion. Therefore, it is important to study how popular coding technologies affect the quality of 360-degree images. In this paper, we present a study on both subjective and objective quality assessment of compressed virtual reality (VR) images. We first build a compressed VR image quality (CVIQ) database including 16 reference images and 528 compressed ones with three prevailing coding technologies. Then, we propose a multi-channel convolution neural network (CNN) for blind 360-degree image quality assessment (MC360IQA). To be consistent with the visual content seen in the VR device, we project each 360-degree image into six viewport images, which are adopted as inputs of the proposed model. MC360IQA consists of two parts, a multi-channel CNN and an image quality regressor. The multi-channel CNN includes six parallel hyper-ResNet34 networks, where the hyper structure is used to incorporate the features from intermediate layers. The image quality regressor fuses the features and regresses them to final scores. The experimental results show that our model achieves the best performance among the state-of-art full-reference (FR) and no-reference (NR) image quality assessment (IQA) models on the CVIQ database and other available 360-degree IQA database.

91 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel framework to design an OPtimization-INspired Explicable deep Network, dubbed OPINE-Net, for adaptive sampling and recovery, which allows image blocks to be sampled independently but reconstructed jointly to further enhance the performance.
Abstract: In order to improve CS performance of natural images, in this paper, we propose a novel framework to design an OPtimization-INspired Explicable deep Network, dubbed OPINE-Net, for adaptive sampling and recovery. Both orthogonal and binary constraints of sampling matrix are incorporated into OPINE-Net simultaneously. In particular, OPINE-Net is composed of three subnets: sampling subnet, initialization subnet and recovery subnet, and all the parameters in OPINE-Net (e.g. sampling matrix, nonlinear transforms, shrinkage threshold) are learned end-to-end, rather than hand-crafted. Moreover, considering the relationship among neighboring blocks, an enhanced version OPINE-Net $^+$ is developed, which allows image blocks to be sampled independently but reconstructed jointly to further enhance the performance. In addition, some interesting findings of learned sampling matrix are presented. Compared with existing state-of-the-art network-based CS methods, the proposed hardware-friendly OPINE-Nets not only achieve better performance but also require much fewer parameters and much less storage space, while maintaining a real-time running speed.

Journal ArticleDOI
TL;DR: This work introduces simple, yet surprisingly efficient digital forensic methods for audio spoof and visual deepfake detection that combine convolutional latent representations with bidirectional recurrent structures and entropy-based cost functions.
Abstract: Deepfakes, or artificially generated audiovisual renderings, can be used to defame a public figure or influence public opinion. With the recent discovery of generative adversarial networks, an attacker using a normal desktop computer fitted with an off-the-shelf graphics processing unit can make renditions realistic enough to easily fool a human observer. Detecting deepfakes is thus becoming important for reporters, social media platforms, and the general public. In this work, we introduce simple, yet surprisingly efficient digital forensic methods for audio spoof and visual deepfake detection. Our methods combine convolutional latent representations with bidirectional recurrent structures and entropy-based cost functions. The latent representations for both audio and video are carefully chosen to extract semantically rich information from the recordings. By feeding these into a recurrent framework, we can detect both spatial and temporal signatures of deepfake renditions. The entropy-based cost functions work well in isolation as well as in context with traditional cost functions. We demonstrate our methods on the FaceForensics++ and Celeb-DF video datasets and the ASVSpoof 2019 Logical Access audio datasets, achieving new benchmarks in all categories. We also perform extensive studies to demonstrate generalization to new domains and gain further insight into the effectiveness of the new architectures.

Journal ArticleDOI
Rongzhi Gu1, Shi-Xiong Zhang1, Yong Xu1, Lianwu Chen1, Yuexian Zou2, Dong Yu1 
TL;DR: A general multi-modal framework for target speech separation is proposed by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements, and a factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi- modalities at embedding level.
Abstract: Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.

Journal ArticleDOI
TL;DR: The results in reconstruction of the fastMRI knee dataset show that the proposed history-cognizant approach reduces residual aliasing artifacts compared to its conventional unrolled counterpart without requiring extra computational power or increasing reconstruction time.
Abstract: Inverse problems for accelerated MRI typically incorporate domain-specific knowledge about the forward encoding operator in a regularized reconstruction framework. Recently physics-driven deep learning (DL) methods have been proposed to use neural networks for data-driven regularization. These methods unroll iterative optimization algorithms to solve the inverse problem objective function, by alternating between domain-specific data consistency and data-driven regularization via neural networks. The whole unrolled network is then trained end-to-end to learn the parameters of the network. Due to simplicity of data consistency updates with gradient descent steps, proximal gradient descent (PGD) is a common approach to unroll physics-driven DL reconstruction methods. However, PGD methods have slow convergence rates, necessitating a higher number of unrolled iterations, leading to memory issues in training and slower reconstruction times in testing. Inspired by efficient variants of PGD methods that use a history of the previous iterates, in this article, we propose a history-cognizant unrolling of the optimization algorithm with dense connections across iterations for improved performance. In our approach, the gradient descent steps are calculated at a trainable combination of the outputs of all the previous regularization units. We also apply this idea to unrolling variable splitting methods with quadratic relaxation. Our results in reconstruction of the fastMRI knee dataset show that the proposed history-cognizant approach reduces residual aliasing artifacts compared to its conventional unrolled counterpart without requiring extra computational power or increasing reconstruction time.

Journal ArticleDOI
TL;DR: A multichannel forward model is used, consisting of a non-uniform Fourier transform with continuously defined sampling locations, to realize the data consistency block within a model-based deep learning image reconstruction scheme.
Abstract: Modern MRI schemes, which rely on compressed sensing or deep learning algorithms to recover MRI data from undersampled multichannel Fourier measurements, are widely used to reduce the scan time. The image quality of these approaches is heavily dependent on the sampling pattern. In this article, we introduce a continuous strategy to optimize the sampling pattern and the network parameters jointly. We use a multichannel forward model, consisting of a non-uniform Fourier transform with continuously defined sampling locations, to realize the data consistency block within a model-based deep learning image reconstruction scheme. This approach facilitates the joint and continuous optimization of the sampling pattern and the CNN parameters to improve image quality. We observe that the joint optimization of the sampling patterns and the reconstruction module significantly improves the performance of most deep learning reconstruction algorithms. The source code of the proposed joint learning framework is available at https://github.com/hkaggarwal/J-MoDL .

Journal ArticleDOI
TL;DR: DeepCABAC as mentioned in this paper applies a novel quantization scheme that minimizes a rate-distortion function while simultaneously taking the impact of quantization to the DNN performance into account, achieving higher compression rates than previously proposed coding techniques for DNN compression.
Abstract: In the past decade deep neural networks (DNNs) have shown state-of-the-art performance on a wide range of complex machine learning tasks. Many of these results have been achieved while growing the size of DNNs, creating a demand for efficient compression and transmission of them. In this work we present DeepCABAC, a universal compression algorithm for DNNs that is based on applying Context-based Adaptive Binary Arithmetic Coder (CABAC) to the DNN parameters. CABAC was originally designed for the H.264/AVC video coding standard and became the state-of-the-art for the lossless compression part of video compression. DeepCABAC applies a novel quantization scheme that minimizes a rate-distortion function while simultaneously taking the impact of quantization to the DNN performance into account. Experimental results show that DeepCABAC consistently attains higher compression rates than previously proposed coding techniques for DNN compression. For instance, it is able to compress the VGG16 ImageNet model by x63.6 with no loss of accuracy, thus being able to represent the entire network with merely 9 MB. The source code for encoding and decoding can be found at https://github.com/fraunhoferhhi/DeepCABAC .

Journal ArticleDOI
TL;DR: A novel approach that combines unsupervised learning, knowledge transfer and hierarchical attention for the task of speech-based depression severity measurement, a Hierarchical Attention Transfer Network (HATN), uses hierarchical attention autoencoders to learn attention from a source task, followed by speech recognition, and then transfers this knowledge into a depression analysis system.
Abstract: Early interventions in mental health conditions such as Major Depressive Disorder (MDD) are critical to improved health outcomes, as they can help reduce the burden of the disease. As the efficient diagnosis of depression severity is therefore highly desirable, the use of behavioural cues such as speech characteristics in diagnosis is attracting increasing interest in the field of quantitative mental health research. However, despite the widespread use of machine learning methods in the depression analysis community, the lack of adequate labelled data has become a bottleneck preventing the broader application of techniques such as deep learning. Accordingly, we herein describe a deep learning approach that combines unsupervised learning, knowledge transfer and hierarchical attention for the task of speech-based depression severity measurement. Our novel approach, a Hierarchical Attention Transfer Network (HATN), uses hierarchical attention autoencoders to learn attention from a source task, followed by speech recognition, and then transfers this knowledge into a depression analysis system. Experiments based on the depression sub-challenge dataset of the Audio/Visual Emotion Challenge (AVEC) 2017 demonstrate the effectiveness of our proposed model. On the test set, our technique outperformed other speech-based systems presented in the literature, achieving a Root Mean Square Error (RMSE) of 5.51 and a Mean Absolute Error (MAE) of 4.20 on a Patient Health Questionnaire (PHQ)-8 scale [0, 24]. To the best of our knowledge, these scores represent the best-known speech results on the AVEC 2017 depression corpus to date.

Journal ArticleDOI
TL;DR: It was observed that the performance achieved with the studied glottal source features is comparable or better than that of conventional MFCCs and perceptual linear prediction (PLP) features, which indicates the complementary nature of the features.
Abstract: Automatic detection of voice pathology enables objective assessment and earlier intervention for the diagnosis. This study provides a systematic analysis of glottal source features and investigates their effectiveness in voice pathology detection. Glottal source features are extracted using glottal flows estimated with the quasi-closed phase (QCP) glottal inverse filtering method, using approximate glottal source signals computed with the zero frequency filtering (ZFF) method, and using acoustic voice signals directly. In addition, we propose to derive mel-frequency cepstral coefficients (MFCCs) from the glottal source waveforms computed by QCP and ZFF to effectively capture the variations in glottal source spectra of pathological voice. Experiments were carried out using two databases, the Hospital Universitario Principe de Asturias (HUPA) database and the Saarbrucken Voice Disorders (SVD) database. Analysis of features revealed that the glottal source contains information that discriminates normal and pathological voice. Pathology detection experiments were carried out using support vector machine (SVM). From the detection experiments it was observed that the performance achieved with the studied glottal source features is comparable or better than that of conventional MFCCs and perceptual linear prediction (PLP) features. The best detection performance was achieved when the glottal source features were combined with the conventional MFCCs and PLP features, which indicates the complementary nature of the features.

Journal ArticleDOI
TL;DR: A novel end-to-end multisetting MR image synthesis method based on generative adversarial networks (GANs) - a deep learning model that can produce high quality synthesized images.
Abstract: In magnetic resonance imaging (MRI), several images can be obtained using different imaging settings (e.g. T1, T2, DWI, and Flair). These images have similar anatomical structures but are with different contrasts, which provide a wealth of information for diagnosis. However, the images under specific imaging settings may not be available due to the limitation of scanning time or corruption caused by noises. It is attractive to derive missing images with some settings from the available MR images. In this paper, we propose a novel end-to-end multisetting MR image synthesis method. The proposed method is based on generative adversarial networks (GANs) - a deep learning model. In the proposed method, different MR images obtained by different settings are used as the inputs of a GANs and each image is encoded by an encoder. Each encoder includes a refinement structure which is used to extract a multiscale feature map from an input image. The multiscale feature maps from different input images are then fused to generate several desired target images under specific settings. Because the resultant images obtained with GANs have blurred edges, we fuse gradient prior information in the model to protect high frequency information such as important tissue textures of medical images. In the proposed model, the multiscale information is also adopted in the adversarial learning (not just in the generator or discriminator) so that we can produce high quality synthesized images. We evaluated the proposed method on two public datasets: BRATS and ISLES. Experimental results demonstrate that the proposed approach is superior to current state-of-the-art methods.

Journal ArticleDOI
TL;DR: A review of existing speech and language features used in this domain, including language diversity, syntactic complexity, semantic coherence, and timing, and a proposal of new research directions to further advance the field are considered.
Abstract: It is widely accepted that information derived from analyzing speech (the acoustic signal) and language production (words and sentences) serves as a useful window into the health of an individual's cognitive ability. In fact, most neuropsychological testing batteries have a component related to speech and language where clinicians elicit speech from patients for subjective evaluation across a broad set of dimensions. With advances in speech signal processing and natural language processing, there has been recent interest in developing tools to detect more subtle changes in cognitive-linguistic function. This work relies on extracting a set of features from recorded and transcribed speech for objective assessments of speech and language, early diagnosis of neurological disease, and tracking of disease after diagnosis. With an emphasis on cognitive and thought disorders, in this paper we provide a review of existing speech and language features used in this domain, discuss their clinical application, and highlight their advantages and disadvantages. Broadly speaking, the review is split into two categories: language features based on natural language processing and speech features based on speech signal processing. Within each category, we consider features that aim to measure complementary dimensions of cognitive-linguistics, including language diversity, syntactic complexity, semantic coherence, and timing. We conclude the review with a proposal of new research directions to further advance the field.

Journal ArticleDOI
TL;DR: The results show that SIMBA can significantly reduce the computational burden of 3D image formation without sacrificing the imaging quality and the theoretical fixed-point convergence of SIMBA under nonexpansive denoisers for convex data-fidelity terms is established.
Abstract: Two features desired in a three-dimensional (3D) optical tomographic image reconstruction algorithm are the ability to reduce imaging artifacts and to do fast processing of large data volumes. Traditional iterative inversion algorithms are impractical in this context due to their heavy computational and memory requirements. We propose and experimentally validate a novel scalable iterative minibatch algorithm (SIMBA) for fast and high-quality optical tomographic imaging. SIMBA enables high-quality imaging by combining two complementary information sources: the physics of the imaging system characterized by its forward model and the imaging prior characterized by a denoising deep neural net. SIMBA easily scales to very large 3D tomographic datasets by processing only a small subset of measurements at each iteration. We establish the theoretical fixed-point convergence of SIMBA under nonexpansive denoisers for convex data-fidelity terms. We validate SIMBA on both simulated and experimentally collected intensity diffraction tomography (IDT) datasets. Our results show that SIMBA can significantly reduce the computational burden of 3D image formation without sacrificing the imaging quality.

Journal ArticleDOI
Ke Tan1, Yong Xu2, Shi-Xiong Zhang2, Meng Yu2, Dong Yu2 
TL;DR: This study addresses joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation, and proposes a novel multimodal network that exploits both audio and visual signals.
Abstract: Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that exploits both audio and visual signals. The proposed network architecture adopts a two-stage strategy, where a separation module is employed to attenuate background noise and interfering speech in the first stage and a dereverberation module to suppress room reverberation in the second stage. The two modules are first trained separately, and then integrated for joint training, which is based on a new multi-objective loss function. Our experimental results show that the proposed multimodal network yields consistently better objective intelligibility and perceptual quality than several one-stage and two-stage baselines. We find that our network achieves a 21.10% improvement in ESTOI and a 0.79 improvement in PESQ over the unprocessed mixtures. Moreover, our network architecture does not require the knowledge of the number of speakers.

Journal ArticleDOI
TL;DR: In this article, a dual dynamic inference (DDI) framework is proposed to integrate both input-dependent and resource-dependent dynamic inference mechanisms under a unified framework in order to fit the varying IoT resource requirements in practice.
Abstract: State-of-the-art convolutional neural networks (CNNs) yield record-breaking predictive performance, yet at the cost of high-energy-consumption inference, that prohibits their widely deployments in resource-constrained Internet of Things (IoT) applications. We propose a dual dynamic inference (DDI) framework that highlights the following aspects: 1) we integrate both input-dependent and resource-dependent dynamic inference mechanisms under a unified framework in order to fit the varying IoT resource requirements in practice. DDI is able to both constantly suppress unnecessary costs for easy samples, and to halt inference for all samples to meet hard resource constraints enforced; 2) we propose a flexible multi-grained learning to skip (MGL2S) approach for input-dependent inference which allows simultaneous layer-wise and channel-wise skipping; 3) we extend DDI to complex CNN backbones such as DenseNet and show that DDI can be applied towards optimizing any specific resource goals including inference latency and energy cost. Extensive experiments demonstrate the superior inference accuracy-resource trade-off achieved by DDI, as well as the flexibility to control such a trade-off as compared to existing peer methods. Specifically, DDI can achieve up to 4 times computational savings with the same or even higher accuracy as compared to existing competitive baselines.

Journal ArticleDOI
TL;DR: This paper addresses the problem of semantic segmentation in cardiac MR images using a dilated Convolutional Neural Network, and outperforms other methods featuring dilated convolutions in this challenge up until now.
Abstract: Semantic segmentation of cardiac MR images is a challenging task due to its importance in medical assessment of heart diseases. Having a detailed localization of specific regions of interest such as Right and Left Ventricular Cavities and Myocardium, doctors can infer important information about the presence of cardiovascular diseases, which are today a major cause of death globally. This paper addresses the problem of semantic segmentation in cardiac MR images using a dilated Convolutional Neural Network. Opting for dilated convolutions allowed us to work in full resolution throughout the network's layers, preserving localization accuracy, while maintaining a relatively small number of trainable parameters. To assist the network's training process we designed a custom loss function. Furthermore, we developed new augmentation techniques and also adapted existing ones, to cope for the lack of sufficient training images. Consequently, the training set increases not only by amount, but by substance as well, and the network trains quickly and efficiently without overfitting. Our pre- and post-processing steps are also crucial to the whole process. We apply our methodology for the Right and Left Ventricles (RV, LV) and also the Myocardium (MYO) according to the Automated Cardiac Diagnosis Challenge (ACDC) with promising results. Submitting our algorithm's predictions to the Post-2017-MICCAI-challenge testing phase, we achieved similar scores (average Dice coefficient 0.916) on the test data set compared to the state of the art featured in the ACDC leaderboard, but with significantly fewer parameters than the leading method. Our approach outperforms other methods featuring dilated convolutions in this challenge up until now.

Journal ArticleDOI
TL;DR: In the study, 450 distorted images obtained from 15 pristine 3D VR images modified by 6 types of distortion of varying severities were evaluated by 42 subjects in a controlled VR setting and made available as part of the new database, in hopes that the relationships between gaze direction and perceived quality might be better understood.
Abstract: Virtual Reality (VR) and its applications have attracted significant and increasing attention. However, the requirements of much larger file sizes, different storage formats, and immersive viewing conditions pose significant challenges to the goals of acquiring, transmitting, compressing and displaying high quality VR content. Towards meeting these challenges, it is important to be able to understand the distortions that arise and that can affect the perceived quality of displayed VR content. It is also important to develop ways to automatically predict VR picture quality. Meeting these challenges requires basic tools in the form of large, representative subjective VR quality databases on which VR quality models can be developed and which can be used to benchmark VR quality prediction algorithms. Towards making progress in this direction, here we present the results of an immersive 3D subjective image quality assessment study. In the study, 450 distorted images obtained from 15 pristine 3D VR images modified by 6 types of distortion of varying severities were evaluated by 42 subjects in a controlled VR setting. Both the subject ratings as well as eye tracking data were recorded and made available as part of the new database, in hopes that the relationships between gaze direction and perceived quality might be better understood. We also evaluated several publicly available IQA models on the new database, and also report a statistical evaluation of the performances of the compared IQA models.

Journal ArticleDOI
TL;DR: A novel greedy approach called cluster pruning is proposed, which provides a structured way of removing filters in a CNN by considering the importance of filters and the underlying hardware architecture, and outperforms the conventional filter pruning methodology.
Abstract: Even though the Convolutional Neural Networks (CNN) has shown superior results in the field of computer vision, it is still a challenging task to implement computer vision algorithms in real-time at the edge, especially using a low-cost IoT device due to high memory consumption and computation complexities in a CNN. Network compression methodologies such as weight pruning, filter pruning, and quantization are used to overcome the above mentioned problem. Even though filter pruning methodology has shown better performances compared to other techniques, irregularity of the number of filters pruned across different layers of a CNN might not comply with majority of the neural computing hardware architectures. In this paper, a novel greedy approach called cluster pruning has been proposed, which provides a structured way of removing filters in a CNN by considering the importance of filters and the underlying hardware architecture. The proposed methodology is compared with the conventional filter pruning algorithm on Pascal-VOC open dataset, and Head-Counting dataset, which is our own dataset developed to detect and count people entering a room. We benchmark our proposed method on three hardware architectures, namely CPU, GPU, and Intel Movidius Neural Computer Stick (NCS) using the popular SSD-MobileNet and SSD-SqueezeNet neural network architectures used for edge-AI vision applications. Results demonstrate that our method outperforms the conventional filter pruning methodology, using both datasets on above mentioned hardware architectures. Furthermore, a low cost IoT hardware setup consisting of an Intel Movidius-NCS is proposed to deploy an edge-AI application using our proposed pruning methodology.

Journal ArticleDOI
TL;DR: A novel detection algorithm is designed to exploit the structural defect in GAN, taking advantage of the most vulnerable link in Gan generators – the up-sampling process conducted by the Transposed Convolution operation.
Abstract: With Generative adversarial networks (GAN) achieving realistic image generation, fake image detection research has become an imminent need. In this paper, a novel detection algorithm is designed to exploit the structural defect in GAN, taking advantage of the most vulnerable link in GAN generators - the up-sampling process conducted by the Transposed Convolution operation. The Transposed Convolution in the process will cause the lack of global information in the generated images. Therefore, the Self-Attention mechanism is adopted correspondingly, equipping the algorithm with a much better comprehension of the global information than the other current work adopting pure CNN network, which is reflected in the significant increase in the detection accuracy. With the thorough comparison to the current work and corresponding careful analysis, it is verified that our proposed algorithm outperforms other current works in the field. Also, with experiments conducted on other image-generation categories and images undergone usual real-life post-processing methods, our proposed algorithm shows decent robustness for various categories of images under different reality circumstances, rather than restricted by image types and pure laboratory situation.

Journal ArticleDOI
TL;DR: This paper introduces a novel graph-based representation of an image that captures key forensic relationships among regions in the image, which is called the Forensic Similarity Graph, and presents two community detection techniques, adapted from literature, to detect and localize image forgeries.
Abstract: In this paper, we propose new image forgery detection and localization algorithms by recasting these problems as graph-based community detection problems. To do this, we introduce a novel graph-based representation of an image, which we call the Forensic Similarity Graph, that captures key forensic relationships among regions in the image. In this representation, small image patches are represented by graph vertices with edges assigned according to the forensic similarity between patches. Localized tampering introduces unique structure into this graph, which aligns with a concept called “community structure” in graph-theory literature. In the Forensic Similarity Graph, communities correspond to the tampered and unaltered regions in the image. As a result, forgery detection is performed by identifying whether multiple communities exist, and forgery localization is performed by partitioning these communities. We present two community detection techniques, adapted from literature, to detect and localize image forgeries. We experimentally show that our proposed community detection methods outperform existing state-of-the-art forgery detection and localization methods, which do not capture such community structure.

Journal ArticleDOI
TL;DR: A two-stage student-teacher approach is presented to make state-of-the-art neural networks for sound event detection fit on current microcontrollers, and the approach is tested on an ARM Cortex M4, particularly focusing on issues related to 8-bits quantization.
Abstract: Outdoor acoustic event detection is an exciting research field but challenged by the need for complex algorithms and deep learning techniques, typically requiring many computational, memory, and energy resources. These challenges discourage IoT implementations, where an efficient use of resources is required. However, current embedded technologies and microcontrollers have increased their capabilities without penalizing energy efficiency. This paper addresses the application of sound event detection at the very edge, by optimizing deep learning techniques on resource-constrained embedded platforms for the IoT. The contribution is two-fold: firstly, a two-stage student-teacher approach is presented to make state-of-the-art neural networks for sound event detection fit on current microcontrollers; secondly, we test our approach on an ARM Cortex M4, particularly focusing on issues related to 8-bits quantization. Our embedded implementation can achieve 68% accuracy in recognition on Urbansound8k, not far from state-of-the-art performance, with an inference time of 125 ms for each second of the audio stream, and power consumption of 5.5 mW in just 34.3 kB of RAM.

Journal ArticleDOI
TL;DR: This paper introduces BB-UNet (Bounding Box U-Net), a deep learning model that integrates location as well as shape prior onto model training and outperforms state-of-the-art methods in fully supervised learning frameworks and registers relevant results given the weakly supervised domain.
Abstract: Medical image segmentation is the process of anatomically isolating organs for analysis and treatment. Leading works within this domain emerged with the well-known U-Net. Despite its success, recent works have shown the limitations of U-Net to conduct segmentation given image particularities such as noise, corruption or lack of contrast. Prior knowledge integration allows to overcome segmentation ambiguities. This paper introduces BB-UNet (Bounding Box U-Net), a deep learning model that integrates location as well as shape prior onto model training. The proposed model is inspired by U-Net and incorporates priors through a novel convolutional layer introduced at the level of skip connections. The proposed architecture helps in presenting attention kernels onto the neural training in order to guide the model on where to look for the organs. Moreover, it fine-tunes the encoder layers based on positional constraints. The proposed model is exploited within two main paradigms: as a solo model given a fully supervised framework and as an ancillary model, in a weakly supervised setting. In the current experiments, manual bounding boxes are fed at inference and as such BB-Unet is exploited in a semi-automatic setting; however, BB-Unet has the potential of being part of a fully automated process, if it relies on a preliminary step of object detection. To validate the performance of the proposed model, experiments are conducted on two public datasets: the SegTHOR dataset which focuses on the segmentation of thoracic organs at risk in computed tomography (CT) images, and the Cardiac dataset which is a mono-modal MRI dataset released as part of the Decathlon challenge and dedicated to segmentation of the left atrium. Results show that the proposed method outperforms state-of-the-art methods in fully supervised learning frameworks and registers relevant results given the weakly supervised domain.