scispace - formally typeset
Search or ask a question

Showing papers in "Signal Processing-image Communication in 2020"


Journal ArticleDOI
TL;DR: In this article, a comprehensive and in-depth survey of deep learning-based underwater image enhancement is provided, which covers various perspectives ranging from algorithms to open issues, and conduct a qualitative and quantitative comparison of the deep algorithms on diverse datasets to serve as a benchmark.
Abstract: The powerful representation capacity of deep learning has made it inevitable for the underwater image enhancement community to employ its potential. The exploration of deep underwater image enhancement networks is increasing over time; hence, a comprehensive survey is the need of the hour. In this paper, our main aim is two-fold, (1): to provide a comprehensive and in-depth survey of the deep learning-based underwater image enhancement, which covers various perspectives ranging from algorithms to open issues, and (2): to conduct a qualitative and quantitative comparison of the deep algorithms on diverse datasets to serve as a benchmark, which has been barely explored before. We first introduce the underwater image formation models, which are the base of training data synthesis and design of deep networks, and also helpful for understanding the process of underwater image degradation. Then, we review deep underwater image enhancement algorithms, and a glimpse of some of the aspects of the current networks is presented, including architecture, parameters, training data, loss function, and training configurations. We also summarize the evaluation metrics and underwater image datasets. Following that, a systematically experimental comparison is carried out to analyze the robustness and effectiveness of deep algorithms. Meanwhile, we point out the shortcomings of current benchmark datasets and evaluation metrics. Finally, we discuss several unsolved open issues and suggest possible research directions. We hope that all efforts done in this paper might serve as a comprehensive reference for future research and call for the development of deep learning-based underwater image enhancement.

141 citations


Journal ArticleDOI
TL;DR: This work proposes a two-branch network to compensate the global distorted color and local reduced contrast, respectively, and designs a compressed-histogram equalization to complement the data-driven deep learning, in which the parameters are fixed after training.
Abstract: Due to the light absorption and scattering, captured underwater images usually contain severe color distortion and contrast reduction. To address the above problems, we combine the merits of deep learning and conventional image enhancement technology to improve the underwater image quality. We first propose a two-branch network to compensate the global distorted color and local reduced contrast, respectively. Adopting this global–local network can greatly ease the learning problem, so that it can be handled by using a lightweight network architecture. To cope with the complex and changeable underwater environment, we then design a compressed-histogram equalization to complement the data-driven deep learning, in which the parameters are fixed after training. The proposed compression strategy is able to generate vivid results without introducing over-enhancement and extra computing burden. Experiments demonstrate that our method significantly outperforms several state-of-the-arts in both qualitative and quantitative qualities.

110 citations


Journal ArticleDOI
TL;DR: A new color image encryption scheme based on DNA operations and spatiotemporal chaotic system is presented and the key streams are associated with the secret keys and plain image, which can ensure the cryptosystem plain-image-dependent and improve the ability to resist known-plaintext or chosen-plain text attacks.
Abstract: In this paper, a new color image encryption scheme based on DNA operations and spatiotemporal chaotic system is presented. Firstly, to hide the distribution information of the plain image, we convert the plain image into three DNA matrices based on the DNA random encoding rules. Then, the DNA matrices are combined into a new matrix and is permutated by a scramble matrix generated by mixed linear-nonlinear coupled map lattices (MLNCML) system. In which, the key streams are associated with the secret keys and plain image, which can ensure our cryptosystem plain-image-dependent and improve the ability to resist known-plaintext or chosen-plaintext attacks. Thereafter, to resist statistical attacks, the scrambled matrix is decomposed into three matrices and diffused by DNA deletion-insertion operations. Finally, the three matrices are decoded based on DNA random decoding rules and recombined to three channels of the cipher image. Simulation results demonstrate that the proposed image cryptosystem has good security and can resist various potential attacks.

102 citations


Journal ArticleDOI
TL;DR: This paper proposes a conditional generative adversarial network (cGAN), where the clear underwater image is achieved by a multi-scale generator that employs a dual discriminator to grab local and global semantic information, which enforces the generated results by the multi- scale generator realistic and natural.
Abstract: Underwater images play an essential role in acquiring and understanding underwater information. High-quality underwater images can guarantee the reliability of underwater intelligent systems. Unfortunately, underwater images are characterized by low contrast, color casts, blurring, low light, and uneven illumination, which severely affects the perception and processing of underwater information. To improve the quality of acquired underwater images, numerous methods have been proposed, particularly with the emergence of deep learning technologies. However, the performance of underwater image enhancement methods is still unsatisfactory due to lacking sufficient training data and effective network structures. In this paper, we solve this problem based on a conditional generative adversarial network (cGAN), where the clear underwater image is achieved by a multi-scale generator. Besides, we employ a dual discriminator to grab local and global semantic information, which enforces the generated results by the multi-scale generator realistic and natural. Experiments on real-world and synthetic underwater images demonstrate that the proposed method performs favorable against the state-of-the-art underwater image enhancement methods.

71 citations


Journal ArticleDOI
TL;DR: The proposed novel scheme can reconstruct the alteration of extremely high rates (up to 80%), obtaining good quality for altered regions that are self-recovered with higher visual performance compared with a similar scheme from state of the-art methods.
Abstract: In this paper, a fragile watermarking scheme for color-image authentication and self-recovery is proposed. Original image is divided into non-overlapping blocks, and for each i -th block, the watermarks used for recovery and authentication are generated, which are embedded into a different block according to an embedding sequence given by a permutation process. The designed scheme embeds the watermarks generated by each block within the 2-LSB, where a bit-adjustment phase is subsequently applied to increase the quality of the watermarked image. In order to increase the quality of the recovered image, we use in the post-processing stage the bilateral filter that efficiently suppresses noise preserving image edges. Additionally, in the tamper detection process high accuracy is achieved employing a hierarchical tamper detection algorithm. Finally, to solve tampering coincidence problem, three recovery watermarks are embedded in different positions to reconstruct a specific block, and a proposed inpainting algorithm is implemented to regenerate those regions affected by this problem. Simulation results demonstrate that the watermarked images appear to demonstrate higher quality, and the proposed novel scheme can reconstruct the alteration of extremely high rates (up to 80%), obtaining good quality for altered regions that are self-recovered with higher visual performance compared with a similar scheme from state of the-art methods.

65 citations


Journal ArticleDOI
TL;DR: In this article, the general structure and classifications of image hashing based tamper detection techniques with their properties are exploited and the evaluation datasets and different performance metrics are discussed.
Abstract: Perceptual hashing is used for multimedia content identification and authentication through perception digests based on the understanding of multimedia content. This paper presents a literature review of image hashing for image authentication in the last decade. The objective of this paper is to provide a comprehensive survey and to highlight the pros and cons of existing state-of-the-art techniques. In this article, the general structure and classifications of image hashing based tamper detection techniques with their properties are exploited. Furthermore, the evaluation datasets and different performance metrics are also discussed. The paper concludes with recommendations and good practices drawn from the reviewed techniques.

65 citations


Journal ArticleDOI
TL;DR: This work proposes a VQA model based on multi-objective visual relationship detection based on the principle of word vector similarity, which is benchmarked on the DQAUAR data set, and evaluated by the Acc WUPS@0.9.0.
Abstract: visual question answering (VQA) is a learning task involving two major fields of computer vision and natural language processing. The development of deep learning technology has contributed to the advancement of this research area. Although the research on the question answering model has made great progress, the low accuracy of the VQA model is mainly because the current question answering model structure is relatively simple, the attention mechanism of model is deviated from human attention and lacks a higher level of logical reasoning ability. In response to the above problems, we propose a VQA model based on multi-objective visual relationship detection. Firstly, the appearance feature is used to replace the image features from the original object, and the appearance model is extended by the principle of word vector similarity. The appearance features and relationship predicates are then fed into the word vector space and represented by a fixed length vector. Finally, through the concatenation of elements between the image feature and the question vector are fed into the classifier to generate an output answer. Our method is benchmarked on the DQAUAR data set, and evaluated by the Acc WUPS@0.0 and WUPS@0.9.

61 citations


Journal ArticleDOI
TL;DR: This work proposes a bimodal fusion algorithm to realize speech emotion recognition, where both facial expression and speech information are optimally fused and achieves a better performance than the uni-modal emotion recognition.
Abstract: Emotion recognition is a hot research in modern intelligent systems. The technique is pervasively used in autonomous vehicles, remote medical service, and human–computer interaction (HCI). Traditional speech emotion recognition algorithms cannot be effectively generalized since both training and testing data are from the same domain, which have the same data distribution. In practice, however, speech data is acquired from different devices and recording environments. Thus, the data may differ significantly in terms of language, emotional types and tags. To solve such problem, in this work, we propose a bimodal fusion algorithm to realize speech emotion recognition, where both facial expression and speech information are optimally fused. We first combine the CNN and RNN to achieve facial emotion recognition. Subsequently, we leverage the MFCC to convert speech signal to images. Therefore, we can leverage the LSTM and CNN to recognize speech emotion. Finally, we utilize the weighted decision fusion method to fuse facial expression and speech signal to achieve speech emotion recognition. Comprehensive experimental results have demonstrated that, compared with the uni-modal emotion recognition, bimodal features-based emotion recognition achieves a better performance.

59 citations


Journal ArticleDOI
TL;DR: A fusion tracking method which combines information from RGB and thermal infrared images (RGB-T) is presented based on the fact that infrared images reveal thermal radiation of objects thus providing complementary features, which improves tracking performance significantly compared to methods based on images of single modality.
Abstract: The task of object tracking is very important since its various applications. However, most object tracking methods are based on visible images, which may fail when visible images are unreliable, for example when the illumination conditions are poor. To address this issue, in this paper a fusion tracking method which combines information from RGB and thermal infrared images (RGB-T) is presented based on the fact that infrared images reveal thermal radiation of objects thus providing complementary features. Particularly, a fusion tracking method based on dynamic Siamese networks with multi-layer fusion, termed as DSiamMFT, is proposed. Visible and infrared images are firstly processed by two dynamic Siamese Networks, namely visible and infrared network, respectively. Then, multi-layer feature fusion is performed to adaptively integrate multi-level deep features between visible and infrared networks. Response maps produced from different fused layer features are then combined through an elementwise fusion approach to produce the final response map, based on which the target can be located. Extensive experiments on large datasets with various challenging scenarios have been conducted. The results demonstrate that the proposed method shows very competitive performance against the-state-of-art RGB-T trackers. The proposed approach also improves tracking performance significantly compared to methods based on images of single modality.

58 citations


Journal ArticleDOI
TL;DR: The probabilistic Kalman filter (PKF) that is able to take into account the stored trajectories to improve tracking estimation is presented, which has higher accuracy compared to the standardKalman filter and could handle widespread problems such as occlusion.
Abstract: Kalman filter has been successfully applied to tracking moving objects in real-time situations. However, the filter cannot take into account the existing prior knowledge to improve its predictions. In the moving object tracking, the trajectories of multiple targets in the same environment could be available, which can be viewed as the prior knowledge for the tracking procedure. This paper presents the probabilistic Kalman filter (PKF) that is able to take into account the stored trajectories to improve tracking estimation. The PKF has an extra stage after two steps of the Kalman filter to refine the estimated position of the targets. The refinement is obtained by applying the Viterbi algorithm to a probabilistic graph, that is constructed based on the observed trajectories. The graph is built in the offline situation and could be adapted in the online tracking. The proposed tracker has higher accuracy compared to the standard Kalman filter and could handle widespread problems such as occlusion. Another significant achievement of the proposed tracker is to track an object with anomalous behaviors by drawing an inference based on the constructed probabilistic graph. The PKF was applied to several manually-built videos and several other video-bases containing severe occlusions, which demonstrates a significant performance in comparison with other state-of-the-art trackers.

55 citations


Journal ArticleDOI
TL;DR: Theoretical analysis and experimental results show that the proposed zero-watermarking algorithm achieves a good trade-off between robustness and discriminability, and has certain superiority in terms of security, capacity and time complexity.
Abstract: As one promising solution, zero-watermarking techniques have been proposed to enhance the image visual quality and applied to protect the intellectual property rights of the medical images, remote sensing images and military images. Owing to their favorable image description capability and geometric invariance, moments and moment invariants have become a popular tool for the zero-watermarking. However, two issues of the moments-based zero-watermarking methods should be addressed: First, most of them ignore the analysis and experiment on discriminability, resulting in a high false positive ratio; Second, direct computation of the moments from their definition is inefficient, numerically unstable and inaccurate, which severely affects the performances of these moments-based methods. To overcome the two challenges, in this paper, we present a Fast Quaternion Generic Polar Complex Exponential Transform (FQGPCET) based color image zero-watermarking algorithm. We first propose a novel computation strategy, i.e. FGPCET, to solve the moments computing problems. We then show that it is possible to generate a robust and discriminative image feature, by mixing the low-order QGPCET moments/coefficients. And finally, we develop a new color image zero-watermarking approach using FQGPCET and asymmetric tent map. Theoretical analysis and experimental results show that the proposed zero-watermarking algorithm achieves a good trade-off between robustness and discriminability, and has certain superiority in terms of security, capacity and time complexity.

Journal ArticleDOI
TL;DR: An HRI model of a robotic arm is proposed for robot arm manipulation using 3D SSD architecture for the location and identification of gesture and arm movement and DTW template matching algorithm is adopted to trace the dynamic gestures.
Abstract: Human–robot interaction (HRI) has become a research hotspot in computer vision and robotics due to its wide application in human–computer interaction (HCI) domain. Based on the explored algorithms of gesture recognition and limb movement recognition in somatosensory interaction, an HRI model of a robotic arm is proposed for robot arm manipulation. More specifically, 3D SSD architecture is used for the location and identification of gesture and arm movement. Then, DTW template matching algorithm is adopted to trace the dynamic gestures. The interactive scenarios and interactive modes are designed for experiment and implementation. Virtual interactive experimental results have demonstrated the usefulness of our method.

Journal ArticleDOI
TL;DR: A novel fusion method that combines separable dictionary optimization with Gabor filter in non-subsampled contourlet transform (NSCT) domain is proposed, leading to the state-of-art performance on both visual quality and objective assessment.
Abstract: Sparse representation (SR) has been widely used in image fusion in recent years. However, source image, segmented into vectors, reduces correlation and structural information of texture with conventional SR methods, and extracting texture with the sliding window technology is more likely to cause spatial inconsistency in flat regions of multi-modality medical fusion image. To solve these problems, a novel fusion method that combines separable dictionary optimization with Gabor filter in non-subsampled contourlet transform (NSCT) domain is proposed. Firstly, source images are decomposed into high frequency (HF) and low frequency (LF) components by NSCT. Then the HF components are reconstructed sparsely by separable dictionaries with iterative updating sparse coding and dictionary training. In the process, sparse coefficients and separable dictionaries are updated by orthogonal matching pursuit (OMP) and manifold-based conjugate gradient method, respectively. Meanwhile, the Gabor energy as weighting factor is utilized to guide the LF components fusion, and this further improves the fusion degree of low-significant feature in the flat regions. Finally, the fusion components are transformed to obtain fusion image by inverse NSCT. Experimental results demonstrate the more competitive results of the proposal, leading to the state-of-art performance on both visual quality and objective assessment.

Journal ArticleDOI
TL;DR: A four-stream framework to improve VI-ReId performance, which outperforms current state-of-the-art with a large margin, and improves the performance of the proposed framework by employing a re-ranking algorithm for post-processing.
Abstract: Visible–infrared cross-modality person re-identification (VI-ReId) is an essential task for video surveillance in poorly illuminated or dark environments. Despite many recent studies on person re-identification in the visible domain (ReId), there are few studies dealing specifically with VI-ReId. Besides challenges that are common for both ReId and VI-ReId such as pose/illumination variations, background clutter and occlusion, VI-ReId has additional challenges as color information is not available in infrared images. As a result, the performance of VI-ReId systems is typically lower than that of ReId systems. In this work, we propose a four-stream framework to improve VI-ReId performance. We train a separate deep convolutional neural network in each stream using different representations of input images. We expect that different and complementary features can be learned from each stream. In our framework, grayscale and infrared input images are used to train the ResNet in the first stream. In the second stream, RGB and three-channel infrared images (created by repeating the infrared channel) are used. In the remaining two streams, we use local pattern maps as input images. These maps are generated utilizing local Zernike moments transformation. Local pattern maps are obtained from grayscale and infrared images in the third stream and from RGB and three-channel infrared images in the last stream. We improve the performance of the proposed framework by employing a re-ranking algorithm for post-processing. Our results indicate that the proposed framework outperforms current state-of-the-art with a large margin by improving Rank-1/mAP by 29 . 79 % ∕ 30 . 91 % on SYSU-MM01 dataset, and by 9 . 73 % ∕ 16 . 36 % on RegDB dataset.

Journal ArticleDOI
TL;DR: Experiments on the Flickr30k and COCO datasets indicate that the proposed adaptive attention model with a visual sentinel exhibits significant improvement in terms of the BLEU and METEOR evaluation criteria.
Abstract: Considering the image captioning problem, it is difficult to correctly extract the global features of the images. At the same time, most attention methods force each word to correspond to the image region, ignoring the phenomenon that words such as “the” in the description text cannot correspond to the image region. To address these problems, an adaptive attention model with a visual sentinel is proposed in this paper. In the encoding phase, the model introduces DenseNet to extract the global features of the image. At the same time, on each time axis, the sentinel gate is set by the adaptive attention mechanism to decide whether to use the image feature information for word generation. In the decoding phase, the long short-term memory (LSTM) network is applied as a language generation model for image captioning tasks to improve the quality of image caption generation. Experiments on the Flickr30k and COCO datasets indicate that the proposed model exhibits significant improvement in terms ofthe BLEU and METEOR evaluation criteria.

Journal ArticleDOI
TL;DR: The proposed encryption method embeds the encryption into the compression process, in which a small part of the data is encrypted quickly, while maintaining the good coding characteristics of set partitioning in hierarchical trees (SPIHT).
Abstract: In this paper, a novel method for lossless image encryption based on set partitioning in hierarchical trees and cellular automata. The proposed encryption method embeds the encryption into the compression process, in which a small part of the data is encrypted quickly, while maintaining the good coding characteristics of set partitioning in hierarchical trees (SPIHT). The proposed encryption system adopts three stages of scrambling and diffusion. In each stage of encryption, different chaotic systems are used to generate the plaintext-related key stream to maintain high security and to resist some attacks. Moreover, the channel length of the coded-and-compressed color image is more uncertain, resulting into higher difficulty for attackers to decipher the algorithm. The experimental results indicate that the length of bitstream is compressed to 50% of the original image, showing that our proposed algorithm has higher lossless compression ratio compared with the existing algorithms. Meanwhile, the encryption scheme passes the entropy analysis, sensitivity analysis, lossless recovery test, and SP800-22 test.

Journal ArticleDOI
TL;DR: A public, well-structured and complete dataset, named Multiview, Multimodal and Multispectral Driver Action Dataset (3MDAD), which presents multiple modalities, spectrums and views under different time and weather conditions and independently analyze the driver action recognition results adapted to each modality and those obtained of several combinations of modalities.
Abstract: Driver distraction and fatigue have become one of the leading causes of severe traffic accidents. Hence, driver inattention monitoring systems are crucial. Even with the growing development of advanced driver assistance systems and the introduction of third-level autonomous vehicles, this task is still trending and complex due to challenges such as the illumination change and the dynamic background. To reliably compare and validate driver inattention monitoring methods, a limited number of public datasets are available. In this paper, we put forward a public, well-structured and complete dataset, named Multiview, Multimodal and Multispectral Driver Action Dataset (3MDAD). The dataset is mainly composed of two sets: the first one recorded in daytime and the second one at nighttime. Each set consists of two synchronized data modalities, both from frontal and side views. More than 60 drivers are asked to execute 16 in-vehicle actions under a wide range of naturalistic driving settings. In contrast to other public datasets, 3MDAD presents multiple modalities, spectrums and views under different time and weather conditions. To highlight the utility of our dataset, we independently analyze the driver action recognition results adapted to each modality and those obtained of several combinations of modalities.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed method outperforms other state-of-the-art methods, and achieves a higher embedding capacity than that of methods based on vacating room after encryption.
Abstract: This paper proposes a novel reversible data hiding method in encrypted images based on specific encryption process. In the proposed specific encryption algorithm, the stream cipher and prediction error are combined to vacate room for data embedding. After that, a permutation operation is performed on the encrypted image to improve the security. In the embedding process, we can embed a large amount of secret data in the encrypted image by pixel value expansion because most of the pixel values are less than 128 by the specific encryption process. At the receiver end, the encrypted image can be recovered from the marked encrypted image without knowing the secret data. Therefore, even if the recipient only has the encryption key, the original image will be perfectly recovered. If the recipient only has the data-hiding key, the secret data will be extracted. And if the recipient has both keys, the original image and the secret data are both available. The proposed method achieves a higher embedding capacity than that of methods based on vacating room after encryption. It does not require the image owner to perform reversible data hiding techniques on the original image, which is more convenient than methods based on reserving room before encryption. Experimental results demonstrate that the proposed method outperforms other state-of-the-art methods.

Journal ArticleDOI
TL;DR: This work proposes a dense connected cascade network that densely connects both CNN-based sub-network and data-consistency sub- network, thus takes advantage of the data- Consistency of k-space data in a densely connected fashion for more accurate MRI reconstruction.
Abstract: The progress of convolution neural network (CNN) based super-resolution has shown its potential in image processing community. Meanwhile, Compressed Sensing MRI (CS-MRI) provides the possibility to accelerate the traditional acquisition process of MRI. In this work, on the basis of decomposing the cascade network to be a series of alternating CNN-based sub-network and data-consistency sub-network, we investigate the performance of the cascade networks in CS-MRI by employing various CNN-based super-resolution methods in the CNN-based sub-network. Furthermore, recognizing that existing methods of exploring dense connection in the CNN-based sub-network are insufficient to utilize the feature information, we propose a dense connected cascade network for more accurate MRI reconstruction. Specifically, the proposed network densely connects both CNN-based sub-network and data-consistency sub-network, thus takes advantage of the data-consistency of k-space data in a densely connected fashion. Experimental results on various MR data demonstrated that the proposed network is superior to current cascade networks in reconstruction quality.

Journal ArticleDOI
TL;DR: A VR simulator of a forestry crane used for loading logs onto a truck is investigated, to study the effects of latency on the subjective experience, with regards to delays in the crane control interface.
Abstract: In this article, we have investigated a VR simulator of a forestry crane used for loading logs onto a truck. We have mainly studied the Quality of Experience (QoE) aspects that may be relevant for task completion, and whether there are any discomfort related symptoms experienced during the task execution. QoE experiments were designed to capture the general subjective experience of using the simulator, and to study task performance. The focus was to study the effects of latency on the subjective experience, with regards to delays in the crane control interface. Subjective studies were performed with controlled delays added to the display update and hand controller (joystick) signals. The added delays ranged from 0 to 30 ms for the display update, and from 0 to 800 ms for the hand controller. We found a strong effect on latency in the display update and a significant negative effect for 800 ms added delay on latency in the hand controller (in total approx. 880 ms latency including the system delay). The Simulator Sickness Questionnaire (SSQ) gave significantly higher scores after the experiment compared to before the experiment, but a majority of the participants reported experiencing only minor symptoms. Some test subjects ceased the test before finishing due to their symptoms, particularly due to the added latency in the display update.

Journal ArticleDOI
TL;DR: In order to transmit multiple-image synchronously, an encryption algorithm by combining equal modulus decomposition with quaternion gyrator transform is introduced, which makes the cryptosystem achieve high security.
Abstract: In order to transmit multiple-image synchronously, this paper introduces an encryption algorithm by combining equal modulus decomposition with quaternion gyrator transform. Firstly, each color image is encoded into a quaternion-valued matrix for a holistic processing. With the chaotic random phase mask, the quaternion gyrator spectrum is obtained. It is subsequently split into a complex-valued interim matrix and the equal modulus decomposition is performed to enhance the security, where the spectrum is divided into two complex-valued masks. Thereafter, the two set of phase masks are respectively superimposed and followed by gyrator transforms. Finally, a real-valued matrix is constructed as the final ciphertext image by splicing the real and imaginary parts together. The phase masks that are generated using the chaotic and real-valued ciphertext are convenient to storage and transmission. Moreover, the initial conditions are closely related with the plaintext images, which makes the cryptosystem achieve high security. Numerical simulations are made to demonstrate the reliability of the proposed cryptosystem.

Journal ArticleDOI
TL;DR: The quantitative experiment result shows that the proposed multi-scale attention convolutional neural network (MSA-CNN) obtains the state of the art performance in image-based driver action recognition.
Abstract: Driver distraction has currently been a global issue causing the dramatic increase of road accidents and casualties. However, recognizing distracted driving action remains a challenging task in the field of computer vision, since inter-class variations between different driver action categories are quite subtle. To overcome this difficulty, in this paper, a novel deep learning based approach is proposed to extract fine-grained feature representation for image-based driver action recognition. Specifically, we improve the existing convolutional neural network from two aspects: (1) we employ multi-scale convolutional block with different receptive fields of kernel sizes to generate hierarchical feature map and adopt maximum selection unit to adaptively combine multi-scale information; (2) we incorporate an attention mechanism to learn pixel saliency and channel saliency between convolutional features so that it can guide the network to intensify local detail information and suppress global background information. For experiment, we evaluate the designed architecture on multiple driver action datasets. The quantitative experiment result shows that the proposed multi-scale attention convolutional neural network (MSA-CNN) obtains the state of the art performance in image-based driver action recognition.

Journal ArticleDOI
Yuchun Fang1, Yifan Li1, Tu Xiaokang1, Taifeng Tan1, Xin Wang1 
TL;DR: The proposed U-Net based method combines Hybrid Dilated Convolution (HDC) and spectral normalization to fill in missing regions of any shape with sharp structures and fine-detailed textures can generate realistic and semantically plausible images.
Abstract: Image completion is a challenging task which aims to fill the missing or masked regions in images with plausibly synthesized contents. In this paper, we focus on face image inpainting tasks, aiming at reconstructing missing or damaged regions of an incomplete face image given the context information. We specially design the U-Net architecture to tackle the problem. The proposed U-Net based method combines Hybrid Dilated Convolution (HDC) and spectral normalization to fill in missing regions of any shape with sharp structures and fine-detailed textures. We perform both qualitative and quantitative evaluation on two challenging face datasets. Experimental results demonstrate that our method outperforms previous learning-based inpainting methods. The proposed method can generate realistic and semantically plausible images.

Journal ArticleDOI
TL;DR: The proposed TDFSSD network is trained end to end and outperforms the state-of-the-art methods across the three datasets and all the results show the efficiency of the proposed method on object detection.
Abstract: Object detection across different scales is challenging as the variances of object scales. Thus, a novel detection network, Top-Down Feature Fusion Single Shot MultiBox Detector (TDFSSD), is proposed. The proposed network is based on Single Shot MultiBox Detector (SSD) using VGG-16 as backbone with a novel, simple yet efficient feature fusion module, namely, the Top-Down Feature Fusion Module. The proposed module fuses features from higher-level features, containing semantic information, to lower-level features, containing boundary information, iteratively. Extensive experiments have been conducted on PASCAL VOC2007, PASCAL VOC2012, and MS COCO datasets to demonstrate the efficiency of the proposed method. The proposed TDFSSD network is trained end to end and outperforms the state-of-the-art methods across the three datasets. The TDFSSD network achieves 81.7% and 80.1% mAPs on VOC2007 and 2012 respectively, which outperforms the reported best results of both one-stage and two-stage frameworks. In the meantime, it achieves 33.4% mAP on MS COCO test-dev, especially 17.2% average precision (AP) on small objects. Thus all the results show the efficiency of the proposed method on object detection. Code and model are available at: https://github.com/dongfengxijian/TDFSSD .

Journal ArticleDOI
TL;DR: Experimental results on remote sensing image quality database from the GeoEye-1 and WorldView-2 satellites show that the proposed model can optimally discover the essential features of the image and effectively extract the high-frequency information of each level of image, and has better overall quality assessment performance than the other state-of-the-art methods.
Abstract: Aiming at the problem that the remote sensing image quality evaluation models with manually extracted features lack robustness and generality, this paper proposes a 3D CNN-based architecture and nuclear power plant for accurate remote sensing image quality assessment. The model incorporates two sub-networks. The DSVL-based sub-network is employed to extract multi-scale, multi-direction and high-level features by layer-wise training. Afterwards, the extracted feature maps are fused as flowed as input data of the second sub-network, which is designed with 3D CNN architecture and nuclear power plant for remote sensing image quality assessment. Experimental results on remote sensing image quality database from the GeoEye-1 and WorldView-2 satellites show that the proposed model can optimally discover the essential features of the image and effectively extract the high-frequency information of each level of image, and has better overall quality assessment performance than the other state-of-the-art methods.

Journal ArticleDOI
TL;DR: Experimental results and analysis show that the proposed novel reversible data hiding scheme is reversible and can achieve good performance in capacity and imperceptibility compared with the existing methods.
Abstract: By leveraging the secret data coding using the remainder storage based exploiting modification direction (RSBEMD), and the pixel change operation recording based on multi-segment left and right histogram shifting, a novel reversible data hiding (RHD) scheme is proposed in this paper. The secret data are first encoded by some specific pixel change operations to the pixels in groups. After that, multi-segment left and right histogram shifting based on threshold manipulation is implemented for recording the pixel change operations. Furthermore, a multiple embedding policy based on chess board prediction (CBP) and threshold manipulation is put forward, and the threshold can be adjusted to achieve adaptive data hiding. Experimental results and analysis show that it is reversible and can achieve good performance in capacity and imperceptibility compared with the existing methods.

Journal ArticleDOI
TL;DR: An adaptive scheme is designed on a cube with opposite colour patterns and optimisations in handling the above problem of opposite features being invisible to cameras at opposite positions, based upon which, with the proposed calibration method, a common visible feature point is not required in advance.
Abstract: A camera array placed in a spherical arrangement facilitates a new paradigm for future interactive visual applications, and the calibration of the cameras is a crucial task for such implementation. It is difficult to apply previous calibration methods to this special camera system because opposite features can commonly be found among the opposing cameras used in this system. In this case, a part-by-part calibration may fail in terms of error accumulation. In this paper, we propose an opposite feature based camera calibration method for a camera array in a spherical arrangement. An adaptive scheme is designed on a cube with opposite colour patterns and optimisations in handling the above problem of opposite features being invisible to cameras at opposite positions, based upon which, with our method, a common visible feature point is not required in advance. By applying a practical camera array in a spherical arrangement, and using 3D modelling, the superiority of the proposed method was verified through computer simulations. The results demonstrate that the proposed calibration method can contribute to the imaging system and future visual applications.

Journal ArticleDOI
TL;DR: The goal of the current paper is to study the effects of the CSM in a more comprehensive manner and then to examine and compare different strategies for mitigating it, providing a systematic study regarding the various factors that can give birth to CSM for image steganalysis.
Abstract: The Cover-Source Mismatch (CSM) has been long recognized as a major problem in modern steganography and steganalysis. Indeed, while a vast majority of works in steganography and steganalysis had been tailored to a specific reference database, namely BOSSbase, recent works show that, because of CSM, the results may greatly differ when changing this dataset. Although the CSM has already been the subject of several publications, these prior works investigated only a few elements in a limited setup. The goal of the current paper is to study the effects of the CSM in a more comprehensive manner and then to examine and compare different strategies for mitigating it. It first defines two different parameters, the source difficulty and the source inconsistency, which are involved in the CSM. Then, using different steganographic schemes and feature sets, it aims at providing a systematic study regarding the various factors that can give birth to CSM for image steganalysis. Finally, two practical ways to mitigate the CSM, using training techniques promoting either diversity of different sources or the specificity of one targeted source which is beforehand identified by training a multi-class classifier, are presented and their performances are compared for different training set sizes.

Journal ArticleDOI
TL;DR: An image alignment based perceptual image hash method, and a hash-based image forging detection and tampering localization method that has broad-spectrum robustness, including tolerating content-preserving manipulations and geometric distortion-resilient.
Abstract: Perceptual image hash is an emerging technology that is closely related to many applications such as image content authentication, image forging detection, image similarity detection, and image retrieval. In this work, we propose an image alignment based perceptual image hash method, and a hash-based image forging detection and tampering localization method. In the proposed method, we introduce an image alignment process to provide a framework for image hash method to tolerate a wide range of geometric distortions. The image hash is generated by utilizing hybrid perceptual features that are extracted from global and local Zernike moments combining with DCT-based statistical features of the image. The proposed method can detect various image forging and compromised image regions. Furthermore, it has broad-spectrum robustness, including tolerating content-preserving manipulations and geometric distortion-resilient. Compared with state-of-the-art schemes, the proposed method provides satisfactory comprehensive performances in content-based image forging detection and tampering localization.

Journal ArticleDOI
TL;DR: A correlation network with a Shannon fusion for learning a pre-trained CNN that captures multimodal correlations over arbitrary timestamps and is validated in comparison experiments on the UCF-101 and HMDB-51 datasets.
Abstract: This paper describes a network that captures multimodal correlations over arbitrary timestamps. The proposed scheme operates as a complementary, extended network over a multimodal convolutional neural network (CNN). Spatial and temporal streams are required for action recognition by a deep CNN, but overfitting reduction and fusing these two streams remain open problems. The existing fusion approach averages the two streams. Here we propose a correlation network with a Shannon fusion for learning a pre-trained CNN. A Long-range video may consist of spatiotemporal correlations over arbitrary times, which can be captured by forming the correlation network from simple fully connected layers. This approach was found to complement the existing network fusion methods. The importance of multimodal correlation is validated in comparison experiments on the UCF-101 and HMDB-51 datasets. The multimodal correlation enhanced the accuracy of the video recognition results.