scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Signal Processing Letters in 2021"


Journal ArticleDOI
TL;DR: In this article, the authors proposed an anti-spoofing system to detect unknown synthetic voice spoofing attacks (i.e., text-to-speech or voice conversion) using one-class learning.
Abstract: Human voices can be used to authenticate the identity of the speaker, but the automatic speaker verification (ASV) systems are vulnerable to voice spoofing attacks, such as impersonation, replay, text-to-speech, and voice conversion. Recently, researchers developed anti-spoofing techniques to improve the reliability of ASV systems against spoofing attacks. However, most methods encounter difficulties in detecting unknown attacks in practical use, which often have different statistical distributions from known attacks. Especially, the fast development of synthetic voice spoofing algorithms is generating increasingly powerful attacks, putting the ASV systems at risk of unseen attacks. In this work, we propose an anti-spoofing system to detect unknown synthetic voice spoofing attacks (i.e., text-to-speech or voice conversion) using one-class learning. The key idea is to compact the bona fide speech representation and inject an angular margin to separate the spoofing attacks in the embedding space. Without resorting to any data augmentation methods, our proposed system achieves an equal error rate (EER) of 2.19% on the evaluation set of ASVspoof 2019 Challenge logical access scenario, outperforming all existing single systems ( i.e. , those without model ensemble).

106 citations


Journal ArticleDOI
TL;DR: In this article, the authors considered the target detection problem in a sensing architecture where the radar is aided by a reconfigurable intelligent surface (RIS), that can be modeled as an array of sub-wavelength small reflective elements capable of imposing a tunable phase shift to the impinging waves and, ultimately, providing the radar with an additional echo of the target.
Abstract: In this work, we consider the target detection problem in a sensing architecture where the radar is aided by a reconfigurable intelligent surface (RIS), that can be modeled as an array of sub-wavelength small reflective elements capable of imposing a tunable phase shift to the impinging waves and, ultimately, of providing the radar with an additional echo of the target. A theoretical analysis is carried out for closely- and widely-spaced (with respect to the target) radar and RIS and for different beampattern configurations, and some examples are provided to show that large gains can be achieved by the considered detection architecture.

68 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a two-stage RIS-aided channel estimation (TRICE) framework, where every stage is formulated as a multidimensional direction-of-arrival (DOA) estimation problem.
Abstract: We consider the channel estimation problem in point-to-point reconfigurable intelligent surface (RIS)-aided millimeter-wave (mmWave) MIMO systems. By exploiting the low-rank nature of mmWave channels in the angular domains, we propose a non-iterative Two-stage RIS-aided Channel Estimation (TRICE) framework, where every stage is formulated as a multidimensional direction-of-arrival (DOA) estimation problem. As a result, our TRICE framework is very general in the sense that any efficient multidimensional DOA estimation solution can be readily used in every stage to estimate the associated channel parameters. Numerical results show that the TRICE framework has a lower training overhead and a lower computational complexity, as compared to benchmark solutions.

62 citations


Journal ArticleDOI
TL;DR: In this article, a ridge regression-based high precision error prediction algorithm for reversible data hiding is proposed, which minimizes the residual sum of squares between predicted and target pixels subject to the constraint expressed in terms of the L2-norm.
Abstract: An efficient predictor is crucial for high embedding capacity and low image distortion. In this letter, a ridge regression-based high precision error prediction algorithm for reversible data hiding is proposed. The ridge regression is a penalized least-square algorithm, which solves the overfitting problem of the least-square method. Reversible data hiding based on ridge regression predictor minimizes the residual sum of squares between predicted and target pixels subject to the constraint expressed in terms of the L2-norm. Compared to a least-square-based predictor, the ridge regression-based predictor can obtain more small prediction errors, proving that the proposed method has a higher accuracy. In addition, the eight neighbor pixels of the target pixels and their two different combinations are selected as training and support sets, respectively. This selection scheme further improves the prediction accuracy. Experimental results show that the proposed method outperforms state-of-the-art adaptive reversible data hiding in terms of prediction accuracy and embedding performance.

61 citations


Journal ArticleDOI
TL;DR: In this paper, the implicit compensation between estimated magnitude and phase was analyzed for monaural speech separation and robust automatic speech recognition (ASR) tasks in noisy-reverberant conditions.
Abstract: Deep neural network (DNN) based end-to-end optimization in the complex time-frequency (T-F) domain or time domain has shown considerable potential in monaural speech separation. Many recent studies optimize loss functions defined solely in the time or complex domain, without including a loss on magnitude. Although such loss functions typically produce better scores if the evaluation metrics are objective time-domain metrics, they however produce worse scores on speech quality and intelligibility metrics and usually lead to worse speech recognition performance, compared with including a loss on magnitude. While this phenomenon has been experimentally observed by many studies, it is often not accurately explained and there lacks a thorough understanding on its fundamental cause. This letter provides a novel view from the perspective of the implicit compensation between estimated magnitude and phase. Analytical results based on monaural speech separation and robust automatic speech recognition (ASR) tasks in noisy-reverberant conditions support the validity of our view.

60 citations


Journal ArticleDOI
TL;DR: In this paper, two different non-autoregressive transformer structures for automatic speech recognition (ASR) have been proposed: Audio-Conditional Masked Language Model (A-CMLM) and Audio-Factorized Masked LMs (AFMLM).
Abstract: Very deep transformers outperform conventional bi-directional long short-term memory networks for automatic speech recognition (ASR) by a significant margin. However, being autoregressive models, their computational complexity is still a prohibitive factor in their deployment into production systems. To amend this problem, we study two different non-autoregressive transformer structures for ASR: Audio-Conditional Masked Language Model (A-CMLM) and Audio-Factorized Masked Language Model (A-FMLM). When training these frameworks, the decoder input tokens are randomly replaced by special mask tokens. Then, the network is optimized to predict the masked tokens by taking both the unmasked context tokens and the input speech into consideration. During inference, we start from all masked tokens and the network iteratively predicts missing tokens based on partial results. A new decoding strategy is proposed as an example, which starts from the most confident predictions to the rest. Results on Mandarin (AISHELL), Japanese (CSJ), English (LibriSpeech) benchmarks show promising results to train such a non-autoregressive network for ASR. Especially in AISHELL, the proposed method outperformed the Kaldi ASR system and matched the performance of the state-of-the-art autoregressive transformer with $7\times$ speedup.

56 citations


Journal ArticleDOI
TL;DR: In this paper, a coupled coarray tensor CPD-based two-dimensional DOA estimation method for a specially designed coprime L-shaped array is proposed, in which a shifting coarray concatenation approach is developed to factorize the partitioned fourth-order coarray statistics into multiple coupled co-array tensors.
Abstract: Conventional canonical polyadic decomposition (CPD) approach for tensor-based sparse array direction-of-arrival (DOA) estimation typically partitions the coarray statistics to generate a full-rank coarray tensor for decomposition. However, such an operation ignores the spatial relevance among the partitioned coarray statistics. In this letter, we propose a coupled coarray tensor CPD-based two-dimensional DOA estimation method for a specially designed coprime L-shaped array. In particular, a shifting coarray concatenation approach is developed to factorize the partitioned fourth-order coarray statistics into multiple coupled coarray tensors. To make full use of the inherent spatial relevance among these coarray tensors, a coupled coarray tensor CPD approach is proposed to jointly decompose them for high-accuracy DOA estimation in a closed-form manner. According to the uniqueness condition analysis on the coupled coarray tensor CPD, an increased number of degrees-of-freedom for the proposed method is guaranteed.

46 citations


Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed a CNN-based approach by luminously dividing a grayscale image into two sets and applying one set to predict the other set for data embedding.
Abstract: How to predict images is an important issue in the reversible data hiding (RDH) community. In this letter, we propose a novel CNN-based prediction approach by luminously dividing a grayscale image into two sets and applying one set to predict the other set for data embedding. The proposed CNN predictor is a lightweight and computation-efficient network with the capabilities of multi receptive fields and global optimization. This CNN predictor can be trained quickly and well by using 1000 images randomly selected from ImageNet. Furthermore, we propose a two stages of embedding scheme for this predictor. Experimental results show that the CNN predictor can make full use of more surrounding pixels to promote the prediction performance. Furthermore, in the experimental way we have shown that the CNN predictor with expansion embedding and histogram shifting techniques can provide better embedding performance in comparison with those classical linear predictors.

42 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a non-local aggregation network (NANet) with a well-designed Multi-modality Non-local Aggregation Module (MNAM), which enables the aggregation of nonlocal RGB-D information along both spatial and channel dimensions.
Abstract: Exploiting both RGB (2D appearance) and Depth (3D geometry) information can improve the performance of semantic segmentation. However, due to the inherent difference between the RGB and Depth information, it remains a challenging problem in how to integrate RGB-D features effectively. In this letter, to address this issue, we propose a Non-local Aggregation Network (NANet), with a well-designed Multi-modality Non-local Aggregation Module (MNAM), to better exploit the non-local context of RGB-D features at multi-stage. Compared with most existing RGB-D semantic segmentation schemes, which only exploit local RGB-D features, the MNAM enables the aggregation of non-local RGB-D information along both spatial and channel dimensions. The proposed NANet achieves comparable performances with state-of-the-art methods on popular RGB-D benchmarks, NYUDv2 and SUN-RGBD.

41 citations


Journal ArticleDOI
TL;DR: In this paper, the authors show that the pre-transform and hand-crafted features could simply be replaced by end-to-end DNNs and experimentally verify that by only using standard components, a light-weight neural network could outperform the state-of-the-art methods for the ASVspoof2019 challenge.
Abstract: The constant Q transform (CQT) has been shown to be one of the most effective speech signal pre-transforms to facilitate synthetic speech detection, followed by either hand-crafted (subband) constant Q cepstral coefficient (CQCC) feature extraction and a back-end binary classifier, or a deep neural network (DNN) directly for further feature extraction and classification. Despite the rich literature on such a pipeline, we show in this paper that the pre-transform and hand-crafted features could simply be replaced by end-to-end DNNs. Specifically, we experimentally verify that by only using standard components, a light-weight neural network could outperform the state-of-the-art methods for the ASVspoof2019 challenge. The proposed model is termed Time-domain Synthetic Speech Detection Net (TSSDNet), having ResNet- or Inception-style structures. We further demonstrate that the proposed models also have attractive generalization capability. Trained on ASVspoof2019, they could achieve promising detection performance when tested on disjoint ASVspoof2015, significantly better than the existing cross-dataset results. This paper reveals the great potential of end-to-end DNNs for synthetic speech detection, without hand-crafted features.

40 citations


Journal ArticleDOI
TL;DR: This paper investigates intelligent reflecting surface (IRS)-assisted secret key generation, which aims to maximize the secret key capacity by adjusting the placement of the IRS units by analyzing and deducing the key capacity expression of the taxonomic system from the perspective of information theory.
Abstract: In secret key generation of physical layer security technology, it is challenging to achieve high key capacity and low bit inconsistency rate. This paper investigates intelligent reflecting surface (IRS)-assisted secret key generation, which aims to maximize the secret key capacity by adjusting the placement of the IRS units. Specifically, we first analyze and deduce the key capacity expression of the IRS-assisted system from the perspective of information theory. Then we investigate how to use the channel state information (CSI) to place the IRS units effectively so as to maximize the secret key capacity. Simulation results show that our scheme could improve the quality of secret key generation significantly.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a pure remote photoplethysmography transformer (TransRPPG) framework for learning intrinsic liveness representation efficiently, which is lightweight and efficient (with only 547 K parameters and 763 M FLOPs).
Abstract: 3D mask face presentation attack detection (PAD) plays a vital role in securing face recognition systems from emergent 3D mask attacks. Recently, remote photoplethysmography (rPPG) has been developed as an intrinsic liveness clue for 3D mask PAD without relying on the mask appearance. However, the rPPG features for 3D mask PAD are still needed expert knowledge to design manually, which limits its further progress in the deep learning and big data era. In this letter, we propose a pure rPPG transformer (TransRPPG) framework for learning intrinsic liveness representation efficiently. At first, rPPG-based multi-scale spatial-temporal maps (MSTmap) are constructed from facial skin and background regions. Then the transformer fully mines the global relationship within MSTmaps for liveness representation, and gives a binary prediction for 3D mask detection. Comprehensive experiments are conducted on two benchmark datasets to demonstrate the efficacy of the TransRPPG on both intra- and cross-dataset testings. Our TransRPPG is lightweight and efficient (with only 547 K parameters and 763 M FLOPs), which is promising for mobile-level applications.

Journal ArticleDOI
TL;DR: In this paper, a multi-information fusion CNN (MF-CNN) model is proposed to early terminate the Quad-Tree plus Multi-type Tree (QTMT)-based CU partition process by jointly using the multi-domain information.
Abstract: The Versatile Video Coding (VVC) achieves superior coding efficiency as compared with the High Efficiency Video Coding (HEVC), while its excellent coding performance is at the cost of several high computational complexity coding tools, such as Quad-Tree plus Multi-type Tree (QTMT)-based Coding Units (CUs) and multiple inter prediction modes. To reduce the computational complexity of VVC, a CNN-based fast inter coding method is proposed in this paper. First, a multi-information fusion CNN (MF-CNN) model is proposed to early terminate the QTMT-based CU partition process by jointly using the multi-domain information. Then, a content complexity-based early Merge mode decision is proposed to skip the time-consuming inter prediction modes by considering the CU prediction residuals and the confidence of MF-CNN. Experimental results show that the proposed method reduces an average of 30.63% VVC encoding time, and the Bjoontegaard Delta Bit Rate (BDBR) increases about 3%.

Journal ArticleDOI
TL;DR: In this paper, a spatial-temporal graph deconvolutional network (ST-GDN) is proposed to improve message aggregation by removing the embedding redundancy of the input graphs from either node-wise, frame-wise or elementwise at different network layers.
Abstract: Benefited from the powerful ability of spatial temporal Graph Convolutional Networks (ST-GCNs), skeleton-based human action recognition has gained promising success. However, the node interaction through message propagation does not always provide complementary information. Instead, it May even produce destructive noise and thus make learned representations indistinguishable. Inevitably, the graph representation would also become over-smoothing especially when multiple GCN layers are stacked. This paper proposes spatial-temporal graph deconvolutional networks (ST-GDNs), a novel and flexible graph deconvolution technique, to alleviate this issue. At its core, this method provides a better message aggregation by removing the embedding redundancy of the input graphs from either node-wise, frame-wise or element-wise at different network layers. Extensive experiments on three current most challenging benchmarks verify that ST-GDN consistently improves the performance and largely reduce the model size on these datasets.

Journal ArticleDOI
TL;DR: In this article, an exponential hyperbolic cosine function (EHCF) based new robust norm is introduced and a corresponding EHCF based adaptive filter called exponential Hyperbolic Cosine adaptive filter (EHCAF) is developed.
Abstract: In recent years, correntropy-based algorithms which include maximum correntropy criterion (MCC), generalized MCC (GMCC), kernel MCC (KMCC) and hyperbolic cosine function-based algorithms such as hyperbolic cosine adaptive filter (HCAF), logarithmic HCAF (LHCAF), least lncosh (Llncosh) have been widely utilized in adaptive filtering due to their robustness towards non-Gaussian/impulsive background noises. However, the performance of such algorithms suffers from high steady-state misalignment. To minimize the steady-state misalignment along with having comparable computational complexity, an exponential hyperbolic cosine function (EHCF) based new robust norm is introduced and a corresponding EHCF based adaptive filter called exponential hyperbolic cosine adaptive filter (EHCAF) is developed in this letter. Further, computational complexity and bound on learning rate for stability of the proposed algorithm is also studied. A set of simulation studies has been carried out for system identification scenario to assess the performance of the proposed algorithm. Further, EHCAF algorithm has been extended and the filtered-x EHCAF (Fx-EHCAF) algorithm is proposed for robust room equalization.

Journal ArticleDOI
Xiujun Shu1, Ge Li1, Xiao Wang1, Weijian Ruan1, Qi Tian2 
TL;DR: Zhang et al. as discussed by the authors proposed a semantic-guided pixel sampling approach for the cloth-changing person re-ID task, which does not explicitly define which feature to extract but forces the model to automatically learn cloth-irrelevant cues.
Abstract: Cloth-changing person re-identification (re-ID) is a new rising research topic that aims at retrieving pedestrians whose clothes are changed. This task is quite challenging and has not been fully studied to date. Current works mainly focus on body shape or contour sketch, but they are not robust enough due to view and posture variations. The key to this task is to exploit cloth-irrelevant cues. This paper proposes a semantic-guided pixel sampling approach for the cloth-changing person re-ID task. We do not explicitly define which feature to extract but force the model to automatically learn cloth-irrelevant cues. Specifically, we firstly recognize the pedestrian's upper clothes and pants, then randomly change them by sampling pixels from other pedestrians. The changed samples retain the identity labels but exchange the pixels of clothes or pants among different pedestrians. Besides, we adopt a loss function to constrain the learned features to keep consistent before and after changes. In this way, the model is forced to learn cues that are irrelevant to upper clothes and pants. We conduct extensive experiments on the latest released PRCC dataset. Our method achieved 65.8% on Rank1 accuracy, which outperforms previous methods with a large margin. The code is available at https://github.com/shuxjweb/pixel_sampling.git .

Journal ArticleDOI
TL;DR: Experimental results on the RML, eNTERFACE05, BAUM-1 s datasets show that the recognition rate of the method is higher than other state-of-the-art methods.
Abstract: To effectively fuse speech and visual features, this letter proposes a multi-modal emotion recognition method by fusing correlation features of speech-visual. Firstly, speech and visual features are extracted by two-dimensional convolutional neural network (2D-CNN) and three-dimensional convolutional neural network (3D-CNN), respectively. Secondly, the speech and visual features is processed by feature correlation analysis algorithm in multi-modal fusion. In addition, the class information of speech and visual features are also applied to the feature correlation analysis algorithm, which can effectively fuse speech and visual features and improve the performance of multi-modal emotion recognition. Finally, support vector machines (SVM) completes the classification of multi-modal speech and visual emotion recognition. Experimental results on the RML, eNTERFACE05, BAUM-1 s datasets show that the recognition rate of our method is higher than other state-of-the-art methods.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a novel discriminative correlation filter (DCF) tracking model by introducing dynamic spatial regularization weight, which encourage the filter focuses on more reliable region during training stage.
Abstract: With the wide vision and high flexibility, unmanned aerial vehicle (UAV) has been widely used into object tracking in recent years. However, its limited computing capability poses a great challenges to tracking algorithms. On the other hand, Discriminative Correlation Filter (DCF) based trackers have attracted great attention due to their computational efficiency and superior accuracy. Many studies introduce spatial and temporal regularization into the DCF framework to achieve a more robust appearance model and further enhance the tracking performance. However, such algorithms generally set fixed spatial or temporal regularization parameters, which lack flexibility and adaptability under cluttered and challenging scenarios. To tackle such issue, in this letter, we propose a novel DCF tracking model by introducing dynamic spatial regularization weight, which encourage the filter focuses on more reliable region during training stage. Furthermore, our method could optimize the spatial and temporal regularization weight simultaneously using Alternative Direction Method of Multiplies (ADMM) technique method, where each sub-problem has closed-form solution. Through the joint optimization, our tracker could not only suppress the potential distractors but also construct robust target appearance on the basis of reliable historical information. Experiments on two UAV benchmarks have demonstrated that our tracker performs favorably against other state-of-the-art algorithms.

Journal ArticleDOI
TL;DR: In this paper, a pre-trained acoustic encoder and a linguistic encoder are fused into an end-to-end ASR model to learn the transfer from speech to language during fine-tuning on limited labeled data.
Abstract: End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its impressive ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed two new multi-label classification networks $--$ MCG-Net based on graph convolutional network and self-supervised learning.
Abstract: The accurate diagnosis of fundus disease can effectively reduce the disease's further deterioration and provide targeted treatment plans for patients. Fundus image classification is a multi-label classification task due to one fundus image may contain one or more diseases. For multi-label classification of fundus images, we propose two new multi-label classification networks $--$ MCG-Net based on graph convolutional network and MCGS-Net based on graph convolutional network and self-supervised learning. Here, the graph convolutional network is used to capture the relevant information of the multi-label fundus images, and self-supervised learning is used to enhance the generalization ability of the network by learning more unannotated data. We use the ROC curve, Precision score, Recall score, Kappa score, F-1 score, and AUC value as the evaluation metrics and test on two datasets. Compared with other methods, our methods have better classification performance and generalization ability. Our methods can significantly improve classification performance and enhance the generalization ability of multi-label fundus image classification.

Journal ArticleDOI
Jinyuan Liu1, Yuhui Wu1, Zhanbo Huang1, Risheng Liu1, Xin Fan1 
TL;DR: Li et al. as mentioned in this paper proposed a Neural Architecture Search (NAS)-based deep learning network to realize the infrared and visible image fusion, which can automatically discover the modality-oriented feature representation.
Abstract: Nowadays, driven by the high demand for autonomous driving and surveillance, infrared and visible image fusion (IVIF) has attracted significant attention from both the industry and research community. Existing learning-based IVIF methods tried to design various architectures to extract features. Still, these hand-crafted designed architectures cannot adequately represent the typical features of different modalities, resulting in undesirable artifacts on their fused results. To alleviate this issue, we propose a Neural Architecture Search (NAS)-based deep learning network to realize the IVIF task, which can automatically discover the modality-oriented feature representation. Our network is accomplished through two modality-oriented encoders and a unified decoder, in addition to a self-visual saliency weight module (SvSW). The two modality-oriented encoders target to learn different intrinsic feature representations automatically from infrared-/visible- modality images. Subsequently, these intermediate features are merged via the SvSW module. Finally, the fused image is recovered by a unified decoder. Extensive experiments demonstrate that our method outperforms the state-of-the-art approaches by a large margin, especially in generating distinct targets and abundant details.

Journal ArticleDOI
Haijun Liu1, Yanxia Chai1, Xiaoheng Tan1, Dong Li1, Xichuan Zhou1 
TL;DR: Wang et al. as discussed by the authors proposed a dual-granularity triplet loss for visible-thermal person re-identification (VT-ReID), which organizes the sample-based triplets loss and center-based triplet loss in a hierarchical fine to coarse granularity manner, with some simple configurations of typical operations, such as pooling and batch normalization.
Abstract: This letter presents a conceptually simple and effective dual-granularity triplet loss for visible-thermal person re-identification (VT-ReID). Generally, ReID models are always trained with the sample-based triplet loss and identification loss from the fine granularity level. Further, center-based loss could be introduced to encourage the intra-class compactness and inter-class discrimination from the coarse granularity level. Our proposed dual-granularity triplet loss well organizes the sample-based triplet loss and center-based triplet loss in a hierarchical fine to coarse granularity manner, just with some simple configurations of typical operations, such as pooling and batch normalization. Experiments on RegDB and SYSU-MM01 datasets show that with only the global features our dual-granularity triplet loss can improve the VT-ReID performance by a significant margin. It can be a strong VT-ReID baseline to boost future research with high quality.

Journal ArticleDOI
TL;DR: BitDance as mentioned in this paper uses color and geometry texture descriptors to estimate the perceived quality of the test point clouds, and shows that the proposed PC quality assessment metric performs very well when compared to state-of-the-art quality metrics.
Abstract: Point Clouds (PCs) have recently been adopted as the preferred data structure for representing 3D visual contents. Examples of Point Cloud (PC) applications range from 3D representations of small objects up to large scenes, both still or dynamic in time. PC adoption triggered the development of new coding, transmission, and display methodologies that culminated in new international standards for PC compression. Along with these, in the last couple of years, novel methods have been developed for evaluating the visual quality of PC contents. This paper presents a new objective full-reference visual quality assessment metric for static PC contents, named BitDance, which uses color and geometry texture descriptors. The proposed method first extracts the statistics of color and geometry information of the reference and test PCs. Then, it compares the color and geometry statistics and combines them to estimate the perceived quality of the test PC. Using publicly available PC quality assessment datasets, we show that the proposed PC quality assessment metric performs very well when compared to state-of-the-art quality metrics. In particular, the method performs well for different types of PC datasets, including the ones where both geometry and color are not degraded with similar intensities. BitDance is a low complexity algorithm, with an optimized C++ source code that is available for download at github.com/rafael2k/bitdance-pc_metric .

Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed a two-stream encoder generative adversarial network (TSE-GAN) with progressive training for co-saliency detection, where the SOD-GAN and the classification network are separately trained by SOD and co-Saliency datasets with only category labels.
Abstract: The recent end-to-end co-saliency models have good performance, however, they cannot express the semantic consistency among a group of images well and usually require many co-saliency labels. To this end, a two-stream encoder generative adversarial network (TSE-GAN) with progressive training is proposed in this paper. In the pre-training stage, the salient object detection generative adversarial networks (SOD-GAN) and classification network (CN) are separately trained by the salient object detection (SOD) datasets and co-saliency datasets with only category labels to learn the intra-saliency and preliminary inter-saliency cues and alleviate the problem of insufficient co-saliency labels. In the second training stage, the backbone of TSE-GAN is inherited from the trained SOD-GAN, the encoder of trained SOD-GAN (SOD-Encoder) is used to extract intra-saliency features, the group-wise semantic encoder (GS-Encoder) is constructed by the multi-level group-wise category features extracted from CN for extracting inter-saliency features with better semantic consistency, the TSE-GAN constructed by incorporating the GS-Encoder into SOD-GAN is trained on co-saliency datasets for co-saliency detection. The comprehensive comparisons with 13 state-of-the-art methods demonstrate the effectiveness of proposed method.

Journal ArticleDOI
Ke Tan1, Buye Xu2, Anurag Kumar2, Eliya Nachmani2, Yossi Adi2 
TL;DR: This study extends a newly-developed gated recurrent neural network for monaural separation by additionally incorporating self-attention mechanisms and dense connectivity and develops an end-to-end multiple-input multiple-output system, which directly maps from the binaural waveform of the mixture to those of the speech signals.
Abstract: Most existing deep learning based binaural speaker separation systems focus on producing a monaural estimate for each of the target speakers, and thus do not preserve the interaural cues, which are crucial for human listeners to perform sound localization and lateralization. In this study, we address talker-independent binaural speaker separation with interaural cues preserved in the estimated binaural signals. Specifically, we extend a newly-developed gated recurrent neural network for monaural separation by additionally incorporating self-attention mechanisms and dense connectivity. We develop an end-to-end multiple-input multiple-output system, which directly maps from the binaural waveform of the mixture to those of the speech signals. The experimental results show that our proposed approach achieves significantly better separation performance than a recent binaural separation approach. In addition, our approach effectively preserves the interaural cues, which improves the accuracy of sound localization.

Journal ArticleDOI
Yufei Lin1, Liquan Shen1, Zhengyong Wang1, Kun Wang1, Xi Zhang1 
TL;DR: Wang et al. as discussed by the authors proposed a two-stage network for underwater image restoration, which divides the restoration process into two parts viz. horizontal and vertical distortion restoration, and proposed a novel attenuation coefficient prior attention block (ACPAB) to adaptively recalibrate the RGB channel-wise feature maps of the image suffering from the vertical distortion.
Abstract: Underwater images suffer from severe color casts, low contrast and blurriness, which are caused by scattering and absorption when light propagates through water. However, existing deep learning methods treat the restoration process as a whole and do not fully consider the underwater physical distortion process. Thus, they cannot adequately tackle both absorption and scattering, leading to poor restoration results. To address this problem, we propose a novel two-stage network for underwater image restoration (UIR), which divides the restoration process into two parts viz. horizontal and vertical distortion restoration. In the first stage, a model-based network is proposed to handle horizontal distortion by directly embedding the underwater physical model into the network. The attenuation coefficient, as a feature representation in characterizing water type information, is first estimated to guide the accurate estimation of the parameters in the physical model. For the second stage, to tackle vertical distortion and reconstruct the clear underwater image, we put forth a novel attenuation coefficient prior attention block (ACPAB) to adaptively recalibrate the RGB channel-wise feature maps of the image suffering from the vertical distortion. Experiments on both synthetic dataset and real-world underwater images demonstrate that our method can effectively tackle scattering and absorption compared with several state-of-the-art methods.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a novel method for early stage glaucoma detection based on the newly introduced two-dimensional tensor empirical wavelet transform (2D-T-EWT).
Abstract: Glaucoma is a chronic eye disease, causing damage to the optic nerve; it may cause permanent vision loss. The conventional instrument methods for glaucoma detection are manual and time-consuming. Many approaches have recently been proposed for automatic glaucoma classification using retinal fundus images. However, none of the existing methods can efficiently use for early-stage glaucoma detection. In this letter, we proposed a novel method for glaucoma classification based on the newly introduced two-dimensional tensor empirical wavelet transform (2D-T-EWT). In this study, the pre-processed images are decomposed into sub-band images (SBIs) using 2D-T-EWT. Then, texture-based grey level co-occurrence matrix (GLCM), chip histogram, and moment invariant features have been extracted from decomposed SBIs. Afore, robust features have been selects and ranked using the student's t -test algorithm. Finally, trained multi-class least squares-support vector machine (MC-LS-SVM) classifier has been used for the classification. The experimental results show that our method outperformed state-of-the-art approaches for glaucoma classification. The proposed method achieved the highest classification accuracy of 93.65% using tenfold cross-validation.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper presented a linguistic steganalysis method with graph neural network, where texts are translated as directed graphs with the associated information, where nodes denote words and edges show associations between the words.
Abstract: Recent linguistic steganalysis methods model texts as sequences and use deep learning models to extract discriminative features for detecting the presence of secret information in texts. However, natural language has a complex syntactic structure and sequences have limited representation ability for text modeling. Moreover, previous methods tend to extract features from local continuous word sequences, which cannot effectively model global characteristics. In this paper, we present a linguistic steganalysis method with graph neural network. In the proposed method, texts are translated as directed graphs with the associated information, where nodes denote words and edges show associations between the words. By training a graph convolutional network for feature extraction, each node of a graph can collect contextual information to update self-expression, accordingly effectively solving the problem of poor representation of polysemous words. Meanwhile, we adopt a globally-shared matrix to record correlation strengths between words so that each text can effectively utilize the global information to obtain the better self-representation. Experimental results have shown that the proposed work achieves the state-of-the-art performance comparing with the previous works.

Journal ArticleDOI
TL;DR: In this article, a new constant false alarm rate (CFAR) detector was proposed to accelerate the existing superpixel (SP)-based CFAR detectors for ship detection in SAR images.
Abstract: In this letter, we propose a new constant false alarm rate (CFAR) detector to accelerate the existing superpixel (SP)-based CFAR detectors for ship detection in synthetic aperture radar (SAR) images In our method, we design a new density-censoring operation to rapidly identify background clutter SPs (BCSPs) with high densities before the local CFAR detection In this way, a large number of non-informative BCSPs are removed without time-consuming calculation of decision thresholds, and only a few candidate ship target SPs (STSPs) are retained This reduces the computational cost of the subsequent local CFAR detection and the number of false alarms produced by it During the local CFAR detection process for the retained candidate STSPs, we also propose an improved method to define their neighboring clutter regions (for the calculation of decision thresholds) using BCSPs identified by the density-censoring operation Experiments on measured SAR images validate that the proposed CFAR method reduces the computational cost of commonly used SP-based CFAR methods by 75%-96% with similar or better detection performance

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a channel and space attention neural network (CSANN) for image denoising, which concatenates the noise level with the average and maximum values of each channel as the input and proposes a convolutional network to learn the relationship between channels Meanwhile, they combine the noise-level map with the normal values of the spatial locations as the inputs and use a CNN-based attention network to combine them as an attention network.
Abstract: Recently, convolutional neural networks (CNN) have been widely used in image denoising But with most CNN denoising methods, all the channels are treated equally and the relationship between spatial locations are neglected In the letter, we propose a novel channel and space attention neural network (CSANN) for image denoising In CSANN, we concatenate the noise level with the average and maximum values of each channel as the input and propose a convolutional network to learn the relationship between channels Meanwhile, we combine the noise level map with the average and maximum values of each spatial locations as the input and use a convolutional network to learn the relationship between spatial locations Moreover, we combine them as an attention network and introduce it into the main CNN and symmetric skip connections, which makes channels related to attention network play different roles in the subsequent convolution and offsets the performance degradation caused by using a single convolution kernel in spatial locations In addition, the use of symmetric skip connections and resnet blocks avoid the vanishing gradient problem and the loss of shallow features Experimental results show that, compared with some state-of-the-art denoising algorithms, the experimental results of CSANN have better visual effects and higher peak signal-to-noise ratio (PSNR) values