Showing papers in "Pattern Recognition in 2022"
TL;DR: Wang et al. as mentioned in this paper proposed edge computing based video pre-processing to eliminate the redundant frames, so that they migrate the partial or all the video processing task to the edge, thereby diminishing the computing, storage and network bandwidth requirements of the cloud center, and enhancing the effectiveness of video analyzes.
Abstract: In the Internet of Things enabled intelligent transportation systems, a huge amount of vehicle video data has been generated and real-time and accurate video analysis are very important and challenging work, especially in situations with complex street scenes. Therefore, we propose edge computing based video pre-processing to eliminate the redundant frames, so that we migrate the partial or all the video processing task to the edge, thereby diminishing the computing, storage and network bandwidth requirements of the cloud center, and enhancing the effectiveness of video analyzes. To eliminate the redundancy of the traffic video, the magnitude of motion detection based on spatio-temporal interest points (STIP) and the multi-modal linear features combination are presented which splits a video into super frame segments of interests. After that, we select the key frames from these interesting segments of the long videos with the design and detection of the prominent region. Finally, the extensive numerical experimental verification results show our methods are superior to the previous algorithms for different stages of the redundancy elimination, video segmentation, key frame selection and vehicle detection.
91 citations
TL;DR: Zhang et al. as mentioned in this paper proposed a novel approach to estimate both aleatoric and epistemic uncertainties for stereo matching in an end-to-end way, where the uncertainty parameters are predicted for each potential disparity and then averaged via the guidance of matching probability distribution.
Abstract: Although deep learning-based stereo matching approaches have achieved excellent performance in recent years, it is still a non-trivial task to estimate the uncertainty of the produced disparity map. In this paper, we propose a novel approach to estimate both aleatoric and epistemic uncertainties for stereo matching in an end-to-end way. We introduce an evidential distribution, named Normal Inverse-Gamma (NIG) distribution, whose parameters can be used to calculate the uncertainty. Instead of directly regressed from aggregated features, the uncertainty parameters are predicted for each potential disparity and then averaged via the guidance of matching probability distribution. Furthermore, considering the sparsity of ground truth in real scene datasets, we design two additional losses. The first one tries to enlarge uncertainty on incorrect predictions, so uncertainty becomes more sensitive to erroneous regions. The second one enforces the smoothness of the uncertainty in the regions with smooth disparity. Most stereo matching models, such as PSM-Net, GA-Net, and AA-Net, can be easily integrated with our approach. Experiments on multiple benchmark datasets show that our method improves stereo matching results. We prove that both aleatoric and epistemic uncertainties are well-calibrated with incorrect predictions. Particularly, our method can capture increased epistemic uncertainty on out-of-distribution data, making it effective to prevent a system from potential fatal consequences. Code is available at https://github.com/Dawnstar8411/StereoMatching-Uncertainty.
74 citations
TL;DR: Wang et al. as mentioned in this paper proposed edge computing based video pre-processing to eliminate the redundant frames, so that they migrate the partial or all the video processing task to the edge, thereby diminishing the computing, storage and network bandwidth requirements of the cloud center, and enhancing the effectiveness of video analyzes.
Abstract: In the Internet of Things enabled intelligent transportation systems, a huge amount of vehicle video data has been generated and real-time and accurate video analysis are very important and challenging work, especially in situations with complex street scenes. Therefore, we propose edge computing based video pre-processing to eliminate the redundant frames, so that we migrate the partial or all the video processing task to the edge, thereby diminishing the computing, storage and network bandwidth requirements of the cloud center, and enhancing the effectiveness of video analyzes. To eliminate the redundancy of the traffic video, the magnitude of motion detection based on spatio-temporal interest points (STIP) and the multi-modal linear features combination are presented which splits a video into super frame segments of interests. After that, we select the key frames from these interesting segments of the long videos with the design and detection of the prominent region. Finally, the extensive numerical experimental verification results show our methods are superior to the previous algorithms for different stages of the redundancy elimination, video segmentation, key frame selection and vehicle detection.
67 citations
TL;DR: In this paper, a neighborhood linear discriminant analysis (nLDA) is proposed, in which the scatter matrices are defined on a neighborhood consisting of reverse nearest neighbors and the neighborhood can be naturally regarded as the smallest subclass.
Abstract: Linear Discriminant Analysis (LDA) assumes that all samples from the same class are independently and identically distributed (i.i.d.). LDA may fail in the cases where the assumption does not hold. Particularly when a class contains several clusters (or subclasses), LDA cannot correctly depict the internal structure as the scatter matrices that LDA relies on are defined at the class level. In order to mitigate the problem, this paper proposes a neighborhood linear discriminant analysis (nLDA) in which the scatter matrices are defined on a neighborhood consisting of reverse nearest neighbors. Thus, the new discriminator does not need an i.i.d. assumption. In addition, the neighborhood can be naturally regarded as the smallest subclass, for which it is easier to be obtained than subclass without resorting to any clustering algorithms. The projected directions are sought to make sure that the within-neighborhood scatter as small as possible and the between-neighborhood scatter as large as possible, simultaneously. The experimental results show that nLDA performs significantly better than previous discriminators, such as LDA, LFDA, ccLDA, LM-NNDA, and l 2 , 1 -RLDA.
64 citations
TL;DR: In this article , a neighborhood linear discriminant analysis (nLDA) is proposed to address multimodality in LDA, in which the scatter matrices are defined on a neighborhood consisting of reverse nearest neighbors.
Abstract: • The neighborhood linear discriminant analysis (nLDA) is proposed to address multimodality in LDA. • In nLDA, the scatters are defined on a neighborhood consisting of reverse nearest neighbors. • The within- and between-neighborhood scatters can avoid estimating the subclasses in multimodal class. • The nLDA performs significantly better than some existing discriminators, such as LDA, LFDA, ccLDA, LM-NNDA and l 2 , 1 -RLDA. Linear Discriminant Analysis (LDA) assumes that all samples from the same class are independently and identically distributed (i.i.d.). LDA may fail in the cases where the assumption does not hold. Particularly when a class contains several clusters (or subclasses), LDA cannot correctly depict the internal structure as the scatter matrices that LDA relies on are defined at the class level. In order to mitigate the problem, this paper proposes a neighborhood linear discriminant analysis (nLDA) in which the scatter matrices are defined on a neighborhood consisting of reverse nearest neighbors. Thus, the new discriminator does not need an i.i.d. assumption. In addition, the neighborhood can be naturally regarded as the smallest subclass, for which it is easier to be obtained than subclass without resorting to any clustering algorithms. The projected directions are sought to make sure that the within-neighborhood scatter as small as possible and the between-neighborhood scatter as large as possible, simultaneously. The experimental results show that nLDA performs significantly better than previous discriminators, such as LDA, LFDA, ccLDA, LM-NNDA, and l 2 , 1 -RLDA.
64 citations
TL;DR: A novel convolution autoencoder architecture that can dissociate the spatio-temporal representation to separately capture the spatial and the temporal information is explored, since abnormal events are usually different from the normality in appearance and/or motion behavior.
Abstract: Anomaly detection in videos remains a challenging task due to the ambiguous definition of anomaly and the complexity of visual scenes from real video data. Different from the previous work which utilizes reconstruction or prediction as an auxiliary task to learn the temporal regularity, in this work, we explore a novel convolution autoencoder architecture that can dissociate the spatio-temporal representation to separately capture the spatial and the temporal information, since abnormal events are usually different from the normality in appearance and/or motion behavior. Specifically, the spatial autoencoder models the normality on the appearance feature space by learning to reconstruct the input of the first individual frame (FIF), while the temporal part takes the first four consecutive frames as the input and the RGB difference as the output to simulate the motion of optical flow in an efficient way. The abnormal events, which are irregular in appearance or in motion behavior, lead to a large reconstruction error. To improve detection performance on fast moving outliers, we exploit a variance-based attention module and insert it into the motion autoencoder to highlight large movement areas. In addition, we propose a deep K-means cluster strategy to force the spatial and the motion encoder to extract a compact representation. Extensive experiments on some publicly available datasets have demonstrated the effectiveness of our method which achieves the state-of-the-art performance. The code is publicly released at the link 1 .
61 citations
TL;DR: In this article , a multi-scale visual transformer model, referred as GasHis-Transformer, is proposed for Gastric Histopathological Image Detection (GHID), which enables the automatic global detection of gastric cancer images.
Abstract: In this paper, a multi-scale visual transformer model, referred as GasHis-Transformer, is proposed for Gastric Histopathological Image Detection (GHID), which enables the automatic global detection of gastric cancer images. GasHis-Transformer model consists of two key modules designed to extract global and local information using a position-encoded transformer model and a convolutional neural network with local convolution, respectively. A publicly available hematoxylin and eosin (H&E) stained gastric histopathological image dataset is used in the experiment. Furthermore, a Dropconnect based lightweight network is proposed to reduce the model size and training time of GasHis-Transformer for clinical applications with improved confidence. Moreover, a series of contrast and extended experiments verify the robustness, extensibility and stability of GasHis-Transformer. In conclusion, GasHis-Transformer demonstrates high global detection performance and shows its significant potential in GHID task.
61 citations
TL;DR: In this article, a multi-view deep autoencoder model is proposed to fuse the spectral and spatial features extracted from the hyperspectral image into a joint latent representation space.
Abstract: The classification of hyperspectral image is a challenging task due to the high dimensional space, with large number of spectral bands, and low number of labeled training samples. To overcome these challenges, we propose a novel methodology for hyperspectral image classification based on multi-view deep neural networks which fuses both spectral and spatial features by using only a small number of labeled samples. Firstly, we process the initial hyperspectral image in order to extract a set of spectral and spatial features. Each spectral vector is the spectral signature of each pixel of the image. The spatial features are extracted using a simple deep autoencoder, which seeks to reduce the high dimensionality of data taking into account the neighborhood region for each pixel. Secondly, we propose a multi-view deep autoencoder model which allows fusing the spectral and spatial features extracted from the hyperspectral image into a joint latent representation space. Finally, a semi-supervised graph convolutional network is trained based on thee fused latent representation space to perform the hyperspectral image classification. The main advantage of the proposed approach is to allow the automatic extraction of relevant information while preserving the spatial and spectral features of data, and improve the classification of hyperspectral images even when the number of labeled samples is low. Experiments are conducted on three real hyperspectral images respectively Indian Pines, Salinas, and Pavia University datasets. Results show that the proposed approach is competitive in classification performances compared to state-of-the-art.
58 citations
TL;DR: In this paper , a multi-view deep autoencoder model is proposed to fuse the spectral and spatial features extracted from the hyperspectral image into a joint latent representation space.
Abstract: The classification of hyperspectral image is a challenging task due to the high dimensional space, with large number of spectral bands, and low number of labeled training samples. To overcome these challenges, we propose a novel methodology for hyperspectral image classification based on multi-view deep neural networks which fuses both spectral and spatial features by using only a small number of labeled samples. Firstly, we process the initial hyperspectral image in order to extract a set of spectral and spatial features. Each spectral vector is the spectral signature of each pixel of the image. The spatial features are extracted using a simple deep autoencoder, which seeks to reduce the high dimensionality of data taking into account the neighborhood region for each pixel. Secondly, we propose a multi-view deep autoencoder model which allows fusing the spectral and spatial features extracted from the hyperspectral image into a joint latent representation space. Finally, a semi-supervised graph convolutional network is trained based on thee fused latent representation space to perform the hyperspectral image classification. The main advantage of the proposed approach is to allow the automatic extraction of relevant information while preserving the spatial and spectral features of data, and improve the classification of hyperspectral images even when the number of labeled samples is low. Experiments are conducted on three real hyperspectral images respectively Indian Pines, Salinas, and Pavia University datasets. Results show that the proposed approach is competitive in classification performances compared to state-of-the-art.
58 citations
TL;DR: SARS-Net as mentioned in this paper is a CADx system combining graph convolutional networks and Convolutional Neural Networks for detecting abnormalities in a patient's CXR images for presence of severe acute respiratory syndrome coronavirus.
Abstract: COVID-19 has emerged as one of the deadliest pandemics that has ever crept on humanity. Screening tests are currently the most reliable and accurate steps in detecting severe acute respiratory syndrome coronavirus in a patient, and the most used is RT-PCR testing. Various researchers and early studies implied that visual indicators (abnormalities) in a patient's Chest X-Ray (CXR) or computed tomography (CT) imaging were a valuable characteristic of a COVID-19 patient that can be leveraged to find out virus in a vast population. Motivated by various contributions to open-source community to tackle COVID-19 pandemic, we introduce SARS-Net, a CADx system combining Graph Convolutional Networks and Convolutional Neural Networks for detecting abnormalities in a patient's CXR images for presence of COVID-19 infection in a patient. In this paper, we introduce and evaluate the performance of a custom-made deep learning architecture SARS-Net, to classify and detect the Chest X-ray images for COVID-19 diagnosis. Quantitative analysis shows that the proposed model achieves more accuracy than previously mentioned state-of-the-art methods. It was found that our proposed model achieved an accuracy of 97.60% and a sensitivity of 92.90% on the validation set.
53 citations
TL;DR: SARS-Net as discussed by the authors is a CADx system combining graph convolutional networks and Convolutional Neural Networks for detecting abnormalities in a patient's CXR images for presence of severe acute respiratory syndrome coronavirus.
Abstract: COVID-19 has emerged as one of the deadliest pandemics that has ever crept on humanity. Screening tests are currently the most reliable and accurate steps in detecting severe acute respiratory syndrome coronavirus in a patient, and the most used is RT-PCR testing. Various researchers and early studies implied that visual indicators (abnormalities) in a patient's Chest X-Ray (CXR) or computed tomography (CT) imaging were a valuable characteristic of a COVID-19 patient that can be leveraged to find out virus in a vast population. Motivated by various contributions to open-source community to tackle COVID-19 pandemic, we introduce SARS-Net, a CADx system combining Graph Convolutional Networks and Convolutional Neural Networks for detecting abnormalities in a patient's CXR images for presence of COVID-19 infection in a patient. In this paper, we introduce and evaluate the performance of a custom-made deep learning architecture SARS-Net, to classify and detect the Chest X-ray images for COVID-19 diagnosis. Quantitative analysis shows that the proposed model achieves more accuracy than previously mentioned state-of-the-art methods. It was found that our proposed model achieved an accuracy of 97.60% and a sensitivity of 92.90% on the validation set.
TL;DR: A novel encoder-decoder architecture, called contextual ensemble network (CENet), for semantic segmentation, where the contextual cues are aggregated via densely usampling the convolutional features of deep layer to the shallow deconvolutional layers.
Abstract: Recently, exploring features from different layers in fully convolutional networks (FCNs) has gained substantial attention to capture context information for semantic segmentation. This paper presents a novel encoder-decoder architecture, called contextual ensemble network (CENet), for semantic segmentation, where the contextual cues are aggregated via densely usampling the convolutional features of deep layer to the shallow deconvolutional layers. The proposed CENet is trained in terms of end-to-end segmentation to match the resolution of input image, and allows us to fully explore contextual features through ensemble of dense deconvolutions. We evaluate our CENet on two widely-used semantic segmentation datasets: PASCAL VOC 2012 and CityScapes. The experimental results demonstrate our CENet achieves superior performance with respect to recent state-of-the-art results. Furthermore, we also evaluate CENet on MS COCO dataset and ISBI 2012 dataset for the task of instance segmentation and biological segmentation, respectively. The experimental results show that CENet obtains promising results on these two datasets.
TL;DR: Action Transformer (AcT) as discussed by the authors exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance, which consistently outperforms more elaborated networks that mix convolutional, recurrent and attentive layers.
Abstract: Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully, self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent, and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an attempt to build a formal training and evaluation benchmark for real-time, short-time HAR. The proposed methodology was extensively tested on MPOSE2021 and compared to several state-of-the-art architectures, proving the effectiveness of the AcT model and laying the foundations for future work on HAR.
TL;DR: In this article , a multi-modality graph neural network (MAGNN) is proposed to learn from these multimodal inputs for financial time series prediction, which provides investors with a profitable as well as interpretable option and enables them to make informed investment decisions.
Abstract: Financial time series analysis plays a central role in hedging market risks and optimizing investment decisions. This is a challenging task as the problems are always accompanied by multi-modality streams and lead-lag effects. For example, the price movements of stock are reflections of complicated market states in different diffusion speeds, including historical price series, media news, associated events, etc. Furthermore, the financial industry requires forecasting models to be interpretable and compliant. Therefore, in this paper, we propose a multi-modality graph neural network (MAGNN) to learn from these multimodal inputs for financial time series prediction. The heterogeneous graph network is constructed by the sources as nodes and relations in our financial knowledge graph as edges. To ensure the model interpretability, we leverage a two-phase attention mechanism for joint optimization, allowing end-users to investigate the importance of inner-modality and inter-modality sources. Extensive experiments on real-world datasets demonstrate the superior performance of MAGNN in financial market prediction. Our method provides investors with a profitable as well as interpretable option and enables them to make informed investment decisions.
TL;DR: In this paper, an uncertainty-aware mean teacher is incorporated into the scribble-based segmentation method, encouraging the segmentation predictions to be consistent under different perturbations for an input image.
Abstract: Segmentation of infections from CT scans is important for accurate diagnosis and follow-up in tackling the COVID-19. Although the convolutional neural network has great potential to automate the segmentation task, most existing deep learning-based infection segmentation methods require fully annotated ground-truth labels for training, which is time-consuming and labor-intensive. This paper proposed a novel weakly supervised segmentation method for COVID-19 infections in CT slices, which only requires scribble supervision and is enhanced with the uncertainty-aware self-ensembling and transformation-consistent techniques. Specifically, to deal with the difficulty caused by the shortage of supervision, an uncertainty-aware mean teacher is incorporated into the scribble-based segmentation method, encouraging the segmentation predictions to be consistent under different perturbations for an input image. This mean teacher model can guide the student model to be trained using information in images without requiring manual annotations. On the other hand, considering the output of the mean teacher contains both correct and unreliable predictions, equally treating each prediction in the teacher model may degrade the performance of the student network. To alleviate this problem, the pixel level uncertainty measure on the predictions of the teacher model is calculated, and then the student model is only guided by reliable predictions from the teacher model. To further regularize the network, a transformation-consistent strategy is also incorporated, which requires the prediction to follow the same transformation if a transform is performed on an input image of the network. The proposed method has been evaluated on two public datasets and one local dataset. The experimental results demonstrate that the proposed method is more effective than other weakly supervised methods and achieves similar performance as those fully supervised.
TL;DR: In this article, a multi-modality graph neural network (MAGNN) is proposed to learn from these multimodal inputs for financial time series prediction, which provides investors with a profitable as well as interpretable option and enables them to make informed investment decisions.
Abstract: Financial time series analysis plays a central role in hedging market risks and optimizing investment decisions. This is a challenging task as the problems are always accompanied by multi-modality streams and lead-lag effects. For example, the price movements of stock are reflections of complicated market states in different diffusion speeds, including historical price series, media news, associated events, etc. Furthermore, the financial industry requires forecasting models to be interpretable and compliant. Therefore, in this paper, we propose a multi-modality graph neural network (MAGNN) to learn from these multimodal inputs for financial time series prediction. The heterogeneous graph network is constructed by the sources as nodes and relations in our financial knowledge graph as edges. To ensure the model interpretability, we leverage a two-phase attention mechanism for joint optimization, allowing end-users to investigate the importance of inner-modality and inter-modality sources. Extensive experiments on real-world datasets demonstrate the superior performance of MAGNN in financial market prediction. Our method provides investors with a profitable as well as interpretable option and enables them to make informed investment decisions.
TL;DR: In this paper , an uncertainty-aware mean teacher is incorporated into the scribble-based segmentation method, encouraging the segmentation predictions to be consistent under different perturbations for an input image.
Abstract: Segmentation of infections from CT scans is important for accurate diagnosis and follow-up in tackling the COVID-19. Although the convolutional neural network has great potential to automate the segmentation task, most existing deep learning-based infection segmentation methods require fully annotated ground-truth labels for training, which is time-consuming and labor-intensive. This paper proposed a novel weakly supervised segmentation method for COVID-19 infections in CT slices, which only requires scribble supervision and is enhanced with the uncertainty-aware self-ensembling and transformation-consistent techniques. Specifically, to deal with the difficulty caused by the shortage of supervision, an uncertainty-aware mean teacher is incorporated into the scribble-based segmentation method, encouraging the segmentation predictions to be consistent under different perturbations for an input image. This mean teacher model can guide the student model to be trained using information in images without requiring manual annotations. On the other hand, considering the output of the mean teacher contains both correct and unreliable predictions, equally treating each prediction in the teacher model may degrade the performance of the student network. To alleviate this problem, the pixel level uncertainty measure on the predictions of the teacher model is calculated, and then the student model is only guided by reliable predictions from the teacher model. To further regularize the network, a transformation-consistent strategy is also incorporated, which requires the prediction to follow the same transformation if a transform is performed on an input image of the network. The proposed method has been evaluated on two public datasets and one local dataset. The experimental results demonstrate that the proposed method is more effective than other weakly supervised methods and achieves similar performance as those fully supervised.
TL;DR: In this article, a polarization fusion network with geometric feature embedding (PFGFE-Net) was proposed to solve the two defects of traditional feature abandonment and insufficient utilization of traditional features.
Abstract: Current synthetic aperture radar (SAR) ship classifiers using convolutional neural networks (CNNs) offer state-of-the-art performance. Yet, they still have two defects potentially hindering accuracy progress – polarization insufficient utilization and traditional feature abandonment. Therefore, we propose a polarization fusion network with geometric feature embedding (PFGFE-Net) to solve them. PFGFE-Net achieves the polarization fusion (PF) from the input data, feature-level, and decision-level. Moreover, the geometric feature embedding (GFE) enriches expert experience. Results on OpenSARShip reveal PFGFE-Net's excellent performance.
TL;DR: Zhang et al. as discussed by the authors proposed an effective bidirectional pyramid architecture to enhance internal representations of features to cater to fine-grained image recognition task in the few-shot learning scenario.
Abstract: Few-shot fine-grained recognition (FS-FGR) aims to distinguish several highly similar objects from different sub-categories with limited supervision. However, traditional few-shot learning solutions typically exploit image-level features and are committed to capturing global silhouettes while accidentally ignore to exploring local details, resulting in an inevitable problem of inconspicuous but distinguishable information loss. Thus, how to effectively address the fine-grained recognition issue given limited samples still remains a major challenging. In this article, we tend to propose an effective bidirectional pyramid architecture to enhance internal representations of features to cater to fine-grained image recognition task in the few-shot learning scenario. Specifically, we deploy a multi-scale feature pyramid and a multi-level attention pyramid on the backbone network, and progressively aggregated features from different granular spaces via both of them. We then further present an attention-guided refinement strategy in collaboration with a multi-level attention pyramid to reduce the uncertainty brought by backgrounds conditioned by limited samples. In addition, the proposed method is trained with the meta-learning framework in an end-to-end fashion without any extra supervision. Extensive experimental results on four challenging and widely-used fine-grained benchmarks show that the proposed method performs favorably against state-of-the-arts, especially in the one-shot scenarios.
TL;DR: An underwater image enhancement method that does not require training on synthetic underwater images and eliminates the dependence on underwater ground-truth images is presented and the experimental results indicate that the proposed method produces visually satisfactory results.
Abstract: In recent years, underwater image enhancement methods based on deep learning have achieved remarkable results. Since the images obtained in complex underwater scenarios lack a ground truth, these algorithms mainly train models on underwater images synthesized from in-air images. Synthesized underwater images are different from real-world underwater images; this difference leads to the limited generalizability of the training model when enhancing real-world underwater images. In this work, we present an underwater image enhancement method that does not require training on synthetic underwater images and eliminates the dependence on underwater ground-truth images. Specifically, a novel domain adaptation framework for real-world underwater image enhancement inspired by transfer learning is presented; it transfers in-air image dehazing to real-world underwater image enhancement. The experimental results on different real-world underwater scenes indicate that the proposed method produces visually satisfactory results.
TL;DR: In this paper , a high-order coverage function (HCF) neuron model was proposed to replace the fully-connected (FC) layers, which utilizes weight coefficients and hyper-parameters to mine underlying geometry with arbitrary shapes in an n-dimensional space.
Abstract: Recent advances in deep neural networks (DNNs) have mainly focused on innovations in network architecture and loss function. In this paper, we introduce a flexible high-order coverage function (HCF) neuron model to replace the fully-connected (FC) layers. The approximation theorem and proof for the HCF are also presented to demonstrate its fitting ability. Unlike the FC layers, which cannot handle high-dimensional data well, the HCF utilizes weight coefficients and hyper-parameters to mine underlying geometries with arbitrary shapes in an n-dimensional space. To explore the power and potential of our HCF neuron model, a high-order coverage function neural network (HCFNN) is proposed, which incorporates the HCF neuron as the building block. Moreover, a novel adaptive optimization method for weights and hyper-parameters is designed to achieve effective network learning. Comprehensive experiments on nine datasets in several domains validate the effectiveness and generalizability of the HCF and HCFNN. The proposed method provides a new perspective for further developments in DNNs and ensures wide application in the field of image classification. The source code is available at https://github.com/Tough2011/HCFNet.git
TL;DR: In this article , a polarization fusion network with geometric feature embedding (PFGFE-Net) was proposed to solve the two defects of traditional feature abandonment and insufficient utilization of traditional features.
Abstract: Current synthetic aperture radar (SAR) ship classifiers using convolutional neural networks (CNNs) offer state-of-the-art performance. Yet, they still have two defects potentially hindering accuracy progress – polarization insufficient utilization and traditional feature abandonment. Therefore, we propose a polarization fusion network with geometric feature embedding (PFGFE-Net) to solve them. PFGFE-Net achieves the polarization fusion (PF) from the input data, feature-level, and decision-level. Moreover, the geometric feature embedding (GFE) enriches expert experience. Results on OpenSARShip reveal PFGFE-Net's excellent performance.
TL;DR: In this paper , a comprehensive survey of 3D object detection for autonomous driving is presented, encompassing all the main concerns including sensors, datasets, performance metrics and the recent state-of-the-art detection methods, together with their pros and cons.
Abstract: Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc.. Taking a quick glance at the progress we have made, we attribute challenges to visual appearance recovery in the absence of depth information from images, representation learning from partially occluded unstructured point clouds, and semantic alignments over heterogeneous features from cross modalities. Despite existing efforts, 3D object detection for autonomous driving is still in its infancy. Recently, a large body of literature have been investigated to address this 3D vision task. Nevertheless, few investigations have looked into collecting and structuring this growing knowledge. We therefore aim to fill this gap in a comprehensive survey, encompassing all the main concerns including sensors, datasets, performance metrics and the recent state-of-the-art detection methods, together with their pros and cons. Furthermore, we provide quantitative comparisons with the state of the art. A case study on fifteen selected representative methods is presented, involved with runtime analysis, error analysis, and robustness analysis. Finally, we provide concluding remarks after an in-depth analysis of the surveyed works and identify promising directions for future work.
TL;DR: Wang et al. as discussed by the authors proposed a novel convolution autoencoder architecture that can dissociate the spatio-temporal representation to separately capture the spatial and temporal information, since abnormal events are usually different from the normality in appearance and/or motion behavior.
Abstract: Anomaly detection in videos remains a challenging task due to the ambiguous definition of anomaly and the complexity of visual scenes from real video data. Different from the previous work which utilizes reconstruction or prediction as an auxiliary task to learn the temporal regularity, in this work, we explore a novel convolution autoencoder architecture that can dissociate the spatio-temporal representation to separately capture the spatial and the temporal information, since abnormal events are usually different from the normality in appearance and/or motion behavior. Specifically, the spatial autoencoder models the normality on the appearance feature space by learning to reconstruct the input of the first individual frame (FIF), while the temporal part takes the first four consecutive frames as the input and the RGB difference as the output to simulate the motion of optical flow in an efficient way. The abnormal events, which are irregular in appearance or in motion behavior, lead to a large reconstruction error. To improve detection performance on fast moving outliers, we exploit a variance-based attention module and insert it into the motion autoencoder to highlight large movement areas. In addition, we propose a deep K-means cluster strategy to force the spatial and the motion encoder to extract a compact representation. Extensive experiments on some publicly available datasets have demonstrated the effectiveness of our method which achieves the state-of-the-art performance. The code is publicly released at the link1.
TL;DR: A comprehensive survey of image augmentation for deep learning using a novel informative taxonomy is presented in this article , where the algorithms are classified into three categories: model-free, model-based, and optimizing policy-based.
Abstract: Although deep learning has achieved satisfactory performance in computer vision, a large volume of images is required. However, collecting images is often expensive and challenging. Many image augmentation algorithms have been proposed to alleviate this issue. Understanding existing algorithms is, therefore, essential for finding suitable and developing novel methods for a given task. In this study, we perform a comprehensive survey of image augmentation for deep learning using a novel informative taxonomy. To examine the basic objective of image augmentation, we introduce challenges in computer vision tasks and vicinity distribution. The algorithms are then classified among three categories: model-free, model-based, and optimizing policy-based. The model-free category employs the methods from image processing, whereas the model-based approach leverages image generation models to synthesize images. In contrast, the optimizing policy-based approach aims to find an optimal combination of operations. Based on this analysis, we believe that our survey enhances the understanding necessary for choosing suitable methods and designing novel algorithms.
TL;DR: Wang et al. as discussed by the authors proposed the Embedding Unmasking Model (EUM) operated on top of existing face recognition models, which enabled the EUM to produce embeddings similar to these of unmasked faces of the same identities.
Abstract: Using the face as a biometric identity trait is motivated by the contactless nature of the capture process and the high accuracy of the recognition algorithms. After the current COVID-19 pandemic, wearing a face mask has been imposed in public places to keep the pandemic under control. However, face occlusion due to wearing a mask presents an emerging challenge for face recognition systems. In this paper, we present a solution to improve masked face recognition performance. Specifically, we propose the Embedding Unmasking Model (EUM) operated on top of existing face recognition models. We also propose a novel loss function, the Self-restrained Triplet (SRT), which enabled the EUM to produce embeddings similar to these of unmasked faces of the same identities. The achieved evaluation results on three face recognition models, two real masked datasets, and two synthetically generated masked face datasets proved that our proposed approach significantly improves the performance in most experimental settings.
TL;DR: In this article , a taxonomy of X-ray security imaging algorithms is presented, with a particular focus on object classification, detection, segmentation and anomaly detection tasks, and a performance benchmark is provided based on the current and future trends in deep learning.
Abstract: X-ray security screening is widely used to maintain aviation/transport security, and its significance poses a particular interest in automated screening systems. This paper aims to review computerised X-ray security imaging algorithms by taxonomising the field into conventional machine learning and contemporary deep learning applications. The first part briefly discusses the classical machine learning approaches utilised within X-ray security imaging, while the latter part thoroughly investigates the use of modern deep learning algorithms. The proposed taxonomy sub-categorises the use of deep learning approaches into supervised and unsupervised learning, with a particular focus on object classification, detection, segmentation and anomaly detection tasks. The paper further explores well-established X-ray datasets and provides a performance benchmark. Based on the current and future trends in deep learning, the paper finally presents a discussion and future directions for X-ray security imagery.
TL;DR: In this paper, a taxonomy of X-ray security imaging algorithms is presented, with a particular focus on object classification, detection, segmentation and anomaly detection tasks, and a performance benchmark is provided based on the current and future trends in deep learning.
Abstract: X-ray security screening is widely used to maintain aviation/transport security, and its significance poses a particular interest in automated screening systems. This paper aims to review computerised X-ray security imaging algorithms by taxonomising the field into conventional machine learning and contemporary deep learning applications. The first part briefly discusses the classical machine learning approaches utilised within X-ray security imaging, while the latter part thoroughly investigates the use of modern deep learning algorithms. The proposed taxonomy sub-categorises the use of deep learning approaches into supervised and unsupervised learning, with a particular focus on object classification, detection, segmentation and anomaly detection tasks. The paper further explores well-established X-ray datasets and provides a performance benchmark. Based on the current and future trends in deep learning, the paper finally presents a discussion and future directions for X-ray security imagery.
TL;DR: Wang et al. as mentioned in this paper proposed a high frequency attention-guided Siamese network (HFA-Net), which is composed of two main stages, spatial-wise attention (SA) and high frequency enhancement (HF).
Abstract: • We propose a new convolutional neural network-based Siamese architecture, high frequency attention Siamese network, for a finer recognition of changed building objects in very-high-resolution remote sensed images. • With supplementary high frequency information provided by the high frequency enhancement block, our proposed method can acquire a better ability of feature representation, thus brings expectable performance improvement. • The enhancement of global high frequency information in deep neural networks has been preliminarily confirmed beneficial in building change detection. • Comprehensive comparisons among the recent change detection methods and our proposed method are given, which indicates that our method can achieve state-of-the-art performance in building change detection. Building change detection (BCD) recently can be handled well under the booming of deep-learning based computer vision techniques. However, segmentation and recognition for objects with sharper boundaries still suffer from the poorly acquired high frequency information, which can result in the deteriorated annotation of building boundaries in BCD. To better obtain the high frequency pattern under the deep learning pipeline, we propose a high frequency attention-guided Siamese network (HFA-Net) in which a novel built-in high frequency attention block (HFAB) is applied. HFA-Net is designed to enhance high frequency information of buildings via HFAB which is composed of two main stages, i.e., the spatial-wise attention (SA) and the high frequency enhancement (HF). The SA firstly guides the model to search and focus on buildings, and HF is employed afterwards to highlight the high frequency information of the input feature maps. With high frequency information of buildings enhanced by HFAB, HFA-Net is able to better detect the edges of changed buildings, so as to improve the performance of BCD. Our proposed method is evaluated on three widely-used public datasets, i.e., WHU-CD, LEVIR-CD, and Google dataset. Remarkable experimental results on these datasets indicate that our proposed method can better detect edges of changed buildings and shows a better performance. The source code will be released at: https://github.com/HaiXing-1998/HFANet .
TL;DR: In this paper , a 2D convolutional model was proposed to predict the future trajectories of pedestrians, which achieved state-of-the-art results on the ETH and TrajNet datasets.
Abstract: Predicting the future trajectories of pedestrians is a challenging problem that has a range of application, from crowd surveillance to autonomous driving. In literature, methods to approach pedestrian trajectory prediction have evolved, transitioning from physics-based models to data-driven models based on recurrent neural networks. In this work, we propose a new approach to pedestrian trajectory prediction, with the introduction of a novel 2D convolutional model. This new model outperforms recurrent models, and it achieves state-of-the-art results on the ETH and TrajNet datasets. We also present an effective system to represent pedestrian positions and powerful data augmentation techniques, such as the addition of Gaussian noise and the use of random rotations, which can be applied to any model. As an additional exploratory analysis, we present experimental results on the inclusion of occupancy methods to model social information, which empirically show that these methods are ineffective in capturing social interaction.