scispace - formally typeset
Search or ask a question

Showing papers in "The Visual Computer in 2021"


Journal ArticleDOI
TL;DR: This work proposes a face detector named YOLO-face based on Y OLOv3 to improve the performance for face detection and includes using anchor boxes more appropriate for face Detection and a more precise regression loss function.
Abstract: Face detection is one of the important tasks of object detection. Typically detection is the first stage of pattern recognition and identity authentication. In recent years, deep learning-based algorithms in object detection have grown rapidly. These algorithms can be generally divided into two categories, i.e., two-stage detector like Faster R-CNN and one-stage detector like YOLO. Although YOLO and its varieties are not so good as two-stage detectors in terms of accuracy, they outperform the counterparts by a large margin in speed. YOLO performs well when facing normal size objects, but is incapable of detecting small objects. The accuracy decreases notably when dealing with objects that have large-scale changing like faces. Aimed to solve the detection problem of varying face scales, we propose a face detector named YOLO-face based on YOLOv3 to improve the performance for face detection. The present approach includes using anchor boxes more appropriate for face detection and a more precise regression loss function. The improved detector significantly increased accuracy while remaining fast detection speed. Experiments on the WIDER FACE and the FDDB datasets show that our improved algorithm outperforms YOLO and its varieties.

128 citations


Journal ArticleDOI
TL;DR: The authors have presented the feature-based method for 2D face images, which uses speeded up robust features (SURF) and scale-invariant feature transform (SIFT) for feature extraction and has a maximum recognition accuracy of 99.7%.
Abstract: Face recognition is the process of identifying people through facial images. It has become vital for security and surveillance applications and required everywhere including institutions, organizations, offices, and social places. There are a number of challenges faced in face recognition which includes face pose, age, gender, illumination, and other variable condition. Another challenge is that the database size for these applications is usually small. So, training and recognition become difficult. Face recognition methods can be divided into two major categories, appearance-based method and feature-based method. In this paper, the authors have presented the feature-based method for 2D face images. speeded up robust features (SURF) and scale-invariant feature transform (SIFT) are used for feature extraction. Five public datasets, namely Yale2B, Face 94, M2VTS, ORL, and FERET, are used for experimental work. Various combinations of SIFT and SURF features with two classification techniques, namely decision tree and random forest, have experimented in this work. A maximum recognition accuracy of 99.7% has been reported by the authors with a combination of SIFT (64-components) and SURF (32-components).

110 citations


Journal ArticleDOI
TL;DR: An improved image inpainting method based on a new encoder combined with a context loss function is proposed and can demonstrate that the proposed algorithm demonstrates better adaptive capability than the comparison algorithms on a number of image categories.
Abstract: Existing image inpainting algorithms based on neural network models are affected by structural distortions and blurred textures on visible connectivity. As a result, overfitting and overlearning phenomena can easily emerge during the image inpainting procedure. Image inpainting refers to the repairing of missing parts of an image, given an image that is broken or incomplete. After the repairing operation is complete, there are obvious signs of repair in damaged areas, semantic discontinuities, and unclearness. This paper proposes an improved image inpainting method based on a new encoder combined with a context loss function. In order to obtain clear repaired images and ensure that the semantic features of images are fully learned, a generative network based on the fusion model of squeeze-and-excitation networks deep residual learning has been proposed to improve the application of network features in order to obtain clear images and reduce network parameters. At the same time, a discriminative network based on the squeeze-and-excitation residual Network has been proposed to strengthen the capability of the discriminative network. In order to make the generated image more realistic, so that the restored image will be more similar to the original image, a joint context-awareness loss training method (contextual perception loss network) has also been proposed to generate the similarity of the local features of the network constraint, with the result that the repaired image is closer to the original picture and more realistic. The experimental results can demonstrate that the proposed algorithm demonstrates better adaptive capability than the comparison algorithms on a number of image categories. In addition, the processing results of the image inpainting procedure were also superior to those of five state-of-the-art algorithms.

98 citations


Journal ArticleDOI
TL;DR: A hybrid of convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) is used, which does automatic feature extraction from the raw sensor data with minimal data pre-processing and outperforms the other compared approaches.
Abstract: Human activity recognition (HAR) has become a significant area of research in human behavior analysis, human–computer interaction, and pervasive computing. Recently, deep learning (DL)-based methods have been applied successfully to time-series data generated from smartphones and wearable sensors to predict various activities of humans. Even though DL-based approaches performed very well in activity recognition, they are still facing challenges in handling time series data. Several issues persist with time-series data, such as difficulties in feature extraction, heavily biased data, etc. Moreover, most of the HAR approaches rely on manual feature engineering. In this paper, to design a robust classification model for HAR using wearable sensor data, a hybrid of convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) is used. The proposed multibranch CNN-BiLSTM network does automatic feature extraction from the raw sensor data with minimal data pre-processing. The use of CNN and BiLSTM makes the model capable of learning local features as well as long-term dependencies in sequential data. The different filter sizes used in the proposed model can capture various temporal local dependencies and thus helps to improve the feature extraction process. To evaluate the model performance, three benchmark datasets, i.e., WISDM, UCI-HAR, and PAMAP2, are utilized. The proposed model has achieved 96.05%, 96.37%, and 94.29% accuracies on WISDM, UCI-HAR, and PAMAP2 datasets, respectively. The obtained experimental results demonstrate that the proposed model outperforms the other compared approaches.

65 citations


Journal ArticleDOI
TL;DR: A multi-phase blending method with incremental blending intensity to improve the accuracy of object detectors and achieve remarkable improvements that outperforms the state-of-the-art RFBNet of one-stage detectors for real-time processing.
Abstract: Object detection is an important topic for visual data processing in the visual computing area. Although a number of approaches have been studied, it still remains a challenge. There is a suitable way to promote image classifiers by blending training with blended images and corresponding blended labels. However, our experiments show that directly moving existing blending methods from classification to object detection will cause the training process become harder and eventually will lead to a bad performance. Inspired by our discovery, this paper presents a multi-phase blending method with incremental blending intensity to improve the accuracy of object detectors and achieve remarkable improvements. Firstly, to adapt blending method to detection task, we propose a smoothly scheduled and incremental blending intensity to control the degree of multi-phase blending. Based on the above dynamic coefficient, we propose an incremental blending method, in which the blending intensity is smoothly increased from zero to full. Therefore, more complex and various data can be created to achieve the goal of regularizing the network. Secondly, we also design an incremental hybrid loss function to replace the original loss function. The blending intensity in our loss function increases smoothly, which is controlled by our scheduled coefficient. Thirdly, we further discard more negative examples in our multi-phase training process than other typical training methods and processes. By doing so, we can regularize the neural network to enhance generalization capability with data diversity and eventually to improve the accuracy in object detection. Another advantage is that there is no negative effect on evaluation because our method is just applied during the training process. Typical experiments show the proposed method improves the generalization of the detection networks. On PASCAL VOC and MS COCO, our method outperforms the state-of-the-art RFBNet of one-stage detectors for real-time processing.

56 citations


Journal ArticleDOI
TL;DR: Experimental and comparative results demonstrated the stability and improved performance of the proposed scheme compared to its parents watermarking schemes, and it is free of false positive detection error.
Abstract: This paper presents a new intelligent image watermarking scheme based on discrete wavelet transform (DWT) and singular values decomposition (SVD) using human visual system (HVS) and particle swarm optimization (PSO). The cover image is transformed by one-level (DWT) and subsequently the LL sub-band of (DWT) transformed image is chosen for embedding. To achieve the highest possible visual quality, the embedding regions are selected based on (HVS). After applying (SVD) on the selected regions, every two watermark bits are embedded indirectly into the U and $$V^{t}$$ components of SVD decomposition of the selected regions, instead of embedding one watermark bit into the U component and compensating on the $$V^{t}$$ component that results in twice capacity and reasonable imperceptibility. In addition, for increasing the robustness without losing the transparency, the scaling factors are chosen automatically by (PSO) based on the attacks test results and predefined conditions, instead of using fixed or manually set scaling factors for all different cover images. Experimental and comparative results demonstrated the stability and improved performance of the proposed scheme compared to its parents watermarking schemes. Moreover, the proposed scheme is free of false positive detection error.

55 citations


Journal ArticleDOI
TL;DR: The statistical analysis of the proposed 1-DCP chaotic map shows that it has a simple structure, a high chaotic behavior, and an infinite chaotic range, which makes it a perfect candidate for the design of chaos-based cryptographic systems.
Abstract: In this paper, we propose a new real one-dimensional cosine polynomial (1-DCP) chaotic map. The statistical analysis of the proposed map shows that it has a simple structure, a high chaotic behavior, and an infinite chaotic range. Therefore, the proposed map is a perfect candidate for the design of chaos-based cryptographic systems. Moreover, we propose an application of the 1-DCP map in the design of a new efficient image encryption scheme (1-DCPIE) to demonstrate the new map further good cryptographic proprieties. In the new scheme, we significantly reduce the encryption process time by raising the small processing unit from the pixels level to the rows/columns level and replacing the classical sequential permutation substitution architecture with a parallel permutation substitution one. We apply several simulation and security tests on the proposed scheme and compare its performances with some recently proposed encryption schemes. The simulation results prove that 1-DCPIE has a better security level and a higher encryption speed.

49 citations


Journal ArticleDOI
TL;DR: A framework to learn a robust face verification in an unconstrained environment using aggressive data augmentation using convolutional neural networks and an adaptive fusion of softmax loss and center loss as supervision signals is developed.
Abstract: In recent years, convolutional neural networks have proven to be a highly efficient approach for face recognition. In this paper, we develop such a framework to learn a robust face verification in an unconstrained environment using aggressive data augmentation. Our objective is to learn a deep face representation from large-scale data with massive noisy and occluded face. Besides, we add an adaptive fusion of softmax loss and center loss as supervision signals, which are helpful to improve the performance and to conduct the final classification. The experiment results show that the suggested system achieves comparable performances with other state-of-the-art methods on the Labeled Faces in the Wild and YouTube face verification tasks.

47 citations


Journal ArticleDOI
TL;DR: In this paper, the authors summarize the current literature on deep multimodal learning and provide insights and directions for future research, and present a collection of benchmark datasets for solving problems in various vision domains.
Abstract: The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

45 citations


Journal ArticleDOI
TL;DR: A real-time system to identify the type of disease present in a crop based on leaf images using machine learning is proposed, and a deep convolutional neural network architecture is proposed to classify the crop disease.
Abstract: Early identification of crop disease can aid the farmers to take timely precautions and countermeasures for its removal. In this paper, a real-time system to identify the type of disease present in a crop based on leaf images using machine learning is proposed. A deep convolutional neural network architecture is proposed to classify the crop disease, and a single shot detector is used for identification and localization of the leaf. These models are deployed on an embedded hardware, Nvidia Jetson TX1, for real-time in-field plant disease detection and identification. The disease classification accuracy achieved is around 96.88%, and the classification results are compared with existing convolutional neural network architectures. Also, the high success rate of the proposed system in the actual field test makes the proposed system a completely deployable system.

44 citations


Journal ArticleDOI
TL;DR: Five prominent commercial emotion recognition systems are selected and their performance is evaluated via two experiments, finding that the systems varied in how well they handled manipulated images that simulate realistic image distortion.
Abstract: Currently, there are several widely used commercial cloud-based services that attempt to recognize an individual’s emotions based on their facial expressions Most research into facial emotion recognition has used high-resolution, front-oriented, full-face images However, when images are collected in naturalistic settings (eg, using smartphone’s frontal camera), these images are likely to be far from ideal due to camera positioning, lighting conditions, and camera shake The impact these conditions have on the accuracy of commercial emotion recognition services has not been studied in full detail To fill this gap, we selected five prominent commercial emotion recognition systems—Amazon Rekognition, Baidu Research, Face++, Microsoft Azure, and Affectiva—and evaluated their performance via two experiments In Experiment 1, we compared the systems’ accuracy at classifying images drawn from three standardized facial expression databases In Experiment 2, we first identified several common scenarios (eg, partially visible face) that can lead to poor-quality pictures during smartphone use, and manipulated the same set of images used in Experiment 1 to simulate these scenarios We used the manipulated images to again compare the systems’ classification performance, finding that the systems varied in how well they handled manipulated images that simulate realistic image distortion Based on our findings, we offer recommendations for developers and researchers who would like to use commercial facial emotion recognition technologies in their applications

Journal ArticleDOI
TL;DR: Several chaos-theory analysis tests demonstrate that the proposed 1-DCF map has many good cryptography properties, such as a highly chaotic behavior, a large chaotic range, an infinite number of unstable fixed points, and a widely superior sensitivity to the initial conditions than most of the low-dimensional chaotic maps.
Abstract: In this paper, we propose a new real one-dimensional cosine fractional (1-DCF) chaotic map. Several chaos-theory analysis tests demonstrate that the proposed map has many good cryptography properties, such as a highly chaotic behavior, a large chaotic range, an infinite number of unstable fixed points, and a widely superior sensitivity to the initial conditions than most of the low-dimensional chaotic maps. Regarding these attractive features, we use the 1-DCF map to design a novel fast image encryption scheme for real-time image processing. Unlike most of the existing encryption schemes, we adopt a permutation-less architecture to increase the encryption speed. Regardless of the permutation phase absence, a high-security level is obtained by using a substitution process with a high sensitivity to the plain image. Moreover, we replace the natural row-order encryption with a more secure random-like encryption order generated from the secret key. Experimentation and simulations show that the new scheme is better than many recently proposed encryption schemes in both security and rapidity.

Journal ArticleDOI
TL;DR: An in-depth overview of recent object tracking research is provided and the latest research trend in object tracking based on convolutional neural networks, which is receiving growing attention is reviewed.
Abstract: Visual object tracking has become one of the most active research topics in computer vision, which has been growing in commercial development as well as academic research. Many visual trackers have been proposed in the last two decades. Recent studies of computer vision for dynamic scenes include motion detection, object classification, environment modeling, tracking of moving objects, understanding of object behaviors, object identification, and data fusion from multiple sensors. This paper provides an in-depth overview of recent object tracking research. Object tracking tasks in realistic scenario often face challenging problems such as camera motion, occlusion, illumination effect, clutter, and similar appearance. A variety of tracker techniques have been published, which combine multiple techniques to solve multiple visual tracking sub-problems. This paper also reviews the latest research trend in object tracking based on convolutional neural networks, which is receiving growing attention. Finally, the paper discusses the future challenges and research directions for the object tracking problems that still need extensive studies in coming years.

Journal ArticleDOI
TL;DR: Experimental results on KTH, Weizmann, UT-interaction, and TenthLab dataset showed that the proposed algorithm has higher accuracy than the other literature.
Abstract: In order to improve the accuracy of human abnormal behavior recognition, a two-stream convolution neural network model was proposed. This model includes two main parts, VMHI and FRGB. Firstly, the motion history images are extracted and input into VGG-16 convolutional neural network for training. Then, the RGB image is input into Faster R-CNN algorithm for training using Kalman filter-assisted data annotation. Finally, the two stream VMHI and FRGB results are fused. The algorithm can recognize not only single person behavior, but also two person interaction behavior and improve the recognition accuracy of similar actions. Experimental results on KTH, Weizmann, UT-interaction, and TenthLab dataset showed that the proposed algorithm has higher accuracy than the other literature.

Journal ArticleDOI
TL;DR: A novel method to detect fights or violent actions based on learning both the spatial and temporal features from equally spaced sequential frames of a video, using the proposed feature fusion method to take into account the motion information.
Abstract: Human behavior detection is essential for public safety and monitoring However, in human-based surveillance systems, it requires continuous human attention and observation, which is a difficult task Detection of violent human behavior using autonomous surveillance systems is of critical importance for uninterrupted video surveillance In this paper, we propose a novel method to detect fights or violent actions based on learning both the spatial and temporal features from equally spaced sequential frames of a video Multi-level features for two sequential frames, extracted from the convolutional neural network’s top and bottom layers, are combined using the proposed feature fusion method to take into account the motion information We also proposed Wide-Dense Residual Block to learn these combined spatial features from the two input frames These learned features are then concatenated and fed to long short-term memory units for capturing temporal dependencies The feature fusion method and use of additional wide-dense residual blocks enable the network to learn combined features from the input frames effectively and yields better accuracy results Experimental results evaluated on four publicly available datasets: HockeyFight, Movies, ViolentFlow and BEHAVE show the superior performance of the proposed model in comparison with the state-of-the-art methods

Journal ArticleDOI
TL;DR: An image encryption scheme based on improved Arnold map that includes a Divide & Rotate operation and pixels shuffling is presented and shows high sensitivity and resistance against common attacks.
Abstract: In this paper, we present an image encryption scheme based on improved Arnold map. The improvement in the Arnold map includes a Divide & Rotate operation and pixels shuffling. The obtained shuffled Arnold map shows better results in terms of performance and speed. The proposed encryption scheme applies a preprocessing procedure on the plain image. We use the Shuffled Arnold map in the confusion process for only one round. For the diffusion process, we execute a Forward & Backward process to apply an integer-level values manipulation. The evaluation of the proposed method shows high sensitivity and resistance against common attacks.

Journal ArticleDOI
TL;DR: A novel lightweight end-to-end feature refinement network (FRNet) to address the issue of consecutive down-sampling operations in the encoder lead to the loss of spatial information, which is important for medical image segmentation.
Abstract: Medical image segmentation is a crucial but challenging task for computer-aided diagnosis. In recent years, fully convolutional network-based methods have been widely applied to medical image segmentation. U-shape-based approaches are one of the most successful structures in this medical field. However, the consecutive down-sampling operations in the encoder lead to the loss of spatial information, which is important for medical image segmentation. In this paper, we present a novel lightweight end-to-end feature refinement network (FRNet) to address this issue. The structure of our model is simple and efficient. Specifically, the network adopts an encoder-decoder network as backbone, where two additional paths, spatial refinement path and semantic refinement path, are applied on the encoder and decoder, respectively, to improve the detailed representation ability and discriminative ability of our model. In addition, we introduce a feature adaptive fusion block (FAF block) that effectively combines features of different depths. The proposed FRNet can be trained in an end-to-end way. We have evaluated our method on three different medical image segmentation tasks. Experimental results show that FRNet has better performance than the state-of-the-art approaches. It achieves a high average accuracy without any post-processing of 0.968 and 0.936 for blood vessel segmentation and skin lesion segmentation, respectively. We further demonstrate that our method can be easily applied to other network structures to improve their performance.

Journal ArticleDOI
TL;DR: This paper presents an efficient dual integrated convolution neural network (DICNN) model for the recognition of facial expressions in the wild in real-time, running on an embedded platform and optimized the designed DICNN model using TensorRT SDK and deployed it on an Nvidia Xavier embedded platform.
Abstract: Automatic recognition of facial expressions in the wild is a challenging problem and has drawn a lot of attention from the computer vision and pattern recognition community. Since their emergence, the deep learning techniques have proved their efficacy in facial expression recognition (FER) tasks. However, these techniques are parameter intensive, and thus, could not be deployed on resource-constrained embedded platforms for real-world applications. To mitigate these limitations of the deep learning inspired FER systems, in this paper, we present an efficient dual integrated convolution neural network (DICNN) model for the recognition of facial expressions in the wild in real-time, running on an embedded platform. The designed DICNN model with just 1.08M parameters and 5.40 MB memory storage size achieves optimal performance by maintaining a proper balance between recognition accuracy and computational efficiency. We evaluated the DICNN model on four FER benchmark datasets (FER2013, FERPlus, RAF-DB, and CKPlus) using different performance evaluation metrics, namely the recognition accuracy, precision, recall, and F1-score. Finally, to provide a portable solution with high throughput inference, we optimized the designed DICNN model using TensorRT SDK and deployed it on an Nvidia Xavier embedded platform. Comparative analysis results with the other state-of-the-art methods revealed the effectiveness of the designed FER system, which achieved competitive accuracy with multi-fold improvement in the execution speed.

Journal ArticleDOI
TL;DR: A deep learning (DL) and transfer learning-based approach is proposed to classify histopathological images for breast cancer diagnosis that outperforms the baseline methods in terms of multiple performance measures.
Abstract: Breast cancer is one of the leading death cause among women nowadays. Several methods have been proposed for the detection of breast cancer. Various machine learning-based automatic diagnosis systems have been developed known as Computer Aided Diagnostics (CAD) systems. Initial CAD systems were based on machine learning algorithms however due to the automatic feature extraction ability of convolutional neural networks (CNN)-based deep learning models are widely adopted. Deep learning is widely used in various fields. Healthcare is one of the essential field that deep learning has transformed. Another common issue faced by the patients is difference of opinion among different pathologists and medical practitioners. Such human errors often lead to misleading or delayed judgment, which may be fatal to human life. To improve decision consistency, efficiency, and error reduction, researchers in the field of healthcare are using deep learning-based approaches and achieved state of art results. In this study, a deep learning (DL) and transfer learning-based approach is proposed to classify histopathological images for breast cancer diagnosis. In this study, we have adopted patch selection approach to classify breast histopathological images on small number of training images using transfer learning without losing the performance. Initially, patches are extracted from Whole Slide Images and fed into the CNN for features extraction. Based on these features, the discriminative patches are selected and then fed to Efficient-Net architecture pre-trained on ImageNet dataset. Features extracted from Efficient-Net architecture are also used to train a SVM classifier. The proposed model outperforms the baseline methods in terms of multiple performance measures.

Journal ArticleDOI
TL;DR: The experimental results show that compared with the existing schemes, the proposed watermarking scheme has higher performance, such as better invisibility, stronger robustness and shorter execution time.
Abstract: In order to realize the copyright protection of color image effectively, combining the advantages of spatial-domain watermarking scheme and frequency-domain one, a blind color image watermarking scheme with high performance in the spatial domain is proposed in the paper. The presented scheme does not require real discrete cosine transform (DCT) and discrete Hartley transform (DHT), but only uses the different quantization steps to complete the embedding and blind extracting of color watermark in the spatial domain according to the unique characteristic of direct current (DC) components of DCT and DHT. The contributions of this paper include the following: (1) This scheme combined the strengths of watermarking scheme in the spatial domain and frequency domain, which has fast speed and strong robustness; (2) the scheme makes full use of the energy aggregation characteristics of image block, and the invisibility of the watermarking scheme has greatly improved; and (3) different quantization steps are chosen to embed and extract watermark in different layers, which reduce the modification range of pixel value effectively. The experimental results show that compared with the existing schemes, the proposed watermarking scheme has higher performance, such as better invisibility, stronger robustness and shorter execution time.

Journal ArticleDOI
TL;DR: This paper proposes a end-to-end scale invariant head detection framework that can handle broad range of scales and demonstrates that scale variations can be handled by modeling a set of specialized scale-specific convolutional neural networks with different receptive fields.
Abstract: Crowd counting in high density crowds has significant importance in crowd safety and crowd management. Existing state-of-the-art methods employ regression models to count the number of people in an image. However, regression models are blind and cannot localize the individuals in the scene. On the other hand, detection-based crowd counting in high density crowds is a challenging problem due to significant variations in scales, poses and appearances. The variations in poses and appearances can be handled through large capacity convolutional neural networks. However, the problem of scale lies in the heart of every detector and needs to be addressed for effective crowd counting. In this paper, we propose a end-to-end scale invariant head detection framework that can handle broad range of scales. We demonstrate that scale variations can be handled by modeling a set of specialized scale-specific convolutional neural networks with different receptive fields. These scale-specific detectors are combined into a single backbone network, where parameters of the network is optimized in end-to-end fashion. We evaluated our framework on challenging benchmark datasets, i.e., UCF-QNRF, UCSD. From experiment results, we demonstrate that proposed framework beats existing methods by a great margin.

Journal ArticleDOI
TL;DR: In this paper, an improved YOLO model was proposed to detect oil palm loose fruits from UAV images, where the images are augmented by brightness, rotation, and blurring to simulate the actual natural environment.
Abstract: Manual harvesting of loose fruits in the oil palm plantation is both time consuming and physically laborious. Automatic harvesting system is an alternative solution for precision agriculture which requires accurate visual information of the targets. Current state-of-the-art one-stage object detection method provides excellent detection accuracy; however, it is computationally intensive and impractical for embedded system. This paper proposed an improved YOLO model to detect oil palm loose fruits from unmanned aerial vehicle images. In order to improve the robustness of the detection system, the images are augmented by brightness, rotation, and blurring to simulate the actual natural environment. The proposed improved YOLO model adopted several improvements; densely connected neural network for better feature reuse, swish activation function, multi-layer detection to enhance detection on small targets and prior box optimization to obtain accurate bounding box information. The experimental results show that the proposed model achieves outstanding average precision of 99.76% with detection time of 34.06 ms. In addition, the proposed model is also light in weight size and requires less training time which is significant in reducing the hardware costs. The results exhibit the superiority of the proposed improved YOLO model over several existing state-of-the-art detection models.

Journal ArticleDOI
TL;DR: Up-to-date review on computational analysis techniques for measurement of emotional facial expression of people with PD (PWP) along with an overview on clinical applications of automated facial expression analysis are presented.
Abstract: Among various means of communication, the human face is utmost powerful. Persons suffering from Parkinson’s disease (PD) experience hypomimia which often leads to reduction in facial expression. Hypomimia affects in social interaction and has a highly undesirable impact on patient’s as well as his relative’s quality of life. To track the longitudinal progression of PD, usually Movement Disorder Society’s Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) is used in clinical studies and item 3.2 (i.e., facial expression) of MDS-UPDRS defines hypomimia levels. Assessment of facial expressions has traditionally relied on an observer-based scale which can be time-consuming. Computational analysis techniques for facial expressions can assist the clinician in decision making. Intention of such techniques is to predict objective and accurate score for facial expression. The aim of this paper is to present up-to-date review on computational analysis techniques for measurement of emotional facial expression of people with PD (PWP) along with an overview on clinical applications of automated facial expression analysis. This led us to examine a pilot experimental work for masked face detection in PD. For the same, a deep learning-based model was trained on NVIDIA GeForce 920M GPU. It was observed that deep learning-based model yields 85% accuracy on the testing images.

Journal ArticleDOI
TL;DR: A deep neural network-based approach for view classification and content-based image retrieval is proposed and its application for efficient medical image retrieved is demonstrated and an approach for body part orientation view classification labels is designed, intending to reduce the variance that occurs in different types of scans.
Abstract: In medical applications, retrieving similar images from repositories is most essential for supporting diagnostic imaging-based clinical analysis and decision support systems. However, this is a challenging task, due to the multi-modal and multi-dimensional nature of medical images. In practical scenarios, the availability of large and balanced datasets that can be used for developing intelligent systems for efficient medical image management is quite limited. Traditional models often fail to capture the latent characteristics of images and have achieved limited accuracy when applied to medical images. For addressing these issues, a deep neural network-based approach for view classification and content-based image retrieval is proposed and its application for efficient medical image retrieval is demonstrated. We also designed an approach for body part orientation view classification labels, intending to reduce the variance that occurs in different types of scans. The learned features are used first to predict class labels and later used to model the feature space for similarity computation for the retrieval task. The outcome of this approach is measured in terms of error score. When benchmarked against 12 state-of-the-art works, the model achieved the lowest error score of 132.45, with 9.62–63.14% improvement over other works, thus highlighting its suitability for real-world applications.

Journal ArticleDOI
TL;DR: A deep learning-based DeGlow–DeHaze iterative model which accounts for varying colors and glows is introduced and demonstrates that the multi-path CNN model outperforms other state-of-the-art methods in terms of PSNR, SSIM, evaluation parameters and computation time.
Abstract: In this paper, we address the single-image haze removal problem in nighttime scenes. The night haze removal is a severely ill-posed problem due to the presence of various visible night light sources with varying colors and non-uniform illumination. These light sources are of different shapes and introduce a noticeable amount of glow in the night scenes. To overcome these effects, we introduce a deep learning-based DeGlow–DeHaze iterative model which accounts for varying colors and glows. The proposed model is a linear combination of three terms: the direct transmission attenuation, airlight and glow. First, a multi-path dilated convolution DeGlow network is introduced to interactively learn the local features with different reception fields and reduce the glow effect. The glow term is estimated by a binary mask that informs the location of the illumination source. As a result, the nighttime image is only left with only direct transmission and airlight terms. Finally, a separate post-processing DeHaze network is included to remove the haze effect from the reduced glow image. For our model training, we collected the night hazy images from internal and external resources, synthesized transmission maps from the NYU depth datasets, and consequently restored the haze-free images. The quantitative and qualitative evaluations show the effectiveness of our model on several real and synthetic images and compare our results with existing night haze models. The experimental results demonstrate that our multi-path CNN model outperforms other state-of-the-art methods in terms of PSNR (19.25 dB), SSIM (0.9958) evaluation parameters and computation time.

Journal ArticleDOI
TL;DR: Experimental findings demonstrate that the proposed Gabor-modulated CNN network has limited network parameters to learn; therefore, it is quite easy to train such networks.
Abstract: MR brain tumor classification is one of the extensively utilized approaches in medical prognosis. However, analyzing and processing MR brain images is still quite a task for radiologists. To encounter this problem, the evaluation of existing canonical techniques has already been done. There are numeral MR brain tumor classification approaches that are being used for medical diagnosis. In this paper, we have developed an automated computer-aided network for diagnosis of MR brain tumor class, i.e., HGG and LGG. We have proffered a Gabor-modulated convolutional filter-based classifier for brain tumor classification. The inclusion of Gabor filter dynamics endows the competency to deal with spatial and orientational transformations. This mere modification (modulation) of conventional convolutional filters by Gabor filters empowers the proposed architecture to learn relatively smaller feature maps and thereby, decreasing network parameter requirement. We have introduced some skip connections to our modulated CNN architecture without introducing an extra network parameter. Pre-trained networks, i.e., Alex-Net, Google-Net (Inception V1), Res-Net and VGG 19 have been considered for performance evaluation of our proposed Gabor-modulated CNN. Additionally, some popular machine learning classification techniques have also been considered for comparative analysis. Experimental findings demonstrate that our proposed network has limited network parameters to learn; therefore, it is quite easy to train such networks.

Journal ArticleDOI
TL;DR: This paper proposes a novel human action recognition method by fusing spatial and temporal features learned from a simple unsupervised convolutional neural network called principal component analysis network (PCANet) in combination with bag-of-features (BoF) and vector of locally aggregated descriptors (VLAD) encoding schemes.
Abstract: Human action recognition is still a challenging topic in the computer vision field that has attracted a large number of researchers. It has a significant importance in varieties of applications such as intelligent video surveillance, sports analysis, and human–computer interaction. Recent works attempt to exploit the progress in deep learning architecture to learn spatial and temporal features from action video. However, it remains unclear how to combine spatial and temporal information with convolutional neural network. In this paper, we propose a novel human action recognition method by fusing spatial and temporal features learned from a simple unsupervised convolutional neural network called principal component analysis network (PCANet) in combination with bag-of-features (BoF) and vector of locally aggregated descriptors (VLAD ) encoding schemes. Firstly, both spatial and temporal features are learned via PCANet using a subset of frames and temporal templates for each video, while their dimensionality is reduced using whitening transformation (WT). The temporal templates are calculated using short-time motion energy images (ST-MEI) based on frame differencing. Then, the encoding scheme is applied to represent the final dual spatiotemporal PCANet features by feature fusion. Finally, the support vector machine (SVM) classifier is exploited for action recognition. Extensive experiments have been performed on two popular datasets, namely KTH and UCF sports, to evaluate the performance of proposed method. Our experimental results using leave-one-out evaluation strategy demonstrate that the proposed method presents satisfactory and comparable results on both datasets.

Journal ArticleDOI
TL;DR: This paper proposes an unsupervised fabric defect detection method based on the human visual attention mechanism which introduces two-dimensional entropy which can reflect the spatial distribution characteristics of images based on one- dimensional entropy, according to the relationship between information entropy and image texture.
Abstract: The automatic detection of defects is an important part of the fabric production process. However, existing methods of detecting defects in fabrics with periodic patterns lack adaptability and perform poorly in detection. In this paper, we propose an unsupervised fabric defect detection method based on the human visual attention mechanism. The method introduces two-dimensional entropy which can reflect the spatial distribution characteristics of images based on one-dimensional entropy, according to the relationship between information entropy and image texture. The image is reconstructed into a quaternion matrix by combining two-dimensional entropy and three feature maps that characterize the opponent color space representation of the input image. The hypercomplex Fourier transform is then used to transform the quaternion image matrix into the frequency domain. We propose a new method for local tuning of amplitude spectrum, thereby suppressing the background pattern while retaining the defect region. Finally, the inverse transform is performed to obtain a saliency map. Through experimental comparisons and a series of numerical evaluations, we demonstrate that the proposed method has a better detection effect compared to state-of-the-art methods in fabric defect detection.

Journal ArticleDOI
TL;DR: This paper proposes a genetic programming (GP)-based method that combines the two well-known features of histograms of oriented gradients and local binary patterns, and significantly outperforms or achieves similar performance to relevant methods from the state-of-the-art, even with a limited number of training instances.
Abstract: Classifying texture images relies heavily on the quality of the extracted features. However, producing a reliable set of features is a difficult task that often requires human intervention to select a set of prominent primitives. The process becomes more difficult when it comes to fuse low-level descriptors because of data redundancy and high dimensionality. To overcome these challenges, several approaches use machine learning to automate primitive detection and feature extraction while combining low-level descriptors. Nevertheless, most of these approaches performed the two processes separately while ignoring the correlation between them. In this paper, we propose a genetic programming (GP)-based method that combines the two well-known features of histograms of oriented gradients and local binary patterns. Indeed, a three-layer tree-based binary program is learned using genetic programming for each pair of classes. The three layers incorporate patch detection, feature fusion and classification in the GP optimization process. The feature fusion function is designed to handle different variations, notably illumination and rotation, while reducing dimensionality. The proposed method has been compared, using six challenging collections of images, with multiple domain-expert GP and non-GP methods for binary and multi-class classifications. Results show that the proposed method significantly outperforms or achieves similar performance to relevant methods from the state-of-the-art, even with a limited number of training instances.

Journal ArticleDOI
TL;DR: The aim of the paper is to propose a method to detect the presence of neovascularization which involves image processing methods such as resizing, green channel filtering, Gaussian filter, and morphology techniques such as erosion and dilation.
Abstract: Diabetic retinopathy (DR) is also called diabetic eye disease, which causes damage to the retina due to diabetes mellitus and that leads to blindness when the disease reaches an extreme stage. The medical tests take a lot of procedure, time, and money to test for the proliferative stage of diabetic retinopathy (PDR). Hence to resolve this problem, this model is proposed to detect and identify the proliferative stages of diabetic retinopathy which is also identified by its hallmark feature that is neovascularization. In the proposed system, the paper aims to correctly identify the presence of neovascularization using color fundus images. The presence of neovascularization in an eye is an indication that the eye is affected with proliferative PDR. Neovascularization is the development of new abnormal blood vessels in the retina. Since the occurrence of neovascularization may lead to partial or complete vision loss, timely and accurate prediction is important. The aim of the paper is to propose a method to detect the presence of neovascularization which involves image processing methods such as resizing, green channel filtering, Gaussian filter, and morphology techniques such as erosion and dilation. For classification, the different layers of CNN have been used and modeled together in a VGG-16 net architecture. The model was trained and tested on 2200 images all together from the Kaggle database. The proposed model was tested using DRIVE and STARE data sets, and the accuracy, specificity, sensitivity, precision, F1 score achieved are 0.96, 0.99, 0.95, 0.99, and 0.97, respectively, on DRIVE and 0.95, 0.99, 0.9375, 0.96, and 0.95, respectively, on STARE.