scispace - formally typeset
Search or ask a question

Showing papers in "Iet Computer Vision in 2017"


Journal ArticleDOI
TL;DR: The authors present an analytical framework to classify and to evaluate these methods based on some important functional measures, and a categorisation of the state-of-the-art approaches in deep learning for human action recognition is presented.
Abstract: A study on one of the most important issues in a human action recognition task, i.e. how to create proper data representations with a high-level abstraction from large dimensional noisy video data, is carried out. Most of the recent successful studies in this area are mainly focused on deep learning. Deep learning methods have gained superiority to other approaches in the field of image recognition. In this survey, the authors first investigate the role of deep learning in both image and video processing and recognition. Owing to the variety and plenty of deep learning methods, the authors discuss them in a comparative form. For this purpose, the authors present an analytical framework to classify and to evaluate these methods based on some important functional measures. Furthermore, a categorisation of the state-of-the-art approaches in deep learning for human action recognition is presented. The authors summarise the significantly related works in each approach and discuss their performance.

60 citations


Journal ArticleDOI
TL;DR: The proposed framework performs well in classifying the digital mammograms as normal, benign or malignant and its subclasses as well and exhibits significant improvement in performance over the traditional methods.
Abstract: In this study, a novel deep learning-based framework for classifying the digital mammograms is introduced. The development of this methodology is based on deep learning strategies that model the presence of the tumour tissues with level sets. It is difficult to robustly segment mammogram image due to low contrast between normal and lesion tissues. Therefore, Chan-Vese level set method is used to extract the initial contour of mammograms and deep learning convolutional neural network (DL-CNN) algorithm is used to learn the features of mammary-specific mass and microcalcification clusters. To increase the classification accuracy and reduce the false positives, a well-known fully complex-valued relaxation network classifier is used in the last stage of DL-CNN network. Experimental results using the standard benchmarking breast cancer dataset (MIAS and BCDR) show that the proposed method exhibits significant improvement in performance over the traditional methods. Performance measures such as accuracy, sensitivity, specificity, AUC achieved are 99%, 0.9875, 1.0 and 0.9815, respectively. The proposed framework performs well in classifying the digital mammograms as normal, benign or malignant and its subclasses as well.

56 citations


Journal ArticleDOI
TL;DR: The authors identify and introduce existing, publicly available, benchmark datasets and software resources that fuse colour and depth data for MHT and present a brief comparative evaluation of the performance of those works that have applied their methods to these datasets.
Abstract: Multiple human tracking (MHT) is a fundamental task in many computer vision applications. Appearance-based approaches, primarily formulated on RGB data, are constrained and affected by problems arising from occlusions and/or illumination variations. In recent years, the arrival of cheap RGB-depth devices has led to many new approaches to MHT, and many of these integrate colour and depth cues to improve each and every stage of the process. In this survey, the authors present the common processing pipeline of these methods and review their methodology based (a) on how they implement this pipeline and (b) on what role depth plays within each stage of it. They identify and introduce existing, publicly available, benchmark datasets and software resources that fuse colour and depth data for MHT. Finally, they present a brief comparative evaluation of the performance of those works that have applied their methods to these datasets.

42 citations


Journal ArticleDOI
TL;DR: This study addresses the automatic multi-person tracking problem in complex scenes from a single, static, uncalibrated camera using a sequential tracking-by-detection framework, which can be applied to real-time applications.
Abstract: This study addresses the automatic multi-person tracking problem in complex scenes from a single, static, uncalibrated camera. In contrast with offline tracking approaches, a novel online multi-person tracking method is proposed based on a sequential tracking-by-detection framework, which can be applied to real-time applications. A two-stage data association is first developed to handle the drifting targets stemming from occlusions and people's abrupt motion changes. Subsequently, a novel online appearance learning is developed by using the incremental/decremental support vector machine with an adaptive training sample collection strategy to ensure reliable data association and rapid learning. Experimental results show the effectiveness and robustness of the proposed method while demonstrating its compatibility with real-time applications.

34 citations


Journal ArticleDOI
TL;DR: The authors propose a fitting-based optimisation method for salient object detection algorithms that analyses the quantitative relationship between saliency and ground truth values, and uses the derived relationship to fit the saliency values to the original saliency maps.
Abstract: To overcome some major problems with traditional saliency evaluation metrics, full-reference image quality assessment (IQA) metrics, which have similar but stricter objectives, are used. Inspired by the root mean absolute error, the authors propose a fitting-based optimisation method for salient object detection algorithms. Their algorithm analyses the quantitative relationship between saliency and ground truth values, and uses the derived relationship to fit the saliency values to the original saliency maps. This ensures that the resulting images, which are composed of fitted values, are closer to the ground truth. The proposed algorithm first computes the statistics of the ground truth and saliency maps computed by each salient object detection algorithm. These statistics are used to compute the parameters of four fitting models, which generally agree with the characteristics of the statistical data. For a new saliency map, they use the fitting model with the computed parameters to obtain the fitted saliency values, which are confined to the range [0, 255]. Finally, they evaluate their saliency optimisation algorithm using traditional evaluation metrics, IQA metrics, and a content-based image retrieval application. The results show that the proposed approach improves the quality of the optimised saliency maps.

34 citations


Journal ArticleDOI
TL;DR: A novel two-stream fully convolutional networks architecture for action recognition which can significantly reduce parameters while keeping performance is designed and can achieve the state-of-the-art performance on two challenging datasets.
Abstract: Human action recognition is an important and challenging topic in computer vision. Recently, convolutional neural networks (CNNs) have established impressive results for many image recognition tasks. The CNNs usually contain million parameters which prone to overfit when training on small datasets. Therefore, the CNNs do not produce superior performance over traditional methods for action recognition. In this study, the authors design a novel two-stream fully convolutional networks architecture for action recognition which can significantly reduce parameters while keeping performance. To utilise the advantage of spatial-temporal features, a linear weighted fusion method is used to fuse two-stream networks’ feature maps and a video pooling method is adopted to construct the video-level features. At the meantime, the authors also demonstrate that the improved dense trajectories has significant impact for action recognition. The authors’ method can achieve the state-of-the-art performance on two challenging datasets UCF101 (93.0%) and HMDB51 (70.2%).

29 citations


Journal ArticleDOI
TL;DR: A localisation-segmentation framework to cater for small object segmentation of left ventricle (LV) cavity from cardiac magnetic resonance (CMR) images, which shows the potentials of deep learning approaches for this particular application.
Abstract: This work conducts a feasibility study of deep learning approaches for automatic segmentation of left ventricle (LV) cavity from cardiac magnetic resonance (CMR) images. Automatic LV cavity segmentation is a challenging task, partially due to the small size of the object as compared to the large CMR image background, especially at the apex. To cater for small object segmentation, the authors present a localisation-segmentation framework, to first locate the object in the large full image, then segment the object within the small cropped region of interest. The localisation is performed by a deep regression model based on convolutional neural networks, while the segmentation is done by the deep neural networks based on U-Net architecture. They also employ the Dice loss function for the training process of the segmentation models, to investigate its effects on the segmentation performance. The deep learning models are trained and evaluated by using public endocardium-annotated CMR datasets from York University and MICCAI 2009 LV Challenge websites. The average dice metric values of the authors’ proposed framework are 0.91 and 0.93, respectively, on these two databases. These results are promising as compared to the best results achieved by the current state-of-art, which shows the potentials of deep learning approaches for this particular application.

28 citations


Journal ArticleDOI
TL;DR: The devised sparse representation method employs both the original and virtual training samples to improve the classification accuracy since the two kinds of training samples makes sample information to be fully exploited in a good way, also satisfactory robustness to be obtained.
Abstract: Among all image representation and classification methods, sparse representation has proven to be an extremely powerful tool. However, a limited number of training samples are an unavoidable problem for sparse representation methods. Many efforts have been devoted to improve the performance of sparse representation methods. In this study, the authors proposed a novel framework to improve the classification accuracy of sparse representation methods. They first introduced the concept of the approximations of all training samples (i.e., virtual training samples). The advantage of this is that the application of virtual training samples can allow noise in original training samples to be partially reduced. Then they proposed an efficient and competent objective function to disclose more discriminant information between different classes, which is very significant for obtaining a better classification result. The devised sparse representation method employs both the original and virtual training samples to improve the classification accuracy since the two kinds of training samples makes sample information to be fully exploited in a good way, also satisfactory robustness to be obtained. The experimental results on the JAFFE, ORL, Columbia Object Image Library (COIL-100) AR and CMU PIE databases show that the proposed method outperforms the state-of-art image classification methods.

23 citations


Journal ArticleDOI
TL;DR: A large margin relative distance learning (LMRDL) method which learns the metric from triplet constraints, so that the problem of imbalanced sample pairs can be bypassed.
Abstract: Distance metric learning has achieved great success in person re-identification. Most existing methods that learn metrics from pairwise constraints suffer the problem of imbalanced data. In this study, the authors present a large margin relative distance learning (LMRDL) method which learns the metric from triplet constraints, so that the problem of imbalanced sample pairs can be bypassed. Different from existing triplet-based methods, LMRDL employs an improved triplet loss that enforces penalisation on the triplets with minimal inter-class distance, and this leads to a more stringent constraint to guide the learning. To suppress the large variations of pedestrian's appearance in different camera views, the authors propose to learn the metric over the intra-class subspace. The proposed method is formulated as a logistic metric learning problem with positive semi-definite constraint, and the authors derive an efficient optimisation scheme to solve it based on the accelerated proximal gradient approach. Experimental results show that the proposed method achieves state-of-the-art performance on three challenging datasets (VIPeR, PRID450S, and GRID).

22 citations


Journal ArticleDOI
TL;DR: This work describes several important enhancements made in the original framework related to the pre-processing steps, feature calculation and training setup and proposes the augmented framework, which stands out in terms of the detection accuracy and computational complexity compared to contemporary detectors.
Abstract: Histogram intersection kernel support vector machine (SVM) is accepted as a better discriminator than its linear counterpart when used for pedestrian detection in images and video frames. Its computational complexity has, however, limited its use in practical real-time detectors. To circumvent this problem, prior work proposed a low complexity detection framework based on integer-only histograms of oriented gradient features which allow a look-up table-based implementation of kernel SVM leading to further simplification without compromising detection performance. This work describes several important enhancements made in the original framework related to the pre-processing steps, feature calculation and training setup. Resultantly, the augmented framework, proposed in this study, stands out in terms of the detection accuracy and computational complexity compared to contemporary detectors. The best detector described in this study achieves 8 and 2% lesser miss rates (MRs) on ETH and INRIA pedestrian datasets, respectively, compared to the well-known boosting cascades-based aggregate channel feature detector despite avoiding complex floating point operations. Moreover, the proposed detector performs exceptionally better in scenarios where less than 10−2 false positives per image are desired as demonstrated through the MR versus false positive curves.

20 citations


Journal ArticleDOI
TL;DR: This work uses a pretrained deep residual neural network to extract features, and then utilises a sparse partial least-squares regression approach to estimate ages from the side-view of face images.
Abstract: In recent years, automatic facial age estimation has gained popularity due to its numerous applications. Much work has been done on frontal images and lately, minimal estimation errors have been achieved on most of the benchmark databases. However, in reality, images obtained in unconstrained environments are not always frontal. For instance, when conducting a demographic study or crowd analysis, one may get profile images of the face. To the best of our knowledge, no attempt has been made to estimate ages from the side-view of face images. Here the authors exploit this by using a pretrained deep residual neural network to extract features, and then utilise a sparse partial least-squares regression approach to estimate ages. Despite having less information as compared with frontal images, the results show that the extracted deep features achieve a promising performance.

Journal ArticleDOI
TL;DR: The face location algorithm is developed to improve the reliability of face detection and extract a face region with a high proportion of skin and facial structure estimation is exploited to further reduce the impact of non-skin factors on dynamic skin colour modelling.
Abstract: Reliable and accurate facial skin extraction is the most critical and urgent issue for adaptive skin detection. Aiming at resolving this issue, the authors propose an adaptive skin detection method using face location and facial structure estimation. The face location algorithm is developed to improve the reliability of face detection and extract a face region with a high proportion of skin. Facial structure estimation is exploited to further reduce the impact of non-skin factors on dynamic skin colour modelling. The colour space distribution model of extracted facial skin is very close to that of real facial skin. Finally, the skin in an image is obtained by using a hybrid colour space strategy. Extensive experimental comparisons with some state-of-the-art methods have shown the superior performance of the proposed method.

Journal ArticleDOI
TL;DR: The aim of optimisation is to distribute particles in high-likelihood area according to the cognitive effect and improve quality of particles, while the objective of dynamic resampling is to maintain diversity in the particle set.
Abstract: Particle filters (PFs) are sequential Monte Carlo methods that use particle representation of state-space model to implement the recursive Bayesian filter for non-linear and non-Gaussian systems. Owing to this property, PFs have been extensively used for object tracking in recent years. Although PFs provide a robust object tracking framework, they suffer from shortcomings. Particle degeneracy and particle impoverishment brought by the resampling step result in abysmal construction of posterior probability density function (PDF) of the state. To overcome these problems, this work amalgamates two characteristics of population-based heuristic optimisation algorithms: exploration and exploitation with PF implementing dynamic resampling method. The aim of optimisation is to distribute particles in high-likelihood area according to the cognitive effect and improve quality of particles, while the objective of dynamic resampling is to maintain diversity in the particle set. This work uses very efficient spider monkey optimisation to achieve this. Furthermore, to test the efficiency of the proposed algorithm, experiments were carried out on one-dimensional state estimation problem, bearing only tracking problem, standard videos and synthesised videos. Metrics obtained show that the proposed algorithm outplays simple PF, particle swarm optimisation based PF, and cuckoo search based PF, and effectively handles different challenges inherent in object tracking.

Journal ArticleDOI
TL;DR: The authors propose a novel, semi-supervised multi-label active learning (SSMAL) method that combines automated annotation with human annotation to reduce the annotation workload associated with the active learning process.
Abstract: Multi-label image classification has attracted considerable attention in machine learning recently. Active learning is widely used in multi-label learning because it can effectively reduce the human annotation workload required to construct high-performance classifiers. However, annotation by experts is costly, especially when the number of labels in a dataset is large. Inspired by the idea of semi-supervised learning, in this study, the authors propose a novel, semi-supervised multi-label active learning (SSMAL) method that combines automated annotation with human annotation to reduce the annotation workload associated with the active learning process. In SSMAL, they capture three aspects of potentially useful information – classification prediction information, label correlation information, and example spatial information – and they use this information to develop an effective strategy for automated annotation of selected unlabelled example-label pairs. The experimental results obtained in this study demonstrate the effectiveness of the authors' proposed approach.

Journal ArticleDOI
TL;DR: The 2D singularities of masses and their surrounding regions with Ripplet-II transform are analyzed to classify them as benign and malignant to quantify the texture information of mammographic regions.
Abstract: Masses are one of the prevalent early signs of breast cancer, visible in mammogram. However, its variation in shape, size, and appearance often creates hazards in proper diagnosis of mammographic masses. This study analyses the 2D singularities of masses and their surrounding regions with Ripplet-II transform to classify them as benign and malignant. Since benign and malignant masses may change the orientation patterns of normal breast tissues differently, several textural features including Ripplet-II coefficients and statistical co-variates, derived from the Ripplet-II transformed images, are extracted to quantify the texture information of mammographic regions. The important features are then selected using stepwise logistic regression technique and evaluated using linear discriminant analysis and support vector machine with a ten-fold cross-validation. The best performance in terms of the area under the receiver operating characteristic curve of 0.91 ± 0.01 and 0.83 ± 0.01 and accuracy of 87.28 ± 0.02 and 75.60 ± 0.01 are obtained with the proposed method while experimenting with 58 images from the mini-MIAS and 200 images from the Digital Database for Screening Mammography database, respectively.

Journal ArticleDOI
TL;DR: The authors explore different feature spaces by employing features commonly used in object detection to improve the performance of detector in feature space and propose a robust scale estimation algorithm that estimates the size of the object in the current frame.
Abstract: In this study, the authors propose two kinds of improvements to a baseline tracker that employs the tracking-by-detection framework. First, they explore different feature spaces by employing features commonly used in object detection to improve the performance of detector in feature space. Second, they propose a robust scale estimation algorithm that estimates the size of the object in the current frame. Their experimental results on the challenging online tracking benchmark-13 dataset show that reduced dimensionality histogram of oriented gradients boosts the performance of the tracker. The proposed scale estimation algorithm provides a significant gain and reduces the failure of the tracker in challenging scenarios. The improved tracker is compared with 13 state-of-the-art trackers. The quantitative and qualitative results show that the performance of the tracker is comparable with the state of the art against initialisation errors, variations in illumination, scale and motion, out-of-plane and in-plane rotations, deformations and low resolution.

Journal ArticleDOI
TL;DR: A system that directly transcribes scene text images to text without character segmentation is developed and achieves competitive performance comparison with the state of the art on several public scene text datasets, including both lexicon-based and non-lexicon ones.
Abstract: Text recognition in natural scene remains a challenging problem due to the highly variable appearance in unconstrained condition. The authors develop a system that directly transcribes scene text images to text without character segmentation. They formulate the problem as sequence labelling. They build a convolutional recurrent neural network (RNN) by using deep convolutional neural networks (CNN) for modelling text appearance and RNNs for sequence dynamics. The two models are complementary in modelling capabilities and so integrated together to form the segmentation free system. They train a Gaussian mixture model–hidden Markov model to supervise the training of the CNN model. The system is data driven and needs no hand labelled training data. Their method has several appealing properties: (i) It can recognise arbitrary length text images. (ii) The recognition process does not involve sophisticated character segmentation. (iii) It is trained on scene text images with only word-level transcriptions. (iv) It can recognise both the lexicon-based or lexicon-free text. The proposed system achieves competitive performance comparison with the state of the art on several public scene text datasets, including both lexicon-based and non-lexicon ones.

Journal ArticleDOI
TL;DR: A computer-aided diagnosis system to differentiate between four breast imaging reporting and data system (Bi-RADS) classes in digitised mammograms is proposed, inspired by the approach of the doctor during the radiologic examination.
Abstract: The goal of this study is to propose a computer-aided diagnosis system to differentiate between four breast imaging reporting and data system (Bi-RADS) classes in digitised mammograms This system is inspired by the approach of the doctor during the radiologic examination as it was agreed in BI-RADS, where masses are described by their form, their boundary and their density The segmentation of masses in the authors’ approach is manual because it is supposed that the detection is already made When the segmented region is available, the features extraction process can be carried out 22 visual characteristics are automatically computed from shape, edge and textural properties; only one human feature is used in this study, which is the patient's age Classification is finally done using a multi-layer perceptron according to two separate schemes; the first one consists of classify masses to distinguish between the four BI-RADS classes (2, 3, 4 and 5) In the second one the authors classify abnormalities on two classes (benign and malign) The proposed approach has been evaluated on 480 mammographic masses extracted from the digital database for screening mammography, and the obtained results are encouraging

Journal ArticleDOI
TL;DR: This study addresses the problem of efficiently combining the joint, RGB and depth modalities of the Kinect sensor in order to recognise human actions with a multi-layered fusion scheme that builds specialised local and global SVM models and iteratively fuses their different scores.
Abstract: This study addresses the problem of efficiently combining the joint, RGB and depth modalities of the Kinect sensor in order to recognise human actions. For this purpose, a multi-layered fusion scheme concatenates different specific features, builds specialised local and global SVM models and then iteratively fuses their different scores. The authors essentially contribute in two levels: (i) they combine the performance of local descriptors with the strength of global bags-of-visual-words representations. They are able then to generate improved local decisions that allow noisy frames handling. (ii) They also study the performance of multiple fusion schemes guided by different features concatenations, Fisher vectors representations concatenation and later iterative scores fusion. To prove the efficiency of their approach, they have evaluated their experiments on two challenging public datasets: CAD-60 and CGC-2014. Competitive results are obtained for both benchmarks.

Journal ArticleDOI
TL;DR: A new algorithm to generalise intra-class variations of multi-sample subjects to single- sample subjects by deep autoencoder and reconstruct new samples is proposed.
Abstract: One sample per person (OSPP) face recognition is a challenging problem in face recognition community. Lack of samples is the main reason for the failure of most algorithms in OSPP. In this study, the authors propose a new algorithm to generalise intra-class variations of multi-sample subjects to single-sample subjects by deep autoencoder and reconstruct new samples. In the proposed algorithm, a generalised deep autoencoder is first trained with all images in the gallery, then a class-specific deep autoencoder (CDA) is fine-tuned for each single-sample subject with its single sample. Samples of the multi-sample subject, which is most like the single-sample subject, are input to the corresponding CDA to reconstruct new samples. For classification, minimum L2 distance, principle component analysis, sparse represented-based classifier and softmax regression are used. Experiments on the Extended Yale Face Database B, AR database and CMU PIE database are provided to show the validity of the proposed algorithm.

Journal ArticleDOI
TL;DR: Experimental results show how the proposed approach outperforms state-of-the-art methods and provides an accurate segmentation and labelling of RGBD data.
Abstract: We present an approach for segmentation and semantic labelling of RGBD data exploiting together geometrical cues and deep learning techniques. An initial over-segmentation is performed using spectral clustering and a set of non-uniform rational B-spline surfaces is fitted on the extracted segments. Then a convolutional neural network (CNN) receives in input colour and geometry data together with surface fitting parameters. The network is made of nine convolutional stages followed by a softmax classifier and produces a vector of descriptors for each sample. In the next step, an iterative merging algorithm recombines the output of the over-segmentation into larger regions matching the various elements of the scene. The couples of adjacent segments with higher similarity according to the CNN features are candidate to be merged and the surface fitting accuracy is used to detect which couples of segments belong to the same surface. Finally, a set of labelled segments is obtained by combining the segmentation output with the descriptors from the CNN. Experimental results show how the proposed approach outperforms state-of-the-art methods and provides an accurate segmentation and labelling.

Journal ArticleDOI
TL;DR: A novel extended social force model-based mean shift tracking algorithm in which pedestrian environment is full taken in consideration is proposed in which obstacles exist and this algorithm achieves an encouraging performance when obstacles exist.
Abstract: It has been shown that mean shift tracking algorithm can achieve excellent results in pedestrian tracking task. It empirically estimates the target position of current frame by locating the maximum of a density function from the local neighborhood of the target position of previous frame. However, this method only considers its past trajectory without considering the influence of pedestrian environment when applying to pedestrian tracking. In practical, pedestrians always keep a safe distance away from obstacles when programming their paths. To address the issue of obstacle avoidance, this paper proposes a novel extended social force model-based mean shift tracking algorithm in which pedestrian environment is full taken in consideration. Firstly, an extended social force model is presented to quantify the interaction between pedestrian and obstacle by means of force. Furthermore, directional weights and speed weights are introduced to adjust the strength of the force in terms of the difference of individual perspectives and relative velocities. Finally, the initial target position is predicted by Newton's laws of motion and then the Mean Shift method is integrated to track target position. Experiment results show that this algorithm achieves an encouraging performance when obstacles exist.

Journal ArticleDOI
TL;DR: The proposed discriminative feature learning scheme achieves satisfying recognition results, reaching accuracy rates as high as 91.87% on CK+ database, 82.24% on KDEF database, and 78.94% on CMU Multi-PIE database in the LOSO scenario, which perform better than other comparison methods.
Abstract: Recently, researchers have proposed different feature descriptors to achieve robust performance for facial expression recognition (FER). However, finding a discriminative feature descriptor remains one of the critical tasks. In this paper, we propose a discriminative feature learning scheme to improve the representation power of expressions. First, we obtain a discriminative feature matrix (DFM) based pixel difference representation. Subsequently, all DFMs corresponding to the training samples are used to construct a discriminative feature dictionary (DFD). Next, DFD is projected on a vertical two-dimensional linear discriminant analysis in direction (V-2DLDA) space to compute between and within-class scatter because V-2DLDA works well with the DFD in matrix representation and achieves good efficiency. Finally, nearest neighbor (NN) classifier is used to determine the labels of the query samples. DFD represents the local feature changes that are robust to the expression, illumination et al. Besides, we exploit V-2DLDA to find an optimal projection matrix since it not only protects the discriminative features but reduces the dimensions. The proposed method achieves satisfying recognition results, reaching accuracy rates as high as 91.87% on CK+ database, 82.24% on KDEF database, and 78.94% on CMU Multi-PIE database in the LOSO scenario, which perform better than other comparison methods.

Journal ArticleDOI
TL;DR: An object-based semantic image representation is integrated into a deep features-based retrieval framework to select the relevant images and a novel phrase selection paradigm and sentence generation model which depends on a joint analysis of salient regions in the input and retrieved images within a clustering framework are presented.
Abstract: In the past few years, automatically generating descriptions for images has attracted a lot of attention in computer vision and natural language processing research. Among the existing approaches, data-driven methods have been proven to be highly effective. These methods compare the given image against a large set of training images to determine a set of relevant images, then generate a description using the associated captions. In this study, the authors propose to integrate an object-based semantic image representation into a deep features-based retrieval framework to select the relevant images. Moreover, they present a novel phrase selection paradigm and a sentence generation model which depends on a joint analysis of salient regions in the input and retrieved images within a clustering framework. The authors demonstrate the effectiveness of their proposed approach on Flickr8K and Flickr30K benchmark datasets and show that their model gives highly competitive results compared with the state-of-the-art models.

Journal ArticleDOI
TL;DR: The proposed multiple objects tracking approach has the capability to deal long term and complete occlusion without any prior training of the shape and motion model of the objects and is cost effective in terms of memory and/or computation as compared with that of the existing state-of-the-art techniques.
Abstract: This study presents a novel multiple objects tracking (MOT) approach that models object's appearance based on K-means, while introducing a new statistical measure for association of objects after occlusion. The proposed method is tested on several standard datasets dealing complex situations in both indoor and outdoor environments. The experimental results show that the proposed model successfully tracks multiple objects in the presence of occlusion with high accuracy. Moreover, the presented work has the capability to deal long term and complete occlusion without any prior training of the shape and motion model of the objects. Accuracy of the proposed method is comparable with that of the existing state-of-the-art techniques as it successfully deals with all MOT cases in the standard datasets. Most importantly, the proposed method is cost effective in terms of memory and/or computation as compared with that of the existing state-of-the-art techniques. These traits make the proposed system very useful for real-time embedded video surveillance platforms especially those that have low memory/compute resources.

Journal ArticleDOI
TL;DR: A new method for transmission tower detection that involves the use of visual features and the linear content of the scene and a descriptor based on a grid of two-dimensional feature descriptors that is useful not only for object detection, but also for tracking the area of interest.
Abstract: In this study, the authors propose a new method for transmission tower detection that involves the use of visual features and the linear content of the scene. For this process, they developed a descriptor based on a grid of two-dimensional feature descriptors that is useful not only for object detection, but also for tracking the area of interest. For the detection and classification, they used a support vector machine. The experiments were conducted with a dataset of real world images from transmission tower videos that were used to validate the strategy by comparing it with the ground truth. The results showed that the obtained method is fast and appropriate for tower detection in video sequences of environments that include rural and urban areas. The detection took less than 50 ms and was faster than other methods.

Journal ArticleDOI
TL;DR: A novel hypothesis generation scheme that uses a voting and penalisation mechanism to accurately select a true-positive candidate is proposed and achieves 100% detection accuracy on German TSD benchmark and achieves 4.0% better detection accuracy, when compared with other well-known methods (under partially occluded settings), on KTSD dataset.
Abstract: In advanced driver assistance systems, accurate detection of traffic signs plays an important role in extracting information about the road ahead. However, traffic signs are persistently occluded by vehicles, trees, and other structures on road. Performance of a detector decreases drastically when occlusions are encountered especially when it is trained using full object templates. Therefore, we propose a new method called discriminative patches (d-patches), which is a traffic sign detection (TSD) framework with occlusion handling capability. D-patches are those regions of an object that possess the most discriminative features than their surroundings. They are mined during training and are used for classification instead of the full object templates. Furthermore, we observe that the distribution of redundant-detections around a true-positive is different from that around a false-positive. Based on this observation, we propose a novel hypothesis generation scheme that uses a voting and penalisation mechanism to accurately select a true-positive candidate. We also introduce a new Korean TSD (KTSD) dataset with several evaluation settings to facilitate detector's evaluation under different conditions. The proposed method achieves 100% detection accuracy on German TSD benchmark and achieves 4.0% better detection accuracy, when compared with other well-known methods (under partially occluded settings), on KTSD dataset.

Journal ArticleDOI
TL;DR: The quantitative and qualitative analyses of the experimental results show the superiority of the proposed technique over the conventional and state-of-the-art methods.
Abstract: In this study, a novel single image super-resolution (SR) method, which uses a generated dictionary from pairs of high-resolution (HR) images and their corresponding low-resolution (LR) representations, is proposed. First, HR and LR dictionaries are created by dividing HR and LR images into patches Afterwards, when performing SR, the distance between every patch of the input LR image and those of available LR patches in the LR dictionary are calculated. The minimum distance between the input LR patch and those in the LR dictionary is taken, and its counterpart from the HR dictionary will be passed through an illumination enhancement process resulting in consistency of illumination between neighbour patches. This process is applied to all patches of the LR image. Finally, in order to remove the blocking effect caused by merging the patches, an average of the obtained HR image and the interpolated image is calculated. Furthermore, it is shown that the stabe of dictionaries is reducible to a great degree. The speed of the system is improved by 62.5%. The quantitative and qualitative analyses of the experimental results show the superiority of the proposed technique over the conventional and state-of-the-art methods.

Journal ArticleDOI
TL;DR: A novel class-wise two-dimensional principal component analysis (PCA)-based face recognition algorithm that can successively detect and recognise faces in not only images but also in video files is presented.
Abstract: Interests in biometric identification systems have led to many face recognition task-oriented studies. These studies often address the detection of face images taken from a camera and the recognition of faces via extracted meaningful features. To meet the requirement of defining data with fewer features, principal component analysis (PCA)-based techniques are widely used due to their efficiency and simplicity. There is a remarkable interest in the used efficiency of PCA by extending this traditional technique with various aspects. From this viewpoint, this study is specifically focused on the PCA-based face recognition techniques. By enhancing the methods in the reviewed studies, a novel class-wise two-dimensional PCA-based face recognition algorithm is presented in this study. Unlike the traditional method, this method generates more than one subspace considering within-class scattering. A system based on the presented approach can successively detect and recognise faces in not only images but also in video files. In addition, analyses were conducted to evaluate the efficiency of the proposed algorithm and its extension comparing with other addressed PCA-based methods. On the basis of the experimental results, it is clear to say that the presented approach and its extension are superior to the compared PCA-based algorithms.

Journal ArticleDOI
TL;DR: It is shown that a webcam-based VOG system can provide similar accuracy to that of a head-mounted IR-basedVOG system, and it is proved that the authors’ iris localisation algorithm outperforms current state-of-the-art methods on the popular BioID dataset in terms of accuracy.
Abstract: Video-oculography (VOG) is a tool providing diagnostic information about the progress of the diseases that cause regression of the vergence eye movements, such as Parkinson's disease (PD). The majority of the existing systems are based on sophisticated infra-red (IR) devices. In this study, the authors show that a webcam-based VOG system can provide similar accuracy to that of a head-mounted IR-based VOG system. They also prove that the authors’ iris localisation algorithm outperforms current state-of-the-art methods on the popular BioID dataset in terms of accuracy. The proposed system consists of a set of image processing algorithms: face detection, facial features localisation and iris localisation. They have performed examinations on patients suffering from PD using their system and a JAZZ-novo head-mounted device with IR sensor as reference. In the experiments, they have obtained a mean correlation of 0.841 between the results from their method and those from the JAZZ-novo. They have shown that the accuracy of their visual system is similar to the accuracy of IR head-mounted devices. In the future, they plan to extend their experiments to inexpensive high frame rate cameras which can potentially provide more diagnostic parameters.