scispace - formally typeset
Search or ask a question
Book ChapterDOI

Recurrent Image Annotation with Explicit Inter-label Dependencies

TL;DR: This paper proposes a novel approach in which the RNN is explicitly forced to learn multiple relevant inter-label dependencies, without the need of feeding the ground-truth in any particular order, and outperforms several state-of-the-art techniques on two popular datasets.
Abstract: Inspired by the success of the CNN-RNN framework in the image captioning task, several works have explored this in multi-label image annotation with the hope that the RNN followed by a CNN would encode inter-label dependencies better than using a CNN alone. To do so, for each training sample, the earlier methods converted the ground-truth label-set into a sequence of labels based on their frequencies (e.g., rare-to-frequent) for training the RNN. However, since the ground-truth is an unordered set of labels, imposing a fixed and predefined sequence on them does not naturally align with this task. To address this, some of the recent papers have proposed techniques that are capable to train the RNN without feeding the ground-truth labels in a particular sequence/order. However, most of these techniques leave it to the RNN to implicitly choose one sequence for the ground-truth labels corresponding to each sample at the time of training, thus making it inherently biased. In this paper, we address this limitation and propose a novel approach in which the RNN is explicitly forced to learn multiple relevant inter-label dependencies, without the need of feeding the ground-truth in any particular order. Using thorough empirical comparisons, we demonstrate that our approach outperforms several state-of-the-art techniques on two popular datasets (MS-COCO and NUS-WIDE). Additionally, it provides a new perspecitve of looking at an unordered set of labels as equivalent to a collection of different permutations (sequences) of those labels, thus naturally aligning with the image annotation task. Our code is available at: https://github.com/ayushidutta/multi-order-rnn.
Citations
More filters
Journal ArticleDOI
Wei Zhou, Zhiwu Xia, Pengli Dou, Tao Su, Haifeng Hu 
TL;DR: A novel image multi-label classification framework which aims to align Image Semantics with Label Concepts (ISLC) by proposing a residual encoder to learn salient object features in the images, and exploiting the self-attention layer in aligned decoder to automatically capture the correlation between labels.
Abstract: Image multi-label classification task is mainly to correctly predict multiple object categories in the images. To capture the correlation between labels, graph convolution network based methods have to manually count the label co-occurrence probability from training data to construct a pre-defined graph as the input of graph network, which is inflexible and may degrade model generalizability. Moreover, most of the current methods cannot effectively align the learned salient object features with the label concepts, so that the predicted results of model may not be consistent with the image content. Therefore, how to learn the salient semantic features of images and capture the correlation between labels, and then effectively align them is one of the key to improve the performance of image multi-label classification task. To this end, we propose a novel image multi-label classification framework which aims to align Image Semantics with Label Concepts (ISLC). Specifically, we propose a residual encoder to learn salient object features in the images, and exploit the self-attention layer in aligned decoder to automatically capture the correlation between labels. Then, we leverage the cross-attention layers in aligned decoder to align image semantic features with label concepts, so as to make the labels predicted by model more consistent with image content. Finally, the output features of the last layer of residual encoder and aligned decoder are fused to obtain the final output feature for classification. The proposed ISLC model achieves good performance on various prevalent multi-label image datasets such as MS-COCO 2014, PASCAL VOC 2007, VG-500, and NUS-WIDE with 87.2%, 96.9%, 39.4%, and 64.2%, respectively.

1 citations

Journal ArticleDOI
Wei Zhou, Yanke Hou, Dihu Chen, Haifeng Hu, Tao Su 
TL;DR: Zhang et al. as mentioned in this paper proposed an Attention-Augmented Memory Network (AAMN) model for image multi-label classification task, which first proposes a novel categorical memory module to excavate the contextual information of various categories from the dataset to augment the current input feature.
Abstract: The purpose of image multi-label classification is to predict all the object categories presented in an image. Some recent works exploit graph convolution network to capture the correlation between labels. Although promising results have been reported, these methods cannot learn salient object features in the images and ignore the correlation between channel feature maps. In addition, the current researches only learn the feature information within individual input image, but fail to mine the contextual information of various categories from the dataset to enhance the input feature representation. To address these issues, we propose an Attention-Augmented Memory Network (AAMN) model for the image multi-label classification task. Specifically, we first propose a novel categorical memory module to excavate the contextual information of various categories from the dataset to augment the current input feature. Secondly, we design a new channel-relation exploration module to capture the inter-channel relationship of features, so as to enhance the correlation between objects in the images. Thirdly, we develop a spatial-relation enhancement module to model second-order statistics of features and capture long-range dependencies between pixels in feature maps, so as to learn salient object features. Experimental results on standard benchmarks, including MS-COCO 2014, PASCAL VOC 2007, and VG-500, demonstrate the effectiveness and superiority of AAMN model, which outperforms current state-of-the-art methods.

1 citations

Journal ArticleDOI
Wei Zhou, Pengli Dou, Tao Su, Hai Hu, Zhijie Zheng 
TL;DR: Zhang et al. as discussed by the authors proposed a feature learning network based on Transformer to learn salient features and excavate potential useful features (FL-Tran) for multi-label image classification task.

1 citations

Book ChapterDOI
01 Jan 2022
TL;DR: In this article , the authors proposed a generalized adversary generation mechanism by generating worst-case perturbation, when added to the feature vector of the original sample, generates an adversarial sample without the need for the availability of either training data or model to the attacker.
Abstract: AbstractMulti-label classification is a generalization of single-label classification, where an unseen sample is automatically assigned a subset of semantically relevant labels from a given vocabulary. In parallel, recent research has demonstrated the impact of adversarial examples, which are modifications of original samples and aim at fooling machine learning models. Unlike existing adversary generation techniques which are specific to single-label data and mostly assume the availability of training data and/or model to the attacker, in this paper, we propose a generalized adversary generation mechanism by generating worst-case perturbation. This perturbation, when added to the feature vector of the original sample, generates an adversarial sample without the need for the availability of either training data or model to the attacker. Next, for the first time as per our knowledge, we study and demonstrate the effect of feature normalization as a defense mechanism against adversarial attacks. Extensive experiments show the effectiveness of our adversarial attack and defense mechanisms using state-of-the-art max-margin multi-label classification algorithms on two benchmark datasets.KeywordsMulti-label learningImage annotationMax-margin classifierAdversarial attackFeature normalization
Book ChapterDOI
01 Jan 2022
TL;DR: In this article , the authors propose an online signature verification model based on deep learning, which reports an equal error rate (EER) of 9.72% and 3.1% in Skilled_01 categories of MCYT-100 and SVC data sets.
Abstract: AbstractAn Online signature is a multivariate time series, a commonly used biometric source for user verification. Deep learning (DL) is increasingly becoming ubiquitous as a paradigm for solving problems that come with a wealth of data. Convolution has been its main workhorse. Recently, DL had marked its entry in online signature verification (OSV), a standard bio-metric method that has been mostly dealt with in traditional settings. However, embracing a DL solution to a problem requires certain issues to be tackled, viz. (i) type of convolution, (ii) order of convolution, and (iii) input representation. In this work, we experimentally analyse each of the issues mentioned above regarding OSV, and subsequently present a superior model that reports state-of-the-art (SOTA) performance on three widely used data-sets namely MCYT-100, SVC, and Mobisig. Specifically, the proposed model reports an equal error rate (EER) of 9.72% and 3.1% in Skilled_01 categories of MCYT-100 and SVC data-sets, with gains of around 4% and 3% over the next best performing methods, respectively. The experimental outcome confirms that the interrelationship between the type and order of convolution operation and the input signature representation plays a significant role in the performance of OSV frameworks.KeywordsOnline signature verificationDeep learningImpact of convolutionStep sizeOne shot learning
References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Proceedings Article
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

49,914 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Trending Questions (1)
Repeatable labeling, multiple labeler, ground truth?

The paper does not mention anything about "repeatable labeling" or "multiple labeler". The paper is about improving image annotation using a recurrent neural network (RNN) with explicit inter-label dependencies.