scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Context-Aware Query Selection for Active Learning in Event Recognition

01 Mar 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence (Institute of Electrical and Electronics Engineers (IEEE))-Vol. 42, Iss: 3, pp 554-567
TL;DR: This work proposes a continuous-learning framework for context-aware activity recognition from unlabeled video that employs a novel active-learning technique that not only exploits the informativeness of the individual activities but also utilizes their contextual information during query selection, which leads to significant reduction in expensive manual annotation effort.
Abstract: Activity recognition is a challenging problem with many practical applications. In addition to the visual features, recent approaches have benefited from the use of context, e.g., inter-relationships among the activities and objects. However, these approaches require data to be labeled, entirely available beforehand, and not designed to be updated continuously, which make them unsuitable for surveillance applications. In contrast, we propose a continuous-learning framework for context-aware activity recognition from unlabeled video, which has two distinct advantages over existing methods. First, it employs a novel active-learning technique that not only exploits the informativeness of the individual activities but also utilizes their contextual information during query selection; this leads to significant reduction in expensive manual annotation effort. Second, the learned models can be adapted online as more data is available. We formulate a conditional random field model that encodes the context and devise an information-theoretic approach that utilizes entropy and mutual information of the nodes to compute the set of most informative queries, which are labeled by a human. These labels are combined with graphical inference techniques for incremental updates. We provide a theoretical formulation of the active learning framework with an analytic solution. Experiments on six challenging datasets demonstrate that our framework achieves superior performance with significantly less manual labeling.
Citations
More filters
Proceedings ArticleDOI
12 Oct 2020
TL;DR: A scene-aware context reasoning method that exploits context information from visual features for unsupervised abnormal event detection in videos, which bridges the semantic gap between visual context and the meaning of abnormal events is proposed.
Abstract: In this paper, we propose a scene-aware context reasoning method that exploits context information from visual features for unsupervised abnormal event detection in videos, which bridges the semantic gap between visual context and the meaning of abnormal events. In particular, we build na spatio-temporal context graph to model visual context information including appearances of objects, spatio-temporal relationships among objects and scene types. The context information is encoded into the nodes and edges of the graph, and their states are iteratively updated by using multiple RNNs with message passing for context reasoning. To infer the spatio-temporal context graph in various scenes, we develop a graph-based deep Gaussian mixture model for scene clustering in an unsupervised manner. We then compute frame-level anomaly scores based on the context information to discriminate abnormal events in various scenes. Evaluations on three challenging datasets, including the UCF-Crime, Avenue, and ShanghaiTech datasets, demonstrate the effectiveness of our method.

56 citations


Cites background from "Context-Aware Query Selection for A..."

  • ...Psychological evidence shows that humans can recognize objects and scenes comprehensively through exploiting visual context information [1, 34], and a variety of computer vision tasks benefit from context information [7, 30, 32, 33]....

    [...]

Journal ArticleDOI
18 Sep 2020-Sensors
TL;DR: For the first time in the state-of-the-art, a meta-feature based Long Short-Term Memory (LSTM) hashing model for person re-identification is presented and is tested on three challenging datasets, showing that the proposed method is fully competitive with respect to other methods based on visual features.
Abstract: Person re-identification is concerned with matching people across disjointed camera views at different places and different time instants. This task results of great interest in computer vision, especially in video surveillance applications where the re-identification and tracking of persons are required on uncontrolled crowded spaces and after long time periods. The latter aspects are responsible for most of the current unsolved problems of person re-identification, in fact, the presence of many people in a location as well as the passing of hours or days give arise to important visual appearance changes of people, for example, clothes, lighting, and occlusions; thus making person re-identification a very hard task. In this paper, for the first time in the state-of-the-art, a meta-feature based Long Short-Term Memory (LSTM) hashing model for person re-identification is presented. Starting from 2D skeletons extracted from RGB video streams, the proposed method computes a set of novel meta-features based on movement, gait, and bone proportions. These features are analysed by a network composed of a single LSTM layer and two dense layers. The first layer is used to create a pattern of the person’s identity, then, the seconds are used to generate a bodyprint hash through binary coding. The effectiveness of the proposed method is tested on three challenging datasets, that is, iLIDS-VID, PRID 2011, and MARS. In particular, the reported results show that the proposed method, which is not based on visual appearance of people, is fully competitive with respect to other methods based on visual features. In addition, thanks to its skeleton model abstraction, the method results to be a concrete contribute to address open problems, such as long-term re-identification and severe illumination changes, which tend to heavily influence the visual appearance of persons.

9 citations


Cites background from "Context-Aware Query Selection for A..."

  • ...Moving to stationary and Pan–Tilt–Zoom (PTZ) cameras, very recent works, as that reported in Reference [17], shown that, even in challenging application fields, robust systems can be implemented and applied, for security contexts, in everyday life....

    [...]

Posted Content
TL;DR: A noisy label filtering based learning approach where the inter-relationship that is quite common in natural data is utilized to detect the wrong labels and update the recognition model with correct labels which result in better recognition performance.
Abstract: Several works in computer vision have demonstrated the effectiveness of active learning for adapting the recognition model when new unlabeled data becomes available. Most of these works consider that labels obtained from the annotator are correct. However, in a practical scenario, as the quality of the labels depends on the annotator, some of the labels might be wrong, which results in degraded recognition performance. In this paper, we address the problems of i) how a system can identify which of the queried labels are wrong and ii) how a multi-class active learning system can be adapted to minimize the negative impact of label noise. Towards solving the problems, we propose a noisy label filtering based learning approach where the inter-relationship (context) that is quite common in natural data is utilized to detect the wrong labels. We construct a graphical representation of the unlabeled data to encode these relationships and obtain new beliefs on the graph when noisy labels are available. Comparing the new beliefs with the prior relational information, we generate a dissimilarity score to detect the incorrect labels and update the recognition model with correct labels which result in better recognition performance. This is demonstrated in three different applications: scene classification, activity classification, and document classification.

4 citations


Cites background or methods from "Context-Aware Query Selection for A..."

  • ...coding contextual relationships in several applications [42], [28], [18], we also utilize graphical representation to encode contextual relationships....

    [...]

  • ...Entropy [3], Batch Rank [58], and CAAL [28]....

    [...]

  • ...There have been some active learning methods that utilize relational information [27], [1], [28]....

    [...]

  • ...We select three commonly used active learning methods: Entropy [3], Batch Rank [58], and CAAL [28]....

    [...]

Posted Content
TL;DR: This paper proposes a Cost-Quality Adaptive Active Learning (CQAAL) approach for CNER in Chinese EHRs, which maintains a balance between the annotation quality, labeling cost and the informativeness of selected instances.
Abstract: Clinical Named Entity Recognition (CNER) aims to automatically identity clinical terminologies in Electronic Health Records (EHRs), which is a fundamental and crucial step for clinical research. To train a high-performance model for CNER, it usually requires a large number of EHRs with high-quality labels. However, labeling EHRs, especially Chinese EHRs, is time-consuming and expensive. One effective solution to this is active learning, where a model asks labelers to annotate data which the model is uncertain of. Conventional active learning assumes a single labeler that always replies noiseless answers to queried labels. However, in real settings, multiple labelers provide diverse quality of annotation with varied costs and labelers with low overall annotation quality can still assign correct labels for some specific instances. In this paper, we propose a Cost-Quality Adaptive Active Learning (CQAAL) approach for CNER in Chinese EHRs, which maintains a balance between the annotation quality, labeling costs, and the informativeness of selected instances. Specifically, CQAAL selects cost-effective instance-labeler pairs to achieve better annotation quality with lower costs in an adaptive manner. Computational results on the CCKS-2017 Task 2 benchmark dataset demonstrate the superiority and effectiveness of the proposed CQAAL.

3 citations


Cites methods from "Context-Aware Query Selection for A..."

  • ...It has been widely used in many Natural Language Processing (NLP) tasks, such as text classification [8] and event recognition [9]....

    [...]

References
More filters
Book ChapterDOI

[...]

01 Jan 2012

139,059 citations


"Context-Aware Query Selection for A..." refers background in this paper

  • ...This is a challenging and much bigger movie action dataset [2]....

    [...]

Journal Article

28,685 citations


"Context-Aware Query Selection for A..." refers result in this paper

  • ...Even though SCSG performs better than CAQS by 1.7 percent, CAQS consumes only 33 percent manually labeled data compared to 100 percent of SCSG. Table 1 summarizes the performance comparison against other state-of-the-art methods....

    [...]

  • ...We compared our work with two structure learning methods such as SSVM and SCSG. SSVM learns the structure with structural SVM and SCSG learns the structure with AND-OR graphs....

    [...]

  • ...We compare the results on UCLAOffice dataset against stochastic context sensitive grammar (SCSG) [3], and SVM based bag-of-word....

    [...]

Proceedings ArticleDOI
07 Dec 2015
TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Abstract: We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets, 2) A homogeneous architecture with small 3x3x3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets, and 3) Our learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

7,091 citations


"Context-Aware Query Selection for A..." refers methods in this paper

  • ...We use an off-the-self C3D model trained on the Sports-1M [44] dataset....

    [...]

  • ...Given the video segment, we extract a C3D feature of size 4096 for each sixteen frames with a temporal stride of eight frames....

    [...]

  • ...C3D exploits 3D convolution that makes it better than conventional 2D convolution for motion description....

    [...]

  • ...We extract 4096 dimensional C3D features from this cropped video and then, we use PCA to compress this 4096 dimensional features into 256 dimension for faster processing in the later steps....

    [...]

  • ...We use C3D [43] features as a generic feature descriptor for video segments for the UCF50 and VIRAT datasets....

    [...]

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Abstract: Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).

4,876 citations


"Context-Aware Query Selection for A..." refers methods in this paper

  • ...We use an off-the-self C3D model trained on the Sports-1M [44] dataset....

    [...]