scispace - formally typeset
Open AccessProceedings Article

Concepts not alone: exploring pairwise relationships for zero-shot video activity recognition

Reads0
Chats0
TLDR
This paper proposes a more robust approach for scoring concepts in order to alleviate many of the brittleness and low precision problems of previous work, and proposes a novel pairwise order matrix approach for score aggregation.
Abstract
Vast quantities of videos are now being captured at astonishing rates, but the majority of these are not labelled. To cope with such data, we consider the task of content-based activity recognition in videos without any manually labelled examples, also known as zero-shot video recognition. To achieve this, videos are represented in terms of detected visual concepts, which are then scored as relevant or irrelevant according to their similarity with a given textual query. In this paper, we propose a more robust approach for scoring concepts in order to alleviate many of the brittleness and low precision problems of previous work. Not only do we jointly consider semantic relatedness, visual reliability, and discriminative power. To handle noise and non-linearities in the ranking scores of the selected concepts, we propose a novel pairwise order matrix approach for score aggregation. Extensive experiments on the large-scale TRECVID Multimedia Event Detection data show the superiority of our approach.

read more

Content maybe subject to copyright    Report

Concepts Not Alone: Exploring Pairwise Relationships
for Zero-Shot Video Activity Recognition
Chuang Gan
1
, Ming Lin
3
,YiYang
2
, Gerard de Melo
1
and Alexander G. Hauptmann
4
1
IIIS, Tsinghua University, Beijing, China
2
QCIS, University of Technology Sydney, Sydney, Australia
3
DCM&B, University of Michigan, Ann Arbor, USA
4
SCS, Carnegie Mellon University, Pittsburgh, USA
Abstract
Vast quantities of videos are now being captured at astonish-
ing rates, but the majority of these are not labelled. To cope
with such data, we consider the task of content-based activity
recognition in videos without any manually labelled examples,
also known as zero-shot video recognition. To achieve this,
videos are represented in terms of detected visual concepts,
which are then scored as relevant or irrelevant according to
their similarity with a given textual query. In this paper, we
propose a more robust approach for scoring concepts in order
to alleviate many of the brittleness and low precision problems
of previous work. Not only do we jointly consider semantic
relatedness, visual reliability, and discriminative power. To
handle noise and non-linearities in the ranking scores of the
selected concepts, we propose a novel pairwise order matrix
approach for score aggregation. Extensive experiments on the
large-scale TRECVID Multimedia Event Detection data show
the superiority of our approach.
1 Introduction
Motivation.
The increasing ubiquity of devices capable of
capturing videos has led to an explosion in the amount of
recorded video content. Smartphones, action cameras, as well
as surveillance cameras mean that ever-increasing amounts
of our daily activities are captured on videos. Due to the
torrential volume of this data, the vast majority of videos are
never labeled. Moreover, even for those that are shared online,
the human-supplied metadata is often vague or unspecific
(e.g. Albufeira, Summer 2015”). Unfortunately, video search
engines such as Youtube, Yahoo, and Bing, crucially depend
on textual keyword matching. Their approach works well for
popular videos but fails hopelessly for long-tail content or
personal video collections with insufficient metadata.
Fortunately, encouraging progress has been made on
content-based video analysis in recent years. Standard ap-
proaches rely on low-level audio/visual input features that
are fed into machine learning algorithms such as sup-
port vector machines (SVMs) (Chang and Lin 2011) or
deep convolutional neural networks (Karpathy et al
.
2014;
Gan et al
.
2015b). These achieve promising results when
there are sufficient numbers of labeled training examples for
every search query of interest.
Copyright
c
2016, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
However, due to the large number of possible search
queries, query-specific training is not always feasible. Zero-
shot learning (Lampert, Nickisch, and Harmeling 2009) ad-
dresses this problem by providing an alternative paradigm
that does not require positive training exemplars for every
class of videos. Given a textual query, one aims to retrieve
videos that are most relevant to it, exploiting visual attributes
of the videos.
While several algorithms (Jiang et al
.
2015a; Wu et al
.
2014; Dalton, Allan, and Mirajkar 2013; Habibian, Mensink,
and Snoek 2014; Liu et al
.
2013; Singh et al
.
2015) have re-
cently been proposed for such zero-shot video activity recog-
nition, state-of-the-art systems still suffer from brittleness
and low precision.
Contributions.
In this paper, we show how to make zero-
shot learning more robust. Similar to previous work, our
system consists of two main components. The first of these
aims at a semantic query interpretation, in which the system
selects concepts pertaining to the query description from
a large pool of potential candidate concepts. The second
component produces an aggregation of the individual concept-
specific video ranking lists. We propose important strategies
for making both more robust:
We propose a simple yet effective concept selection ap-
proach in representing queries. Unlike previous work, con-
cept reliability and discriminative power are considered
as critical indicators in order to ensure robust zero-shot
activity recognition.
We devise a novel robust video ranking approach that relies
on the recovery of a low-rank order matrix from multiple
pairwise order matrices for different concept ranking lists.
Experimental results on challenging unconstrained video
data confirm that the proposed system outperforms the
state-of-the-art zero-shot approaches.
2 Related Work
Video analysis has attracted a lot of research interest in the
past decade. A recent review can be found in (Jiang et al
.
2013). Standard video activity recognition systems, despite
their reasonable recognition performance, rely on custom
low-level representations, such as improved dense trajecto-
ries (Wang and Schmid 2013), and Mel-FrequencyCepstral
Coefficients (MFCC) (Rabiner and Schafer 2007). These
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)
3487

suffer from several deal-breaking drawbacks. First, they are
incapable of providing a semantic interpretation of a video.
Second, because of their high dimensionality, training effec-
tive event classifiers on the low-level representation requires a
substantial number of training examples per class. When only
a few or even no positive examples are available, the power of
low-level representations is limited. In our work, we instead
draw concept detection to derive higher-level representations.
Semantic video representations describe a video in terms
of pre-defined pools of activity concepts and attributes, and
have proven both robust for video activity recognition and
interpretable by humans (Hauptmann et al
.
2007). This form
of representation has inspired the development of zero-shot
video activity recognition. Most existing zero-shot learning
frameworks follow a two-stage classification approach. Given
a novel class, first its semantic properties (i.e., attributes or
concepts) are identified, then its class label is predicted as
a ranking function of those attributes or concepts. To iden-
tify relevant concepts, most of the current approaches rely
on manually defined concept schemes or blindly adopt in-
ventories based on some outside knowledge sources, such as
WordNet (Lin 1998), Wikipedia (Gabrilovich and Markovitch
2007), or label co-occurrence data (Mensink, Gavves, and
Snoek ). In contrast to our approach, none of the existing ap-
proaches take the visual reliability and discriminative power
into consideration. As we will see in Section 4, the selection
of concepts has a dramatic impact on the retrieval results.
When fusing rankings, most existing approaches to zero-
shot activity recognition directly combine the raw ranking
scores of related attributes or concepts as the final ranking
list. Unlike these methods, we propose a more robust rank-
ing function, by first converting the raw ranking score into
scale-invariant pairwise order matrices and then decompos-
ing the multiple pairwise order matrices into a single shared
order matrix, which is then consulted to generate the ranking
list. Our experiments confirm that this leads to more robust
ranking results.
Our work is also related to the sentence generation
task (Guadarrama et al
.
2013; Sun, Gan, and Nevatia 2015),
where the goal is to generate natural language descriptions
of a video. Our goal, in contrast, is to address the opposite
direction: given a textual query, we seek to retrieve videos
that match the query. Matrix factorization is also related to
our work, which has been widely used in different tasks (Ye
et al. 2012; Fan et al. 2014; Yan et al. 2015).
3 Proposed Method
3.1 Overview
In our approach, we first represent both the video data and the
textual description by embedding them in a semantic concept
space. The overall framework of our approach is outlined in
Figure 1. It consists of two major components.
The first of the two components aims at a semantic query
interpretation. Given a query, i.e., a textual description of
the target event, we apply text analytics methods to extract
salient words that describe this target class. The system se-
lects relevant concepts matching these query words from a
large pool of potential candidate concepts.
We are given a textual query
q
as input. Based on a vocab-
ulary of
d
visual concepts, we first select relevant concepts
according to their semantic similarity with
q
. A higher simi-
larity score indicates that the corresponding concept is more
related to the target video class denoted by q.
We also apply a bank of visual concept detectors to gener-
ate semantic representations for the videos. Thus, each video
is represented as a vector whose elements are detection scores
of different semantic concepts.
Having embedded both the query and the videos as numer-
ical vectors of concepts, we select relevant concepts based
on their similarity to the query, but also prune away selected
semantic concepts that have insufficient reliability or discrim-
inative power.
The second component produces a ranked list of videos.
This is achieved by aggregating the individual concept-
specific ranking lists. After the preliminary filtering, the de-
tection scores of selected concepts are still noisy. To alleviate
the problems of noise and non-linearity of different rank-
ing lists, we first convert the raw ranking scores of selected
concepts into pairwise order matrices, in which each entry
characterizes the comparative relationship of two test sam-
ples. Our hypothesis is that the relative score relations are
consistent within component ranking lists, despite the large
variations that may exist in the absolute values of the raw
scores. Thus, we take the pairwise order matrices of different
semantic concepts as input, and recover a common shared
rank-2 order matrix. Even though the original matrix might
be inaccurate, the joint relations from multiple matrices may
be complementary with respect to each other and hence the
shared common matrix may well model the correct order.
Finally, we transform the shared order matrix into a ranking
list as the final retrieval result. then recovered from these
pairwise matrices.
3.2 Query Interpretation
The goal of the query interpretation phase is to create a se-
mantic representation for the query in terms of salient se-
mantic concepts that it expresses. These are taken from a
large pool of candidate concepts and should not only be rele-
vant to the textual query but also visually discernable in the
videos. It is tedious if not impossible to manually label all
related concepts in a large pool of potential candidate con-
cepts. Therefore, most recent work relies on natural language
processing technologies to compute the semantic similarity
between the query and the candidate concepts (Gan et al
.
2015a).
While fully automatic, this approach alone is far from sat-
isfactory, due to the well-known semantic gap between query
and visual content. For instance, a given abstract concept
such as meeting may have vastly different visual representa-
tions in different circumstances. It appears unreasonable to
choose the relevant visual concepts only based on semantic
similarity scores between the concept names and the textual
query. In our approach, we posit that the selected relevant
concepts should have the following key properties:
1.
Semantic relevance: the selected concepts should be se-
mantically related with the textual query.
3488

Pairwise video
Retrieval
Input: textual query
Event name: Bee keeping
Definition: One or more people perform activities associated with the keeping of
honeybees.
Explication: Bee keeping refers to the maintenance of honeybees by humans. A
beekeeper keeps bees in order to collect products of the hive to pollinate crops, or to
produce bees for sale to others.
Evidence: bee, bee keeper, smoke, honey, knife.
S1
S2
S3
Ranking Score Vector
Pairwise Re lationship Matrix
T1
T2 T3
Matrix
Decomposition
N1 N2 N3
T
Final Ranking Score
semantic
1.0 1.0 0.9 0.70.8
reliable
0.3 0.9 0.85 0.850.95
discriminative
0.9 0.9 0.9 0.60.2
bee
bee hive
honeycomb insectperson
bee hive honeycomb intersect
Concept candidate
Concept filtering
Semantic Query
Generation
Figure 1: An illustration of our zero-shot video activity recognition framework. We first automatically compute semantic
similarity scores between the query and concepts based on the cosine distances of continuous word vectors, and filter out the
concepts that are not sufficiently visually reliable or discriminative for events. Then we convert the raw ranking scores of selected
concepts to multiple pairwise order matrices, which are taken as input to recover a rank-2 order matrix based on low rank and
skew-symmetric constraints. A robust ranking score vector is finally extracted to fit the recovered low-rank order matrix.
2.
Visual reliability: the selected concepts should be reliable
during the detection on different datasets.
3.
Event discriminativeness: The selected concepts should
be discriminative enough in the detection of the video
activity.
Query Analysis.
To automatically generate query terms
from event descriptions, we apply standard natural language
processing techniques to clean up its textual description, in-
cluding removal of common stop words and lemmatization
(stemming) to normalize word inflections. Then we compute
the TF-IDF score of the remaining terms, and select the top 5
terms as event query terms. Next, we compute the similarity
of query terms with concept terms as described below, and
average over all query terms.
Semantic Similarity Computation.
The semantic similar-
ity computation requires us to have trained a model that
can quantify the degree of semantic similarity between two
words. This can conveniently be done beforehand in an of-
fline process. We draw on the recent success of the skip-gram
with negative sampling neural network model (Mikolov et
al
.
2013). Given a large text corpus such as Wikipedia, the
objective is to produce vectors that represent words such
that vector similarity reflects word similarities. The training
objective of the skip-gram model achieves this by optimiz-
ing for word representations that allow for predicting the
surrounding context words in a sentence. More precisely,
given a sequence of words
{w
1
, w
2
, ..., w
d
}
, it searches for
a vector representation for each word w
i
such that
1
d
d
t=1
cjc,j=0
log(P (w
t+j
|w
t
)) (1)
is maximized, where
c
controls the context window length.
The probability of
w
t+j
given
w
t
is defined by the softmax
3489

function
P (w
i
|w
j
)=
exp(w
i
T
w
j
)
w
exp(w
i
T
w
j
)s
(2)
In order to optimize this more efficiently, a binary Huffman
tree is used to predict words from the vocabulary, and training
is carried out with stochastic gradient ascent, using negative
sampling to limit the number of predictions (Mikolov et al
.
2013).
Once we have optimized the word vector representations
using the corpus, we can measure the similarity between
words via the standard cosine measure. The larger the cosine
score between two word vectors, the more they are deemed
semantically related.
Reliability and Discriminativeness Validation.
While con-
cepts selected by the above pipeline are likely semantically
related, they need not be visually reliable or discriminative
enough for activity recognition. To test the reliability, we use
two-fold drop-out cross validation. The averaged precision
based on the two-fold cross validation reflects the reliability
of the tested concept. We filter out concepts whose precision
is below a threshold (set to 80% in all our experiments to
reasonably balance precision and recall).
The discriminative power is assessed using detection
scores on held-out data. Concepts that have detection scores
over
1
2
for over 50% of all activity classes are deleted. For
example, person is semantically related to the event bee keep-
ing and has a high reliability score. However, it obtains high
scores on most of the videos, so the term lacks discriminative
power for discerning more specific activities. Therefore, we
may prune such concepts from the concept pool.
3.3 Pairwise Order Matrix Construction
The next step is to use semantic representation of the query to
retrieve and rank a set of videos. For retrieval, most of exist-
ing approaches (Jiang et al
.
2015a; Wu et al
.
2014; Dalton, Al-
lan, and Mirajkar 2013; Habibian, Mensink, and Snoek 2014;
Liu et al
.
2013) naively average the detection scores of se-
lected relevant concepts. This leads to a suboptimal solution.
First, the scale of different detection scores for different con-
cepts is not comparable even after normalization. Thus, it
is unwise to use the same weight when fusing them, but
manually fine-tuning weights is also not possible in practice.
Second, even within the same concept, the detection scores
are not necessarily linear. For example, we may not be able
to discern any apparent difference between
0.5
and
0.9
, yet
perceive a marked difference between 0.1 and 0.15.
Assume that we have
d
concepts,
n
videos in the system.
We apply
d
concept detectors to these videos. Our implemen-
tation uses a deep Convolutional Neural Network (CNNs)
architecture (Krizhevsky, Sutskever, and Hinton 2012). We
take the key frames of a given test video as input, run a
forward pass through the CNN, and use the softmax score
as the concept detection score on the key frame. To arrive
at a video-level representation, we rely on simple average
pooling.
This process yields a detection score matrix
X R
d×n
.
Each column
X
i
stores the detection scores for the
i
-th video
with respect to all
d
concepts. For the
k
th
concept, we con-
struct a pairwise order matrix T
(k)
,
T
(k)
i,j
=sign(X
k,i
X
k,j
) .
The matrix
T
(k)
encodes the pairwise order of every two
videos measured under the
k
-th concept. In particular,
T
i,j
=
1
indicates that the
i
-th sample is more detected as positive
with greater confidence than the
j
-th sample for the
k
-th
concept, while
T
i,j
= 1
indicates the opposite comparative
relation. Meanwhile,
T
i,j
=0
indicates that the
i
-th sample
has a similar detection confidence value. As this order matrix
captures relative assessments between different samples, it is
not influenced by the scale or non-linearity of the detection
scores within or between concepts.
Assume that there is a ground truth ranking score denoted
as
s
. The corresponding pairwise order matrix
ˆ
T
is defined
by
ˆ
T = se
T
es
T
(3)
where
e =[1, 1, ···]
T
. The next proposition shows that the
rank of
ˆ
T is exactly 2.
Proposition 1.
The rank of
ˆ
T
is exactly
2
when the scores in
s are not all equal to a constant value.
Proof.
It is easy to confirm that
rank(
ˆ
T ) 2
from Eq. (3).
If there is some
s
such that
ˆ
T
is rank
1
, then
ˆ
T
i
= c
i
ˆ
T
1
for
some constant
c
i
. Since
ˆ
T =
ˆ
T
T
and
ˆ
T
i,i
=0
,wehave
ˆ
T =0.
Proposition 1 entails that
rank(
ˆ
T )=2
is a restricted
convex relaxation with respect to directly optimizing
s
(Yuan,
Li, and Zhang 2014). This suggests that we may use rank-
2
hard iterative singular value thresholding, as we will explain
shortly in more detail.
Based on the above discussion, we now turn to studying
how to estimate
ˆ
T
from a set of pairwise order matrices
T
(k)
,k =1, 2, ··· ,d,
with rank-
2
and skew-symmetric con-
straints.
3.4 Recovering the Pairwise Order Matrix
Assume that we have a set of
d
pairwise order matrices
ˆ
T
(1)
,
ˆ
T
(2)
, ··· ,
ˆ
T
(k)
. Because the detection scores are noisy,
the order constraints of
ˆ
T
(k)
may contradict each other. We
need a robust approach to recover the matrix
ˆ
T
by adaptively
fusing
ˆ
T
(k)
. To this end, each time
ˆ
T
i,j
violates the given
order
T
(k)
i,j
, we penalize it with a loss function
(
ˆ
T
i,j
,T
(k)
i,j
)
.
To maximize the margin, we use the hinge loss in this paper
(although theoretically any other loss function is applicable):
(
ˆ
T,T
(k)
)
n
i,j=1
[1 T
(k)
i,j
ˆ
T
i,j
]
+
(4)
where
[z]
+
= max(z, 0)
. Then we have the following opti-
mization problem to recover
ˆ
T :
3490

Algorithm 1 Hard Iterative Singular Value Thresholding
1: Input: T
(k)
, step size η.
2:
ˆ
T
0
=0
3: for i =1, 2, ··· ,Ldo
4:
ˆ
G
t
=
ˆ
T
t1
η
ˆ
T
(
ˆ
T ) .
5:
G
t
=
1
2
(
ˆ
G
t
ˆ
G
T
t
)
6: SVD:
G
t
=
i
λ
i
U
i
V
T
i
7: rank-2 thresholding:
ˆ
T
t
= λ
1
U
1
V
T
1
+ λ
2
U
2
V
T
2
8: end for
9: Output:
ˆ
T =
ˆ
T
L
min
ˆ
T
(
ˆ
T )
d
k=1
(
ˆ
T,T
(k)
)+λ
ˆ
T
2
(5)
s.t. rank(
ˆ
T )=2 (6)
ˆ
T =
ˆ
T. (7)
The above optimization can be effectively solved with hard
iterative singular value thresholding, as depicted in Algorithm
1. It is not difficult to show that Algorithm 1 converges ge-
ometrically to the global optimal because the optimization
problem is restricted convex.
In Algorithm 1, we first carry out a gradient descent step
with respect to
ˆ
T
with step size
η
.Inline
5
, we project the
intermediate solution onto skew-symmetric subspace. In lines
6
and
7
, we greedily threshold the intermediate solution with
the rank-2 constraint. It is important to note that line
5
must
be above lines
6
and
7
because a skew-symmetric matrix is
still skew-symmetric after rank-2 singular value thresholding
but the reverse does not hold.
After getting the optimized matrix
ˆ
T
, we seek to recover
the final ranking score vector
ˆs
. Based on our rank-2 assump-
tion mentioned above, we expect that
ˆ
T
is generated from
ˆs
as
ˆ
T se
T
eˆs
T
. The authors in (Jiang et al
.
2011) have
shown that using
(1/m)
ˆ
Te
as the recovered
s
will provide
the best least-square approximation, which can be formally
described as follows:
(1/m)e
ˆ
T =argmin
ˆs
||
ˆ
T (ˆse
T
+ s
T
)||. (8)
Therefore, we can treat
(1/m)e
ˆ
T
as the recovered
ˆs
after the
retrieval, giving us our final results.
3.5 Out-of-Sample Extension
We can additionally also deal with the case of new out-of-
sample test videos. Given a new test video
x
m+1
, we first
semantically represent it as an
n
-dimensional vector. For
each dimension, we find its nearest neighbour from the ex-
isting test data
X = {x
1
,x
2
, ..., x
m
}
. Let
x
i
denote the
nearest example for the
i
-th semantic concept, and
w
i
de-
note the feature similarity based on
i
-th semantic feature
type. Then, the ranking score of
x
m+1
can be computed as
ˆs(x
m+1
)
n
i=1
w(t
i
m+1
,x
i
)
n
i=1
w(t
i
m+1
,x
i
)
ˆs(x
i
)
, where
x
i
is the aggre-
gated score for sample x
i
.
4 Experiments
4.1 Experimental Setup
Dataset and Metrics.
We conduct experiments on the chal-
lenging TRECVID Multimedia Event Detection datasets
from 2013 (MED13test) and 2014 (MED14test). Each in-
cludes 25,000 testing videos (over 960 hours of video) with
per-video ground truth annotations for 20 event categories,
all officially provided by NIST. Each category has a textual
description in the form of event name, definition, explica-
tion, and related evidence types. Since we focus on zero-shot
event detection, the experiments are conducted without using
any examples. To evaluate the results, we apply the official
metric: average precision (AP) per event, and mean Average
Precision (mAP) by averaging all 20 events.
Image-based Concepts.
We obtain 1000 image-based con-
cept detectors using a deep Convolutional Neural Network
(CNNs) (Krizhevsky, Sutskever, and Hinton 2012). We use
the VGG19 Net (Simonyan and Zisserman 2015) architecture,
as implemented in the Caffe (Jia 2013) toolbox. The network
is trained on the ImageNet ILSVRC-2014 dataset (Deng et
al
.
2009), which includes 1.2M training images categorized
into 1000 classes.
Video-based Concepts.
We also obtain video-based con-
cepts from four publicly available datasets: UCF101 (Soomro,
Zamir, and Shah 2012), FCVID (Jiang et al
.
2015b), Google
Sports1M (Karpathy et al
.
2014), and ActivityNet (Heilbron
et al
.
). They contain 101 action categories, 239 action cat-
egories, 487 sports categories, and 203 activity categories
respectively. We extract the improved dense trajectory (Wang
and Schmid 2013) features from videos, and aggregate the
local features into video-level feature vectors by Fisher vec-
tors (Oneata et al
.
2013). We train linear SVM classifiers and
employ 5-fold cross validation to select the parameters.
Held-Out Data.
In order to obtain the discriminative power
scores, we test on the UCF101 dataset (crcv.ucf.edu/data/ ).
4.2 Experimental Results
Comparison with Previous Work.
In Table 1, we com-
pare our approach with other recent state-of-the-art systems,
specifically the Bi-Concept approach (Habibian, Mensink,
and Snoek 2014), EventNet (Ye et al
.
), the weak concepts
approach (Wu et al
.
2014), Selecting (Singh et al
.
2015), and
SPaR (Jiang et al
.
2014). The first three of these only rely on
concept aggregation, while Selecting and SPaR combine con-
cept aggregation and re-ranking strategies. We report results
on MEDtest13, as this allows us to directly quote the val-
ues given in the original papers, for fairness. The results are
comparable, as we use the same data split. To better analyse
our approach, we also implement a traditional attribute-based
retrieval approach (“Basic”) (Gan et al
.
2015a). However, for
a fair comparison, we use our concept features, as these are
stronger, as shown later on.
3491

Citations
More filters
Proceedings ArticleDOI

Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing

TL;DR: A new benchmark Look into Person (LIP) is introduced that makes a significant advance in terms of scalability, diversity and difficulty, and a novel self-supervised structure-sensitive learning approach, which imposes human pose structures into parsing results without resorting to extra supervision.
Journal ArticleDOI

Look into Person: Joint Body Parsing & Pose Estimation Network and a New Benchmark

TL;DR: A new benchmark named “Look into Person (LIP)” is introduced that provides a significant advancement in terms of scalability, diversity, and difficulty, which are crucial for future developments in human-centric analysis.
Book ChapterDOI

Semantic Object Parsing with Graph LSTM

TL;DR: Wang et al. as mentioned in this paper proposed the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTMs from sequential data or multi-dimensional data to general graph-structured data.
Book ChapterDOI

Instance-Level Human Parsing via Part Grouping Network.

TL;DR: This work makes the first attempt to explore a detection-free Part Grouping Network (PGN) for efficiently parsing multiple people in an image in a single pass and outperforms all state-of-the-art methods on PASCAL-Person-Part dataset.
Posted Content

Semantic Object Parsing with Graph LSTM

TL;DR: The Graph Long Short-Term Memory network is proposed, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Related Papers (5)