scispace - formally typeset
Search or ask a question

Showing papers by "Fumin Shen published in 2019"


Journal ArticleDOI
TL;DR: A novel Binary Multi-View Clustering (BMVC) framework, which can dexterously manipulate multi-view image data and easily scale to large data, and is formulated by two key components: compact collaborative discrete representation learning and binary clustering structure learning, in a joint learning framework.
Abstract: Clustering is a long-standing important research problem, however, remains challenging when handling large-scale image data from diverse sources. In this paper, we present a novel Binary Multi-View Clustering (BMVC) framework, which can dexterously manipulate multi-view image data and easily scale to large data. To achieve this goal, we formulate BMVC by two key components: compact collaborative discrete representation learning and binary clustering structure learning, in a joint learning framework. Specifically, BMVC collaboratively encodes the multi-view image descriptors into a compact common binary code space by considering their complementary information; the collaborative binary representations are meanwhile clustered by a binary matrix factorization model, such that the cluster structures are optimized in the Hamming space by pure, extremely fast bit-operations. For efficiency, the code balance constraints are imposed on both binary data representations and cluster centroids. Finally, the resulting optimization problem is solved by an alternating optimization scheme with guaranteed fast convergence. Extensive experiments on four large-scale multi-view image datasets demonstrate that the proposed method enjoys the significant reduction in both computation and memory footprint, while observing superior (in most cases) or very competitive performance, in comparison with state-of-the-art clustering methods.

319 citations


Journal ArticleDOI
TL;DR: A novel video captioning framework, which integrates bidirectional long-short term memory (BiLSTM) and a soft attention mechanism to generate better global representations for videos as well as enhance the recognition of lasting motions in videos.
Abstract: Video captioning has been attracting broad research attention in the multimedia community. However, most existing approaches heavily rely on static visual information or partially capture the local temporal knowledge (e.g., within 16 frames), thus hardly describing motions accurately from a global view. In this paper, we propose a novel video captioning framework, which integrates bidirectional long-short term memory (BiLSTM) and a soft attention mechanism to generate better global representations for videos as well as enhance the recognition of lasting motions in videos. To generate video captions, we exploit another long-short term memory as a decoder to fully explore global contextual information. The benefits of our proposed method are two fold: 1) the BiLSTM structure comprehensively preserves global temporal and visual information and 2) the soft attention mechanism enables a language decoder to recognize and focus on principle targets from the complex content. We verify the effectiveness of our proposed video captioning framework on two widely used benchmarks, that is, microsoft video description corpus and MSR-video to text, and the experimental results demonstrate the superiority of the proposed approach compared to several state-of-the-art methods.

200 citations


Journal ArticleDOI
TL;DR: This paper unify the projections of text and image to the Hamming space into a common reconstructive embedding through rigid mathematical reformulation, which not only reduces the optimization complexity significantly but also facilitates the inter-modal similarity preservation among different modalities.
Abstract: In this paper, we study the problem of cross-modal retrieval by hashing-based approximate nearest neighbor search techniques. Most existing cross-modal hashing works mainly address the issue of multi-modal integration complexity using the same mapping and similarity calculation for data from different media types. Nonetheless, this may cause information loss during the mapping process due to overlooking the specifics of each individual modality. In this paper, we propose a simple yet effective cross-modal hashing approach, termed collective reconstructive embeddings (CRE), which can simultaneously solve the heterogeneity and integration complexity of multi-modal data. To address the heterogeneity challenge, we propose to process heterogeneous types of data using different modality-specific models. Specifically, we model textual data with cosine similarity-based reconstructive embedding to alleviate the data sparsity to the greatest extent, while for image data, we utilize the Euclidean distance to characterize the relationships of the projected hash codes. Meanwhile, we unify the projections of text and image to the Hamming space into a common reconstructive embedding through rigid mathematical reformulation, which not only reduces the optimization complexity significantly but also facilitates the inter-modal similarity preservation among different modalities. We further incorporate the code balance and uncorrelation criteria into the problem and devise an efficient iterative algorithm for optimization. Comprehensive experiments on four widely used multimodal benchmarks show that the proposed CRE can achieve a superior performance compared with the state of the art on several challenging cross-modal tasks.

113 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: This work proposes Additive Focal Variational Auto-encoder (AF-VAE), a novel approach that can arbitrarily manipulate high-resolution face images using a simple yet effective model and only weak supervision of reconstruction and KL divergence losses.
Abstract: Recent studies have shown remarkable success in face manipulation task with the advance of GANs and VAEs paradigms, but the outputs are sometimes limited to low-resolution and lack of diversity. In this work, we propose Additive Focal Variational Auto-encoder (AF-VAE), a novel approach that can arbitrarily manipulate high-resolution face images using a simple yet effective model and only weak supervision of reconstruction and KL divergence losses. First, a novel additive Gaussian Mixture assumption is introduced with an unsupervised clustering mechanism in the structural latent space, which endows better disentanglement and boosts multi-modal representation with external memory. Second, to improve the perceptual quality of synthesized results, two simple strategies in architecture design are further tailored and discussed on the behavior of Human Visual System (HVS) for the first time, allowing for fine control over the model complexity and sample quality. Human opinion studies and new state-of-the-art Inception Score (IS) / Frechet Inception Distance (FID) demonstrate the superiority of our approach over existing algorithms, advancing both the fidelity and extremity of face manipulation task.

71 citations


Journal ArticleDOI
TL;DR: A novel feature fusion network for facial expression recognition in a cross-domain manner in order to realize the facial expression Recognition in extensive scenarios and achieves the state-of-the-art performances.

54 citations


Journal ArticleDOI
TL;DR: A novel instance-level multi-instance learning model is proposed to enhance classifier learning by exploring PI from untagged corpora, which can effectively eliminate the dependency on manually labeled data and obtain much richer PI.
Abstract: The accuracy of data-driven learning approaches is often unsatisfactory when the training data is inadequate either in quantity or quality. Manually labeled privileged information (PI), e.g., attributes, tags or properties, is usually incorporated to improve classifier learning. However, the process of manually labeling is time-consuming and labor-intensive. Moreover, due to the limitations of personal knowledge, manually labeled PI may not be rich enough. To address these issues, we propose to enhance classifier learning by exploring PI from untagged corpora, which can effectively eliminate the dependency on manually labeled data and obtain much richer PI. In detail, we treat each selected PI as a subcategory and learn one classifier for each subcategory independently. The classifiers for all subcategories are integrated together to form a more powerful category classifier. Particularly, we propose a novel instance-level multi-instance learning model to simultaneously select a subset of training images from each subcategory and learn the optimal SVM classifiers based on the selected images. Extensive experiments on four benchmark data sets demonstrate the superiority of our proposed approach.

51 citations


Journal ArticleDOI
TL;DR: This paper presents a multimodal framework that solves the problem of polysemy by allowing sense-specific diversity in search results and trains one visual classifier for each selected semantic sense and uses the learned sense- specific classifiers to distinguish multiple visual senses.
Abstract: Labeled image datasets have played a critical role in high-level image understanding. However, the process of manual labeling is both time consuming and labor intensive. To reduce the dependence on manually labeled data, there have been increasing research efforts on learning visual classifiers by directly exploiting web images. One issue that limits their performance is the problem of polysemy. Existing unsupervised approaches attempt to reduce the influence of visual polysemy by filtering out irrelevant images, but do not directly address polysemy. To this end, in this paper, we present a multimodal framework that solves the problem of polysemy by allowing sense-specific diversity in search results. Specifically, we first discover a list of possible semantic senses from untagged corpora to retrieve sense-specific images. Then, we merge visual similar semantic senses and prune noise by using the retrieved images. Finally, we train one visual classifier for each selected semantic sense and use the learned sense-specific classifiers to distinguish multiple visual senses. Extensive experiments on classifying images into sense-specific categories and reranking search results demonstrate the superiority of our proposed approach.

44 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel binary embedding-based zero-shot learning (BZSL) method, which recognizes the visual instances from unseen classes through an intermediate discriminative Hamming space, which well alleviates the visual-semantic bias problem.
Abstract: Zero-shot learning aims to classify the visual instances from unseen classes in the absence of training examples. This is typically achieved by directly mapping visual features to a semantic embedding space of classes (e.g., attributes or word vectors), where the similarity between the two modalities can be readily measured. However, the semantic space may not be reliable for recognition due to the noisy class embeddings or visual bias problem. In this paper, we propose a novel binary embedding-based zero-shot learning (BZSL) method, which recognizes the visual instances from unseen classes through an intermediate discriminative Hamming space. Specifically, BZSL jointly learns two binary coding functions to encode both visual instances and class embeddings into the Hamming space, which well alleviates the visual-semantic bias problem. As a desiring property, classifying an unseen instance thereby can be efficiently done by retrieving its nearest class codes with minimal Hamming distance. During training, by introducing two auxiliary variables for the coding functions, we formulate an equivalent correlation maximization problem, which admits an analytical solution. The resulting algorithm thus enjoys both highly efficient training and scalable novel class inferring. Extensive experiments on four benchmark datasets, including the full ImageNet Fall 2011 dataset with over 20k unseen classes, demonstrate the superiority of our method on the zero-shot learning task. Particularly, we show that increasing the binary embedding dimension can inevitably improve the recognition accuracy.

41 citations


Posted Content
TL;DR: This work proposes to fool an image captioning system to generate some targeted partial captions for an image polluted by adversarial noises, even the targeted captions are totally irrelevant to the image content.
Abstract: In this work, we study the robustness of a CNN+RNN based image captioning system being subjected to adversarial noises. We propose to fool an image captioning system to generate some targeted partial captions for an image polluted by adversarial noises, even the targeted captions are totally irrelevant to the image content. A partial caption indicates that the words at some locations in this caption are observed, while words at other locations are not this http URL is the first work to study exact adversarial attacks of targeted partial captions. Due to the sequential dependencies among words in a caption, we formulate the generation of adversarial noises for targeted partial captions as a structured output learning problem with latent variables. Both the generalized expectation maximization algorithm and structural SVMs with latent variables are then adopted to optimize the problem. The proposed methods generate very successful at-tacks to three popular CNN+RNN based image captioning models. Furthermore, the proposed attack methods are used to understand the inner mechanism of image captioning systems, providing the guidance to further improve automatic image captioning systems towards human captioning.

32 citations


Proceedings ArticleDOI
10 May 2019
TL;DR: In this paper, the authors study the robustness of a CNN+RNN based image captioning system being subjected to adversarial noises and propose to generate some targeted partial captions for an image polluted by adversarial noise.
Abstract: In this work, we study the robustness of a CNN+RNN based image captioning system being subjected to adversarial noises. We propose to fool an image captioning system to generate some targeted partial captions for an image polluted by adversarial noises, even the targeted captions are totally irrelevant to the image content. A partial caption indicates that the words at some locations in this caption are observed, while words at other locations are not restricted. It is the first work to study exact adversarial attacks of targeted partial captions. Due to the sequential dependencies among words in a caption, we formulate the generation of adversarial noises for targeted partial captions as a structured output learning problem with latent variables. Both the generalized expectation maximization algorithm and structural SVMs with latent variables are then adopted to optimize the problem. The proposed methods generate very successful attacks to three popular CNN+RNN based image captioning models. Furthermore, the proposed attack methods are used to understand the inner mechanism of image captioning systems, providing the guidance to further improve automatic image captioning systems towards human captioning.

32 citations


Proceedings ArticleDOI
15 Jun 2019
TL;DR: A novel framework for efficient sketch-based 3D shape retrieval, i.e., Deep Sketch-Shape Hashing (DSSH), which tackles the challenging problem from two perspectives and improves both the retrieval efficiency and accuracy remarkably, compared to the state-of-the-art methods.
Abstract: Sketch-based 3D shape retrieval has been extensively studied in recent works, most of which focus on improving the retrieval accuracy, whilst neglecting the efficiency. In this paper, we propose a novel framework for efficient sketch-based 3D shape retrieval, i.e., Deep Sketch-Shape Hashing (DSSH), which tackles the challenging problem from two perspectives. Firstly, we propose an intuitive 3D shape representation method to deal with unaligned shapes with arbitrary poses. Specifically, the proposed Segmented Stochastic-viewing Shape Network models discriminative 3D representations by a set of 2D images rendered from multiple views, which are stochastically selected from non-overlapping spatial segments of a 3D sphere. Secondly, Batch-Hard Binary Coding (BHBC) is developed to learn semantics-preserving compact binary codes by mining the hardest samples. The overall framework is jointly learned by developing an alternating iteration algorithm. Extensive experimental results on three benchmarks show that DSSH improves both the retrieval efficiency and accuracy remarkably, compared to the state-of-the-art methods.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: A polynomial pooling (P-pooling) function is proposed that finds an intermediate form between max and average pooling to provide an optimally balanced and self-adjusted pooling strategy for semantic segmentation.
Abstract: Semantic segmentation is an important computer vision task, which aims to allocate a semantic label to each pixel in an image. When training a segmentation model, it is common to fine-tune a classification network pre-trained on a large-scale dataset. However, as an intrinsic property of the classification model, invariance to spatial perturbation resulting from the lose of detail-sensitivity prevents segmentation networks from achieving high performance. The use of standard poolings is one of the key factors for this invariance. The most common standard poolings are max and average pooling. Max pooling can increase both the invariance to spatial perturbations and the non-linearity of the networks. Average pooling, on the other hand, is sensitive to spatial perturbations, but is a linear function. For semantic segmentation, we prefer both the preservation of detailed cues within a local feature region and non-linearity that increases a network’s functional complexity. In this work, we propose a polynomial pooling (P-pooling) function that finds an intermediate form between max and average pooling to provide an optimally balanced and self-adjusted pooling strategy for semantic segmentation. The P-pooling is differentiable and can be applied into a variety of pre-trained networks. Extensive studies on the PASCAL VOC, Cityscapes and ADE20k datasets demonstrate the superiority of P-pooling over other poolings. Experiments on various network architectures and state-of-the-art training strategies also show that models with P-pooling layers consistently outperform those directly fine-tuned using pre-trained classification models.

Journal ArticleDOI
TL;DR: This paper proposes a novel framework, error correcting input and output (EC-IO) coding, which does class-level and instance-level encoding under a unified mapping space, and presents the hashing model, EC-IOH, by approximating the mapping space with the Hamming space.
Abstract: Most learning-based hashing algorithms leverage sample-to-sample similarities, such as neighborhood structure, to generate binary codes, which achieve promising results for image retrieval. This type of methods are referred to as instance-level encoding . However, it is nontrivial to define a scalar to represent sample-to-sample similarity encoding the semantic labels and the data structure. To address this issue, in this paper, we seek to use a class-level encoding method, which encodes the class-to-class relationship, to take the semantic information of classes into consideration. Based on these two encodings, we propose a novel framework, error correcting input and output (EC-IO) coding, which does class-level and instance-level encoding under a unified mapping space. Our proposed model contains two major components, which are distribution preservation and error correction. With these two components, our model maps the input feature of samples and the output code of classes into a unified space to encode the intrinsic structure of data and semantic information of classes simultaneously. Under this framework, we present our hashing model, EC-IO hashing (EC-IOH), by approximating the mapping space with the Hamming space. Extensive experiments are conducted to evaluate the retrieval performance, and EC-IOH exhibits superior and competitive performances comparing with popular supervised and unsupervised hashing methods.

Journal ArticleDOI
TL;DR: This paper proposes to transfer the deep feature, which is learned originally for image classification to the visual tracking domain, via some “grafted” auxiliary networks, which improves the tracking performance significantly both on accuracy and efficiency.
Abstract: Visual tracking is one of the fundamental problems in computer vision. Recently, some deep-learning-based tracking algorithms have been illustrating record-breaking performances. However, due to the high complexity of neural networks, most deep trackers suffer from low tracking speed and are, thus, impractical in many real-world applications. Some recently proposed deep trackers with smaller network structure achieve high efficiency while at the cost of significant decrease in precision. In this paper, we propose to transfer the deep feature, which is learned originally for image classification to the visual tracking domain. The domain adaptation is achieved via some “grafted” auxiliary networks, which are trained by regressing the object location in tracking frames. This adaptation improves the tracking performance significantly both on accuracy and efficiency. The yielded deep tracker is real time and also illustrates the state-of-the-art accuracies in the experiment involving two well-adopted benchmarks with more than 100 test videos. Furthermore, the adaptation is also naturally used for introducing the objectness concept into visual tracking. This removes a long-standing target ambiguity in visual tracking tasks, and we illustrate the empirical superiority of the more well-defined task.

Posted Content
TL;DR: A large-scale RGB-D action dataset for arbitrary-view action analysis, including RGB videos, depth and skeleton sequences, and a View-guided Skeleton CNN (VS-CNN) to tackle the problem of arbitrary-View action recognition are collected.
Abstract: Current researches of action recognition mainly focus on single-view and multi-view recognition, which can hardly satisfies the requirements of human-robot interaction (HRI) applications to recognize actions from arbitrary views. The lack of datasets also sets up barriers. To provide data for arbitrary-view action recognition, we newly collect a large-scale RGB-D action dataset for arbitrary-view action analysis, including RGB videos, depth and skeleton sequences. The dataset includes action samples captured in 8 fixed viewpoints and varying-view sequences which covers the entire 360 degree view angles. In total, 118 persons are invited to act 40 action categories, and 25,600 video samples are collected. Our dataset involves more participants, more viewpoints and a large number of samples. More importantly, it is the first dataset containing the entire 360 degree varying-view sequences. The dataset provides sufficient data for multi-view, cross-view and arbitrary-view action analysis. Besides, we propose a View-guided Skeleton CNN (VS-CNN) to tackle the problem of arbitrary-view action recognition. Experiment results show that the VS-CNN achieves superior performance.

Posted Content
TL;DR: A metalearning-based online optimization approach to dynamically learn the interpolation policy in a data-adaptive way (learning to learn better) and extensive experiments show that the MetaMixUp adapted SSL greatly outperforms MixUp and many state-of-the-art methods on CIFAR-10 and SVHN benchmarks under the SSL configuration.
Abstract: MixUp is an effective data augmentation method to regularize deep neural networks via random linear interpolations between pairs of samples and their labels. It plays an important role in model regularization, semi-supervised learning and domain adaption. However, despite its empirical success, its deficiency of randomly mixing samples has poorly been studied. Since deep networks are capable of memorizing the entire dataset, the corrupted samples generated by vanilla MixUp with a badly chosen interpolation policy will degrade the performance of networks. To overcome the underfitting by corrupted samples, inspired by Meta-learning (learning to learn), we propose a novel technique of learning to mixup in this work, namely, MetaMixUp. Unlike the vanilla MixUp that samples interpolation policy from a predefined distribution, this paper introduces a meta-learning based online optimization approach to dynamically learn the interpolation policy in a data-adaptive way. The validation set performance via meta-learning captures the underfitting issue, which provides more information to refine interpolation policy. Furthermore, we adapt our method for pseudo-label based semisupervised learning (SSL) along with a refined pseudo-labeling strategy. In our experiments, our method achieves better performance than vanilla MixUp and its variants under supervised learning configuration. In particular, extensive experiments show that our MetaMixUp adapted SSL greatly outperforms MixUp and many state-of-the-art methods on CIFAR-10 and SVHN benchmarks under SSL configuration.

Journal ArticleDOI
TL;DR: A novel Word-to-Region Attention Network (WRAN) is proposed, which can simultaneously locate pertinent object regions instead of a uniform grid of image regions of euqal size and identify the corresponding words of the reference question; as well as enforce consistency between image object regions and core semantics in questions.
Abstract: Visual attention, which allows more concentration on the image regions that are relevant to a reference question, brings remarkable performance improvement in Visual Question Answering (VQA). Most VQA attention models employ the entire reference question representation to query relevant image regions. Nonetheless, only certain salient words of the question play an effective role in an attention operation. In this paper, we propose a novel Word-to-Region Attention Network (WRAN), which can 1) simultaneously locate pertinent object regions instead of a uniform grid of image regions of euqal size and identify the corresponding words of the reference question; as well as 2) enforce consistency between image object regions and core semantics in questions. We evaluate the proposed model on the VQA v1.0 and VQA v2.0 datasets. Experimental results demonstrate the superiority of the proposed model as compared to the state-of-the-arts.

Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed an additive Gaussian mixture assumption with an unsupervised clustering mechanism in the structural latent space, which endows better disentanglement and boosts multi-modal representation with external memory, and two simple strategies in architecture design are further tailored and discussed on the behavior of human visual system (HVS) for the first time, allowing for fine control over the model complexity and sample quality.
Abstract: Recent studies have shown remarkable success in face manipulation task with the advance of GANs and VAEs paradigms, but the outputs are sometimes limited to low-resolution and lack of diversity. In this work, we propose Additive Focal Variational Auto-encoder (AF-VAE), a novel approach that can arbitrarily manipulate high-resolution face images using a simple yet effective model and only weak supervision of reconstruction and KL divergence losses. First, a novel additive Gaussian Mixture assumption is introduced with an unsupervised clustering mechanism in the structural latent space, which endows better disentanglement and boosts multi-modal representation with external memory. Second, to improve the perceptual quality of synthesized results, two simple strategies in architecture design are further tailored and discussed on the behavior of Human Visual System (HVS) for the first time, allowing for fine control over the model complexity and sample quality. Human opinion studies and new state-of-the-art Inception Score (IS) / Frechet Inception Distance (FID) demonstrate the superiority of our approach over existing algorithms, advancing both the fidelity and extremity of face manipulation task.

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed an adaptive multi-model framework that resolves polysemy by visual disambiguation, which can adapt to the dynamic changes in the search results.
Abstract: Due to the high cost of manual annotation, learning directly from the web has attracted broad attention. One issue that limits their performance is the problem of visual polysemy. To address this issue, we present an adaptive multi-model framework that resolves polysemy by visual disambiguation. Compared to existing methods, the primary advantage of our approach lies in that our approach can adapt to the dynamic changes in the search results. Our proposed framework consists of two major steps: we first discover and dynamically select the text queries according to the image search results, then we employ the proposed saliency-guided deep multi-instance learning network to remove outliers and learn classification models for visual disambiguation. Extensive experiments demonstrate the superiority of our proposed approach.

Patent
02 Apr 2019
TL;DR: Improved sketch-based image retrieval (SBIR) as discussed by the authors utilizes an architecture comprising three interconnected neural networks to enable zero-shot image recognition and retrieval based on free-hand sketches.
Abstract: This disclosure relates to improved sketch-based image retrieval (SBIR) techniques. The SBIR techniques utilize an architecture comprising three interconnected neural networks to enable zero-shot image recognition and retrieval based on free-hand sketches. Zero-shot learning may be implemented to retrieve one or more images corresponding to the sketches without prior training on all categories of the sketches. The neural network architecture may do so, at least in part, by training encoder hashing functions to mitigate heterogeneity of sketches and images, and by applying semantic knowledge that is learned during a limited training phase to unknown categories.

Proceedings ArticleDOI
01 May 2019
TL;DR: An adaptive multi-model framework that resolves polysemy by visual disambiguation by adapting to the dynamic changes in the search results and employing the proposed saliency-guided deep multi-instance learning network to remove outliers and learn classification models for visual disAmbiguation.
Abstract: Due to the high cost of manual annotation, learning directly from the web has attracted broad attention. One issue that limits their performance is the problem of visual polysemy. To address this issue, we present an adaptive multi-model framework that resolves polysemy by visual disambiguation. Compared to existing methods, the primary advantage of our approach lies in that our approach can adapt to the dynamic changes in the search results. Our proposed framework consists of two major steps: we first discover and dynamically select the text queries according to the image search results, then we employ the proposed saliency-guided deep multi-instance learning network to remove outliers and learn classification models for visual disambiguation. Extensive experiments demonstrate the superiority of our proposed approach.

Journal ArticleDOI
TL;DR: A novel multi-view deep neural network, termed Fusion by Synthesis (FS), which leverages word embeddings of classes as complementary for attributes and performs zero-shot prediction by fusing the word embedDings of unseen classes and the synthesized attributes in the visual feature space.

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed an efficient temporal reasoning graph (TRG) to simultaneously capture the appearance features and temporal relation between video sequences at multiple time scales, which can extract discriminative features for activity recognition.
Abstract: Despite great success has been achieved in activity analysis, it still has many challenges. Most existing work in activity recognition pay more attention to design efficient architecture or video sampling strategy. However, due to the property of fine-grained action and long term structure in video, activity recognition is expected to reason temporal relation between video sequences. In this paper, we propose an efficient temporal reasoning graph (TRG) to simultaneously capture the appearance features and temporal relation between video sequences at multiple time scales. Specifically, we construct learnable temporal relation graphs to explore temporal relation on the multi-scale range. Additionally, to facilitate multi-scale temporal relation extraction, we design a multi-head temporal adjacent matrix to represent multi-kinds of temporal relations. Eventually, a multi-head temporal relation aggregator is proposed to extract the semantic meaning of those features convolving through the graphs. Extensive experiments are performed on widely-used large-scale datasets, such as Something-Something and Charades, and the results show that our model can achieve state-of-the-art performance. Further analysis shows that temporal relation reasoning with our TRG can extract discriminative features for activity recognition.

Proceedings ArticleDOI
15 Oct 2019
TL;DR: Wang et al. as discussed by the authors proposed Generative Reconstructive Hashing (GRH) to map incomplete video features to the feature distributions of complete videos, and then the discriminative hashing module further filled the gap between full video features and estimated features from partial videos by projecting both features into a common binary feature space.
Abstract: In the literature of video analysis, most researches, such as retrieval and recognition, hypothesize that each input video contains at least one complete semantic entity, e.g. an activity, action and event.However, this hypothesis does not hold in many realistic scenarios due to two main reasons. First, complete videos whose qualities are good enough for automatic analysis are not always accessible because of heavy motion blur, occlusions, interruptions, etc. % Second, extracting features from complete videos always fails to meet up with speed and storage requirements in large-scale use cases.To tackle these challenges, incomplete videos are more useful, but researches on them are seldom mentioned. In this paper, we propose a novel and effective hashing framework specialized in large-scale incomplete video analysis called Generative Reconstructive Hashing (GRH). To begin with, an adversarial generative network that is specially designed to map incomplete video features to the feature distributions of complete videos, so that features of incomplete videos become indistinguishable from those of complete videos. Then, the discriminative hashing module further fills the gap between full video features and estimated features from partial videos by projecting both features into a common binary feature space, which allows improvement in efficiency compared with real-value based methods. GRH is the first end-to-end framework for incomplete video analysis. Extensive experiments on various datasets demonstrate GRH's superior effectiveness and efficiency on retrieval and recognition tasks. GRH outperforms the recent state-of-the-art methods by 5.44/3.22/4.82 in terms of MAPs on HMDB51/UCF101/CCV datasets, respectively.

Posted Content
TL;DR: This work evaluates the proposed webly supervised approach on the benchmark Pascal VOC 2007 dataset and the results demonstrates the superiority of the method over many other state-of-the-art methods in image data collection.
Abstract: Recent successes in visual recognition can be primarily attributed to feature representation, learning algorithms, and the ever-increasing size of labeled training data. Extensive research has been devoted to the first two, but much less attention has been paid to the third. Due to the high cost of manual labeling, the size of recent efforts such as ImageNet is still relatively small in respect to daily applications. In this work, we mainly focus on how to automatically generate identifying image data for a given visual concept on a vast scale. With the generated image data, we can train a robust recognition model for the given concept. We evaluate the proposed webly supervised approach on the benchmark Pascal VOC 2007 dataset and the results demonstrates the superiority of our proposed approach in image data collection.

Journal ArticleDOI
TL;DR: An event early embedding model (EEEM) that can extract social events from noise, find the previous similar events, and predict future dynamics of a new event with very limited information is designed and a denoising approach is derived from the knowledge of signal analysis to eliminate social noise and extract events.
Abstract: Social media has become one of the most credible sources for delivering messages, breaking news, as well as events. Predicting the future dynamics of an event at a very early stage is significantly valuable, e.g, helping company anticipate marketing trends before the event becomes mature. However, this prediction is non-trivial because a) social events always stay with “noise” under the same topic and b) the information obtained at its early stage is too sparse and limited to support an accurate prediction. In order to overcome these two problems, in this paper, we design an event early embedding model (EEEM) that can 1) extract social events from noise, 2) find the previous similar events, and 3) predict future dynamics of a new event with very limited information. Specifically, a denoising approach is derived from the knowledge of signal analysis to eliminate social noise and extract events. Moreover, we propose a novel predicting scheme based on locally linear embedding algorithm to construct the volume of a new event from its k nearest neighbors. Compared to previous work only fitting the historical volume dynamics to make a prediction, our predictive model is based on both the volume information and content information of events. Extensive experiments conducted on a large-scale dataset of Twitter data demonstrate the capacity of our model on extract events and the promising performance of prediction by considering both volume information as well as content information. Compared with predicting with only the content or the volume feature, we find the best performance of considering they both with our proposed fusion method.

Posted Content
TL;DR: The theoretical convergence of the proposed algorithm is proven, and the explicit convergence rates are derived, for objective functions with Lipschitz continuous gradients, which are commonly adopted in practice.
Abstract: Binary optimization, a representative subclass of discrete optimization, plays an important role in mathematical optimization and has various applications in computer vision and machine learning. Usually, binary optimization problems are NP-hard and difficult to solve due to the binary constraints, especially when the number of variables is very large. Existing methods often suffer from high computational costs or large accumulated quantization errors, or are only designed for specific tasks. In this paper, we propose a fast algorithm to find effective approximate solutions for general binary optimization problems. The proposed algorithm iteratively solves minimization problems related to the linear surrogates of loss functions, which leads to the updating of some binary variables most impacting the value of loss functions in each step. Our method supports a wide class of empirical objective functions with/without restrictions on the numbers of $1$s and $-1$s in the binary variables. Furthermore, the theoretical convergence of our algorithm is proven, and the explicit convergence rates are derived, for objective functions with Lipschitz continuous gradients, which are commonly adopted in practice. Extensive experiments on several binary optimization tasks and large-scale datasets demonstrate the superiority of the proposed algorithm over several state-of-the-art methods in terms of both effectiveness and efficiency.

Posted Content
TL;DR: A novel cooperative cross-stream network that investigates the conjoint information in multiple different modalities and enhances the discriminative power of the deeply learned features and reduces the undesired modality discrepancy by jointly optimizing a modality ranking constraint and a cross-entropy loss for both homogeneous and heterogeneous modalities.
Abstract: Spatial and temporal stream model has gained great success in video action recognition. Most existing works pay more attention to designing effective features fusion methods, which train the two-stream model in a separate way. However, it's hard to ensure discriminability and explore complementary information between different streams in existing works. In this work, we propose a novel cooperative cross-stream network that investigates the conjoint information in multiple different modalities. The jointly spatial and temporal stream networks feature extraction is accomplished by an end-to-end learning manner. It extracts this complementary information of different modality from a connection block, which aims at exploring correlations of different stream features. Furthermore, different from the conventional ConvNet that learns the deep separable features with only one cross-entropy loss, our proposed model enhances the discriminative power of the deeply learned features and reduces the undesired modality discrepancy by jointly optimizing a modality ranking constraint and a cross-entropy loss for both homogeneous and heterogeneous modalities. The modality ranking constraint constitutes intra-modality discriminative embedding and inter-modality triplet constraint, and it reduces both the intra-modality and cross-modality feature variations. Experiments on three benchmark datasets demonstrate that by cooperating appearance and motion feature extraction, our method can achieve state-of-the-art or competitive performance compared with existing results.

Proceedings ArticleDOI
15 Oct 2019
TL;DR: A framework named TSE, with Item Embedding in Session and Time Decay Factor for a multimodal recommendation, which performed well in the Content-based Video Relevance Prediction Challenge and gets the first place in this competition.
Abstract: TV series correlation computing is one of the most important tasks of personalized online streaming services. With the relevance of TV series and viewer feedback, we can calculate the TV series correlation table based on the viewer's implicit feedback which does not perform well for the newly added "cold start" TV series. In this paper, we aim to improve correlation computing within the cold-start phase. We propose a framework named Time-aware Session Embedding (TSE), with Item Embedding in Session and Time Decay Factor for a multimodal recommendation. We apply an lower- dimensional vector as item embedding and calculate their factor considering the time decay. The framework performed well in the Content-based Video Relevance Prediction Challenge and we get the first place in this competition.