Showing papers by "Jun Xiao published in 2019"

PDF

Open Access

Proceedings Article•DOI•

Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction

[...]

Dejing Xu¹, Jun Xiao¹, Zhou Zhao¹, Jian Shao¹, Di Xie, Yueting Zhuang¹ - Show less +2 more•Institutions (1)

15 Jun 2019

TL;DR: A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video.

...read moreread less

Abstract: We propose a self-supervised spatiotemporal learning technique which leverages the chronological order of videos. Our method can learn the spatiotemporal representation of the video by predicting the order of shuffled clips from the video. The category of the video is not required, which gives our technique the potential to take advantage of infinite unannotated videos. There exist related works which use frames, while compared to frames, clips are more consistent with the video dynamics. Clips can help to reduce the uncertainty of orders and are more appropriate to learn a video representation. The 3D convolutional neural networks are utilized to extract features for clips, and these features are processed to predict the actual order. The learned representations are evaluated via nearest neighbor retrieval experiments. We also use the learned networks as the pre-trained models and finetune them on the action recognition task. Three types of 3D convolutional neural networks are tested in experiments, and we gain large improvements compared to existing self-supervised methods.

...read moreread less

406 citations

Proceedings Article•DOI•

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

[...]

Long Chen¹, Hanwang Zhang², Jun Xiao¹, Xiangnan He³, Shiliang Pu, Shih-Fu Chang⁴ - Show less +2 more•Institutions (4)

Zhejiang University¹, Nanyang Technological University², University of Science and Technology of China³, Columbia University⁴

01 Oct 2019

TL;DR: CMAT is a multi-agent policy gradient method that frames objects into cooperative agents, and then directly maximizes a graph-level metric as the reward, and uses a counterfactual baseline that disentangles the agent-specific reward by fixing the predictions of other agents.

...read moreread less

Abstract: Scene graphs --- objects as nodes and visual relationships as edges --- describe the whereabouts and interactions of objects in an image for comprehensive scene understanding. To generate coherent scene graphs, almost all existing methods exploit the fruitful visual context by modeling message passing among objects. For example, ``person'' on ``bike'' can help to determine the relationship ``ride'', which in turn contributes to the confidence of the two objects. However, we argue that the visual context is not properly learned by using the prevailing cross-entropy based supervised learning paradigm, which is not sensitive to graph inconsistency: errors at the hub or non-hub nodes should not be penalized equally. To this end, we propose a Counterfactual critic Multi-Agent Training (CMAT) approach. CMAT is a multi-agent policy gradient method that frames objects into cooperative agents, and then directly maximizes a graph-level metric as the reward. In particular, to assign the reward properly to each agent, CMAT uses a counterfactual baseline that disentangles the agent-specific reward by fixing the predictions of other agents. Extensive validations on the challenging Visual Genome benchmark show that CMAT achieves a state-of-the-art performance by significant gains under various settings and metrics.

...read moreread less

126 citations

Proceedings Article•DOI•

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.

[...]

Chujie Lu¹, Long Chen¹, Chilie Tan, Xiaolin Li², Jun Xiao¹ - Show less +1 more•Institutions (2)

Zhejiang University¹, Pacific Northwest National Laboratory²

01 Nov 2019

TL;DR: This paper proposes a novel dense bottom-up framework: DEnse Bottom-Up Grounding (DEBUG), which regards all frames falling in the ground truth segment as foreground, and each foreground frame regresses the unique distances from its location to bi-directional ground truth boundaries.

...read moreread less

Abstract: In this paper, we focus on natural language video localization: localizing (ie, grounding) a natural language description in a long and untrimmed video sequence. All currently published models for addressing this problem can be categorized into two types: (i) top-down approach: it does classification and regression for a set of pre-cut video segment candidates; (ii) bottom-up approach: it directly predicts probabilities for each video frame as the temporal boundaries (ie, start and end time point). However, both two approaches suffer several limitations: the former is computation-intensive for densely placed candidates, while the latter has trailed the performance of the top-down counterpart thus far. To this end, we propose a novel dense bottom-up framework: DEnse Bottom-Up Grounding (DEBUG). DEBUG regards all frames falling in the ground truth segment as foreground, and each foreground frame regresses the unique distances from its location to bi-directional ground truth boundaries. Extensive experiments on three challenging benchmarks (TACoS, Charades-STA, and ActivityNet Captions) show that DEBUG is able to match the speed of bottom-up models while surpassing the performance of the state-of-the-art top-down models.

...read moreread less

114 citations

Journal Article•DOI•

An artificial intelligence based data-driven approach for design ideation

[...]

Liuqing Chen¹, Pan Wang¹, Hao Dong¹, Feng Shi¹, Ji Han², Yike Guo¹, Peter R.N. Childs¹, Jun Xiao³, Chao Wu³ - Show less +5 more•Institutions (3)

Imperial College London¹, University of Liverpool², Zhejiang University³

01 May 2019-Journal of Visual Communication and Image Representation

TL;DR: An integrated approach for enhancing design ideation by applying artificial intelligence and data mining techniques, which consists of two models, a semantic ideation network and a visual concepts combination model, which provide inspiration semantically and visually based on computational creativity theory.

...read moreread less

70 citations

Proceedings Article•DOI•

Video Relation Detection with Spatio-Temporal Graph

[...]

Xufeng Qian¹, Yueting Zhuang¹, Yimeng Li¹, Shaoning Xiao¹, Shiliang Pu, Jun Xiao¹ - Show less +2 more•Institutions (1)

Zhejiang University¹

15 Oct 2019

TL;DR: By combining the model (VRD-GCN) and the proposed association method, the framework for video relation detection achieves the best performance in the latest benchmarks and a series of ablation studies demonstrate the method's effectiveness.

...read moreread less

Abstract: What we perceive from visual content are not only collections of objects but the interactions between them. Visual relations, denoted by the triplet , could convey a wealth of information for visual understanding. Different from static images and because of the additional temporal channel, dynamic relations in videos are often correlated in both spatial and temporal dimensions, which make the relation detection in videos a more complex and challenging task. In this paper, we abstract videos into fully-connected spatial-temporal graphs. We pass message and conduct reasoning in these 3D graphs with a novel VidVRD model using graph convolution network. Our model can take advantage of spatial-temporal contextual cues to make better predictions on objects as well as their dynamic relationships. Furthermore, an online association method with a siamese network is proposed for accurate relation instances association. By combining our model (VRD-GCN) and the proposed association method, our framework for video relation detection achieves the best performance in the latest benchmarks. We validate our approach on benchmark ImageNet-VidVRD dataset. The experimental results show that our framework outperforms the state-of-the-art by a large margin and a series of ablation studies demonstrate our method's effectiveness.

...read moreread less

69 citations

Proceedings Article•DOI•

Multi-interaction Network with Object Relation for Video Question Answering

[...]

Weike Jin¹, Zhou Zhao¹, Mao Gu¹, Jun Yu², Jun Xiao¹, Yueting Zhuang¹ - Show less +2 more•Institutions (2)

Zhejiang University¹, Hangzhou Dianzi University²

15 Oct 2019

TL;DR: A new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously, simultaneously is proposed, which achieves the new state-of-the-art performance.

...read moreread less

Abstract: Video question answering is an important task for testing machine's ability of video understanding. The existing methods normally focus on the combination of recurrent and convolutional neural networks to capture spatial and temporal information of the video. Recently, some work has also shown that using attention mechanism can achieve better performance. In this paper, we propose a new model called Multi-interaction network for video question answering. There are two types of interactions in our model. The first type is the multi-modal interaction between the visual and textual information. The second type is the multi-level interaction inside the multi-modal interaction. Specifically, instead of using original self-attention, we propose a new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously. And in addition to the normal frame-level interaction, we also take the object relations into consideration, in order to obtain more fine-grained information, such as motions and other potential relations among these objects. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves the new state-of-the-art performance.

...read moreread less

52 citations

Journal Article•DOI•

Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network

[...]

Weike Jin¹, Zhou Zhao¹, Yimeng Li¹, Jie Li¹, Jun Xiao¹, Yueting Zhuang¹ - Show less +2 more•Institutions (1)

Zhejiang University¹

03 Jul 2019

TL;DR: A knowledge-based progressive spatial-temporal attention network is proposed to tackle the problem of video question answering by taking the spatial and temporal dimension of video content into account and employing an external knowledge base to improve the answering ability of the network.

...read moreread less

Abstract: Visual Question Answering (VQA) is a challenging task that has gained increasing attention from both the computer vision and the natural language processing communities in recent years. Given a question in natural language, a VQA system is designed to automatically generate the answer according to the referenced visual content. Though there recently has been much intereset in this topic, the existing work of visual question answering mainly focuses on a single static image, which is only a small part of the dynamic and sequential visual data in the real world. As a natural extension, video question answering (VideoQA) is less explored. Because of the inherent temporal structure in the video, the approaches of ImageQA may be ineffectively applied to video question answering. In this article, we not only take the spatial and temporal dimension of video content into account but also employ an external knowledge base to improve the answering ability of the network. More specifically, we propose a knowledge-based progressive spatial-temporal attention network to tackle this problem. We obtain both objects and region features of the video frames from a region proposal network. The knowledge representation is generated by a word-level attention mechanism using the comment information of each object that is extracted from DBpedia. Then, we develop a question-knowledge-guided progressive spatial-temporal attention network to learn the joint video representation for video question answering task. We construct a large-scale video question answering dataset. The extensive experiments based on two different datasets validate the effectiveness of our method.

...read moreread less

12 citations

Journal Article•DOI•

Explorations of skeleton features for LSTM-based action recognition

[...]

Jiageng Feng¹, Songyang Zhang¹, Jun Xiao¹•Institutions (1)

Zhejiang University¹

01 Jan 2019-Multimedia Tools and Applications

TL;DR: This work utilizes a universal spatial model perpendicular to the RNN model enhancement, and proposes two simple geometric features, inspired by previous work, which outperform other methods and achieve state-of-art results on two datasets.

...read moreread less

Abstract: Currently RNN-based methods achieve excellent performance on action recognition using skeletons. But the inputs of these approaches are limited to coordinates of joints, and they improve the performance mainly by extending RNN models in different ways and exploring relations of body parts directly from joint coordinates. Our method utilizes a universal spatial model perpendicular to the RNN model enhancement. Specifically, we propose two simple geometric features, inspired by previous work. With experiments on a 3-layer LSTM (Long Short-Term Memory) framework, we find that the geometric relational features based on vectors and normal vectors outperform other methods and achieve state-of-art results on two datasets. Moreover, we show that utilizing our features as input requires less data for training.

...read moreread less

11 citations

Proceedings Article•DOI•

Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks

[...]

Weike Jin¹, Zhou Zhao¹, Mao Gu¹, Jun Yu², Jun Xiao¹, Yueting Zhuang¹ - Show less +2 more•Institutions (2)

Zhejiang University¹, Hangzhou Dianzi University²

18 Jul 2019

TL;DR: A novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history and can achieve higher time efficiency and the extensive experiments also show the effectiveness of the method.

...read moreread less

Abstract: Video dialog is a new and challenging task, which requires an AI agent to maintain a meaningful dialog with humans in natural language about video contents. Specifically, given a video, a dialog history and a new question about the video, the agent has to combine video information with dialog history to infer the answer. And due to the complexity of video information, the methods of image dialog might be ineffectively applied directly to video dialog. In this paper, we propose a novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history. Instead of using RNN to encode the sequence information, we design a multi-grained convolutional self-attention mechanism to capture both element and segment level interactions which contain multi-grained sequence information. Then, we design a hierarchical dialog history encoder to learn the context-aware question representation and a two-stream video encoder to learn the context-aware video representation. We evaluate our method on two large-scale datasets. Due to the flexibility and parallelism of the new attention mechanism, our method can achieve higher time efficiency, and the extensive experiments also show the effectiveness of our method.

...read moreread less

9 citations

Proceedings Article•DOI•

Video Dialog via Progressive Inference and Cross-Transformer

[...]

Weike Jin¹, Zhou Zhao¹, Mao Gu¹, Jun Xiao¹, Furu Wei², Yueting Zhuang¹ - Show less +2 more•Institutions (2)

Zhejiang University¹, Harbin Institute of Technology²

01 Nov 2019

TL;DR: This paper introduces a novel progressive inference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous.

...read moreread less

Abstract: Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history. And different from single-turn video question answering, the additional dialog history is important for video dialog, which often includes contextual information for the question. Existing visual dialog methods mainly use RNN to encode the dialog history as a single vector representation, which might be rough and straightforward. Some more advanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit reasoning process. In this paper, we introduce a novel progressive inference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous. In order to tackle the multi-modal fusion problem, we propose a cross-transformer module, which could learn more fine-grained and comprehensive interactions both inside and between the modalities. And besides answer generation, we also consider question generation, which is more challenging but significant for a complete video dialog system. We evaluate our method on two large-scale datasets, and the extensive experiments show the effectiveness of our method.

...read moreread less

6 citations

Proceedings Article•DOI•

Weak Supervision Enhanced Generative Network for Question Generation

[...]

Yutong Wang¹, Jiyuan Zheng¹, Qijiong Liu¹, Zhou Zhao¹, Jun Xiao¹, Yueting Zhuang¹ - Show less +2 more•Institutions (1)

Zhejiang University¹

01 Jul 2019

TL;DR: The Weak Supervision Enhanced Generative Network (WeGen) is proposed which automatically discovers relevant features of the passage given the answer span in a weakly supervised manner to improve the quality of generated questions.

...read moreread less

Abstract: Automatic question generation according to an answer within the given passage is useful for many applications, such as question answering system, dialogue system, etc. Current neural-based methods mostly take two steps which extract several important sentences based on the candidate answer through manual rules or supervised neural networks and then use an encoder-decoder framework to generate questions about these sentences. These approaches neglect the semantic relations between the answer and the context of the whole passage which is sometimes necessary for answering the question. To address this problem, we propose the Weak Supervision Enhanced Generative Network (WeGen) which automatically discovers relevant features of the passage given the answer span in a weakly supervised manner to improve the quality of generated questions. More specifically, we devise a discriminator, Relation Guider, to capture the relations between the whole passage and the associated answer and then the Multi-Interaction mechanism is deployed to transfer the knowledge dynamically for our question generation system. Experiments show the effectiveness of our method in both automatic evaluations and human evaluations.

...read moreread less

Journal Article•DOI•

Adversarial learning for viewpoints invariant 3D human pose estimation

[...]

Yimeng Li¹, Jun Xiao¹, Di Xie, Jian Shao¹, Jinlong Wang² - Show less +1 more•Institutions (2)

Zhejiang University¹, Qingdao University²

01 Jan 2019-Journal of Visual Communication and Image Representation

TL;DR: An adversarial learning framework is proposed, which can learn invariant human pose latent from 3D annotated datasets to optimize the estimation of monocular images with only 2D annotations and adds a viewpoints invariant module to automatically regulate observation viewpoints for generated 3D pose.

...read moreread less

Posted Content•

Galaxy Learning - A Position Paper.

[...]

Chao Wu¹, Jun Xiao¹, Gang Huang¹, Fei Wu¹•Institutions (1)

Zhejiang University¹

22 Apr 2019-arXiv: Other Computer Science

TL;DR: A novel model training paradigm based on blockchain, named Galaxy Learning, is proposed, which aims to train a model with distributed data and to reserve the data ownership for their owners.

...read moreread less

Abstract: The recent rapid development of artificial intelligence (AI, mainly driven by machine learning research, especially deep learning) has achieved phenomenal success in various applications. However, to further apply AI technologies in real-world context, several significant issues regarding the AI ecosystem should be addressed. We identify the main issues as data privacy, ownership, and exchange, which are difficult to be solved with the current centralized paradigm of machine learning training methodology. As a result, we propose a novel model training paradigm based on blockchain, named Galaxy Learning, which aims to train a model with distributed data and to reserve the data ownership for their owners. In this new paradigm, encrypted models are moved around instead, and are federated once trained. Model training, as well as the communication, is achieved with blockchain and its smart contracts. Pricing of training data is determined by its contribution, and therefore it is not about the exchange of data ownership. In this position paper, we describe the motivation, paradigm, design, and challenges as well as opportunities of Galaxy Learning.

...read moreread less

Proceedings Article•DOI•

Training Encrypted Models with Privacy-preserved Data on Blockchain

[...]

Lifeng Liu¹, Yifan Hu¹, Jiawei Yu, Fengda Zhang¹, Gang Huang, Jun Xiao¹, Chao Wu¹ - Show less +3 more•Institutions (1)

Zhejiang University¹

26 Aug 2019

TL;DR: Based on FL, premodel sent a novel and decentralized approach to training encrypted models with privacy-preserved data on Blockchain and demonstrates that the approach is practical in real-world applications.

...read moreread less

Abstract: Currently, training neural networks often requires a large corpus of data from multiple parties. However, data owners are reluctant to share their sensitive data to third parties for modelling in many cases. Therefore, Federated Learning (FL) has arisen as an alternative to enable collaborative training of models without sharing raw data, by distributing modelling tasks to multiple data owners. Based on FL, we premodel sent a novel and decentralized approach to training encrypted models with privacy-preserved data on Blockchain. In our approach, Blockchain is adopted as the machine learning environment where different actors (i.e., the model provider, the data provider) collaborate on the training task. During the training process, an encryption algorithm is used to protect the privacy of data and the trained model. Our experiments demonstrate that our approach is practical in real-world applications.

...read moreread less

Journal Article•DOI•

Exploratory Analysis for Big Social Data Using Deep Network

[...]

Chao Wu¹, Guolong Wang¹, Jiangcheng Zhu¹, Piyawat Lertvittayakumjorn², Simon Hu¹, Chilie Tan, Hong Mi¹, Yadan Xu¹, Jun Xiao¹ - Show less +5 more•Institutions (2)

Zhejiang University¹, Imperial College London²

15 Feb 2019-IEEE Access

TL;DR: A new paradigm that does not require feature selection so that data can speak for itself without manually picking out features is proposed, and using the deep network as a methodology to explore previously unknown relationships and capture complexity and non-linearity between target variables and a large number of input features for big social data.

...read moreread less

Abstract: Exploratory analysis is an important way to gain understanding and find unknown relationships from various data sources, especially in the era of big data. Traditional paradigms of social science data analysis follow the steps of feature selection, modeling, and prediction. In this paper, we propose a new paradigm that does not require feature selection so that data can speak for itself without manually picking out features. Besides, we propose using the deep network as a methodology to explore previously unknown relationships and capture complexity and non-linearity between target variables and a large number of input features for big social data. The new paradigm tends to be a relatively generic approach that can be widely used in different scenarios. In order to validate the feasibility of the paradigm, we use country-level indicators forecasting as a case study. The process includes: 1) data collection and preparation and 2) modeling and experiment. The data collection and preparation part builds a data warehouse and conducts the extract-transform-load process to eliminate data format inconsistency. The modeling and experiment part includes model setup and model structures change to achieve relatively high accuracy on prediction results at both model level and case level. We find some patterns about network capacity modification and the influence of time interval difference on the test results, whereas both of them deserve further research.

...read moreread less

Posted Content•

Weak Supervision Enhanced Generative Network for Question Generation.

[...]

Yutong Wang¹, Jiyuan Zheng¹, Qijiong Liu¹, Zhou Zhao¹, Jun Xiao¹, Yueting Zhuang¹ - Show less +2 more•Institutions (1)

Zhejiang University¹

01 Jul 2019-arXiv: Computation and Language

TL;DR: This article proposed Weak Supervision Enhanced Generative Network (WeGen) which automatically discovers relevant features of the passage given the answer span in a weakly supervised manner to improve the quality of generated questions.

...read moreread less