scispace - formally typeset
Search or ask a question

Showing papers by "Jun Xiao published in 2019"


Proceedings Article•DOI•
Dejing Xu1, Jun Xiao1, Zhou Zhao1, Jian Shao1, Di Xie, Yueting Zhuang1 •
15 Jun 2019
TL;DR: A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video.
Abstract: We propose a self-supervised spatiotemporal learning technique which leverages the chronological order of videos. Our method can learn the spatiotemporal representation of the video by predicting the order of shuffled clips from the video. The category of the video is not required, which gives our technique the potential to take advantage of infinite unannotated videos. There exist related works which use frames, while compared to frames, clips are more consistent with the video dynamics. Clips can help to reduce the uncertainty of orders and are more appropriate to learn a video representation. The 3D convolutional neural networks are utilized to extract features for clips, and these features are processed to predict the actual order. The learned representations are evaluated via nearest neighbor retrieval experiments. We also use the learned networks as the pre-trained models and finetune them on the action recognition task. Three types of 3D convolutional neural networks are tested in experiments, and we gain large improvements compared to existing self-supervised methods.

406 citations


Proceedings Article•DOI•
01 Oct 2019
TL;DR: CMAT is a multi-agent policy gradient method that frames objects into cooperative agents, and then directly maximizes a graph-level metric as the reward, and uses a counterfactual baseline that disentangles the agent-specific reward by fixing the predictions of other agents.
Abstract: Scene graphs --- objects as nodes and visual relationships as edges --- describe the whereabouts and interactions of objects in an image for comprehensive scene understanding. To generate coherent scene graphs, almost all existing methods exploit the fruitful visual context by modeling message passing among objects. For example, ``person'' on ``bike'' can help to determine the relationship ``ride'', which in turn contributes to the confidence of the two objects. However, we argue that the visual context is not properly learned by using the prevailing cross-entropy based supervised learning paradigm, which is not sensitive to graph inconsistency: errors at the hub or non-hub nodes should not be penalized equally. To this end, we propose a Counterfactual critic Multi-Agent Training (CMAT) approach. CMAT is a multi-agent policy gradient method that frames objects into cooperative agents, and then directly maximizes a graph-level metric as the reward. In particular, to assign the reward properly to each agent, CMAT uses a counterfactual baseline that disentangles the agent-specific reward by fixing the predictions of other agents. Extensive validations on the challenging Visual Genome benchmark show that CMAT achieves a state-of-the-art performance by significant gains under various settings and metrics.

126 citations


Proceedings Article•DOI•
01 Nov 2019
TL;DR: This paper proposes a novel dense bottom-up framework: DEnse Bottom-Up Grounding (DEBUG), which regards all frames falling in the ground truth segment as foreground, and each foreground frame regresses the unique distances from its location to bi-directional ground truth boundaries.
Abstract: In this paper, we focus on natural language video localization: localizing (ie, grounding) a natural language description in a long and untrimmed video sequence. All currently published models for addressing this problem can be categorized into two types: (i) top-down approach: it does classification and regression for a set of pre-cut video segment candidates; (ii) bottom-up approach: it directly predicts probabilities for each video frame as the temporal boundaries (ie, start and end time point). However, both two approaches suffer several limitations: the former is computation-intensive for densely placed candidates, while the latter has trailed the performance of the top-down counterpart thus far. To this end, we propose a novel dense bottom-up framework: DEnse Bottom-Up Grounding (DEBUG). DEBUG regards all frames falling in the ground truth segment as foreground, and each foreground frame regresses the unique distances from its location to bi-directional ground truth boundaries. Extensive experiments on three challenging benchmarks (TACoS, Charades-STA, and ActivityNet Captions) show that DEBUG is able to match the speed of bottom-up models while surpassing the performance of the state-of-the-art top-down models.

114 citations


Journal Article•DOI•
TL;DR: An integrated approach for enhancing design ideation by applying artificial intelligence and data mining techniques, which consists of two models, a semantic ideation network and a visual concepts combination model, which provide inspiration semantically and visually based on computational creativity theory.

70 citations


Proceedings Article•DOI•
Xufeng Qian1, Yueting Zhuang1, Yimeng Li1, Shaoning Xiao1, Shiliang Pu, Jun Xiao1 •
15 Oct 2019
TL;DR: By combining the model (VRD-GCN) and the proposed association method, the framework for video relation detection achieves the best performance in the latest benchmarks and a series of ablation studies demonstrate the method's effectiveness.
Abstract: What we perceive from visual content are not only collections of objects but the interactions between them. Visual relations, denoted by the triplet , could convey a wealth of information for visual understanding. Different from static images and because of the additional temporal channel, dynamic relations in videos are often correlated in both spatial and temporal dimensions, which make the relation detection in videos a more complex and challenging task. In this paper, we abstract videos into fully-connected spatial-temporal graphs. We pass message and conduct reasoning in these 3D graphs with a novel VidVRD model using graph convolution network. Our model can take advantage of spatial-temporal contextual cues to make better predictions on objects as well as their dynamic relationships. Furthermore, an online association method with a siamese network is proposed for accurate relation instances association. By combining our model (VRD-GCN) and the proposed association method, our framework for video relation detection achieves the best performance in the latest benchmarks. We validate our approach on benchmark ImageNet-VidVRD dataset. The experimental results show that our framework outperforms the state-of-the-art by a large margin and a series of ablation studies demonstrate our method's effectiveness.

69 citations


Proceedings Article•DOI•
Weike Jin1, Zhou Zhao1, Mao Gu1, Jun Yu2, Jun Xiao1, Yueting Zhuang1 •
15 Oct 2019
TL;DR: A new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously, simultaneously is proposed, which achieves the new state-of-the-art performance.
Abstract: Video question answering is an important task for testing machine's ability of video understanding. The existing methods normally focus on the combination of recurrent and convolutional neural networks to capture spatial and temporal information of the video. Recently, some work has also shown that using attention mechanism can achieve better performance. In this paper, we propose a new model called Multi-interaction network for video question answering. There are two types of interactions in our model. The first type is the multi-modal interaction between the visual and textual information. The second type is the multi-level interaction inside the multi-modal interaction. Specifically, instead of using original self-attention, we propose a new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously. And in addition to the normal frame-level interaction, we also take the object relations into consideration, in order to obtain more fine-grained information, such as motions and other potential relations among these objects. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves the new state-of-the-art performance.

52 citations


Journal Article•DOI•
Weike Jin1, Zhou Zhao1, Yimeng Li1, Jie Li1, Jun Xiao1, Yueting Zhuang1 •
03 Jul 2019
TL;DR: A knowledge-based progressive spatial-temporal attention network is proposed to tackle the problem of video question answering by taking the spatial and temporal dimension of video content into account and employing an external knowledge base to improve the answering ability of the network.
Abstract: Visual Question Answering (VQA) is a challenging task that has gained increasing attention from both the computer vision and the natural language processing communities in recent years. Given a question in natural language, a VQA system is designed to automatically generate the answer according to the referenced visual content. Though there recently has been much intereset in this topic, the existing work of visual question answering mainly focuses on a single static image, which is only a small part of the dynamic and sequential visual data in the real world. As a natural extension, video question answering (VideoQA) is less explored. Because of the inherent temporal structure in the video, the approaches of ImageQA may be ineffectively applied to video question answering. In this article, we not only take the spatial and temporal dimension of video content into account but also employ an external knowledge base to improve the answering ability of the network. More specifically, we propose a knowledge-based progressive spatial-temporal attention network to tackle this problem. We obtain both objects and region features of the video frames from a region proposal network. The knowledge representation is generated by a word-level attention mechanism using the comment information of each object that is extracted from DBpedia. Then, we develop a question-knowledge-guided progressive spatial-temporal attention network to learn the joint video representation for video question answering task. We construct a large-scale video question answering dataset. The extensive experiments based on two different datasets validate the effectiveness of our method.

12 citations


Journal Article•DOI•
TL;DR: This work utilizes a universal spatial model perpendicular to the RNN model enhancement, and proposes two simple geometric features, inspired by previous work, which outperform other methods and achieve state-of-art results on two datasets.
Abstract: Currently RNN-based methods achieve excellent performance on action recognition using skeletons. But the inputs of these approaches are limited to coordinates of joints, and they improve the performance mainly by extending RNN models in different ways and exploring relations of body parts directly from joint coordinates. Our method utilizes a universal spatial model perpendicular to the RNN model enhancement. Specifically, we propose two simple geometric features, inspired by previous work. With experiments on a 3-layer LSTM (Long Short-Term Memory) framework, we find that the geometric relational features based on vectors and normal vectors outperform other methods and achieve state-of-art results on two datasets. Moreover, we show that utilizing our features as input requires less data for training.

11 citations


Proceedings Article•DOI•
Weike Jin1, Zhou Zhao1, Mao Gu1, Jun Yu2, Jun Xiao1, Yueting Zhuang1 •
18 Jul 2019
TL;DR: A novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history and can achieve higher time efficiency and the extensive experiments also show the effectiveness of the method.
Abstract: Video dialog is a new and challenging task, which requires an AI agent to maintain a meaningful dialog with humans in natural language about video contents. Specifically, given a video, a dialog history and a new question about the video, the agent has to combine video information with dialog history to infer the answer. And due to the complexity of video information, the methods of image dialog might be ineffectively applied directly to video dialog. In this paper, we propose a novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history. Instead of using RNN to encode the sequence information, we design a multi-grained convolutional self-attention mechanism to capture both element and segment level interactions which contain multi-grained sequence information. Then, we design a hierarchical dialog history encoder to learn the context-aware question representation and a two-stream video encoder to learn the context-aware video representation. We evaluate our method on two large-scale datasets. Due to the flexibility and parallelism of the new attention mechanism, our method can achieve higher time efficiency, and the extensive experiments also show the effectiveness of our method.

9 citations


Proceedings Article•DOI•
01 Nov 2019
TL;DR: This paper introduces a novel progressive inference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous.
Abstract: Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history. And different from single-turn video question answering, the additional dialog history is important for video dialog, which often includes contextual information for the question. Existing visual dialog methods mainly use RNN to encode the dialog history as a single vector representation, which might be rough and straightforward. Some more advanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit reasoning process. In this paper, we introduce a novel progressive inference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous. In order to tackle the multi-modal fusion problem, we propose a cross-transformer module, which could learn more fine-grained and comprehensive interactions both inside and between the modalities. And besides answer generation, we also consider question generation, which is more challenging but significant for a complete video dialog system. We evaluate our method on two large-scale datasets, and the extensive experiments show the effectiveness of our method.

6 citations


Proceedings Article•DOI•
Yutong Wang1, Jiyuan Zheng1, Qijiong Liu1, Zhou Zhao1, Jun Xiao1, Yueting Zhuang1 •
01 Jul 2019
TL;DR: The Weak Supervision Enhanced Generative Network (WeGen) is proposed which automatically discovers relevant features of the passage given the answer span in a weakly supervised manner to improve the quality of generated questions.
Abstract: Automatic question generation according to an answer within the given passage is useful for many applications, such as question answering system, dialogue system, etc. Current neural-based methods mostly take two steps which extract several important sentences based on the candidate answer through manual rules or supervised neural networks and then use an encoder-decoder framework to generate questions about these sentences. These approaches neglect the semantic relations between the answer and the context of the whole passage which is sometimes necessary for answering the question. To address this problem, we propose the Weak Supervision Enhanced Generative Network (WeGen) which automatically discovers relevant features of the passage given the answer span in a weakly supervised manner to improve the quality of generated questions. More specifically, we devise a discriminator, Relation Guider, to capture the relations between the whole passage and the associated answer and then the Multi-Interaction mechanism is deployed to transfer the knowledge dynamically for our question generation system. Experiments show the effectiveness of our method in both automatic evaluations and human evaluations.

Journal Article•DOI•
Yimeng Li1, Jun Xiao1, Di Xie, Jian Shao1, Jinlong Wang2 •
TL;DR: An adversarial learning framework is proposed, which can learn invariant human pose latent from 3D annotated datasets to optimize the estimation of monocular images with only 2D annotations and adds a viewpoints invariant module to automatically regulate observation viewpoints for generated 3D pose.

Posted Content•
Chao Wu1, Jun Xiao1, Gang Huang1, Fei Wu1•
TL;DR: A novel model training paradigm based on blockchain, named Galaxy Learning, is proposed, which aims to train a model with distributed data and to reserve the data ownership for their owners.
Abstract: The recent rapid development of artificial intelligence (AI, mainly driven by machine learning research, especially deep learning) has achieved phenomenal success in various applications. However, to further apply AI technologies in real-world context, several significant issues regarding the AI ecosystem should be addressed. We identify the main issues as data privacy, ownership, and exchange, which are difficult to be solved with the current centralized paradigm of machine learning training methodology. As a result, we propose a novel model training paradigm based on blockchain, named Galaxy Learning, which aims to train a model with distributed data and to reserve the data ownership for their owners. In this new paradigm, encrypted models are moved around instead, and are federated once trained. Model training, as well as the communication, is achieved with blockchain and its smart contracts. Pricing of training data is determined by its contribution, and therefore it is not about the exchange of data ownership. In this position paper, we describe the motivation, paradigm, design, and challenges as well as opportunities of Galaxy Learning.

Proceedings Article•DOI•
Lifeng Liu1, Yifan Hu1, Jiawei Yu, Fengda Zhang1, Gang Huang, Jun Xiao1, Chao Wu1 •
26 Aug 2019
TL;DR: Based on FL, premodel sent a novel and decentralized approach to training encrypted models with privacy-preserved data on Blockchain and demonstrates that the approach is practical in real-world applications.
Abstract: Currently, training neural networks often requires a large corpus of data from multiple parties. However, data owners are reluctant to share their sensitive data to third parties for modelling in many cases. Therefore, Federated Learning (FL) has arisen as an alternative to enable collaborative training of models without sharing raw data, by distributing modelling tasks to multiple data owners. Based on FL, we premodel sent a novel and decentralized approach to training encrypted models with privacy-preserved data on Blockchain. In our approach, Blockchain is adopted as the machine learning environment where different actors (i.e., the model provider, the data provider) collaborate on the training task. During the training process, an encryption algorithm is used to protect the privacy of data and the trained model. Our experiments demonstrate that our approach is practical in real-world applications.

Journal Article•DOI•
TL;DR: A new paradigm that does not require feature selection so that data can speak for itself without manually picking out features is proposed, and using the deep network as a methodology to explore previously unknown relationships and capture complexity and non-linearity between target variables and a large number of input features for big social data.
Abstract: Exploratory analysis is an important way to gain understanding and find unknown relationships from various data sources, especially in the era of big data. Traditional paradigms of social science data analysis follow the steps of feature selection, modeling, and prediction. In this paper, we propose a new paradigm that does not require feature selection so that data can speak for itself without manually picking out features. Besides, we propose using the deep network as a methodology to explore previously unknown relationships and capture complexity and non-linearity between target variables and a large number of input features for big social data. The new paradigm tends to be a relatively generic approach that can be widely used in different scenarios. In order to validate the feasibility of the paradigm, we use country-level indicators forecasting as a case study. The process includes: 1) data collection and preparation and 2) modeling and experiment. The data collection and preparation part builds a data warehouse and conducts the extract-transform-load process to eliminate data format inconsistency. The modeling and experiment part includes model setup and model structures change to achieve relatively high accuracy on prediction results at both model level and case level. We find some patterns about network capacity modification and the influence of time interval difference on the test results, whereas both of them deserve further research.

Posted Content•
Yutong Wang1, Jiyuan Zheng1, Qijiong Liu1, Zhou Zhao1, Jun Xiao1, Yueting Zhuang1 •
TL;DR: This article proposed Weak Supervision Enhanced Generative Network (WeGen) which automatically discovers relevant features of the passage given the answer span in a weakly supervised manner to improve the quality of generated questions.
Abstract: Automatic question generation according to an answer within the given passage is useful for many applications, such as question answering system, dialogue system, etc. Current neural-based methods mostly take two steps which extract several important sentences based on the candidate answer through manual rules or supervised neural networks and then use an encoder-decoder framework to generate questions about these sentences. These approaches neglect the semantic relations between the answer and the context of the whole passage which is sometimes necessary for answering the question. To address this problem, we propose the Weak Supervision Enhanced Generative Network (WeGen) which automatically discovers relevant features of the passage given the answer span in a weakly supervised manner to improve the quality of generated questions. More specifically, we devise a discriminator, Relation Guider, to capture the relations between the whole passage and the associated answer and then the Multi-Interaction mechanism is deployed to transfer the knowledge dynamically for our question generation system. Experiments show the effectiveness of our method in both automatic evaluations and human evaluations.