Top 11 papers published by Long Chen from Columbia University in 2020

Proceedings Article•DOI•

Counterfactual Samples Synthesizing for Robust Visual Question Answering

[...]

Long Chen¹, Xin Yan¹, Jun Xiao¹, Hanwang Zhang², Shiliang Pu, Yueting Zhuang¹ - Show less +2 more•Institutions (2)

Zhejiang University¹, Nanyang Technological University²

14 Jun 2020

TL;DR: A model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme that significantly improves both visual-explainable and question-sensitive abilities of VQA models and, in return, the performance of these models is further boosted.

...read moreread less

Abstract: Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize to the test set with different QA distributions. To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP. However, since the complexity of design, current methods are unable to equip the ensemble-based models with two indispensable characteristics of an ideal VQA model: 1) visual-explainable: the model should rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question. To this end, we propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. The CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions, and assigning different ground-truth answers. After training with the complementary samples (ie, the original and generated samples), the VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. In return, the performance of these models is further boosted. Extensive ablations have shown the effectiveness of CSS. Particularly, by building on top of the model LMH, we achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.

...read moreread less

231 citations

Journal Article•DOI•

Rethinking the Bottom-Up Framework for Query-Based Video Localization

[...]

Long Chen¹, Chujie Lu¹, Siliang Tang¹, Jun Xiao¹, Dong Zhang², Chilie Tan, Xiaolin Li³ - Show less +3 more•Institutions (3)

Zhejiang University¹, Nanjing University of Science and Technology², University of Florida³

03 Apr 2020

TL;DR: It is argued that the performance of bottom-up framework is severely underestimated by current unreasonable designs, including both the backbone and head network, and designed a novel top-up model: Graph-FPN with Dense Predictions (GDP).

...read moreread less

Abstract: In this paper, we focus on the task query-based video localization, i.e., localizing a query in a long and untrimmed video. The prevailing solutions for this problem can be grouped into two categories: i) Top-down approach: It pre-cuts the video into a set of moment candidates, then it does classification and regression for each candidate; ii) Bottom-up approach: It injects the whole query content into each video frame, then it predicts the probabilities of each frame as a ground truth segment boundary (i.e., start or end). Both two frameworks have respective shortcomings: the top-down models suffer from heavy computations and they are sensitive to the heuristic rules, while the performance of bottom-up models is behind the performance of top-down counterpart thus far. However, we argue that the performance of bottom-up framework is severely underestimated by current unreasonable designs, including both the backbone and head network. To this end, we design a novel bottom-up model: Graph-FPN with Dense Predictions (GDP). For the backbone, GDP firstly generates a frame feature pyramid to capture multi-level semantics, then it utilizes graph convolution to encode the plentiful scene relationships, which incidentally mitigates the semantic gaps in the multi-scale feature pyramid. For the head network, GDP regards all frames falling in the ground truth segment as the foreground, and each foreground frame regresses the unique distances from its location to bi-directional boundaries. Extensive experiments on two challenging query-based video localization tasks (natural language video localization and video relocalization), involving four challenging benchmarks (TACoS, Charades-STA, ActivityNet Captions, and Activity-VRL), have shown that GDP surpasses the state-of-the-art top-down models.

...read moreread less

122 citations

Proceedings Article•DOI•

Hierarchical Fashion Graph Network for Personalized Outfit Recommendation

[...]

Xingchen Li¹, Xiang Wang², Xiangnan He³, Long Chen¹, Jun Xiao¹, Tat-Seng Chua² - Show less +2 more•Institutions (3)

Zhejiang University¹, National University of Singapore², University of Science and Technology of China³

25 Jul 2020

TL;DR: Wang et al. as mentioned in this paper proposed a hierarchical fashion graph network (HFGN) to model relationships among users, items, and outfits simultaneously, and employed the embedding propagation on such hierarchical graph, so as to aggregate item information into an outfit representation, and then refine a user's representation via his/her historical outfits.

...read moreread less

Abstract: Fashion outfit recommendation has attracted increasing attentions from online shopping services and fashion communities.Distinct from other scenarios (e.g., social networking or content sharing) which recommend a single item (e.g., a friend or picture) to a user, outfit recommendation predicts user preference on a set of well-matched fashion items. Hence, performing high-quality personalized outfit recommendation should satisfy two requirements -- 1) the nice compatibility of fashion items and 2) the consistence with user preference. However, present works focus mainly on one of the requirements and only consider either user-outfit or outfit-item relationships, thereby easily leading to suboptimal representations and limiting the performance. In this work, we unify two tasks, fashion compatibility modeling and personalized outfit recommendation. Towards this end, we develop a new framework, Hierarchical Fashion Graph Network(HFGN), to model relationships among users, items, and outfits simultaneously. In particular, we construct a hierarchical structure upon user-outfit interactions and outfit-item mappings. We then get inspirations from recent graph neural networks, and employ the embedding propagation on such hierarchical graph, so as to aggregate item information into an outfit representation, and then refine a user's representation via his/her historical outfits. Furthermore, we jointly train these two tasks to optimize these representations. To demonstrate the effectiveness of HFGN, we conduct extensive experiments on a benchmark dataset, and HFGN achieves significant improvements over the state-of-the-art compatibility matching models like NGNN and outfit recommenders like FHN.

...read moreread less

39 citations

Posted Content•

Hierarchical Fashion Graph Network for Personalized Outfit Recommendation

[...]

Xingchen Li¹, Xiang Wang², Xiangnan He³, Long Chen¹, Jun Xiao¹, Tat-Seng Chua² - Show less +2 more•Institutions (3)

Zhejiang University¹, National University of Singapore², University of Science and Technology of China³

26 May 2020-arXiv: Information Retrieval

TL;DR: A new framework, Hierarchical Fashion Graph Network (HFGN), is developed, to model relationships among users, items, and outfits simultaneously, and achieves significant improvements over the state-of-the-art compatibility matching models like NGNN and outfit recommenders like FHN.

...read moreread less

Abstract: Fashion outfit recommendation has attracted increasing attentions from online shopping services and fashion communities.Distinct from other scenarios (e.g., social networking or content sharing) which recommend a single item (e.g., a friend or picture) to a user, outfit recommendation predicts user preference on a set of well-matched fashion items.Hence, performing high-quality personalized outfit recommendation should satisfy two requirements -- 1) the nice compatibility of fashion items and 2) the consistence with user preference. However, present works focus mainly on one of the requirements and only consider either user-outfit or outfit-item relationships, thereby easily leading to suboptimal representations and limiting the performance. In this work, we unify two tasks, fashion compatibility modeling and personalized outfit recommendation. Towards this end, we develop a new framework, Hierarchical Fashion Graph Network(HFGN), to model relationships among users, items, and outfits simultaneously. In particular, we construct a hierarchical structure upon user-outfit interactions and outfit-item mappings. We then get inspirations from recent graph neural networks, and employ the embedding propagation on such hierarchical graph, so as to aggregate item information into an outfit representation, and then refine a user's representation via his/her historical outfits. Furthermore, we jointly train these two tasks to optimize these representations. To demonstrate the effectiveness of HFGN, we conduct extensive experiments on a benchmark dataset, and HFGN achieves significant improvements over the state-of-the-art compatibility matching models like NGNN and outfit recommenders like FHN.

...read moreread less

38 citations

Posted Content•

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

[...]

Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, Shih-Fu Chang - Show less +1 more

03 Sep 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: Ref-NMS is proposed, which is the first method to yield expression-aware proposals at the first stage, and introduces a lightweight module to predict a score for aligning each box with a critical object, resulting in a significantly improved grounding performance.

...read moreread less

Abstract: The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref- NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS. Codes are available at: this https URL.

...read moreread less

28 citations

Posted Content•

Counterfactual Samples Synthesizing for Robust Visual Question Answering

[...]

Long Chen¹, Xin Yan¹, Jun Xiao¹, Hanwang Zhang², Shiliang Pu, Yueting Zhuang¹ - Show less +2 more•Institutions (2)

Zhejiang University¹, Nanyang Technological University²

14 Mar 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposed a model-agnostic counterfactual samples synthesis (CSS) training scheme to improve visual-explainability and question-sensitive abilities of VQA models, achieving state-of-the-art performance.

...read moreread less

Abstract: Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize to the test set with different QA distributions. To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP. However, since the complexity of design, current methods are unable to equip the ensemble-based models with two indispensable characteristics of an ideal VQA model: 1) visual-explainable: the model should rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question. To this end, we propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. The CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions, and assigning different ground-truth answers. After training with the complementary samples (ie, the original and generated samples), the VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. In return, the performance of these models is further boosted. Extensive ablations have shown the effectiveness of CSS. Particularly, by building on top of the model LMH, we achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.

...read moreread less

19 citations

Proceedings Article•

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

[...]

Long Chen¹, Wenbo Ma², Jun Xiao², Hanwang Zhang³, Wei Liu⁴, Shih-Fu Chang⁴ - Show less +2 more•Institutions (4)

Tencent¹, Zhejiang University², Nanyang Technological University³, Columbia University⁴

03 Sep 2020

TL;DR: Ref-NMS as mentioned in this paper proposes a lightweight module to predict a score for aligning each box with a critical object, which can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects.

...read moreread less

Abstract: The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref- NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS. Codes are available at: https://github.com/ChopinSharp/ref-nms.

...read moreread less

18 citations

Journal Article•DOI•

Question-Driven Purchasing Propensity Analysis for Recommendation.

[...]

Long Chen, Ziyu Guan¹, Qibin Xu², Qiong Zhang³, Huan Sun⁴, Guangyue Lu, Deng Cai² - Show less +3 more•Institutions (4)

Northwest University (China)¹, Zhejiang University², Alibaba Group³, Ohio State University⁴

03 Apr 2020

TL;DR: A novel Question-Driven Attentive Neural Network is proposed to assess the instant demands of questioners and the eligibility of products based on user generated reviews, and do recommendation accordingly and the results show the efficacy and its superiority over baseline methods.

...read moreread less

Abstract: Merchants of e-commerce Websites expect recommender systems to entice more consumption which is highly correlated with the customers' purchasing propensity. However, most existing recommender systems focus on customers' general preference rather than purchasing propensity often governed by instant demands which we deem to be well conveyed by the questions asked by customers. A typical recommendation scenario is: Bob wants to buy a cell phone which can play the game PUBG. He is interested in HUAWEI P20 and asks “can PUBG run smoothly on this phone?” under it. Then our system will be triggered to recommend the most eligible cell phones to him. Intuitively, diverse user questions could probably be addressed in reviews written by other users who have similar concerns. To address this recommendation problem, we propose a novel Question-Driven Attentive Neural Network (QDANN) to assess the instant demands of questioners and the eligibility of products based on user generated reviews, and do recommendation accordingly. Without supervision, QDANN can well exploit reviews to achieve this goal. The attention mechanisms can be used to provide explanations for recommendations. We evaluate QDANN in three domains of Taobao. The results show the efficacy of our method and its superiority over baseline methods.

...read moreread less

7 citations

Journal Article•DOI•

Multitask deep learning-based multiuser hybrid beamforming for mm-wave orthogonal frequency division multiple access systems

[...]

Jing Jiang, Yue Li, Long Chen, Jianbo Du, Chunguo Li¹ - Show less +1 more•Institutions (1)

Southeast University¹

01 Aug 2020-Science in China Series F: Information Sciences

TL;DR: Simulation results prove that the proposed MTDL-based multiuser hybrid beamforming scheme could achieve better performance than traditional algorithms and multiple serial single tasks deep learning scheme.

...read moreread less

Abstract: Multiuser hybrid beamforming of a wideband millimeter-wave (mm-wave) system is a complex combinatorial optimization problem. It not only needs large training data, but also tends to overfit and incur long run-time when multiple serial deep learning network models are used to solve this problem directly. Preferably, multitask deep learning (MTDL) model could jointly learn multiple related tasks and share their knowledge among the tasks, and this has been demonstrated to improve performance, compared to learning the tasks individually. Therefore, this work presents a first attempt to exploit MTDL for multiuser hybrid beamforming for mm-wave massive multiple-input multiple-output orthogonal frequency division multiple access systems. The MTDL model includes a multitask network architecture, which consists of two tasks-user scheduling and multiuser analog beamforming. First, we use the effective channel with a low dimension as input data for the two parallel tasks to reduce the computational complexity of deep neural networks. In a shallow shared layer of the MTDL model, we utilize hard parameter sharing in which the knowledge of multiuser analog beamforming task is shared with the user scheduling task to mitigate multiuser interference. Second, in the training process of the MTDL model, we use the exhaustive search algorithm to generate training data to ensure optimal performance. Finally, we choose the weight coefficient of each task by traversing all weight coefficient combinations in the training phase. Simulation results prove that our proposed MTDL-based multiuser hybrid beamforming scheme could achieve better performance than traditional algorithms and multiple serial single tasks deep learning scheme.

...read moreread less

7 citations

Journal Article•DOI•

Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

[...]

Shaoning Xiao¹, Yimeng Li¹, Yunan Ye¹, Long Chen¹, Shiliang Pu¹, Zhou Zhao¹, Jian Shao¹, Jun Xiao¹ - Show less +4 more•Institutions (1)

Zhejiang University¹

01 Oct 2020-Neural Processing Letters

TL;DR: This work proposes the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer.

...read moreread less

Abstract: This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset.

...read moreread less

5 citations

Posted Content•

Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework

[...]

Wenxiao Wang, Minghao Chen¹, Shuai Zhao², Long Chen³, Jinming Hu⁴, Haifeng Liu, Deng Cai, Xiaofei He¹, Wei Liu⁵ - Show less +5 more•Institutions (5)

Zhejiang University¹, The Chinese University of Hong Kong², Columbia University³, Microsoft⁴, Tencent⁵

10 Oct 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A simple yet effective pruning framework is proposed to comprehensively consider these three dimensions of the deep CNN, taking depth, width, and image resolution as variables and the model's accuracy as the optimization objective.

...read moreread less

Abstract: Most neural network pruning methods, such as filter-level and layer-level prunings, prune the network model along one dimension (depth, width, or resolution) solely to meet a computational budget. However, such a pruning policy often leads to excessive reduction of that dimension, thus inducing a huge accuracy loss. To alleviate this issue, we argue that pruning should be conducted along three dimensions comprehensively. For this purpose, our pruning framework formulates pruning as an optimization problem. Specifically, it first casts the relationships between a certain model's accuracy and depth/width/resolution into a polynomial regression and then maximizes the polynomial to acquire the optimal values for the three dimensions. Finally, the model is pruned along the three optimal dimensions accordingly. In this framework, since collecting too much data for training the regression is very time-costly, we propose two approaches to lower the cost: 1) specializing the polynomial to ensure an accurate regression even with less training data; 2) employing iterative pruning and fine-tuning to collect the data faster. Extensive experiments show that our proposed algorithm surpasses state-of-the-art pruning algorithms and even neural architecture search-based algorithms.

...read moreread less

Showing papers by "Long Chen published in 2020"