scispace - formally typeset
Search or ask a question

Showing papers by "Aixin Sun published in 2021"


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a video span localizing network (VSLNet) to solve the NLVL problem from a span-based question answering (QA) perspective by treating the input video as a text passage.
Abstract: Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query. Existing approaches mainly solve the NLVL problem from the perspective of computer vision by formulating it as ranking, anchor, or regression tasks. These methods suffer from large performance degradation when localizing on long videos. In this work, we address the NLVL from a new perspective, \ie span-based question answering (QA), by treating the input video as a text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the differences between NLVL and span-based QA through a simple yet effective query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the matching video span within a highlighted region. To address the performance degradation on long videos, we further extend VSLNet to VSLNet-L by applying a multi-scale split-and-concatenation strategy to locate the target moment accurately. Extensive experiments show that the proposed methods outperform the state-of-the-art methods; VSLNet-L addresses the issue of performance degradation on long videos. Our study suggests that the span-based QA framework is an effective strategy to solve the NLVL problem.

42 citations


Proceedings ArticleDOI
TL;DR: In this article, the authors propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for video corpus moment retrieval, which is based on two contrastive learning objectives to refine video and text representations separately but with better alignment for VCMR.
Abstract: Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct feature spaces, there are two general approaches to address VCMR: (i) to separately encode each modality representations, then align the two modality representations for query processing, and (ii) to adopt fine-grained cross-modal interaction to learn multi-modal representations for query processing. While the second approach often leads to better retrieval accuracy, the first approach is far more efficient. In this paper, we propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We adopt the first approach and introduce two contrastive learning objectives to refine video encoder and text encoder to learn video and text representations separately but with better alignment for VCMR. The video contrastive learning (VideoCL) is to maximize mutual information between query and candidate video at video-level. The frame contrastive learning (FrameCL) aims to highlight the moment region corresponds to the query at frame-level, within a video. Experimental results show that, although ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.

35 citations


Proceedings ArticleDOI
11 Jul 2021
TL;DR: In this article, the authors propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for video corpus moment retrieval, which is based on two contrastive learning objectives to refine video and text representations separately but with better alignment for VCMR.
Abstract: Given a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct feature spaces, there are two general approaches to address VCMR: (i) to separately encode each modality representations, then align the two modality representations for query processing, and (ii) to adopt fine-grained cross-modal interaction to learn multi-modal representations for query processing. While the second approach often leads to better retrieval accuracy, the first approach is far more efficient. In this paper, we propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. We adopt the first approach and introduce two contrastive learning objectives to refine video encoder and text encoder to learn video and text representations separately but with better alignment for VCMR. The video contrastive learning (VideoCL) is to maximize mutual information between query and candidate video at video-level. The frame contrastive learning (FrameCL) aims to highlight the moment region corresponds to the query at frame-level, within a video. Experimental results show that, although ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.

35 citations


Proceedings ArticleDOI
17 Oct 2021
TL;DR: In this article, the authors propose a pre-training strategy to learn item representations by considering both item side information and their relationships, e.g., co-purchase, and construct a homogeneous item graph, which provides a unified view of item relations and their associated side information in multimodality.
Abstract: Side information of items, e.g., images and text description, has shown to be effective in contributing to accurate recommendations. Inspired by the recent success of pre-training models on natural language and images, we propose a pre-training strategy to learn item representations by considering both item side information and their relationships. We relate items by common user activities, e.g., co-purchase, and construct a homogeneous item graph. This graph provides a unified view of item relations and their associated side information in multimodality. We develop a novel sampling algorithm named MCNSampling to select contextual neighbors for each item. The proposed Pre-trained Multimodal Graph Transformer (PMGT) learns item representations with two objectives: 1) graph structure reconstruction, and 2) masked node feature reconstruction. Experimental results on real datasets demonstrate that the proposed PMGT model effectively exploits the multimodality side information to achieve better accuracies in downstream tasks including item recommendation and click-through ratio prediction. In addition, we also report a case study of testing PMGT in an online setting with 600 thousand users.

27 citations


Journal ArticleDOI
TL;DR: This paper proposes AUC-MF to address the POI recommendation problem by maximizing Area Under the ROC curve (AUC), and defines a new lambda for AUC to utilize the LambdaMF model, which combines the lambda-based method and matrix factorization model in collaborative filtering.
Abstract: The task of point of interest (POI) recommendation aims to recommend unvisited places to users based on their check-in history. A major challenge in POI recommendation is data sparsity, because a user typically visits only a very small number of POIs among all available POIs. In this paper, we propose AUC-MF to address the POI recommendation problem by maximizing Area Under the ROC curve (AUC). AUC has been widely used for measuring classification performance with imbalanced data distributions. To optimize AUC, we transform the recommendation task to a classification problem, where the visited locations are positive examples and the unvisited are negative ones. We define a new lambda for AUC to utilize the LambdaMF model, which combines the lambda-based method and matrix factorization model in collaborative filtering. Many studies have shown that geographic information plays an important role in POI recommendation. In this study, we focus on two levels geographic information: local similarity and global similarity. We further show that AUC-MF can be easily extended to incorporate geographical contextual information for POI recommendation.

22 citations


Journal ArticleDOI
TL;DR: BdryBot, a recurrent neural network encoder-decoder framework with a pointer network to detect entity boundaries from a given sentence, achieves state-of-the-art performance against five baselines and can be further enhanced when incorporating contextualized language embeddings into token representations.
Abstract: In this paper, we focus on named entity boundary detection , which is to detect the start and end boundaries of an entity mention in text, without predicting its type. The detected entities are input to entity linking or fine-grained typing systems for semantic enrichment. We propose BdryBot , a recurrent neural network encoder-decoder framework with a pointer network to detect entity boundaries from a given sentence. The encoder considers both character-level representations and word-level embeddings to represent the input words. In this way, BdryBot does not require any hand-crafted features. Because of the pointer network, BdryBot overcomes the problem of variable size output vocabulary and the issue of sparse boundary tags. We conduct two sets of experiments, in-domain detection and cross-domain detection, on six datasets. Our results show that BdryBot achieves state-of-the-art performance against five baselines. In addition, our proposed approach can be further enhanced when incorporating contextualized language embeddings into token representations.

19 citations


Proceedings ArticleDOI
26 Oct 2021
TL;DR: This article proposed a generative inverse reinforcement learning approach that avoids the need of defining an elaborative reward function by first generating policies based on observed users' preferences and then evaluating the learned policy by a measurement based on a discriminative actor-critic network.
Abstract: Deep reinforcement learning enables an agent to capture users' interest through dynamic interactions with the environment. It uses a reward function to learn user's interest and to control the learning process, attracting great interest in recommendation research. However, most reward functions are manually designed; they are either too unrealistic or imprecise to reflect the variety, dimensionality, and non-linearity of the recommendation problem. This impedes the agent from learning an optimal policy in highly dynamic online recommendation scenarios. To address the above issue, we propose a generative inverse reinforcement learning approach that avoids the need of defining an elaborative reward function. In particular, we model the recommendation problem as an automatic policy learning problem. We first generate policies based on observed users' preferences and then evaluate the learned policy by a measurement based on a discriminative actor-critic network. We conduct experiments on an online platform, VirtualTB, and demonstrate the feasibility and effectiveness of our proposed approach via comparisons with several state-of-the-art methods.

5 citations



Journal ArticleDOI
TL;DR: A detailed analysis on the stability of concept embeddings in medical domain, particularly in relations with concept frequency, reveals the surprising high stability of low‐frequency concepts: low-frequency (<100) concepts have the same high stability as high‐frequency (>1,000) concepts.
Abstract: Frequency is one of the major factors for training quality word embeddings. Several work has recently discussed the stability of word embeddings in general domain and suggested factors influencing the stability. In this work, we conduct a detailed analysis on the stability of concept embeddings in medical domain, particularly the relation with concept frequency. The analysis reveals the surprising high stability of low-frequency concepts: low-frequency ( 1000) concepts. To develop a deeper understanding of this finding, we propose a new factor, the noisiness of context words, which influences the stability of medical concept embeddings, regardless of frequency. We evaluate the proposed factor by showing the linear correlation with the stability of medical concept embeddings. The correlations are clear and consistent with various groups of medical concepts. Based on the linear relations, we make suggestions on ways to adjust the noisiness of context words for the improvement of stability. Finally, we demonstrate that the proposed factor extends to the word embedding stability in general domain.

2 citations


Journal ArticleDOI
TL;DR: The comparative results showed that papers would obtain a sharp rise in citation counts shortly after they were cited by citation promoters, and papers that received citation promoters at an early age outperformed other papers in long-term citation counts.
Abstract: Reseachers have investigated numerous factors influencing citation counts of cited papers. One factor investigated has been the number of gained citations, as this could increase the visibility of cited papers and subsequently induce further citations. In this paper, aiming to identify a particular kind of citation that could trigger a rapid growth in the citation counts of cited papers, a concept of “citation promoter” was proposed. We defined citation promoters based on the annual citation rates of the cited papers and the co-citation counts received by the pair of cited and citing papers. The comparative results showed that papers would obtain a sharp rise in citation counts shortly after they were cited by citation promoters. Papers that received citation promoters at an early age outperformed other papers in long-term citation counts. In addition, we developed a classification model for predicting whether a citing paper would be a citation promoter for its cited paper. Since it was a class imbalanced problem (4 percent positive instances), and there was a lack of content and author features in our dataset, our preliminary models achieved moderate performance with an $F_1$ F 1 score slightly higher than 0.5, while the $F_1$ F 1 score obtained by random guessing was 0.07.

1 citations


Posted Content
TL;DR: In this article, a parallel attention network with sequence matching (SeqPAN) is proposed to address the challenges of multi-modal representation learning, and target moment boundary prediction in video grounding.
Abstract: Given a video, video grounding aims to retrieve a temporal moment that semantically corresponds to a language query. In this work, we propose a Parallel Attention Network with Sequence matching (SeqPAN) to address the challenges in this task: multi-modal representation learning, and target moment boundary prediction. We design a self-guided parallel attention module to effectively capture self-modal contexts and cross-modal attentive information between video and text. Inspired by sequence labeling tasks in natural language processing, we split the ground truth moment into begin, inside, and end regions. We then propose a sequence matching strategy to guide start/end boundary predictions using region labels. Experimental results on three datasets show that SeqPAN is superior to state-of-the-art methods. Furthermore, the effectiveness of the self-guided parallel attention module and the sequence matching module is verified.


Posted Content
TL;DR: Li et al. as discussed by the authors proposed DocIE, a document-level context-aware OpenIE model, which can extract structured relational tuples (subject, relation, object) from sentences and plays critical roles for downstream NLP applications.
Abstract: Open Information Extraction (OpenIE) aims to extract structured relational tuples (subject, relation, object) from sentences and plays critical roles for many downstream NLP applications. Existing solutions perform extraction at sentence level, without referring to any additional contextual information. In reality, however, a sentence typically exists as part of a document rather than standalone; we often need to access relevant contextual information around the sentence before we can accurately interpret it. As there is no document-level context-aware OpenIE dataset available, we manually annotate 800 sentences from 80 documents in two domains (Healthcare and Transportation) to form a DocOIE dataset for evaluation. In addition, we propose DocIE, a novel document-level context-aware OpenIE model. Our experimental results based on DocIE demonstrate that incorporating document-level context is helpful in improving OpenIE performance. Both DocOIE dataset and DocIE model are released for public.