Showing papers by "Jiawei Han published in 2021"

PDF

Open Access

Data Mining Concepts and Techniques Third Edition

[...]

Jiawei Han, Micheline Kamber, Jian Pei

28 Oct 2021

956 citations

Proceedings Article•DOI•

Document-Level Event Argument Extraction by Conditional Generation

[...]

Sha Li¹, Heng Ji¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jun 2021

TL;DR: A document-level neural event argument extraction model is proposed by formulating the task as conditional generation following event templates by creating the first end-to-end zero-shot event extraction framework.

...read moreread less

Abstract: Event extraction has long been treated as a sentence-level task in the IE community. We argue that this setting does not match human informative seeking behavior and leads to incomplete and uninformative extraction results. We propose a document-level neural event argument extraction model by formulating the task as conditional generation following event templates. We also compile a new document-level event extraction benchmark dataset WikiEvents which includes complete event and coreference annotation. On the task of argument extraction, we achieve an absolute gain of 7.6% F1 and 5.7% F1 over the next best model on the RAMS and WikiEvents dataset respectively. On the more challenging task of informative argument extraction, which requires implicit coreference reasoning, we achieve a 9.3% F1 gain over the best baseline. To demonstrate the portability of our model, we also create the first end-to-end zero-shot event extraction framework and achieve 97% of fully supervised model’s trigger extraction performance and 82% of the argument extraction performance given only access to 10 out of the 33 types on ACE.

...read moreread less

111 citations

Proceedings Article•DOI•

TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names

[...]

Jiaming Shen¹, Wenda Qiu, Yu Meng¹, Jingbo Shang², Xiang Ren³, Jiawei Han¹ - Show less +2 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, University of California, San Diego², University of Southern California³

01 Jun 2021

TL;DR: This paper proposes a novel HMTC framework, named TaxoClass, which calculates document-class similarities using a textual entailment model, identifies a document’s core classes and utilizes confident core classes to train a taxonomy-enhanced classifier, and generalizes the classifier via multi-label self-training.

...read moreread less

Abstract: Hierarchical multi-label text classification (HMTC) aims to tag each document with a set of classes from a taxonomic class hierarchy. Most existing HMTC methods train classifiers using massive human-labeled documents, which are often too costly to obtain in real-world applications. In this paper, we explore to conduct HMTC based on only class surface names as supervision signals. We observe that to perform HMTC, human experts typically first pinpoint a few most essential classes for the document as its “core classes”, and then check core classes’ ancestor classes to ensure the coverage. To mimic human experts, we propose a novel HMTC framework, named TaxoClass. Specifically, TaxoClass (1) calculates document-class similarities using a textual entailment model, (2) identifies a document’s core classes and utilizes confident core classes to train a taxonomy-enhanced classifier, and (3) generalizes the classifier via multi-label self-training. Our experiments on two challenging datasets show TaxoClass can achieve around 0.71 Example-F1 using only class names, outperforming the best previous method by 25%.

...read moreread less

43 citations

Posted Content•

COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining

[...]

Yu Meng¹, Chenyan Xiong², Payal Bajaj², Saurabh Tiwary², Paul N. Bennett², Jiawei Han¹, Xia Song² - Show less +3 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Microsoft²

16 Feb 2021-arXiv: Computation and Language

TL;DR: This article proposed COCO-LM, a self-supervised learning framework that pretrains language models by correcting challenging errors and contrasting text sequences, which employs an auxiliary language model to mask-and-predict tokens in original text sequences.

...read moreread less

Abstract: We present COCO-LM, a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences. COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences. It creates more challenging pretraining inputs, where noises are sampled based on their likelihood in the auxiliary language model. COCO-LM then pretrains with two tasks: The first task, corrective language modeling, learns to correct the auxiliary model's corruptions by recovering the original tokens. The second task, sequence contrastive learning, ensures that the language model generates sequence representations that are invariant to noises and transformations. In our experiments on the GLUE and SQuAD benchmarks, COCO-LM outperforms recent pretraining approaches in various pretraining settings and few-shot evaluations, with higher pretraining efficiency. Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.

...read moreread less

36 citations

Proceedings Article•DOI•

COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation

[...]

Qingyun Wang¹, Manling Li², Xuan Wang³, Nikolaus Nova Parulian³, Guangxing Han⁴, Jiawei Ma⁴, Jingxuan Tu⁵, Ying Lin², Haoran Zhang, Weili Liu³, Aabhas Chauhan³, Yingjun Guan³, Bangzheng Li³, Ruisong Li, Xiangchen Song³, Heng Ji³, Jiawei Han³, Shih-Fu Chang⁴, James Pustejovsky⁵, Jasmine Rah, David A. Liem⁶, Ahmed Elsayed⁷, Martha Palmer⁸, Clare R. Voss⁹, Cynthia Schneider, Boyan Onyshkevych¹⁰ - Show less +22 more•Institutions (10)

Salesforce.com¹, Rensselaer Polytechnic Institute², University of Illinois at Urbana–Champaign³, Columbia University⁴, Brandeis University⁵, University of California, Los Angeles⁶, Cairo University⁷, University of Colorado Boulder⁸, United States Army Research Laboratory⁹, DARPA¹⁰

18 Mar 2021

TL;DR: In this paper, a knowledge discovery framework, COVID-KG, was developed to extract fine-grained multimedia knowledge elements (entities, relations and events) from scientific literature.

...read moreread less

Abstract: To combat COVID-19, both clinicians and scientists need to digest the vast amount of relevant biomedical knowledge in literature to understand the disease mechanism and the related biological functions. We have developed a novel and comprehensive knowledge discovery framework, COVID-KG to extract fine-grained multimedia knowledge elements (entities, relations and events) from scientific literature. We then exploit the constructed multimedia knowledge graphs (KGs) for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures, and knowledge subgraphs as evidence. All of the data, KGs, reports.

...read moreread less

22 citations

Proceedings Article•DOI•

On the Transformer Growth for Progressive BERT Training.

[...]

Xiaotao Gu¹, Liyuan Liu¹, Hongkun Yu², Jing Li¹, Chen Chen¹, Jiawei Han¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Google²

01 Jun 2021

TL;DR: It is found that similar to network architecture selection, Transformer growth also favors compound scaling, and the proposed method CompoundGrow speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively while achieving comparable performances.

...read moreread less

Abstract: As the excessive pre-training cost arouses the need to improve efficiency, considerable efforts have been made to train BERT progressively–start from an inferior but low-cost model and gradually increase the computational complexity. Our objective is to help advance the understanding of such Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture selection, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give practical guidance for operator selection. In light of our analyses, the proposed method CompoundGrow speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively while achieving comparable performances.

...read moreread less

22 citations

Proceedings Article•DOI•

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

[...]

Xiaotao Gu¹, Zihan Wang², Zhenyu Bi², Yu Meng¹, Liyuan Liu¹, Jiawei Han¹, Jingbo Shang² - Show less +3 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of California, San Diego²

14 Aug 2021

TL;DR: The authors induce high-quality phrase spans as silver labels from consistently co-occurring word sequences within each document, which can be applied to new input to recognize (unseen) quality phrases regardless of their surface names or frequency.

...read moreread less

Abstract: Identifying and understanding quality phrases from context is a fundamental task in text mining. The most challenging part of this task arguably lies in uncommon, emerging, and domain-specific phrases. The infrequent nature of these phrases significantly hurts the performance of phrase mining methods that rely on sufficient phrase occurrences in the input corpus. Context-aware tagging models, though not restricted by frequency, heavily rely on domain experts for either massive sentence-level gold labels or handcrafted gazetteers. In this work, we propose UCPhrase, a novel unsupervised context-aware quality phrase tagger. Specifically, we induce high-quality phrase spans as silver labels from consistently co-occurring word sequences within each document. Compared with typical context-agnostic distant supervision based on existing knowledge bases (KBs), our silver labels root deeply in the input domain and context, thus having unique advantages in preserving contextual completeness and capturing emerging, out-of-KB phrases. Training a conventional neural tagger based on silver labels usually faces the risk of overfitting phrase surface names. Alternatively, we observe that the contextualized attention maps generated from a transformer-based neural language model effectively reveal the connections between words in a surface-agnostic way. Therefore, we pair such attention maps with the silver labels to train a lightweight span prediction model, which can be applied to new input to recognize (unseen) quality phrases regardless of their surface names or frequency. Thorough experiments on various tasks and datasets, including corpus-level phrase ranking, document-level keyphrase extraction, and sentence-level phrase tagging, demonstrate the superiority of our design over state-of-the-art pre-trained, unsupervised, and distantly supervised methods.

...read moreread less

21 citations

Proceedings Article•DOI•

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

[...]

Xiaotao Gu¹, Zihan Wang², Zhenyu Bi², Yu Meng¹, Liyuan Liu¹, Jiawei Han¹, Jingbo Shang² - Show less +3 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of California, San Diego²

28 May 2021-arXiv: Computation and Language

...read moreread less

19 citations

Proceedings Article•DOI•

Event Time Extraction and Propagation via Graph Attention Networks

[...]

Haoyang Wen¹, Yanru Qu², Heng Ji², Qiang Ning³, Jiawei Han², Avi Sil⁴, Hanghang Tong, Dan Roth⁵ - Show less +4 more•Institutions (5)

Harbin Institute of Technology¹, University of Illinois at Urbana–Champaign², Amazon.com³, IBM⁴, University of Pennsylvania⁵

01 Jun 2021

TL;DR: This paper first formulates this problem based on a 4-tuple temporal representation used in entity slot filling, which allows us to represent fuzzy time spans more conveniently, and proposes a graph attention network-based approach to propagate temporal information over document-level event graphs constructed by shared entity arguments and temporal relations.

...read moreread less

Abstract: Grounding events into a precise timeline is important for natural language understanding but has received limited attention in recent work. This problem is challenging due to the inherent ambiguity of language and the requirement for information propagation over inter-related events. This paper first formulates this problem based on a 4-tuple temporal representation used in entity slot filling, which allows us to represent fuzzy time spans more conveniently. We then propose a graph attention network-based approach to propagate temporal information over document-level event graphs constructed by shared entity arguments and temporal relations. To better evaluate our approach, we present a challenging new benchmark on the ACE2005 corpus, where more than 78% of events do not have time spans mentioned explicitly in their local contexts. The proposed approach yields an absolute gain of 7.0% in match rate over contextualized embedding approaches, and 16.3% higher match rate compared to sentence-level manual event time argument annotation.

...read moreread less

14 citations

Proceedings Article•DOI•

MATCH: Metadata-Aware Text Classification in A Large Hierarchy

[...]

Yu Zhang¹, Zhihong Shen², Yuxiao Dong², Kuansan Wang², Jiawei Han¹ - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Microsoft²

19 Apr 2021

TL;DR: In this paper, the authors propose an end-to-end framework that leverages both metadata and hierarchy information for multi-label text classification in a large label hierarchy (e.g., with tens of thousands of labels).

...read moreread less

Abstract: Multi-label text classification refers to the problem of assigning each given document its most relevant labels from a label set. Commonly, the metadata of the given documents and the hierarchy of the labels are available in real-world applications. However, most existing studies focus on only modeling the text information, with a few attempts to utilize either metadata or hierarchy signals, but not both of them. In this paper, we bridge the gap by formalizing the problem of metadata-aware text classification in a large label hierarchy (e.g., with tens of thousands of labels). To address this problem, we present the MATCH1 solution—an end-to-end framework that leverages both metadata and hierarchy information. To incorporate metadata, we pre-train the embeddings of text and metadata in the same space and also leverage the fully-connected attentions to capture the interrelations between them. To leverage the label hierarchy, we propose different ways to regularize the parameters and output probability of each child label by its parents. Extensive experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH over the state-of-the-art deep learning baselines.

...read moreread less

13 citations

Proceedings Article•DOI•

Generation-Augmented Retrieval for Open-Domain Question Answering

[...]

Yuning Mao¹, Pengcheng He², Xiaodong Liu³, Yelong Shen², Jianfeng Gao², Jiawei Han⁴, Weizhu Chen² - Show less +3 more•Institutions (4)

University of Illinois at Urbana–Champaign¹, Microsoft², Edinburgh Napier University³, Google⁴

01 Aug 2021

TL;DR: Generative-Augmented Retrieval (GAR) as discussed by the authors augments a query through text generation of heuristically discovered relevant contexts without external resources as supervision, achieving state-of-the-art performance on Natural Questions and TriviaQA datasets.

...read moreread less

Abstract: We propose Generation-Augmented Retrieval (GAR) for answering open-domain questions, which augments a query through text generation of heuristically discovered relevant contexts without external resources as supervision. We demonstrate that the generated contexts substantially enrich the semantics of the queries and GAR with sparse representations (BM25) achieves comparable or better performance than state-of-the-art dense retrieval methods such as DPR. We show that generating diverse contexts for a query is beneficial as fusing their results consistently yields better retrieval accuracy. Moreover, as sparse and dense representations are often complementary, GAR can be easily combined with DPR to achieve even better performance. GAR achieves state-of-the-art performance on Natural Questions and TriviaQA datasets under the extractive QA setup when equipped with an extractive reader, and consistently outperforms other retrieval methods when the same generative reader is used.

...read moreread less

Proceedings Article•

Few-Shot Named Entity Recognition: An Empirical Baseline Study.

[...]

Jiaxin Huang¹, Chunyuan Li², Krishan Subudhi, Damien Jose², Shobana Balakrishnan, Weizhu Chen², Baolin Peng², Jianfeng Gao³, Jiawei Han¹ - Show less +5 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, Microsoft², Iowa State University³

01 Nov 2021

TL;DR: This paper proposed three orthogonal schemes to improve model generalization ability in few-shot settings: meta-learning to construct prototypes for different entity types, task-specific supervised pre-training on noisy web data to extract entity-related representations and self-training to leverage unlabeled in-domain data.

...read moreread less

Abstract: This paper presents an empirical study to efficiently build named entity recognition (NER) systems when a small amount of in-domain labeled data is available. Based upon recent Transformer-based self-supervised pre-trained language models (PLMs), we investigate three orthogonal schemes to improve model generalization ability in few-shot settings: (1) meta-learning to construct prototypes for different entity types, (2) task-specific supervised pre-training on noisy web data to extract entity-related representations and (3) self-training to leverage unlabeled in-domain data. On 10 public NER datasets, we perform extensive empirical comparisons over the proposed schemes and their combinations with various proportions of labeled data, our experiments show that (i)in the few-shot learning setting, the proposed NER schemes significantly improve or outperform the commonly used baseline, a PLM-based linear classifier fine-tuned using domain labels. (ii) We create new state-of-the-art results on both few-shot and training-free settings compared with existing methods.

...read moreread less

Proceedings Article•DOI•

Hierarchical Metadata-Aware Document Categorization under Weak Supervision

[...]

Yu Zhang¹, Xiusi Chen², Yu Meng¹, Jiawei Han¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of California, Los Angeles²

08 Mar 2021

TL;DR: In this paper, a joint representation learning and data augmentation module is proposed for document categorization under weak supervision, which allows simultaneous modeling of category dependencies, metadata information and textual semantics, and introduces a hierarchical synthesizing training documents to complement the original, small-scale training set.

...read moreread less

Abstract: Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and we introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set. Our experiments demonstrate a consistent improvement of HiMeCat over competitive baselines and validate the contribution of our representation learning and data augmentation modules.

...read moreread less

Posted Content•

Reader-Guided Passage Reranking for Open-Domain Question Answering

[...]

Yuning Mao¹, Pengcheng He², Xiaodong Liu³, Yelong Shen², Jianfeng Gao², Jiawei Han⁴, Weizhu Chen² - Show less +3 more•Institutions (4)

University of Illinois at Urbana–Champaign¹, Microsoft², Edinburgh Napier University³, Google⁴

01 Jan 2021-arXiv: Computation and Language

TL;DR: This article proposed Reader-guIDEd Reranker (RIDER), which does not involve training and reranks the retrieved passages solely based on the top predictions of the reader before reranking.

...read moreread less

Abstract: Current open-domain question answering systems often follow a Retriever-Reader architecture, where the retriever first retrieves relevant passages and the reader then reads the retrieved passages to form an answer. In this paper, we propose a simple and effective passage reranking method, named Reader-guIDEd Reranker (RIDER), which does not involve training and reranks the retrieved passages solely based on the top predictions of the reader before reranking. We show that RIDER, despite its simplicity, achieves 10 to 20 absolute gains in top-1 retrieval accuracy and 1 to 4 Exact Match (EM) gains without refining the retriever or reader. In addition, RIDER, without any training, outperforms state-of-the-art transformer-based supervised rerankers. Remarkably, RIDER achieves 48.3 EM on the Natural Questions dataset and 66.4 EM on the TriviaQA dataset when only 1,024 tokens (7.8 passages on average) are used as the reader input after passage reranking.

...read moreread less

Proceedings Article•DOI•

Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks

[...]

Xinyang Zhang¹, Chenwei Zhang², Xin Luna Dong², Jingbo Shang³, Jiawei Han¹ - Show less +1 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, Amazon.com², University of California, San Diego³

19 Apr 2021

TL;DR: In this paper, the authors propose a novel framework for minimally supervised text categorization by learning from the text-rich network, which jointly trains two modules with different inductive biases, a text analysis module for text understanding and a network learning module for class discriminative, scalable network learning.

...read moreread less

Abstract: Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus’ heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases – a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commerce product categorization dataset with 683 categories, our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%, significantly outperforming all compared methods; our accuracy is only less than 2% away from the supervised BERT model trained on about 50K labeled documents.

...read moreread less

Posted Content•

Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

[...]

Yu Meng¹, Yunyi Zhang¹, Jiaxin Huang¹, Xuan Wang¹, Yu Zhang¹, Heng Ji¹, Jiawei Han¹ - Show less +3 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

10 Sep 2021-arXiv: Computation and Language

TL;DR: This paper propose a self-training method that uses contextualized augmentations created by pre-trained language models to improve the generalization ability of the NER model, which achieves superior performance on three benchmark datasets.

...read moreread less

Abstract: We study the problem of training named entity recognition (NER) models using only distantly-labeled data, which can be automatically obtained by matching entity mentions in the raw text with entity types in a knowledge base. The biggest challenge of distantly-supervised NER is that the distant supervision may induce incomplete and noisy labels, rendering the straightforward application of supervised learning ineffective. In this paper, we propose (1) a noise-robust learning scheme comprised of a new loss function and a noisy label removal step, for training NER models on distantly-labeled data, and (2) a self-training method that uses contextualized augmentations created by pre-trained language models to improve the generalization ability of the NER model. On three benchmark datasets, our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.

...read moreread less

Proceedings Article•DOI•

RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System

[...]

Haoyang Wen¹, Ying Lin², Tuan Lai, Xiaoman Pan³, Sha Li⁴, Xudong Lin⁵, Ben Zhou⁶, Manling Li², Haoyu Wang⁷, Hongming Zhang⁸, Xiaodong Yu⁴, Alexander Dong, Zhenhailong Wang⁹, Yi Fung, Piyush Mishra, Qing Lyu⁷, Dídac Surís⁵, Brian Chen⁵, Susan Brown¹⁰, Martha Palmer¹⁰, Chris Callison-Burch⁷, Carl Vondrick⁵, Jiawei Han⁴, Dan Roth⁷, Shih-Fu Chang⁵, Heng Ji⁴ - Show less +22 more•Institutions (10)

Harbin Institute of Technology¹, Rensselaer Polytechnic Institute², Tencent³, University of Illinois at Urbana–Champaign⁴, Columbia University⁵, Allen Institute for Artificial Intelligence⁶, University of Pennsylvania⁷, Hong Kong University of Science and Technology⁸, Zhejiang University⁹, University of Colorado Boulder¹⁰

01 Jun 2021

TL;DR: This article presented a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for their experiment), and multiple data modalities (speech, text, image and video).

...read moreread less

Abstract: We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). The system advances state-of-the-art from two aspects: (1) extending from sentence-level event extraction to cross-document cross-lingual cross-media event extraction, coreference resolution and temporal event tracking; (2) using human curated event schema library to match and enhance the extraction output. We have made the dockerlized system publicly available for research purpose at GitHub, with a demo video.

...read moreread less

Proceedings Article•DOI•

BiTe-GCN: A New GCN Architecture via Bidirectional Convolution of Topology and Features on Text-Rich Networks

[...]

Di Jin¹, Xiangchen Song², Zhizhi Yu¹, Ziyang Liu, Heling Zhang², Zhaomeng Cheng, Jiawei Han² - Show less +3 more•Institutions (2)

Tianjin University¹, University of Illinois at Urbana–Champaign²

08 Mar 2021

TL;DR: BiTe-GCN as mentioned in this paper is a novel GCN architecture modeling via bidirectional convolution of topology and features on text-rich networks, which can capture both the global document-level information and the local text-sequence information from texts.

...read moreread less

Abstract: Graph convolutional networks (GCNs), aiming to obtain node embeddings by integrating high-order neighborhood information through stacked graph convolution layers, have demonstrated great power in many network analysis tasks such as node classification and link prediction. However, a fundamental weakness of GCNs, that is, topological limitations, including over-smoothing and local homophily of topology, limits their ability to represent networks. Existing studies for solving these topological limitations typically focus only on the convolution of features on network topology, which inevitably relies heavily on network structure. Moreover, most networks are text-rich, so it is important to integrate not only document-level information, but also the local text information which is particularly significant while often ignored by the existing methods. To solve these limitations, we propose BiTe-GCN, a novel GCN architecture modeling via bidirectional convolution of topology and features on text-rich networks. Specifically, we first transform the original text-rich network into an augmented bi-typed heterogeneous network, capturing both the global document-level information and the local text-sequence information from texts. We then introduce discriminative convolution mechanisms, which performs convolution on this augmented bi-typed network, realizing the convolutions of topology and features altogether in the same system, and learning different contributions of these two parts (i.e., network part and text part), automatically for the given learning objectives. Extensive experiments on text-rich networks demonstrate that our new architecture outperforms the state-of-the-arts by a breakout improvement. Moreover, this architecture can also be applied to several e-commerce search scenes such as JD searching, and experiments on JD dataset show the superiority of the proposed architecture over the related methods.

...read moreread less

Posted Content•

Document-Level Event Argument Extraction by Conditional Generation.

[...]

Sha Li¹, Heng Ji¹, Jiawei Han¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

13 Apr 2021-arXiv: Computation and Language

TL;DR: The authors proposed a document-level neural event argument extraction model by formulating the task as conditional generation following event templates, which achieved an absolute gain of 7.6% and 5.7% F1 over the next best model on the RAMS and WikiEvents datasets respectively.

...read moreread less

Abstract: Event extraction has long been treated as a sentence-level task in the IE community. We argue that this setting does not match human information-seeking behavior and leads to incomplete and uninformative extraction results. We propose a document-level neural event argument extraction model by formulating the task as conditional generation following event templates. We also compile a new document-level event extraction benchmark dataset WikiEvents which includes complete event and coreference annotation. On the task of argument extraction, we achieve an absolute gain of 7.6% F1 and 5.7% F1 over the next best model on the RAMS and WikiEvents datasets respectively. On the more challenging task of informative argument extraction, which requires implicit coreference reasoning, we achieve a 9.3% F1 gain over the best baseline. To demonstrate the portability of our model, we also create the first end-to-end zero-shot event extraction framework and achieve 97% of fully supervised model's trigger extraction performance and 82% of the argument extraction performance given only access to 10 out of the 33 types on ACE.

...read moreread less

Posted Content•

MotifClass: Weakly Supervised Text Classification with Higher-order Metadata Information.

[...]

Yu Zhang, Shweta Garg, Yu Meng, Xiusi Chen, Jiawei Han - Show less +1 more

07 Nov 2021-arXiv: Computation and Language

TL;DR: MotifClass as discussed by the authors proposes a heterogeneous information network to capture higher-order structures in the network, and motifs are used to describe metadata combinations to help weakly supervised text classification.

...read moreread less

Abstract: We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only and without any annotated training document provided. Most existing approaches leverage textual information in each document. However, in many domains, documents are accompanied by various types of metadata (e.g., authors, venue, and year of a research paper). These metadata and their combinations may serve as strong category indicators in addition to textual contents. In this paper, we explore the potential of using metadata to help weakly supervised text classification. To be specific, we model the relationships between documents and metadata via a heterogeneous information network. To effectively capture higher-order structures in the network, we use motifs to describe metadata combinations. We propose a novel framework, named MotifClass, which (1) selects category-indicative motif instances, (2) retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances, and (3) trains a text classifier using the pseudo training data. Extensive experiments on real-world datasets demonstrate the superior performance of MotifClass to existing weakly supervised text classification approaches. Further analysis shows the benefit of considering higher-order metadata information in our framework.

...read moreread less

Proceedings Article•

ChemNER: Fine-Grained Chemistry Named Entity Recognition with Ontology-Guided Distant Supervision

[...]

Xuan Wang, Vivian Hu, Xiangchen Song¹, Shweta Garg², Jinfeng Xiao¹, Jiawei Han¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Amazon.com²

01 Nov 2021

TL;DR: Li et al. as discussed by the authors proposed an ontology-guided, distantly-supervised method for fine-grained chemistry NER to tackle the challenges of incomplete annotation and noisy annotation.

...read moreread less

Abstract: Scientific literature analysis needs fine-grained named entity recognition (NER) to provide a wide range of information for scientific discovery. For example, chemistry research needs to study dozens to hundreds of distinct, fine-grained entity types, making consistent and accurate annotation difficult even for crowds of domain experts. On the other hand, domain-specific ontologies and knowledge bases (KBs) can be easily accessed, constructed, or integrated, which makes distant supervision realistic for fine-grained chemistry NER. In distant supervision, training labels are generated by matching mentions in a document with the concepts in the knowledge bases (KBs). However, this kind of KB-matching suffers from two major challenges: incomplete annotation and noisy annotation. We propose ChemNER, an ontology-guided, distantly-supervised method for fine-grained chemistry NER to tackle these challenges. It leverages the chemistry type ontology structure to generate distant labels with novel methods of flexible KB-matching and ontology-guided multi-type disambiguation. It significantly improves the distant label generation for the subsequent sequence labeling model training. We also provide an expert-labeled, chemistry NER dataset with 62 fine-grained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that ChemNER is highly effective, outperforming substantially the state-of-the-art NER methods (with .25 absolute F1 score improvement).

...read moreread less

Proceedings Article•

The Future is not One-dimensional: Complex Event Schema Induction by Graph Modeling for Event Prediction

[...]

Manling Li¹, Sha Li², Zhenhailong Wang³, Lifu Huang⁴, Kyunghyun Cho⁵, Heng Ji², Jiawei Han², Clare R. Voss⁶ - Show less +4 more•Institutions (6)

Rensselaer Polytechnic Institute¹, University of Illinois at Urbana–Champaign², Zhejiang University³, Virginia Tech⁴, New York University⁵, United States Army Research Laboratory⁶

01 Nov 2021

TL;DR: Temporal Complex Event Schema as discussed by the authors is a graph-based schema representation that encompasses events, arguments, temporal connections and argument relations, and predicts event instances following the temporal complex event schema.

...read moreread less

Abstract: Event schemas encode knowledge of stereotypical structures of events and their connections. As events unfold, schemas are crucial to act as a scaffolding. Previous work on event schema induction focuses either on atomic events or linear temporal event sequences, ignoring the interplay between events via arguments and argument relations. We introduce a new concept of Temporal Complex Event Schema: a graph-based schema representation that encompasses events, arguments, temporal connections and argument relations. In addition, we propose a Temporal Event Graph Model that predicts event instances following the temporal complex event schema. To build and evaluate such schemas, we release a new schema learning corpus containing 6,399 documents accompanied with event graphs, and we have manually constructed gold-standard schemas. Intrinsic evaluations by schema matching and instance graph perplexity, prove the superior quality of our probabilistic graph schema library compared to linear representations. Extrinsic evaluation on schema-guided future event prediction further demonstrates the predictive power of our event graph model, significantly outperforming human schemas and baselines by more than 17.8% on HITS@1.

...read moreread less

Proceedings Article•DOI•

Reader-Guided Passage Reranking for Open-Domain Question Answering

[...]

Yuning Mao¹, Pengcheng He², Xiaodong Liu³, Yelong Shen², Jianfeng Gao², Jiawei Han⁴, Weizhu Chen² - Show less +3 more•Institutions (4)

University of Illinois at Urbana–Champaign¹, Microsoft², Edinburgh Napier University³, Google⁴

01 Aug 2021

TL;DR: This paper proposed Reader-guIDEd Reranker (RIDER), which does not involve training and reranks the retrieved passages solely based on the top predictions of the reader before reranking.

...read moreread less

Proceedings Article•

CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding

[...]

Yanru Qu¹, Dinghan Shen², Yelong Shen³, Sandra Sajeev², Weizhu Chen², Jiawei Han¹ - Show less +2 more•Institutions (3)

University of Illinois at Urbana–Champaign¹, Microsoft², Kent State University³

03 May 2021

TL;DR: In this paper, a contrastive regularization is introduced to capture the global relationship among all the data samples and a momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss.

...read moreread less

Abstract: Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends to be more challenging. In this paper, we propose a novel data augmentation frame-work dubbed CoDA, which synthesizes diverse and informative augmented examples by integrating multiple transformations organically. Moreover, a contrastive regularization is introduced to capture the global relationship among all the data samples. A momentum encoder along with a memory bank is further leveraged to better estimate the contrastive loss. To verify the effectiveness of the proposed framework, we apply CoDA to Transformer-based models on a wide range of natural language understanding tasks. On the GLUE benchmark, CoDA gives rise to an average improvement of 2.2%while applied to the Roberta-large model. More importantly, it consistently exhibits stronger results relative to several competitive data augmentation and adversarial training baselines (including the low-resource settings). Extensive experiments show that the proposed contrastive objective can be flexibly combined with various data augmentation approaches to further boost their performance, highlighting the wide applicability of the CoDA framework.

...read moreread less

Posted Content•

Future is not One-dimensional: Graph Modeling based Complex Event Schema Induction for Event Prediction.

[...]

Manling Li, Sha Li, Zhenhailong Wang, Lifu Huang, Kyunghyun Cho, Heng Ji, Jiawei Han, Clare R. Voss - Show less +4 more

13 Apr 2021-arXiv: Artificial Intelligence

TL;DR: Temporal Complex Event Schema (TCES) as mentioned in this paper is a graph-based schema representation that encompasses events, arguments, temporal connections and argument relations, and it models the emergence of event instances following the temporal complex event schema.

...read moreread less

Abstract: Event schemas encode knowledge of stereotypical structures of events and their connections. As events unfold, schemas are crucial to act as a scaffolding. Previous work on event schema induction either focuses on atomic events or linear temporal event sequences, ignoring the interplay between events via arguments and argument relations. We introduce the concept of Temporal Complex Event Schema: a graph-based schema representation that encompasses events, arguments, temporal connections and argument relations. Additionally, we propose a Temporal Event Graph Model that models the emergence of event instances following the temporal complex event schema. To build and evaluate such schemas, we release a new schema learning corpus containing 6,399 documents accompanied with event graphs, and manually constructed gold schemas. Intrinsic evaluation by schema matching and instance graph perplexity, prove the superior quality of our probabilistic graph schema library compared to linear representations. Extrinsic evaluation on schema-guided event prediction further demonstrates the predictive power of our event graph model, significantly surpassing human schemas and baselines by more than 17.8% on HITS@1.

...read moreread less

Journal Article•DOI•

Toward Tweet Entity Linking with Heterogeneous Information Networks

[...]

Wei Shen¹, Yuwei Yin¹, Yang Yang¹, Jiawei Han², Jianyong Wang³, Xiaojie Yuan¹ - Show less +2 more•Institutions (3)

Nankai University¹, University of Illinois at Urbana–Champaign², Tsinghua University³

23 Mar 2021-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Li et al. as mentioned in this paper proposed an unsupervised iterative clustering framework TELHIN to link multiple similar tweets with a heterogeneous information network jointly, taking three dimensions of tweet similarity into consideration: content similarity, temporal similarity, and user similarity.

...read moreread less

Abstract: Twitter, a microblogging platform, has developed into an increasingly invaluable information source, where millions of users post a great quantity of tweets with various topics per day. Heterogeneous information networks consisting of multi-type objects and relations are becoming more and more prevalent as an organization form of knowledge and information. The task of linking an entity mention in a tweet with its corresponding entity in a heterogeneous information network is of great importance, for the purpose of enriching heterogeneous information networks with the abundant and fresh knowledge embedded in tweets. However, the entity mention is ambiguous. Additionally, tweets are short and informal, making it difficult to mine enough information from a single tweet for entity linking. In this paper, we propose an unsupervised iterative clustering framework TELHIN to link multiple similar tweets with a heterogeneous information network jointly. Our framework takes three dimensions of tweet similarity into consideration: (1) content similarity, (2) temporal similarity, and (3) user similarity. The appropriate weights of different similarity dimensions for each entity mention are learned iteratively based on the metric learning algorithm by leveraging the pairwise constraints generated automatically. Experiments on real data demonstrate the effectiveness of our framework in comparison with the baselines.

...read moreread less

Proceedings Article•

Transfer Learning of Graph Neural Networks with Ego-graph Information Maximization

[...]

Qi Zhu¹, Yidan Xu², Haonan Wang¹, Chao Zhang³, Jiawei Han¹, Carl Yang⁴ - Show less +2 more•Institutions (4)

University of Illinois at Urbana–Champaign¹, University of Washington², Shandong University³, Emory University⁴

04 May 2021

TL;DR: In this paper, a theoretical framework for the transfer learning of GNNs is proposed, which aims to capture the essential graph information and advocate the capturing of it as the goal of transferable GNN training, which motivates the design of EGI to analytically achieve this goal.

...read moreread less

Abstract: Graph neural networks (GNNs) have been shown with superior performance in various applications, but training dedicated GNNs can be costly for large-scale graphs. Some recent work started to study the pre-training of GNNs. However, none of them provide theoretical insights into the design of their frameworks, or clear requirements and guarantees towards the transferability of GNNs. In this work, we establish a theoretically grounded and practically useful framework for the transfer learning of GNNs. Firstly, we propose a novel view towards the essential graph information and advocate the capturing of it as the goal of transferable GNN training, which motivates the design of EGI (ego-graph information maximization) to analytically achieve this goal. Secondly, we specify the requirement of structure-respecting node features as the GNN input, and conduct a rigorous analysis of GNN transferability based on the difference between the local graph Laplacians of the source and target graphs. Finally, we conduct controlled synthetic experiments to directly justify our theoretical conclusions. Extensive experiments on real-world networks towards role identification show consistent results in the rigorously analyzed setting of direct-transfering (freezing parameters), while those towards large-scale relation prediction show promising results in the more generalized and practical setting of transfering with fine-tuning.

...read moreread less

Proceedings Article•

SUMDocS: Surrounding-aware Unsupervised Multi-Document Summarization.

[...]

Qi Zhu¹, Fang Guo¹, Jingjing Tian, Yuning Mao¹, Jiawei Han¹ - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jan 2021

Posted Content•

Eider: Evidence-enhanced Document-level Relation Extraction.

[...]

Yiqing Xie, Jiaming Shen, Sha Li¹, Yuning Mao, Jiawei Han - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

16 Jun 2021-arXiv: Computation and Language

TL;DR: Li et al. as mentioned in this paper proposed a three-stage evidenceenhanced DocRE framework consisting of joint relation and evidence extraction, evidence-centered relation extraction (RE), and fusion of extraction results.

...read moreread less

Abstract: Document-level relation extraction (DocRE) aims at extracting the semantic relations among entity pairs in a document. In DocRE, a subset of the sentences in a document, called the evidence sentences, might be sufficient for predicting the relation between a specific entity pair. To make better use of the evidence sentences, in this paper, we propose a three-stage evidence-enhanced DocRE framework consisting of joint relation and evidence extraction, evidence-centered relation extraction (RE), and fusion of extraction results. We first jointly train an RE model with a simple and memory-efficient evidence extraction model. Then, we construct pseudo documents based on the extracted evidence sentences and run the RE model again. Finally, we fuse the extraction results of the first two stages using a blending layer and make a final prediction. Extensive experiments show that our proposed framework achieves state-of-the-art performance on the DocRED dataset, outperforming the second-best method by 0.76/0.82 Ign F1/F1. In particular, our method significantly improves the performance on inter-sentence relations by 1.23 Inter F1.

...read moreread less

Posted Content•

Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks

[...]

Xinyang Zhang, Chenwei Zhang, Luna Xin Dong, Jingbo Shang, Jiawei Han - Show less +1 more

23 Feb 2021-arXiv: Computation and Language

TL;DR: Li et al. as mentioned in this paper propose a novel framework for minimally supervised categorization by learning from the text-rich network. But they focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category.

...read moreread less

Abstract: Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus' heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commerce product categorization dataset with 683 categories, our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%, significantly outperforming all compared methods; our accuracy is only less than 2% away from the supervised BERT model trained on about 50K labeled documents.

...read moreread less