scispace - formally typeset
Search or ask a question
Author

Xiaofei Sun

Other affiliations: Harbin Institute of Technology
Bio: Xiaofei Sun is an academic researcher from Stony Brook University. The author has contributed to research in topics: Language model & Computer science. The author has an hindex of 10, co-authored 44 publications receiving 483 citations. Previous affiliations of Xiaofei Sun include Harbin Institute of Technology.

Papers
More filters
Proceedings ArticleDOI
01 Jul 2020
TL;DR: This paper proposes to use dice loss in replacement of the standard cross-entropy objective for data-imbalanced NLP tasks, based on the Sørensen--Dice coefficient or Tversky index, which attaches similar importance to false positives and false negatives, and is more immune to the data-IMbalance issue.
Abstract: Many NLP tasks such as tagging and machine reading comprehension are faced with the severe data imbalance issue: negative examples significantly outnumber positive examples, and the huge number of easy-negative examples overwhelms the training. The most commonly used cross entropy (CE) criteria is actually an accuracy-oriented objective, and thus creates a discrepancy between training and test: at training time, each training instance contributes equally to the objective function, while at test time F1 score concerns more about positive examples. In this paper, we propose to use dice loss in replacement of the standard cross-entropy objective for data-imbalanced NLP tasks. Dice loss is based on the Sorensen--Dice coefficient or Tversky index , which attaches similar importance to false positives and false negatives, and is more immune to the data-imbalance issue. To further alleviate the dominating influence from easy-negative examples in training, we propose to associate training examples with dynamically adjusted weights to deemphasize easy-negative examples. Theoretical analysis shows that this strategy narrows down the gap between the F1 score in evaluation and the dice loss in training. With the proposed training objective, we observe significant performance boost on a wide range of data imbalanced NLP tasks. Notably, we are able to achieve SOTA results on CTB5, CTB6 and UD1.4 for the part of speech tagging task; SOTA results on CoNLL03, OntoNotes5.0, MSRA and OntoNotes4.0 for the named entity recognition task; along with competitive results on the tasks of machine reading comprehension and paraphrase identification.

195 citations

Proceedings Article
29 Jan 2019
TL;DR: Glyce, the glyph-vectors for Chinese character representations, is presented and it is shown that glyph-based models are able to consistently outperform word/char ID- based models in a wide range of Chinese NLP tasks.
Abstract: It is intuitive that NLP tasks for logographic languages like Chinese should benefit from the use of the glyph information in those languages. However, due to the lack of rich pictographic evidence in glyphs and the weak generalization ability of standard computer vision models on character data, an effective way to utilize the glyph information remains to be found. In this paper, we address this gap by presenting Glyce, the glyph-vectors for Chinese character representations. We make three major innovations: (1) We use historical Chinese scripts (e.g., bronzeware script, seal script, traditional Chinese, etc) to enrich the pictographic evidence in characters; (2) We design CNN structures (called tianzege-CNN) tailored to Chinese character image processing; and (3) We use image-classification as an auxiliary task in a multi-task learning setup to increase the model's ability to generalize. We show that glyph-based models are able to consistently outperform word/char ID-based models in a wide range of Chinese NLP tasks. When combing with BERT, we are able to set new state-of-the-art results for a variety of Chinese NLP tasks, including language modeling, tagging (NER, CWS, POS), sentence pair classification (BQ, LCQMC, XNLI, NLPCC-DBQA), single sentence classification tasks (ChnSentiCorp, the Fudan corpus, iFeng), dependency parsing, and semantic role labeling. For example, the proposed model achieves an F1 score of 81.6 on the OntoNotes dataset of NER, +1.5 over BERT; it achieves an almost perfect accuracy of 99.8\% on the the Fudan corpus for text classification.

123 citations

Proceedings ArticleDOI
01 Aug 2021
TL;DR: By jointly training the BERT and GCN modules within BertGCN, the proposed model is able to leverage the advantages of both worlds: large-scale pretraining which takes the advantage of the massive amount of raw data and transductive learning.
Abstract: In this work, we propose BertGCN, a model that combines large scale pretraining and transductive learning for text classification. BertGCN constructs a heterogeneous graph over the dataset and represents documents as nodes using BERT representations. By jointly training the BERT and GCN modules within BertGCN, the proposed model is able to leverage the advantages of both worlds: large-scale pretraining which takes the advantage of the massive amount of raw data and transductive learning which jointly learns representations for both training data and unlabeled test data by propagating label influence through graph convolution. Experiments show that BertGCN achieves SOTA performances on a wide range of text classification datasets.1

102 citations

Proceedings ArticleDOI
01 Jul 2019
TL;DR: The authors show that word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting.
Abstract: Segmenting a chunk of text into words is usually the first step of processing Chinese text, but its necessity has rarely been explored. In this paper, we ask the fundamental question of whether Chinese word segmentation (CWS) is necessary for deep learning-based Chinese Natural Language Processing. We benchmark neural word-based models which rely on word segmentation against neural char-based models which do not involve word segmentation in four end-to-end NLP benchmark tasks: language modeling, machine translation, sentence matching/paraphrase and text classification. Through direct comparisons between these two types of models, we find that char-based models consistently outperform word-based models. Based on these observations, we conduct comprehensive experiments to study why word-based models underperform char-based models in these deep learning-based NLP tasks. We show that it is because word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting. We hope this paper could encourage researchers in the community to rethink the necessity of word segmentation in deep learning-based Chinese Natural Language Processing.

91 citations

Posted Content
TL;DR: This paper proposes content-enhanced network embedding (CENE), which is capable of jointly leveraging the network structure and the content information, and shows that its models outperform all existing network embeddedding methods, demonstrating the merits of content information and joint learning.
Abstract: This paper investigates the problem of network embedding, which aims at learning low-dimensional vector representation of nodes in networks. Most existing network embedding methods rely solely on the network structure, i.e., the linkage relationships between nodes, but ignore the rich content information associated with it, which is common in real world networks and beneficial to describing the characteristics of a node. In this paper, we propose content-enhanced network embedding (CENE), which is capable of jointly leveraging the network structure and the content information. Our approach integrates text modeling and structure modeling in a general framework by treating the content information as a special kind of node. Experiments on several real world net- works with application to node classification show that our models outperform all existing network embedding methods, demonstrating the merits of content information and joint learning.

83 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure as discussed by the authors, and a significant amount of progress has been made toward this emerging network analysis paradigm.
Abstract: Network embedding assigns nodes in a network to low-dimensional representations and effectively preserves the network structure. Recently, a significant amount of progresses have been made toward this emerging network analysis paradigm. In this survey, we focus on categorizing and then reviewing the current development on network embedding methods, and point out its future research directions. We first summarize the motivation of network embedding. We discuss the classical graph embedding algorithms and their relationship with network embedding. Afterwards and primarily, we provide a comprehensive overview of a large number of network embedding methods in a systematic manner, covering the structure- and property-preserving network embedding methods, the network embedding methods with side information, and the advanced information preserving network embedding methods. Moreover, several evaluation approaches for network embedding and some useful online resources, including the network data sets and softwares, are reviewed, too. Finally, we discuss the framework of exploiting these network embedding methods to build an effective system and point out some potential future directions.

929 citations

Journal ArticleDOI
TL;DR: A comprehensive review on existing deep learning techniques for NER is provided in this paper, where the authors systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder.
Abstract: Named entity recognition (NER) is the task to identify text spans that mention named entities, and to classify them into predefined categories such as person, location, organization etc. NER serves as the basis for a variety of natural language applications such as question answering, text summarization, and machine translation. Although early NER systems are successful in producing decent recognition accuracy, they often require much human effort in carefully designing rules or features. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.

474 citations

Posted Content
TL;DR: A comprehensive review on existing deep learning techniques for NER, including tagged NER corpora and off-the-shelf NER tools, and systematically categorizes existing works based on a taxonomy along three axes.
Abstract: Named entity recognition (NER) is the task to identify mentions of rigid designators from text belonging to predefined semantic types such as person, location, organization etc. NER always serves as the foundation for many natural language applications such as question answering, text summarization, and machine translation. Early NER systems got a huge success in achieving good performance with the cost of human engineering in designing domain-specific features and rules. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.

381 citations

Posted Content
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu1, Jiwei Li1 
TL;DR: This paper proposed a unified framework that is capable of handling both flat and nested NER tasks, instead of treating the task of NER as a sequence labeling problem, they propose to formulate it as a machine reading comprehension (MRC) task.
Abstract: The task of named entity recognition (NER) is normally divided into nested NER and flat NER depending on whether named entities are nested or not. Models are usually separately developed for the two tasks, since sequence labeling models, the most widely used backbone for flat NER, are only able to assign a single label to a particular token, which is unsuitable for nested NER where a token may be assigned several labels. In this paper, we propose a unified framework that is capable of handling both flat and nested NER tasks. Instead of treating the task of NER as a sequence labeling problem, we propose to formulate it as a machine reading comprehension (MRC) task. For example, extracting entities with the \textsc{per} label is formalized as extracting answer spans to the question "{\it which person is mentioned in the text?}". This formulation naturally tackles the entity overlapping issue in nested NER: the extraction of two overlapping entities for different categories requires answering two independent questions. Additionally, since the query encodes informative prior knowledge, this strategy facilitates the process of entity extraction, leading to better performances for not only nested NER, but flat NER. We conduct experiments on both {\em nested} and {\em flat} NER datasets. Experimental results demonstrate the effectiveness of the proposed formulation. We are able to achieve vast amount of performance boost over current SOTA models on nested NER datasets, i.e., +1.28, +2.55, +5.44, +6.37, respectively on ACE04, ACE05, GENIA and KBP17, along with SOTA results on flat NER datasets, i.e.,+0.24, +1.95, +0.21, +1.49 respectively on English CoNLL 2003, English OntoNotes 5.0, Chinese MSRA, Chinese OntoNotes 4.0.

282 citations

Proceedings ArticleDOI
01 Jul 2017
TL;DR: Context-Aware Network Embedding (CANE), a novel NE model that learns context-aware embeddings for vertices with mutual attention mechanism and is expected to model the semantic relationships between vertices more precisely, is presented.
Abstract: Network embedding (NE) is playing a critical role in network analysis, due to its ability to represent vertices with efficient low-dimensional embedding vectors. However, existing NE models aim to learn a fixed context-free embedding for each vertex and neglect the diverse roles when interacting with other vertices. In this paper, we assume that one vertex usually shows different aspects when interacting with different neighbor vertices, and should own different embeddings respectively. Therefore, we present Context-Aware Network Embedding (CANE), a novel NE model to address this issue. CANE learns context-aware embeddings for vertices with mutual attention mechanism and is expected to model the semantic relationships between vertices more precisely. In experiments, we compare our model with existing NE models on three real-world datasets. Experimental results show that CANE achieves significant improvement than state-of-the-art methods on link prediction and comparable performance on vertex classification. The source code and datasets can be obtained from https://github.com/thunlp/CANE.

257 citations