scispace - formally typeset
Search or ask a question

Showing papers by "Hiroyuki Shindo published in 2018"


Proceedings ArticleDOI
01 Jul 2018
TL;DR: This paper restores interpretability to adversarial training methods by restricting the directions of perturbations toward the existing words in the input embedding space and can straightforwardly reconstruct each input with perturbATIONS to an actual text by considering the perturbation to be the replacement of words in a sentence while maintaining or even improving the task performance.
Abstract: Following great success in the image processing field, the idea of adversarial training has been applied to tasks in the natural language processing (NLP) field. One promising approach directly applies adversarial training developed in the image processing field to the input word embedding space instead of the discrete input space of texts. However, this approach abandons such interpretability as generating adversarial texts to significantly improve the performance of NLP tasks. This paper restores interpretability to such methods by restricting the directions of perturbations toward the existing words in the input embedding space. As a result, we can straightforwardly reconstruct each input with perturbations to an actual text by considering the perturbations to be the replacement of words in the sentence while maintaining or even improving the task performance.

155 citations


Proceedings ArticleDOI
01 Jan 2018
TL;DR: In this article, a simple and accurate span-based model for semantic role labeling (SRL) is presented, which directly takes into account all possible argument spans and scores them for each label.
Abstract: We present a simple and accurate span-based model for semantic role labeling (SRL). Our model directly takes into account all possible argument spans and scores them for each label. At decoding time, we greedily select higher scoring labeled spans. One advantage of our model is to allow us to design and use span-level features, that are difficult to use in token-based BIO tagging approaches. Experimental results demonstrate that our ensemble model achieves the state-of-the-art results, 87.4 F1 and 87.0 F1 on the CoNLL-2005 and 2012 datasets, respectively.

88 citations


Posted Content
15 Dec 2018
TL;DR: This tool enables users to easily obtain high-quality embeddings of words and entities from a Wikipedia dump with a single command and can be used as features in downstream natural language processing (NLP) models.
Abstract: We present Wikipedia2Vec, an open source tool for learning embeddings of words and entities from Wikipedia. This tool enables users to easily obtain high-quality embeddings of words and entities from a Wikipedia dump with a single command. The learned embeddings can be used as features in downstream natural language processing (NLP) models. The tool can be installed via PyPI. The source code, documentation, and pretrained embeddings for 12 major languages can be obtained at this http URL.

56 citations


Posted Content
TL;DR: The authors restores interpretability to adversarial training by restricting the directions of perturbations toward the existing words in the input embedding space, which can straightforwardly reconstruct each input with adversarial text to an actual text by considering the perturbation to be the replacement of words in sentence while maintaining or even improving the task performance.
Abstract: Following great success in the image processing field, the idea of adversarial training has been applied to tasks in the natural language processing (NLP) field. One promising approach directly applies adversarial training developed in the image processing field to the input word embedding space instead of the discrete input space of texts. However, this approach abandons such interpretability as generating adversarial texts to significantly improve the performance of NLP tasks. This paper restores interpretability to such methods by restricting the directions of perturbations toward the existing words in the input embedding space. As a result, we can straightforwardly reconstruct each input with perturbations to an actual text by considering the perturbations to be the replacement of words in the sentence while maintaining or even improving the task performance.

24 citations


Book ChapterDOI
TL;DR: This chapter describes the question answering system, which was the winning system at the Human–Computer Question Answering (HCQA) Competition at the Thirty-first Annual Conference on Neural Information Processing Systems (NIPS).
Abstract: In this chapter, we describe our question answering system, which was the winning system at the Human–Computer Question Answering (HCQA) Competition at the Thirty-first Annual Conference on Neural Information Processing Systems (NIPS). The competition requires participants to address a factoid question answering task referred to as quiz bowl. To address this task, we use two novel neural network models and combine these models with conventional information retrieval models using a supervised machine learning model. Our system achieved the best performance among the systems submitted in the competition and won a match against six top human quiz experts by a wide margin.

15 citations


Proceedings Article
01 May 2018
TL;DR: Li et al. as discussed by the authors present PDFAnno, a web-based linguistic annotation tool for PDF documents, which offers functions for various types of linguistic annotations directly on PDF, including named entity, dependency relation, and coreference chain.
Abstract: We present PDFAnno, a web-based linguistic annotation tool for PDF documents. PDF has become widespread standard for various types of publications, however, current tools for linguistic annotation mostly focus on plain-text documents. PDFAnno offers functions for various types of linguistic annotations directly on PDF, including named entity, dependency relation, and coreference chain. Furthermore, for multi-user support, it allows simultaneous visualization of multi-user’s annotations on the single PDF, which is useful for checking inter-annotator agreement and resolving annotation conflicts. PDFAnno is freely available under open-source license at https://github.com/paperai/pdfanno.

15 citations


Posted Content
TL;DR: The authors presented Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia, which enables users to learn the embedding efficiently by issuing a single command with a Wikipedia dump file as an argument.
Abstract: The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. The proposed tool enables users to learn the embeddings efficiently by issuing a single command with a Wikipedia dump file as an argument. We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings. In our experiments, our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and competitive results on various standard benchmark datasets. Furthermore, our tool has been used as a key component in various recent studies. We publicize the source code, demonstration, and the pretrained embeddings for 12 languages at this https URL.

10 citations


Proceedings Article
07 Jun 2018
TL;DR: This article proposed TextEnt, a neural network model that learns distributed representations of entities and documents directly from a knowledge base (KB) and achieved state-of-the-art performance on fine-grained entity typing and multiclass text classification.
Abstract: In this paper, we describe TextEnt, a neural network model that learns distributed representations of entities and documents directly from a knowledge base (KB). Given a document in a KB consisting of words and entity annotations, we train our model to predict the entity that the document describes and map the document and its target entity close to each other in a continuous vector space. Our model is trained using a large number of documents extracted from Wikipedia. The performance of the proposed model is evaluated using two tasks, namely fine-grained entity typing and multiclass text classification. The results demonstrate that our model achieves state-of-the-art performance on both tasks. The code and the trained representations are made available online for further academic research.

9 citations


Proceedings Article
01 May 2018
TL;DR: This work conducts large-scale annotations of VMWEs on the Wall Street Journal portion of English Ontonotes by a combination of automatic annotations and crowdsourcing, and formalizes VMWE annotations as a multiword sense disambiguation problem to exploit crowdsourcing.
Abstract: Multiword expressions (MWEs) consist of groups of tokens, which should be treated as a single syntactic or semantic unit. In this work, we focus on verbal MWEs (VMWEs), whose accurate recognition is challenging because they could be discontinuous (e.g., take .. off). Since previous English VMWE annotations are relatively small-scale in terms of VMWE occurrences and types, we conduct large-scale annotations of VMWEs on the Wall Street Journal portion of English Ontonotes by a combination of automatic annotations and crowdsourcing. Concretely, we first construct a VMWE dictionary based on the English-language Wiktionary. After that, we collect possible VMWE occurrences in Ontonotes and filter candidates with the help of gold dependency trees, then we formalize VMWE annotations as a multiword sense disambiguation problem to exploit crowdsourcing. As a result, we annotate 7,833 VMWE instances belonging to various categories, such as phrasal verbs, light verb constructions, and semi-fixed VMWEs. We hope this large-scale VMWE-annotated resource helps to develop models for MWE recognition and dependency parsing that are aware of English MWEs. Our resource is publicly available.

9 citations


Posted Content
TL;DR: A novel approach to action-graph extraction from materials science papers is proposed, Text2Quest, where procedural text is interpreted as instructions for an interactive game, which can complement existing approaches and enables richer forms of learning compared to static texts.
Abstract: Understanding procedural text requires tracking entities, actions and effects as the narrative unfolds. We focus on the challenging real-world problem of action-graph extraction from material science papers, where language is highly specialized and data annotation is expensive and scarce. We propose a novel approach, Text2Quest, where procedural text is interpreted as instructions for an interactive game. A learning agent completes the game by executing the procedure correctly in a text-based simulated lab environment. The framework can complement existing approaches and enables richer forms of learning compared to static texts. We discuss potential limitations and advantages of the approach, and release a prototype proof-of-concept, hoping to encourage research in this direction.

7 citations


Proceedings ArticleDOI
28 Mar 2018
TL;DR: In this article, the effects of new charge control methods (Special Scan and Faster Scan), which are implemented in the latest Hitachi CD-SEM (CG6300), were examined with EUV resist hole-patterns.
Abstract: Accurate EPE (edge placement error) characterization is important for the process control of high-volume manufacturing at N5 BEOL and beyond. In a CD-SEM metrology, the accurate edge-to-edge measurements among multiple layers and/or SEM-Contour extraction are required for the accurate EPE characterization. One of the technical challenges in CD-SEM metrology is to control charging effects caused by EB-irradiation during SEM image acquisition. In this paper, the effects of new charge control methods (Special Scan and Faster Scan), which are implemented in the latest Hitachi CD-SEM (CG6300), were examined with EUV resist hole-patterns. It was confirmed that Special Scan showed a profound effect on the suppression of the charge-induced errors. We also demonstrated the effects of the Special Scan for CD measurements and Contour Extraction for the EPE characterization of block on SAQP (SAQP lines + EUV block) pattern at imec iN7platform. Consequently, Special Scan is expected to be the solution for the accurate EPE measurements by CD-SEM.

Posted Content
Abstract: We present a simple and accurate span-based model for semantic role labeling (SRL). Our model directly takes into account all possible argument spans and scores them for each label. At decoding time, we greedily select higher scoring labeled spans. One advantage of our model is to allow us to design and use span-level features, that are difficult to use in token-based BIO tagging approaches. Experimental results demonstrate that our ensemble model achieves the state-of-the-art results, 87.4 F1 and 87.0 F1 on the CoNLL-2005 and 2012 datasets, respectively.

Proceedings ArticleDOI
01 Jul 2018
TL;DR: A computer-assisted learning system, Jastudy, which is particularly designed for Chinese-speaking learners of Japanese as a second language (JSL) to learn Japanese functional expressions with suggestion of appropriate example sentences is presented.
Abstract: We present a computer-assisted learning system, Jastudy, which is particularly designed for Chinese-speaking learners of Japanese as a second language (JSL) to learn Japanese functional expressions with suggestion of appropriate example sentences. The system automatically recognizes Japanese functional expressions using a free Japanese morphological analyzer MeCab, which is retrained on a new Conditional Random Fields (CRF) model. In order to select appropriate example sentences, we apply a pairwise-based machine learning tool, Support Vector Machine for Ranking (SVMrank) to estimate the complexity of the example sentences using Japanese–Chinese homographs as an important feature. In addition, we cluster the example sentences that contain Japanese functional expressions with two or more meanings and usages, based on part-of-speech, conjugation forms of verbs and semantic attributes, using the K-means clustering algorithm in Scikit-Learn. Experimental results demonstrate the effectiveness of our approach.

Posted Content
15 Dec 2018
TL;DR: This paper presented Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia, which enables users to learn the embedding efficiently by issuing a single command with a Wikipedia dump file as an argument.
Abstract: The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. The proposed tool enables users to learn the embeddings efficiently by issuing a single command with a Wikipedia dump file as an argument. We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings. In our experiments, our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and competitive results on various standard benchmark datasets. Furthermore, our tool has been used as a key component in various recent studies. We publicize the source code, demonstration, and the pretrained embeddings for 12 languages at this https URL.

Posted Content
TL;DR: The authors proposed TextEnt, a neural network model that learns distributed representations of entities and documents directly from a knowledge base (KB) and achieved state-of-the-art performance on fine-grained entity typing and multiclass text classification.
Abstract: In this paper, we describe TextEnt, a neural network model that learns distributed representations of entities and documents directly from a knowledge base (KB). Given a document in a KB consisting of words and entity annotations, we train our model to predict the entity that the document describes and map the document and its target entity close to each other in a continuous vector space. Our model is trained using a large number of documents extracted from Wikipedia. The performance of the proposed model is evaluated using two tasks, namely fine-grained entity typing and multiclass text classification. The results demonstrate that our model achieves state-of-the-art performance on both tasks. The code and the trained representations are made available online for further academic research.

Proceedings Article
01 May 2018
TL;DR: This system uses Natural Language Processing technologies for extracting information of chemical ompounds from text and for storing the extracted results as Linked Data (LD).
Abstract: This paper proposes a visualization system for chem ical compounds. New chemical compounds are being pr oduced by every moment and registration of chemical compounds to databases strongly depends on human labor. Our system uses Natural Language Processing technologies for extracting information of chemical ompounds from text and for storing the extracted results as Linked Data (LD). By combining the extracted results with LD-based exist ing chemical compound knowledge, our system provide s visualization of chemical compound information such as integrated view of sev eral databases and chemical compounds that have sim ilar structures.

Proceedings Article
01 Jan 2018
TL;DR: A framework to correct the spelling and grammatical errors of Japanese functional expressions as well as the error data collection problem is proposed and results indicate that the character-based method outperforms the wordbased method both on artificial error data and real error data.
Abstract: Correcting spelling and grammatical errors of Japanese functional expressions shows practical usefulness for Japanese Second Language (JSL) learners. However, the collection of these types of error data is difficult because it relies on detecting Japanese functional expressions first. In this paper, we propose a framework to correct the spelling and grammatical errors of Japanese functional expressions as well as the error data collection problem. Firstly, we apply a bidirectional Long Short-Term Memory with a Conditional Random Field (BiLSTM-CRF) model to detect Japanese functional expressions. Secondly, we extract phrases which include Japanese functional expressions as well as their neighboring words from native Japanese and learners’ corpora. Then we generate a large scale of artificial error data via substitution, injection and deletion operations. Finally, we utilize the generated artificial error data to train a sequence-to-sequence neural machine translation model for correcting Japanese functional expression errors. We also compare the character-based method with the word-based method. The experimental results indicate that the character-based method outperforms the wordbased method both on artificial error data and real error data.

Posted Content
10 Nov 2018
TL;DR: This work proposes an approach, Text2Quest, where procedural text is interpreted as instructions for an interactive game, where a reinforcement-learning agent completes the game by understanding and executing the procedure correctly, in a text-based simulated lab environment.
Abstract: Understanding procedural text requires tracking entities, actions and effects as the narrative unfolds (often implicitly). We focus on the challenging real-world problem of structured narrative extraction in the materials science domain, where language is highly specialized and suitable annotated data is not publicly available. We propose an approach, Text2Quest, where procedural text is interpreted as instructions for an interactive game. A reinforcement-learning agent completes the game by understanding and executing the procedure correctly, in a text-based simulated lab environment. The framework is intended to be more broadly applicable to other domain-specific and data-scarce settings. We conclude with a discussion of challenges and interesting potential extensions enabled by the agent-based perspective.

Proceedings Article
01 Aug 2018
TL;DR: The authors present tools for lexicon and corpus management that offer cooperating functionality in corpus annotation, such as Cradle and ChaKi, which store a set of words and expressions where multi-word expressions are defined with their own part-of-speech information and internal syntactic structures.
Abstract: We present tools for lexicon and corpus management that offer cooperating functionality in corpus annotation The former, named Cradle, stores a set of words and expressions where multi-word expressions are defined with their own part-of-speech information and internal syntactic structures The latter, named ChaKi, manages text corpora with part-of-speech (POS) and syntactic dependency structure annotations Those two tools cooperate so that the words and multi-word expressions stored in Cradle are directly referred to by ChaKi in conducting corpus annotation, and the words and expressions annotated in ChaKi can be output as a list of lexical entities that are to be stored in Cradle