Top 19 papers published by Hiroyuki Shindo from Nara Institute of Science and Technology in 2018

Proceedings Article•DOI•

Interpretable Adversarial Perturbation in Input Embedding Space for Text

[...]

Motoki Sato, Jun Suzuki, Hiroyuki Shindo¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Jul 2018

TL;DR: This paper restores interpretability to adversarial training methods by restricting the directions of perturbations toward the existing words in the input embedding space and can straightforwardly reconstruct each input with perturbATIONS to an actual text by considering the perturbation to be the replacement of words in a sentence while maintaining or even improving the task performance.

...read moreread less

Abstract: Following great success in the image processing field, the idea of adversarial training has been applied to tasks in the natural language processing (NLP) field. One promising approach directly applies adversarial training developed in the image processing field to the input word embedding space instead of the discrete input space of texts. However, this approach abandons such interpretability as generating adversarial texts to significantly improve the performance of NLP tasks. This paper restores interpretability to such methods by restricting the directions of perturbations toward the existing words in the input embedding space. As a result, we can straightforwardly reconstruct each input with perturbations to an actual text by considering the perturbations to be the replacement of words in the sentence while maintaining or even improving the task performance.

...read moreread less

155 citations

Proceedings Article•DOI•

A Span Selection Model for Semantic Role Labeling

[...]

Hiroki Ouchi¹, Hiroyuki Shindo¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Jan 2018

TL;DR: In this article, a simple and accurate span-based model for semantic role labeling (SRL) is presented, which directly takes into account all possible argument spans and scores them for each label.

...read moreread less

Abstract: We present a simple and accurate span-based model for semantic role labeling (SRL). Our model directly takes into account all possible argument spans and scores them for each label. At decoding time, we greedily select higher scoring labeled spans. One advantage of our model is to allow us to design and use span-level features, that are difficult to use in token-based BIO tagging approaches. Experimental results demonstrate that our ensemble model achieves the state-of-the-art results, 87.4 F1 and 87.0 F1 on the CoNLL-2005 and 2012 datasets, respectively.

...read moreread less

88 citations

Posted Content•

Wikipedia2Vec: An Optimized Tool for Learning Embeddings of Words and Entities from Wikipedia.

[...]

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji - Show less +1 more

15 Dec 2018

TL;DR: This tool enables users to easily obtain high-quality embeddings of words and entities from a Wikipedia dump with a single command and can be used as features in downstream natural language processing (NLP) models.

...read moreread less

Abstract: We present Wikipedia2Vec, an open source tool for learning embeddings of words and entities from Wikipedia. This tool enables users to easily obtain high-quality embeddings of words and entities from a Wikipedia dump with a single command. The learned embeddings can be used as features in downstream natural language processing (NLP) models. The tool can be installed via PyPI. The source code, documentation, and pretrained embeddings for 12 major languages can be obtained at this http URL.

...read moreread less

56 citations

Posted Content•

Interpretable Adversarial Perturbation in Input Embedding Space for Text

[...]

Motoki Sato, Jun Suzuki, Hiroyuki Shindo¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

08 May 2018-arXiv: Learning

TL;DR: The authors restores interpretability to adversarial training by restricting the directions of perturbations toward the existing words in the input embedding space, which can straightforwardly reconstruct each input with adversarial text to an actual text by considering the perturbation to be the replacement of words in sentence while maintaining or even improving the task performance.

...read moreread less

Abstract: Following great success in the image processing field, the idea of adversarial training has been applied to tasks in the natural language processing (NLP) field. One promising approach directly applies adversarial training developed in the image processing field to the input word embedding space instead of the discrete input space of texts. However, this approach abandons such interpretability as generating adversarial texts to significantly improve the performance of NLP tasks. This paper restores interpretability to such methods by restricting the directions of perturbations toward the existing words in the input embedding space. As a result, we can straightforwardly reconstruct each input with perturbations to an actual text by considering the perturbations to be the replacement of words in the sentence while maintaining or even improving the task performance.

...read moreread less

24 citations

Book Chapter•DOI•

Studio Ousia’s Quiz Bowl Question Answering System

[...]

Ikuya Yamada, Ryuji Tamaki, Hiroyuki Shindo¹, Yoshiyasu Takefuji²•Institutions (2)

Nara Institute of Science and Technology¹, Keio University²

23 Mar 2018-arXiv: Computation and Language

TL;DR: This chapter describes the question answering system, which was the winning system at the Human–Computer Question Answering (HCQA) Competition at the Thirty-first Annual Conference on Neural Information Processing Systems (NIPS).

...read moreread less

Abstract: In this chapter, we describe our question answering system, which was the winning system at the Human–Computer Question Answering (HCQA) Competition at the Thirty-first Annual Conference on Neural Information Processing Systems (NIPS). The competition requires participants to address a factoid question answering task referred to as quiz bowl. To address this task, we use two novel neural network models and combine these models with conventional information retrieval models using a supervised machine learning model. Our system achieved the best performance among the systems submitted in the competition and won a match against six top human quiz experts by a wide margin.

...read moreread less

15 citations

Proceedings Article•

PDFAnno: a Web-based Linguistic Annotation Tool for PDF Documents

[...]

Hiroyuki Shindo¹, Yohei Munesada, Yuji Matsumoto²•Institutions (2)

Hitachi¹, Tohoku University²

01 May 2018

TL;DR: Li et al. as discussed by the authors present PDFAnno, a web-based linguistic annotation tool for PDF documents, which offers functions for various types of linguistic annotations directly on PDF, including named entity, dependency relation, and coreference chain.

...read moreread less

Abstract: We present PDFAnno, a web-based linguistic annotation tool for PDF documents. PDF has become widespread standard for various types of publications, however, current tools for linguistic annotation mostly focus on plain-text documents. PDFAnno offers functions for various types of linguistic annotations directly on PDF, including named entity, dependency relation, and coreference chain. Furthermore, for multi-user support, it allows simultaneous visualization of multi-user’s annotations on the single PDF, which is useful for checking inter-annotator agreement and resolving annotation conflicts. PDFAnno is freely available under open-source license at https://github.com/paperai/pdfanno.

...read moreread less

15 citations

Posted Content•

Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia.

[...]

Ikuya Yamada¹, Akari Asai², Jin Sakuma³, Hiroyuki Shindo⁴, Hideaki Takeda⁵, Yoshiyasu Takefuji¹, Yuji Matsumoto⁴ - Show less +3 more•Institutions (5)

Keio University¹, University of Washington², University of Tokyo³, Nara Institute of Science and Technology⁴, National Institute of Informatics⁵

15 Dec 2018-arXiv: Computation and Language

TL;DR: The authors presented Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia, which enables users to learn the embedding efficiently by issuing a single command with a Wikipedia dump file as an argument.

...read moreread less

Abstract: The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. The proposed tool enables users to learn the embeddings efficiently by issuing a single command with a Wikipedia dump file as an argument. We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings. In our experiments, our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and competitive results on various standard benchmark datasets. Furthermore, our tool has been used as a key component in various recent studies. We publicize the source code, demonstration, and the pretrained embeddings for 12 languages at this https URL.

...read moreread less

10 citations

Proceedings Article•

Representation Learning of Entities and Documents from Knowledge Base Descriptions

[...]

Ikuya Yamada, Hiroyuki Shindo¹, Yoshiyasu Takefuji•Institutions (1)

Nara Institute of Science and Technology¹

07 Jun 2018

TL;DR: This article proposed TextEnt, a neural network model that learns distributed representations of entities and documents directly from a knowledge base (KB) and achieved state-of-the-art performance on fine-grained entity typing and multiclass text classification.

...read moreread less

Abstract: In this paper, we describe TextEnt, a neural network model that learns distributed representations of entities and documents directly from a knowledge base (KB). Given a document in a KB consisting of words and entity annotations, we train our model to predict the entity that the document describes and map the document and its target entity close to each other in a continuous vector space. Our model is trained using a large number of documents extracted from Wikipedia. The performance of the proposed model is evaluated using two tasks, namely fine-grained entity typing and multiclass text classification. The results demonstrate that our model achieves state-of-the-art performance on both tasks. The code and the trained representations are made available online for further academic research.

...read moreread less

9 citations

Proceedings Article•

Construction of Large-scale English Verbal Multiword Expression Annotated Corpus

[...]

Akihiko Kato¹, Hiroyuki Shindo², Yuji Matsumoto³•Institutions (3)

Nara Institute of Science and Technology¹, Hitachi², Tohoku University³

01 May 2018

TL;DR: This work conducts large-scale annotations of VMWEs on the Wall Street Journal portion of English Ontonotes by a combination of automatic annotations and crowdsourcing, and formalizes VMWE annotations as a multiword sense disambiguation problem to exploit crowdsourcing.

...read moreread less

Abstract: Multiword expressions (MWEs) consist of groups of tokens, which should be treated as a single syntactic or semantic unit. In this work, we focus on verbal MWEs (VMWEs), whose accurate recognition is challenging because they could be discontinuous (e.g., take .. off). Since previous English VMWE annotations are relatively small-scale in terms of VMWE occurrences and types, we conduct large-scale annotations of VMWEs on the Wall Street Journal portion of English Ontonotes by a combination of automatic annotations and crowdsourcing. Concretely, we first construct a VMWE dictionary based on the English-language Wiktionary. After that, we collect possible VMWE occurrences in Ontonotes and filter candidates with the help of gold dependency trees, then we formalize VMWE annotations as a multiword sense disambiguation problem to exploit crowdsourcing. As a result, we annotate 7,833 VMWE instances belonging to various categories, such as phrasal verbs, light verb constructions, and semi-fixed VMWEs. We hope this large-scale VMWE-annotated resource helps to develop models for MWE recognition and dependency parsing that are aware of English MWEs. Our resource is publicly available.

...read moreread less

9 citations

Posted Content•

Playing by the Book: An Interactive Game Approach for Action Graph Extraction from Text

[...]

Ronen Tamari¹, Hiroyuki Shindo², Dafna Shahaf¹, Yuji Matsumoto²•Institutions (2)

Hebrew University of Jerusalem¹, Nara Institute of Science and Technology²

10 Nov 2018-arXiv: Learning

TL;DR: A novel approach to action-graph extraction from materials science papers is proposed, Text2Quest, where procedural text is interpreted as instructions for an interactive game, which can complement existing approaches and enables richer forms of learning compared to static texts.

...read moreread less

Abstract: Understanding procedural text requires tracking entities, actions and effects as the narrative unfolds. We focus on the challenging real-world problem of action-graph extraction from material science papers, where language is highly specialized and data annotation is expensive and scarce. We propose a novel approach, Text2Quest, where procedural text is interpreted as instructions for an interactive game. A learning agent completes the game by executing the procedure correctly in a text-based simulated lab environment. The framework can complement existing approaches and enables richer forms of learning compared to static texts. We discuss potential limitations and advantages of the approach, and release a prototype proof-of-concept, hoping to encourage research in this direction.

...read moreread less

7 citations

Proceedings Article•DOI•

Advanced CD-SEM imaging methodology for EPE measurements

[...]

Yoshikata Takemasa¹, Takeyoshi Ohashi¹, Hiroyuki Shindo¹, Gian Lorusso², Anne-Laure Charley² - Show less +1 more•Institutions (2)

Hitachi¹, IMEC²

28 Mar 2018

TL;DR: In this article, the effects of new charge control methods (Special Scan and Faster Scan), which are implemented in the latest Hitachi CD-SEM (CG6300), were examined with EUV resist hole-patterns.

...read moreread less

Abstract: Accurate EPE (edge placement error) characterization is important for the process control of high-volume manufacturing at N5 BEOL and beyond. In a CD-SEM metrology, the accurate edge-to-edge measurements among multiple layers and/or SEM-Contour extraction are required for the accurate EPE characterization. One of the technical challenges in CD-SEM metrology is to control charging effects caused by EB-irradiation during SEM image acquisition. In this paper, the effects of new charge control methods (Special Scan and Faster Scan), which are implemented in the latest Hitachi CD-SEM (CG6300), were examined with EUV resist hole-patterns. It was confirmed that Special Scan showed a profound effect on the suppression of the charge-induced errors. We also demonstrated the effects of the Special Scan for CD measurements and Contour Extraction for the EPE characterization of block on SAQP (SAQP lines + EUV block) pattern at imec iN7platform. Consequently, Special Scan is expected to be the solution for the accurate EPE measurements by CD-SEM.

...read moreread less

Posted Content•

A Span Selection Model for Semantic Role Labeling

[...]

Hiroki Ouchi¹, Hiroyuki Shindo¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

04 Oct 2018-arXiv: Computation and Language

Abstract: We present a simple and accurate span-based model for semantic role labeling (SRL). Our model directly takes into account all possible argument spans and scores them for each label. At decoding time, we greedily select higher scoring labeled spans. One advantage of our model is to allow us to design and use span-level features, that are difficult to use in token-based BIO tagging approaches. Experimental results demonstrate that our ensemble model achieves the state-of-the-art results, 87.4 F1 and 87.0 F1 on the CoNLL-2005 and 2012 datasets, respectively.

...read moreread less

Proceedings Article•DOI•

Sentence Suggestion of Japanese Functional Expressions for Chinese-speaking Learners

[...]

Jun Liu¹, Hiroyuki Shindo¹, Yuji Matsumoto¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Jul 2018

TL;DR: A computer-assisted learning system, Jastudy, which is particularly designed for Chinese-speaking learners of Japanese as a second language (JSL) to learn Japanese functional expressions with suggestion of appropriate example sentences is presented.

...read moreread less

Abstract: We present a computer-assisted learning system, Jastudy, which is particularly designed for Chinese-speaking learners of Japanese as a second language (JSL) to learn Japanese functional expressions with suggestion of appropriate example sentences. The system automatically recognizes Japanese functional expressions using a free Japanese morphological analyzer MeCab, which is retrained on a new Conditional Random Fields (CRF) model. In order to select appropriate example sentences, we apply a pairwise-based machine learning tool, Support Vector Machine for Ranking (SVMrank) to estimate the complexity of the example sentences using Japanese–Chinese homographs as an important feature. In addition, we cluster the example sentences that contain Japanese functional expressions with two or more meanings and usages, based on part-of-speech, conjugation forms of verbs and semantic attributes, using the K-means clustering algorithm in Scikit-Learn. Experimental results demonstrate the effectiveness of our approach.

...read moreread less

Posted Content•

Wikipedia2Vec: An Optimized Implementation for Learning Embeddings from Wikipedia

[...]

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji - Show less +1 more

15 Dec 2018

TL;DR: This paper presented Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia, which enables users to learn the embedding efficiently by issuing a single command with a Wikipedia dump file as an argument.

...read moreread less

Abstract: The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. The proposed tool enables users to learn the embeddings efficiently by issuing a single command with a Wikipedia dump file as an argument. We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings. In our experiments, our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and competitive results on various standard benchmark datasets. Furthermore, our tool has been used as a key component in various recent studies. We publicize the source code, demonstration, and the pretrained embeddings for 12 languages at this https URL.

...read moreread less

Posted Content•

Representation Learning of Entities and Documents from Knowledge Base Descriptions

[...]

Ikuya Yamada, Hiroyuki Shindo¹, Yoshiyasu Takefuji•Institutions (1)

Nara Institute of Science and Technology¹

08 Jun 2018-arXiv: Computation and Language

TL;DR: The authors proposed TextEnt, a neural network model that learns distributed representations of entities and documents directly from a knowledge base (KB) and achieved state-of-the-art performance on fine-grained entity typing and multiclass text classification.

...read moreread less

Abstract: In this paper, we describe TextEnt, a neural network model that learns distributed representations of entities and documents directly from a knowledge base (KB). Given a document in a KB consisting of words and entity annotations, we train our model to predict the entity that the document describes and map the document and its target entity close to each other in a continuous vector space. Our model is trained using a large number of documents extracted from Wikipedia. The performance of the proposed model is evaluated using two tasks, namely fine-grained entity typing and multiclass text classification. The results demonstrate that our model achieves state-of-the-art performance on both tasks. The code and the trained representations are made available online for further academic research.

...read moreread less

Proceedings Article•

Chemical Compounds Knowledge Visualization with Natural Language Processing and Linked Data

[...]

Kazunari Tanaka, Tomoya Iwakura, Yusuke Koyanagi, Noriko Ikeda, Hiroyuki Shindo¹, Yuji Matsumoto² - Show less +2 more•Institutions (2)

Hitachi¹, Tohoku University²

01 May 2018

TL;DR: This system uses Natural Language Processing technologies for extracting information of chemical ompounds from text and for storing the extracted results as Linked Data (LD).

...read moreread less

Abstract: This paper proposes a visualization system for chem ical compounds. New chemical compounds are being pr oduced by every moment and registration of chemical compounds to databases strongly depends on human labor. Our system uses Natural Language Processing technologies for extracting information of chemical ompounds from text and for storing the extracted results as Linked Data (LD). By combining the extracted results with LD-based exist ing chemical compound knowledge, our system provide s visualization of chemical compound information such as integrated view of sev eral databases and chemical compounds that have sim ilar structures.

...read moreread less

Proceedings Article•

Automatic Error Correction on Japanese Functional Expressions Using Character-based Neural Machine Translation

[...]

Jun Liu¹, Fei Cheng, Yiran Wang, Hiroyuki Shindo¹, Yuji Matsumoto¹ - Show less +1 more•Institutions (1)

Nara Institute of Science and Technology¹

01 Jan 2018

TL;DR: A framework to correct the spelling and grammatical errors of Japanese functional expressions as well as the error data collection problem is proposed and results indicate that the character-based method outperforms the wordbased method both on artificial error data and real error data.

...read moreread less

Abstract: Correcting spelling and grammatical errors of Japanese functional expressions shows practical usefulness for Japanese Second Language (JSL) learners. However, the collection of these types of error data is difficult because it relies on detecting Japanese functional expressions first. In this paper, we propose a framework to correct the spelling and grammatical errors of Japanese functional expressions as well as the error data collection problem. Firstly, we apply a bidirectional Long Short-Term Memory with a Conditional Random Field (BiLSTM-CRF) model to detect Japanese functional expressions. Secondly, we extract phrases which include Japanese functional expressions as well as their neighboring words from native Japanese and learners’ corpora. Then we generate a large scale of artificial error data via substitution, injection and deletion operations. Finally, we utilize the generated artificial error data to train a sequence-to-sequence neural machine translation model for correcting Japanese functional expression errors. We also compare the character-based method with the word-based method. The experimental results indicate that the character-based method outperforms the wordbased method both on artificial error data and real error data.

...read moreread less

Posted Content•

Playing by the Book: Towards Agent-based Narrative Understanding through Role-playing and Simulation.

[...]

Ronen Tamari, Hiroyuki Shindo, Dafna Shahaf, Yuji Matsumoto

10 Nov 2018

TL;DR: This work proposes an approach, Text2Quest, where procedural text is interpreted as instructions for an interactive game, where a reinforcement-learning agent completes the game by understanding and executing the procedure correctly, in a text-based simulated lab environment.

...read moreread less

Abstract: Understanding procedural text requires tracking entities, actions and effects as the narrative unfolds (often implicitly). We focus on the challenging real-world problem of structured narrative extraction in the materials science domain, where language is highly specialized and suitable annotated data is not publicly available. We propose an approach, Text2Quest, where procedural text is interpreted as instructions for an interactive game. A reinforcement-learning agent completes the game by understanding and executing the procedure correctly, in a text-based simulated lab environment. The framework is intended to be more broadly applicable to other domain-specific and data-scarce settings. We conclude with a discussion of challenges and interesting potential extensions enabled by the agent-based perspective.

...read moreread less

Proceedings Article•

Cooperating Tools for MWE Lexicon Management and Corpus Annotation

[...]

Yuji Matsumoto¹, Akihiko Kato², Hiroyuki Shindo², Toshio Morita•Institutions (2)

Tohoku University¹, Nara Institute of Science and Technology²

01 Aug 2018

TL;DR: The authors present tools for lexicon and corpus management that offer cooperating functionality in corpus annotation, such as Cradle and ChaKi, which store a set of words and expressions where multi-word expressions are defined with their own part-of-speech information and internal syntactic structures.

...read moreread less

Abstract: We present tools for lexicon and corpus management that offer cooperating functionality in corpus annotation The former, named Cradle, stores a set of words and expressions where multi-word expressions are defined with their own part-of-speech information and internal syntactic structures The latter, named ChaKi, manages text corpora with part-of-speech (POS) and syntactic dependency structure annotations Those two tools cooperate so that the words and multi-word expressions stored in Cradle are directly referred to by ChaKi in conducting corpus annotation, and the words and expressions annotated in ChaKi can be output as a list of lexical entities that are to be stored in Cradle

...read moreread less

Showing papers by "Hiroyuki Shindo published in 2018"