scispace - formally typeset
Search or ask a question
Proceedings Article

Open Language Learning for Information Extraction

TL;DR: Ollie as mentioned in this paper improves ReVerb by extracting relations mediated by nouns, adjectives, and more, and adds context information from the sentence in the extractions to increase precision.
Abstract: Open Information Extraction (IE) systems extract relational tuples from text, without requiring a pre-specified vocabulary, by identifying relation phrases and associated arguments in arbitrary sentences. However, state-of-the-art Open IE systems such as ReVerb and woe share two important weaknesses -- (1) they extract only relations that are mediated by verbs, and (2) they ignore context, thus extracting tuples that are not asserted as factual. This paper presents ollie, a substantially improved Open IE system that addresses both these limitations. First, ollie achieves high yield by extracting relations mediated by nouns, adjectives, and more. Second, a context-analysis step increases precision by including contextual information from the sentence in the extractions. ollie obtains 2.7 times the area under precision-yield curve (AUC) compared to ReVerb and 1.9 times the AUC of woeparse.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article
01 Oct 2013
TL;DR: This paper trains a semantic parser that scales up to Freebase and outperforms their state-of-the-art parser on the dataset of Cai and Yates (2013), despite not having annotated logical forms.
Abstract: In this paper, we train a semantic parser that scales up to Freebase. Instead of relying on annotated logical forms, which is especially expensive to obtain at large scale, we learn from question-answer pairs. The main challenge in this setting is narrowing down the huge number of possible logical predicates for a given question. We tackle this problem in two ways: First, we build a coarse mapping from phrases to predicates using a knowledge base and a large text corpus. Second, we use a bridging operation to generate additional predicates based on neighboring predicates. On the dataset of Cai and Yates (2013), despite not having annotated logical forms, our system outperforms their state-of-the-art parser. Additionally, we collected a more realistic and challenging dataset of question-answer pairs and improves over a natural baseline.

1,738 citations


Cites background from "Open Language Learning for Informat..."

  • ...Finally, although Freebase has thousands of properties, open information extraction (Banko et al., 2007; Fader et al., 2011; Masaum et al., 2012) and associated question answering systems (Fader et al....

    [...]

Proceedings ArticleDOI
24 Aug 2014
TL;DR: The Knowledge Vault is a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories that computes calibrated probabilities of fact correctness.
Abstract: Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft's Satori, and Google's Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods.

1,657 citations


Cites methods from "Open Language Learning for Informat..."

  • ...This literature can be clustered into 4 main groups: (1) approaches such as YAGO [37], YAGO2 [18], DBpedia [3], and Freebase [4], which are built on Wikipedia infoboxes and other structured data sources; (2) approaches such as Reverb [11], OLLIE [24], and PRISMATIC [12], which use open information (schema-less) extraction techniques applied to the entire web; (3) approaches such as NELL/ ReadTheWeb [8], PROSPERA [28], and DeepDive/ Elementary [30], which extract information from the entire web, but use a fixed ontology/ schema; and (4) approaches such as Probase [44], which construct taxonomies (is-a hierarchies), as opposed to general KBs with multiple types of predicates....

    [...]

  • ...This literature can be clustered into 4 main groups: (1) approaches such as YAGO [39], YAGO2 [19], DBpedia [3], and Freebase [4], which are built on Wikipedia infoboxes and other structured data sources; (2) approaches such as Reverb [12], OLLIE [26], and PRISMATIC [13], which use open information (schema-less) extraction techniques applied to the entire web; (3) approaches such as NELL/ ReadTheWeb [8], PROSPERA [30], and DeepDive/ Elementary [32], which extract information from the entire web, but use a fixed ontology/ schema; and (4) approaches such as Probase [47], which construct taxonomies (is-a hierarchies), as opposed to general KBs with multiple types of predicates....

    [...]

Journal ArticleDOI
01 Jan 2016
TL;DR: This paper provides a review of how statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph) and how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web.
Abstract: Relational machine learning studies methods for the statistical analysis of relational, or graph-structured, data. In this paper, we provide a review of how such statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph). In particular, we discuss two fundamentally different kinds of statistical relational models, both of which can scale to massive data sets. The first is based on latent feature models such as tensor factorization and multiway neural networks. The second is based on mining observable patterns in the graph. We also show how to combine these latent and observable models to get improved modeling power at decreased computational cost. Finally, we discuss how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web. To this end, we also discuss Google's knowledge vault project as an example of such combination.

1,452 citations


Cites background from "Open Language Learning for Informat..."

  • ...Unstructured No ReVerb [32], OLLIE [33], PRISMATIC [34]...

    [...]

Proceedings ArticleDOI
01 Jul 2015
TL;DR: This work replaces this large pattern set with a few patterns for canonically structured sentences, and shifts the focus to a classifier which learns to extract self-contained clauses from longer sentences to determine the maximally specific arguments for each candidate triple.
Abstract: Relation triples produced by open domain information extraction (open IE) systems are useful for question answering, inference, and other IE tasks. Traditionally these are extracted using a large set of patterns; however, this approach is brittle on out-of-domain text and long-range dependencies, and gives no insight into the substructure of the arguments. We replace this large pattern set with a few patterns for canonically structured sentences, and shift the focus to a classifier which learns to extract self-contained clauses from longer sentences. We then run natural logic inference over these short clauses to determine the maximally specific arguments for each candidate triple. We show that our approach outperforms a state-of-the-art open IE system on the end-to-end TAC-KBP 2013 Slot Filling task.

704 citations


Cites background or methods from "Open Language Learning for Informat..."

  • ...Systems like Ollie (Mausam et al., 2012) approach this problem by using a bootstrapping method to create a large corpus of broad-coverage partially lexicalized patterns....

    [...]

  • ...With the introduction of fast dependency parsers, Ollie (Mausam et al., 2012) continues in the same spirit but with learned dependency patterns, improving on the earlier WOE system (Wu and Weld, 2010)....

    [...]

  • ...We ran the Ollie open IE system (Mausam et al., 2012) in an identical framework to ours, and report accuracy in Table 5....

    [...]

Proceedings ArticleDOI
01 Sep 2017
TL;DR: An effective new model is proposed, which combines an LSTM sequence model with a form of entity position-aware attention that is better suited to relation extraction that builds TACRED, a large supervised relation extraction dataset obtained via crowdsourcing and targeted towards TAC KBP relations.
Abstract: Organized relational knowledge in the form of “knowledge graphs” is important for many applications. However, the ability to populate knowledge bases with facts automatically extracted from documents has improved frustratingly slowly. This paper simultaneously addresses two issues that have held back prior work. We first propose an effective new model, which combines an LSTM sequence model with a form of entity position-aware attention that is better suited to relation extraction. Then we build TACRED, a large (119,474 examples) supervised relation extraction dataset obtained via crowdsourcing and targeted towards TAC KBP relations. The combination of better supervised data and a more appropriate high-capacity model enables much better relation extraction performance. When the model trained on this new dataset replaces the previous relation extraction component of the best TAC KBP 2015 slot filling system, its F1 score increases markedly from 22.2% to 26.7%.

697 citations


Cites background from "Open Language Learning for Informat..."

  • ..., 2012), where a training set is formed by projecting the relations in an existing knowledge base onto textual instances that contain the entities that the relation connects; and third, Open IE (Fader et al., 2011; Mausam et al., 2012), which views its goal as producing subject-relationobject triples and expressing the relation in text....

    [...]

  • ...…by projecting the relations in an existing knowledge base onto textual instances that contain the entities that the relation connects; and third, Open IE (Fader et al., 2011; Mausam et al., 2012), which views its goal as producing subject-relationobject triples and expressing the relation in text....

    [...]

References
More filters
Proceedings ArticleDOI
02 Aug 2009
TL;DR: This work investigates an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size.
Abstract: Modern models of relation extraction for tasks like ACE are based on supervised learning of relations from small hand-labeled corpora. We investigate an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Our experiments use Freebase, a large semantic database of several thousand relations, to provide distant supervision. For each pair of entities that appears in some Freebase relation, we find all sentences containing those entities in a large unlabeled corpus and extract textual features to train a relation classifier. Our algorithm combines the advantages of supervised IE (combining 400,000 noisy pattern features in a probabilistic classifier) and unsupervised IE (extracting large numbers of relations from large corpora of any domain). Our model is able to extract 10,000 instances of 102 relations at a precision of 67.6%. We also analyze feature performance, showing that syntactic parse features are particularly helpful for relations that are ambiguous or lexically distant in their expression.

2,965 citations


"Open Language Learning for Informat..." refers background or methods in this paper

  • ...While traditional Information Extraction (IE) (ARPA, 1991; ARPA, 1998) focused on identifying and extracting specific relations of interest, there has been great interest in scaling IE to a broader set of relations and to far larger corpora (Banko et al., 2007; Hoffmann et al., 2010; Mintz et al., 2009; Carlson et al., 2010; Fader et al., 2011)....

    [...]

  • ...…1991; ARPA, 1998) focused on identifying and extracting specific relations of interest, there has been great interest in scaling IE to a broader set of relations and to far larger corpora (Banko et al., 2007; Hoffmann et al., 2010; Mintz et al., 2009; Carlson et al., 2010; Fader et al., 2011)....

    [...]

  • ...Bootstrapped data has been previously used to generate positive training data for IE (Hoffmann et al., 2010; Mintz et al., 2009)....

    [...]

Proceedings Article
01 May 2006
TL;DR: A system for extracting typed dependency parses of English sentences from phrase structure parses that captures inherent relations occurring in corpus texts that can be critical in real-world applications is described.
Abstract: This paper describes a system for extracting typed dependency parses of English sentences from phrase structure parses. In order to capture inherent relations occurring in corpus texts that can be critical in real-world applications, many NP relations are included in the set of grammatical relations used. We provide a comparison of our system with Minipar and the Link parser. The typed dependency extraction facility described here is integrated in the Stanford Parser, available for download.

2,503 citations


"Open Language Learning for Informat..." refers methods in this paper

  • ...We post-process the parses using Stanford’s CCprocessed algorithm, which compacts the parse structure for easier extraction (de Marneffe et al., 2006)....

    [...]

Proceedings Article
11 Jul 2010
TL;DR: This work proposes an approach and a set of design principles for an intelligent computer agent that runs forever and describes a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs.
Abstract: We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74% after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent

2,010 citations

Proceedings Article
06 Jan 2007
TL;DR: Open Information Extraction (OIE) as mentioned in this paper is a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input.
Abstract: Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.

1,574 citations

Proceedings ArticleDOI
01 Jun 2000
TL;DR: This paper develops a scalable evaluation methodology and metrics for the task, and presents a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.
Abstract: Text documents often contain valuable structured data that is hidden Yin regular English sentences. This data is best exploited infavailable as arelational table that we could use for answering precise queries or running data mining tasks.We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection.We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents.At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention,and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.

1,399 citations


Additional excerpts

  • ...Another extractor, StatSnowBall (Zhu et al., 2009), has an Open IE version, which learns general but shallow patterns....

    [...]

  • ...There is a long history of bootstrapping and pattern learning approaches in traditional information extraction, e.g., DIPRE (Brin, 1998), SnowBall (Agichtein and Gravano, 2000), Espresso (Pantel and Pennacchiotti, 2006), PORE (Wang et al., 2007), SOFIE (Suchanek et al., 2009), NELL (Carlson et al., 2010), and PROSPERA (Nakashole et al., 2011)....

    [...]

  • ..., DIPRE (Brin, 1998), SnowBall (Agichtein and Gravano, 2000), Espresso (Pan-...

    [...]