Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

Home
/
Papers
/
Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

Proceedings Article•

Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

Qingqing Cai¹, Alexander Yates¹•Institutions (1)

01 Aug 2013-pp 423-433

TL;DR: A semantic parser for Freebase is developed based on a reduction to standard supervised training algorithms, schema matching, and pattern learning that is capable of parsing questions with an F1 that improves by 0.42 over a purely-supervised learning algorithm.

read less

Abstract: Supervised training procedures for semantic parsers produce high-quality semantic parsers, but they have difficulty scaling to large databases because of the sheer number of logical constants for which they must see labeled training data. We present a technique for developing semantic parsers for large databases based on a reduction to standard supervised training algorithms, schema matching, and pattern learning. Leveraging techniques from each of these areas, we develop a semantic parser for Freebase that is capable of parsing questions with an F1 that improves by 0.42 over a purely-supervised learning algorithm.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•

Semantic Parsing on Freebase from Question-Answer Pairs

[...]

Jonathan Berant¹, Andrew Chia Chen Chou¹, Roy Frostig¹, Percy Liang¹•Institutions (1)

Stanford University¹

01 Oct 2013

TL;DR: This paper trains a semantic parser that scales up to Freebase and outperforms their state-of-the-art parser on the dataset of Cai and Yates (2013), despite not having annotated logical forms.

...read moreread less

Abstract: In this paper, we train a semantic parser that scales up to Freebase. Instead of relying on annotated logical forms, which is especially expensive to obtain at large scale, we learn from question-answer pairs. The main challenge in this setting is narrowing down the huge number of possible logical predicates for a given question. We tackle this problem in two ways: First, we build a coarse mapping from phrases to predicates using a knowledge base and a large text corpus. Second, we use a bridging operation to generate additional predicates based on neighboring predicates. On the dataset of Cai and Yates (2013), despite not having annotated logical forms, our system outperforms their state-of-the-art parser. Additionally, we collected a more realistic and challenging dataset of question-answer pairs and improves over a natural baseline.

...read moreread less

1,738 citations

Cites background or methods from "Large-scale Semantic Parsing via Sc..."

..., 2011), at large scale they have inadequate coverage (Cai and Yates, 2013)....
[...]
...Rather than using head-modifier information from dependency trees (Branavan et al., 2012; Krishnamurthy and Mitchell, 2012; Cai and Yates, 2013; Poon, 2013), we can learn the appropriate relationships tailored for downstream accuracy....
[...]
...Previous work based on CCG requires manually specifying combination rules (Krishnamurthy and Mitchell, 2012) or inducing the rules from annotated logical forms (Kwiatkowski et al., 2010; Cai and Yates, 2013)....
[...]
...On the question answering side, recent methods have made progress in building semantic parsers for the open domain, but still require a fair amount of manual effort (Yahya et al., 2012; Unger et al., 2012; Cai and Yates, 2013)....
[...]

Proceedings Article•DOI•

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

[...]

Mandar Joshi¹, Eunsol Choi¹, Daniel S. Weld¹, Luke Zettlemoyer²•Institutions (2)

University of Washington¹, Allen Institute for Artificial Intelligence²

09 May 2017

TL;DR: It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.

...read moreread less

Abstract: We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a feature-based classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that TriviaQA is a challenging testbed that is worth significant future study.

...read moreread less

1,266 citations

Cites background from "Large-scale Semantic Parsing via Sc..."

...Proposed datasets (Cai and Yates, 2013; Berant et al., 2013; Bordes et al., 2015) are either limited in scale or in the complexity of questions, and can only retrieve facts covered by the KB....
[...]

Posted Content•

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

[...]

Victor Zhong, Caiming Xiong, Richard Socher

31 Aug 2017-arXiv: Computation and Language

TL;DR: This work proposes Seq2 SQL, a deep neural network for translating natural language questions to corresponding SQL queries, and releases WikiSQL, a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables fromWikipedia that is an order of magnitude larger than comparable datasets.

...read moreread less

Abstract: A significant amount of the world's knowledge is stored in relational databases. However, the ability for users to retrieve facts from a database is limited due to a lack of understanding of query languages such as SQL. We propose Seq2SQL, a deep neural network for translating natural language questions to corresponding SQL queries. Our model leverages the structure of SQL queries to significantly reduce the output space of generated queries. Moreover, we use rewards from in-the-loop query execution over the database to learn a policy to generate unordered parts of the query, which we show are less suitable for optimization via cross entropy loss. In addition, we will publish WikiSQL, a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia. This dataset is required to train our model and is an order of magnitude larger than comparable datasets. By applying policy-based reinforcement learning with a query execution environment to WikiSQL, our model Seq2SQL outperforms attentional sequence to sequence models, improving execution accuracy from 35.9% to 59.4% and logical form accuracy from 23.4% to 48.3%.

...read moreread less

830 citations

Cites background from "Large-scale Semantic Parsing via Sc..."

...…parsing focus on learning parsers without relying on annotated logical forms by leveraging conversational logs (Artzi & Zettlemoyer, 2011), demonstrations (Artzi & Zettlemoyer, 2013), distant supervision (Cai & Yates, 2013; Reddy et al., 2014), and question-answer pairs (Liang et al., 2011)....
[...]
...Researchers also investigated QA over subsets of largescale knowledge graphs such as DBPedia (Starc & Mladenic, 2017) and Freebase (Cai & Yates, 2013; Berant et al., 2013)....
[...]

Proceedings Article•DOI•

Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base

[...]

Wen-tau Yih¹, Ming-Wei Chang¹, Xiaodong He¹, Jianfeng Gao¹•Institutions (1)

Microsoft¹

28 Jul 2015

TL;DR: This work proposes a novel semantic parsing framework for question answering using a knowledge base that leverages the knowledge base in an early stage to prune the search space and thus simplifies the semantic matching problem.

...read moreread less

Abstract: We propose a novel semantic parsing framework for question answering using a knowledge base. We define a query graph that resembles subgraphs of the knowledge base and can be directly mapped to a logical form. Semantic parsing is reduced to query graph generation, formulated as a staged search problem. Unlike traditional approaches, our method leverages the knowledge base in an early stage to prune the search space and thus simplifies the semantic matching problem. By applying an advanced entity linking system and a deep convolutional neural network model that matches questions and predicate sequences, our system outperforms previous methods substantially, and achieves an F1 measure of 52.5% on the WEBQUESTIONS dataset.

...read moreread less

806 citations

Cites methods from "Large-scale Semantic Parsing via Sc..."

...Several semantic parsing methods use a domainindependent meaning representation derived from the combinatory categorial grammar (CCG) parses (e.g., (Cai and Yates, 2013; Kwiatkowski et al., 2013; Reddy et al., 2014))....
[...]

Posted Content•

Large-scale Simple Question Answering with Memory Networks

[...]

Antoine Bordes, Nicolas Usunier, Sumit Chopra, Jason Weston

05 Jun 2015-arXiv: Learning

TL;DR: This paper studies the impact of multitask and transfer learning for simple question answering; a setting for which the reasoning required to answer is quite easy, as long as one can retrieve the correct evidence given a question, which can be difficult in large-scale conditions.

...read moreread less

Abstract: Training large-scale question answering systems is complicated because training sources usually cover a small portion of the range of possible questions. This paper studies the impact of multitask and transfer learning for simple question answering; a setting for which the reasoning required to answer is quite easy, as long as one can retrieve the correct evidence given a question, which can be difficult in large-scale conditions. To this end, we introduce a new dataset of 100k questions that we use in conjunction with existing benchmarks. We conduct our study within the framework of Memory Networks (Weston et al., 2015) because this perspective allows us to eventually scale up to more complex reasoning, and show that Memory Networks can be successfully trained to achieve excellent performance.

...read moreread less

634 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Freebase: a collaboratively created graph database for structuring human knowledge

[...]

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, Jamie Taylor - Show less +1 more

09 Jun 2008

TL;DR: MQL provides an easy-to-use object-oriented interface to the tuple data in Freebase and is designed to facilitate the creation of collaborative, Web-based data-oriented applications.

...read moreread less

Abstract: Freebase is a practical, scalable tuple database used to structure general human knowledge. The data in Freebase is collaboratively created, structured, and maintained. Freebase currently contains more than 125,000,000 tuples, more than 4000 types, and more than 7000 properties. Public read/write access to Freebase is allowed through an HTTP-based graph-query API using the Metaweb Query Language (MQL) as a data query and manipulation language. MQL provides an easy-to-use object-oriented interface to the tuple data in Freebase and is designed to facilitate the creation of collaborative, Web-based data-oriented applications.

...read moreread less

4,813 citations

"Large-scale Semantic Parsing via Sc..." refers background in this paper

...Freebase (Bollacker et al., 2008) is a free, online, user-contributed, relational database (www....
[...]
...Freebase (Bollacker et al., 2008) is a free, online, user-contributed, relational database (www.freebase.com) covering many different domains of knowledge....
[...]
...Examples of such schemas include Freebase (Bollacker et al., 2008) and Yago2 (Hoffart et al., 2013)....
[...]
...Examples of such schemas include Freebase (Bollacker et al., 2008) and Yago2 (Hoffart et al....
[...]

Journal Article•DOI•

A survey of approaches to automatic schema matching

[...]

Erhard Rahm¹, Philip A. Bernstein²•Institutions (2)

Leipzig University¹, Microsoft²

01 Dec 2001

TL;DR: A taxonomy is presented that distinguishes between schema-level and instance-level, element- level and structure- level, and language-based and constraint-based matchers and is intended to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

...read moreread less

Abstract: Schema matching is a basic problem in many database application domains, such as data integration, E-business, data warehousing, and semantic query processing. In current implementations, schema matching is typically performed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match operation for specific application domains. We present a taxonomy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distinguish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some previous match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different approaches to schema matching, when developing a new match algorithm, and when implementing a schema matching component.

...read moreread less

3,693 citations

"Large-scale Semantic Parsing via Sc..." refers background in this paper

...Schema matching (Rahm and Bernstein, 2001; Ehrig et al., 2004; Giunchiglia et al., 2005) is a task from the database and knowledge representation community in which systems attempt to identify a “common schema” that covers the relations defined in a set of databases or ontologies, and the mapping between each individual database and the common schema....
[...]

Proceedings Article•

Toward an architecture for never-ending language learning

[...]

Andrew Carlson¹, Justin Betteridge¹, Bryan Kisiel¹, Burr Settles¹, Estevam R. Hruschka², Tom M. Mitchell¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Federal University of São Carlos²

11 Jul 2010

TL;DR: This work proposes an approach and a set of design principles for an intelligent computer agent that runs forever and describes a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs.

...read moreread less

Abstract: We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74% after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent

...read moreread less

2,010 citations

"Large-scale Semantic Parsing via Sc..." refers methods in this paper

...We say a schema is a textual schema if it has been extracted from free text, such as the Nell (Carlson et al., 2010) and ReVerb (Fader et al., 2011) extracted databases....
[...]

Proceedings Article•

Open information extraction from the web

[...]

Michele Banko¹, Michael Cafarella¹, Stephen Soderland¹, Matt Broadhead¹, Oren Etzioni¹ - Show less +1 more•Institutions (1)

University of Washington¹

06 Jan 2007

TL;DR: Open Information Extraction (OIE) as mentioned in this paper is a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input.

...read moreread less

Abstract: Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. This paper introduces Open IE (OIE), a new extraction paradigm where the system makes a single data-driven pass over its corpus and extracts a large set of relational tuples without requiring any human input. The paper also introduces TEXTRUNNER, a fully implemented, highly scalable OIE system where the tuples are assigned a probability and indexed to support efficient extraction and exploration via user queries. We report on experiments over a 9,000,000 Web page corpus that compare TEXTRUNNER with KNOWITALL, a state-of-the-art Web IE system. TEXTRUNNER achieves an error reduction of 33% on a comparable set of extractions. Furthermore, in the amount of time it takes KNOWITALL to perform extraction for a handful of pre-specified relations, TEXTRUNNER extracts a far broader set of facts reflecting orders of magnitude more relations, discovered on the fly. We report statistics on TEXTRUNNER's 11,000,000 highest probability tuples, and show that they contain over 1,000,000 concrete facts and over 6,500,000 more abstract assertions.

...read moreread less

1,574 citations

Proceedings Article•

Identifying Relations for Open Information Extraction

[...]

Anthony Fader¹, Stephen Soderland¹, Oren Etzioni¹•Institutions (1)

University of Washington¹

27 Jul 2011

TL;DR: Two simple syntactic and lexical constraints on binary relations expressed by verbs are introduced in the ReVerb Open IE system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TextRunner and woepos.

...read moreread less

Abstract: Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. This paper shows that the output of state-of-the-art Open IE systems is rife with uninformative and incoherent extractions. To overcome these problems, we introduce two simple syntactic and lexical constraints on binary relations expressed by verbs. We implemented the constraints in the ReVerb Open IE system, which more than doubles the area under the precision-recall curve relative to previous extractors such as TextRunner and woepos. More than 30% of ReVerb's extractions are at precision 0.8 or higher---compared to virtually none for earlier systems. The paper concludes with a detailed analysis of ReVerb's errors, suggesting directions for future work.

...read moreread less

1,326 citations

"Large-scale Semantic Parsing via Sc..." refers methods in this paper

...To avoid overwhelming the ReVerb servers, for our experiments we limited MATCHER to queries 1http://openie.cs.washington.edu/ for the top 80 rT ∈ C(rD), when they are ranked according to frequency during the candidate identification process....
[...]
...The API for ReVerb allows for relational queries in which some subset of the entity strings, entity categories, and relation string are specified....
[...]
...MATCHER uses an API for the ReVerb Open IE system1 (Fader et al., 2011) to collect I(rT ), for each rT ....
[...]
...We define 2The data is available from the second author’s website. precision and recall as: P = |M ∩G| |M | , R = |M ∩G| |G| Figure 3 shows a Precision-Recall (PR) curve for MATCHER and three baselines: a “Frequency” model that ranks candidate matches for rD by their frequency during the candidate identification step; a “Pattern” model that uses MATCHER’s linear regression model for ranking, but is restricted to only the pattern-based features; and an “Extractions” model that similarly restricts the ranking model to ReVerb features....
[...]
...MATCHER queries ReVerb with three different types of queries for each rT , specifying the types for both arguments, or just the type of the first argument, or just the second argument....
[...]