Smatch: an Evaluation Metric for Semantic Feature Structures

Home
/
Papers
/
Smatch: an Evaluation Metric for Semantic Feature Structures

Proceedings Article•

Smatch: an Evaluation Metric for Semantic Feature Structures

Shu Cai¹, Kevin Knight¹•Institutions (1)

01 Aug 2013-pp 748-752

TL;DR: This paper presents smatch, a metric that calculates the degree of overlap between two semantic feature structures, and gives an efficient algorithm to compute the metric and shows the results of an inter-annotator agreement study.

read less

Abstract: The evaluation of whole-sentence semantic structures plays an important role in semantic parsing and large-scale semantic structure annotation. However, there is no widely-used metric to evaluate wholesentence semantic structures. In this paper, we present smatch, a metric that calculates the degree of overlap between two semantic feature structures. We give an efficient algorithm to compute the metric and show the results of an inter-annotator agreement study.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

SPICE: Semantic Propositional Image Caption Evaluation

[...]

Peter Anderson¹, Basura Fernando¹, Mark Johnson², Stephen Gould¹•Institutions (2)

Australian National University¹, Macquarie University²

08 Oct 2016

TL;DR: This paper proposed a new automated caption evaluation metric defined over scene graphs coined SPICE, which captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR).

...read moreread less

Abstract: There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?

...read moreread less

1,053 citations

Posted Content•

SPICE: Semantic Propositional Image Caption Evaluation

[...]

Peter Anderson¹, Basura Fernando¹, Mark Johnson², Stephen Gould¹•Institutions (2)

Australian National University¹, Macquarie University²

29 Jul 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is hypothesized that semantic propositional content is an important component of human caption evaluation, and a new automated caption evaluation metric defined over scene graphs coined SPICE is proposed, which can answer questions such as which caption-generator best understands colors?

...read moreread less

Abstract: There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as `which caption-generator best understands colors?' and `can caption-generators count?'

...read moreread less

922 citations

Cites background from "Smatch: an Evaluation Metric for Se..."

...Unlike Smatch [27], a recently proposed metric for evaluating AMR parsers that considers multiple alignments of AMR graphs, we make no allowance for...
[...]
...Unlike Smatch [27], a recently proposed metric for evaluating AMR parsers that considers multiple alignments of AMR graphs, we make no allowance for partial credit when only one element of a tuple is incorrect....
[...]
...Recent work has proposed a common framework for semantic graphs called an abstract meaning representation (AMR) [24], for which a number of parsers [25,26,17] and the Smatch evaluation metric [27] have been developed....
[...]
...However, in initial experiments, we found that AMR representations using Smatch similarity performed poorly as image caption representations....
[...]

Proceedings Article•DOI•

A Discriminative Graph-Based Parser for the Abstract Meaning Representation

[...]

Jeffrey Flanigan¹, Sam Thomson¹, Jaime G. Carbonell¹, Chris Dyer¹, Noah A. Smith¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

01 Jun 2014

TL;DR: The first approach to parse sentences into meaning representation, a semantic formalism for which a grow- ing set of annotated examples is available, is introduced, providing a strong baseline for improvement.

...read moreread less

Abstract: Meaning Representation (AMR) is a semantic formalism for which a grow- ing set of annotated examples is avail- able. We introduce the first approach to parse sentences into this representa- tion, providing a strong baseline for fu- ture improvement. The method is based on a novel algorithm for finding a maxi- mum spanning, connected subgraph, em- bedded within a Lagrangian relaxation of an optimization problem that imposes lin- guistically inspired constraints. Our ap- proach is described in the general frame- work of structured prediction, allowing fu- ture incorporation of additional features and constraints, and may extend to other formalisms as well. Our open-source sys- tem, JAMR, is available at: http://github.com/jflanigan/jamr

...read moreread less

342 citations

Cites background or methods from "Smatch: an Evaluation Metric for Se..."

..., 2013) under the Smatch score (Cai and Knight, 2013), presenting the first published results on automatic AMR parsing....
[...]
...We evaluate using the Smatch score (Cai and Knight, 2013), establishing a baseline for future work....
[...]
...0 (Cai and Knight, 2013), which counts the precision, recall and F1 of the concepts and relations together....
[...]

Proceedings Article•DOI•

Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval

[...]

Sebastian Schuster¹, Ranjay Krishna¹, Angel X. Chang¹, Li Fei-Fei¹, Christopher D. Manning¹ - Show less +1 more•Institutions (1)

Stanford University¹

01 Sep 2015

TL;DR: It is shown that scene graphs can be effectively created automatically from a natural language scene description and that using the output of the parsers is almost as effective as using human-constructed scene graphs.

...read moreread less

Abstract: Semantically complex queries which include attributes of objects and relations between objects still pose a major challenge to image retrieval systems. Recent work in computer vision has shown that a graph-based semantic representation called a scene graph is an effective representation for very detailed image descriptions and for complex queries for retrieval. In this paper, we show that scene graphs can be effectively created automatically from a natural language scene description. We present a rule-based and a classifierbased scene graph parser whose output can be used for image retrieval. We show that including relations and attributes in the query graph outperforms a model that only considers objects and that using the output of our parsers is almost as effective as using human-constructed scene graphs (Recall@10 of 27.1% vs. 33.4%). Additionally, we demonstrate the general usefulness of parsing to scene graphs by showing that the output can also be used to generate 3D scenes.

...read moreread less

314 citations

Cites methods from "Smatch: an Evaluation Metric for Se..."

..., 2013) graphs, we use Smatch F1 (Cai and Knight, 2013) as an additional intrinsic metric....
[...]

Proceedings Article•DOI•

Toward Abstractive Summarization Using Semantic Representations

[...]

Fei Liu¹, Jeffrey Flanigan¹, Sam Thomson¹, Norman Sadeh¹, Noah A. Smith¹ - Show less +1 more•Institutions (1)

Carnegie Mellon University¹

01 Jan 2015

TL;DR: This work focuses on the graph-tograph transformation that reduces the source semantic graph into a summary graph, making use of an existing AMR parser and assuming the eventual availability of an AMR-totext generator.

...read moreread less

Abstract: We present a novel abstractive summarization framework that draws on the recent development of a treebank for the Abstract Meaning Representation (AMR). In this framework, the source text is parsed to a set of AMR graphs, the graphs are transformed into a summary graph, and then text is generated from the summary graph. We focus on the graph-tograph transformation that reduces the source semantic graph into a summary graph, making use of an existing AMR parser and assuming the eventual availability of an AMR-totext generator. The framework is data-driven, trainable, and not specifically designed for a particular domain. Experiments on goldstandard AMR annotations and system parses show promising results. Code is available at: https://github.com/summarization

...read moreread less

225 citations

Cites methods from "Smatch: an Evaluation Metric for Se..."

...A 2http://www.isi.edu/˜ulf/amr/help/ amr-guidelines.pdf 3AMR parse quality is evaluated using smatch (Cai and Knight, 2013), which measures the accuracy of concept and relation predictions....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Bleu: a Method for Automatic Evaluation of Machine Translation

[...]

Kishore Papineni¹, Salim Roukos¹, Todd Ward¹, Wei-Jing Zhu¹•Institutions (1)

IBM¹

06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

Abstract: Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

...read moreread less

21,126 citations

Proceedings Article•

A Study of Translation Edit Rate with Targeted Human Annotation

[...]

Matthew Snover¹, Bonnie J. Dorr¹, Richard Schwartz², Linnea Micciulla², John Makhoul² - Show less +1 more•Institutions (2)

University of Maryland, College Park¹, BBN Technologies²

08 Aug 2006

TL;DR: A new, intuitive measure for evaluating machine translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments is defined.

...read moreread less

Abstract: We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments. Translation Edit Rate (TER) measures the amount of editing that a human would have to perform to change a system output so it exactly matches a reference translation. We show that the single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant of BLEU. We also define a human-targeted TER (or HTER) and show that it yields higher correlations with human judgments than BLEU—even when BLEU is given human-targeted references. Our results indicate that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate with human judgments as well as—or better than—a second human judgment does.

...read moreread less

2,210 citations

"Smatch: an Evaluation Metric for Se..." refers methods in this paper

...2 We note that other widely-used metrics, such as TER (Snover et al., 2006), are also NP-complete....
[...]

Book•

Events in the Semantics of English: A Study in Subatomic Semantics

[...]

Terence Parsons

01 Jan 1990

TL;DR: Focusing on the structure of meaning in English sentences at a "subatomic" level - that is, a level below the one most theories accept as basic or "atomic" - Parsons asserts that the semantics of simple English sentences require logical forms somewhat more complex than is normally assumed in natural language semantics.

...read moreread less

Abstract: This extended investigation of the semantics of event (and state) sentences in their various forms is a major contribution to the semantics of natural language, simultaneously encompassing important issues in linguistics, philosophy, and logic. It develops the view that the logical forms of simple English sentences typically contain quantification over events or states and shows how this view can account for a wide variety of semantic phenomena.Focusing on the structure of meaning in English sentences at a "subatomic" level - that is, a level below the one most theories accept as basic or "atomic" - Parsons asserts that the semantics of simple English sentences require logical forms somewhat more complex than is normally assumed in natural language semantics. His articulation of underlying event theory explains a wide variety of apparently diverse semantic characteristics of natural language, and his development of the theory shows the importance of seeing the distinction between events and states.Parsons demonstrates that verbs, also, indicate kinds of actions rather than specific, individual actions. Verb phrases, too, he argues, depend on modifiers to make their function and meaning in a sentence specific. An appendix gives many of the details needed to formalize the theory discussed in the body of the text and provides a series of templates that permit the generation of atomic formulas of English.Terence Parsons is Professor of Philosophy and Dean of Humanities at the University of California, Irvine.

...read moreread less

1,437 citations

"Smatch: an Evaluation Metric for Se..." refers methods in this paper

...We work on a semantic feature structure representation in a standard neo-Davidsonian (Davidson, 1969; Parsons, 1990) framework....
[...]

Proceedings Article•

Learning to map sentences to logical form: structured classification with probabilistic categorial grammars

[...]

Luke Zettlemoyer¹, Michael Collins¹•Institutions (1)

Massachusetts Institute of Technology¹

26 Jul 2005

TL;DR: A learning algorithm is described that takes as input a training set of sentences labeled with expressions in the lambda calculus and induces a grammar for the problem, along with a log-linear model that represents a distribution over syntactic and semantic analyses conditioned on the input sentence.

...read moreread less

Abstract: This paper addresses the problem of mapping natural language sentences to lambda–calculus encodings of their meaning. We describe a learning algorithm that takes as input a training set of sentences labeled with expressions in the lambda calculus. The algorithm induces a grammar for the problem, along with a log-linear model that represents a distribution over syntactic and semantic analyses conditioned on the input sentence. We apply the method to the task of learning natural language interfaces to databases and show that the learned parsers outperform previous methods in two benchmark database domains.

...read moreread less

865 citations

"Smatch: an Evaluation Metric for Se..." refers background in this paper

...whole-sentence accuracy (Zettlemoyer and Collins, 2005), which counts the number of sentences parsed completely correctly....
[...]

Proceedings Article•

From treebank to propbank

[...]

Paul R. Kingsbury, Martha Palmer

01 May 2002

TL;DR: This paper describes the approach to the development of a Proposition Bank, which involves the addition of semantic information to the Penn English Treebank and introduces metaframes as a technique for handling similar frames among near− synonymous verbs.

...read moreread less

Abstract: This paper describes our approach to the development of a Proposition Bank, which involves the addition of semantic information to the Penn English Treebank. Our primary goal is the labeling of syntactic nodes with specific argument labels that preserve the similarity of roles such as the window in John broke the window and the window broke. After motivating the need for explicit predicate argument structure labels, we briefly discuss the theoretical considerations of predicate argument structure and the need to maintain consistency across syntactic alternations. The issues of consistency of argument structure across both polysemous and synonymous verbs are also discussed and we present our actual guidelines for these types of phenomena, along with numerous examples of tagged sentences and verb frames. Metaframes are introduced as a technique for handling similar frames among near− synonymous verbs. We conclude with a summary of the current status of annotation process.

...read moreread less

686 citations