scispace - formally typeset
Search or ask a question
Proceedings Article

Smatch: an Evaluation Metric for Semantic Feature Structures

01 Aug 2013-pp 748-752
TL;DR: This paper presents smatch, a metric that calculates the degree of overlap between two semantic feature structures, and gives an efficient algorithm to compute the metric and shows the results of an inter-annotator agreement study.
Abstract: The evaluation of whole-sentence semantic structures plays an important role in semantic parsing and large-scale semantic structure annotation. However, there is no widely-used metric to evaluate wholesentence semantic structures. In this paper, we present smatch, a metric that calculates the degree of overlap between two semantic feature structures. We give an efficient algorithm to compute the metric and show the results of an inter-annotator agreement study.

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
08 Oct 2016
TL;DR: This paper proposed a new automated caption evaluation metric defined over scene graphs coined SPICE, which captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR).
Abstract: There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?

1,053 citations

Posted Content
TL;DR: It is hypothesized that semantic propositional content is an important component of human caption evaluation, and a new automated caption evaluation metric defined over scene graphs coined SPICE is proposed, which can answer questions such as which caption-generator best understands colors?
Abstract: There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as `which caption-generator best understands colors?' and `can caption-generators count?'

922 citations


Cites background from "Smatch: an Evaluation Metric for Se..."

  • ...Unlike Smatch [27], a recently proposed metric for evaluating AMR parsers that considers multiple alignments of AMR graphs, we make no allowance for...

    [...]

  • ...Unlike Smatch [27], a recently proposed metric for evaluating AMR parsers that considers multiple alignments of AMR graphs, we make no allowance for partial credit when only one element of a tuple is incorrect....

    [...]

  • ...Recent work has proposed a common framework for semantic graphs called an abstract meaning representation (AMR) [24], for which a number of parsers [25,26,17] and the Smatch evaluation metric [27] have been developed....

    [...]

  • ...However, in initial experiments, we found that AMR representations using Smatch similarity performed poorly as image caption representations....

    [...]

Proceedings ArticleDOI
01 Jun 2014
TL;DR: The first approach to parse sentences into meaning representation, a semantic formalism for which a grow- ing set of annotated examples is available, is introduced, providing a strong baseline for improvement.
Abstract: Meaning Representation (AMR) is a semantic formalism for which a grow- ing set of annotated examples is avail- able. We introduce the first approach to parse sentences into this representa- tion, providing a strong baseline for fu- ture improvement. The method is based on a novel algorithm for finding a maxi- mum spanning, connected subgraph, em- bedded within a Lagrangian relaxation of an optimization problem that imposes lin- guistically inspired constraints. Our ap- proach is described in the general frame- work of structured prediction, allowing fu- ture incorporation of additional features and constraints, and may extend to other formalisms as well. Our open-source sys- tem, JAMR, is available at: http://github.com/jflanigan/jamr

342 citations


Cites background or methods from "Smatch: an Evaluation Metric for Se..."

  • ..., 2013) under the Smatch score (Cai and Knight, 2013), presenting the first published results on automatic AMR parsing....

    [...]

  • ...We evaluate using the Smatch score (Cai and Knight, 2013), establishing a baseline for future work....

    [...]

  • ...0 (Cai and Knight, 2013), which counts the precision, recall and F1 of the concepts and relations together....

    [...]

Proceedings ArticleDOI
01 Sep 2015
TL;DR: It is shown that scene graphs can be effectively created automatically from a natural language scene description and that using the output of the parsers is almost as effective as using human-constructed scene graphs.
Abstract: Semantically complex queries which include attributes of objects and relations between objects still pose a major challenge to image retrieval systems. Recent work in computer vision has shown that a graph-based semantic representation called a scene graph is an effective representation for very detailed image descriptions and for complex queries for retrieval. In this paper, we show that scene graphs can be effectively created automatically from a natural language scene description. We present a rule-based and a classifierbased scene graph parser whose output can be used for image retrieval. We show that including relations and attributes in the query graph outperforms a model that only considers objects and that using the output of our parsers is almost as effective as using human-constructed scene graphs (Recall@10 of 27.1% vs. 33.4%). Additionally, we demonstrate the general usefulness of parsing to scene graphs by showing that the output can also be used to generate 3D scenes.

314 citations


Cites methods from "Smatch: an Evaluation Metric for Se..."

  • ..., 2013) graphs, we use Smatch F1 (Cai and Knight, 2013) as an additional intrinsic metric....

    [...]

Proceedings ArticleDOI
01 Jan 2015
TL;DR: This work focuses on the graph-tograph transformation that reduces the source semantic graph into a summary graph, making use of an existing AMR parser and assuming the eventual availability of an AMR-totext generator.
Abstract: We present a novel abstractive summarization framework that draws on the recent development of a treebank for the Abstract Meaning Representation (AMR). In this framework, the source text is parsed to a set of AMR graphs, the graphs are transformed into a summary graph, and then text is generated from the summary graph. We focus on the graph-tograph transformation that reduces the source semantic graph into a summary graph, making use of an existing AMR parser and assuming the eventual availability of an AMR-totext generator. The framework is data-driven, trainable, and not specifically designed for a particular domain. Experiments on goldstandard AMR annotations and system parses show promising results. Code is available at: https://github.com/summarization

225 citations


Cites methods from "Smatch: an Evaluation Metric for Se..."

  • ...A 2http://www.isi.edu/˜ulf/amr/help/ amr-guidelines.pdf 3AMR parse quality is evaluated using smatch (Cai and Knight, 2013), which measures the accuracy of concept and relation predictions....

    [...]

References
More filters
Proceedings ArticleDOI
06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Abstract: Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations.

21,126 citations

Proceedings Article
08 Aug 2006
TL;DR: A new, intuitive measure for evaluating machine translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments is defined.
Abstract: We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments. Translation Edit Rate (TER) measures the amount of editing that a human would have to perform to change a system output so it exactly matches a reference translation. We show that the single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant of BLEU. We also define a human-targeted TER (or HTER) and show that it yields higher correlations with human judgments than BLEU—even when BLEU is given human-targeted references. Our results indicate that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate with human judgments as well as—or better than—a second human judgment does.

2,210 citations


"Smatch: an Evaluation Metric for Se..." refers methods in this paper

  • ...2 We note that other widely-used metrics, such as TER (Snover et al., 2006), are also NP-complete....

    [...]

Book
01 Jan 1990
TL;DR: Focusing on the structure of meaning in English sentences at a "subatomic" level - that is, a level below the one most theories accept as basic or "atomic" - Parsons asserts that the semantics of simple English sentences require logical forms somewhat more complex than is normally assumed in natural language semantics.
Abstract: This extended investigation of the semantics of event (and state) sentences in their various forms is a major contribution to the semantics of natural language, simultaneously encompassing important issues in linguistics, philosophy, and logic. It develops the view that the logical forms of simple English sentences typically contain quantification over events or states and shows how this view can account for a wide variety of semantic phenomena.Focusing on the structure of meaning in English sentences at a "subatomic" level - that is, a level below the one most theories accept as basic or "atomic" - Parsons asserts that the semantics of simple English sentences require logical forms somewhat more complex than is normally assumed in natural language semantics. His articulation of underlying event theory explains a wide variety of apparently diverse semantic characteristics of natural language, and his development of the theory shows the importance of seeing the distinction between events and states.Parsons demonstrates that verbs, also, indicate kinds of actions rather than specific, individual actions. Verb phrases, too, he argues, depend on modifiers to make their function and meaning in a sentence specific. An appendix gives many of the details needed to formalize the theory discussed in the body of the text and provides a series of templates that permit the generation of atomic formulas of English.Terence Parsons is Professor of Philosophy and Dean of Humanities at the University of California, Irvine.

1,437 citations


"Smatch: an Evaluation Metric for Se..." refers methods in this paper

  • ...We work on a semantic feature structure representation in a standard neo-Davidsonian (Davidson, 1969; Parsons, 1990) framework....

    [...]

Proceedings Article
26 Jul 2005
TL;DR: A learning algorithm is described that takes as input a training set of sentences labeled with expressions in the lambda calculus and induces a grammar for the problem, along with a log-linear model that represents a distribution over syntactic and semantic analyses conditioned on the input sentence.
Abstract: This paper addresses the problem of mapping natural language sentences to lambda–calculus encodings of their meaning. We describe a learning algorithm that takes as input a training set of sentences labeled with expressions in the lambda calculus. The algorithm induces a grammar for the problem, along with a log-linear model that represents a distribution over syntactic and semantic analyses conditioned on the input sentence. We apply the method to the task of learning natural language interfaces to databases and show that the learned parsers outperform previous methods in two benchmark database domains.

865 citations


"Smatch: an Evaluation Metric for Se..." refers background in this paper

  • ...whole-sentence accuracy (Zettlemoyer and Collins, 2005), which counts the number of sentences parsed completely correctly....

    [...]

Proceedings Article
01 May 2002
TL;DR: This paper describes the approach to the development of a Proposition Bank, which involves the addition of semantic information to the Penn English Treebank and introduces metaframes as a technique for handling similar frames among near− synonymous verbs.
Abstract: This paper describes our approach to the development of a Proposition Bank, which involves the addition of semantic information to the Penn English Treebank. Our primary goal is the labeling of syntactic nodes with specific argument labels that preserve the similarity of roles such as the window in John broke the window and the window broke. After motivating the need for explicit predicate argument structure labels, we briefly discuss the theoretical considerations of predicate argument structure and the need to maintain consistency across syntactic alternations. The issues of consistency of argument structure across both polysemous and synonymous verbs are also discussed and we present our actual guidelines for these types of phenomena, along with numerous examples of tagged sentences and verb frames. Metaframes are introduced as a technique for handling similar frames among near− synonymous verbs. We conclude with a summary of the current status of annotation process.

686 citations


"Smatch: an Evaluation Metric for Se..." refers background in this paper

  • ...Both want-01 and go-01 are frames from PropBank framesets (Kingsbury and Palmer, 2002)....

    [...]