scispace - formally typeset
Search or ask a question
Author

Beatrice Santorini

Other affiliations: Northwestern University
Bio: Beatrice Santorini is an academic researcher from University of Pennsylvania. The author has contributed to research in topics: Treebank & Parsing. The author has an hindex of 14, co-authored 25 publications receiving 9823 citations. Previous affiliations of Beatrice Santorini include Northwestern University.

Papers
More filters
ReportDOI
TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Abstract: : As a result of this grant, the researchers have now published oil CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, with over 3 million words of that material assigned skeletal grammatical structure This material now includes a fully hand-parsed version of the classic Brown corpus About one half of the papers at the ACL Workshop on Using Large Text Corpora this past summer were based on the materials generated by this grant

8,377 citations

01 Jan 1990
TL;DR: This manual addresses the linguistic issues that arise in connection with annotating texts by part of speech ("tagging") and discusses parts of speech that are easily confused and gives guidelines on how to tag such cases.
Abstract: This manual addresses the linguistic issues that arise in connection with annotating texts by part of speech ("tagging"). Section 2 is an alphabetical list of the parts of speech encoded in the annotation systems of the Penn Treebank Project, along with their corresponding abbreviations ("tags") and some information concerning their definition. This section allows you to find an unfamiliar tag by looking up a familiar part of speech. Section 3 recapitulates the information in Section 2, but this time the information is alphabetically ordered by tags. This is the section to consult in order to find out what an unfamiliar tag means. Since the parts of speech are probably familiar to you from high school English, you should have little difficulty in assimilating the tags themselves. However, it is often quite difficult to decide which tag is appropriate in a particular context. The two sections 4 and 5 therefore include examples and guidelines on how to tag problematic cases. If you are uncertain about whether a given tag is correct or not, refer to these sections in order to ensure a consistently annotated text. Section 4 discusses parts of speech that are easily confused and gives guidelines on how to tag such cases, while Section 5 contains an alphabetical list of specific problematic words and collocations. Finally, Section 6 discusses some general tagging conventions. One general rule, however, is so important that we state it here. Many texts are not models of good prose, and some contain outright errors and slips of the pen. Do not be tempted to correct a tag to what it would be if the text were correct; rather, it is the incorrect word that should be tagged correctly. Disciplines Computer Sciences Comments University of Pennsylvania Department of Computer and Information Science Technical Report No. MSCIS-90-47. This technical report is available at ScholarlyCommons: http://repository.upenn.edu/cis_reports/570 Part-of-S peech Tagging Guidelines For The Penn Treebank Project (3rd Revision) MS-CIS-90-47 LINC LAB 178

507 citations

Proceedings ArticleDOI
19 Feb 1991
TL;DR: The problem of quantitatively comparing the performance of different broad-coverage grammars of English has to date resisted solution as discussed by the authors, which is a problem that has been resisted solution.
Abstract: The problem of quantitatively comparing the performance of different broad-coverage grammars of English has to date resisted solution. Prima facie, known English grammars appear to disagree strongly with each other as to the elements of even the simplest sentences. For instance, the grammars of Steve Abney (Bellcore), Ezra Black (IBM), Dan Flickinger (Hewlett Packard), Claudia Gdaniec (Logos), Ralph Grishman and Tomek Strzalkowski (NYU), Phil Harrison (Boeing), Don Hindle (AT&T), Bob Ingria (BBN), and Mitch Marcus (U. of Pennsylvania) recognize in common only the following constituents, when each grammarian provides the single parse which he/she would ideally want his/her grammar to specify for three sample Brown Corpus sentences:The famed Yankee Clipper, now retired, has been assisting (as (a batting coach)).One of those capital-gains ventures, in fact, has saddled him (with Gore Court).He said this constituted a (very serious) misuse (of the (Criminal court) processes).

434 citations

Book ChapterDOI
01 Jan 2003
TL;DR: The design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation is described and the methodology employed in production is described.
Abstract: The Penn Treebank, in its eight years of operation (1989–1996), produced approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicateargument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. This paper describes the design of the three annotation schemes used by the Treebank: POS tagging, syntactic bracketing, and disfluency annotation and the methodology employed in production. All available Penn Treebank materials are distributed by the Linguistic Data Consortium http://www.ldc.upenn.edu.

393 citations

Journal ArticleDOI
TL;DR: In this paper, it was shown that the postion of inflected verbs in early Yiddish varies between second position and positions later in the clause, and that the phrase structure changed from infl-final to infl-medial.
Abstract: The postion of inflected verbs in early Yiddish varies between second position and positions later in the clause. Standard distributional tests establish that this reflects variation in the underlying position of infl, and that Yiddish phrase structure changed from infl-final to infl-medial. Based on clauses containing the relevant structural diagnostics, we can estimate the rate of this change. We cannot, however, determine the phrase structure of structurally ambiguous clauses (i.e., those superficially consistent with either of the phrase structures) with certainty. Nevertheless, we can use quantitative methods to estimate the likelihood of such clauses being infl-medial, and we can then use these likelihoods to provide an additional estimate of the rate of the change. Comparing both estimates reveals that they do not differ significantly. The implications of this result are briefly examined in conclusion.

94 citations


Cited by
More filters
Book
28 May 1999
TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Abstract: Statistical approaches to processing natural language text have become dominant in recent years This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear The book contains all the theory and algorithms needed for building NLP tools It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications

9,295 citations

ReportDOI
TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.
Abstract: : As a result of this grant, the researchers have now published oil CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, with over 3 million words of that material assigned skeletal grammatical structure This material now includes a fully hand-parsed version of the classic Brown corpus About one half of the papers at the ACL Workshop on Using Large Text Corpora this past summer were based on the materials generated by this grant

8,377 citations

Proceedings ArticleDOI
15 Feb 2018
TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).
Abstract: We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

7,412 citations

Posted Content
TL;DR: A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

7,019 citations

Posted Content
TL;DR: A simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (Thumbs down) if the average semantic orientation of its phrases is positive.
Abstract: This paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (thumbs down). The classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs. A phrase has a positive semantic orientation when it has good associations (e.g., "subtle nuances") and a negative semantic orientation when it has bad associations (e.g., "very cavalier"). In this paper, the semantic orientation of a phrase is calculated as the mutual information between the given phrase and the word "excellent" minus the mutual information between the given phrase and the word "poor". A review is classified as recommended if the average semantic orientation of its phrases is positive. The algorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations). The accuracy ranges from 84% for automobile reviews to 66% for movie reviews.

4,526 citations