scispace - formally typeset
Search or ask a question
Author

Salah Ait-Mokhtar

Bio: Salah Ait-Mokhtar is an academic researcher from Xerox. The author has contributed to research in topics: Computer science & Parser combinator. The author has an hindex of 12, co-authored 18 publications receiving 872 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: This work argues that with a systematic incremental methodology one can go beyond shallow parsing to deeper language analysis, while preserving robustness, and describes a generic system based on such a methodology and designed for building robust analyzers that tackle deeper linguistic phenomena than those traditionally handled by the now widespread shallow parsers.
Abstract: Robustness is a key issue for natural language processing in general and parsing in particular, and many approaches have been explored in the last decade for the design of robust parsing systems. Among those approaches is shallow or partial parsing, which produces minimal and incomplete syntactic structures, often in an incremental way. We argue that with a systematic incremental methodology one can go beyond shallow parsing to deeper language analysis, while preserving robustness. We describe a generic system based on such a methodology and designed for building robust analyzers that tackle deeper linguistic phenomena than those traditionally handled by the now widespread shallow parsers. The rule formalism allows the recognition of n-ary linguistic relations between words or constituents on the basis of global or local structural, topological and/or lexical conditions. It offers the advantage of accepting various types of inputs, ranging from raw to chunked or constituent-marked texts, so for instance it can be used to process existing annotated corpora, or to perform a deeper analysis on the output of an existing shallow parser. It has been successfully used to build a deep functional dependency parser, as well as for the task of co-reference resolution, in a modular way.

321 citations

Proceedings ArticleDOI
31 Mar 1997
TL;DR: This paper describes a new finite-state shallow parser that overcomes the inefficiency of previous fully reductionist constraint-based systems, while maintaining broad coverage and linguistic granularity.
Abstract: This paper describes a new finite-state shallow parser. It merges constructive and reductionist approaches within a highly modular architecture. Syntactic information is added at the sentence level in an incremental way, depending on the contextual information available at a given stage. This approach overcomes the inefficiency of previous fully reductionist constraint-based systems, while maintaining broad coverage and linguistic granularity. The implementation relies on a sequence of networks built with the replace operator. Given the high level of modularity, the core grammar is easily augmented with corpus-specific sub-grammars. The current system is implemented for French and is being expanded to new languages.

174 citations

Patent
22 Apr 2008
TL;DR: In this paper, a system and a method for providing a factuality assessment of a retrieved information source's statement are disclosed, which includes receiving a user's query which identifies an information source whose statements are to be retrieved, retrieving documents which refer to the information source, mapping statements in the retrieved documents to their authors, identifying as information source statements, the mapped statements that are mapped to an author which is compatible with the Information source, and for at least one of the information sources's statements, assessing the factuality of the statement according to the source's statements.
Abstract: A system and method for providing a factuality assessment of a retrieved information source's statement are disclosed. The method includes receiving a user's query which identifies an information source whose statements are to be retrieved, retrieving documents which refer to the information source, mapping statements in the retrieved documents to their authors, identifying as information source statements, the mapped statements that are mapped to an author which is compatible with the information source, and for at least one of the information source's statements, assessing a factuality of the information source's statement according to the information source.

82 citations

01 Jan 1997
TL;DR: An approach for fast automatic recognition and extraction of subject and object dependency relations from large French corpora, using a sequence of finite-state transducers, and the impact of POS tagging errors on subject/object dependency extraction is evaluated.
Abstract: We describe and evaluate an approach for fast automatic recognition and extraction of subject and object dependency relations from large French corpora, using a sequence of finite-state transducers. The extraction is performed in two major steps: incremental finite-state parsing and extraction of subject/verb and object/verb relations. Our incremental and cautious approach during the first phase allows the system to deal successfully with complex phenomena such as embeddings, coordination of VPs and NPs or non-standard word order. The extraction requires no subcategorisation information. It relies on POS information only. After describing the two steps, we give the results of an evaluation on various types of unrestricted corpora. Precision is around 90-97% for subjects (84-88% for objects) and recall around 86-92% for subjects (80-90% for objects). We also provide some error analysis; in particular, we evaluate the impact of POS tagging errors on subject/object dependency extraction.

67 citations

Proceedings Article
01 Oct 2001

58 citations


Cited by
More filters
Journal ArticleDOI
01 May 2006
TL;DR: It is shown that extending the term‐counting method with contextual valence shifters improves the accuracy of the classification, and combining the two methods achieves better results than either method alone.
Abstract: We present two methods for determining the sentiment expressed by a movie review. The semantic orientation of a review can be positive, negative, or neutral. We examine the effect of valence shifters on classifying the reviews. We examine three types of valence shifters: negations, intensifiers, and diminishers. Negations are used to reverse the semantic polarity of a particular term, while intensifiers and diminishers are used to increase and decrease, respectively, the degree to which a term is positive or negative. The first method classifies reviews based on the number of positive and negative terms they contain. We use the General Inquirer to identify positive and negative terms, as well as negation terms, intensifiers, and diminishers. We also use positive and negative terms from other sources, including a dictionary of synonym differences and a very large Web corpus. To compute corpus-based semantic orientation values of terms, we use their association scores with a small group of positive and negative terms. We show that extending the term-counting method with contextual valence shifters improves the accuracy of the classification. The second method uses a Machine Learning algorithm, Support Vector Machines. We start with unigram features and then add bigrams that consist of a valence shifter and another word. The accuracy of classification is very high, and the valence shifter bigrams slightly improve it. The features that contribute to the high accuracy are the words in the lists of positive and negative terms. Previous work focused on either the term-counting method or the Machine Learning method. We show that combining the two methods achieves better results than either method alone.

735 citations

Book
01 Jan 1997
TL;DR: This paper discusses attempts to derive templates directly from corpora; to derive knowledge structures and lexicons directly from Corpora, including discussion of the recent LE project ECRAN which attempted to tune existing lexicons to new corpora.
Abstract: It seems widely agreed that IE (Information Extraction) is now a tested language technology that has reached precision+recall values that put it in about the same position as Information Retrieval and Machine Translation, both of which are widely used commercially. There is also a clear range of practical applications that would be eased by the sort of template-style data that IE provides. The problem for wider deployment of the technology is adaptability: the ability to customize IE rapidly to new domains. In this paper we discuss some methods that have been tried to ease this problem, and to create something more rapid than the bench-mark one-month figure, which was roughly what ARPA teams in IE needed to adapt an existing system by hand to a new domain of corpora and templates. An important distinction in discussing the issue is the degree to which a user can be assumed to know what is wanted, to have preexisting templates ready to hand, as opposed to a user who has a vague idea of what is needed from a corpus. We shall discuss attempts to derive templates directly from corpora; to derive knowledge structures and lexicons directly from corpora, including discussion of the recent LE project ECRAN which attempted to tune existing lexicons to new corpora. An important issue is how far established methods in Information Retrieval of tuning to a user’s needs with feedback at an interface can be transferred to IE.

716 citations

Book ChapterDOI
01 Jan 2003
TL;DR: A treebank project for French has annotated a newspaper corpus of 1 Million words with part of speech, inflection, compounds, lemmas and constituency and presents some uses of the corpus.
Abstract: We present a treebank project for French. We have annotated a newspaper corpus of 1 Million words with part of speech, inflection, compounds, lemmas and constituency. We describe the tagging and parsing phases of the project, and for each, the automatic tools, the guidelines and the validation process. We then present some uses of the corpus as well as some directions for future work.

509 citations

Journal ArticleDOI
TL;DR: This survey discusses related issues and main approaches to these problems, namely, subjectivity classification, word sentiment classification, document sentiment classification and opinion extraction.
Abstract: The sentiment detection of texts has been witnessed a booming interest in recent years, due to the increased availability of online reviews in digital form and the ensuing need to organize them Till to now, there are mainly four different problems predominating in this research community, namely, subjectivity classification, word sentiment classification, document sentiment classification and opinion extraction In fact, there are inherent relations between them Subjectivity classification can prevent the sentiment classifier from considering irrelevant or even potentially misleading text Document sentiment classification and opinion extraction have often involved word sentiment classification techniques This survey discusses related issues and main approaches to these problems

447 citations

Journal ArticleDOI
TL;DR: Text mining is used to transform patent documents into structured data to identify keyword vectors and principal component analysis is employed to reduce the numbers of keyword vectors to make suitable for use on a two-dimensional map.

386 citations