scispace - formally typeset
Search or ask a question
Author

Ludovic Tanguy

Bio: Ludovic Tanguy is an academic researcher from University of Toulouse. The author has contributed to research in topics: Distributional semantics & Query expansion. The author has an hindex of 16, co-authored 92 publications receiving 930 citations. Previous affiliations of Ludovic Tanguy include École Normale Supérieure & Centre national de la recherche scientifique.


Papers
More filters
Journal ArticleDOI
TL;DR: The different NLP techniques designed and used in collaboration between the CLLE-ERSS research laboratory and the CFH/Safety Data company to manage and analyse aviation incident reports are described.

105 citations

Proceedings Article
01 Jan 2005
TL;DR: Using 16 different linguistic features automatically computed on TREC queries, three of these features are shown to have a significant impact on either recall or precision scores for previous adhoc TREC campaigns.
Abstract: Query difficulty can be linked to a number of causes. Some of these causes can be related to the query expression itself, and can therefore be detected through a linguistic analysis of the query text. Using 16 different linguistic features, automatically computed on TREC queries, we looked for significant correlations between these features and the average recall and precision scores obtained by systems. Three of these features are shown to have a significant impact on either recall or precision scores for previous adhoc TREC campaigns. Each of these features can be viewed as a clue to a linguistically-specific characteristic, either morphological, syntactical or semantic. These results also open the way for a more enlightened use of linguistic processing in IR systems.

100 citations

Proceedings Article
23 May 2012
TL;DR: Two first analyses taking advantage of this new annotated corpus are presented --one that tested hypotheses on constraints governing discourse structure, and another that studied the variations in composition and signalling of multi-level discourse structures.
Abstract: This paper describes the ANNODIS resource, a discourse-level annotated corpus for French. The corpus combines two perspectives on discourse: a bottom-up approach and a top-down approach. The bottom-up view incrementally builds a structure from elementary discourse units, while the top-down view focuses on the selective annotation of multi-level discourse structures. The corpus is composed of texts that are diversified with respect to genre, length and type of discursive organisation. The methodology followed here involves an iterative design of annotation guidelines in order to reach satisfactory inter-annotator agreement levels. This allows us to raise a few issues relevant for the comparison of such complex objects as discourse structures. The corpus also serves as a source of empirical evidence for discourse theories. We present here two first analyses taking advantage of this new annotated corpus --one that tested hypotheses on constraints governing discourse structure, and another that studied the variations in composition and signalling of multi-level discourse structures.

68 citations

01 Jan 2005
TL;DR: Linguistics is considered as a way to predict query difficulty rather than a means to model IR, in the context of the ARIEL research project, in which the impact of linguistic processing in IR systems is investigated.
Abstract: Query difficulty can be linked to a number of causes. Some of these causes can be related to the query expression itself, and can therefore be detected through a linguistic analysis of the query text. Using 16 different linguistic features, automatically computed on TREC queries, we looked for significant correlations between these features and the average recall and precision scores obtained by systems. Each of these features can be viewed as a clue to a linguisticallyspecific characteristic, either morphological, syntactical or semantic. Two of these features (syntactic links span and polysemy value) are shown to have a significant impact on either recall or precision scores for previous adhoc TREC campaigns. Although the correlation values are not very high, they indicate a promising link between some linguistic characteristics and query difficulty. 1. CONTEXT This study has been conducted in the context of the ARIEL research project, in which we investigate the impact of linguistic processing in IR systems. The ultimate objective is to build an adaptive IR system, in which several natural language processing (NLP) techniques are available, but are selectively used for a given query, depending on the predicted efficiency of each technique. 2. OBJECTIVE Although linguistics and NLP have been viewed as natural solutions for IR, the overall efficiency of the techniques used in IR systems is doubtful at best. From fine-grained morphological analysis to query expansion based on semantic word classes, the use of linguistically-sound techniques and resources has often been proven to be as efficient as other cruder techniques [5] [8]. In this paper, we consider linguistics as a way to predict query difficulty rather than a means to model IR. 3. RELATED WORK A closely-related approach is the analysis performed by [7] on the CLEF topics. Their intent was to discover if some query features could be correlated to system performance and thus indicate a kind of bias in this evaluation campaign, and further to build a fusion-based IR engine. The linguistic features they used to describe each topic mostly concerned syntactic and word forms aspects, and were calculated by hand. They used a correlation measure between these features and the average precision, but the only significant result was a correlation of 0.4 between the number of proper nouns and average precision. Further studies led the authors to named entities as a useful feature, and they were able to propose a fusion-based model that improved overall precision after a classification of topics according to the number of named entities. The precision increase using this feature varied from 0 to 10%, across several tasks (monoand multi-lingual). Our study deals with more linguistic features, especially in order to deal with syntactic complexity. In addition, we only used automatic analysis methods with NLP techniques. Focusing on documents instead of queries, [6] also used linguistic features in order to characterize documents in IR collections. His main point was to study the notion of relevance, and test whether it could be related to stylistic features, and if the genre of a document could be useful for relevant document selection. [3] also used documents in order to predict query difficulty using a clarity score that depends on both the query and target collection. Both the previous studies therefore need to have exhaustive information on the collection; while we decided to focus on queries only, in order to deal with a wider range of IR situations. In [2] several classes of topic failures were drawn manually, but no elements were given on how to assign automatically a topic to a category. 4. METHOD We selected the following data: TREC 3, 5, 6 and 7 results for the adhoc task; that corresponds to a total of 200 queries (50 per year). Each query in these collections was automatically analysed and described with 16 variables, each corresponding to a specific linguistic feature. We considered the title part of the query as its length and format is the closest to a real user’s query. Because TREC web site makes participants’ runs available (i.e. lists of retrieved documents for each query), it was possible to compute the average recall and precision scores for each run and each query (using the trec-eval utility). We then computed the average recall and precision values over runs for each query. Finally, we computed the correlation between these scores and the linguistic features variables. These correlation values were tested for statistical significance. As a first result, if simple features dealing with the number or size of words in a query or the presence of certain parts of speech do not have clear consequences on a query's difficulty, more sophisticated variables led to interesting results. Globally, the syntactic complexity of a query has a negative impact on the precision scores, and the semantic ambiguity of the query words has a negative impact on the recall scores. A little less significantly, the morphological complexity of words also has a negative effect on recall. 4.1. Linguistic Features The use of linguistic features in order to study a document is a well-known technique. It has been thoroughly used in several NLP tasks, ranging from classification to genre analysis. The principles are quite simple: the text (i.e. query in our case) is first analysed using some generic parsing techniques (e.g. part of speech tagging, chunking, and parsing). Based on the tagged text data, simple programs compute the corresponding information. We used: Tree Tagger for part-of-speech tagging and lemmatisation: this tool attributes a single 1TreeTagger, by H. Schmidt; available at www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ morphosyntactic category to each word in the input text, based on a general lexicon and a language model; Syntex [4] for shallow parsing (syntactic link detection): this analyser identifies syntactic relation between words in a sentence, based on grammatical rules; In addition, we used the following resources: WordNet 1.6 semantic network to compute semantic ambiguity: this database provides, among other information, the possible meanings for a given word; CELEX database for derivational morphology: this resource gives the morphological decomposition of a given word. According to the final objective, which is an automatic classification of queries, all the features considered are computed without any human intervention, and are as such prone to processing errors. The 16 linguistic features we computed are in Table 1, categorized in three different classes according to their level of linguistic analysis: Table 1: List of linguistic features Morphological features : number (#) of words NBWORDS average word length LENGTH average # of morphemes per word MORPH average # of suffixed tokens word SUFFIX average # of proper nouns PN average # of acronyms ACRO average # of numeral values (dates, quantities, etc.) NUM average # of unknown tokens UNKNOWN Syntactical features : average # of conjunctions CONJ average # of prepositions PREP average # of personal pronouns PP average syntactic depth SYNTDEPTH average syntactic links span SYNTDIST

57 citations

01 Jan 2000
TL;DR: This article propose a demarche which s'echelonne depuis une etude linguistique du phenomene jusqu'a la constitution de patrons permettant un reperage automatique des enonces.
Abstract: Cet article propose, pour le cas specifique des enonces definitoires, une demarche qui s'echelonne depuis une etude linguistique du phenomene jusqu'a la constitution de patrons permettant un reperage automatique des enonces. Une attention particuliere est portee aux technologies d'analyse de corpus, et l'accent est tout particulierement mis sur les differentes pratiques a l'oeuvre dans cette demarche : une pratique linguistique, une pratique des outils de reperage, et une pratique specifique a l'etude des corpus.

52 citations


Cited by
More filters
Book
01 Jan 1972
TL;DR: Invisible colleges diffusion of knowledge in scientific communities is also a way as one of the collective books that gives many advantages as discussed by the authors The advantages are not only for you, but for the other peoples with those meaningful benefits.
Abstract: No wonder you activities are, reading will be always needed. It is not only to fulfil the duties that you need to finish in deadline time. Reading will encourage your mind and thoughts. Of course, reading will greatly develop your experiences about everything. Reading invisible colleges diffusion of knowledge in scientific communities is also a way as one of the collective books that gives many advantages. The advantages are not only for you, but for the other peoples with those meaningful benefits.

1,262 citations

01 Jan 2001
TL;DR: This paper presents a meta-modelling framework for estimating the randomness of word frequency distributions using a variety of non-parametric and Parametric models.
Abstract: 1. Word Frequencies. 2. Non-parametric models. 3. Parametric models. 4. Mixture distributions. 5. The Randomness Assumption. 6. Examples of Applications. A. List of Symbols. B. Solutions of the exercises. C. Software. D. Data sets. Bibliography. Index.

422 citations