Showing papers by "Patrick Paroubek published in 2018"

PDF

Open Access

Book•

Actes de la conférence Traitement Automatique de la Langue Naturelle, TALN 2018

[...]

Anne-Laure Ligozat, Peggy Cellier, Anne-Lyse Minard¹, Vincent Claveau, Cyril Grouin, Patrick Paroubek - Show less +2 more•Institutions (1)

Institut de Recherche en Informatique et Systèmes Aléatoires¹

14 May 2018

TL;DR: This article presents an information extraction method which collects additional information on the web so as to enrich already existing information and then fill in a knowledge base using lexical and syntactical patterns.

...read moreread less

Abstract: Relation pattern extraction and information extraction from the web. This article presents an information extraction method which collects additional information on the web so as to enrich already existing information and then fill in a knowledge base. Our method is based on lexical and syntactical patterns, both used as search queries and extraction patterns to allow the analysis of unstructured documents. To do so, we first defined relevant criteria coming from the analysis phase so as to ease the discovery of new values. MOTS-CLES : Construction de patrons, extraction d’information, extraction d’entités nommées, syntaxe en dépendances, apprentissage de patrons d’extraction, web comme corpus.

...read moreread less

28 citations

DEFT2018 : recherche d'information et analyse de sentiments dans des tweets concernant les transports en Île de France

[...]

Patrick Paroubek, Cyril Grouin, Patrice Bellot, Vincent Claveau, Iris Eshkol-Taravella, Amel Fraisse, Agata Jackiewicz, Jihen Karoui, Laura Monceaux, Torres-Moreno Juan-Manuel - Show less +6 more

15 May 2018

TL;DR: Quatre tâches ont été proposées : identifier les tweets sur la thématique des transports, puis parmi ces derniers, identifier la polarité (négatif, neutre, positif, mixte), identifier les marqueurs de sentiment and the cible, and enfin, annoter complètement chaque tweet en source and cible des sentiments exprimés.

...read moreread less

Abstract: Cet article presente l'edition 2018 de la campagne d'evaluation DEFT (Defi Fouille de Textes). A partir d'un corpus de tweets, quatre tâches ont ete proposees : identifier les tweets sur la thematique des transports, puis parmi ces derniers, identifier la polarite (negatif, neutre, positif, mixte), identifier les marqueurs de sentiment et la cible, et enfin, annoter completement chaque tweet en source et cible des sentiments exprimes. Douze equipes ont participe, majoritairement sur les deux premieres tâches. Sur l'identification de la thematique des transports, la micro F-mesure varie de 0,827 a 0,908. Sur l'identification de la polarite globale, la micro F-mesure varie de 0,381 a 0,823.

...read moreread less

8 citations

Proceedings Article•

Measuring Innovation in Speech and Language Processing Publications.

[...]

Joseph Mariani¹, Gil Francopoulo², Patrick Paroubek²•Institutions (2)

Université Paris-Saclay¹, Centre national de la recherche scientifique²

01 May 2018

TL;DR: A measure of innovativeness for authors and publications is proposed based on the NLP4NLP corpus, which contains the articles published in major conferences and journals related to speech and language processing over 50 years.

...read moreread less

Abstract: The goal of this paper is to propose measures of innovation through the study of publications in the field of speech and language processing. It is based on the NLP4NLP corpus, which contains the articles published in major conferences and journals related to speech and language processing over 50 years (1965-2015). It represents 65,003 documents from 34 different sources, conferences and journals, published by 48,894 different authors in 558 events, for a total of more than 270 million words and 324,422 bibliographical references. The data was obtained in textual form or as an image that had to be converted into text. This resulted in a lower quality for the most ancient papers, that we measured through the computation of an unknown word ratio. The multi-word technical terms were automatically extracted after parsing, using a set of general language text corpora. The occurrences, frequencies, existences and presences of the terms were then computed overall, for each year and for each document. It resulted in a list of 3.5 million different terms and 24 million term occurrences. The evolution of the research topics over the year, as reflected by the terms presence, was then computed and we propose a measure of the topic popularity based on this computation. The author(s) who introduced the terms were searched for, together with the year when the term was first introduced and the publication where it was introduced. We then studied the global and evolutional contributions of authors to a given topic. We also studied the global and evolutional contributions of the various publications to a given topic. We finally propose a measure of innovativeness for authors and publications.

...read moreread less

2 citations

Proceedings Article•

Annotating Spin in Biomedical Scientific Publications : the case of Random Controlled Trials (RCTs)

[...]

Anna Koroleva¹, Patrick Paroubek²•Institutions (2)

Zurich University of Applied Sciences/ZHAW¹, Centre national de la recherche scientifique²

01 May 2018

TL;DR: The manual annotation model and its annotation guidelines are presented and the planned machine learning experiments and evaluations are described.

...read moreread less

Abstract: In this paper we report on the collection in the context of the MIROR project of a corpus of biomedical articles for the task of automatic detection of inadequate claims (spin), which to our knowledge has never been addressed before. We present the manual annotation model and its annotation guidelines and describe the planned machine learning experiments and evaluations.

...read moreread less

1 citations

Proceedings Article•

TransLiTex: A Parallel Corpus of Translated Literary Texts

[...]

Amel Fraisse, Quoc-Tan Tran, Ronald Jenn, Patrick Paroubek, Shelley Fisher Fishkin¹ - Show less +1 more•Institutions (1)

Stanford University¹

08 May 2018

TL;DR: This work identified and collected 29 translations of Mark Twain's Adventures of Huckleberry Finn published in 23 languages including less-resourced languages, and evaluated the correctness of chapter alignment by computing the percentage of common words between the English version and the translated ones.

...read moreread less

Abstract: In this paper, we present our ongoing research work to create a massively parallel corpus of translated literary texts which is useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. Using a crowdsourcing approach, we identified and collected 29 translations of Mark Twain's Adventures of Huckleberry Finn published in 23 languages including less-resourced languages. We report on the current status of the corpus, with 5 chapter-aligned translations (English-Dutch, two English-Hungarian, English-Polish and English-Russian). We evaluated the correctness of chapter alignment by computing the percentage of common words between the English version and the translated ones. Results show high percentages that vary between 43% and 64% proving the high correctness of chapter alignment.

...read moreread less

1 citations