scispace - formally typeset
Search or ask a question

Showing papers by "Patrick Paroubek published in 2018"


Book
14 May 2018
TL;DR: This article presents an information extraction method which collects additional information on the web so as to enrich already existing information and then fill in a knowledge base using lexical and syntactical patterns.
Abstract: Relation pattern extraction and information extraction from the web. This article presents an information extraction method which collects additional information on the web so as to enrich already existing information and then fill in a knowledge base. Our method is based on lexical and syntactical patterns, both used as search queries and extraction patterns to allow the analysis of unstructured documents. To do so, we first defined relevant criteria coming from the analysis phase so as to ease the discovery of new values. MOTS-CLES : Construction de patrons, extraction d’information, extraction d’entités nommées, syntaxe en dépendances, apprentissage de patrons d’extraction, web comme corpus.

28 citations


15 May 2018
TL;DR: Quatre tâches ont été proposées : identifier les tweets sur la thématique des transports, puis parmi ces derniers, identifier la polarité (négatif, neutre, positif, mixte), identifier les marqueurs de sentiment and the cible, and enfin, annoter complètement chaque tweet en source and cible des sentiments exprimés.
Abstract: Cet article presente l'edition 2018 de la campagne d'evaluation DEFT (Defi Fouille de Textes). A partir d'un corpus de tweets, quatre tâches ont ete proposees : identifier les tweets sur la thematique des transports, puis parmi ces derniers, identifier la polarite (negatif, neutre, positif, mixte), identifier les marqueurs de sentiment et la cible, et enfin, annoter completement chaque tweet en source et cible des sentiments exprimes. Douze equipes ont participe, majoritairement sur les deux premieres tâches. Sur l'identification de la thematique des transports, la micro F-mesure varie de 0,827 a 0,908. Sur l'identification de la polarite globale, la micro F-mesure varie de 0,381 a 0,823.

8 citations


Proceedings Article
01 May 2018
TL;DR: A measure of innovativeness for authors and publications is proposed based on the NLP4NLP corpus, which contains the articles published in major conferences and journals related to speech and language processing over 50 years.
Abstract: The goal of this paper is to propose measures of innovation through the study of publications in the field of speech and language processing. It is based on the NLP4NLP corpus, which contains the articles published in major conferences and journals related to speech and language processing over 50 years (1965-2015). It represents 65,003 documents from 34 different sources, conferences and journals, published by 48,894 different authors in 558 events, for a total of more than 270 million words and 324,422 bibliographical references. The data was obtained in textual form or as an image that had to be converted into text. This resulted in a lower quality for the most ancient papers, that we measured through the computation of an unknown word ratio. The multi-word technical terms were automatically extracted after parsing, using a set of general language text corpora. The occurrences, frequencies, existences and presences of the terms were then computed overall, for each year and for each document. It resulted in a list of 3.5 million different terms and 24 million term occurrences. The evolution of the research topics over the year, as reflected by the terms presence, was then computed and we propose a measure of the topic popularity based on this computation. The author(s) who introduced the terms were searched for, together with the year when the term was first introduced and the publication where it was introduced. We then studied the global and evolutional contributions of authors to a given topic. We also studied the global and evolutional contributions of the various publications to a given topic. We finally propose a measure of innovativeness for authors and publications.

2 citations


Proceedings Article
01 May 2018
TL;DR: The manual annotation model and its annotation guidelines are presented and the planned machine learning experiments and evaluations are described.
Abstract: In this paper we report on the collection in the context of the MIROR project of a corpus of biomedical articles for the task of automatic detection of inadequate claims (spin), which to our knowledge has never been addressed before. We present the manual annotation model and its annotation guidelines and describe the planned machine learning experiments and evaluations.

1 citations


Proceedings Article
08 May 2018
TL;DR: This work identified and collected 29 translations of Mark Twain's Adventures of Huckleberry Finn published in 23 languages including less-resourced languages, and evaluated the correctness of chapter alignment by computing the percentage of common words between the English version and the translated ones.
Abstract: In this paper, we present our ongoing research work to create a massively parallel corpus of translated literary texts which is useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. Using a crowdsourcing approach, we identified and collected 29 translations of Mark Twain's Adventures of Huckleberry Finn published in 23 languages including less-resourced languages. We report on the current status of the corpus, with 5 chapter-aligned translations (English-Dutch, two English-Hungarian, English-Polish and English-Russian). We evaluated the correctness of chapter alignment by computing the percentage of common words between the English version and the translated ones. Results show high percentages that vary between 43% and 64% proving the high correctness of chapter alignment.

1 citations