Data Mining Practical Machine Learning Tools and Techniques

How to Do Things With Words

We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we flnd that standard machine learning techniques deflnitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classiflcation, and support vector machines) do not perform as well on sentiment classiflcation as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classiflcation problem more challenging.

/pdf/thumbs-up-sentiment-classiflcation-using-machine-learning-1m4vdmh1b0.pdf

Thumbs up? Sentiment Classiflcation using Machine Learning Techniques

http://cfile28.uf.tistory.com/attach/23159E3D57117DF7107077

Twitter mood predicts the stock market.

The measurement of meaning

Providing a comparative framework for parsers is a task that has already been tried in the past, e.g. (Abeille, 1991), (Atwell and Sutcliffe, 1997), (Black et al., 1991), and studied in the literature (Black, 1993), (Black, 1994), (Carroll et al., 1998), (Gaizauskas et al., 1998), (WEPS-98, ), (Mengel and Lezius, 2000), but mainly for English. In this paper, we present PEAS: a Protocol for Evaluating Analyzers of Syntax (in French: Protocole d’Evaluation pour les Analyseurs Syntaxiques), based on an ongoing experiment at LIMSI which aims at developing and testing a generic quantitative black-box evaluation protocol for parsers of French. Two fully operational parsers will be used to test the evaluation protocol; they are: the parser (Giguet and Vergne, 1997) developed at GREYC (Caen University) and the latest version of the parser developed at Rank Xerox Research Center in Grenoble (Ait-Mokhtar and Chanod, 1997)

/pdf/a-protocol-for-evaluating-analyzers-of-syntax-peas-318v6piak2.pdf

A Protocol for Evaluating Analyzers of Syntax (PEAS)

The present study focuses on automatic processing of sibling resources of audio and written documents, such as available in audio archives or for parliament debates: written texts are close but not exact audio transcripts. Such resources deserve attention for several reasons: they represent an interesting testbed for studying differences between written and spoken material and they yield low cost resources for acoustic model training. When automatically transcribing the audio data, regions of agreement between automatic transcripts and written sources allow to transfer time-codes to the written documents: this may be helpful in an audio archive or audio information retrieval environment. Regions of disagreement can be automatically selected for further correction by human transcribers. This study makes use of 10 hours of French radio interview archives with corresponding press-oriented transcripts. The audio corpus has then been transcribed using the LIMSI speech recognizer resulting in automatic transcripts, exhibiting an average word error rate of 12%. 80% of the text corpus (with word chunks of at least five words) can be exactly aligned with the automatic transcripts of the audio data. The residual word error rate on these 80% is less than 1%.

/pdf/automatic-audio-and-manual-transcripts-alignment-time-code-547nq671su.pdf

Automatic Audio and Manual Transcripts Alignment, Time-code Transfer and Selection of Exact Transcripts

We describe the MAPA project, funded under the Connecting Europe Facility programme, whose goal is the development of an open-source de-identification toolkit for all official European Union languages. It will be developed since January 2020 until December 2021.

https://hal.archives-ouvertes.fr/hal-03103205/document

The Multilingual Anonymisation Toolkit for Public Administrations (MAPA) Project

Recent advances in neural computing and word embeddings for semantic processing open many new applications areas which had been left unaddressed so far because of inadequate language understanding capacity. But this new kind of approaches rely even more on training data to be operational. Corpora for financial applications exists, but most of them concern stock market prediction and are in English. To address this need for the French language and regulation oriented applications which require a deeper understanding of the text content, we hereby present “DoRe”, a French and dialectal French Corpus for NLP analytics in Finance, Regulation and Investment. This corpus is composed of: (a) 1769 Annual Reports from 336 companies among the most capitalized companies in: France (Euronext Paris) & Belgium (Euronext Brussels), covering a time frame from 2009 to 2019, and (b) related MetaData containing information for each company about its ISIN code, capitalization and sector. This corpus is designed to be as modular as possible in order to allow for maximum reuse in different tasks pertaining to Economics, Finance and Regulation. After presenting existing resources, we relate the construction of the DoRe corpus and the rationale behind our choices, concluding on the spectrum of possible uses of this new resource for NLP applications.

NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports.

We have created the NLP4NLP corpus to study the content of scientific publications in the field of speech and natural language processing. It contains articles published in 34 major conferences and journals in that field over a period of 50 years (1965-2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing approximately 270 million words. Most of these publications are in English, some are in French, German or Russian. Some are open access, others have been provided by the publishers. In order to constitute and analyze this corpus several tools have been used or developed. Some of them use Natural Language Processing methods that have been published in the corpus, hence its name. Numerous manual corrections were necessary, which demonstrated the importance of establishing standards for uniquely identifying authors, publications or resources. We have conducted various studies: evolution over time of the number of articles and authors, collaborations between authors, citations between papers and authors, evolution of research themes and identification of the authors who introduced them, measure of innovation and detection of epistemological ruptures, use of language resources, reuse of articles and plagiarism in the context of a global or comparative analysis between sources.

Patrick Paroubek

Papers

A Protocol for Evaluating Analyzers of Syntax (PEAS)

Automatic Audio and Manual Transcripts Alignment, Time-code Transfer and Selection of Exact Transcripts

The Multilingual Anonymisation Toolkit for Public Administrations (MAPA) Project

NLP Analytics in Finance with DoRe: A French 250M Tokens Corpus of Corporate Annual Reports.

Rediscovering 50 years of discoveries in speech and language processing: A survey