scispace - formally typeset
Search or ask a question

Showing papers by "Georgios Paltoglou published in 2015"


Journal ArticleDOI
TL;DR: The analyses show that validity and reliability vary with the formality and diversity of the text; it is shown that ready-to-use methods leave much space for improvement when analyzing domain-specific content and that a machine-learning approach offers more accurate predictions across communication domains.
Abstract: This study offers a systematic comparison of automated content analysis tools. The ability of different lexicons to correctly identify affective tone (e.g., positive vs. negative) is assessed in different social media environments. Our comparisons examine the reliability and validity of publicly available, off-the-shelf classifiers. We use datasets from a range of online sources that vary in the diversity and formality of the language used, and we apply different classifiers to extract information about the affective tone in these datasets. We first measure agreement (reliability test) and then compare their classifications with the benchmark of human coding (validity test). Our analyses show that validity and reliability vary with the formality and diversity of the text; we also show that ready-to-use methods leave much space for improvement when analyzing domain-specific content and that a machine-learning approach offers more accurate predictions across communication domains.

78 citations


Journal ArticleDOI
TL;DR: A new multilayer collection selection method that effectively suggests classification codes exploiting the hierarchical classification schemes such as IPC/CPC, which outperforms state-of-art methods from the DIR domain not only in identifying relevant IPC codes but also in retrieving relevant patent documents given a patent query.
Abstract: In this paper we present a method that can be used to attain specific objectives in a typical prior art search process. The objectives are first to assist patent searchers in understanding the underlying technical concepts of a patent by identifying relevant international patent classification (IPC) codes and second to help them conduct a filtered search based on automatically selected IPCs. We view the automated selection of IPCs as a collection selection problem from the domain of distributed information retrieval (DIR) that can be addressed using existing DIR methods, which we extend and adapt for the patent domain. Our work exploits the intellectually assigned classifications codes that are used to categorize patents and to facilitate patent searches. In our method, manually assigned IPC codes of patent documents are used to cluster, distribute and index patents through hundreds or thousands of sub-collections. We propose a new multilayer collection selection method that effectively suggests classification codes exploiting the hierarchical classification schemes such as IPC/CPC. The new method in addition to utilizing the topical relevance of IPCs at a particular level of interest exploits the topical relevance of their ancestors in the IPC hierarchy and aggregates those multiple estimations of relevance to a single estimation. Experimental results on the CLEF-IP 2011 dataset show that the proposed approach outperforms state-of-art methods from the DIR domain not only in identifying relevant IPC codes but also in retrieving relevant patent documents given a patent query.

10 citations


Posted Content
TL;DR: In this paper, the authors compare automated content analysis tools by assessing their ability to correctly identify affective tone (e.g., positive vs. negative) in different data contexts and social media environments.
Abstract: This study offers a systematic comparison of automated content analysis tools by assessing their ability to correctly identify affective tone (e.g., positive vs. negative) in different data contexts and social media environments. Our comparisons assess the reliability and validity of publicly available, off-the-shelf classifiers. We use datasets from a range of online sources that vary in the diversity and formality of the language used, and we apply different classifiers to extract information about the affective tone in these datasets. We first measure agreement (reliability test) and then compare their classifications with the benchmark of human coding (validity test). Our analyses show that validity and reliability vary with the formality and diversity of the text; we also show that ready-to-use methods leave much space for improvement in domain-specific content and that a machine-learning approach offers more accurate predictions.

2 citations