scispace - formally typeset
Open Access

Detection of Duplicate Content on Twitter

Reads0
Chats0
TLDR
A scale that can be used to measure the extent of semantic over- lapping in a pair of tweets and a prediction model that can predict the duplicity order of a tweet-pair with an average accuracy of over 80% are developed.
Abstract
Depending on the needs of a user, searching on Twitter streams has proven to be very useful for a wide range of applications. Twitter however, has large volumes of information available in the form of tweets, that are shared by mil- lions of users. We have observed that many of these tweets contain mutually redundant information and due to this, there is a consequent impact on the search and retrieval process on Twitter streams. Redundancy in the information pre- sented to a user, hinders with the objective of a search engine to limit a user’s effort, by presenting content that has already been seen. Detecting redundant information in tweets is all the more challenging, due to the colloquial nature of the language that is used. In addition to this, the presence of tweet-specific constructs like hash-tags, shortened URLs and @-mentions further complicates the task of detecting mutually redundant tweets or duplicate tweets. Although it is relatively easy to automatically detect duplicate tweets which are syntactically identical (i.e., the text in the tweets is identical), it is challenging to detect tweets that are semantically equivalent (i.e., tweets have the same underlying meaning) but syntactically different. By analyzing a large Twitter benchmark dataset (provided as a part of the TREC microblog search challenge) used for the micro-blog search task in 2011, we ob- served that there is a varying extent to which a pair of tweets can be duplicates. We developed a scale that can be used to measure the extent of semantic over- lapping in a pair of tweets. Through our analysis, we identified the key aspects around which strategies for the detection of duplicate tweets can be designed, namely, terminological, semantic and linguistic aspects. We designed and imple- mented strategies for the detection of duplicate tweets from that perspective. Our main contributions are four-fold. Firstly, we developed a scale that can be used to measure the degree of a pair of duplicate tweets. Secondly, we analyzed the top-k search results on Twitter and observed the extent to which duplicates appear in top-k retrieval. Thirdly, we designed and implemented the strategies for the detection of duplicates. Our best strategy combination results in a commend- able F-measure of just over 80% (0.82). Finally, we develop a prediction model that can predict the duplicity order of a tweet-pair with an average accuracy of over 80%.

read more

Content maybe subject to copyright    Report

References
More filters
Journal ArticleDOI

The WEKA data mining software: an update

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Journal ArticleDOI

WordNet: a lexical database for English

TL;DR: WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.
Book

Introduction to Information Retrieval

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
Book

Modern Information Retrieval

TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.
Journal ArticleDOI

An algorithm for suffix stripping

TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.
Related Papers (5)