Detection of Duplicate Content on Twitter

Open Access

Detection of Duplicate Content on Twitter

Chats0

TLDR

A scale that can be used to measure the extent of semantic over- lapping in a pair of tweets and a prediction model that can predict the duplicity order of a tweet-pair with an average accuracy of over 80% are developed.

Abstract:

Depending on the needs of a user, searching on Twitter streams has proven to be very useful for a wide range of applications. Twitter however, has large volumes of information available in the form of tweets, that are shared by mil- lions of users. We have observed that many of these tweets contain mutually redundant information and due to this, there is a consequent impact on the search and retrieval process on Twitter streams. Redundancy in the information pre- sented to a user, hinders with the objective of a search engine to limit a user’s effort, by presenting content that has already been seen. Detecting redundant information in tweets is all the more challenging, due to the colloquial nature of the language that is used. In addition to this, the presence of tweet-specific constructs like hash-tags, shortened URLs and @-mentions further complicates the task of detecting mutually redundant tweets or duplicate tweets. Although it is relatively easy to automatically detect duplicate tweets which are syntactically identical (i.e., the text in the tweets is identical), it is challenging to detect tweets that are semantically equivalent (i.e., tweets have the same underlying meaning) but syntactically different. By analyzing a large Twitter benchmark dataset (provided as a part of the TREC microblog search challenge) used for the micro-blog search task in 2011, we ob- served that there is a varying extent to which a pair of tweets can be duplicates. We developed a scale that can be used to measure the extent of semantic over- lapping in a pair of tweets. Through our analysis, we identified the key aspects around which strategies for the detection of duplicate tweets can be designed, namely, terminological, semantic and linguistic aspects. We designed and imple- mented strategies for the detection of duplicate tweets from that perspective. Our main contributions are four-fold. Firstly, we developed a scale that can be used to measure the degree of a pair of duplicate tweets. Secondly, we analyzed the top-k search results on Twitter and observed the extent to which duplicates appear in top-k retrieval. Thirdly, we designed and implemented the strategies for the detection of duplicates. Our best strategy combination results in a commend- able F-measure of just over 80% (0.82). Finally, we develop a prediction model that can predict the duplicity order of a tweet-pair with an average accuracy of over 80%.

References

PDF

Open Access

More filters

Journal ArticleDOI

The WEKA data mining software: an update

Mark Hall, +5 more

- 16 Nov 2009 -

Sigkdd Explorations

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

Journal ArticleDOI

WordNet: a lexical database for English

George A. Miller

- 01 Nov 1995 -

Communications of The ACM

TL;DR: WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.

...read moreread less

Book

Introduction to Information Retrieval

Christopher D. Manning, +2 more

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.

...read moreread less

Book

Modern Information Retrieval

Ricardo Baeza-Yates, +1 more

TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.

...read moreread less

Journal ArticleDOI

An algorithm for suffix stripping

M. F. Porter

- 01 Dec 1997 -

Program: Electronic Library and Informat...

TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.

...read moreread less

Collapse

Related Papers (5)

RAProp: Ranking Tweets by Exploiting the Tweet/User/Web Ecosystem and Inter-Tweet Agreement

S. Ravikumar, +3 more

- 11 Aug 2013 -

arXiv: Information Retrieval

Detection of Duplicate Content on Twitter

References

The WEKA data mining software: an update

WordNet: a lexical database for English

Introduction to Information Retrieval

Modern Information Retrieval

An algorithm for suffix stripping

Related Papers (5)

RAProp: Ranking Tweets by Exploiting the Tweet/User/Web Ecosystem and Inter-Tweet Agreement

Structuring Tweets for improving Twitter search

Gibberish, Assistant, or Master?: Using Tweets Linking to News for Extractive Single-Document Summarization

User Type Classification of Tweets with Implications for Event Recognition

Retweet or not?: personalized tweet re-ranking