scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Microblogs data management: a survey

01 Jan 2020-Vol. 29, Iss: 1, pp 177-216
TL;DR: This paper reviews core components that enable large-scale querying and indexing for microblogs data, and discusses system-level issues and on-going effort on supporting microblogs through the rising wave of big data systems.
Abstract: Microblogs data is the microlength user-generated data that is posted on the web, e.g., tweets, online reviews, comments on news and social media. It has gained considerable attention in recent years due to its widespread popularity, rich content, and value in several societal applications. Nowadays, microblogs applications span a wide spectrum of interests including targeted advertising, market reports, news delivery, political campaigns, rescue services, and public health. Consequently, major research efforts have been spent to manage, analyze, and visualize microblogs to support different applications. This paper gives a comprehensive review of major research and system work in microblogs data management. The paper reviews core components that enable large-scale querying and indexing for microblogs data. A dedicated part gives particular focus for discussing system-level issues and on-going effort on supporting microblogs through the rising wave of big data systems. In addition, we review the major research topics that exploit these core data management components to provide innovative and effective analysis and visualization for microblogs, such as event detection, recommendations, automatic geotagging, and user queries. Throughout the different parts, we highlight the challenges, innovations, and future opportunities in microblogs data research.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: This paper is the first complete description of the resulting open source AsterixDB system, covering the system's data model, its query language, and its software architecture.
Abstract: AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store. Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements.

168 citations

Proceedings ArticleDOI
TL;DR: This study demonstrates the potential of multitask models on this type of problems and improves the state-of-the-art results in the fine-grained sentiment classification problem.
Abstract: Traditional sentiment analysis approaches tackle problems like ternary (3-category) and fine-grained (5-category) classification by learning the tasks separately. We argue that such classification tasks are correlated and we propose a multitask approach based on a recurrent neural network that benefits by jointly learning them. Our study demonstrates the potential of multitask models on this type of problems and improves the state-of-the-art results in the fine-grained sentiment classification problem.

53 citations

Proceedings ArticleDOI
20 Apr 2020
TL;DR: An efficient divide-and-conquer algorithm is proposed to derive bounds of spatial similarity and textual similarity between two semantic trajectories, which enable us prune dissimilar trajectory pairs without the need of computing the exact value of spatio-textual similarity.
Abstract: Matching similar pairs of trajectories, called trajectory similarity join, is a fundamental functionality in spatial data management. We consider the problem of semantic trajectory similarity join (STS-Join). Each semantic trajectory is a sequence of Points-of-interest (POIs) with both location and text information. Thus, given two sets of semantic trajectories and a threshold θ, the STS-Join returns all pairs of semantic trajectories from the two sets with spatio-textual similarity no less than θ. This join targets applications such as term-based trajectory near-duplicate detection, geo-text data cleaning, personalized ridesharing recommendation, keyword-aware route planning, and travel itinerary recommendation.With these applications in mind, we provide a purposeful definition of spatio-textual similarity. To enable efficient STS-Join processing on large sets of semantic trajectories, we develop trajectory pair filtering techniques and consider the parallel processing capabilities of modern processors. Specifically, we present a two-phase parallel search algorithm. We first group semantic trajectories based on their text information. The algorithm’s per-group searches are independent of each other and thus can be performed in parallel. For each group, the trajectories are further partitioned based on the spatial domain. We generate spatial and textual summaries for each trajectory batch, based on which we develop batch filtering and trajectory-batch filtering techniques to prune unqualified trajectory pairs in a batch mode. Additionally, we propose an efficient divide-and-conquer algorithm to derive bounds of spatial similarity and textual similarity between two semantic trajectories, which enable us prune dissimilar trajectory pairs without the need of computing the exact value of spatio-textual similarity. Experimental study with large semantic trajectory data confirms that our algorithm of processing semantic trajectory join is capable of outperforming our well-designed baseline by a factor of 8–12.

32 citations


Cites background from "Microblogs data management: a surve..."

  • ...Additionally, spatial keyword search is extensively investigated by existing studies [4], [7], [23], [24]....

    [...]

Journal ArticleDOI
TL;DR: Web of Science core collection was taken as the data source, and traditional statistical methods and CiteSpace software were used to carry out the scientometrics analysis of SMBD, which showed the research status, hotspots and trends in this field.
Abstract: Social Media Big Data (SMBD) is widely used to serve the economic and social development of human beings. However, as a young research and practice field, the understanding of SMBD in academia is not enough and needs to be supplemented. This paper took Web of Science (WoS) core collection as the data source, and used traditional statistical methods and CiteSpace software to carry out the scientometrics analysis of SMBD, which showed the research status, hotspots and trends in this field. The results showed that: (1) More and more attention has been paid to SMBD research in academia, and the number of journals published has been increased in recent years, mainly in subjects such as Computer Science Engineering and Telecommunications. The results were published primarily in IEEE Access Sustainability and Future Generation Computer Systems the International Journal of eScience and so on; (2) In terms of contributions, China, the United States, the United Kingdom and other countries (regions) have published the most papers in SMBD, high-yield institutions also mainly from these countries (regions). There were already some excellent teams in the field, such as the Wanggen Wan team at Shanghai University and Haoran Xie team from City University of Hong Kong; (3) we studied the hotspots of SMBD in recent years, and realized the summary of the frontier of SMBD based on the keywords and co-citation literature, including the deep excavation and construction of social media technology, the reflection and concerns about the rapid development of social media, and the role of SMBD in solving human social development problems. These studies could provide values and references for SMBD researchers to understand the research status, hotspots and trends in this field.

29 citations

Journal ArticleDOI
09 Mar 2020
TL;DR: This work proposes solutions that are capable of supporting real-life location-based publish/subscribe applications that process large numbers of SST and RST subscriptions over a realistic stream of spatio-temporal documents.
Abstract: Massive amounts of data that contain spatial, textual, and temporal information are being generated at a rapid pace. With streams of such data, which includes check-ins and geo-tagged tweets, available, users may be interested in being kept up-to-date on which terms are popular in the streams in a particular region of space. To enable this functionality, we aim at efficiently processing two types of general top-k term subscriptions over streams of spatio-temporal documents: region-based top-k spatial-temporal term (RST) subscriptions and similarity-based top-k spatio-temporal term (SST) subscriptions. RST subscriptions continuously maintain the top-k most popular trending terms within a user-defined region. SST subscriptions free users from defining a region and maintain top-k locally popular terms based on a ranking function that combines term frequency, term recency, and term proximity. To solve the problem, we propose solutions that are capable of supporting real-life location-based publish/subscribe applications that process large numbers of SST and RST subscriptions over a realistic stream of spatio-temporal documents. The performance of our proposed solutions is studied in extensive experiments using two spatio-temporal datasets.

29 citations


Cites background from "Microblogs data management: a surve..."

  • ...[36] offer a comprehensive tutorial and survey, respectively, on this topic....

    [...]

References
More filters
Proceedings ArticleDOI
26 Apr 2010
TL;DR: This paper investigates the real-time interaction of events such as earthquakes in Twitter and proposes an algorithm to monitor tweets and to detect a target event and produces a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location.
Abstract: Twitter, a popular microblogging service, has received much attention recently. An important characteristic of Twitter is its real-time nature. For example, when an earthquake occurs, people make many Twitter posts (tweets) related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. As described in this paper, we investigate the real-time interaction of events such as earthquakes in Twitter and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event that can find the center and the trajectory of the event location. We consider each Twitter user as a sensor and apply Kalman filtering and particle filtering, which are widely used for location estimation in ubiquitous/pervasive computing. The particle filter works better than other comparable methods for estimating the centers of earthquakes and the trajectories of typhoons. As an application, we construct an earthquake reporting system in Japan. Because of the numerous earthquakes and the large number of Twitter users throughout the country, we can detect an earthquake with high probability (96% of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more are detected) merely by monitoring tweets. Our system detects earthquakes promptly and sends e-mails to registered users. Notification is delivered much faster than the announcements that are broadcast by the JMA.

3,976 citations

Proceedings Article
01 May 2010
TL;DR: This paper shows how to automatically collect a corpus for sentiment analysis and opinion mining purposes and builds a sentiment classifier, that is able to determine positive, negative and neutral sentiments for a document.
Abstract: Microblogging today has become a very popular communication tool among Internet users. Millions of users share opinions on different aspects of life everyday. Therefore microblogging web-sites are rich sources of data for opinion mining and sentiment analysis. Because microblogging has appeared relatively recently, there are a few research works that were devoted to this topic. In our paper, we focus on using Twitter, the most popular microblogging platform, for the task of sentiment analysis. We show how to automatically collect a corpus for sentiment analysis and opinion mining purposes. We perform linguistic analysis of the collected corpus and explain discovered phenomena. Using the corpus, we build a sentiment classifier, that is able to determine positive, negative and neutral sentiments for a document. Experimental evaluations show that our proposed techniques are efficient and performs better than previously proposed methods. In our research, we worked with English, however, the proposed technique can be used with any other language.

2,570 citations


"Microblogs data management: a surve..." refers methods in this paper

  • ...The used models are both aggregation and SVM classification models....

    [...]

  • ...ADSEM [242] uses SVM and naive Bayes classifiers to enhance the precision of mapping tweets to Wikipedia-based concepts....

    [...]

  • ...It uses SVM classifiers and labeled training earthquake data to classify earthquake-related tweets....

    [...]

  • ...The used features include different types of language-based features such as unigrams [5,35,120, 164,247,262], bigrams [35,120,262], trigrams [262], ngrams [82,180,185,248], and POS tags [5,39,120,180,185, 247,248,262], microblog-specific features [185,248] such as retweets [39], hashtags [35,39,164], emotions [35,39,164, 180], links [35,39], and other features such as punctuationbased [5,82,180,248], pattern-based [5,35,82,164,180,248] and semantic-based [180,247]....

    [...]

  • ...The framework first applies transfer learning and label propagation to automatically generate labeled data, then learns an SVM text classifier based on tweet mini-clusters obtained by graph partitioning....

    [...]

23 Jun 2011
TL;DR: This article introduced POS-specific prior polarity features and explored the use of a tree kernel to obviate the need for tedious feature engineering for sentiment analysis on Twitter data, which outperformed the state-of-the-art baseline.
Abstract: We examine sentiment analysis on Twitter data. The contributions of this paper are: (1) We introduce POS-specific prior polarity features. (2) We explore the use of a tree kernel to obviate the need for tedious feature engineering. The new features (in conjunction with previously proposed features) and the tree kernel perform approximately at the same level, both outperforming the state-of-the-art baseline.

1,652 citations


"Microblogs data management: a surve..." refers methods in this paper

  • ...The used models are both aggregation and SVM classification models....

    [...]

  • ...ADSEM [242] uses SVM and naive Bayes classifiers to enhance the precision of mapping tweets to Wikipedia-based concepts....

    [...]

  • ...It uses SVM classifiers and labeled training earthquake data to classify earthquake-related tweets....

    [...]

  • ...The used features include different types of language-based features such as unigrams [5,35,120, 164,247,262], bigrams [35,120,262], trigrams [262], ngrams [82,180,185,248], and POS tags [5,39,120,180,185, 247,248,262], microblog-specific features [185,248] such as retweets [39], hashtags [35,39,164], emotions [35,39,164, 180], links [35,39], and other features such as punctuationbased [5,82,180,248], pattern-based [5,35,82,164,180,248] and semantic-based [180,247]....

    [...]

  • ...The framework first applies transfer learning and label propagation to automatically generate labeled data, then learns an SVM text classifier based on tweet mini-clusters obtained by graph partitioning....

    [...]

Proceedings Article
05 Jul 2011
TL;DR: This paper evaluates the usefulness of existing lexical resources as well as features that capture information about the informal and creative language used in microblogging, and uses existing hashtags in the Twitter data for building training data.
Abstract: In this paper, we investigate the utility of linguistic features for detecting the sentiment of Twitter messages. We evaluate the usefulness of existing lexical resources as well as features that capture information about the informal and creative language used in microblogging. We take a supervied approach to the problem, but leverage existing hashtags in the Twitter data for building training data.

1,261 citations


"Microblogs data management: a surve..." refers methods in this paper

  • ...The used features include different types of language-based features such as unigrams [5,35,120, 164,247,262], bigrams [35,120,262], trigrams [262], ngrams [82,180,185,248], and POS tags [5,39,120,180,185, 247,248,262], microblog-specific features [185,248] such as retweets [39], hashtags [35,39,164], emotions [35,39,164, 180], links [35,39], and other features such as punctuationbased [5,82,180,248], pattern-based [5,35,82,164,180,248] and semantic-based [180,247]....

    [...]

  • ...In specific, SVM [79,136,266], NB [70,136], MNB [79], logistic regression [79,136,208], and AdaBoost [185] are still used, while new classifiers are also introduced such as neural models [86,136,363] and Bayes network [136]....

    [...]

  • ...NB [70,136], MNB [79], logistic regression [79,136,208], and AdaBoost [185] are still used, while new classifiers are also introduced such as neural models [86,136,363] and...

    [...]

  • ...The major used classifiers are support vector machines (SVM) [4,5,31,35,39,83,87,120,131,160, 164,172,180,248,262,275,306], (multinomial) naive Bayes (MNB and NB) [31,35,120,131,247,262], k-nearest neighbor (kNN) [31,82], MaxEnt [92,120,180], random forest (RF) [89,319], logistic regression [36,202,393], and AdaBoost [185]....

    [...]

  • ...To enhance the classification accuracy, techniques of the second sub-category ensemble multiple classifiers [61,70, 75,79,86,136,172,185,194,208,244,266,317,363]....

    [...]

Proceedings ArticleDOI
01 Jun 2014
TL;DR: Three neural networks are developed to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions and the performance of SSWE is improved by concatenating SSWE with existing feature set.
Abstract: We present a method that learns word embedding for Twitter sentiment classification in this paper. Most existing algorithms for learning continuous word representations typically only model the syntactic context of words but ignore the sentiment of text. This is problematic for sentiment analysis as they usually map words with similar syntactic context but opposite sentiment polarity, such as good and bad, to neighboring word vectors. We address this issue by learning sentimentspecific word embedding (SSWE), which encodes sentiment information in the continuous representation of words. Specifically, we develop three neural networks to effectively incorporate the supervision from sentiment polarity of text (e.g. sentences or tweets) in their loss functions. To obtain large scale training corpora, we learn the sentiment-specific word embedding from massive distant-supervised tweets collected by positive and negative emoticons. Experiments on applying SSWE to a benchmark Twitter sentiment classification dataset in SemEval 2013 show that (1) the SSWE feature performs comparably with hand-crafted features in the top-performed system; (2) the performance is further improved by concatenating SSWE with existing feature set.

1,157 citations


"Microblogs data management: a surve..." refers background in this paper

  • ...Existing deep learning techniques is exploited in short textual contexts in two-step fashion [37,62,71,86,90,134,147,148,165,265,280,288,321, 322,337,342,359]....

    [...]