Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose
Citations
1,571 citations
Cites background from "Is the Sample Good Enough? Comparin..."
...The differences and biases in the networks can be a result of many factors, such as network sampling, as shown in [51, 91], which can change the network measures and cause different types of problems....
[...]
767 citations
634 citations
Cites background or methods from "Is the Sample Good Enough? Comparin..."
...The stream was gathered through the Twitter Streaming API.2 Although the service sets a limit on how much data can be accessed to less than 1% of the total Twitter stream, the total geo-located content was found not to exceed this restriction (Morstatter et al. 2013)....
[...]
...These geo-located tweets account for around 1% of the total feed (Morstatter et al. 2013)....
[...]
...Although the service sets a limit on how much data can be accessed to less than 1% of the total Twitter stream, the total geo-located content was found not to exceed this restriction (Morstatter et al. 2013)....
[...]
550 citations
Cites background from "Is the Sample Good Enough? Comparin..."
...[203] studied whether Twitter’s heavily sampled Streaming API, a free service for social media data, accurately portrays the true activity on Twitter....
[...]
495 citations
Cites result from "Is the Sample Good Enough? Comparin..."
...Although all of these approaches might lead to comparable and stable results, there are only a few studies systematically testing whether these approaches indeed produce identical data sets (e.g., Morstatter et al., 2013)....
[...]
References
45,034 citations
"Is the Sample Good Enough? Comparin..." refers background or methods in this paper
...Treating each topic as a probability distribution, we compute this as follows: JS(TSi ||TFj ) = 1 2 [KL(TSi ||M) +KL(TFj ||M)], (3) where M = 12 (T S i + T F j ) and KL is the Kullback-Liebler divergence (Cover and Thomas 2006)....
[...]
...where M = 12 (T S i + T F j ) and KL is the Kullback-Liebler divergence (Cover and Thomas 2006)....
[...]
39,297 citations
"Is the Sample Good Enough? Comparin..." refers methods in this paper
...To describe the structure of the retweet networks we calculate the clustering coefficient, a measure for local density (Watts and Strogatz 1998)....
[...]
30,570 citations
"Is the Sample Good Enough? Comparin..." refers methods in this paper
...Since LDA’s topics have no implicit orderings we first must match them based upon the similarity of the words in the distribution....
[...]
...In the case of LDA we find a significant increase in the accuracy of LDA with the randomly sampled data over the data from the Streaming API....
[...]
...We compare the topics drawn from the Streaming data with those drawn from the Firehose data using a widely-used topic modeling algorithm, latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan 2003)....
[...]
...To get a sense of how the topics found in the Streaming data compare with those found with random samples, we compare with topics found by running LDA on random subsamples of the Firehose data....
[...]
...We also employed LDA to extract topics from the text....
[...]
25,546 citations
17,104 citations