Top 20 papers published by Michael L. Nelson from Old Dominion University in 2018

Journal Article•DOI•

Challenges in the assessment of total fluid intake in children and adolescents: a discussion paper.

[...]

Janet Warren, Isabelle Guelinckx, Barbara Livingstone¹, Nancy Potischman², Michael L. Nelson³, Emma Foster⁴, Bridget A. Holmes - Show less +3 more•Institutions (4)

Ulster University¹, National Institutes of Health², King's College London³, Newcastle University⁴

12 Jun 2018-European Journal of Nutrition

TL;DR: The aim of this paper is to review the current status of the literature and highlight the challenges of assessing total fluid intake in children and adolescents, and indicates that if the research focus is to assess only fluid intake, a fluid-specific method appears to be a feasible approach to provide an accurate estimate of intakes.

...read moreread less

Abstract: In recent years, evidence has emerged about the importance of healthy fluid intake in children for physical and mental performance and health, and in the prevention of obesity. Accurate data on water intake are needed to inform researchers and policymakers and for setting dietary reference values. However, to date, there are few published data on fluid or water intakes in children. This is due partly to the fact that drinking water is not always reported in dietary surveys. The aim of this paper is to review the current status of the literature and highlight the challenges of assessing total fluid intake in children and adolescents. From the dietary assessment literature it is apparent that children present unique challenges to assessing intake due to ongoing cognitive capacity development, limited literacy skills, difficulties in estimating portion sizes and multiple caregivers during any 1 day making it difficult to track intakes. As such, many issues should be considered when assessing total fluid intakes in children or adolescents. Various methods to assess fluid intakes exist, each with its own strengths and weaknesses; the ultimate choice of method depends on the research question and resources available. Based on the literature review, it is apparent that if the research focus is to assess only fluid intake, a fluid-specific method, such as a diary or record, appears to be a feasible approach to provide an accurate estimate of intakes.

...read moreread less

21 citations

Proceedings Article•DOI•

Bootstrapping Web Archive Collections from Social Media

[...]

Alexander C. Nwala¹, Michele C. Weigle¹, Michael L. Nelson¹•Institutions (1)

Old Dominion University¹

03 Jul 2018

TL;DR: The results showed that social media sources such as Reddit, Storify, Twitter, and Wikipedia produce collections that are similar to Archive-It collections, and curators may consider extracting URIs from these sources in order to begin or augment collections about various news topics.

...read moreread less

Abstract: Human-generated collections of archived web pages are expensive to create, but provide a critical source of information for researchers studying historical events Hand-selected collections of web pages about events shared by users on social media offer the opportunity for bootstrapping archived collections We investigated if collections generated automatically and semi-automatically from social media sources such as Storify, Reddit, Twitter, and Wikipedia are similar to Archive-It human-generated collections This is a challenging task because it requires comparing collections that may cater to different needs It is also challenging to compare collections since there are many possible measures to use as a baseline for collection comparison: how does one narrow down this list to metrics that reflect if two collections are similar or dissimilar? We identified social media sources that may provide similar collections to Archive-It human-generated collections in two main steps First, we explored the state of the art in collection comparison and defined a suite of seven measures (Collection Characterizing Suite - CCS) to describe the individual collections Second, we calculated the distances between the CCS vectors of Archive-It collections and the CCS vectors of collections generated automatically and semi-automatically from social media sources, to identify social media collections most similar to Archive-It collections The CCS distance comparison was done for three topics: "Ebola Virus," "Hurricane Harvey," and "2016 Pulse Nightclub Shooting" Our results showed that social media sources such as Reddit, Storify, Twitter, and Wikipedia produce collections that are similar to Archive-It collections Consequently, curators may consider extracting URIs from these sources in order to begin or augment collections about various news topics

...read moreread less

14 citations

Proceedings Article•DOI•

ArchiveNow: Simplified, Extensible, Multi-Archive Preservation

[...]

Mohamed Aturban¹, Mat Kelly¹, Sawood Alam¹, John A. Berlin¹, Michael L. Nelson¹, Michele C. Weigle¹ - Show less +2 more•Institutions (1)

Old Dominion University¹

23 May 2018

TL;DR: This module allows a user to submit a URI of a web page for archiving at several configured web archives, and provides the user with links to the archived copies of the web page.

...read moreread less

Abstract: ArchiveNow is a Python module for preserving web pages in on-demand web archives. This module allows a user to submit a URI of a web page for archiving at several configured web archives. Once the web page is captured, ArchiveNow provides the user with links to the archived copies of the web page. ArchiveNow is initially configured to use four archives but is easily configurable to add or remove other archives. In addition to pushing web pages to public archives, ArchiveNow, through the use of Wget and Squidwarc, allows users to generate local WARC files, enabling them to create their own personal and private archives.

...read moreread less

13 citations

Proceedings Article•DOI•

Scraping SERPs for Archival Seeds: It Matters When You Start

[...]

Alexander C. Nwala¹, Michele C. Weigle¹, Michael L. Nelson¹•Institutions (1)

Old Dominion University¹

23 May 2018

TL;DR: The findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in orderto capture the evolution of the events.

...read moreread less

Abstract: Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21 - 0.54, and a weekly rate of 0.39 - 0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01 - 0.11. In addition to the reporting of these probabilities, we also provide two predictive models for estimating the probability of finding the URI of an arbitrary news story on SERPs as a function of time. The web archiving community considers link rot and content drift important reasons for collection building. Similarly, our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events, because it becomes more difficult to find the same news stories with the same queries on Google, as time progresses.

...read moreread less

13 citations

Journal Article•DOI•

Dietary patterns and risk of prostate cancer: a factor analysis study in a sample of Iranian men.

[...]

Amir Bagheri¹, Seyed Mostafa Nachvak¹, Mansour Rezaei¹, Mozhgan Moravridzade¹, Mahmoudreza Moradi¹, Michael L. Nelson² - Show less +2 more•Institutions (2)

Kermanshah University of Medical Sciences¹, King's College London²

18 Apr 2018-health promotion perspectives

TL;DR: This study shows that an unhealthy dietary pattern was associated with increased risk of prostate cancer, however, a healthy dietary patternwas associated with decreased risk of Prostate cancer.

...read moreread less

Abstract: Background: Prostate cancer is one of the most common types of cancer with a high mortality rate. The current study was conducted to investigate the relationship between dietary patterns and prostate cancer risk among Iranian men. Methods: This case-control study was conducted in Kermanshah province in western Iran in November 2016. Fifty patients with prostate cancer were selected as cases and 150 healthy men matched for age and body mass index (BMI) were selected as controls. Dietary intake data were collected by a semi-quantitative food frequency questionnaire (FFQ). Food items were grouped according to the similarity of nutrient profiles. The main dietary patterns were identified by factor analysis. Results: After adjustment for potential confounders, a healthy dietary pattern was associated with decreased risk of prostate cancer (highest versus lowest tertile OR:0.24; 95% CI: 0.07-0.81;trend p: 0.025). An unhealthy dietary pattern was related to increased risk of prostate cancer(highest versus lowest tertile OR:3.4; 95% CI: 1.09-10.32; trend p: 0.037). Conclusion: This study shows that an unhealthy dietary pattern was associated with increased risk of prostate cancer. However, a healthy dietary pattern was associated with decreased risk of prostate cancer.

...read moreread less

12 citations

The Many Shapes of Archive-It..

[...]

Shawn M. Jones, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

01 Jan 2018

TL;DR: This work focuses on the collections within Archive-It, a subscription service started by the Internet Archive in 2005 for the purpose of allowing organizations to create their own collections of archived web pages, or mementos, and proposes using structural metadata as an additional way to understand these collections.

...read moreread less

Abstract: Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves, selecting their own original resources, or seeds, and creating their own web archive collections. We focus on the collections within Archive-It, a subscription service started by the Internet Archive in 2005 for the purpose of allowing organizations to create their own collections of archived web pages, or mementos. Understanding these collections could be done via their user-supplied metadata or via text analysis, but the metadata is applied inconsistently between collections and some Archive-It collections consist of hundreds of thousands of seeds, making it costly in terms of time to download each memento. Our work proposes using structural metadata as an additional way to understand these collections. We explore structural features currently existing in these collections that can unveil curation and crawling behaviors. We adapt the concept of the collection growth curve for understanding Archive-It collection curation and crawling behavior. We also introduce several seed features and come to an understanding of the diversity of resources that make up a collection. Finally, we use the descriptions of each collection to identify four semantic categories of Archive-It collections. Using the identified structural features, we reviewed the results of runs with 20 classifiers and are able to predict the semantic category of a collection using a Random Forest classifier with a weighted average F1 score of 0.720, thus bridging the structural to the descriptive. Our method is useful because it saves the researcher time and bandwidth. Identifying collections by their semantic category allows further downstream processing to be tailored to these categories.

...read moreread less

10 citations

Proceedings Article•DOI•

A Framework for Aggregating Private and Public Web Archives

[...]

Mat Kelly¹, Michael L. Nelson¹, Michele C. Weigle¹•Institutions (1)

Old Dominion University¹

23 May 2018

TL;DR: A framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential sensitive information contained in private captures is introduced and Memento syntax and semantics are amended to allow additional attributes to be expressed inclusive of the requirements for dereferencing private Web archive captures.

...read moreread less

Abstract: Personal and private Web archives are proliferating due to the increase in the tools to create them and the realization that Internet Archive and other public Web archives are unable to capture personalized (e.g., Facebook) and private (e.g., banking) Web pages. We introduce a framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential sensitive information contained in private captures. We amend Memento syntax and semantics to allow TimeMap enrichment to account for additional attributes to be expressed inclusive of the requirements for dereferencing private Web archive captures. We provide a method to involve the user further in the negotiation of archival captures in dimensions beyond time. We introduce a model for archival querying precedence and short-circuiting, as needed when aggregating private and personal Web archive captures with those from public Web archives through Memento. Negotiation of this sort is novel to Web archiving and allows for the more seamless aggregation of various types of Web archives to convey a more accurate picture of the past Web.

...read moreread less

9 citations

Journal Article•DOI•

Mining the Web to approximate university rankings

[...]

Corren G. McCoy, Michael L. Nelson, Michele C. Weigle

13 Nov 2018

TL;DR: An alternative to university ranking lists published in U.S. News & World Report, Times Higher Education, Academic Ranking of World Universities and Money Magazine is presented and UTE could be a viable proxy for ranking atypical institutions normally excluded from traditional lists.

...read moreread less

Abstract: Purpose The purpose of this study is to present an alternative to university ranking lists published in US News & World Report, Times Higher Education, Academic Ranking of World Universities and Money Magazine A strategy is proposed to mine a collection of university data obtained from Twitter and publicly available online academic sources to compute social media metrics that approximate typical academic rankings of US universities Design/methodology/approach The Twitter application programming interface (API) is used to rank 264 universities using two easily collected measurements The University Twitter Engagement (UTE) score is the total number of primary and secondary followers affiliated with the university The authors mine other public data sources related to endowment funds, athletic expenditures and student enrollment to compute a ranking based on the endowment, expenditures and enrollment (EEE) score Findings In rank-to-rank comparisons, the authors observed a significant, positive rank correlation (τ = 06018) between UTE and an aggregate reputation ranking, which indicates UTE could be a viable proxy for ranking atypical institutions normally excluded from traditional lists Originality/value The UTE and EEE metrics offer distinct advantages because they can be calculated on-demand rather than relying on an annual publication and they promote diversity in the ranking lists, as any university with a Twitter account can be ranked by UTE and any university with online information about enrollment, expenditures and endowment can be given an EEE rank The authors also propose a unique approach for discovering official university accounts by mining and correlating the profile information of Twitter friends

...read moreread less

9 citations

Journal Article•DOI•

Avoiding spoilers: wiki time travel with Sheldon Cooper

[...]

Shawn M. Jones¹, Michael L. Nelson², Herbert Van de Sompel¹•Institutions (2)

Los Alamos National Laboratory¹, Old Dominion University²

01 Mar 2018-International Journal on Digital Libraries

TL;DR: The results indicate that the Internet Archive is not safe for avoiding spoilers, and therefore the inherent capability of fan wikis to address the spoiler problem internally using existing, off-the-shelf technology is highlighted.

...read moreread less

Abstract: A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if fans are behind in their viewing they run the risk of encountering “spoilers”—information that gives away key plot points before the intended time of the show’s writers. Because the wiki history is indexed by revisions, finding specific dates can be tedious, especially for pages with hundreds or thousands of edits. A wiki’s history interface does not permit browsing across historic pages without visiting current ones, thus revealing spoilers in the current page. Enterprising fans can resort to web archives and navigate there across wiki pages that were live prior to a specific episode date. In this paper, we explore the use of Memento with the Internet Archive as a means of avoiding spoilers in fan wikis. We conduct two experiments: one to determine the probability of encountering a spoiler when using Memento with the Internet Archive for a given wiki page, and a second to determine which date prior to an episode to choose when trying to avoid spoilers for that specific episode. Our results indicate that the Internet Archive is not safe for avoiding spoilers, and therefore we highlight the inherent capability of fan wikis to address the spoiler problem internally using existing, off-the-shelf technology. We use the spoiler use case to define and analyze different ways of discovering the best past version of a resource to avoid spoilers. We propose Memento as a structural solution to the problem, distinguishing it from prior content-based solutions to the spoiler problem. This research promotes the idea that content management systems can benefit from exposing their version information in the standardized Memento way used by other archives. We support the idea that there are use cases for which specific prior versions of web resources are invaluable.

...read moreread less

8 citations

Posted Content•DOI•

The Many Shapes of Archive-It

[...]

Shawn M. Jones, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

18 Jun 2018-arXiv: Digital Libraries

TL;DR: In this article, structural features are used to identify four semantic categories of Archive-It collections using a Random Forest classifier with a weighted average F1 score of 0.720, thus bridging the structural to the descriptive.

...read moreread less

Abstract: Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves, selecting their own original resources, or seeds, and creating their own web archive collections. We focus on the collections within Archive-It, a subscription service started by the Internet Archive in 2005 for the purpose of allowing organizations to create their own collections of archived web pages, or mementos. Understanding these collections could be done via their user-supplied metadata or via text analysis, but the metadata is applied inconsistently between collections and some Archive-It collections consist of hundreds of thousands of seeds, making it costly in terms of time to download each memento. Our work proposes using structural metadata as an additional way to understand these collections. We explore structural features currently existing in these collections that can unveil curation and crawling behaviors. We adapt the concept of the collection growth curve for understanding Archive-It collection curation and crawling behavior. We also introduce several seed features and come to an understanding of the diversity of resources that make up a collection. Finally, we use the descriptions of each collection to identify four semantic categories of Archive-It collections. Using the identified structural features, we reviewed the results of runs with 20 classifiers and are able to predict the semantic category of a collection using a Random Forest classifier with a weighted average F1 score of 0.720, thus bridging the structural to the descriptive. Our method is useful because it saves the researcher time and bandwidth. Identifying collections by their semantic category allows further downstream processing to be tailored to these categories.

...read moreread less

7 citations

Posted Content•DOI•

The Off-Topic Memento Toolkit.

[...]

Shawn M. Jones, Michele C. Weigle, Michael L. Nelson

18 Jun 2018-arXiv: Digital Libraries

TL;DR: The Off-Topic Memento Toolkit as discussed by the authors allows users to detect off-topic mementos within web archive collections, which can then be separately removed from a collection or merely excluded from downstream analysis.

...read moreread less

Abstract: Web archive collections are created with a particular purpose in mind. A curator selects seeds, or original resources, which are then captured by an archiving system and stored as archived web pages, or mementos. The systems that build web archive collections are often configured to revisit the same original resource multiple times. This is incredibly useful for understanding an unfolding news story or the evolution of an organization. Unfortunately, over time, some of these original resources can go off-topic and no longer suit the purpose for which the collection was originally created. They can go off-topic due to web site redesigns, changes in domain ownership, financial issues, hacking, technical problems, or because their content has moved on from the original topic. Even though they are off-topic, the archiving system will still capture them, thus it becomes imperative to anyone performing research on these collections to identify these off-topic mementos. Hence, we present the Off-Topic Memento Toolkit, which allows users to detect off-topic mementos within web archive collections. The mementos identified by this toolkit can then be separately removed from a collection or merely excluded from downstream analysis. The following similarity measures are available: byte count, word count, cosine similarity, Jaccard distance, Sorensen-Dice distance, Simhash using raw text content, Simhash using term frequency, and Latent Semantic Indexing via the gensim library. We document the implementation of each of these similarity measures. We possess a gold standard dataset generated by manual analysis, which contains both off-topic and on-topic mementos. Using this gold standard dataset, we establish a default threshold corresponding to the best F1 score for each measure. We also provide an overview of potential future directions that the toolkit may take.

...read moreread less

Proceedings Article•DOI•

Unobtrusive and Extensible Archival Replay Banners Using Custom Elements

[...]

Sawood Alam¹, Mat Kelly¹, Michele C. Weigle¹, Michael L. Nelson¹•Institutions (1)

Old Dominion University¹

23 May 2018

TL;DR: This work proposes an implementation that utilizes Custom Elements and adds some unique behaviors, not common in existing archival replay systems, to enhance the user experience and has a minimal user interface footprint and resource overhead.

...read moreread less

Abstract: We compare and contrast three different ways to implement an archival replay banner. We propose an implementation that utilizes Custom Elements and adds some unique behaviors, not common in existing archival replay systems, to enhance the user experience. Our approach has a minimal user interface footprint and resource overhead while still providing rich interactivity and extended on-demand provenance information about the archived resources.

...read moreread less

Proceedings Article•DOI•

A Framework for Aggregating Private and Public Web Archives

[...]

Mat Kelly¹, Michael L. Nelson¹, Michele C. Weigle¹•Institutions (1)

Old Dominion University¹

03 Jun 2018-arXiv: Digital Libraries

TL;DR: In this article, the authors introduce a framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential sensitive information contained in private captures, and propose a model for archival querying precedence and short-circuiting, as needed when aggregating private and personal Web archive captures with those from public web archives through Memento.

...read moreread less

Abstract: Personal and private Web archives are proliferating due to the increase in the tools to create them and the realization that Internet Archive and other public Web archives are unable to capture personalized (e.g., Facebook) and private (e.g., banking) Web pages. We introduce a framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential sensitive information contained in private captures. We amend Memento syntax and semantics to allow TimeMap enrichment to account for additional attributes to be expressed inclusive of the requirements for dereferencing private Web archive captures. We provide a method to involve the user further in the negotiation of archival captures in dimensions beyond time. We introduce a model for archival querying precedence and short-circuiting, as needed when aggregating private and personal Web archive captures with those from public Web archives through Memento. Negotiation of this sort is novel to Web archiving and allows for the more seamless aggregation of various types of Web archives to convey a more accurate picture of the past Web.

...read moreread less

Off-Topic Memento Toolkit.

[...]

Shawn M. Jones, Michael L. Nelson, Michele C. Weigle

01 Jan 2018

TL;DR: The Off-Topic Memento Toolkit is presented, which allows users to detect off-topic mementos within web archive collections and establishes a default threshold corresponding to the best F1 score for each measure.

...read moreread less

Posted Content•

Measuring News Similarity Across Ten U.S. News Sites

[...]

Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

24 Jun 2018-arXiv: Digital Libraries

TL;DR: In this article, a method for identifying the top news story for a select set of U.S.-based news websites and then quantifying the similarity across them is presented. But the method is limited to a subset of news websites.

...read moreread less

Abstract: News websites make editorial decisions about what stories to include on their website homepages and what stories to emphasize (e.g., large font size for main story). The emphasized stories on a news website are often highly similar to many other news websites (e.g, a terrorist event story). The selective emphasis of a top news story and the similarity of news across different news organizations are well-known phenomena but not well-measured. We provide a method for identifying the top news story for a select set of U.S.-based news websites and then quantify the similarity across them. To achieve this, we first developed a headline and link extractor that parses select websites, and then examined ten United States based news website homepages during a three month period, November 2016 to January 2017. Using archived copies, retrieved from the Internet Archive (IA), we discuss the methods and difficulties for parsing these websites, and how events such as a presidential election can lead news websites to alter their document representation just for these events. We use our parser to extract k = 1, 3, 10 maximum number of stories for each news site. Second, we used the cosine similarity measure to calculate news similarity at 8PM Eastern Time for each day in the three months. The similarity scores show a buildup (0.335) before Election Day, with a declining value (0.328) on Election Day, and an increase (0.354) after Election Day. Our method shows that we can effectively identity top stories and quantify news similarity.

...read moreread less

Measuring News Similarity Across Ten U.S. News Sites

[...]

Grant C. Atkins, Alexander C. Nwala, Michele C. Weigle, Michael L. Nelson

01 Jun 2018

TL;DR: In this article, a method for identifying the top news story for a select set of U.S.-based news websites and then quantifying the similarity across them is presented. But the method is limited to a subset of news websites.

...read moreread less

Abstract: News websites make editorial decisions about what stories to include on their website homepages and what stories to emphasize (e.g., large font size for main story). The emphasized stories on a news website are often highly similar to many other news websites (e.g, a terrorist event story). The selective emphasis of a top news story and the similarity of news across different news organizations are well-known phenomena but not well-measured. We provide a method for identifying the top news story for a select set of U.S.-based news websites and then quantify the similarity across them. To achieve this, we first developed a headline and link extractor that parses select websites, and then examined ten United States based news website homepages during a three month period, November 2016 to January 2017. Using archived copies, retrieved from the Internet Archive (IA), we discuss the methods and difficulties for parsing these websites, and how events such as a presidential election can lead news websites to alter their document representation just for these events. We use our parser to extract k = 1, 3, 10 maximum number of stories for each news site. Second, we used the cosine similarity measure to calculate news similarity at 8PM Eastern Time for each day in the three months. The similarity scores show a buildup (0.335) before Election Day, with a declining value (0.328) on Election Day, and an increase (0.354) after Election Day. Our method shows that we can effectively identity top stories and quantify news similarity.

...read moreread less

Proceedings Article•DOI•

Queryable Compression on Time-Evolving Social Networks with Streaming

[...]

Michael L. Nelson¹, Sridhar Radhakrishnan¹, Chandra N. Sekharan²•Institutions (2)

University of Oklahoma¹, Loyola University Chicago²

01 Dec 2018

TL;DR: This work adapts its strategy to compress time-evolving graphs, rather than static ones, and achieves the smallest representation of 4.9GB on the largest dataset which only spans three days yet occupies 21.5GB of space.

...read moreread less

Abstract: Time-evolving graphs represent a set of individuals (nodes) and their edges (relationships) over time. How these graphs are represented in data structures determines what information is easy to obtain from them. Now that we have such massive social networks with dynamic lifetimes, even basic data structures are too large to fit into main memory. Clearly, this poses a problem to areas such as time-evolving graph pattern analysis. Therefore, it is an interesting field of study to design time-evolving graph compressions that can efficiently answer certain queries about the graph at any given point in time.If a single snapshot of a graph at a moment in time can be considered a 2D matrix, then can we visualize these time-evolving graphs as 3D matrices and then use a novel technique to compress the entire graph over time. Our technique is based on our previous work using compressed binary trees. In this work, we adapt our strategy to compress time-evolving graphs, rather than static ones. We manage to maintain our minimal main memory overhead by not requiring an intermediate structure (e.g. adjacency list) to compress. This compression is queryable, meaning that the data can be read without decompression. It is also streaming, meaning that the data can be changed without decompression. This includes adding/removing edges in individual frames. We test our algorithms on public, anonymized, massive, time-evolving graphs such as Flickr, Yahoo!, and Wikipedia. Our empirical evaluation is based on several parameters including time to compress, size of compressed graph, and time to execute queries. Our compression rates are highly competitive, as we achieve the smallest representation of 4.9GB on our largest dataset which only spans three days yet occupies 21.5GB of space.

...read moreread less

Proceedings Article•DOI•

Scraping SERPs for Archival Seeds: It Matters When You Start

[...]

Alexander C. Nwala¹, Michele C. Weigle¹, Michael L. Nelson¹•Institutions (1)

Old Dominion University¹

25 May 2018-arXiv: Digital Libraries

TL;DR: In this article, the authors studied the retrievability of URIs of news stories found on Google and found that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21 -0.54, and a weekly rate of 0.39 - 0.44.

...read moreread less

Abstract: Event-based collections are often started with a web search, but the search results you find on Day 1 may not be the same as those you find on Day 7. In this paper, we consider collections that originate from extracting URIs (Uniform Resource Identifiers) from Search Engine Result Pages (SERPs). Specifically, we seek to provide insight about the retrievability of URIs of news stories found on Google, and to answer two main questions: first, can one "refind" the same URI of a news story (for the same query) from Google after a given time? Second, what is the probability of finding a story on Google over a given period of time? To answer these questions, we issued seven queries to Google every day for over seven months (2017-05-25 to 2018-01-12) and collected links from the first five SERPs to generate seven collections for each query. The queries represent public interest stories: "healthcare bill," "manchester bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey," and "hurricane irma." We tracked each URI in all collections over time to estimate the discoverability of URIs from the first five SERPs. Our results showed that the daily average rate at which stories were replaced on the default Google SERP ranged from 0.21 -0.54, and a weekly rate of 0.39 - 0.79, suggesting the fast replacement of older stories by newer stories. The probability of finding the same URI of a news story after one day from the initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the probability of finding the same news stories diminishes rapidly to 0.01 - 0.11. Our findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in order to capture the evolution of the events...

...read moreread less