scispace - formally typeset
Search or ask a question

Showing papers by "Michael L. Nelson published in 2009"


Posted Content
TL;DR: The Memento solution is a framework in which archived resources can seamlessly be reached via the URI of their original: protocol-based time travel for the Web.
Abstract: The Web is ephemeral. Many resources have representations that change over time, and many of those representations are lost forever. A lucky few manage to reappear as archived resources that carry their own URIs. For example, some content management systems maintain version pages that reect a frozen prior state of their changing resources. Archives recurrently crawl the web to obtain the actual representation of resources, and subsequently make those available via special-purpose archived resources. In both cases, the archival copies have URIs that are protocolwise disconnected from the URI of the resource of which they represent a prior state. Indeed, the lack of temporal capabilities in the most common Web protocol, HTTP, prevents getting to an archived resource on the basis of the URI of its original. This turns accessing archived resources into a signicant discovery challenge for both human and software agents, which typically involves following a multitude of links from the original to the archival resource, or of searching archives for the original URI. This paper proposes the protocol-based Memento solution to address this problem, and describes a proof-of-concept experiment that includes major servers of archival content, including Wikipedia and the Internet Archive. The Memento solution is based on existing HTTP capabilities applied in a novel way to add the temporal dimension. The result is a framework in which archived resources can seamlessly be reached via the URI of their original: protocol-based time travel for the Web.

149 citations


Proceedings ArticleDOI
15 Jun 2009
TL;DR: This work examines a variety of techniques for extracting users' activities from Facebook (and by extension, other social networking systems) for the personal archive and for the third-party archiver.
Abstract: Web users are spending more of their time and creative energies within online social networking systems. While many of these networks allow users to export their personal data or expose themselves to third-party web archiving, some do not. Facebook, one of the most popular social networking websites, is one example of a "walled garden" where users' activities are trapped. We examine a variety of techniques for extracting users' activities from Facebook (and by extension, other social networking systems) for the personal archive and for the third-party archiver. Our framework could be applied to any walled garden where personal user data is being locked.

44 citations


Posted Content
TL;DR: This work presents a mechanism to identify and describe aggregations of Web resources that has resulted from the Open Archives Initiative - Object Reuse and Exchange (OAI-ORE) project, and ensures the integration of the products of scholarly research into the Data Web.
Abstract: Aggregations of Web resources are increasingly important in scholarship as it adopts new methods that are data-centric, collaborative, and networked-based. The same notion of aggregations of resources is common to the mashed-up, socially networked information environment of Web 2.0. We present a mechanism to identify and describe aggregations of Web resources that has resulted from the Open Archives Initiative - Object Reuse and Exchange (OAI-ORE) project. The OAI-ORE specications are based on the principles of the Architecture of the World Wide Web, the Semantic Web, and the Linked Data eort. Therefore, their incorporation into the cyberinfrastructure that supports eScholarship will ensure the integration of the products of scholarly research into the Data Web.

39 citations


Journal ArticleDOI
TL;DR: A Web-repository crawler named Warrick is created that restores lost resources from the holdings of four Web repositories collectively as the Web Infrastructure (WI), and a survey is constructed to explore after-loss recovery Lazy Preservation.
Abstract: IntroductionThe web is in constant flux---new pages and Web sites appear daily, and old pages and sites disappear almost as quickly. One study estimates that about two percent of the Web disappears from its current location every week.2 Although Web users have become accustomed to seeing the infamous "404 Not Found" page, they are more taken aback when they own, are responsible for, or have come to rely on the missing material.Web archivists like those at the Internet Archive have responded to the Web's transience by archiving as much of it as possible, hoping to preserve snapshots of the Web for future generations.3 Search engines have also responded by offering pages that have been cached as a result of the indexing process. These straightforward archiving and caching efforts have been used by the public in unintended ways: individuals and organizations have used them to restore their own lost Web sites.5To automate recovering lost Web sites, we created a Web-repository crawler named Warrick that restores lost resources from the holdings of four Web repositories: Internet Archive, Google, Live Search (now Bing), and Yahoo;6 we refer to these Web repositories collectively as the Web Infrastructure (WI). We call this after-loss recovery Lazy Preservation (see the sidebar for more information). Warrick can only recover what is accessible to the WI, namely the crawlable Web. There are numerous resources that cannot be found in the WI: password protected content, pages without incoming links or protected by the robots exclusion protocol, and content hidden behind Flash or JavaScript interfaces. Most importantly, WI crawlers do not have access to the server-side components (for example, scripts, configuration files, databases, among others) of a Web site.Nevertheless, upon Warrick's public release in 2005, we received many inquiries about its usage and collected a handful of anecdotes about the Web sites individuals and organizations had lost and wanted to recover. Were these Web sites representative? What types of Web resources were people losing? Given the inherent limitations of the WI, were Warrick users recovering enough material to reconstruct the site? Were these losses changing their behavior, or was the availability of cached material reinforcing a "lazy" approach to preservation?We constructed an online survey to explore these questions and conducted a set of in-depth interviews with survey respondents to clarify the results. Potential participants were solicited by us or the Internet Archive, or they found a link to the survey from the Warrick Web site. A total of 52 participants completed the survey regarding 55 lost Web sites, and seven of the participants allowed us to follow-up with telephone or instant messaging interviews. Participants were divided into two groups:1. Personal loss: Those who had lost (and tried to recover) a Web site that they had personally created, maintained or owned (34 participants who lost 37 Web sites).2. Third party: Those who had recovered someone else's lost Web site (18 participants who recovered 18 Web sites).

28 citations


01 Jan 2009
TL;DR: The response rate and coverage are both sufficiently high to be confident that the findings presented in this report are representative of local authority organised school meal provision in England.
Abstract: Table 7. Percentage take up of school lunches and percentage coverage, primary a and secondary schools with LA catered or contracted provision, by region, England, 2011-201215 Table 8. Percentage take up of school lunches and percentage coverage, primary a and secondary schools with non-LA catering provision, Table 15. Facilities for food preparation in schools with LA catering or LA contracted provision in primary a and secondary schools (percentage of schools reported on), by region, Table 23. Number and percentage a of LAs that monitor compliance with school food standards in schools where catering services are not provided by the LA, by region, England, Summary  All 152 local authorities (LAs) in England were approached for information regarding school catering services. Of these, 99 (65%) responded, providing information relating to both LA organised catering services (whether provided directly or contracted on behalf of schools by the LA) and non-LA catering services.  The response rate and coverage are both sufficiently high to be confident that the findings presented in this report are representative of local authority organised school meal provision in England. The coverage nationally relating to take up of school lunches is 61% in the primary a sector, down from 78% in 2010-2011, and 38% in the secondary sector, down from 54% in 2010-2011.  LA catered or contracted provision accounted for 84%, 40% and 75% of primary, secondary and special school lunch provision, respectively. Percentages for non-LA catering provision were 16%, 60% and 25%, respectively.  Take up of school lunches was 46.3% in primary schools and 39.8% in secondary schools. This represents an increase over 2010-2011 of 2.2 percentage points in both the primary and secondary sectors. This equates to about 167,000 more pupils taking school lunch in 2011-2012.  Average school lunch prices were £1.93 in the LA catered primary sector and £2.03 in the LA catered secondary sector, an increase of 3% for primary and 2% for secondary on the preceding year.  In the primary sector, in the LAs who provided information, 77% of schools had a full production kitchen, 5% had facilities for regeneration or a mini-kitchen, 17% had hot food transported from another school or venue, and 0.3% had cold food only provision. In the secondary sector, 99% of schools had a full production kitchen; less than 1% had cold food only provision.  99% of primary and 95% of secondary LA catered …

20 citations


Book ChapterDOI
18 Apr 2009
TL;DR: This work investigates the relationship between TC and DF values of terms occurring in the Web as Corpus (WaC) and also the similarity between TC values obtained from the WaC and the Google N-gram dataset.
Abstract: For bounded datasets such as the TREC Web Track (WT10g) the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, when the corpus is the entire web, direct IDF calculation is impossible and values must instead be estimated. Most available datasets provide values for term count (TC) meaning the number of times a certain term occurs in the entire corpus. Intuitively this value is different from document frequency (DF) , the number of documents (e.g., web pages) a certain term occurs in. We investigate the relationship between TC and DF values of terms occurring in the Web as Corpus (WaC) and also the similarity between TC values obtained from the WaC and the Google N-gram dataset. A strong correlation between the two would gives us confidence in using the Google N-grams to estimate accurate IDF values which for example is the foundation to generate well performing lexical signatures based on the TF-IDF scheme. Our results show a very strong correlation between TC and DF within the WaC with Spearman's ρ *** 0.8 (p ≤ 2.2×10*** 16) and a high similarity between TC values from the WaC and the Google N-grams.

14 citations


Journal ArticleDOI
TL;DR: The evidence presented here and elsewhere suggests that the 24 h recall is the method best suited for dietary assessment in low-income households, followed by the weighed inventory, food checklist and lastly the semi-weighed method.
Abstract: In 1998, a review for the Ministry of Agriculture Fisheries and Food (the predecessor of the Food Standards Agency) was published evaluating the relative merits of different dietary assessment methods against a series of factors likely to affect compliance or accuracy in low-income households. The review informed the design of a method comparison study carried out in London, UK, in 2001, in which the validity and acceptability of 4 d dietary assessment methods based on 24 h recalls, food checklists and a semi-weighed method were compared with 4 d weighed inventories and other reference measures. Results were based on observations in 384 respondents (159 males, 225 females) aged 2-90 years in 240 households. Outcomes of the comparison study included evaluations of each method made by respondents, interviewers and researchers. These findings were used in the present paper to update and extend the 1998 review. Additional factors not included in the 1998 review have been considered. This updated and extended review provides the basis for discussion of the relative merits of approaches to dietary assessment in low-income households in developed economies. The evidence presented here and elsewhere suggests that the 24 h recall is the method best suited for dietary assessment in low-income households, followed by the weighed inventory, food checklist and lastly the semi-weighed method.

11 citations


Proceedings ArticleDOI
15 Jun 2009
TL;DR: The primary research question is to create objects that preserve themselves more effectively than repositories or web infrastructure can.
Abstract: The prevailing model for digital preservation is that archives should be similar to a "fortress": a large, protective infrastructure built to defend a relatively small collection of data from attack by external forces. Such projects are a luxury, suitable only for limited collections of known importance and requiring significant institutional commitment for sustainability. In previous research, we have shown the web infrastructure (i.e., search engine caches, web archives) refreshes and migrates web content in bulk as side-effects of their user-services, and these results can be mined as a useful, but passive preservation service. Our current research involves a number of questions resulting from removing the implicit assumption that web-based data objects must passively await curatorial services: What if data objects were not tethered to repositories? What are the implications if the content were actively seeking out and injecting itself into the web infrastructure (i.e., search engine caches, web archives)? All of this leads to our primary research question: Can we create objects that preserve themselves more effectively than repositories or web infrastructure can?

9 citations


Proceedings ArticleDOI
15 Jun 2009
TL;DR: This paper presents a framework for describing web repositories and the status of web resources in them, including an abstract API for web repository interaction, the concepts of deep vs. flat and light/dark/grey repositories and terminology of describing the recoverability of a web resource.
Abstract: In prior work we have demonstrated that search engine caches and archiving projects like the Internet Archive's Wayback Machine can be used to "lazily preserve" website and reconstruct them when they are lost. We use the term "web repositories" for collections of automatically refreshed and migrated content, and collectively we refer to these repositories as the "web infrastructure". In this paper we present a framework for describing web repositories and the status of web resources in them. This includes an abstract API for web repository interaction, the concepts of deep vs. flat and light/dark/grey repositories and terminology of describing the recoverability of a web resource. Our API may serve as a foundation for future web repository interfaces.

8 citations


Proceedings ArticleDOI
29 Jun 2009
TL;DR: It is believed that the correlation between the polls and the SEs steadily decreased as the season went on, and this is because the rankings in the web graph have "inertia" and do not rapidly fluctuate as do the teams' on the field fortunes.
Abstract: In previous research it has been shown that link-based web page metrics can be used to predict experts' assessment of quality. We are interested in a related question: do expert rankings of real-world entities correlate with search engine (SE) rankings of corresponding web resources? To answer this question we compared rankings of college football teams in the US with rankings of their associated web resources. We looked at the weekly polls released by the Associated Press (AP) and USA Today Coaches Poll. Both rank the top 25 teams according to the aggregated expertise of sports writers and college football coaches. For the entire 2008 season (8/2008 { 1/2009), we compared the ranking of teams (top 10 and top 25) according to the polls with the rankings of one to eight URLs associated with each team in Google, Live Search and Yahoo. We found moderate to high correlations between the final rankings of 2007 and the SE ranking in mid 2008 but the correlation between the polls and the SEs steadily decreased as the season went on. We believe this is because the rankings in the web graph (as reported via SEs) have "inertia" and do not rapidly fluctuate as do the teams' on the field fortunes.

6 citations


Proceedings ArticleDOI
15 Jun 2009
TL;DR: This work generates lexical signatures from web pages and acquire the mandatory document frequency values from three dierent search engine (SE) indexes and cross- queries the LSs against the two SEs they were not generated from to compare the retrieval performance.
Abstract: We generate lexical signatures (LSs) from web pages and acquire the mandatory document frequency values from three dierent search engine (SE) indexes. We cross-query the LSs against the two SEs they were not generated from and compare the retrieval performance by parsing the result set and analyzing the rank of the source URL.

Posted Content
TL;DR: Inspired by Web 2.0 services such as digg, deli.cio.us, and Yahoo! Buzz, a lightweight system called ReMember is developed that attempts to harness the collective abilities of the web community for preservation purposes instead of solely placing the burden of curatorial responsibilities on a small number of experts.
Abstract: The Open Archives Initiative (OAI) has recently created the Object Reuse and Exchange (ORE) project that denes Resource Maps (ReMs) for describing aggregations of web resources. These aggregations are susceptible to many of the same preservation challenges that face other web resources. In this paper, we investigate how the aggregations of web resources can be preserved outside of the typical repository environment and instead rely on the thousands of interactive users in the web community and the Web Infrastructure (the collection of web archives, search engines, and personal archiving services) to facilitate preservation. Inspired by Web 2.0 services such as digg, deli.cio.us, and Yahoo! Buzz, we have developed a lightweight system called ReMember that attempts to harness the collective abilities of the web community for preservation purposes instead of solely placing the burden of curatorial responsibilities on a small number of experts.


Proceedings ArticleDOI
15 Jun 2009
TL;DR: The Timed-Locked Embargo Framework (TLEF) is implemented and a quantitative analysis of its successful data harvest of time-locked embargoed data with minimum time overhead without compromising data security and integrity is provided.
Abstract: Due to temporary access restrictions, embargoed data cannot be refreshed to unlimited parties during the embargo time interval. A solution to mitigate the risk of data loss has been developed that uses a data dissemination framework, the Timed-Locked Embargo Framework (TLEF), that allows data refreshing of encrypted instances of embargoed content in an open, unrestricted scholarly community. TLEF exploits implementations of existing technologies to "time-lock" data using timed-release cryptology so that TLEF can be deployed as digital resources encoded in a complex object format suitable for metadata harvesting. The framework successfully demonstrates dynamic record identification, time-lock puzzle encryption, encapsulation and dissemination as XML documents. We implement TLEF and provide a quantitative analysis of its successful data harvest of time-locked embargoed data with minimum time overhead without compromising data security and integrity.

Posted Content
TL;DR: Very low frequencies of change and high Levenshtein scores indicating that titles, on average, change little from their original, rst observed values (rooted comparison) and even less from the values of their previous observation (sliding).
Abstract: Inaccessible web pages are part of the browsing experience. The content of these pages however is often not completely lost but rather missing. Lexical signatures (LS) generated from the web pages’ textual content have been shown to be suitable as search engine queries when trying to discover a (missing) web page. Since LSs are expensive to generate, we investigate the potential of web pages’ titles as they are available at a lower cost. We present the results from studying the change of titles over time. We take titles from copies provided by the Internet Archive of randomly sampled web pages and show the frequency of change as well as the degree of change in terms of the Levenshtein score. We found very low frequencies of change and high Levenshtein scores indicating that titles, on average, change little from their original, rst observed values (rooted comparison) and even less from the values of their previous observation (sliding).

Proceedings ArticleDOI
15 Jun 2009
TL;DR: This work compares Billboards "Hot 100 Airplay" music charts with SE rankings of associated web resources to investigate whether expert rankings of real-world entities correlate with search engine (SE) rankings of corresponding web resources.
Abstract: We investigate the question whether expert rankings of real-world entities correlate with search engine (SE) rankings of corresponding web resources. We compare Billboards "Hot 100 Airplay" music charts with SE rankings of associated web resources. Out of nine comparisons we found two strong, two moderate, two weak and one negative correlation. The remaining two comparisons were inconclusive.


Posted Content
TL;DR: In this paper, the authors compared four automated methods for rediscovering missing web pages (pages that return the 404 "Page Not Found" error) are part of the browsing experience and found that the manual use of search engines to rediscover missing pages can be frustrating and unsuccessful.
Abstract: Missing web pages (pages that return the 404 "Page Not Found" error) are part of the browsing experience. The manual use of search engines to rediscover missing pages can be frustrating and unsuccessful. We compare four automated methods for rediscovering web pages. We extract the page's title, generate the page's lexical signature (LS), obtain the page's tags from the bookmarking website delicious.com and generate a LS from the page's link neighborhood. We use the output of all methods to query Internet search engines and analyze their retrieval performance. Our results show that both LSs and titles perform fairly well with over 60% URIs returned top ranked from Yahoo!. However, the combination of methods improves the retrieval performance. Considering the complexity of the LS generation, querying the title first and in case of insufficient results querying the LSs second is the preferable setup. This combination accounts for more than 75% top ranked URIs.