scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Digital Libraries in 2005"


Posted Content
TL;DR: A dynamical model of collaborative tagging is presented that predicts regularities in user activity, tag frequencies, kinds of tags used, bursts of popularity in bookmarking and a remarkable stability in the relative proportions of tags within a given url.
Abstract: Collaborative tagging describes the process by which many users add metadata in the form of keywords to shared content Recently, collaborative tagging has grown in popularity on the web, on sites that allow users to tag bookmarks, photographs and other content In this paper we analyze the structure of collaborative tagging systems as well as their dynamical aspects Specifically, we discovered regularities in user activity, tag frequencies, kinds of tags used, bursts of popularity in bookmarking and a remarkable stability in the relative proportions of tags within a given url We also present a dynamical model of collaborative tagging that predicts these stable patterns and relates them to imitation and shared knowledge

997 citations


Posted Content
TL;DR: In this paper, the authors generated networks of journal relationships from citation and download data, and determined journal impact rankings from these networks using a set of social network centrality metrics, which were compared to the ISI IF.
Abstract: We generated networks of journal relationships from citation and download data, and determined journal impact rankings from these networks using a set of social network centrality metrics. The resulting journal impact rankings were compared to the ISI IF. Results indicate that, although social network metrics and ISI IF rankings deviate moderately for citation-based journal networks, they differ considerably for journal networks derived from download data. We believe the results represent a unique aspect of general journal impact that is not captured by the ISI IF. These results furthermore raise questions regarding the validity of the ISI IF as the sole assessment of journal impact, and suggest the possibility of devising impact metrics based on usage information in general.

201 citations


Posted Content
TL;DR: The half-life of a URL referenced in a D-Lib Magazine article is approximately 10 years, and URLs were more likely to be unavailable if they pointed to resources in the .net, .edu or country-specific top-level domain, used non-standard ports, or pointed to Resources with uncommon or deprecated extensions.
Abstract: We explore the availability and persistence of URLs cited in articles published in D-Lib Magazine. We extracted 4387 unique URLs referenced in 453 articles published from July 1995 to August 2004. The availability was checked three times a week for 25 weeks from September 2004 to February 2005. We found that approximately 28% of those URLs failed to resolve initially, and 30% failed to resolve at the last check. A majority of the unresolved URLs were due to 404 (page not found) and 500 (internal server error) errors. The content pointed to by the URLs was relatively stable; only 16% of the content registered more than a 1 KB change during the testing period. We explore possible factors which may cause a URL to fail by examining its age, path depth, top-level domain and file extension. Based on the data collected, we found the half-life of a URL referenced in a D-Lib Magazine article is approximately 10 years. We also found that URLs were more likely to be unavailable if they pointed to resources in the .net, .edu or country-specific top-level domain, used non-standard ports (i.e., not port 80), or pointed to resources with uncommon or deprecated extensions (e.g., .shtml, .ps, .txt).

66 citations


Posted Content
TL;DR: The state of the DL domain after a decade of activity is examined by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences, and clear advantages of PageRank and AuthorRank are shown over degree, closeness and betweenness centrality metrics.
Abstract: The field of digital libraries (DLs) coalesced in 1994: the first digital library conferences were held that year, awareness of the World Wide Web was accelerating, and the National Science Foundation awarded $24 Million (U.S.) for the Digital Library Initiative (DLI). In this paper we examine the state of the DL domain after a decade of activity by applying social network analysis to the co-authorship network of the past ACM, IEEE, and joint ACM/IEEE digital library conferences. We base our analysis on a common binary undirectional network model to represent the co-authorship network, and from it we extract several established network measures. We also introduce a weighted directional network model to represent the co-authorship network, for which we define $AuthorRank$ as an indicator of the impact of an individual author in the network. The results are validated against conference program committee members in the same period. The results show clear advantages of PageRank and AuthorRank over degree, closeness and betweenness centrality metrics. We also investigate the amount and nature of international participation in Joint Conference on Digital Libraries (JCDL).

55 citations


Posted Content
TL;DR: The MPEG-21 Digital Item Declaration (MPEG-21 DID) as discussed by the authors is another XML-based standard for representing compound digital assets in XML that has received little attention in the digital library community.
Abstract: Various XML-based approaches aimed at representing compound digital assets have emerged over the last several years. Approaches that are of specific relevance to the digital library community include the Metadata Encoding and Transmission Standard (METS), the IMS Content Packaging XML Binding, and the XML Formatted Data Units (XFDU) developed by CCSDS Panel 2. The MPEG-21 Digital Item Declaration (MPEG-21 DID) is another standard specifying the representation of digital assets in XML that, so far, has received little attention in the digital library community. This article gives a brief insight into the MPEG-21 standardization effort, highlights the major characteristics of the MPEG-21 DID Abstract Model, and describes the MPEG-21 Digital Item Declaration Language (MPEG-21 DIDL), an XML syntax for the representation of digital assets based on the MPEG-21 DID Abstract Model. Also, it briefly demonstrates the potential relevance of MPEG-21 DID to the digital library community by describing its use in the aDORe repository environment at the Research Library of the Los Alamos National Laboratory (LANL) for the representation of digital assets.

28 citations


Posted Content
TL;DR: The Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) and the NISO OpenURL Framework for Context-Sensitive Services (OpenURL Standard) have been proposed in this article.
Abstract: In recent years, a variety of digital repository and archival systems have been developed and adopted. All of these systems aim at hosting a variety of compound digital assets and at providing tools for storing, managing and accessing those assets. This paper will focus on the definition of common and standardized access interfaces that could be deployed across such diverse digital respository and archival systems. The proposed interfaces are based on the two formal specifications that have recently emerged from the Digital Library community: The Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) and the NISO OpenURL Framework for Context-Sensitive Services (OpenURL Standard). As will be described, the former allows for the retrieval of batches of XML-based representations of digital assets, while the latter facilitates the retrieval of disseminations of a specific digital asset or of one or more of its constituents. The core properties of the proposed interfaces are explained in terms of the Reference Model for an Open Archival Information System (OAIS).

25 citations


Posted Content
TL;DR: This work develops a parallel set of requirements based on observations of how existing systems handle this task, and on an analysis of the threats to achieving the goal, and suggests disclosures that systems should provide as to how they satisfy their goals.
Abstract: The field of digital preservation is being defined by a set of standards developed top-down, starting with an abstract reference model (OAIS) and gradually adding more specific detail. Systems claiming conformance to these standards are entering production use. Work is underway to certify that systems conform to requirements derived from OAIS. We complement these requirements derived top-down by presenting an alternate, bottom-up view of the field. The fundamental goal of these systems is to ensure that the information they contain remains accessible for the long term. We develop a parallel set of requirements based on observations of how existing systems handle this task, and on an analysis of the threats to achieving the goal. On this basis we suggest disclosures that systems should provide as to how they satisfy their goals.

19 citations


Posted Content
TL;DR: Mod_oai as discussed by the authors is an Apache 2.0 module that implements the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), which is the de facto standard for metadata exchange in digital libraries.
Abstract: We describe mod_oai, an Apache 2.0 module that implements the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAIPMH is the de facto standard for metadata exchange in digital libraries and allows repositories to expose their contents in a structured, application-neutral format with semantics optimized for accurate incremental harvesting. Current implementations of OAI-PMH are either separate applications that access an existing repository, or are built-in to repository software packages. mod_oai is different in that it optimizes harvesting web content by building OAI-PMH capability into the Apache server. We discuss the implications of adding harvesting capability to an Apache server and describe our initial experimental results accessing a departmental web site using both web crawling and OAIPMH harvesting techniques.

17 citations


Posted Content
TL;DR: Fedora as mentioned in this paper is an extensible framework for the storage, management, and dissemination of complex objects and the relationships among them, which accommodates the aggregation of local and distributed content into digital objects and association of services with objects.
Abstract: The Fedora architecture is an extensible framework for the storage, management, and dissemination of complex objects and the relationships among them. Fedora accommodates the aggregation of local and distributed content into digital objects and the association of services with objects. This al-lows an object to have several accessible representations, some of them dy-namically produced. The architecture includes a generic RDF-based relation-ship model that represents relationships among objects and their components. Queries against these relationships are supported by an RDF triple store. The architecture is implemented as a web service, with all aspects of the complex object architecture and related management functions exposed through REST and SOAP interfaces. The implementation is available as open-source soft-ware, providing the foundation for a variety of end-user applications for digital libraries, archives, institutional repositories, and learning object systems.

16 citations


Posted Content
TL;DR: Grace, an http proxy server that transparently converts browser-incompatible and obsolete web content into web content that a browser is able to display without the use of plug-in software is introduced.
Abstract: Web accessible content stored in obscure, unpopular or ob- solete formats represents a significant problem for digital preservation The file formats that encode web content represent the implicit and ex- plicit choices of web site maintainers at a particular point in time Older file formats that have fallen out of favor are obviously a problem, but so are new file formats that have not yet been fully supported by browsers Often browsers use plug-in software for displaying old and new formats, but plug-ins can be difficult to find, install and replicate across all envi- ronments that one may use We introduce Grace, an http proxy server that transparently converts browser-incompatible and obsolete web con- tent into web content that a browser is able to display without the use of plug-ins Grace is configurable on a per user basis and can be expanded to provide an array of conversion services We illustrate how the Grace prototype transforms several image formats (XBM, PNG with various alpha channels, and JPEG 2000) so they are viewable in Internet Ex- plorer

12 citations


Posted Content
TL;DR: The National Science Digital Library (NSDL) has developed a model of service interaction that enables loosely-coupled third party services to provide metadata enhancements to a central repository, with interactions orchestrated by a centralized software application.
Abstract: Harvested metadata often suffers from uneven quality to the point that utility is compromised. Although some aggregators have developed methods for evaluating and repairing specific metadata problems, it has been unclear how these methods might be scaled into services that can be used within an automated production environment. The National Science Digital Library (NSDL), as part of its work with INFOMINE, has developed a model of ser-vice interaction that enables loosely-coupled third party services to provide metadata enhancements to a central repository, with interactions orchestrated by a centralized software application.

Posted Content
TL;DR: It is argued that the value of CBPP platforms can be leveraged by adapting them for pedagogical purposes, and a recent adaptation of the Noosphere system was used to host a graduate-level mathematics course at Dalhousie University.
Abstract: Commons based peer-production (CBPP) is the de-centralized, net-based approach to the creation and dissemination of information resources. Underlying every CBPP system is a virtual community brought together by an internet tool (such as a web site) and struc- tured by a specific collaboration protocol. In this talk we will argue that the value of such platforms can be leveraged by adapting them for pedagogical purposes. We report on one such recent adaptation. The Noosphere system is a web-based collaboration environment that underlies the popular Planetmath website, a collaboratively written encyclopedia of math- ematics licensed under the GNU Free Documentation License (FDL). Recently, the system was used to host a graduate-level mathematics course at Dalhousie University, in Halifax, Canada. The course con- sisted of regular lectures and assignment problems. The students in the course collaborated on a set of course notes, encapsulating the lec- ture content and giving solutions of assigned problems. The successful outcome of this experiment demonstrated that a dedicated No¨ system is well suited for classroom applications. We argue that this "proof of concept" experience also strongly suggests that every suc- cessful CBPP platform possesses latent pedagogical value. � Supported by Dalhousie's Centre for Learning and Teaching, and N.S.E.R.C., Canada.

Posted Content
TL;DR: In this article, the authors explore the potential of the MPEG-21 Digital Item Declaration (MPEG-21 DID) in a Digital Preservation context, by looking at the core building blocks of the OAIS Information Model and the way in which they map to the MPEG 21 DID abstract model and the MPEG21 DIDL XML syntax.
Abstract: Various efforts aimed at representing digital assets have emerged from several communities over the last years, including the Metadata Encoding and Transmission Standard (METS), the IMS Content Packaging (IMS-CP) XML Binding and the XML Formatted Data Units (XFDU) The MPEG-21 Digital Item Declaration (MPEG-21 DID) is another approach that can be used for the representation of digital assets in XML This paper will explore the potential of the MPEG-21 DID in a Digital Preservation context, by looking at the core building blocks of the OAIS Information Model and the way in which they map to the MPEG-21 DID abstract model and the MPEG-21 DIDL XML syntax

Posted Content
TL;DR: In this article, the authors introduce the write-once/read-many XMLtape/ARC storage approach for Digital Objects and their constituent datastreams, which combines two interconnected file-based storage mechanisms that are made accessible in a protocol-based manner.
Abstract: This paper introduces the write-once/read-many XMLtape/ARC storage approach for Digital Objects and their constituent datastreams. The approach combines two interconnected file-based storage mechanisms that are made accessible in a protocol-based manner. First, XML-based representations of multiple Digital Objects are concatenated into a single file named an XMLtape. An XMLtape is a valid XML file; its format definition is independent of the choice of the XML-based complex object format by which Digital Objects are represented. The creation of indexes for both the identifier and the creation datetime of the XML-based representation of the Digital Objects facilitates OAI-PMH-based access to Digital Objects stored in an XMLtape. Second, ARC files, as introduced by the Internet Archive, are used to contain the constituent datastreams of the Digital Objects in a concatenated manner. An index for the identifier of the datastream facilitates OpenURL-based access to an ARC file. The interconnection between XMLtapes and ARC files is provided by conveying the identifiers of ARC files associated with an XMLtape as administrative information in the XMLtape, and by including OpenURL references to constituent datastreams of a Digital Object in the XML-based representation of that Digital Object.

Posted Content
TL;DR: This model shows that the most important strategies for increasing the reliability of long-term storage are detecting latent faults quickly, automating fault repair to make it cheaper and faster, and increasing the independence of data replicas.
Abstract: Many emerging Web services, such as email, photo sharing, and web site archives, need to preserve large amounts of quickly-accessible data indefinitely into the future. In this paper, we make the case that these applications' demands on large scale storage systems over long time horizons require us to re-evaluate traditional storage system designs. We examine threats to long-lived data from an end-to-end perspective, taking into account not just hardware and software faults but also faults due to humans and organizations. We present a simple model of long-term storage failures that helps us reason about the various strategies for addressing these threats in a cost-effective manner. Using this model we show that the most important strategies for increasing the reliability of long-term storage are detecting latent faults quickly, automating fault repair to make it faster and cheaper, and increasing the independence of data replicas.

Posted Content
TL;DR: In this paper, an original way to add new data in a reference dictionary from several other lexical resources, without loosing any consistence, is presented, which is carried in order to get lexical information classified by the sense of the entry.
Abstract: This paper presents an original way to add new data in a reference dictionary from several other lexical resources, without loosing any consistence. This operation is carried in order to get lexical information classified by the sense of the entry. This classification makes it possible to enrich utterances (in QA: the queries) following the meaning, and to reduce noise. An analysis of the experienced problems shows the interest of this method, and insists on the points that have to be tackled.

Posted Content
TL;DR: This paper describes the aDORe repository architecture designed and implemented for ingesting, storing, and accessing a vast collection of Digital Objects at the Research Library of the Los Alamos National Laboratory, which is highly modular and standards-based.
Abstract: This paper describes the aDORe repository architecture, designed and implemented for ingesting, storing, and accessing a vast collection of Digital Objects at the Research Library of the Los Alamos National Laboratory. The aDORe architecture is highly modular and standards-based. In the architecture, the MPEG-21 Digital Item Declaration Language is used as the XML-based format to represent Digital Objects that can consist of multiple datastreams as Open Archival Information System Archival Information Packages (OAIS AIPs).Through an ingestion process, these OAIS AIPs are stored in a multitude of autonomous repositories. A Repository Index keeps track of the creation and location of all the autonomous repositories, whereas an Identifier Locator registers in which autonomous repository a given Digital Object or OAIS AIP resides. A front-end to the complete environment, the OAI-PMH Federator, is introduced for requesting OAIS Dissemination Information Packages (OAIS DIPs). These OAIS DIPs can be the stored OAIS AIPs themselves, or transformations thereof. This front-end allows OAI-PMH harvesters to recurrently and selectively collect batches of OAIS DIPs from aDORe, and hence to create multiple, parallel services using the collected objects. Another front-end, the OpenURL Resolver, is introduced for requesting OAIS Result Sets. An OAIS Result Set is a dissemination of an individual Digital Object or of its constituent datastreams. Both front-ends make use of an MPEG-21 Digital Item Processing Engine to apply services to OAIS AIPs, Digital Objects, or constituent datastreams that were specified in a dissemination request.

Posted Content
Simeon Warner1
TL;DR: The validation logs are examined to produce a breakdown of reasons why repositories fail validation, which highlights some common problems and will be used to guide work to improve the validation service.
Abstract: I present a summary of recent use of the Open Archives Initiative (OAI) registration and validation services for data-providers. The registration service has seen a steady stream of registrations since its launch in 2002, and there are now over 220 registered repositories. I examine the validation logs to produce a breakdown of reasons why repositories fail validation. This breakdown highlights some common problems and will be used to guide work to improve the validation service.

Posted Content
TL;DR: This paper argues that a dynamic narrative flow is enabled by effective management of complex content and communications in a decentralized web-based education digital library making publishing objects such as aggregations of resources, or selected parts of objects accessible through a Content and Communications System.
Abstract: Education digital libraries contain cataloged resources as well as contextual information about innovations in the use of educational technology, exemplar stories about community activities, and news from various user communities that include teachers, students, scholars, and developers. Long-standing library traditions of service, preservation, democratization of knowledge, rich discourse, equal access, and fair use are evident in library communications models that both pull in and push out contextual information from multiple sources integrated with editorial production processes. This paper argues that a dynamic narrative flow [1] is enabled by effective management of complex content and communications in a decentralized web-based education digital library making publishing objects such as aggregations of resources, or selected parts of objects [4] accessible through a Content and Communications System. Providing services that encourage patrons to reuse, reflect out, and contribute resources back [5] to the Library increases the reach and impact of the National Science Digital Library (NSDL). This system is a model for distributed content development and effective communications for education digital libraries in general.

Posted Content
TL;DR: In this article, the authors describe the underlying data model and implementation of a new architecture for the National Science Digital Library (NSDL) by the Core Integration Team (CI), based on the notion of an information network overlay.
Abstract: We describe the underlying data model and implementation of a new architecture for the National Science Digital Library (NSDL) by the Core Integration Team (CI). The architecture is based on the notion of an information network overlay. This network, implemented as a graph of digital objects in a Fedora repository, allows the representation of multiple information entities and their relationships. The architecture provides the framework for contextualization and reuse of resources, which we argue is essential for the utility of the NSDL as a tool for teaching and learning.

Posted Content
TL;DR: This paper presents a lexical disambiguation system, initially developed for English and now adapted to French, that associates a word with its meaning in a given context using electronic dictionaries as semantically annotated corpora in order to extract semantic disambigsuation rules.
Abstract: This paper presents a lexical disambiguation system, initially developed for English and now adapted to French. This system associates a word with its meaning in a given context using electronic dictionaries as semantically annotated corpora in order to extract semantic disambiguation rules. We describe the rule extraction and application process as well as the evaluation of the system. The results for French give us insight information on some possible improvments of the nature and content of lexical resources adapted for disambiguation in this framework.

Posted Content
TL;DR: The EqRank algorithm designed to cluster vertexes of directed graphs, and the results of Eq Rank application to the SPIRES citation graph are presented.
Abstract: SEUS,Bol’shoi Trekhsvyatitel’skii per. 2,Moscow, 109028 Russia(Dated: January 8, 2005)SPIRES is the largest database of scientific papers in the subject field of high energy and nu-clear physics. It contains information on the citation graph of more than half a million of papers(vertexes of the citation graph). We outline the EqRank algorithm designed to cluster vertexes ofdirected graphs, and present the results of EqRank application to the SPIRES citation graph. Thehierarchical clustering of SPIRES yielded by EqRank is used to set up a web service, which is alsooutlined.

Journal ArticleDOI
TL;DR: In this paper, a deconstructed publication model is presented in which the peer review process is mediated by an OAI-PMH peer-review service, which uses a social-network algorithm to determine potential reviewers for a submitted manuscript and for weighting the relative influence of each participating reviewer's evaluations.
Abstract: Pre-print repositories have seen a significant increase in use over the past fifteen years across multiple research domains. Researchers are beginning to develop applications capable of using these repositories to assist the scientific community above and beyond the pure dissemination of information. The contribution set forth by this paper emphasizes a deconstructed publication model in which the peer-review process is mediated by an OAI-PMH peer-review service. This peer-review service uses a social-network algorithm to determine potential reviewers for a submitted manuscript and for weighting the relative influence of each participating reviewer's evaluations. This paper also suggests a set of peer-review specific metadata tags that can accompany a pre-print's existing metadata record. The combinations of these contributions provide a unique repository-centric peer-review model that fits within the widely deployed OAI-PMH framework.