scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

The open archives initiative: building a low-barrier interoperability framework

01 Jan 2001-pp 54-62
TL;DR: The recent history of the OAI is described - its origins in promoting E-Prints, the broadening of its focus, the details of its technical standard for metadata harvesting, the applications of this standard, and future plans.
Abstract: The Open Archives Initiative (OAI) develops and promotes interoperabil ity solutions that aim to facilitate the efficient dissemination of content The roots of the OAI lie in the E-Print community Over the last year its focus has been extended to include all content providers This paper describes the recent history of the OAI - its origins in promoting E-Prints, the broadening of its focus, the details of its technical standard for metadata harvesting, the applications of this standard, and future plans
Citations
More filters
01 Aug 2002
TL;DR: Excerpts from the paper (footnotes omitted) that introduce the concept and describe the essential elements of an institutional repository are described.
Abstract: Editor's Note: In July, the Scholarly Publishing and Academic Resources Coalition (SPARC) released a major position paper that examines the strategic roles institutional repositories serve for colleges and universities. What follows are excerpts from the paper (footnotes omitted) that introduce the concept and describe the essential elements of an institutional repository. The full paper is available on the SPARC Web site .

630 citations

Book ChapterDOI
30 May 2003
TL;DR: The paper argues the case for creating new types of digital libraries for scientific data with the same sort of management services as conventional digital libraries in addition to other data-specific services.
Abstract: This paper previews the imminent flood of scientific data expected from the next generation of experiments, simulations, sensors and satellites. In order to be exploited by search engines and data mining software tools, such experimental data needs to be annotated with relevant metadata giving information as to provenance, content, conditions and so on. The case for automating the process of going from raw data to information to knowledge is briefly discussed. The paper argues the case for creating new types of digital libraries for scientific data with the same sort of management services as conventional digital libraries in addition to other data-specific services. Some likely implications of both the Open Archives Initiative and e-Science data for the future role for university libraries are briefly mentioned. A substantial subset of this e-Science data needs to archived and curated for long-term preservation. Some of the issues involved in the digital preservation of both scientific data and of the programs needed to interpret the data are reviewed. Finally, the implications of this wealth of e-Science data for the Grid middleware infrastructure are highlighted. * Postal address: EPSRC, Polaris House, North Star Avenue, Swindon SN2 1 ET, UK + On secondment from the Department of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK

545 citations


Cites methods from "The open archives initiative: build..."

  • ...The Open Archive Initiative [44] – which provides software and tools for self-archiving of their research papers by scientists...

    [...]

  • ...The Open Archive Initiative [44] – which provides software and tools for self-archiving of their research papers by scientists - addresses this issue to some extent, but this is clearly a large issue with profound implications for the whole future of university libraries....

    [...]

Journal ArticleDOI
TL;DR: The fundamental abstractions of Streams, Structures, Spaces, Scenarios, and Societies (5S), which allow us to define digital libraries rigorously and usefully, are proposed.
Abstract: Digital libraries (DLs) are complex information systems and therefore demand formal foundations lest development efforts diverge and interoperability suffers. In this article, we propose the fundamental abstractions of Streams, Structures, Spaces, Scenarios, and Societies (5S), which allow us to define digital libraries rigorously and usefully. Streams are sequences of arbitrary items used to describe both static and dynamic (e.g., video) content. Structures can be viewed as labeled directed graphs, which impose organization. Spaces are sets with operations on those sets that obey certain constraints. Scenarios consist of sequences of events or actions that modify states of a computation in order to accomplish a functional requirement. Societies are sets of entities and activities and the relationships among them. Together these abstractions provide a formal foundation to define, relate, and unify concepts---among others, of digital objects, metadata, collections, and services---required to formalize and elucidate "digital libraries". The applicability, versatility, and unifying power of the 5S model are demonstrated through its use in three distinct applications: building and interpretation of a DL taxonomy, informal and formal analysis of case studies of digital libraries (NDLTD and OAI), and utilization as a formal basis for a DL description language.

328 citations

Journal ArticleDOI
17 Aug 2012
TL;DR: A taxonomy for characterizing the current author name disambiguation methods described in the literature is proposed, a brief survey of the most representative ones is presented and several open challenges are discussed.
Abstract: Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. The challenges of dealing with author name ambiguity have led to a myriad of disambiguation methods. Generally speaking, the proposed methods usually attempt to group citation records of a same author by finding some similarity among them or try to directly assign them to their respective authors. Both approaches may either exploit supervised or unsupervised techniques. In this article, we propose a taxonomy for characterizing the current author name disambiguation methods described in the literature, present a brief survey of the most representative ones and discuss several open challenges.

265 citations


Additional excerpts

  • ..., by means of automatic harvesting [30])....

    [...]

Proceedings ArticleDOI
07 Jun 2005
TL;DR: This paper provides a theoretical framework to investigate the query generation problem for the hidden Web and proposes effective policies for generating queries automatically and experimentally evaluates the effectiveness of these policies on 4 real hidden Web sites.
Abstract: An ever-increasing amount of information on the Web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain Web sites. These pages are often referred to as the Hidden Web or the Deep Web. Since there are no static links to the Hidden Web pages, search engines cannot discover and index such pages and thus do not return them in the results. However, according to recent studies, the content provided by many Hidden Web sites is often of very high quality and can be extremely valuable to many users.In this paper, we study how we can build an effective Hidden Web crawler that can autonomously discover and download pages from the Hidden Web. Since the only "entry point" to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. Here, we provide a theoretical framework to investigate the query generation problem for the Hidden Web and we propose effective policies for generating queries automatically. Our policies proceed iteratively, issuing a different query in every iteration. We experimentally evaluate the effectiveness of these policies on 4 real Hidden Web sites and our results are very promising. For instance, in one experiment, one of our policies downloaded more than 90% of a Hidden Web site (that contains 14 million documents) after issuing fewer than 100 queries.

225 citations

References
More filters
Journal ArticleDOI
TL;DR: This report outlines IBM’s perspective on key supporting technologies and on the unique challenges highlighted by the emergence of digital libraries.
Abstract: ing Education-support Object-oriented Accessibility Electronic publishing OCR Agents Ethnographic study OODB support Annotation Filtering Personalization Archive Geographic information system Preservation Billing, charging Hypermedia Privacy Browsing Hypertext Publisher library Catalog Image processing Repository Classification Indexing Scalability Clustering Information retrieval Searching Commercial service Intellectual property rights Security Content conversion Interactive Sociological study Copyright clearance Knowledge base Storage Courseware Knowbot Standard Database Library science Subscription Diagrams (e.g., CAD) Mediator Sustainability Digital video Multilingual Training support Discipline-level library Multimedia stream playback Usability Distributed processing Multimedia systems Virtual (integration) Document analysis Multimodal Visualization Document model National library World-Wide Web Economic study Navigation its characterization of digital libraries. Many important projects and perspectives have been omitted. Here we give some pointers to aid further exploration, and of course we encourage interested readers to attend the numerous conferences and workshops scheduled in this field, many sponsored by or in cooperation with ACM and its SIGs. One early journal special issue is introduced in [6]. It includes articles on copyright and intellectual property rights, a subscription model for handling funds transfer related to digital libraries, a description of the evolution of the WAIS search system in general and its interfaces in particular, an overview of the Right Pages system and its use of OCR and document analysis algorithms, and an early overview of the Envision system [7]. We note that to many, intellectual property rights issues and ways to obtain revenue streams to sustain digital libraries are the most important open problems. The largest digital library conference makes its proceedings available over the WWW [9]. These contain many insightful discussions, proposals of new research ideas, descriptions of base technologies, and explanations of how the broad concept of a digital library fits in with the needs of specific user communities and the information they require. Readers can find a variety of works on agents, architectures, catalogs, collaboration, compression, document analysis from OCR and page images, document structure, electronic journals, heterogeneous sources, knowledge-based approaches, library science, numerical data collections, object stores, and organizational usability. For more details on the origins of the Digital Library Initiative, and for a variety of perspectives on open research problems, we refer the reader to [5]. This work also has numerous pointers to people, projects, institutions, and other reference works in the area. For a perspective on the role the computer industry should have in this field, see [10]. This report outlines IBM’s perspective on key supporting technologies and on the unique challenges highlighted by the emergence of digital libraries. We expect considerable interest from the corporate sector as well as from government agencies in this important area of information technology. For lack of space, we have had to omit many publications on networking and storage technologies, sociological and ethnographic studies, library and information science, OCR and document analysis or conversion, and rights management. These and other works are needed to round out the discussion of digital libraries. However, we encourage you to read the rest of this issue as a good starting point for your future studies of this important field. We invite you to not only use but also help in the creation of a future World Digital Library System!

654 citations

Journal ArticleDOI
TL;DR: The convention presents a simple technical and organizational framework to support basic interoperability among e-print archives and participants have expressed the intention of implementing this framework to allow for interoperability experiments in the course of the year 2000.
Abstract: Welcome to the Santa Fe Convention. This convention is the result of a meeting of the Open Archives Initiative which was held in Santa Fe, New Mexico, on October 21-22 1999. This convention has been endorsed unanimously by all the participants at the meeting, who represented organizations maintaining or planning e-print archives intended for open access and organizations interested in providing services, such as search interfaces or citation-linking, based on the data in those archives. The convention presents a simple technical and organizational framework to support basic interoperability among e-print archives. Participants have expressed the intention of implementing this framework to allow for interoperability experiments in the course of the year 2000. Maintainers of existing or forthcoming e-print archives that were not represented at the meeting are strongly encouraged to join this effort by implementing the framework for their archives.

249 citations

Journal ArticleDOI
TL;DR: The authors describe a set of automated archives for electronic communication of research information that have been operational in many fields of physics, and some related and unrelated disciplines, starting from 1991, and now serve over 35,000 users worldwide from over 70 countries, and process more than 70,000 electronic transactions per day.
Abstract: Summary I describe a set of automated archives for electronic communication of research information that have been operational in many fields of physics, and some related and unrelated disciplines, starting from 1991. These archives now serve over 35,000 users worldwide from over 70 countries, and process more than 70,000 electronic transactions per day. In some fields of physics, they have already supplanted traditional research journals as conveyers of both topical and archival research information. Many of the lessons learned from these systems should carry over to other fields of scholarly publication, i.e., those wherein authors are writing not for direct financial remuneration in the form of royalties, but rather primarily to communicate information (for the advancement of knowledge, with attendant benefits to their careers and professional reputations). These archives have in addition proven equally indispensable to researchers in less-developed countries.

163 citations


"The open archives initiative: build..." refers background in this paper

  • ...The well-known physics archive run by Paul Ginsparg at Los Alamos National Laboratory has already radically changed the publishing paradigm in its respective field....

    [...]

  • ...Perhaps the best known of these is the Physics archive(1) run by Paul Ginsparg [2] at Los Alamos National Laboratory....

    [...]

  • ...Appendix A OAI STEERING COMMITTEE Names are followed by affiliations: Caroline Arms (Library of Congress) Lorcan Dempsey (Joint Information Systems Committee, UK) Dale Flecker (Harvard University) Ed Fox (Virginia Tech) Paul Ginsparg (Los Alamos National Laboratory) Daniel Greenstein (DLF) Carl Lagoze (Cornell University) Clifford Lynch (CNI) John Ober (California Digital Library) Diann Rusch-Feja (Max Planck Institute for Human Development) Herbert Van de Sompel (Cornell University) Don Waters (The Andrew W. Mellon Foundation) 8....

    [...]

  • ...Perhaps the best known of these is the Physics archive1 run by Paul Ginsparg [2] at Los Alamos National Laboratory....

    [...]

Journal ArticleDOI
TL;DR: The history and current directions of interoperability in different parts of computing systems relevant to Digital Libraries are discussed.
Abstract: Discusses the history and current directions of interoperability in different parts of computing systems relevant to Digital Libraries

155 citations

21 Jun 1996
TL;DR: The Warwick Framework is a container architecture for aggregating logically, and perhaps physically, distinct packages of metadata that promotes interoperability and extensibility by allowing tools and agents to selectively access and manipulate individual packages and ignore others.
Abstract: We describe a result of the June 1996 Warwick Metadata II Workshop. This Warwick Framework is a container architecture for aggregating logically, and perhaps physically, distinct packages of metadata. This architecture allows separate administration and access to metadata packages, provides for varying syntax in each package in conformance with semantic requirements, and it promotes interoperability and extensibility by allowing tools and agents to selectively access and manipulate individual packages and ignore others. At the conclusion of the paper we propose implementations of the Framework in HTML, MIME, SGML, and distributed objects.

145 citations


"The open archives initiative: build..." refers background in this paper

  • ...These issues have been a subject of considerable discussion in the metadata community [12, 13] – the OAI attempts to answer this in a simple and deployable manner....

    [...]