scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Digital Libraries in 1999"


Posted Content
TL;DR: The authors proposed a content-based book recommendation system that utilizes information extraction and a machine-learning algorithm for text categorization, which has the advantage of being able to recommend previously unrated items to users with unique interests and to provide explanations for its recommendations.
Abstract: Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use social filtering methods that base recommendations on other users' preferences. By contrast, content-based methods use information about an item itself to make suggestions. This approach has the advantage of being able to recommended previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.

1,268 citations


Posted Content
TL;DR: This paper uses a large test corpus to evaluate Kea’s effectiveness in terms of how many author-assigned keyphrases are correctly identified, and describes the system, which is simple, robust, and publicly available.
Abstract: Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea's effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.

898 citations


Posted Content
TL;DR: Three types of video surrogates visual (keyframes), verbal (keywords/phrases), and visual and verbal were designed and studied in a qualitative investigation of user cognitive processes to inform the interface design and video representation for video retrieval and browsing.
Abstract: Three types of video surrogates - visual (keyframes), verbal (keywords/phrases), and combination of the two - were designed and studied in a qualitative investigation of user cognitive processes. The results favor the combined surrogates in which verbal information and images reinforce each other, lead to better comprehension, and may actually require less processing time. The results also highlight image features users found most helpful. These findings will inform the interface design and video representation for video retrieval and browsing.

57 citations


Posted Content
TL;DR: In this paper, the authors describe a knowledge-based Web environment to support the emergence of such a community-constructed semantic hypertext, and the services it could provide to assist the interpretation of an idea or document in the context of its literature.
Abstract: This paper is concerned with tracking and interpreting scholarly documents in distributed research communities. We argue that current approaches to document description, and current technological infrastructures particularly over the World Wide Web, provide poor support for these tasks. We describe the design of a digital library server which will enable authors to submit a summary of the contributions they claim their documents makes, and its relations to the literature. We describe a knowledge-based Web environment to support the emergence of such a community-constructed semantic hypertext, and the services it could provide to assist the interpretation of an idea or document in the context of its literature. The discussion considers in detail how the approach addresses usability issues associated with knowledge structuring environments.

25 citations


Posted Content
TL;DR: This paper details the evolution of one important digital library component as it has grown in functionality and usefulness over several years of use by a live, unrestricted community.
Abstract: Digital libraries must reach out to users from all walks of life, serving information needs at all levels. To do this, they must attain high standards of usability over an extremely broad audience. This paper details the evolution of one important digital library component as it has grown in functionality and usefulness over several years of use by a live, unrestricted community. Central to its evolution have been user studies, analysis of use patterns, and formative usability evaluation. We extrapolate that all three components are necessary in the production of successful digital library systems.

19 citations


Posted Content
TL;DR: This paper proposes the Multimedia Description Framework (MDF), which is designated to accommodate multiple description (meta-data) schemes, both MPEG-7 and non-MPEG-7, into integrated architecture and uses examples to show how MDF description makes use of combined strength of different description schemes to enhance its expression power and flexibility.
Abstract: MPEG is undertaking a new initiative to standardize content description of audio and video data/documents. When it is finalized in 2001, MPEG-7 is expected to provide standardized description schemes for concise and unambiguous content description of data/documents of complex media types. Meanwhile, other meta-data or description schemes, such as Dublin Core, XML, etc., are becoming popular in different application domains. In this paper, we propose the Multimedia Description Framework (MDF), which is designated to accommodate multiple description (meta-data) schemes, both MPEG-7 and non-MPEG-7, into integrated architecture. We will use examples to show how MDF description makes use of combined strength of different description schemes to enhance its expression power and flexibility. We conclude the paper with discussion of using MDF description of a movie video to search/retrieve required scene clips from the movie, on the MDF prototype system we have implemented.

17 citations


Posted Content
TL;DR: In this article, a simple noise model was used to predict the number of OCR errors in degraded text images using a standard OCR engine (Adobe Capture) and the documents were selected from those in the archive at Los Alamos National Laboratory.
Abstract: Commercial OCR packages work best with high-quality scanned images. They often produce poor results when the image is degraded, either because the original itself was poor quality, or because of excessive photocopying. The ability to predict the word failure rate of OCR from a statistical analysis of the image can help in making decisions in the trade-off between the success rate of OCR and the cost of human correction of errors. This paper describes an investigation of OCR of degraded text images using a standard OCR engine (Adobe Capture). The documents were selected from those in the archive at Los Alamos National Laboratory. By introducing noise in a controlled manner into perfect documents, we show how the quality of OCR can be predicted from the nature of the noise. The preliminary results show that a simple noise model can give good prediction of the number of OCR errors.

11 citations


Posted Content
TL;DR: This model, called MyLibrary, integrates the principles of librarianship with globally networked computing resources creating a dynamic, customer-driven front-end to any library's set of materials creating a framework for libraries to provide enhanced access to local and remote sets of data, information, and knowledge.
Abstract: The paper describes an extensible model for implementing a user-centered, customizable interface to a library's collection of information resources This model, called MyLibrary, integrates the principles of librarianship (collection, organization, dissemination, and evaluation) with globally networked computing resources creating a dynamic, customer-driven front-end to any library's set of materials The model supports a framework for libraries to provide enhanced access to local and remote sets of data, information, and knowledge At the same, the model does not overwhelm its users with too much information because the users control exactly how much information is displayed to them at any given time The model is active and not passive; direct human interaction, computer mediated guidance and communication technologies, as well as current awareness services all play indispensable roles in this system

10 citations


Posted Content
TL;DR: This paper considers four factors: 1) word importance, 2) word frequency, 3) word co-occurrence, and 4) word distance and proposes a model to identify subjects for textual documents and shows that the performance is close to that of human beings.
Abstract: The amount of electronic documents in the Internet grows very quickly. How to effectively identify subjects for documents becomes an important issue. In past, the researches focus on the behavior of nouns in documents. Although subjects are composed of nouns, the constituents that determine which nouns are subjects are not only nouns. Based on the assumption that texts are well-organized and event-driven, nouns and verbs together contribute the process of subject identification. This paper considers four factors: 1) word importance, 2) word frequency, 3) word co-occurrence, and 4) word distance and proposes a model to identify subjects for textual documents. The preliminary experiments show that the performance of the proposed model is close to that of human beings.

7 citations


Posted Content
TL;DR: ZBroker as discussed by the authors is a query routing broker developed for bibliographic database servers that support the Z39.50 protocol, which is a software agent that determines from a large set of accessing information sources the ones most relevant to a user's information need.
Abstract: A query routing broker is a software agent that determines from a large set of accessing information sources the ones most relevant to a user's information need. As the number of information sources on the Internet increases dramatically, future users will have to rely on query routing brokers to decide a small number of information sources to query without incurring too much query processing overheads. In this paper, we describe a query routing broker known as ZBroker developed for bibliographic database servers that support the Z39.50 protocol. ZBroker samples the content of each bibliographic database by using training queries and their results, and summarizes the bibliographic database content into a knowledge base. We present the design and implementation of ZBroker and describe its Web-based user interface.

1 citations


Posted Content
TL;DR: A presentation server designed to serve as an intermediary between retrieval servers and clients equipped with a visualization interface and an own visual interface by which users can view a set of documents from different perspectives through layers of document maps are developed.
Abstract: In any search-based digital library (DL) systems dealing with a non-trivial number of documents, users are often required to go through a long list of short document descriptions in order to identify what they are looking for To tackle the problem, a variety of document organization algorithms and/or visualization techniques have been used to guide users in selecting relevant documents Since these techniques require heavy computations, however, we developed a presentation server designed to serve as an intermediary between retrieval servers and clients equipped with a visualization interface In addition, we designed our own visual interface by which users can view a set of documents from different perspectives through layers of document maps We finally ran experiments to show that the visual interface, in conjunction with the presentation server, indeed helps users in selecting relevant documents from the retrieval results

Posted Content
TL;DR: The architecture and underpinning platform of the system is described with particular emphasis being placed on the structure and the integration of the distributed database.
Abstract: Trilogy is a collaborative project whose key aim is the development of an integrated virtual laboratory to support research training within each institution and collaborative projects between the partners. In this paper, the architecture and underpinning platform of the system is described with particular emphasis being placed on the structure and the integration of the distributed database. A key element is the ontology that provides the multi-agent system with a conceptualisation specification of the domain; this ontology is explained, accompanied by a discussion how such a system is integrated and used within the virtual laboratory. Although in this paper, Telecommunications and in particular Broadband networks are used as exemplars, the underlying system principles are applicable to any domain where a combination of experimental and literature-based resources are required.

Posted Content
TL;DR: In this article, the authors introduce the notion of a query mediator as a digital library service responsible for selecting among available search engines, routing queries to those search engines and aggregating results.
Abstract: We describe an architecture and investigate the characteristics of distributed searching in federated digital libraries. We introduce the notion of a query mediator as a digital library service responsible for selecting among available search engines, routing queries to those search engines, and aggregating results. We examine operational data from the NCSTRL distributed digital library that reveals a number of characteristics of distributed resource discovery. These include availability and response time of indexers and the distinction between the query mediator view of these characteristics and the indexer view.

Posted Content
Andrew Odlyzko1
TL;DR: The authors examines publishers' strategies, how they are likely to evolve, and how they will affect libraries, and examines publishers" strategies and their impact on libraries' revenue and profits, and concludes that the "journal crisis" is more of a library cost crisis than a publisher pricing problem with internal library costs much higher than the amount spent on purchasing books and journals.
Abstract: The conversion of scholarly journals to digital format is proceeding rapidly, especially for those from large commercial and learned society publishers. This conversion offers the best hope for survival for such publishers. The infamous "journal crisis" is more of a library cost crisis than a publisher pricing problem, with internal library costs much higher than the amount spent on purchasing books and journals. Therefore publishers may be able to retain or even increase their revenues and profits, while at the same time providing a superior service. To do this, they will have to take over many of the function of libraries, and they can do that only in the digital domain. This paper examines publishers' strategies, how they are likely to evolve, and how they will affect libraries.

Posted Content
TL;DR: The system demonstrates this technology for real scientific data from astronomy, which has required extension of the standard WWW, and also the extension of metadata standards far beyond the Dublin Core.
Abstract: In this paper we describe our efforts to bring scientific data into the digital library. This has required extension of the standard WWW, and also the extension of metadata standards far beyond the Dublin Core. Our system demonstrates this technology for real scientific data from astronomy.

Posted Content
TL;DR: This work presents a method for semi-automatic indexing of electronic documents and construction of a multilingual thesaurus, which can be used for query formulation and information retrieval.
Abstract: With the growing significance of digital libraries and the Internet, more and more electronic texts become accessible to a wide and geographically disperse public. This requires adequate tools to facilitate indexing, storage, and retrieval of documents written in different languages. We present a method for semi-automatic indexing of electronic documents and construction of a multilingual thesaurus, which can be used for query formulation and information retrieval. We use special dictionaries and user interaction in order to solve ambiguities and find adequate canonical terms in the language and adequate abstract language-independent terms. The abstract thesaurus is updated incrementally by new indexed documents and is used to search document concerning terms in a query to the document base.

Posted Content
TL;DR: The Alex Catalogue of Electronic Texts is described, the only Internet-accessible collection of digital documents allowing the user to dynamically create customized, typographically readable documents on demand, and create sets of documents from the collection for review and annotation.
Abstract: This paper describes the Alex Catalogue of Electronic Texts, the only Internet-accessible collection of digital documents allowing the user to 1) dynamically create customized, typographically readable documents on demand, 2) search the content of one or more documents from the collection simultaneously, 3) create sets of documents from the collection for review and annotation, and 4) publish these sets of annotated documents in turn fostering a sense of community around the Catalogue More than a just a collection of links that will break over time, Alex is an archive of electronic texts providing unprecedented access to its content and features allowing it to meet the needs of a wide variety of users and settings Furthermore, the process of maintaining the Catalogue is streamlined with tools for automatic acquisition and cataloging making it possible to sustain the service with a minimum of personnel