Showing papers in "arXiv: Digital Libraries in 1999"

PDF

Open Access

Posted Content•

Content-Based Book Recommending Using Learning for Text Categorization

[...]

Raymond J. Mooney¹, Loriene Roy¹•Institutions (1)

07 Feb 1999-arXiv: Digital Libraries

TL;DR: The authors proposed a content-based book recommendation system that utilizes information extraction and a machine-learning algorithm for text categorization, which has the advantage of being able to recommend previously unrated items to users with unique interests and to provide explanations for its recommendations.

...read moreread less

Abstract: Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use social filtering methods that base recommendations on other users' preferences. By contrast, content-based methods use information about an item itself to make suggestions. This approach has the advantage of being able to recommended previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.

...read moreread less

1,268 citations

Posted Content•

KEA: Practical Automatic Keyphrase Extraction

[...]

Ian H. Witten¹, Gordon W. Paynter¹, Eibe Frank¹, Carl Gutwin², Craig G. Nevill-Manning³ - Show less +1 more•Institutions (3)

University of Waikato¹, University of Saskatchewan², Rutgers University³

05 Feb 1999-arXiv: Digital Libraries

TL;DR: This paper uses a large test corpus to evaluate Kea’s effectiveness in terms of how many author-assigned keyphrases are correctly identified, and describes the system, which is simple, robust, and publicly available.

...read moreread less

Abstract: Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine-learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea's effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.

...read moreread less

898 citations

Posted Content•

Multimodal Surrogates for Video Browsing

[...]

Wei Ding¹, Gary Marchionini, Dagobert Soergel¹•Institutions (1)

University of Maryland, College Park¹

09 Feb 1999-arXiv: Digital Libraries

TL;DR: Three types of video surrogates visual (keyframes), verbal (keywords/phrases), and visual and verbal were designed and studied in a qualitative investigation of user cognitive processes to inform the interface design and video representation for video retrieval and browsing.

...read moreread less

Abstract: Three types of video surrogates - visual (keyframes), verbal (keywords/phrases), and combination of the two - were designed and studied in a qualitative investigation of user cognitive processes. The results favor the combined surrogates in which verbal information and images reinforce each other, lead to better comprehension, and may actually require less processing time. The results also highlight image features users found most helpful. These findings will inform the interface design and video representation for video retrieval and browsing.

...read moreread less

57 citations

Posted Content•

Representing Scholarly Claims in Internet Digital Libraries: A Knowledge Modelling Approach

[...]

Simon Buckingham Shum¹, Enrico Motta¹, John Domingue¹•Institutions (1)

Open University¹

19 Aug 1999-arXiv: Digital Libraries

TL;DR: In this paper, the authors describe a knowledge-based Web environment to support the emergence of such a community-constructed semantic hypertext, and the services it could provide to assist the interpretation of an idea or document in the context of its literature.

...read moreread less

Abstract: This paper is concerned with tracking and interpreting scholarly documents in distributed research communities. We argue that current approaches to document description, and current technological infrastructures particularly over the World Wide Web, provide poor support for these tasks. We describe the design of a digital library server which will enable authors to submit a summary of the contributions they claim their documents makes, and its relations to the literature. We describe a knowledge-based Web environment to support the emergence of such a community-constructed semantic hypertext, and the services it could provide to assist the interpretation of an idea or document in the context of its literature. The discussion considers in detail how the approach addresses usability issues associated with knowledge structuring environments.

...read moreread less

25 citations

Posted Content•

Use and Usability in a Digital Library Search System

[...]

Lucy T. Nowell¹, Edward A. Fox¹, Rani A. Saad¹, Jianxin Zhao²•Institutions (2)

Virginia Tech¹, Pacific Northwest National Laboratory²

08 Feb 1999-arXiv: Digital Libraries

TL;DR: This paper details the evolution of one important digital library component as it has grown in functionality and usefulness over several years of use by a live, unrestricted community.

...read moreread less

Abstract: Digital libraries must reach out to users from all walks of life, serving information needs at all levels. To do this, they must attain high standards of usability over an extremely broad audience. This paper details the evolution of one important digital library component as it has grown in functionality and usefulness over several years of use by a live, unrestricted community. Central to its evolution have been user studies, analysis of use patterns, and formative usability evaluation. We extrapolate that all three components are necessary in the production of successful digital library systems.

...read moreread less

19 citations

Posted Content•

Multimedia Description Framework (MDF) for Content Description of Audio/Video Documents

[...]

Michael J. Hu¹, Ye Jian¹•Institutions (1)

Nanyang Technological University¹

09 Feb 1999-arXiv: Digital Libraries

TL;DR: This paper proposes the Multimedia Description Framework (MDF), which is designated to accommodate multiple description (meta-data) schemes, both MPEG-7 and non-MPEG-7, into integrated architecture and uses examples to show how MDF description makes use of combined strength of different description schemes to enhance its expression power and flexibility.

...read moreread less

Abstract: MPEG is undertaking a new initiative to standardize content description of audio and video data/documents. When it is finalized in 2001, MPEG-7 is expected to provide standardized description schemes for concise and unambiguous content description of data/documents of complex media types. Meanwhile, other meta-data or description schemes, such as Dublin Core, XML, etc., are becoming popular in different application domains. In this paper, we propose the Multimedia Description Framework (MDF), which is designated to accommodate multiple description (meta-data) schemes, both MPEG-7 and non-MPEG-7, into integrated architecture. We will use examples to show how MDF description makes use of combined strength of different description schemes to enhance its expression power and flexibility. We conclude the paper with discussion of using MDF description of a movie video to search/retrieve required scene clips from the movie, on the MDF prototype system we have implemented.

...read moreread less

17 citations

Posted Content•

Quality of OCR for Degraded Text Images

[...]

Roger T. Hartley¹, Kathleen Marie Crumpton¹•Institutions (1)

New Mexico State University¹

05 Feb 1999-arXiv: Digital Libraries

TL;DR: In this article, a simple noise model was used to predict the number of OCR errors in degraded text images using a standard OCR engine (Adobe Capture) and the documents were selected from those in the archive at Los Alamos National Laboratory.

...read moreread less

Abstract: Commercial OCR packages work best with high-quality scanned images. They often produce poor results when the image is degraded, either because the original itself was poor quality, or because of excessive photocopying. The ability to predict the word failure rate of OCR from a statistical analysis of the image can help in making decisions in the trade-off between the success rate of OCR and the cost of human correction of errors. This paper describes an investigation of OCR of degraded text images using a standard OCR engine (Adobe Capture). The documents were selected from those in the archive at Los Alamos National Laboratory. By introducing noise in a controlled manner into perfect documents, we show how the quality of OCR can be predicted from the nature of the noise. The preliminary results show that a simple noise model can give good prediction of the number of OCR errors.

...read moreread less

11 citations

Posted Content•

MyLibrary: A Model for Implementing a User-centered, Customizable Interface to a Library's Collection of Information Resources

[...]

Eric Lease Morgan¹•Institutions (1)

North Carolina State University¹

02 Feb 1999-arXiv: Digital Libraries

TL;DR: This model, called MyLibrary, integrates the principles of librarianship with globally networked computing resources creating a dynamic, customer-driven front-end to any library's set of materials creating a framework for libraries to provide enhanced access to local and remote sets of data, information, and knowledge.

...read moreread less

Abstract: The paper describes an extensible model for implementing a user-centered, customizable interface to a library's collection of information resources This model, called MyLibrary, integrates the principles of librarianship (collection, organization, dissemination, and evaluation) with globally networked computing resources creating a dynamic, customer-driven front-end to any library's set of materials The model supports a framework for libraries to provide enhanced access to local and remote sets of data, information, and knowledge At the same, the model does not overwhelm its users with too much information because the users control exactly how much information is displayed to them at any given time The model is active and not passive; direct human interaction, computer mediated guidance and communication technologies, as well as current awareness services all play indispensable roles in this system

...read moreread less

10 citations

Posted Content•

Automatic Identification of Subjects for Textual Documents in Digital Libraries

[...]

Kuang-hua Chen

01 Feb 1999-arXiv: Digital Libraries

TL;DR: This paper considers four factors: 1) word importance, 2) word frequency, 3) word co-occurrence, and 4) word distance and proposes a model to identify subjects for textual documents and shows that the performance is close to that of human beings.

...read moreread less

Abstract: The amount of electronic documents in the Internet grows very quickly. How to effectively identify subjects for documents becomes an important issue. In past, the researches focus on the behavior of nouns in documents. Although subjects are composed of nouns, the constituents that determine which nouns are subjects are not only nouns. Based on the assumption that texts are well-organized and event-driven, nouns and verbs together contribute the process of subject identification. This paper considers four factors: 1) word importance, 2) word frequency, 3) word co-occurrence, and 4) word distance and proposes a model to identify subjects for textual documents. The preliminary experiments show that the performance of the proposed model is close to that of human beings.

...read moreread less

7 citations

Posted Content•

ZBroker: A Query Routing Broker for Z39.50 Databases

[...]

Yong Lin¹, Jian Xu¹, Ee-Peng Lim¹, Wee Keong Ng¹•Institutions (1)

Nanyang Technological University¹

09 Feb 1999-arXiv: Digital Libraries

TL;DR: ZBroker as discussed by the authors is a query routing broker developed for bibliographic database servers that support the Z39.50 protocol, which is a software agent that determines from a large set of accessing information sources the ones most relevant to a user's information need.

...read moreread less

Abstract: A query routing broker is a software agent that determines from a large set of accessing information sources the ones most relevant to a user's information need. As the number of information sources on the Internet increases dramatically, future users will have to rely on query routing brokers to decide a small number of information sources to query without incurring too much query processing overheads. In this paper, we describe a query routing broker known as ZBroker developed for bibliographic database servers that support the Z39.50 protocol. ZBroker samples the content of each bibliographic database by using training queries and their results, and summarizes the bibliographic database content into a knowledge base. We present the design and implementation of ZBroker and describe its Web-based user interface.

...read moreread less

1 citations

Posted Content•

Visualization of Retrieved Documents using a Presentation Server

[...]

Sa-Kwang Song, Sung-Hyon Myaeng

10 Feb 1999-arXiv: Digital Libraries

TL;DR: A presentation server designed to serve as an intermediary between retrieval servers and clients equipped with a visualization interface and an own visual interface by which users can view a set of documents from different perspectives through layers of document maps are developed.

...read moreread less

Abstract: In any search-based digital library (DL) systems dealing with a non-trivial number of documents, users are often required to go through a long list of short document descriptions in order to identify what they are looking for To tackle the problem, a variety of document organization algorithms and/or visualization techniques have been used to guide users in selecting relevant documents Since these techniques require heavy computations, however, we developed a presentation server designed to serve as an intermediary between retrieval servers and clients equipped with a visualization interface In addition, we designed our own visual interface by which users can view a set of documents from different perspectives through layers of document maps We finally ran experiments to show that the visual interface, in conjunction with the presentation server, indeed helps users in selecting relevant documents from the retrieval results

...read moreread less

Posted Content•

Resource Discovery in Trilogy

[...]

Franck Chevalier, David Harle, D. Geoffrey Smith

08 Feb 1999-arXiv: Digital Libraries

TL;DR: The architecture and underpinning platform of the system is described with particular emphasis being placed on the structure and the integration of the distributed database.

...read moreread less

Abstract: Trilogy is a collaborative project whose key aim is the development of an integrated virtual laboratory to support research training within each institution and collaborative projects between the partners. In this paper, the architecture and underpinning platform of the system is described with particular emphasis being placed on the structure and the integration of the distributed database. A key element is the ontology that provides the multi-agent system with a conceptualisation specification of the domain; this ontology is explained, accompanied by a discussion how such a system is integrated and used within the virtual laboratory. Although in this paper, Telecommunications and in particular Broadband networks are used as exemplars, the underlying system principles are applicable to any domain where a combination of experimental and literature-based resources are required.

...read moreread less

Posted Content•

Using Query Mediators for Distributed Searching in Federated Digital Libraries

[...]

Naomi Dushay¹, James C. French², Carl Lagoze¹•Institutions (2)

Cornell University¹, University of Virginia²

09 Feb 1999-arXiv: Digital Libraries

TL;DR: In this article, the authors introduce the notion of a query mediator as a digital library service responsible for selecting among available search engines, routing queries to those search engines and aggregating results.

...read moreread less

Abstract: We describe an architecture and investigate the characteristics of distributed searching in federated digital libraries. We introduce the notion of a query mediator as a digital library service responsible for selecting among available search engines, routing queries to those search engines, and aggregating results. We examine operational data from the NCSTRL distributed digital library that reveals a number of characteristics of distributed resource discovery. These include availability and response time of indexers and the distinction between the query mediator view of these characteristics and the indexer view.

...read moreread less

Posted Content•

Competition and cooperation: Libraries and publishers in the transition to electronic scholarly journals

[...]

Andrew Odlyzko¹•Institutions (1)

AT&T¹

20 Jan 1999-arXiv: Digital Libraries

TL;DR: The authors examines publishers' strategies, how they are likely to evolve, and how they will affect libraries, and examines publishers" strategies and their impact on libraries' revenue and profits, and concludes that the "journal crisis" is more of a library cost crisis than a publisher pricing problem with internal library costs much higher than the amount spent on purchasing books and journals.

...read moreread less

Abstract: The conversion of scholarly journals to digital format is proceeding rapidly, especially for those from large commercial and learned society publishers. This conversion offers the best hope for survival for such publishers. The infamous "journal crisis" is more of a library cost crisis than a publisher pricing problem, with internal library costs much higher than the amount spent on purchasing books and journals. Therefore publishers may be able to retain or even increase their revenues and profits, while at the same time providing a superior service. To do this, they will have to take over many of the function of libraries, and they can do that only in the digital domain. This paper examines publishers' strategies, how they are likely to evolve, and how they will affect libraries.

...read moreread less

Posted Content•

Digital Library Technology for Locating and Accessing Scientific Data

[...]

Robert E. McGrath¹, Joe Futrelle¹, Ray Plante¹, Damien Guillaume•Institutions (1)

National Center for Supercomputing Applications¹

07 Feb 1999-arXiv: Digital Libraries

TL;DR: The system demonstrates this technology for real scientific data from astronomy, which has required extension of the standard WWW, and also the extension of metadata standards far beyond the Dublin Core.

...read moreread less

Abstract: In this paper we describe our efforts to bring scientific data into the digital library. This has required extension of the standard WWW, and also the extension of metadata standards far beyond the Dublin Core. Our system demonstrates this technology for real scientific data from astronomy.

...read moreread less

Posted Content•

Semi-Automatic Indexing of Multilingual Documents

[...]

Ulrich Schiel¹, Ianna M. S. F. de Sousa¹, Edberto Ferneda•Institutions (1)

Federal University of Paraíba¹

11 Feb 1999-arXiv: Digital Libraries

TL;DR: This work presents a method for semi-automatic indexing of electronic documents and construction of a multilingual thesaurus, which can be used for query formulation and information retrieval.

...read moreread less

Abstract: With the growing significance of digital libraries and the Internet, more and more electronic texts become accessible to a wide and geographically disperse public. This requires adequate tools to facilitate indexing, storage, and retrieval of documents written in different languages. We present a method for semi-automatic indexing of electronic documents and construction of a multilingual thesaurus, which can be used for query formulation and information retrieval. We use special dictionaries and user interaction in order to solve ambiguities and find adequate canonical terms in the language and adequate abstract language-independent terms. The abstract thesaurus is updated incrementally by new indexed documents and is used to search document concerning terms in a query to the document base.

...read moreread less

Posted Content•

The Alex Catalogue, A Collection of Digital Texts with Automatic Methods for Acquisition and Cataloging, User-Defined Typography, Cross-searching of Indexed Content, and a Sense of Community

[...]

Eric Lease Morgan

02 Feb 1999-arXiv: Digital Libraries

TL;DR: The Alex Catalogue of Electronic Texts is described, the only Internet-accessible collection of digital documents allowing the user to dynamically create customized, typographically readable documents on demand, and create sets of documents from the collection for review and annotation.

...read moreread less

Abstract: This paper describes the Alex Catalogue of Electronic Texts, the only Internet-accessible collection of digital documents allowing the user to 1) dynamically create customized, typographically readable documents on demand, 2) search the content of one or more documents from the collection simultaneously, 3) create sets of documents from the collection for review and annotation, and 4) publish these sets of annotated documents in turn fostering a sense of community around the Catalogue More than a just a collection of links that will break over time, Alex is an archive of electronic texts providing unprecedented access to its content and features allowing it to meet the needs of a wide variety of users and settings Furthermore, the process of maintaining the Catalogue is streamlined with tools for automatic acquisition and cataloging making it possible to sustain the service with a minimum of personnel

...read moreread less