scispace - formally typeset
Search or ask a question

Showing papers presented at "ACM international conference on Digital libraries in 1999"


Proceedings ArticleDOI
01 Aug 1999
TL;DR: Kea as mentioned in this paper identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine learning algorithm to predict which candidates are good keyphrase candidates.
Abstract: Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machinelearning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea’s effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.

912 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: A system for searching and classifying U.S. patent documents, based on Inquery, which includes a unique “phrase help” facility, which helps users find and add phrases and terms related to those in their query.
Abstract: We present a system for searching and classifying U.S. patent documents, based on Inquery. Patents are distributed through hundreds of collections, divided up by general area. The system selects the best collections for the query. Users can search for patents or classify patent text. The user interface helps users search in fields without requiring the knowledge of Inquery query operators. The system includes a unique “phrase help” facility, which helps users find and add phrases and terms related to those in their query.

229 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: In this article, the authors describe a prototype implementation of Intermemory, including an overall system architecture and implementations of key system components, which is a working Intermemory that tolerates up to 17 simultaneous node failures and includes a Web gateway for browser-based access to data.
Abstract: An Archival Intermemory solves the problem of highly survivable digital data storage in the spirit of the Internet. In this paper we describe a prototype implementation of Intermemory, including an overall system architecture and implementations of key system components. The result is a working Intermemory that tolerates up to 17 simultaneous node failures, and includes a Web gateway for browser-based access to data. Our work demonstrates the basic feasibility of Intermemory and represents signi cant progress towards a deployable system.

135 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: An image retrieval system based on a Zoomable User Interface (ZUI) and a controlled experiment on the browsing aspects of the system, which resulted in a statistically significant difference in the interaction between number of images and style of browser.
Abstract: We describe an image retrieval system we built based on a Zoomable User Interface (ZUI). We also discuss the design, results and analysis of a controlled experiment we performed on the browsing aspects of the system. The experiment resulted in a statistically significant difference in the interaction between number of images (25, 75, 225) and style of browser (2D, ZUI, 3D). The 2D and ZUI browser systems performed equally, and both performed better than the 3D systems. The image browsers tested during the experiment include Cerious Software’s Thumbs Plus, TriVista Technology’s Simple LandScape and Photo GoRound, and our Zoomable Image Browser based on Pad++.

128 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: CiteSeer as mentioned in this paper is a system for automatic tracking of scientific literature that is relevant to a user's research interests, which is made possible through the use of a heterogenous profile to represent user interests.
Abstract: We introduce a system as part of the CiteSeer digital library project for automatic tracking of scientific literature that is relevant to a user’s research interests. Unlike previous systems that use simple keyword matching, CiteSeer is able to track and recommend topically relevant papers even when keyword based query profiles fail. This is made possible through the use of a heterogenous profile to represent user interests. These profiles include several representations, including content based relatedness measures. The CiteSeer tracking system is well integrated into the search and browsing facilities of CiteSeer, and provides the user with great flexibility in tuning a profile to better match his or her interests. The software for this system is available, and a sample database is online as a public service.

127 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: A comprehensive suite of tools that gather musical material, convert between many of these representations, allow searching based on combined musical and textual criteria, and help present the results of searching and browsing are described.
Abstract: Digital libraries of music have the potential to capture popular imagination in ways that more scholarly libraries cannot. We are working towards a comprehensive digital library of musical material, including popular music. We have developed new ways of collecting musical material, accessing it through searching and browsing, and presenting the results to the user. We work with different representations of music: facsimile images of scores, the internal representation of a music editing program, page images typeset by a music editor, MIDI files, audio files representing sung user input, and textual metadata such as title, composer and arranger, and lyrics. This paper describes a comprehensive suite of tools that we have built for this project. These tools gather musical material, convert between many of these representations, allow searching based on combined musical and textual criteria, and help present the results of searching and browsing. Although we do not yet have a single fully-blown digital music library, we have built several exploratory prototype collections of music, some of them very large (100,000 tunes), and critical components of the system have been evaluated.

105 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: A document viewer is built incorporating a visualization centered around a novel content-displaying scrollbar and color term highlighting, and whether the visualization is helpful to non-expert searchers is studied.
Abstract: We are interested in questions of improving user control in bestmatch text-retrieval systems, specifically questions as to whether simple visualizations that nonetheless go b eyond the minimal ones generally available can significantly help users. Recently, we have been investigating ways to help users decide—given a set of documents retrieved by a query—which documents and p assages are worth closer examination. We built a document viewer incorporating a visualization centered around a novel content-displaying scrollbar and color term highlighting, and studied whether the visualization is helpful to non-expert searchers. Participants’ reaction to the visualization was very positive, while the objective results were inconclusive.

90 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: The reading practices of an on-going reading group are described, and how these practices changed when XLibris, a digital library reading appliance that uses a pen tablet computer to provide a paper-like interface, was introduced.
Abstract: How will we read digital library materials? This paper describes the reading practices of an on-going reading group, and how these practices changed when we introduced XLibris, a digital library reading appliance that uses a pen tablet computer to provide a paper-like interface. We interviewed group members about their reading practices, observed their meetings, and analyzed their annotations, both when they read a paper document and when they read using XLibris. We use these data to characterize their analytic reading, reference use, and annotation practices. We also describe the use of the Reader’s Notebook, a list of clippings that XLibris computes from a reader’s annotations. Implications for digital libraries stem from our findings on reading and mobility, the complexity of analytic reading, the social nature of reference following, and the unselfconscious nature of readers’ annotations.

87 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: In this article, the authors describe their efforts to bring scientific data into the digital library, which has required extension of the standard WWW, and also the extension of metadata standards far beyond the Dublin Core.
Abstract: In this paper we describe our efforts to bring scientific data into the digital library. This has required extension of the standard WWW, and also the extension of metadata standards far beyond the Dublin Core. Our system demonstrates this technology for real scientific data from astronomy.

77 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: A method for automatically introducing topic-based links into documents to support browsing in digital libraries and an evaluation shows that keyphraseÐbased similarity measures work as well as a popular full-text retrieval system for finding relevant destination documents.
Abstract: Many digital libraries are comprised of documents from disparate sources that are independent of the rest of the collection in which they reside. A userÕs ability to explore is severely curtailed when each document stands in isolation; there is no way to navigate to other, related, documents, or even to tell if such documents exist. We describe a method for automatically introducing topicÐbased links into documents to support browsing in digital libraries. Automatic keyphrase extraction is exploited to identify link anchors, and keyphraseÐbased similarity measures are used to select and rank destinations. Two implementations are described: one that applies these techniques to existing WWWÐbased digital library collections using standard HTML, and one that uses a wider range of interface techniques to provide more sophisticated linking capabilities. An evaluation shows that keyphraseÐbased similarity measures work as well as a popular full-text retrieval system for finding relevant destination documents.

64 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: The motivations for the creation of the VARIATIONS digital library project, an overview of its operation and implementation, user reactions to the system, and future plans for development are covered.
Abstract: The field of music provides an interesting context for the development of digital library systems due to the variety of information formats used by music students and scholars. The VARIATIONS digital library project at Indiana University currently delivers online access to sound recordings from the collections of IU’s William and Gayle Cook Music Library and is developing access to musical score images and other formats. This paper covers the motivations for the creation of VARIATIONS, an overview of its operation and implementation, user reactions to the system, and future plans for development.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: The feasibility of the Pharos architecture is demonstrated using 2500 Usenet newsgroups as separate collections and it is shown that using Pharos as an intermediate retrieval mechanism provides acceptable accuracy of source selection compared to selecting sources using complete classi cation information, while maintaining good scalability.
Abstract: Information retrieval over the Internet increasingly requires the ltering of thousands of information sources. As the number and variety of sources increases, new ways of automatically summarizing, discovering, and selecting sources relevant to a user's query are needed. Pharos is a highly scalable distributed architecture for locating heterogeneous information sources. Its design is hierarchical, thus allowing it to scale well as the number of information sources increases. We demonstrate the feasibility of the Pharos architecture using 2500 Usenet newsgroups as separate collections. Each newsgroup is summarized via automated Library of Congress classi cation. We show that using Pharos as an intermediate retrieval mechanism provides acceptable accuracy of source selection compared to selecting sources using complete classi cation information, while maintaining good scalability. This implies that hierarchical distributed metadata and automated classi cation are potentially useful paradigms to address scalability problems in large-scale distributed information retrieval applications.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: In this article, three types of video surrogates visual (keyframes), verbal (keywords/phrases), and visual and verbal were designed and studied in a qualitative investigation of user cognitive processes.
Abstract: Three types of video surrogates visual (keyframes), verbal (keywords/phrases), and visual and verbal were designed and studied in a qualitative investigation of user cognitive processes. The results favor the combined surrogates in which verbal information and images reinforce each other, lead to better comprehension, and may actually require less processing time. The results also highlight image features users found most helpful. These findings will inform the interface design and video representation for video retrieval and browsing

Proceedings ArticleDOI
01 Aug 1999
TL;DR: A prototype distributed architecture for a digital library is demonstrated, based on XML-based modeling of metadata; use of an XML query language, and associated mediator middleware, to query distributed metadata sources; and the use of a storage system middleware to access distributed, archived data sets.
Abstract: We demonstrate a prototype distributed architecture for a digital library, using technology being developed under the MIX Project at the San Diego Supercomputer Center (SDSC) and the University of California, San Diego. The architecture is based on XML-based modeling of metadata; use of an XML query language, and associated mediator middleware, to query distributed metadata sources; and the use of a storage system middleware to access distributed, archived data sets.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: The paper presents a technique to automatically detect musical phrases to be used as content descriptors, and confiate musical phrase variants by extracting a common stem using the vector-space model.
Abstract: The automatic best-match and content-based retrieval of musical documents against musical queries is addressed in this paper. By "musical documents" we mean scores or performances, while musical queries are supposed to be inserted by final users using a musical interface (GUI or MIDI keyboard). Musical documents lack of separators necessary to detect "lexical units" like text words. Moreover there are many variants of a musical phrase between different works. The paper presents a technique to automatically detect musical phrases to be used as content descriptors, and confiate musical phrase variants by extracting a common stem. An experimental study reports on the results of indexing and retrieval tests using the vector-space model. The technique can complement catalogue-based access whenever the user is unable to use fixed values, or he would find performances or scores being "similar" in content to known ones.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: In this article, the Multimedia Description Framework (MDF) is proposed to accommodate multiple description (meta-data) schemes, both MPEG-7 and non-MPEG-7, into integrated architecture.
Abstract: MPEG is undertaking a new initiative to standardize content description of audio and video data/documents. When it is finalized in 2001, MPEG-7 is expected to provide standardized description schemes for concise and unambiguous content description of data/documents of complex media types. Meanwhile, other meta-data or description schemes, such as Dublin Core, XML, etc., are becoming popular in different application domains. In this paper, we propose the Multimedia Description Framework (MDF), which is designated to accommodate multiple description (meta-data) schemes, both MPEG-7 and non-MPEG-7, into integrated architecture. We will use examples to show how MDF description makes use of combined strength of different description schemes to enhance its expression power and flexibility. We conclude the paper with discussion of using MDF description of a movie video to search/retrieve required scene clips from the movie, on the MDF prototype system we have implemented.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: Findings from my research on people's encounters with DLs in two different arenas: academia and low-income neighborhoods are integrated in this paper.
Abstract: K E Y W O R D S : U s e r s tud ies , ~ e l ec t ron i c journals, community networks, scientific and technical information, low income neighborhoods Permission to make digital or hard copies o1' all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DL 99 Berkeley CA USA Copyright ACM 1999 1-58113 145-3/99/08...$5.00 1. I N T R O D U C T I O N A new federal initiative called Information Technology for the Twenty-First Century (IT2) recognizes the need to bridge research across domains in order to bring computing benefits to society at large [40]. The program will support research leading to improvements in "how we live" as well as how we work, learn, and conduct research. Technology applications targeted for funding include, for example, those geared to helping people with disabilities lead more independent lives, as well as those devoted to healthcare, electronic commerce, and accelerating the pace of scientific and technical discoveries. The new initiative also specifically calls for research on the social implications of the Information Revolution. Certainly one issue that must be dealt with if the benefits of computing are to accrue to all segments of society is the "digital divide" that currently bifurcates information technology use across socioeconomic fault lines [6, 21]. One implication for digital library (DL) research is that we should start looking at initiatives that span the spectrum from basic computer science to the implementation of working systems and consider finks among findings on information system use from a variety of arenas in life. In addition to studying DL use in academic and corporate settings, we should include research into how DLs transform home life, the workings of local communities, and the activities of small community organizations, all of which are important in determining how we stitch together our days. At present, competing visions and agendas in both research and practice are hindering progress toward synthesizing and communicating DL research results [12]. Nonetheless, in looking at conceptions of DLs offered over the past several years--most of which define Dls as some combination of a collection, technology, and services--it is apparent that a range of system genres and intended audiences are candidates for inclusion. These include the web at large, online educational archives, digitized document collections mounted by libraries, virtual museums, and computer-based community information systems. In order to gain a fuller understanding of the social implications of DL use, DL research should continue to explore how use is situated in social practices. In this paper, I integrate findings from my research on people's encounters with DLs in two different arenas: academia and low-income neighborhoods. The point is to see how concepts and conclusions related to use do,

Proceedings ArticleDOI
01 Aug 1999
TL;DR: The SOMLib project creates a digital library system that uses a neural network-based core for library representation and query processing, and uses the self-organizing map, a popular unsupervised neural network model, to automatically structure a document collection.
Abstract: Digital Libraries have gained tremendous interest with numerous research projects addressing the wealth of challenges in this field. While computational intelligence systems are being used for specific tasks in this arena, the majority of projects relies on conventional techniques for the basic structure of the library itself. With the SOMLib project we create a digital library system that uses a neural network-based core for library representation and query processing. The self-organizing map, a popular unsupervised neural network model, is used to automatically structure a document collection. Based on this core, additional modules integrate distributed libraries and create an intuitive representation of the library, automatically labeling the various topical sections in the document collection.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: In this article, the authors focus on profile acquisition by a filtering system that provides general health information and consider how different approaches of acquiring profiles can influence filtering effectiveness. But, none of the studies have considered the impact of profile acquisition on the effectiveness of filtering.
Abstract: INTRODUCTION To make digital libraries attractive and encourage use, new and value-added services are needed beyond conventional distribution and access mechanisms. An exciting area of development is information personalization services that route, recommend, sort and prune documents (henceforth collectively called filtering) based on users’ interest profiles. Significant advances have been made in filtering systems. However, few studies have considered how different approaches of acquiring profiles can influence filtering effectiveness. Profiles are at the center of our research and one of the issues we are focussing in is profile acquisition by a filtering system that provides general health information.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: This paper proposes a notion of visual keywords for similarity matching between visual contents, and describes an evaluation experiment that classi es professional nature scenery photographs to demonstrate the e ectiveness and e ciency ofVisual keywords for automatic categorization of images in digital libraries.
Abstract: Automatic categorization of multimedia documents is an important function for a digital library system. While text categorization has received much attentions by IR researchers, classi cation of visual data is at its infancy stage. In this paper, we propose a notion of visual keywords for similarity matching between visual contents. Visual keywords can be constructed automatically from samples of visual data through supervised/unsupervised learning. Given a visual content, the occurrences of visual keywords are detected, summarized spatially, and coded via singular value decomposition to arrive at a concise coded description. The methods to create, detect, summarize, select, and code visual keywords will be detailed. Last but not least, we describe an evaluation experiment that classi es professional nature scenery photographs to demonstrate the e ectiveness and e ciency of visual keywords for automatic categorization of images in digital libraries.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: This experiment demonstrated the feasibility of scalable semantics techniques for large collections and developed “scalable semantics” technologies that were able to compute semantic indexes for a subject discipline.
Abstract: As part of the Illinois Digital Library Initiative (DLI) project we developed “scalable semantics” technologies. These statistical techniques enabled us to index large collections for deeper search than word matching. Through the auspices of the DARPA Information Management program, we are developing an integrated analysis environment, the Interspace Prototype, that uses “semantic indexing” as the foundation for supporting concept navigation. These semantic indexes record the contextual correlation of noun phrases, and are computed generically, independent of subject domain. Using this technology, we were able to compute semantic indexes for a subject discipline. In particular, in the summer of 1998, we computed concept spaces for 9.3M MEDLINE bibliographic records from the National Library of Medicine (NLM) which extensively covered the biomedical literature for the period from 1966 to 1997. In this experiment, we first partitioned the collection into smaller collections (repositories) by subject, extracted noun phrases from titles and abstracts, then performed semantic indexing on these subcollections by creating a concept space for each repository. The computation required 2 days on a 128-node SGI/CRAY Origin 2000 at the National Center for Supercomputer Applications (NCSA). This experiment demonstrated the feasibility of scalable semantics techniques for large collections. With the rapid increase in computing power, we believe this indexing technology will shortly be feasible on personal computers.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: Two visualization tools, fisheye view and fractal view, are presented and it assists users to visualize a largescale self-organizing map geographically and semantically.
Abstract: Various statistical and pattern recognition techniques, such as concept spaces and category maps in the Illinois Digital Library project, has been explored to solve the semantic interoperability problem in DLI-1. Self-organizing category map is identified as a powerful tool for information summarization. However, visualizing a large-scale self-organizing map in a restricted size of window is difficult. For smaller regions, displaying labels is infeasible. In this paper, two visualization tools, fisheye view and fractal view, are presented. It assists users to visualize a largescale self-organizing map geographically and semantically.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: In the present work, the shortcomings of current recommendation systems for distributed information systems are discussed and how TalkMine can greatly improve these shortcomings are proposed.
Abstract: TalkMine is an adaptive recommendation system which is both content-based and collaborative, and further allows the crossover of information among multiple databases searched by users. In this way, different databases learn new and adapt existing keywords to the categories recognized by its communities of users. TalkMine is based on several theories of uncertainty, as well as on biologically inspired adaptionist ideas. This system is currently being implemented for the research library of the Los Alamos National Laboratory under the Adaptive Recommendation Project. In the present work we discuss the shortcomings of current recommendation systems for distributed information systems and propose how TalkMine can greatly improve these shortcomings.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: The Computing Research Repository is described, a new electronic archive for rapid dissemination and archiving of computer science research results that combines the open and extensible architecture of NCSTRL with the reliable access and well-established management practices of the LANL XXX e-Print repository.
Abstract: We describe the Computing Research Repository (CoRR), a new electronic archive for rapid dissemination and archiving of computer science research results. CoRR was initiated in September 1998 through the cooperation of ACM, LANL (Los Alamos National Laboratory) e-Print archive, and NCSTRL (Networked Computer Science Technical Research Library. Through its implementation of the Dienst protocol, CoRR combines the open and extensible architecture of NCSTRL with the reliable access and well-established management practices of the LANL XXX e-Print repository. This architecture will allow integration with other e-Print archives and provides a foundation for a future broad-based scholarly digital library. We describe the decisions that were made in creating CoRR, the architecture of the CoRR/NCSTRL interoperation, and issues that have arisen during the operation of CoRR.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: This work focuses on an agentbased solution to the problem of providing access to multiple taxonomic perspectives for a large collection of data and instantiates this approach in a digital library in which agents serve as guides through multiple plant taxonomies for a distributed community of users interested in botany.
Abstract: INTRODUCTION Taxonomy is the study of the general principles of scientific classification. A taxonomy provides a particular way to organize entities in a given realm according to specific criteria. Taxonomies are valuable resources in identifying and understanding newly discovered objects or concepts. Frequently, however, multiple taxonomies exist for the same entities. These arise because different viewpoints serve different purposes or simply because scientists do not always agree on how knowledge can best be organized. We focus on an agentbased solution to the problem of providing access to multiple taxonomic perspectives for a large collection of data. We instantiate this approach in a digital library in which agents serve as guides through multiple plant taxonomies for a distributed community of users interested in botany.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: NDLTD activities include: applying automation methods to simplify submission of ETDs over the WWW; specifying the application of the Dublin Core to guarantee that metadata can satisfy needs of searching and browsing; selecting open standards and procedures to facilitate interoperability and preservation; and demonstrating a variety of interfaces, both 2D and 3D, along with exploring their usability.
Abstract: The Networked Digital Library of Theses and Dissertations (NDLTD) is more than an online collection of Electronic Theses and Dissertations (ETDs). It is a scalable project that has impact on thousands of graduate students in many countries as well as diverse researchers worldwide. By May 1999 it had 59 official members representing 13 countries and integrated some of the world’s newest research works, including ETD collections at Virginia Tech and West Virginia University, where ETD submission is now required. The number of accesses to the Virginia Tech collection has grown by more than half in the last year. NDLTD is committed to authors, aiming to improve graduate education for the over 100,000 students that prepare a thesis or dissertation each year. It encourages them to be more expressive by facilitating incorporation of multimedia components into their theses. NDLTD activities include: applying automation methods to simplify submission of ETDs over the WWW; specifying the application of the Dublin Core to guarantee that metadata can satisfy needs of searching and browsing; selecting open standards and procedures to facilitate interoperability and preservation; and demonstrating a variety of interfaces, both 2D and 3D, along with exploring their usability.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: This paper is primarily concerned with the extraction of author and title information and the a utonomous citation indexing components of ResearchIndex, a scientific literature digital library developed at the NEC Research Institute.
Abstract: We proposedistributed error correctionfor digital libraries, where individual users can correct information in a databas e in real-time. Distributed error correction is used in the R searchIndex(formerly CiteSeer) scientific literature digital library developed at NEC Research Institute. We discuss issues including motivation to contribute corrections, barr iers to participation, trust, recovery, detecting malicious ch anges, and the use of correction information to improve automated algorithms or predict the probability of errors. We also detail our implementation of distributed error correction in ResearchIndex . Introduction Many online databases contain errors. In many cases, it is impractical for database maintainers to correct all of the e rrors in their databases. Many of these databases are created using automated or partially automated means, for example some search engines classify pages into predefined categories, while other services maintain databases of automa tically extracted information, for example the HPSearch [7] service maintains a database of researcher homepages, and the WebKB project at CMU automatically extracts information from Web pages [3]. We propose the use of distributed error correctionto increase the accuracy of online databases, by harnessing the knowledge of all users, and allowing individual users to correct errors. Examples might include users reporting incorrectl y classified pages to search engines, or correcting responses from services such as homepage location services. In this work we focus on the correction of automatically extracted information inResearchIndex [10, 11], a scientific literature digital library developed at the NEC Research Institute. Th e next section provides brief background information on the ResearchIndex system and the Autonomous Citation Indexing (ACI) performed by ResearchIndex. ResearchIndex ResearchIndex is a scientific literature digital library pr oject at NEC Research Institute. Areas of focus include the effective use of the capabilities of the web, and the use of machine learning. The ResearchIndex project encompasses many areas including the efficient location of articles, full-text indexing, autonomous citation indexing, information extrac tion, computation of related documents, and user profiling. ResearchIndex operates completely autonomously and performs a number of tasks including: location of research articles on the web, conversion of Postscript and PDF files to text, extraction of title and author information from artic le headers, extraction of the list of citations made in an artic le, autonomous citation indexing, and the extraction of citati on context within articles. This paper is primarily concerned with the extraction of author and title information and the a utonomous citation indexing components of ResearchIndex. Citation indexing is the indexing of the citations made in research articles, linking the citing papers with the cited works [4]. Citation indices allow, for example, the locatio n of subsequent papers that cite a given paper. The most well-known citation indices are the commercial indices created by the Institute of Scientific Information (ISI) (http://www.isinet. om/), for example the Science Citation Index (SCI) R . The ISI citation databases are created using manual effort, and are known to contain errors . Autonomous Citation Indexing (ACI) automates the task of creating citation indices. Details of the autonomous citat ion indexing performed by ResearchIndex can be found in [11]. There are several sources of potential errors in ResearchIn dex, for example errors can be made extracting the title and author information from citations and documents, and error s can be made matching citations to the same article (citation s can be written in many different formats and ResearchIndex attempts to group together citations to the same paper). Thi s paper focuses on the use of distributed error correction to correct these errors. We note that there are techniques that could potentially reduce the error rate of autonomous systems such as ResearchIndex. For example, Cameron [2] proposed a universal bibliographic and citation database that would link every scholarly work ever written. Cameron’s proposal includes the requirement that authors or institution provide citati on information in a standardized format, which removes the difficulty involved in parsing free-form citations. Howeve r, such a method imposes a substantial overhead on the authors or institutions, and has not gained widespread acceptance. Another possibility is the use of universal identifiers [1], such as those used in the Los Alamos e-Print Archive (http://xxx.lanl.gov). However, this also requires effort for the authors to lookup the identifiers. The use of identifiers for citations in the Los Alamos archive varies significantly by discipline [6]. Even with improved algorithms for the tasks performed by ResearchIndex, it is very unlikely that perfect algorithms could be created for most tasks. For example, perfect algorithms for title/author extraction in citations would have to be able to correct for errors made by the article authors and errors made in the conversion from Postscript/PDF to text (Postscript programs can be written in many different ways – the conversion task is relatively simple to do with high accuracy but very difficult to do perfectly [14]). Distributed Error Correction In distributed error correction, individual users are able to correct errors that they find while using an online system. The following sections discuss issues involved in using dis tributed error correction for online databases, with speci fic focus on the application to ResearchIndex. Trust In distributed error correction, individual users can corr ect errors in a database. An important and immediate question is: how do we prevent malicious users from corrupting the database? Various schemes could be used to validate and assign degrees of trust to users (for example, techniques si milar to those used with PGP [15]). However, they all involve some overhead, which would limit the fraction of users providing corrections. We therefore focus on detectionrather thanpreventionof malicious users. However, as is common with many web sites, we can optionally require a validated email address, i.e. we request an email address and immediately send a message to that address asking for confirmation before allowing a user to make any changes. If malicious changes were consistently made from free email addresses (Hotmail, Yahoo!, etc.) thes e could be disallowed, since most legitimate users are likely to have email addresses at universities or research labs. Recovering The first observation is that no matter what methods are used, there is always the possibility of malicious or accidently i ncorrect changes being made to the database. Therefore, we keep a transaction log of all changes which allows changes to be rolled back. This also allows easy application of corrections to new databases that may contain the same documents. Since ResearchIndex is freely available, multiple organiz tions may be obtaining correction information from users, which may be distributed to each organization for correctio n of identical documents or citations. Detecting Malicious Users Malicious changes to the ResearchIndex database may not be a significant problem, due to the target audience of scientific researchers. The Los Alamos e-Print archive ( http: //xxx.lanl.gov) has not had any difficulty with malicious users [5]. However, various methods for detecting maliciou s changes are possible. For example, consider changes to title and author information for indexed articles and citatio ns. The new title and author information should exist in the article header or citation, although there may be errors. Edit distance [12, 8] or similar algorithms could be used to analyze changes from each user. If similar strings to the new information are not contained in the original citation or ar ticle header, then this may indicate a malicious (or accidentl y incorrect) change. Motivation Users are known for not wanting to spend time providing explicit feedback. On the web most attempts to use relevance feedback have resulted in very small amounts of participation. Therefore an important question is how do we motivate users to correct the database? For scientific literature, one strong motivation is for auth ors to correct information relating to their own publications, which improves the accessibility of their research. Anothe r possibility is to provide users with alternative incentive s to correct errors, for example payments, displaying credits f or corrections, or increased status within the system. One tec hnique used in ResearchIndex is to highlight the advantages of making corrections immediately when they are made. In particular, correcting the title and author information on a document in ResearchIndex can enable the system to link the document with corresponding citations in other article s. When this is possible, we immediately perform the linking, notify the user, and provide a link to the respective citatio ns on the correction response page. A related issue is complexity of the correction process. In general, overhead limits usage, and excessive overhead can effectively prevent usage. An interesting analogy is the we b itself, arguably a large, ad hoc, poorly organized information resource, full of dead links, and lacking built-in supp ort for features such as content indexing and access payments. These deficiencies are in principle solvable, and indeed pro posals for hypertext systems without these deficiencies existed long before the web (e.g., Xanadu[13]). However, the reality of designing, implementing, and participating in m ore idealized hypertext systems, namely greater overhead for d esigners and participants, has prevented the widespread suc cess of such systems. On the other hand, a

Proceedings ArticleDOI
01 Aug 1999
TL;DR: In this article, the authors present the design and the current prototype implementation of an interactive vocal information retrieval system that can be used to access articles of a large newspaper archive using a telephone.
Abstract: This paper presents the design and the current prototype implementation of an interactive vocal information retrieval system that can be used to access articles of a large newspaper archive using a telephone. The results of preliminary investigation into the feasibility of such a system are also presented.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: InformationWeaver, an information integration system which can dynamically construct various views on structured documents, Web, and databases according to users' visual perception for data manipulation, is explained.
Abstract: Digital libraries have to provide the user with seamless access to digital information resources in di erent formats. This paper explains InfoWeaver, an information integration system which can dynamically construct various views on structured documents, Web, and databases according to users' visual speci cation for data manipulation. We have implemented a prototype system which accommodates OpenText index server, Web, and Oracle8.

Proceedings ArticleDOI
01 Aug 1999
TL;DR: This workshop is intended to foster the development of the needed technology by providing a forum in which researchers from Several communities can share their perspectives, describe work in progress and present their results.
Abstract: Discovery of and access to multilingual information poses a number of important challenges, and Communities of interest have formed around several key issues. This workshop is intended to foster the development of the needed technology by providing a forum in which researchers from Several communities can share their perspectives, describe work in progress and present their results. The workshop will include researchers from the information retrieval, computational linguistics, multilingual metadata and World Wide Web internationalization communities. We aim to include a mix of junior and senior researchers as well as individuals responsible for research investment policy from around the world.