scispace - formally typeset
Search or ask a question

Showing papers presented at "ACM international conference on Digital libraries in 2000"


Proceedings ArticleDOI
01 Jun 2000
TL;DR: This paper develops a scalable evaluation methodology and metrics for the task, and presents a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.
Abstract: Text documents often contain valuable structured data that is hidden Yin regular English sentences. This data is best exploited infavailable as arelational table that we could use for answering precise queries or running data mining tasks.We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection.We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents.At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention,and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.

1,399 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: This work describes a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization and shows initial experimental results demonstrate that this approach can produce accurate recommendations.
Abstract: Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use collaborative filtering methods that base recommendations on other users' preferences. By contrast,content-based methods use information about an item itself to make suggestions.This approach has the advantage of being able to recommend previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.

1,330 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: A simplified two-dimensional display that uses categorical and hierarchical axes, called hieraxes, that is applied to a digital video library of science topics used by middle school teachers, a legal information system, and a technical library using the ACM Computing Classification System.
Abstract: Digital library search results are usually shown as a textual list, with 10-20 items per page. Viewing several thousand search results at once on a two-dimensional display with continuous variables is a promising alternative. Since these displays can overwhelm some users, we created a simplified two-dimensional display that uses categorical and hierarchical axes, called hieraxes. Users appreciate the meaningful and limited number of terms on each hieraxis. At each grid point of the display we show a cluster of color-coded dots or a bar chart. Users see the entire result set and can then click on labels to move down a level in the hierarchy. Handling broad hierarchies and arranging for imposed hierarchies led to additional design innovations. We applied hieraxes to a digital video library of science topics used by middle school teachers, a legal information system, and a technical library using the ACM Computing Classification System. Feedback from usability testing with 32 subjects revealed strengths and weaknesses.

183 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: A web server for acronym and abbreviation lookup, containing a collection of acronyms and their expansions gathered from a large number of web pages by a heuristic extraction process, which has the potential to be much more inclusive as data from more web pages are processed.
Abstract: We implemented a web server for acronym and abbreviation lookup, containing a collection of acronyms and their expansions gathered from a large number of web pages by a heuristic extraction process. Several different extraction algorithms were evaluated and compared. The corpus resulting from the best algorithm is comparable to a high-quality hand-crafted site, but has the potential to be much more inclusive as data from more web pages are processed.

151 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: The Greenstone digital library software is described, a comprehensive, open-source system for the construction and presentation of information collections that offers effective full-text searching and metadata-based browsing facilities that are attractive and easy to use.
Abstract: This paper describes the Greenstone digital library software, a comprehensive, open-source system for the construction and presentation of information collections. Collections built with Greenstone offer effective full-text searching and metadata-based browsing facilities that are attractive and easy to use. Moreover, they are easily maintainable and can be augmented and rebuilt entirely automatically. The system is extensible: software "plugins" accommodate different document and metadata types.

137 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: A technique for capturing an accurate 3D representation of library materials which can be integrated directly into current digitization setups will allow digitization efforts to provide patrons with more realistic digital facsimile of library material.
Abstract: Significant efforts are being made to digitize rare and valuable library materials, with the goal of providing patrons and historians digital facsimiles that capture the "look and feel" of the original materials. This is often done by digitally photographing the materials and making high resolution 2D images available. The underlying assumption is that the objects are flat. However, older materials may not be flat in practice, being warped and crinkled due to decay, neglect, accident and the passing of time. In such cases, 2D imaging is insufficient to capture the "look and feel" of the original. For these materials, 3D acquisition is necessary to create a realistic facsimile. This paper outlines a technique for capturing an accurate 3D representation of library materials which can be integrated directly into current digitization setups. This will allow digitization efforts to provide patrons with more realistic digital facsimile of library materials.

125 citations


Proceedings ArticleDOI
H. Kawano1
13 Nov 2000
TL;DR: This work is developing the Japanese Web search engine "Mondou (RCAAU"), one of the first generation of Web search engines, and introduces the concept of an integrated query mechanism for different search engines based on the KQML agents.
Abstract: As the volume of Web pages on the Internet is increasing rapidly, it is becoming hard for users to discover valuable Web resources. It is especially difficult for naive users to discover informative pages by popular Web search engines, since they don't have background and domain knowledge about the status of Web systems. Therefore, many kinds of Web search engines have been developed in order to support the processes of Web information retrieval. We are developing the Japanese Web search engine "Mondou (RCAAU)". Though our engine is one of the first generation of Web search engines, we tried to implement the rapidly emerging technologies of data mining in our search engine from 1995. We are also implementing Java applets based on information visualization. The author presents technical overviews of the Mondou Web search engine. One of the most important techniques is the text mining algorithms based on the primitive association rules. Mondou provides highly relevant feedback keywords to users, in order to support search steps. Using the associative keywords, users can modify the combination of keywords in the initial query. We also introduce the concept of an integrated query mechanism for different search engines based on the KQML agents. Furthermore, in order to visualize the characteristics of search results, we are developing Java applets to display the ROC graph and the clusters of specific documents. We are also trying the improve Web robots for the Mondou system from the view point of data cleaning. Finally, we discuss the effectiveness and performance of our Web search engine.

95 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: A system, based on a novel spatial/visual knowledge principle, for extracting metadata from scientific papers stored as PostScript files that embeds the general knowledge about the graphical layout of a scientific paper to guide the metadata extraction process.
Abstract: The automatic document metadata extraction process is animportant task in a world where thousands of documents are just one``click'' away. Thus, powerful indices are necessary to support effective retrieval. The upcoming XML standard represents an important step in this direction as itssemistructuredrepresentation conveys document metadata together with the text of the document. For example, retrieval of scientific papers by authors or affiliations would be a straightforward tasks if papers were stored in XML.Unfortunately, today, the largest majority of documents on the web are available in forms that do not carryadditional semantics. Converting existing documents to a semistructured representation is time consuming and no automatic process can be easily applied. In this paper we discuss a system, based on a novel spatial/visualknowledge principle, for extracting metadata from scientific papers storedas PostScript files. Our system embeds the general knowledge about the graphical layout of a scientific paper to guide the metadata extraction process. Our system can effectively assist the automatic index creation for digital libraries.

91 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: A flexible and dynamic mediator infrastructure that allows mediators to be composed from a set of modules (``blades'') that implements a particular mediation function, such as protocol translation, query translation, or result merging is described.
Abstract: Digital library mediators allow interoperation between diverse information services. In this paper we describe a flexible and dynamic mediator infrastructure that allows mediators to be composed from a set of modules (``blades''). Each module implements a particular mediation function, such as protocol translation, query translation, or result merging. All the information used by the mediator, including the mediator logic itself, is represented by an RDF graph.We illustrate our approach using a mediation scenario involving a Dienst and a Z39.50 server, and we discuss the potential advantages and weaknesses of our framework.

82 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: Results indicate that annotations improve recall of emphasized items, influence how specific arguments in the source materials are perceived, decrease students' tendencies to unnecessarily summarize, and implications for the design and implementation of digitally annotated materials are discussed.
Abstract: Recent research on annotations has focused on how readers annotate texts, ignoring the question of how reading annotations might affect subsequent readers of a text. This paper reports on a study of persuasive essays written by 123 undergraduates receiving primary source materials annotated in various ways. Findings indicate that annotations improve Findings indicate that annotations improve recall of emphasized items, influence how specific arguments in the source materials are perceived, decrease students' tendencies to unnecessarily summarize. Of particular interest is that students' perceptions of the annotator appeared to greatly influence how they responded to the annotated material. Using this study as a basis, I discuss implications for the design and implementation of digitally annotated materials.

81 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: This paper describes the preservation approach adopted in the Victorian Electronic Record Strategy (VERS) which is currently being trialed within the Victorian government, one of the states of Australia.
Abstract: Well within our lifetime we can expect to see most information being created, stored and used digitally. Despite the growing importance of digital data, the wider community pays almost no attention to the problems of preserving this digital information for the future. Even within the archival and library communities most work on digital preservation has been theoretical, not practical, and highlights the problems rather than giving solutions. Physical libraries have to preserve information for long periods and this is no less true of their digital equivalents. This paper describes the preservation approach adopted in the Victorian Electronic Record Strategy (VERS) which is currently being trialed within the Victorian government, one of the states of Australia. We review the various preservation approaches that have been suggested and describe in detail encapsulation, the approach which underlies the VERS format. A key difference between the VERS project and previous digital preservation projects is the focus within VERS on the construction of actual systems to test and implement the proposed technology. VERS is not a theoretical study in preservation.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: The Open Citation project is described, which will focus on linking papers held in freely accessible eprint archives such as the Los Alamos physics archives and other distributed archives, and which will build on the work of the Open Archives initiative to make the data held in such archives available to compliant services.
Abstract: The rapid growth of scholarly information resources available in electronic form and their organisation by digital libraries is proving fertile ground for the development of sophisticated new services, of which citation linking will be one indispensable example. Many new projects, partnerships and commercial agreements have been announced to build citation linking applications. This paper describes the Open Citation (OpCit) project, which will focus on linking papers held in freely accessible eprint archives such as the Los Alamos physics archives and other distributed archives, and which will build on the work of the Open Archives initiative to make the data held in such archives available to compliant services. The paper emphasises the work of the project in the context of emerging digital library information environments, explores how a range of new linking tools might be combined and identifies ways in which different linking applications might converge. Some early results of linked pages from the OpCit project are reported.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: A case study that uses an automatically constructed phrase hierarchy to facilitate browsing of an ordinary large Web site and the ultimate goal is to amalgamate hierarchical phrase browsing and hierarchical thesaurus browsing.
Abstract: Phrase browsing techniques use phrases extracted automatically from a large information collection as a basis for browsing and accessing it. This paper describes a case study that uses an automatically constructed phrase hierarchy to facilitate browsing of an ordinary large Web site. Phrases are extracted from the full text using a novel combination of rudimentary syntactic processing and sequential grammar induction techniques. The interface is simple, robust and easy to use.To convey a feeling for the quality of the phrases that are generated automatically, a thesaurus used by the organization responsible for the Web site is studied and its degree of overlap with the phrases in the hierarchy is analyzed. Our ultimate goal is to amalgamate hierarchical phrase browsing and hierarchical thesaurus browsing: the latter provides an authoritative domain vocabulary and the former augments coverage in areas the thesaurus does not reach.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: The Compus visualization system that assists in the exploration and analysis of structured document corpora encoded in XML, providing a synoptic visualization of a corpus and allowing for dynamic queries and structural transformations, assists researchers in finding regularities or discrepancies leading to a higher level analysis of historic source.
Abstract: This article describes the Compus visualization system that assists in the exploration and analysis of structured document corpora encoded in XML. Compus has been developed for and applied to a corpus of 100 French manuscript letters of the 16th century, transcribed and encoded for scholarly analysis using the recommendations of the Text Encoding Initiative. By providing a synoptic visualization of a corpus and allowing for dynamic queries and structural transformations, Compus assists researchers in finding regularities or discrepancies, leading to a higher level analysis of historic source. Compus can be used with other richly encoded text corpora as well.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: The MatchDetectReveal (MDR) system as mentioned in this paper uses a modified suffix tree representation to identify the exact overlapping chunks and its performance is also presented, which is capable of identifying overlapping and plagiarised documents.
Abstract: In this paper we introduce the MatchDetectReveal(MDR) system, which is capable of identifying overlapping and plagiarised documents. Each component of the system is briefly described. The matching-engine component uses a modified suffix tree representation, which is able to identify the exact overlapping chunks and its performance is also presented.

Proceedings ArticleDOI
13 Nov 2000
TL;DR: A framework of requirements, covering the design space of customize Web applications is suggested and existing approaches for developing customizable Web applications are surveyed and general shortcomings are identified pointing the way to next-generation modeling methods.
Abstract: The Web is more and more used as a platform for full-fledged increasingly complex applications, where a huge amount of change-intensive data is managed by underlying database systems. From a software engineering point of view, the development of Web applications requires proper modeling methods in order to ensure architectural soundness and maintainability. Existing modeling methods for Web applications, however, fall short on considering a major requirement posed on today's Web applications, namely customization. Web applications should be customizable with respect to various context factors comprising different user preferences, device capabilities and locations in mobile scenarios, to mention just a few. The goal of this paper is twofold. First, a framework of requirements, covering the design space of customizable Web applications is suggested. Second, on the basis of this framework, existing approaches for developing customizable Web applications are surveyed and general shortcomings are identified pointing the way to next-generation modeling methods.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: A general framework for reverse engineering the underlying structures of the DTD from a collection of similarly structured XML documents when they share some common but unknown DTDs is proposed.
Abstract: To realize a wide range of applications (including digital libraries) on the Web, a more structured way of accessing the Web is required and such requirement can be facilitated by the use of XML standard. In this paper, we propose a general framework for reverse engineering (or re-engineering) the underlying structures i.e.,the DTD from a collection of similarly structured XML documents when they share some common but unknown DTDs. The essential data structures and algorithms for the DTD generation have been delveloped and experiments on real Web collections have been conducted to demonstrate their feasibilty. In addition, we also proposed a method ofimposing a constraint on the repetitiveness on the element in a DTD rule to further simplify the generated DTD without compromising their correctness.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: A comprehensive digital collection of Taiwan's butterflies to provide a modern research environment on butterflies for academic institutions, as well as an interactive butterfly educational environment for the general public.
Abstract: Taiwan is renown for its great variety of butterflies. There are about 400 species, a number of which unique to Taiwan, over its 36,500 sq km land. Last year we built a comprehensive digital collection of Taiwan's butterflies to provide a modern research environment on butterflies for academic institutions, as well as an interactive butterfly educational environment for the general public. Our digital museum emphasizes on the ease to use, and provides a number of innovative features to help the user fully utilize the information provided by the system. The digital museum is accessible through the Web at http://digimuse.nmns.edu.tw.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: Findings from a library technologies user survey and on-site mobile library access prototype testing are outlined and future research directions can be derived from the results of these two studies are presented.
Abstract: Digital library research is made more robust and effective when end-user opinions and viewpoints inform the research, design and development process. A rich understanding of user tasks and contexts is especially necessary when investigating the use of mobile computers in traditional and digital library environments, since the nature and scope of the research questions at hand remain relatively undefined. This paper outlines findings from a library technologies user survey and on-site mobile library access prototype testing, and presents future research directions that can be derived from the results of these two studies.

Proceedings ArticleDOI
Katy Börner1
01 Jun 2000
TL;DR: An approach that organizes retrieval results semantically and displays them spatially for browsing is introduced, implemented to visualize retrieval results from two different databases: the Science Citation Index Expanded and theDido Image Bank.
Abstract: The paper introduces an approach that organizes retrieval results semantically and displays them spatially for browsing. Latent Semantic Analysis as well as cluster techniques are applied for semantic data analysis. A modified Boltzman algorithm is used to layout documents in a two-dimensional space for interactive exploration. The approach was implemented to visualize retrieval results from two different databases: the Science Citation Index Expanded and theDido Image Bank.

Proceedings ArticleDOI
13 Nov 2000
TL;DR: This work presents a methodology, architecture, and proof-of-concept prototype for query construction and results analysis that provides the user with a ranking of choices based on the user's determination of importance.
Abstract: The World Wide Web provides access to a great deal of information on a vast array of subjects. In a typical Web search a vast amount of information is retrieved. The quantity can be overwhelming, and much of the information may be marginally relevant or completely irrelevant to the user's request. We present a methodology, architecture, and proof-of-concept prototype for query construction and results analysis that provides the user with a ranking of choices based on the user's determination of importance. The user initially designs the query with assistance from the user's profile, a thesaurus, and previously constructed queries acting as a taxonomy of the information requirements. After the query has returned its results, decision analytic methods and information source reliability information are used in conjunction with the expanded taxonomy to rank the solution candidates.


Proceedings ArticleDOI
01 Jun 2000
TL;DR: Mbiblio allows users to create virtual places the authors term personal spaces, and as users find useful items in the repositories, they organize these items and keep them handy in their personal spaces for future use.
Abstract: This paper describes MiBiblio, a highly personalizable interface to large collections in digital libraries. MiBiblio allows users to create virtual places we term personal spaces. As users find useful items in the repositories, they organize these items and keep them handy in their personal spaces for future use. Personal spaces may also be updated by user agents.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: The browser provides a framework for interactive summaries, video of the narrative of Corduroy, a children's short feature which was analyzed in detail.
Abstract: Stories may be analyzed as sequences of causally-related events and reactions to those events by the characters. We employ a notation of plot elements, similar to one developed by Lehnert,and we extend that by forming higher level "story threads"Stories may be analyzed as sequences of causally-related events and reactions to those events by the characters. We employ a notation of plot elements, similar to one developed by Lehnert,and we extend that by forming higher level "story threads" We apply the browser to Corduroy, a children's short feature which was analyzed in detail. We provide additional illustrations with analysis of Kiss of Death, a Film Noir classic. Effectively, the browser provides a framework for interactive summaries, video of the narrative

Proceedings ArticleDOI
01 Jun 2000
TL;DR: A case is presented for digital scholarship in which patrons perform all scholarly work electronically, and a proposal is made for patron-augmented digital libraries (PADLs), a class of digital libraries that supports the digital scholarship of its patrons.
Abstract: Digital library research is mostly focused on the generation of large collections of multimedia resources and state-of-the-art tools for their indexing and retrieval. However, digital libraries should provide more than advanced collection maintenance and retrieval services since the ultimate goal of any (academic) library is to serve the scholarly needs of its users. This paper begins by presenting a case for digital scholarship in which patrons perform all scholarly work electronically. A proposal is then made for patron-augmented digital libraries (PADLs), a class of digital libraries that supports the digital scholarship of its patrons. Finally, a prototype PADL (called Synchrony) providing access to video segments and associated textual transcripts is described. Synchrony allows patrons to search the library for artifacts, create annotations/original compositions, integrate these artifacts to form synchronized mixed text and video presentations and, after suitable review, publish these presentations into the digital library if desired. A study to evaluate the PADL concept and the usability of Synchrony is also discussed. The study revealed that participants were able to use Synchrony for the authoring and publishing of presentations and that attitudes toward PADLs were generally positive.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: A preliminary study was conducted to help understand the purpose of Digital libraries and to investigate whether meaningful results could be obtained from small user studies of digital libraries.
Abstract: A preliminary study was conducted to help understand the purpose of digital libraries (DLs) and to investigate whether meaningful results could be obtained from small user studies of digital libraries. Results stress the importance of mental models, and of "traditional" library support.

Proceedings ArticleDOI
13 Nov 2000
TL;DR: While efforts have focused quite a bit on short-term access that occurs over the duration of a course, it is clear that significant value is added to the archive as it is tuned for long-term use.
Abstract: Since 1995, we have been researching the application of ubiquitous computing technology to support the automated capture of live university lectures so that students and teachers may later access them. With virtually no additional effort beyond that which lecturers already expend on preparing and delivering a lecture, we are able to create a repository, or digital library, of rich educational experiences that is constantly growing. The resulting archive includes a heterogeneous mix of materials presented in lectures. We discuss access issues for this digital library that cover short-term and long-term use of the repository. While our efforts have focused quite a bit on short-term access that occurs over the duration of a course, it is clear that significant value is added to the archive as it is tuned for long-term use. These long-term access issues for an experiential digital library have not yet been addressed, and we highlight some of those challenges in this paper.

Proceedings ArticleDOI
13 Nov 2000
TL;DR: The paper proposes a technique for automatically generating Qualified Dublin Core metadata (Weibel, 2000) on a Web server using a Java Servlet, structured using the Resource Description Framework (RDF) and expressed in eXtensible Markup Language (XML).
Abstract: The paper proposes a technique for automatically generating Qualified Dublin Core metadata (Weibel, 2000) on a Web server using a Java Servlet. The metadata is structured using the Resource Description Framework (RDF) and expressed in eXtensible Markup Language (XML). The descriptions cover ten of the fifteen standard Dublin Core metadata elements and semantic precision is increased by element refinement and encoding scheme qualifiers. The servlet refinement and encoding scheme qualifiers. The servlet produces rich but interoperable metadata encompassing data from all three of the main element groups; content, instantiation and intellectual property. The generated descriptions could most obviously be used by tools for resource discovery but also by local data management applications. Sites wishing to submit content to a portal or specialised digital library could be encouraged to run such an application in order to automate resource description.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: user interaction and an automatically created thesaurus that maps text concepts and internal image concept representations, generated by various feature extraction algorithms, improve the query formulation process of the image retrieval system.
Abstract: Multimedia information retrieval in digital libraries is a difficult task for computers in general. Humans on the other hand are experts in perception, concept representation, knowledge organization and memory retrieval. Cognitive psychology and science describe how cognition works in humans, but can offer valuable clues to information retrieval researchers as well. Cognitive psychologists view the human mind as a general-purpose Ysymbol-processing system that interacts with the world. A multimedia Yinformation retrieval system can also be regarded as a symbol-processing system that interacts with the environment. Its underlying information retrieval model can be seen as a cognitive framework that describes how the describe the design and implementation of a combined text/image retrieval system (as an example of a multimedia retrieval system) that is inspired by cognitive theories such as Paivio's dual coding theory and Marr's theory of perception. User interaction and an automatically created thesaurus that maps text concepts and internal image concept representations, generated by various feature extraction algorithms, improve the query formulation process of the image retrieval system. Unlike most``multimedia databases'' found in literature, this image retrieval system uses the the functionality provided by an extensible multimedia DBMSthat itself is part of an open distributed environment.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: The focus of this study was the effect of clustering techniques and query highlighting on search strategy users develop in the virtual environment, and whether position or spatial arrangement influenced user behavior.
Abstract: In this paper we present a 2x3 factorial design study evaluating the limits and differences on the behavior of 10 users when searching in a virtual reality representation that mimics the arrangement of a traditional library. The focus of this study was the effect of clustering techniques and query highlighting on search strategy users develop in the virtual environment, and whether position or spatial arrangement influenced user behavior. We found several particularities that can be attributed to the differences in the VR environment.This study's results identify: 1) the need of co-designing both spatial arrangement and interaction method; 2) adifficulty novice users faced when using clusters to identify commontopics; 3) the influence of position and distance on users' selection of collection items to inspect; and 4) that users did not search until found the best match, but only until they found a satisfactory match.