scispace - formally typeset
Search or ask a question

Showing papers presented at "ACM international conference on Digital libraries in 1996"


Proceedings ArticleDOI
01 Apr 1996
TL;DR: An interface that transcribes acoustic input into standard music notation is described and a prototype system which has been developed for retrieval of tunes from acoustic input is described.
Abstract: Music is traditionally retrieved by title, composer or subject classification. It is possible, with current technology, to retrieve music from a database on the basis of a few notes sung or hummed into a microphone. This paper describes the implementation of such a system, and discusses several issues pertaining to music retrieval. We first describe an interface that transcribes acoustic input into standard music notation. We then analyze string matching requirements for ranked retrieval of music and present the results of an experiment which tests how accurately people sing well known melodies. The performance of several string matching criteria are analyzed using two folk song databases. Finally, we describe a prototype system which has been developed for retrieval of tunes from acoustic input.

346 citations


Proceedings ArticleDOI
01 Apr 1996
TL;DR: This paper study's the performance of various copy detection mechanisms, including the disk storage requirements, main memory requirements, response times for registration, and response time for querying, and contrast performance to the accuracy of the mechanisms (how well they detectpartial copies).
Abstract: Often, publishers are reluctant to offer valuable digital documents on the Internet for fear that they will be re-transmitted or copied widely. A Copy Detection Mechanism can help identify such copying. For example, publishers may register their documents with a copy detection server, and the server can then automatically check public sources such as UseNet articles and Web sites for potential illegal copies. The server can search for exact copies, and also for cases where significant portions of documents have been copied. In this paper we study, for the first time, the performance of various copy detection mechanisms, including the disk storage requirements, main memory requirements, response times for registration, and response time for querying. We also contrast performance to the accuracy of the mechanisms (how well they detect partial copies). The results are obtained using SCAM, an experimental server we have implemented, and a collection of 50,000 netnews articles.

198 citations


Proceedings ArticleDOI
01 Apr 1996
TL;DR: This paper proposes various kinds of new inverted indexing schemes and signature file schemes for efficient structure query processing and evaluates the storage requirements and disk access time of these schemes and presents the analytical and experimental results.
Abstract: Much research has been carried out in order to manage structured documents such as SGML documents and to provide powerful query facilities which exploit document structures as well as document contents. In order to perform structure queries efficiently in a structured document management system, an index structure which supports fast document element access must be provided. However, there has been little research on the index structures for structured documents. In this paper, we propose various kinds of new inverted indexing schemes and signature file schemes for efficient structure query processing. We evaluate the storage requirements and disk access time of our schemes and present the analytical and experimental results.

123 citations


Proceedings ArticleDOI
01 Apr 1996
TL;DR: The design of a retrieval system prototype that allows users to simultaneously combine terms offered by different suggestion techniques is designed, not about comparing the merits of each in a systematic and controlled way and offers no experimental results.
Abstract: The basic problem in information retrieval is that large-scale searches can only match terms specified by the user to terms appearing in documents in the digital library collection. Intermediate sources that support term suggestion can thus enhance retrieval by providing alternative search terms for the user. Term suggestion increases the recall, while interaction enables the user to attempt to not decrease the precision. We are building a prototype user interface that will become the Web interface for the University of Illinois Digital Library Initiative (DLI) testbed. It supports the principal of multiple views, where different kinds of term suggestors can be used to complement search and each other. This paper discusses its operation with two complementary term suggestors, subject thesauri and co-occurrence lists, and compared their utility. Thesauri are generated by human indexers and place selected terms in a subject hierarchy. Co-occurrence lists are generated by computer and place all terms in frequency order of occurrence together. This paper concludes with a discussion of how multiple views can help provide good quality Search for the Net. This is a paper about the design of a retrieval system prototype that allows users to simultaneously combine terms offered by different suggestion techniques,more » not about comparing the merits of each in a systematic and controlled way. It offers no experimental results.« less

113 citations


Proceedings ArticleDOI
01 Apr 1996
TL;DR: The goal of the VISION (Video Indexing for SearchIng Over Networks) project is to establish a comprehensive, online digital videolibrary by developing automatic mechanisms to populate the library and provide content-based search and retrieval overcomputer networks.
Abstract: The goal of the VISION (Video Indexing for SearchIng Over Networks) project is to establish a comprehensive, online digital video library. We are developing automatic mechanisms to populate the library and provide content-based search and retrieval over computer networks. The salient feature of our approach is the integrated application of mature image or video processing, information retrieval, speech feature extraction and word-spotting technologies for efficient creation and exploration of the library materials. First, full-motion video is captured in real-time with flexible qualities to meet the requirements of library patrons connected via a wide range of network bandwidths. Then, the videos are automatically segmented into a number of logically meaningful video clips by our novel two-step algorithm based on video and audio contents. A closed caption decoder and/or word-spotter is being incorporated into the system to extract textual information to index the video clips by their contents. Finally, all information is stored in a full-text information retrieval system for content-baaed exploration of the library over networks of varying bandwidths.

83 citations


Proceedings ArticleDOI
01 Apr 1996
TL;DR: A document management infrastructure built around a multivalent perspective can provide an extensible, networked system that supports incremental additions of content, incremental addition of interaction with the user and with other components, reuse of content across behaviors, reused of behaviors across types of documents, and eficient use of network bandwidth.
Abstract: Rich varieties of online digital documents are possible, documents which do not merely imitate the capabilities of other media. A true digital document provides an interface to potentially complex content. Since this content is infinitely varied and specialized, we must provide means to interact with it in arbitrarily specialized ways. Furthermore, since relevant content may be found in distinct documents, we must draw from multiple sources, yet provide a coherent presentation to the user Finally, it is essential to be able to conveniently author new content, dejine new means of manipulation, and seamlessly mesh both with existing materials. Wepresent a new general paradigm that regards documents with complex content as “multivalent documents”, comprising multiple “layers” of distinct but intimately related content. Small, dynamically-loaded program objects, or “behaviors”, activate the content and work in concert with each other and layers of content to support arbitrarily specialized document types. Behaviors bind together the disparate pieces of a multivalent document to present the user with a single untjied conceptual document. As implemented in Java in the context of the World Wide Web, multivalent documents in effect create a customizable virtual Web, drawing together diverse content and functionality into coherent documentbased inte~aces to content. Examples of the diverse jimctionality in multivalent documents include: “OCR select and paste”, where the user describes a geometric region on the scanned image of a printed page and the corresponding text characters are copied out; video subtitling, which aligns a video clip with the script and language translations so that, e.g., the playing video can be presented simultaneously in multiple languages, and the video can be searrhed with text-based techniques; geographic information system (GIS) visualizations that compose several types of data from multiple datasets; and distributed user annotations that augment and may transform the Content of other documents. In general, a document management infrastructure built around a multivalent perspective can provide an extensible, networked system that supports incremental addition of content, incremental addition of interaction with the user and with other components, reuse of content across behaviors, reuse of behaviors across types of documents, and eficient use of network bandwidth. Multivalent The work reported here was supported in part by National Science Foundation grant IRI-9411334 as part of the NSF/NASA/ARPA Digital

79 citations


Proceedings ArticleDOI
01 Apr 1996
TL;DR: Due to variations in even a single person’s handwriting, it is expected that the matching will be the most difficult step in the whole process.
Abstract: There are many historical manuscripts written in a single hand which it would be useful to index. Examples include the W. B. DuBois collection at the University of Massachusetts and the early Presidential libraries at the Library of Congress. The standard technique for indexing documents is to scan them in, convert them to machine readable form (ASCII) using Optical Character Recognition (OCR) and then index them using a text retrieval engine. However, OCR does not work well on handwriting. Here an alternative scheme is proposed for indexing such texts. Each page of the document is segmented into words. The images of the words are then matched against each other to create equivalence classes (each equivalence classes contains multiple instances of the same word). The user then provides ASCII equivalents for say the top 2000 equivalence classes. The current paper deals with the matching aspects of this process. Due to variations in even a single person’s handwriting, it is expected that the matching will be the most difficult step in the whole process. A matching technique based on Euclidean distance mapping is discussed. Experiments are shown demonstrating the feasibility of the approach.

76 citations


Proceedings ArticleDOI
01 Apr 1996
TL;DR: In this article, a user interface for remote access of the National Library of Medicine's VISible Human digital image library is presented, where users can visualize the library, browse contents, locate data of interest, and retrieve desired images.
Abstract: This paper proposes a user interface for remote access of the National Library of Medicine's Visible Human digital image library. Users can visualize the library, browse contents, locate data of interest, and retrieve desired images. The interface presents a pair of tightly coupled views into the library data. The overview image provides a global view of the overall search space, and the preview image provides details about high resolution images available for retrieval. To explore, the user sweeps the views through the search space and receives smooth, rapid, visual feedback of contents. Desired images are automatically downloaded over the internet from the library. Library contents are indexed by meta-data consisting of automatically generated miniature visuals. The interface software is completely functional and freely available for public use, at: http://www.nlm.nih.gov/ .

64 citations


Proceedings ArticleDOI
Gregory Crane1
01 Apr 1996
TL;DR: This paper outlines some of the preliminary findings in the Perseus project, an on-going digital library on ancient Greek culture that has been under development since 1987.
Abstract: This paper outlines some of our preliminary findings in the Perseus Project, an on-going digital library on ancient Greek culture that has been under development since 1987.

42 citations


Proceedings ArticleDOI
01 Apr 1996
TL;DR: A multilingual document browsing tool for a user with no multilingual fonts on his or her terminal is presented and a browser which sends a text string with the font glyphs required to display the text is proposed.
Abstract: Since a library is inherently multi-lingual, a multi-linguid document environment is crucial for a digital library. In the near future, worldwide information sharing through digital libraries will be common. Currently, multi-lingual documents are poorly facilitated on computers and the Internet. It is impractical to consider installing fonts for all character sets in every user’s terminal. This paper presents a multilingual document browsing tool for a user with no multilingual fonts on his or her terminal. It discusses several methods for browsing multi-linguaJ documents and proposes a browser which sends a text string with the font glyphs required to display the text. It also gives the evaluation result of the browser.

36 citations



Proceedings ArticleDOI
01 Apr 1996
TL;DR: Today’s Iibrsries aim to provide not only access to and access to information but have increasingly incorporated proactive services aimed at assisting in the interpretation and application of information to fulfill user information requirements.
Abstract: The conception of a library has evolved over the past 200 years from a plsee that houses a collection of information resottrees to a process of facilitating knowledge trsnsfef from source to user. The fseiiitstor role of the iibrsry encompasses the concept of a change agent, where the iibrsry acts as a proactive participant in the diffusion of appropriate knowledge to users. Today’s Iibrsries aim to provide not oniy access to and &livery of information but have increasingly incorporated proactive services aimed at assisting in the interpretation and application of information to fulfill user information requirements.

Proceedings Article
01 Apr 1996
TL;DR: This document contains papers which were presented at the First ACM International Conference on Digital Libraries and Topics processed for this document included information retrieval and index structures for structured documents.
Abstract: This document contains papers which were presented at the First ACM International Conference on Digital Libraries. Topics processed for this document included information retrieval and index structures for structured documents. Individual papers have been processed separately for the United States Department of Energy databases.

Proceedings ArticleDOI
01 Apr 1996
TL;DR: It appears from this cursory examination of the data, that although there are fewer large bibliographic families than expected, the characteristics of bibliographical families are as Smiraglia predicted and Leazer's proposed model for the control of bibliography works seems to be accurate.

Proceedings ArticleDOI
Xia Lin1
01 Apr 1996
TL;DR: This paper presents a GTOC prototype based on Kohonen’s selforganizing featore map algorithm that can be generated automatically from the text of documents and visualizes document contents and relationships to allow easy access of underlying documents.
Abstract: This paper proposes a graphical table of contents (GTOC) that is functionally analogous to the table of contents. The proposed GTOC can be generated automatically from the text of documents. It visualizes document contents and relationships to allow easy access of underlying documents. It also provides various interactive tools to let the user explore the documents. Issues of how to generate such GTOC include how documents are indexed and organized, how the organized documents are visualized, and what interactive means are needed to provide necessary functionality of GTOC. These issues are discussed in this paper with a GTOC prototype based on Kohonen’s selforganizing featore map algorithm.

Proceedings ArticleDOI
01 Apr 1996
TL;DR: Initial link quality user studies indicate that the cluster-based hypertext link generation approach is promising, and the issues of link completeness and link quahty are addressed.
Abstract: Automatic hypertext generation remains an extremely challenging endeavor in the digital library world. In this paper we present a solution for automatically connecting relevant information in dynamic textual digital libraries. This textual information is generally unconnected and often unexplored due to the large flow of information entering from remote and local sources. Often, full-text indexes exist for this information but embedded links to related information are conspicuously absent. Links that do exist are usually generated in an arduous and time-consuming manual process. That is why the ability to automatically generate links has a potentially high payoff. Our solution for the automatic generation of hypertext links relies on the techniques of document segmentation and document clustering. Hypertext links are automatically generated during the document clustering process using the incremental cover-coefficient-based clustering algorithm. The issues of link completeness and link quahty are also addressed in this paper. Link completeness is studied by comparing the cluster-based approach of link generation to the exhaustive link generation approach. Results indicate that links are more complete in the higher similarity range than in the lower similarity range. Initial link quality user studies indicate that the cluster-based hypertext link generation approach is promising. In the future, we plan to concluct fw. ther studies on link quality and investigate ways to increase the effectiveness of our approach.

Proceedings ArticleDOI
01 Apr 1996
TL;DR: This work considers issues arising from the creation of digital libraries based on physical objects, focusing particularly on the characteristics of botanical herbaria and their users.
Abstract: Physicrd objects are the foundation for many of today’s areas of scholarship, research, and education. Beeause physical objects are tangible, any digital representation of one is an approximation of the object, Knowing how to approximate requires an understanding of the work practices and needs of the library’s constituencies. We consider issues arising from the creation of digital libraries based on physical objects, focusing particularly on the characteristics of botanical herbaria and their users.

Proceedings ArticleDOI
01 Apr 1996
TL;DR: A randomized index-splittingmore mechanism has been installed which allows the system to create a number of smaller indexes that can be independently and efficiently searched.
Abstract: In this paper we report on some recent developments in joint NYU and GE natural language information retrieval system. The main characteristic of this system is the use of advanced natural language processing to enhance the effectiveness of term-based document retrieval. The system is designed around a traditional statistical backbone consisting of the indexer module, which builds inverted index files from pre-processed documents, and a retrieval engine which searches and ranks the documents in response to user queries. Natural language processing is used to (1) preprocess the documents in order to extract content-carrying terms, (2) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user`s natural language requests into effective search queries. This system has been used in NIST-sponsored Text Retrieval Conferences (TREC), where we worked with approximately 3.3 GBytes of text articles including material from the Wall Street Journal, the Associated Press newswire, the Federal Register, Ziff Communications`s Computer Library, Department of Energy abstracts, U.S. Patents and the San Jose Mercury News, totaling more than 500 million words of English. The system have been designed to facilitate its scalability to deal with ever increasing amounts of data. In particular, a randomized index-splittingmore » mechanism has been installed which allows the system to create a number of smaller indexes that can be independently and efficiently searched.« less

Proceedings ArticleDOI
01 Apr 1996
TL;DR: The system, SDLS, is a hypermedia-based digital library browser, authoring system, and document viewer in which users navigate using geographical displays to locate and retrieve information.
Abstract: Spatial Approach to Organizing and Locating Digital Libraries and Their Content Jason Orendo~ Charles Kacmar Department of Computer Science Florida State University 203 Love Building Ta&dtass&, Florida 32306-4019 {orendorfJcacmar] @cs.fsu.edu phone: (904) 6$4-9661 frtx: (904) 644-0058 Explosive growth of world-wi& web (WWW) sites combined with the lack of an overalf and consistent organizational structure is making it increasingly difficult for researchers and users to locate relevant materials. This paper proposes a spatial method of structuring digital libraries and their content in which users navigate geographicrdly to locate and access information. A prototype based on a spatial methodology was implemented to further study this organizational shucttrre. The system, SDLS, is a hypermedia-based digital library browser, authoring system, and document viewer in which users navigate using geographical (map) displays to locate and retrieve information. This method of access provides a natural means of information retrieval for geographically-based repositories and reference materials.

Proceedings ArticleDOI
01 Apr 1996
TL;DR: This paper presents how a classification system is built to serve the visualization purposes and discusses presentation and interaction strategies for visual relevance analysis followed by implementation issues and system overview.
Abstract: In order to access relevant information in digital libraries, most traditional systems feature topic search. In this paper we present visual relevance analysis to extend the notion of topic search by relying on visualization and interaction techniques to help users rapidly browse through potentially relevant documents, Visual relevance analysis offers a better repartition of control between the user and the system for topic search. The interaction paradigm uses a library metaphor, implemented through a classification system. In this paper we first present how a classification system is built to serve the visualization purposes. We further discuss presentation and interaction strategies for visual relevance analysis followed by implementation issues and system overview. Finally we briefly review related work and compare it with our approach.



Proceedings ArticleDOI
01 Apr 1996
TL;DR: General propositions about the task and context of information product evaluation are proposed and used to develop a new model (Information Product Evaluation Model) incorporating aspects of the user’s context, meta-information availability, and accessibility.
Abstract: Knowledge workers are routinely engaged in information search and retrieval (ISR) tasks where they make evaluations of complex information products such as electronic documents or multi-media items. Information Systems (IS) organizations in business support the creation of these complex information products as well as providing tools and support for their acquisition and use. Some ISR assumptions, such as an information need exists independently of the ability of the repository to satisfy it, or an information need can be specified by objective terms, can be problematic for knowledge workers. An alternative approach considers information products as elements of an asynchronous communication; it explicitly considers evaluation after retrieval and the types of support provided by IS groups. General propositions about the task and context of information product evaluation are proposed and used to develop a new model (Information Product Evaluation Model) incorporating aspects of the user’s context, meta-information availability, and accessibility.