scispace - formally typeset
Search or ask a question
Author

Edward A. Fox

Bio: Edward A. Fox is an academic researcher from Virginia Tech. The author has contributed to research in topics: Digital library & Metadata. The author has an hindex of 53, co-authored 522 publications receiving 13862 citations. Previous affiliations of Edward A. Fox include University of Maryland, College Park & Cornell University.


Papers
More filters
Proceedings Article
01 Jan 1994
TL;DR: This paper describes one method that has been shown to increase performance by combining the similarity values from five different retrieval runs using both vector space and P-norm extended boolean retrieval methods.
Abstract: The TREC-2 project at Virginai Tech focused on methods for combining the evidence from multiple retrieval runs to improve performance over any single retrieval method. This paper describes one such method that has been shown to increase performance by combining the similarity values from five different retrieval runs using both vector space and P-norm extended boolean retrieval methods

1,106 citations

Journal ArticleDOI
TL;DR: A new, extended Boolean information retrieval system is introduced which is intermediate between the Boolean system of query processing and the vector processing model, and Laboratory tests indicate that the extended system produces better retrieval output than either the Boolean or thevector processing systems.
Abstract: In conventional information retrieval Boolean combinations of index terms are used to formulate the users'' information requests. While any document is in principle retrievable by a Boolean query, the amount of output obtainable by Boolean processing is difficult to control, and the retrieved items are not ranked in any presumed order of importance to the user population. In the vector processing model of retrieval, the retrieved items are easily ranked in decreasing order of the query-record similarity, but the queries themselves are unstructured and expressed as simple sets of weighted index terms. A new, extended Boolean information retrieval system is introduced which is intermediate between the Boolean system of query processing and the vector processing model. The query structure inherent in the Boolean system is preserved, while at the same time weighted terms may be incorporated into both queries and stored documents; the retrieved output can also be ranked in strict similarity order with the user queries. A conventional retrieval system can be modified to make use of the extended system. Laboratory tests indicate that the extended system produces better retrieval output than either the Boolean or the vector processing systems.

909 citations

Journal ArticleDOI
TL;DR: This report outlines IBM’s perspective on key supporting technologies and on the unique challenges highlighted by the emergence of digital libraries.
Abstract: ing Education-support Object-oriented Accessibility Electronic publishing OCR Agents Ethnographic study OODB support Annotation Filtering Personalization Archive Geographic information system Preservation Billing, charging Hypermedia Privacy Browsing Hypertext Publisher library Catalog Image processing Repository Classification Indexing Scalability Clustering Information retrieval Searching Commercial service Intellectual property rights Security Content conversion Interactive Sociological study Copyright clearance Knowledge base Storage Courseware Knowbot Standard Database Library science Subscription Diagrams (e.g., CAD) Mediator Sustainability Digital video Multilingual Training support Discipline-level library Multimedia stream playback Usability Distributed processing Multimedia systems Virtual (integration) Document analysis Multimodal Visualization Document model National library World-Wide Web Economic study Navigation its characterization of digital libraries. Many important projects and perspectives have been omitted. Here we give some pointers to aid further exploration, and of course we encourage interested readers to attend the numerous conferences and workshops scheduled in this field, many sponsored by or in cooperation with ACM and its SIGs. One early journal special issue is introduced in [6]. It includes articles on copyright and intellectual property rights, a subscription model for handling funds transfer related to digital libraries, a description of the evolution of the WAIS search system in general and its interfaces in particular, an overview of the Right Pages system and its use of OCR and document analysis algorithms, and an early overview of the Envision system [7]. We note that to many, intellectual property rights issues and ways to obtain revenue streams to sustain digital libraries are the most important open problems. The largest digital library conference makes its proceedings available over the WWW [9]. These contain many insightful discussions, proposals of new research ideas, descriptions of base technologies, and explanations of how the broad concept of a digital library fits in with the needs of specific user communities and the information they require. Readers can find a variety of works on agents, architectures, catalogs, collaboration, compression, document analysis from OCR and page images, document structure, electronic journals, heterogeneous sources, knowledge-based approaches, library science, numerical data collections, object stores, and organizational usability. For more details on the origins of the Digital Library Initiative, and for a variety of perspectives on open research problems, we refer the reader to [5]. This work also has numerous pointers to people, projects, institutions, and other reference works in the area. For a perspective on the role the computer industry should have in this field, see [10]. This report outlines IBM’s perspective on key supporting technologies and on the unique challenges highlighted by the emergence of digital libraries. We expect considerable interest from the corporate sector as well as from government agencies in this important area of information technology. For lack of space, we have had to omit many publications on networking and storage technologies, sociological and ethnographic studies, library and information science, OCR and document analysis or conversion, and rights management. These and other works are needed to round out the discussion of digital libraries. However, we encourage you to read the rest of this issue as a good starting point for your future studies of this important field. We invite you to not only use but also help in the creation of a future World Digital Library System!

654 citations

Journal ArticleDOI
TL;DR: Findings from a exploratory study conducted with government officials in Arlington, VA between June and December 2010 are presented, with the broad goal of understanding social media use by government officials as well as community organizations, businesses, and the public at large.

497 citations

18 Jul 1995
TL;DR: This work assesses the potential of proxy servers to cache documents retrieved with the HTTP protocol, and finds that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users.
Abstract: As the number of World-Wide Web users grow, so does the number of connections made to servers. This increases both network load and server load. Caching can reduce both loads by migrating copies of server files closer to the clients that use those files. Caching can either be done at a client or in the network (by a proxy server or gateway). We assess the potential of proxy servers to cache documents retrieved with the HTTP protocol. We monitored traffic corresponding to three types of educational workloads over a one semester period, and used this as input to a cache simulation. Our main findings are (1) that with our workloads a proxy has a 30-50% maximum possible hit rate no matter how it is designed; (2) that when the cache is full and a document is replaced, least recently used (LRU) is a poor policy, but simple variations can dramatically improve hit rate and reduce cache size; (3) that a proxy server really functions as a second level cache, and its hit rate may tend to decline with time after initial loading given a more or less constant set of users; and (4) that certain tuning configuration parameters for a cache may have little benefit.

495 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Journal ArticleDOI
TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Abstract: The experimental evidence accumulated over the past 20 years indicates that textindexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective term weighting systems. This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

9,460 citations

Book
01 Jan 2009

8,216 citations

Journal ArticleDOI
01 Jun 2010
TL;DR: A brief overview of clustering is provided, well known clustering methods are summarized, the major challenges and key issues in designing clustering algorithms are discussed, and some of the emerging and useful research directions are pointed out.
Abstract: Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

6,601 citations