scispace - formally typeset
Search or ask a question

Showing papers on "Semantic Web published in 1999"


Proceedings Article
07 Sep 1999
TL;DR: This paper develops novel algorithms for enumerating and organizing all web occurrences of certain subgraphs that are signatures of web phenomena such as tightly-focused topic communities, webrings, taxonomy trees, keiretsus, etc, and argues that these algorithms run efficiently in this model.
Abstract: The subject of this paper is the creation of knowledge bases by enumerating and organizing all web occurrences of certain subgraphs. We focus on subgraphs that are signatures of web phenomena such as tightly-focused topic communities, webrings, taxonomy trees, keiretsus, etc. For instance, the signature of a webring is a central page with bidirectional links to a number of other pages. We develop novel algorithms for such enumeration problems. A key technical contribution is the development of a model for the evolution of the web graph, based on experimental observations derived from a snapshot of the web. We argue that our algorithms run efficiently in this model, and use the model to explain some statistical phenomena on the web that emerged during our experiments. Finally, we describe the design and implementation of Campfire, a knowledge base of over one hundred thousand web communities.

282 citations


01 Jan 1999
TL;DR: A wide range of heuristics for adjusting document rankings based on the special HTML structure of Web documents are described, including a novel one inspired by reinforcement learning techniques for propagating rewards through a graph which can be used to improve a search engine's rankings.
Abstract: Indexing systems for the World Wide Web, such as Lycos and Alta Vista, play an essential role in making the Web useful and usable. These systems are based on Information Retrieval methods for indexing plain text documents, but also include heuristics for adjusting their document rankings based on the special HTML structure of Web documents. In this paper, we describe a wide range of such heuristics|including a novel one inspired by reinforcement learning techniques for propagating rewards through a graph|which can be used to a ect a search engine's rankings. We then demonstrate a system which learns to combine these heuristics automatically, based on feedback collected unintrusively from users, resulting in much improved rankings.

239 citations


Book ChapterDOI
01 Sep 1999
TL;DR: This paper focuses on web data mining research in context of the authors' web warehousing project called WHOWEDA (Warehouse of Web Data), and categorized web datamining into threes areas; web content mining, web structure mining and web usage mining.
Abstract: In this paper, we discuss mining with respect to web data referred here as web data mining. In particular, our focus is on web data mining research in context of our web warehousing project called WHOWEDA (Warehouse of Web Data). We have categorized web data mining into threes areas; web content mining, web structure mining and web usage mining. We have highlighted and discussed various research issues involved in each of these web data mining category. We believe that web data mining will be the topic of exploratory research in near future.

203 citations


01 Jan 1999
TL;DR: The paper describes the novel technique of categorization by context, which instead extracts useful information for classifying a document from the context where a URL referring to it appears, and presents the results of experimenting with Theseus, a classifier that exploits this technique.
Abstract: Assistance in retrieving documents on the World Wide Web is provided either by search engines, through keyword-based queries, or by catalogues, which organize documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult, due to the sheer amount of material on the Web; it is thus becoming necessary to resort to techniques for the automatic classification of documents. Automatic classification is traditionally performed by extracting the information for representing a document (“indexing”) from the document itself. The paper describes the novel technique of categorization by context, which instead extracts useful information for classifying a document from the context where a URL referring to it appears. We present the results of experimenting with Theseus, a classifier that exploits this technique.

192 citations


01 Jan 1999
TL;DR: This article illustrates this by showing that an Example-Based approach to lexical choice for machine translation can use the Web as an adequate and free resource.
Abstract: The WWW is two orders of magnitude larger than the largest corpora. Although noisy, web text presents language as it is used, and statistics derived from the Web can have practical uses in many NLP applications. For this reason, the WWW should be seen and studied as any other computationally available linguistic resource. In this article, we illustrate this by showing that an Example-Based approach to lexical choice for machine translation can use the Web as an adequate and free resource.

175 citations


Journal ArticleDOI
TL;DR: The authors investigated the intersection between corporate World Wide Web pages and the publics they serve and found that while the typical corporate Web page is used to service news media, customers, and the financial community, it is not being used to its fullest potential to communicate simultaneously with other audiences.
Abstract: Against the backdrop of the rapid growth of the Internet, this research study investigates the intersection between corporate World Wide Web pages and the publics they serve. Content analysis revealed that, while the typical corporate Web page is used to service news media, customers, and the financial community, it is not being used to its fullest potential to communicate simultaneously with other audiences. Through a cluster analysis procedure, the researchers found about one-third of corporate Web sites are assertively used to communicate with a multiplicity of audiences in a variety of information formats.

148 citations


Journal ArticleDOI
TL;DR: This study is first a preliminary exploration into Web page and Web site mortality rates, then considers two types of change: Content and structural, and explores the “short memory” and “mind changing” of the World Wide Web.
Abstract: We recognize that documents on the World Wide Web are ephemeral and changing. We also recognize that Web documents can be categorized along a number of dimensions, including “publisher,” size, object mix, as well as purpose, meaning, and content. This study is first a preliminary exploration into Web page and Web site mortality rates. It then considers two types of change: Content and structural. Finally, the study is concerned with understanding those constancy and permanence phenomena for different Web document classes. It is suggested that, from the perspective of information maintenance and retrieval, the WWW does not represent revolutionary change. In fact, in some ways the Web is a less sophisticated form than traditional publication practices. Finally, this study explores the “short memory” and “mind changing” of the World Wide Web.

146 citations


31 Jul 1999
TL;DR: The general architecture and main components of On2broker are discussed and the use of ontologies to make explicit the semantics of web pages is provided.
Abstract: On2broker provides brokering services to improve access to heterogeneous, distributed and semistructured information sources as they are presented in the World Wide Web It relies on the use of ontologies to make explicit the semantics of web pages In the paper we will discuss the general architecture and main components of On2broker and provide some application scenarios

131 citations


Proceedings ArticleDOI
31 Jul 1999
TL;DR: A survey and analysis of traditional, new, and arising Web standards and show how they can be used to represent machine-processable semantics of Web sources to help AI researchers and practitioners to apply their results to real Web documents.
Abstract: The lack of semantic markup is a major barrier to the development of more intelligent document processing on the Web. Current HTML markup is used only to indicate the structure and lay-out of documents, but not the document semantics. Unfortunately, proposals from the AI community for Web-based knowledge -representation languages can hardly expect wide acceptance on the Web. Even if unpalatable for the AI community, the question should instead be how well AI concepts can be fitted into the markup languages that are widely supported on the Web, either now or in the foreseeable future. We provide a survey and analysis of traditional, new, and arising Web standards and show how they can be used to represent machine-processable semantics of Web sources. The results of this paper should help AI researchers and practitioners to apply their results to real Web documents, instead of basing themselves on AI specific representations that have no chance of becoming widely used on the Web.

76 citations


Proceedings ArticleDOI
12 Oct 1999
TL;DR: This paper presents a preliminary discussion about Web mining, including its definition, the relationship between information mining and information retrieval on the Web, and the taxonomy and the function of Web mining.
Abstract: With the flood of information on the World Wide Web, Web mining is a new research issue which is drawing great interest from many communities. Currently, there is no agreement about Web mining; it needs more discussion among researchers in order to define exactly what it is. Meanwhile, the development of Web mining systems will in turn promote its research. In this paper, we present a preliminary discussion about Web mining, including its definition, the relationship between information mining and information retrieval on the Web, and the taxonomy and the function of Web mining. In addition, a prototype system called WebTMS (Web Text Mining System) has been designed. WebTMS is a multi-agent system which combines text mining and multi-dimensional document analysis to help users mine HTML documents on the Web effectively.

68 citations


Journal Article
TL;DR: In this paper, the authors discuss the use of the personal ontology and propose an organization scheme based on a model of an office and its information, an ontology, coupled with the proper tools for using it.
Abstract: Corporations can suffer from too much information, and it is often inaccessible, inconsistent, and incomprehensible. The corporate solution entails knowledge management techniques and data warehouses. The paper discusses the use of the personal ontology. The promising approach is an organization scheme based on a model of an office and its information, an ontology, coupled with the proper tools for using it.

Proceedings ArticleDOI
01 Sep 1999
TL;DR: An approach is presented to use alternative simple visualizations grouped around the traditional result-list, for the usage with a local meta web search engine.
Abstract: The idea of Information Visualization is to get insights into great amounts of abstract data. Especially document sets found by searching the World Wide Web are a special challenge. The paper gives a short overview on the variety of possible visualizations for this application area. The presented ideas are grouped by using the four phase framework of information seeking. Crucial factors for the success of visualizations are discussed. An approach is presented to use alternative simple visualizations grouped around the traditional result-list, for the usage with a local meta web search engine.

Proceedings ArticleDOI
05 Jan 1999
TL;DR: The experiences with client- and proxy-server based implementations of an annotation system architecture are described, pointing to missing elements in the current Web infrastructure that make any implementation of annotation systems less than completely satisfactory.
Abstract: Annotations are a broadly useful mechanism that can support a number of useful document management applications (third-party commentary, design rationale, information filtering and semantic labelling of document content, to name just a few). The ubiquity of World Wide Web content motivates the need for Web annotation systems that are lightweight, efficient, non-intrusive (preferably transparent), platform-independent and scalable. Building such a system using open and standard Web infrastructures (as opposed to proprietary ones) facilitates widespread applicability and deployment. In practice, there are a number of ways to do this, all of which instantiate a common abstract architecture based on intermediaries. This paper described our experiences with client- and proxy-server based implementations of an annotation system architecture. The implementations point to missing elements in the current Web infrastructure that make any implementation of annotation systems less than completely satisfactory. This paper discusses these elements of current Web infrastructure, and potential changes to the Web architecture that might make the implementation of annotation systems more complete.

Book
01 Jan 1999
TL;DR: This work proposes a generalisation of joins from the relational database model to enable joins on arbitrarily complex structured data in a higher-order representation, and extends this model to support approximate joins of heterogeneous data.
Abstract: Integrating heterogeneous data from sources as diverse as web pages, digital libraries, knowledge bases, the Semantic Web and databases is an open problem. The ultimate aim of our work is to be able to query such heterogeneous data sources as if their data were conveniently held in a single relational database. Pursuant to this aim, we propose a generalisation of joins from the relational database model to enable joins on arbitrarily complex structured data in a higher-order representation. By incorporating kernels and distances for structured data, we further extend this model to support approximate joins of heterogeneous data. We demonstrate the flexibility of our approach in the publications domain by evaluating example approximate queries on the CORA data sets, joining on types ranging from sets of co-authors through to entire publications.

Proceedings ArticleDOI
01 Sep 1999
TL;DR: This work regards the Web and its contents as a unit, represented in an object-oriented data model: the Web structure, given by its hyperlinks, the parse-trees of Web pages (intra-document level), and their contents, and the model is complemented by a rule-based object- oriented language.
Abstract: For accessing and processing the information provided on the Web, there is a need for extraction, restructuring, and integration of semistructured data from autonomous, heterogeneous sources. We regard the Web and its contents as a unit, represented in an object-oriented data model: the Web structure (inter-document level), given by its hyperlinks, the parse-trees of Web pages (intra-document level), and their contents. The model is complemented by a rule-based object-oriented language which is extended by Web access capabilities and allows for navigation in the unified model. We show the practicability of our approach by using the FLORID system.

Journal ArticleDOI
17 May 1999
TL;DR: The Mimicry system is introduced that allows authors and readers to link to and from temporal media (video and audio) on the Web and is integrated with the Arakne Environment, an open hypermedia integration aimed at Web augmentation.
Abstract: The World Wide Web has since its beginning provided linking to and from text documents encoded in HTML. The Web has evolved and most Web browsers now support a rich set of media types either by default or by the use of specialised content handlers, known as plug-ins. The limitations of the Web linking model are well known and they also extend into the realm of the other media types currently supported by Web browsers. This paper introduces the Mimicry system that allows authors and readers to link to and from temporal media (video and audio) on the Web. The system is integrated with the Arakne Environment, an open hypermedia integration aimed at Web augmentation. The links created are stored externally, allowing for links to and from resources not owned by the (link) author. Based on the experiences a critique is raised of the limited APIs supported by plug-ins.

Journal ArticleDOI
TL;DR: The W3C provides a broad array of information, organized into general categories ranging from a general history of the Web to an archive of released technical reports and specifications, which is free to all.
Abstract: Just what direction is the Web headed? For the answers to these and other general Internet or markup language specific questions, tune your browser to http://www. w3.org/ for a vast and varied selection of information. Because the W3C establishes recommendations concerning the Web, this site offers interesting possibilities as to future directions for the Web. While the W3C does not exercise the influence of an official standards setting organization, they have been influential in bringing together industry members and other interested parties to develop solutions and circulate recommendations to the public and members. provide the means for establishing guidelines as to future web development, and to serve as a forum for members to meet and discuss common problems. Membership is not free, but the website serves as a central location to disseminate the technical specifications written by the consortium, as well as other related information, which is free to all. The website does provide a very comprehensive assortment of information about the Web, and is well-structured and organized. The structure has a good balance between depth and breadth of pages. The website provides a broad array of information, organized into general categories ranging from a general history of the Web to an archive of released technical reports and specifications. The technical reports are also noted in the press releases that are available at the website. The technical reports include a chronological listing of drafts of

BookDOI
01 Jan 1999
TL;DR: This work argues that cache performance can be improved by integrating cache replacement and consistency algorithms, and presents an unified algorithm LNC-R-W3-U, which achieves performance comparable (and often superior) to most of the published cache replacement algorithms and at the same time significantly reduces the staleness of the cached documents.
Abstract: Caching of Web documents improves the response time perceived by the clients. Cache replacement algorithms play a central role in the response time reduction by selecting a subset of documents for caching so that an appropriate performance metric is maximized. At the same time, the cache must take extra steps to guarantee some form of consistency of the cached data. Cache consistency algorithms enforce appropriate guarantees about the staleness of documents it stores. Most of the published work on Web cache design either considers cache consistency algorithms separately from cache replacement algorithms or concentrates only on studying one of the two. We argue that cache performance can be improved by integrating cache replacement and consistency algorithms. We present an unified algorithm LNC-R-W3-U. Using trace-based experiments, we demonstrate that LNC-R-W3-U achieves performance comparable (and often superior) to most of the published cache replacement algorithms and at the same time significantly reduces the staleness of the cached documents.

Journal ArticleDOI
TL;DR: The history of hypertext in the World Wide Web is overviewed to bring more sophisticated hypertext into the Web, and the new XML proposals are making many of these into mainstream functions.
Abstract: In this short paper, we briefly overview the history of hypertext in the World Wide Web. The Web started with hypertext functions that have disappeared from the early popular browsers, and some are still not present in today's dominant browsers. The hypertext community has proposed ways to bring more sophisticated hypertext into the Web, and the new XML proposals are making many of these into mainstream functions.

Proceedings ArticleDOI
06 Jun 1999
TL;DR: The paper presents an overview of the Web mining agent system, then gives the motivations for the conversion into XML, and discusses in detail the transformation process performed on the Web documents.
Abstract: The work presented is part of a Web mining agent (WMA) system under development at our Multimedia and Mobile Agent Research Laboratory. The purpose of this system is to automatically extract specific information from Web pages and appropriately format the extracted information for further use. This requires resolving problems related to the disorganized nature of the Web that may result from ill-formatted HTML-based Web pages. The desired information is extracted from the Web documents by applying a sequence of filters to these documents. Each of the filters has a specific role. We discuss the filter that is used to convert Web documents into well-formed XML documents. This conversion involves the following operations: (i) syntactic mapping of HTML to XML, (ii) resolving ambiguity introduced by HTML tagging rules, and (iii) handling errors that may occur due to improper usage of HTML by the authors. The paper presents an overview of the Web mining agent system, then gives the motivations for the conversion into XML and finally, discusses in detail the transformation process performed on the Web documents.

01 Jan 1999
TL;DR: This paper sketches Ontobroker and discusses its main shortcomings, then shows how On2broker overcomes these limitations and the integration of new web standards like XML and RDF.
Abstract: Ontobroker applies Artificial Intelligence techniques to improve access to heterogeneous, distributed and semistructured information sources as they are presented in the World Wide Web or organization-wide intranets. It relies on the use of ontologies to annotate web pages, formulate queries and derive answers. In the paper we will briefly sketch Ontobroker. Then we will discuss its main shortcomings, i.e. we will share the lessons we learned from our exercise. We will also show how On2broker overcomes these limitations. Most important is the separation of the query and inference engines and the integration of new web standards like XML and RDF.

Journal ArticleDOI
TL;DR: It is argued that new standards would let users apply the power of their technology to make the most of the Web.
Abstract: It is argued that new standards would let users apply the power of their technology to make the most of the Web.

Proceedings ArticleDOI
01 Jan 1999
TL;DR: This paper presents an approach for retrieving information from forms on the world wide web from natural language input, and a statistical disambiguation method based on n-gram statistics is proposed.
Abstract: This paper presents an approach for retrieving information from forms on the world wide web from natural language input. The structured nature of the form can be utilized to process natural language input for querying data sources on the web that provide form interfaces. Since the valid values for each field can be determined from the form itself or by a user of the form, the form can be filled out be looking for these values in the natural language user input. Since it is possible for a particular value to be valid for more than one field, the surrounding context must be used to determine the correct field for an ambiguous value. A statistical disambiguation method based on n-gram statistics is proposed. It was shown that this method works better than using single context words for disambiguation when the domain is limited.

Book ChapterDOI
13 Dec 1999
TL;DR: In this framework, a hybrid (partially materialized) approach and extended ontologies are used to achieve Web data integration and makes it possible to integrate DW data with Web-based information resources as they are needed.
Abstract: This paper presents a framework for warehousing selected Web contents. In this framework, a hybrid (partially materialized) approach and extended ontologies are used to achieve Web data integration. This hybrid approach makes it possible to integrate DW data with Web-based information resources as they are needed. The Ontologies are used to represent domain knowledge related to Web sources and the logic model of data warehouses. Moreover, we define the mapping rules between Web data and attributes of data warehouses in the ontologies to facilitate the construction and maintenance requirements of data warehouses.

Proceedings ArticleDOI
23 Sep 1999
TL;DR: A system architecture and implementation relying on commercial WWW technology is presented and the temporal aspects of hypermedia features for continuous media like audio and video resemble all other kinds of multimedia applications.
Abstract: Multimedia applications within the World Wide Web (WWW) have to deal with difficulties like executing within Web pages and being transferred via the Internet. However, the temporal aspects of hypermedia features for continuous media like audio and video resemble all other kinds of multimedia applications. These temporal aspects are discussed in consideration of presentation and authoring facilities. A system architecture and implementation relying on commercial WWW technology is presented.




Journal ArticleDOI
TL;DR: This framework provides an easy-to-use and well-formalized method for automatic generation of wrappers extracting data from Web documents and an associated SQL-like query language.