Showing papers on "Web page published in 2002"

PDF

Open Access

Journal Article•DOI•

[...]

01 Sep 2002

TL;DR: This taxonomy of web searches is explored and how global search engines evolved to deal with web-specific needs is discussed.

...read moreread less

Abstract: Classic IR (information retrieval) is inherently predicated on users searching for information, the so-called "information need". But the need behind a web search is often not informational -- it might be navigational (give me the url of the site I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop, download a file, or find a map). We explore this taxonomy of web searches and discuss how global search engines evolved to deal with web-specific needs.

...read moreread less

2,094 citations

Journal Article•DOI•

Empirical Studies Assessing the Quality of Health Information for Consumers on the World Wide Web: A Systematic Review

[...]

Gunther Eysenbach¹, John Powell, Oliver Kuss, Eun-Ryoung Sa•Institutions (1)

Toronto General Hospital¹

22 May 2002-JAMA

TL;DR: A methodological framework on how quality on the Web is evaluated in practice is established, to determine the heterogeneity of the results and conclusions, and to compare the methodological rigor of these studies to determine to what extent the conclusions depend on the methodology used.

...read moreread less

Abstract: ContextThe quality of consumer health information on the World Wide Web is an important issue for medicine, but to date no systematic and comprehensive synthesis of the methods and evidence has been performed.ObjectivesTo establish a methodological framework on how quality on the Web is evaluated in practice, to determine the heterogeneity of the results and conclusions, and to compare the methodological rigor of these studies, to determine to what extent the conclusions depend on the methodology used, and to suggest future directions for research.Data SourcesWe searched MEDLINE and PREMEDLINE (1966 through September 2001), Science Citation Index (1997 through September 2001), Social Sciences Citation Index (1997 through September 2001), Arts and Humanities Citation Index (1997 through September 2001), LISA (1969 through July 2001), CINAHL (1982 through July 2001), PsychINFO (1988 through September 2001), EMBASE (1988 through June 2001), and SIGLE (1980 through June 2001). We also conducted hand searches, general Internet searches, and a personal bibliographic database search.Study SelectionWe included published and unpublished empirical studies in any language in which investigators searched the Web systematically for specific health information, evaluated the quality of Web sites or pages, and reported quantitative results. We screened 7830 citations and retrieved 170 potentially eligible full articles. A total of 79 distinct studies met the inclusion criteria, evaluating 5941 health Web sites and 1329 Web pages, and reporting 408 evaluation results for 86 different quality criteria.Data ExtractionTwo reviewers independently extracted study characteristics, medical domains, search strategies used, methods and criteria of quality assessment, results (percentage of sites or pages rated as inadequate pertaining to a quality criterion), and quality and rigor of study methods and reporting.Data SynthesisMost frequently used quality criteria used include accuracy, completeness, readability, design, disclosures, and references provided. Fifty-five studies (70%) concluded that quality is a problem on the Web, 17 (22%) remained neutral, and 7 studies (9%) came to a positive conclusion. Positive studies scored significantly lower in search (P = .02) and evaluation (P = .04) methods.ConclusionsDue to differences in study methods and rigor, quality criteria, study population, and topic chosen, study results and conclusions on health-related Web sites vary widely. Operational definitions of quality criteria are needed.

...read moreread less

1,781 citations

Proceedings Article•DOI•

Topic-sensitive PageRank

[...]

Taher H. Haveliwala¹•Institutions (1)

Stanford University¹

07 May 2002

TL;DR: A set of PageRank vectors are proposed, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic, and are shown to generate more accurate rankings than with a single, generic PageRank vector.

...read moreread less

Abstract: In the original PageRank algorithm for improving the ranking of search-query results, a single PageRank vector is computed, using the link structure of the Web, to capture the relative "importance" of Web pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. By using these (precomputed) biased PageRank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic PageRank vector. For ordinary keyword search queries, we compute the topic-sensitive PageRank scores for pages satisfying the query using the topic of the query keywords. For searches done in context (e.g., when the search query is performed by highlighting words in a Web page), we compute the topic-sensitive PageRank scores using the topic of the context in which the query appeared.

...read moreread less

1,765 citations

Patent•

Enhanced video programming system and method for incorporating and displaying retrieved integrated Internet information segments

[...]

Craig Ullman, Jack D. Hidary, Nova T. Spivack

18 Nov 2002

TL;DR: In this paper, a system for integrating video programming with the vast information resources of the Internet is presented, where the web pages are synchronized to the video content for display in conjunction with a television program being broadcast to the user at that time.

...read moreread less

Abstract: A system for integrating video programming with the vast information resources of the Internet. A computer-based system receives a video program with embedded uniform resource locators (URLs). The URLs, the effective addresses of locations or Web sites on the Internet, are interpreted by the system and direct the system to the Web site locations to retrieve related Web pages. Upon receipt of the Web pages by the system, the Web pages are synchronized to the video content for display. The video program signal can be displayed on a video window on a conventional personal computer screen. The actual retrieved Web pages are time stamped to also be displayed, on another portion of the display screen, when predetermined related video content is displayed in the video window. As an alternative, the computer-based system receives the URLs directly through an Internet connection, at times specified by TV broadcasters in advance. The system interprets the URLs and retrieves the appropriate Web pages. The Web pages are synchronized to the video content for display in conjunction with a television program being broadcast to the user at that time. This alternative system allows the URLs to be entered for live transmission to the user.

...read moreread less

1,504 citations

Journal Article•DOI•

Unraveling the Web services web: an introduction to SOAP, WSDL, and UDDI

[...]

Francisco Curbera¹, Matthew J. Duftler¹, Rania Khalaf¹, William A. Nagy¹, Nirmal K. Mukhi¹, Sanjiva Weerawarana¹ - Show less +2 more•Institutions (1)

IBM¹

01 Mar 2002-IEEE Internet Computing

TL;DR: This tutorial explores the most salient and stable specifications in each of the three major areas of the emerging Web services framework, which are the simple object access protocol, the Web Services Description Language and the Universal Description, Discovery, and Integration directory.

...read moreread less

Abstract: This tutorial explores the most salient and stable specifications in each of the three major areas of the emerging Web services framework. They are the simple object access protocol, the Web Services Description Language and the Universal Description, Discovery, and Integration directory, which is a registry of Web services descriptions.

...read moreread less

1,470 citations

Patent•

Asynchronous message push to web browser

[...]

Mingte Chen, Sing Yip, Yan Ma, Gilberto Arnaiz, Srikant Krishnapuram Tirumalai, David Tchankotadze, Kuang Huang, Anil Kumar Annadata - Show less +4 more

30 Sep 2002

TL;DR: In this article, a method and system for controlling a user interface presented by a web browser (104A) is presented, where the web browser is not blocked waiting for the asynchronous message.

...read moreread less

Abstract: A method and system for controlling a user interface presented by a web browser (104A). The web browser (104A) is not blocked waiting for the asynchronous message. The web browser (104A) presents a user interface and presents a user interface change in response to receiving the asynchronous message.

...read moreread less

1,151 citations

Journal Article•DOI•

Self-organization and identification of Web communities

[...]

Gary W. Flake, Steve Lawrence¹, C.L. Giles², Frans M. Coetzee•Institutions (2)

Gas Technology Institute¹, Pennsylvania State University²

01 Mar 2002-IEEE Computer

TL;DR: This work shows that the Web self-organizes and its link structure allows efficient identification of communities and is significant because no central authority or process governs the formation and structure of hyperlinks.

...read moreread less

Abstract: The vast improvement in information access is not the only advantage resulting from the increasing percentage of hyperlinked human knowledge available on the Web. Additionally, much potential exists for analyzing interests and relationships within science and society. However, the Web's decentralized and unorganized nature hampers content analysis. Millions of individuals operating independently and having a variety of backgrounds, knowledge, goals and cultures author the information on the Web. Despite the Web's decentralized, unorganized, and heterogeneous nature, our work shows that the Web self-organizes and its link structure allows efficient identification of communities. This self-organization is significant because no central authority or process governs the formation and structure of hyperlinks.

...read moreread less

1,033 citations

Patent•

Method for providing search-specific web pages in a network computing environment

[...]

Rodger Rio

12 Jul 2002

TL;DR: In this article, a method and article for providing search-specific page sets and query-results listings is provided, and a method for defining the custom search page and the custom results page without the need for line-by-line computer coding is presented.

...read moreread less

Abstract: A method and article for providing search-specific page sets and query-results listings is provided. The method and article provides end-users with customized, search-specific pages upon which to initiate a query. A method is also provided for defining the custom search page and the custom results page without the need for line-by-line computer coding. The present invention provides product and service information to end-users in an initiative format.

...read moreread less

960 citations

Proceedings Article•DOI•

Simulation, verification and automated composition of web services

[...]

Srini Narayanan¹, Sheila A. McIlraith²•Institutions (2)

SRI International¹, Stanford University²

07 May 2002

TL;DR: This paper defines the semantics for a relevant subset of DAML-S in terms of a first-order logical language and provides decision procedures for Web service simulation, verification and composition.

...read moreread less

Abstract: Web services -- Web-accessible programs and devices - are a key application area for the Semantic Web. With the proliferation of Web services and the evolution towards the Semantic Web comes the opportunity to automate various Web services tasks. Our objective is to enable markup and automated reasoning technology to describe, simulate, compose, test, and verify compositions of Web services. We take as our starting point the DAML-S DAML+OIL ontology for describing the capabilities of Web services. We define the semantics for a relevant subset of DAML-S in terms of a first-order logical language. With the semantics in hand, we encode our service descriptions in a Petri Net formalism and provide decision procedures for Web service simulation, verification and composition. We also provide an analysis of the complexity of these tasks under different restrictions to the DAML-S composite services we can describe. Finally, we present an implementation of our analysis techniques. This implementation takes as input a DAML-S description of a Web service, automatically generates a Petri Net and performs the desired analysis. Such a tool has broad applicability both as a back end to existing manual Web service composition tools, and as a stand-alone tool for Web service developers.

...read moreread less

953 citations

Proceedings Article•

Adapting Golog for Composition of Semantic Web Services

[...]

Sheila A. McIlraith¹, Tran Cao Son•Institutions (1)

Stanford University¹

01 Jan 2002

TL;DR: It is argued that an augmented version of the logic programming language Golog provides a natural formalism for automatically composing services on the Semantic Web and logical criteria for these generic procedures that define when they are knowledge self-sufficient and physically selfsufficient are proposed.

...read moreread less

Abstract: Motivated by the problem of automatically composing network accessible services, such as those on the World Wide Web, this paper proposes an approach to building agent technology based on the notion of generic procedures and customizing user constraint. We argue that an augmented version of the logic programming language Golog provides a natural formalism for automatically composing services on the Semantic Web. To this end, we adapt and extend the Golog language to enable programs that are generic, customizable and usable in the context of the Web. Further, we propose logical criteria for these generic procedures that define when they are knowledge self-sufficient and physically selfsufficient. To support information gathering combined with search, we propose a middle-ground Golog interpreter that operates under an assumption of reasonable persistence of certain information. These contributions are realized in our augmentation of a ConGolog interpreter that combines online execution of information-providing Web services with offline simulation of worldaltering Web services, to determine a sequence of Web Services for subsequent execution. Our implemented system is currently interacting with services on the Web.

...read moreread less

939 citations

Journal Article•DOI•

A brief survey of web data extraction tools

[...]

Alberto H. F. Laender¹, Berthier Ribeiro-Neto¹, Altigran Soares da Silva¹, Juliana Silveira Teixeira¹•Institutions (1)

Universidade Federal de Minas Gerais¹

01 Jun 2002

TL;DR: A taxonomy for characterizing Web data extraction fools is proposed, a survey of major web data extraction tools described in the literature is briefly surveyed, and a qualitative analysis of them is provided.

...read moreread less

Abstract: In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. The importance of this problem derives from the fact that, once extracted, the data can be handled in a way similar to instances of a traditional database. The approaches proposed in the literature to address the problem of Web data extraction use techniques borrowed from areas such as natural language processing, languages and grammars, machine learning, information retrieval, databases, and ontologies. As a consequence, they present very distinct features and capabilities which make a direct comparison difficult to be done. In this paper, we propose a taxonomy for characterizing Web data extraction fools, briefly survey major Web data extraction tools described in the literature, and provide a qualitative analysis of them. Hopefully, this work will stimulate other studies aimed at a more comprehensive analysis of data extraction approaches and tools for Web data.

...read moreread less

Journal Article•DOI•

A fine-grained access control system for XML documents

[...]

Ernesto Damiani¹, Sabrina De Capitani di Vimercati², Stefano Paraboschi³, Pierangela Samarati¹•Institutions (3)

University of Milan¹, University of Brescia², Polytechnic University of Milan³

01 May 2002-ACM Transactions on Information and System Security

TL;DR: This work presents an access control model to protect information distributed on the Web that, by exploiting XML's own capabilities, allows the definition and enforcement of access restrictions directly on the structure and content of the documents.

...read moreread less

Abstract: Web-based applications greatly increase information availability and ease of access, which is optimal for public information. The distribution and sharing of information via the Web that must be accessed in a selective way, such as electronic commerce transactions, require the definition and enforcement of security controls, ensuring that information will be accessible only to authorized entities. Different approaches have been proposed that address the problem of protecting information in a Web system. However, these approaches typically operate at the file-system level, independently of the data that have to be protected from unauthorized accesses. Part of this problem is due to the limitations of HTML, historically used to design Web documents. The extensible markup language (XML), a markup language promoted by the World Wide Web Consortium (W3C), is de facto the standard language for the exchange of information on the Internet and represents an important opportunity to provide fine-grained access control. We present an access control model to protect information distributed on the Web that, by exploiting XML's own capabilities, allows the definition and enforcement of access restrictions directly on the structure and content of the documents. We present a language for the specification of access restrictions, which uses standard notations and concepts, together with a description of a system architecture for access control enforcement based on existing technology. The result is a flexible and powerful security system offering a simple integration with current solutions.

...read moreread less

Patent•

Technique for implementing browser-initiated user-transparent network-distributed advertising and for interstitially displaying an advertisement, so distributed, through a web browser in response to a user click-stream

[...]

Rick W. Landsman, Wei-Yeh Lee

31 May 2002

TL;DR: In this paper, a technique for implementing in a networked client-server environment, e.g., the Internet, network-distributed advertising in which advertisements are downloaded, from an advertising server to a browser executing at a client computer, in a manner transparent to a user situated at the browser, and subsequently displayed, by that browser and on an interstitial basis, in response to a click-stream generated by the user to move from one web page to the next.

...read moreread less

Abstract: A technique for implementing in a networked client-server environment, e.g., the Internet, network-distributed advertising in which advertisements are downloaded, from an advertising server to a browser executing at a client computer, in a manner transparent to a user situated at the browser, and subsequently displayed, by that browser and on an interstitial basis, in response to a click-stream generated by the user to move from one web page to the next. Specifically, an HTML advertising tag is embedded into a referring web page. This tag contains two components. One component effectively downloads, from an distribution web server and to an extent necessary, and then persistently instantiates an agent at the client browser. This agent “politely” and transparently downloads advertising files (media and where necessary player files), originating from an ad management system residing on a third-party advertising web server, for a given advertisement into browser cache and subsequently plays those media files through the browser on an interstitial basis and in response to a user click-stream. The other component is a reference, in terms of a web address, of the advertising management system. This latter reference totally “decouples” advertising content from a web page such that a web page, rather than embedding actual advertising content within the page itself, merely includes an advertising tag that refers, via a URL, to a specific ad management system rather than to a particular advertisement or its content. The ad management system selects the given advertisement that is to be downloaded, rather than having that selection or its content being embedded in the web content page.

...read moreread less

Journal Article•DOI•

Introduction to web services architecture

[...]

K. D. Gottschalk¹, S. Graham¹, H. Kreger¹, J. Snell²•Institutions (2)

Research Triangle Park¹, IBM²

01 Apr 2002-Ibm Systems Journal

TL;DR: The architectural elements of Web services are related to a real-world business scenario in order to illustrate how the Web services approach helps solve real business problems.

...read moreread less

Abstract: This paper introduces the major components of, and standards associated with, the Web services architecture. The different roles associated with the Web services architecture and the programming stack for Web services are described. The architectural elements of Web services are then related to a real-world business scenario in order to illustrate how the Web services approach helps solve real business problems.

...read moreread less

Journal Article•DOI•

Winners don't take all: Characterizing the competition for links on the web

[...]

David M. Pennock¹, Gary W. Flake, Steve Lawrence, Eric Glover, C. Lee Giles² - Show less +1 more•Institutions (2)

Pennsylvania State University¹, Penn State College of Information Sciences and Technology²

16 Apr 2002-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: A simple generative model quantifies the degree to which the rich nodes grow richer, and how new (and poorly connected) nodes can compete, and accurately accounts for the true connectivity distributions of category-specific web pages, the web as a whole, and other social networks.

...read moreread less

Abstract: As a whole, the World Wide Web displays a striking “rich get richer” behavior, with a relatively small number of sites receiving a disproportionately large share of hyperlink references and traffic. However, hidden in this skewed global distribution, we discover a qualitatively different and considerably less biased link distribution among subcategories of pages—for example, among all university homepages or all newspaper homepages. Although the connectivity distribution over the entire web is close to a pure power law, we find that the distribution within specific categories is typically unimodal on a log scale, with the location of the mode, and thus the extent of the rich get richer phenomenon, varying across different categories. Similar distributions occur in many other naturally occurring networks, including research paper citations, movie actor collaborations, and United States power grid connections. A simple generative model, incorporating a mixture of preferential and uniform attachment, quantifies the degree to which the rich nodes grow richer, and how new (and poorly connected) nodes can compete. The model accurately accounts for the true connectivity distributions of category-specific web pages, the web as a whole, and other social networks.

...read moreread less

Proceedings Article•DOI•

Mining product reputations on the Web

[...]

Satoshi Morinaga¹, Kenji Yamanishi¹, Kenji Tateishi¹, Toshikazu Fukushima¹•Institutions (1)

NEC¹

23 Jul 2002

TL;DR: A new framework for mining product reputations on the Internet is presented, which offers a drastic reduction in the overall cost of reputation analysis over that of conventional survey approaches and supports the discovery of knowledge from the pool of opinions on the web.

...read moreread less

Abstract: Knowing the reputations of your own and/or competitors' products is important for marketing and customer relationship management. It is, however, very costly to collect and analyze survey data manually. This paper presents a new framework for mining product reputations on the Internet. It automatically collects people's opinions about target products from Web pages, and it uses text mining techniques to obtain the reputations of those products.On the basis of human-test samples, we generate in advance syntactic and linguistic rules to determine whether any given statement is an opinion or not, as well as whether such any opinion is positive or negative in nature. We first collect statements regarding target products using a general search engine, and then, using the rules, extract opinions from among them and attach three labels to each opinion, labels indicating the positive/negative determination, the product name itself, and an numerical value expressing the degree of system confidence that the statement is, in fact, an opinion. The labeled opinions are then input into an opinion database.The mining of reputations, i.e., the finding of statistically meaningful information included in the database, is then conducted. We specify target categories using label values (such as positive opinions of product A) and perform four types of text mining: extraction of 1) characteristic words, 2) co-occurrence words, 3) typical sentences, for individual target categories, and 4) correspondence analysis among multiple target categories.Actual marketing data is used to demonstrate the validity and effectiveness of the framework, which offers a drastic reduction in the overall cost of reputation analysis over that of conventional survey approaches and supports the discovery of knowledge from the pool of opinions on the web.

...read moreread less

Book•

Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential

[...]

Dieter Fensel, James A. Hendler, Henry Lieberman, Wolfgang Wahlster

01 Oct 2002

TL;DR: The Semantic Web as discussed by the authors is a new type of hierarchy and standardization that will replace the current Web of links with a web of meaning using a flexible set of languages and tools.

...read moreread less

Abstract: From the Publisher: As the World Wide Web continues to expand, it becomes increasingly difficult for users to obtain information efficiently. Because most search engines read format languages such as HTML or SGML, search results reflect formatting tags more than actual page content, which is expressed in natural language. Spinning the Semantic Web describes an exciting new type of hierarchy and standardization that will replace the current "web of links" with a "web of meaning." Using a flexible set of languages and tools, the Semantic Web will make all available information--display elements, metadata, services, images, and especially content--accessible. The result will be an immense repository of information accessible for a wide range of new applications. This first handbook for the Semantic Web covers, among other topics, software agents that can negotiate and collect information, markup languages that can tag many more types of information in a document, and knowledge systems that enable machines to read Web pages and determine their reliability. The truly interdisciplinary Semantic Web combines aspects of artificial intelligence, markup languages, natural language processing, information retrieval, knowledge representation, intelligent agents, and databases.

...read moreread less

Proceedings Article•DOI•

Eye tracking in web search tasks: design implications

[...]

Joseph H. Goldberg¹, Mark J. Stimson¹, Marion Lewenstein², Neil G. Scott², Anna M. Wichansky¹ - Show less +1 more•Institutions (2)

Oracle Corporation¹, Stanford University²

25 Mar 2002

TL;DR: Based on analysis of screen sequences, there was little evidence that search became more directed as screen sequence increased, and navigation among portlets, when at least two columns exist, was biased towards horizontal search (across columns) as opposed to vertical search (within column).

...read moreread less

Abstract: An eye tracking study was conducted to evaluate specific design features for a prototype web portal application. This software serves independent web content through separate, rectangular, user-modifiable portlets on a web page. Each of seven participants navigated across multiple web pages while conducting six specific tasks, such as removing a link from a portlet. Specific experimental questions included (1) whether eye tracking-derived parameters were related to page sequence or user actions preceding page visits, (2) whether users were biased to traveling vertically or horizontally while viewing a web page, and (3) whether specific sub-features of portlets were visited in any particular order. Participants required 2-15 screens, and from 7-360+ seconds to complete each task. Based on analysis of screen sequences, there was little evidence that search became more directed as screen sequence increased. Navigation among portlets, when at least two columns exist, was biased towards horizontal search (across columns) as opposed to vertical search (within column). Within a portlet, the header bar was not reliably visited prior to the portlet's body, evidence that header bars are not reliably used for navigation cues. Initial design recommendations emphasized the need to place critical portlets on the left and top of the web portal area, and that related portlets do not need to appear in the same column. Further experimental replications are recommended to generalize these results to other applications.

...read moreread less

Proceedings Article•DOI•

Building a recommender agent for e-learning systems

[...]

Osmar R. Zaïane

03 Dec 2002

TL;DR: The use of web mining techniques are suggested to build such an agent that could recommend on-line learning activities or shortcuts in a course web site based on learners' access history to improve course material navigation as well as assist the online learning process.

...read moreread less

Abstract: A recommender system in an e-learning context is a software agent that tries to "intelligently" recommend actions to a learner based on the actions of previous learners. This recommendation could be an on-line activity such as doing an exercise, reading posted messages on a conferencing system, or running an on-line simulation, or could be simply a web resource. These recommendation systems have been tried in e-commerce to entice purchasing of goods, but haven't been tried in e-learning. This paper suggests the use of web mining techniques to build such an agent that could recommend on-line learning activities or shortcuts in a course web site based on learners' access history to improve course material navigation as well as assist the online learning process. These techniques are considered integrated web mining as opposed to off-line web mining used by expert users to discover on-line access patterns.

...read moreread less

Proceedings Article•DOI•

Statistical identification of encrypted Web browsing traffic

[...]

Qixiang Sun¹, Daniel R. Simon², Yi-Min Wang², W. Russell², Venkata N. Padmanabhan², Lili Qiu² - Show less +2 more•Institutions (2)

Stanford University¹, Microsoft²

12 May 2002

TL;DR: This work investigates the identifiability of World Wide Web traffic based on this unconcealed information in a large sample of Web pages, and shows that it suffices to identify a significant fraction of them quite reliably.

...read moreread less

Abstract: Encryption is often proposed as a tool for protecting the privacy of World Wide Web browsing. However, encryption-particularly as typically implemented in, or in concert with popular Web browsers-does not hide all information about the encrypted plaintext. Specifically, HTTP object count and sizes are often revealed (or at least incompletely concealed). We investigate the identifiability of World Wide Web traffic based on this unconcealed information in a large sample of Web pages, and show that it suffices to identify a significant fraction of them quite reliably. We also suggest some possible countermeasures against the exposure of this kind of information and experimentally evaluate their effectiveness.

...read moreread less

Proceedings Article•DOI•

Discovering informative content blocks from Web documents

[...]

Shian-Hua Lin¹, Jan-Ming Ho¹•Institutions (1)

Academia Sinica¹

23 Jul 2002

TL;DR: By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.

...read moreread less

Abstract: In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. Based on the answer set generated from 13 manually tagged news Web sites with a total of 26,518 Web pages, experiments show that both recall and precision rates are greater than 0.956. That is, using the approach, informative blocks (news articles) of these sites can be automatically separated from semantically redundant contents such as advertisements, banners, navigation panels, news categories, etc. By adopting InfoDiscoverer as the preprocessor of information retrieval and extraction applications, the retrieval and extracting precision will be increased, and the indexing size and extracting complexity will also be reduced.

...read moreread less

Proceedings Article•DOI•

Design and implementation of a high-performance distributed Web crawler

[...]

V. Shkapenyuk, Torsten Suel¹•Institutions (1)

New York University¹

07 Aug 2002

TL;DR: This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.

...read moreread less

Abstract: Broad Web search engines as well as many more specialized search tools rely on Web crawlers to acquire large collections of pages for indexing and analysis. Such a Web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed Web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the, performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.

...read moreread less

Book Chapter•DOI•

S-CREAM - Semi-automatic CREAtion of Metadata

[...]

Siegfried Handschuh¹, Steffen Staab¹, Fabio Ciravegna²•Institutions (2)

Karlsruhe Institute of Technology¹, University of Sheffield²

01 Oct 2002

TL;DR: OntoMat-Annotizer extract with the help of Amilcare knowledge structure from web pages through the use of knowledge extraction rules, the result of a learning-cycle based on already annotated pages.

...read moreread less

Abstract: Richly interlinked, machine-understandable data constitute the basis for the Semantic Web. We provide a framework, S-CREAM, that allows for creation of metadata and is trainable for a specific domain. Annotating web documents is one of the major techniques for creating metadata on the web. The implementation of S-CREAM, OntoMat-Annotizer supports now the semi-automatic annotation of web pages. This semi-automatic annotation is based on the information extraction component Amilcare. OntoMat-Annotizer extract with the help of Amilcare knowledge structure from web pages through the use of knowledge extraction rules. These rules are the result of a learning-cycle based on already annotated pages.

...read moreread less

Book Chapter•DOI•

MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup

[...]

Maria Vargas-Vera¹, Enrico Motta¹, John Domingue¹, Mattia Lanzoni¹, Arthur Stutt¹, Fabio Ciravegna² - Show less +2 more•Institutions (2)

Open University¹, University of Sheffield²

01 Oct 2002

TL;DR: M is presented, an annotation tool which provides both automated and semi-automated support for annotating web pages with semantic contents and integrates a web browser with an ontology editor and provides open APIs to link to ontology servers and for integrating information extraction tools.

...read moreread less

Abstract: An important precondition for realizing the goal of a semantic web is the ability to annotate web resources with semantic information. In order to carry out this task, users need appropriate representation languages, ontologies, and support tools. In this paper we present MnM, an annotation tool which provides both automated and semi-automated support for annotating web pages with semantic contents. MnM integrates a web browser with an ontology editor and provides open APIs to link to ontology servers and for integrating information extraction tools. MnM can be seen as an early example of the next generation of ontology editors, being web-based, oriented to semantic markup and providing mechanisms for large-scale automatic markup of web pages.

...read moreread less

Proceedings Article•DOI•

Using web structure for classifying and describing web pages

[...]

Eric Glover¹, Kostas Tsioutsiouliklis¹, Steve Lawrence¹, David M. Pennock¹, Gary W. Flake¹ - Show less +1 more•Institutions (1)

Princeton University¹

07 May 2002

TL;DR: By ranking words and phrases in the citing documents according to expected entropy loss, this work is able to accurately name clusters of web pages, even with very few positive examples.

...read moreread less

Abstract: The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages.

...read moreread less

Patent•

Method and system for web page personalization

[...]

Devin F. Hosea, Richard S. Zimmerman, Arthur P. Rascon, Anthony Scott Oddo, Nathaniel Thurston - Show less +1 more

23 Jan 2002

TL;DR: In this paper, a method and system for personalizing displays of published Web pages provided by Web content providers to meet the interests of Web users accessing the pages, based on profiles of the users.

...read moreread less

Abstract: The invention includes a method and system for personalizing displays of published Web pages provided by Web content providers to meet the interests of Web users accessing the pages, based on profiles of the users. The system preferably provides to the requesting user, through a proxy server, an edited version of the HTML file for the original published Web page that is served by a host Web server. The system uses user profiles that may include demographic and psychographic data to edit the requested Web page. The content of a Web page as published by a host Web server may be coded to correlate components of the Web page with demographic and psychographic data. The user profiles may then be used to filter the content of a coded Web page for delivery to a requesting user. The system may rearrange content on a published Web page so that content determined to be of higher interest to a user is more prominently featured or more easily or quickly accessible. The system may also delete content on a published Web page that is determined to be of low interest to a user. In embodiments of the invention, a single proxy server or proxy server system personalizes Web pages from multiple Web servers, using a single user profile for a user.

...read moreread less

Journal Article•DOI•

A Study of Approaches to Hypertext Categorization

[...]

Yiming Yang¹, Seán Slattery¹, Rayid Ghani¹•Institutions (1)

Carnegie Mellon University¹

01 Mar 2002

TL;DR: This paper examines five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier.

...read moreread less

Abstract: Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related Web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our results show that the identification of hypertext regularities in the data and the selection of appropriate representations for hypertext in particular domains are crucial, but seldom obvious, in real-world problems. We find that adding the words in the linked neighborhood to the page having those links (both inlinks and outlinks) were helpful for all our classifiers on one data set, but more harmful than helpful for two out of the three classifiers on the remaining datasets. We also observed that extracting meta data from related Web sites was extremely useful for improving classification accuracy in some of those domains. Finally, the relative performance of the classifiers being tested provided insights into their strengths and limitations for solving classification problems involving diverse and often noisy Web pages.

...read moreread less

Proceedings Article•DOI•

Minimizing energy for wireless web access with bounded slowdown

[...]

Ronny Krashinsky¹, Hari Balakrishnan¹•Institutions (1)

Massachusetts Institute of Technology¹

23 Sep 2002

TL;DR: The Bounded Slowdown protocol is presented, a PSM that dynamically adapts to network activity that reduces average Web page retrieval times by 5--64%, while simultaneously reducing energy consumption by 1--14% and by 13X compared to no power management.

...read moreread less

Abstract: On many battery-powered mobile computing devices, the wireless network is a significant contributor to the total energy consumption. In this paper, we investigate the interaction between energy-saving protocols and TCP performance for Web like transfers. We show that the popular IEEE 802.11 power-saving mode (PSM), a "static" protocol, can harm performance by increasing fast round trip times (RTTs) to 100 ms; and that under typical Web browsing workloads, current implementations will unnecessarily spend energy waking up during long idle periods.To overcome these problems, we present the Bounded-Slowdown (BSD) protocol, a PSM that dynamically adapts to network activity. BSD is an optimal solution to the problem of minimizing energy consumption while guaranteeing that a connection's RTT does not increase by more than a factor p over its base RTT, where p is a protocol parameter that exposes the trade-off between minimizing energy and reducing latency. works by staying awake for a short period of time after the link idle. We present several trace-driven simulation results that show that, compared to a static PSM, the Bounded Slowdown protocol reduces average Web page retrieval times by 5--64%, while simultaneously reducing energy consumption by 1--14% (and by 13X compared to no power management).

...read moreread less

Proceedings Article•DOI•

PEBL: positive example based learning for Web page classification using SVM

[...]

Hwanjo Yu¹, Jiawei Han¹, Kevin Chen-Chuan Chang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

23 Jul 2002

TL;DR: The Positive Example Based Learning (PEBL) framework for Web page classification is introduced which eliminates the need for manually collecting negative training examples in pre-processing and an algorithm called Mapping-Convergence (M-C) is presented that achieves classification accuracy as high as that of traditional SVM (with positive and negative data).

...read moreread less

Abstract: Web page classification is one of the essential techniques for Web mining. Specifically, classifying Web pages of a user-interesting class is the first step of mining interesting information from the Web. However, constructing a classifier for an interesting class requires laborious pre-processing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of non-homepages (negative examples). In particular, collecting negative training examples requires arduous work and special caution to avoid biasing them. We introduce in this paper the Positive Example Based Learning (PEBL) framework for Web page classification which eliminates the need for manually collecting negative training examples in pre-processing. We present an algorithm called Mapping-Convergence (M-C) that achieves classification accuracy (with positive and unlabeled data) as high as that of traditional SVM (with positive and negative data). Our experiments show that when the M-C algorithm uses the same amount of positive examples as that of traditional SVM, the M-C algorithm performs as well as traditional SVM.

...read moreread less

Journal Article•DOI•

Discovery of Web Robot Sessions Based on their Navigational Patterns

[...]

Pang-Ning Tan¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

01 Jan 2002-Data Mining and Knowledge Discovery

TL;DR: Experimental results on the Computer Science department Web server logs show that highly accurate classification models can be built using the navigational patterns in the click-stream data to determine if it is due to a robot.

...read moreread less

Abstract: Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from other users. First of all, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. In addition, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform clickstream analysis effectively on the Web data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouflaged and previously unknown robots. In this paper, we propose an alternative approach that uses the navigational patterns in the click-stream data to determine if it is due to a robot. Experimental results on our Computer Science department Web server logs show that highly accurate classification models can be built using this approach. We also show that these models are able to discover many camouflaged and previously unidentified robots.

...read moreread less

Collapse