scispace - formally typeset
Search or ask a question

Showing papers on "Web modeling published in 1999"


Journal ArticleDOI
TL;DR: This paper presents several data preparation techniques in order to identify unique users and user sessions and Transactions identified by the proposed methods are used to discover association rules from real world data using the WEBMINER system.
Abstract: The World Wide Web (WWW) continues to grow at an astounding rate in both the sheer volume of traffic and the size and complexity of Web sites. The complexity of tasks such as Web site design, Web server design, and of simply navigating through a Web site have increased along with this growth. An important input to these design tasks is the analysis of how a Web site is being used. Usage analysis includes straightforward statistics, such as page access frequency, as well as more sophisticated forms of analysis, such as finding the common traversal paths through a Web site. Web Usage Mining is the application of data mining techniques to usage logs of large Web data repositories in order to produce results that can be used in the design tasks mentioned above. However, there are several preprocessing tasks that must be performed prior to applying data mining algorithms to the data collected from server logs. This paper presents several data preparation techniques in order to identify unique users and user sessions. Also, a method to divide user sessions into semantically meaningful transactions is defined and successfully tested against two other methods. Transactions identified by the proposed methods are used to discover association rules from real world data using the WEBMINER system [15].

1,616 citations


Book
01 Nov 1999
TL;DR: The author concludes that Simplicity in Web Design is the most important quality in web design and that international use and international use should be considered separately.
Abstract: Preface. 1. Introduction: Why Web Usability? 2. Page Design. 3. Content Design. 4. Site Design. 5. Intranet Design. 6. Accessibility for Users with Disabilities. 7. International Use: Serving A Global Audience. 8. Future Predictions: The Only Web Constant Is Change. 9. Conclusion: Simplicity in Web Design. Recommended Readings. Index.

1,600 citations


Journal ArticleDOI
17 May 1999
TL;DR: The subject of this paper is the systematic enumeration of over 100,000 emerging communities from a Web crawl, motivating a graph-theoretic approach to locating such communities, and describing the algorithms and algorithmic engineering necessary to find structures that subscribe to this notion.
Abstract: The Web harbors a large number of communities — groups of content-creators sharing a common interest — each of which manifests itself as a set of interlinked Web pages. Newgroups and commercial Web directories together contain of the order of 20,000 such communities; our particular interest here is on emerging communities — those that have little or no representation in such fora. The subject of this paper is the systematic enumeration of over 100,000 such emerging communities from a Web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment. © 1999 Published by Elsevier Science B.V. All rights reserved.

1,126 citations


Journal ArticleDOI
TL;DR: This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java, and comments on Mercator's performance, which is found to be comparable to that of other crawlers for which performance numbers have been published.
Abstract: This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java. Scalable Web crawlers are an important component of many Web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable Web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator. We also describe Mercator’s support for extensibility and customizability. Finally, we comment on Mercator’s performance, which we have found to be comparable to that of other crawlers for which performance numbers have been published.

672 citations


Book
01 Jan 1999
TL;DR: This book examines the unique aspects of modeling web applications with the Web Application Extension for the Unified Modeling Language (WAE) enabling developers to model web-specific architectural elements using the Rational Unified Process or an alternative methodology.
Abstract: Building Web Applications with UML is a guide to building robust, scalable, and feature-rich web applications using proven object-oriented techniques. Written for the project manager, architect, analyst, designer, and programmer of web applications, this book examines the unique aspects of modeling web applications with the Web Application Extension (WAE) for the Unified Modeling Language (UML). The UML has been widely accepted as the standard modeling language for software systems, and as a result is often the best option for modeling web application designs.The WAE extends the UML notation with semantics and constraints enabling developers to model web-specific architectural elements using the Rational Unified Process or an alternative methodology. Using UML allows developers to model their web applications as a part of the complete system and the business logic that must be reflected in the application. Readers will gain not only an understanding of the modeling process, but also the ability to map models directly into code.Key topics include: A basic introduction to web servers, browsers, HTTP, and HTML Gathering requirements and defining the system's use cases Transforming requirements into a model and then a design that maps directly into components of the system Defining the architecture of a web application with an examination of three architectural patterns describing architectures for thin web client, thick web client, and web delivery designs Modeling, at the appropriate level of abstraction and detail, the appropriate artifacts, including web application pages, page relationships, navigate routes, client-side scripts, and server-side generation Creating code from UML models using ASP and VBScript Client-side scripting using DHTML, Java Script, VBScript, Applets, ActiveX controls, and DOM Using client/server protocols including DCOM, CORBA/IIOP, and Java's RMI Securing a web application with SET, SSL, PGP, Certificates, and Certificate Authorities 0201615770B04062001

662 citations


Journal ArticleDOI
TL;DR: The concept of flow is concluded to be a fruitful area for research that aims at improving Web design practice and it is suggested that additional research under more rigorous methodological conditions can further specify the factors and conditions associated with flow experiences on the Web.

467 citations


Journal ArticleDOI
TL;DR: This paper investigates the current situation of Web development tools, both in the commercial and research fields, by identifying and characterizing different categories of solutions, evaluating their adequacy to the requirements of Web application development, enlightening open problems, and exposing possible future trends.
Abstract: The exponential growth and capillar diffusion of the Web are nurturing a novel generation of applications, characterized by a direct business-to-customer relationship. The development of such applications is a hybrid between traditional IS development and Hypermedia authoring, and challenges the existing tools and approaches for software production. This paper investigates the current situation of Web development tools, both in the commercial and research fields, by identifying and characterizing different categories of solutions, evaluating their adequacy to the requirements of Web application development, enlightening open problems, and exposing possible future trends.

397 citations


Journal ArticleDOI
17 May 1999
TL;DR: The PageGather algorithm, which automatically identifies candidate link sets to include in index pages based on user access logs, is presented and it is demonstrated experimentally that PageGathering outperforms the Apriori data mining algorithm on this task.
Abstract: The creation of a complex Web site is a thorny problem in user interface design. In this paper we explore the notion of adaptive Web sites : sites that semi-automatically improve their organization and presentation by learning from visitor access patterns. It is easy to imagine and implement Web sites that offer shortcuts to popular pages. Are more sophisticated adaptive Web sites feasible? What degree of automation can we achieve? To address the questions above, we describe the design space of adaptive Web sites and consider a case study: the problem of synthesizing new index pages that facilitate navigation of a Web site. We present the PageGather algorithm, which automatically identifies candidate link sets to include in index pages based on user access logs. We demonstrate experimentally that PageGather outperforms the Apriori data mining algorithm on this task. In addition, we compare PageGather's link sets to pre-existing, human-authored index pages.

372 citations


Book ChapterDOI
01 Jun 1999
TL;DR: A large corpus of Web search queries extracted from server logs recorded by a popular Internet search service is derived and Bayesian networks that predict search behavior are constructed, with a focus on the progression of queries over time.
Abstract: We discuss the construction of probabilistic models centering on temporal patterns of query refinement. Our analyses are derived from a large corpus of Web search queries extracted from server logs recorded by a popular Internet search service. We frame the modeling task in terms of pursuing an understanding of probabilistic relationships among temporal patterns of activity, informational goals, and classes of query refinement. We construct Bayesian networks that predict search behavior, with a focus on the progression of queries over time. We review a methodology for abstracting and tagging user queries. After presenting key statistics on query length, query frequency, and informational goals, we describe user models that capture the dynamics of query refinement.

314 citations


Proceedings ArticleDOI
07 Nov 1999
TL;DR: An effective technique for capturing common user profiles based on association rule discovery and usage based clustering is proposed and techniques for combining this knowledge with the current status of an ongoing Web activity to perform real time personalization are proposed.
Abstract: We describe an approach to usage based Web personalization taking into account both the offline tasks related to the mining of usage data, and the online process of automatic Web page customization based on the mined knowledge. Specifically, we propose an effective technique for capturing common user profiles based on association rule discovery and usage based clustering. We also propose techniques for combining this knowledge with the current status of an ongoing Web activity to perform real time personalization. Finally, we provide an experimental evaluation of the proposed techniques using real Web usage data.

313 citations


Proceedings Article
07 Sep 1999
TL;DR: This paper develops novel algorithms for enumerating and organizing all web occurrences of certain subgraphs that are signatures of web phenomena such as tightly-focused topic communities, webrings, taxonomy trees, keiretsus, etc, and argues that these algorithms run efficiently in this model.
Abstract: The subject of this paper is the creation of knowledge bases by enumerating and organizing all web occurrences of certain subgraphs. We focus on subgraphs that are signatures of web phenomena such as tightly-focused topic communities, webrings, taxonomy trees, keiretsus, etc. For instance, the signature of a webring is a central page with bidirectional links to a number of other pages. We develop novel algorithms for such enumeration problems. A key technical contribution is the development of a model for the evolution of the web graph, based on experimental observations derived from a snapshot of the web. We argue that our algorithms run efficiently in this model, and use the model to explain some statistical phenomena on the web that emerged during our experiments. Finally, we describe the design and implementation of Campfire, a knowledge base of over one hundred thousand web communities.

Book
03 Mar 1999
TL;DR: The Web Wisdom: How to Evaluate and Create Information Quality on the Web as discussed by the authors addresses the key concerns of Web users and Web page authors regarding reliable and useful information on the Internet.
Abstract: From the Publisher: Here is the essential reference for anyone needing to evaluate or establish information quality on the World Wide Web. Web Wisdom: How to Evaluate and Create Information Quality on the Web addresses the key concerns of Web users and Web page authors regarding reliable and useful information on the Internet. Authors Janet E. Alexander and Marsha Ann Tate introduce critical Web evaluation principles and present the theoretical background necessary to evaluate and create quality information on the Web. They include easy-to-use checklists for step-by-step quality evaluations of virtually any Web site. Alexander and Tate also address important issues related to information on the Web, such as understanding the ways that advertising and sponsorship may effect the quality of information found on the Web.

01 Jan 1999
TL;DR: A wide range of heuristics for adjusting document rankings based on the special HTML structure of Web documents are described, including a novel one inspired by reinforcement learning techniques for propagating rewards through a graph which can be used to improve a search engine's rankings.
Abstract: Indexing systems for the World Wide Web, such as Lycos and Alta Vista, play an essential role in making the Web useful and usable. These systems are based on Information Retrieval methods for indexing plain text documents, but also include heuristics for adjusting their document rankings based on the special HTML structure of Web documents. In this paper, we describe a wide range of such heuristics|including a novel one inspired by reinforcement learning techniques for propagating rewards through a graph|which can be used to a ect a search engine's rankings. We then demonstrate a system which learns to combine these heuristics automatically, based on feedback collected unintrusively from users, resulting in much improved rankings.

Proceedings ArticleDOI
31 Oct 1999
TL;DR: A Web traffic model designed to assist in the evaluation and engineering of shared communication networks and is behavioral, which can extrapolate the model to assess the effect of changes in protocols, the network or user behavior.
Abstract: The growing importance of Web traffic on the Internet makes it important that we have accurate traffic models in order to plan and provision. In this paper we present a Web traffic model designed to assist in the evaluation and engineering of shared communication networks. Because the model is behavioral we can extrapolate the model to assess the effect of changes in protocols, the network or user behavior. The increasing complexity of Web traffic has required that we base our model on the notion of a Web-request, rather a Web page. A Web-request results in the retrieval of information that might consist of one or more Web pages. The parameters of our model are derived from an extensive trace of Web traffic. Web-requests are identified by analyzing not just the TCP header in the trace but also the HTTP headers. The effect of Web caching is incorporated into the model. The model is evaluated by comparing independent statistics from the model and from the trace. The reasons for differences between the model and the traces are given.

Patent
28 Jan 1999
TL;DR: In this paper, a focussed web crawler learns to recognize Web pages that are relevant to the interest of one or more users, from a set of examples provided by the users, and explores the Web starting from the example set, using the statistics collected from the examples and other analysis on the link graph of the growing crawl database, to guide itself towards relevant, valuable resources and away from irrelevant and/or low quality material on the Web.
Abstract: A focussed Web crawler learns to recognize Web pages that are relevant to the interest of one or more users, from a set of examples provided by the users It then explores the Web starting from the example set, using the statistics collected from the examples and other analysis on the link graph of the growing crawl database, to guide itself towards relevant, valuable resources and away from irrelevant and/or low quality material on the Web Thereby, the Web crawler builds a comprehensive topic-specific library for the benefit of specific users

Patent
Meyer Sheri L1
09 Nov 1999
TL;DR: In this article, a dynamic Web page construction based on the number of items of data supplied by the facilities management system (10) to the Web server (33) is presented.
Abstract: Information about a facilities management system (10) for a building can be obtained by a remote computer accessing a Web server (33) at the facilities management system via the Internet (50). The Web server (33) employs a generic Web page layout and an active server pages program to create a Web page that displays information from the facilities management system (10). This mechanism does not require that the Web pages be custom developed for each particular building and its unique system configuration. Instead, the present invention provides dynamic Web page construction based on the number of items of data supplied by the facilities management system (10) to the Web server (33). An authoring tool for custom Web page layout also is provided.

Book ChapterDOI
01 Sep 1999
TL;DR: This paper focuses on web data mining research in context of the authors' web warehousing project called WHOWEDA (Warehouse of Web Data), and categorized web datamining into threes areas; web content mining, web structure mining and web usage mining.
Abstract: In this paper, we discuss mining with respect to web data referred here as web data mining. In particular, our focus is on web data mining research in context of our web warehousing project called WHOWEDA (Warehouse of Web Data). We have categorized web data mining into threes areas; web content mining, web structure mining and web usage mining. We have highlighted and discussed various research issues involved in each of these web data mining category. We believe that web data mining will be the topic of exploratory research in near future.

Journal ArticleDOI
TL;DR: The opportunities and obstacles inherent with business-to-business Web sites are examined and the process for devising, overseeing, and evaluating such sites is discussed.

Journal ArticleDOI
TL;DR: WebOQL as mentioned in this paper is a query language for web data restructuring, which synthesizes ideas from query languages for the Web, for semistructured data and for website restructuring.
Abstract: The widespread use of the Web has originated several new data management problems, such as extracting data from Web pages and making databases accessible from Web browsers, and has renewed the interest in problems that had appeared before in other contexts, such as querying graphs, semistructured data and structured documents. Several systems and languages have been proposed for solving each of these Web data management problems, but none of these systems addresses all the problems from a unified perspective. Many of these problems essentially amount to data restructuring: we have information represented according to a certain structure and we want to construct another representation of (part of it) using a different structure. We present the WebOQL system, which supports a general class of data restructuring operations in the context of the Web. WebOQL synthesizes ideas from query languages for the Web, for semistructured data and for Website restructuring.

03 Jun 1999
TL;DR: This paper makes the case for identifying and exploiting the geographical location information of web sites so that web search engines can rank resources in a geographically sensitive fashion, in addition to using more traditional information-retrieval strategies.
Abstract: Many information resources on the web are relevant primarily to limited geographical communities. For instance, web sites containing information on restaurants, theaters, and apartment rentals are relevant primarily to web users in geographical proximity to these locations. In contrast, other information resources are relevant to a broader geographical community. For instance, an on-line newspaper may be relevant to users across the United States. Unfortunately, the geographical scope of web resources is largely ignored by web search engines. We make the case for identifying and exploiting the geographical location information of web sites so that web search engines can rank resources in a geographically sensitive fashion, in addition to using more traditional information-retrieval strategies. In this paper, we first consider how to compute the geographical location of web pages. Subsequently, we consider how to exploit such information in one specific "proof-of-concept" application we implemented in JAVA, and discuss other examples as well.

Proceedings ArticleDOI
07 Nov 1999
TL;DR: WEST, a WEb browser for Small Terminals, is described, that aims to solve some of the problems associated with accessing web pages on hand-held devices through a novel combination of text reduction and focus+context visualization.
Abstract: We describe WEST, a WEb browser for Small Terminals, that aims to solve some of the problems associated with accessing web pages on hand-held devices. Through a novel combination of text reduction and focus+context visualization, users can access web pages from a very limited display environment, since the system will provide an overview of the contents of a web page even when it is too large to be displayed in its entirety. To make maximum use of the limited resources available on a typical hand-held terminal, much of the most demanding work is done by a proxy server, allowing the terminal to concentrate on the task of providing responsive user interaction. The system makes use of some interaction concepts reminiscent of those defined in the Wireless Application Protocol (WAP), making it possible to utilize the techniques described here for WAP-compliant devices and services that may become available in the near future.

Proceedings ArticleDOI
01 May 1999
TL;DR: A taxonomy of tasks undertaken on the World-Wide Web, based on naturally-collectedverbal protocol data, reveals that several previous claims aboutrowsing behavior are questionable, and suggests thatwidget-centered approaches to interface design and evaluation maybe incomplete with respect to good user interfaces for the Web.
Abstract: A prerequisite to the effective design of user interfaces is an understanding of the tasks for which that interface will actually be used. Surprisingly little task analysis has appeared for one of the most discussed and fastest-growing computer applications, browsing the World-Wide Web (WWW). Based on naturally-collected verbal protocol data, we present a taxonomy of tasks undertaken on the WWW. The data reveal that several previous claims about browsing behavior are questionable, and suggests that that widget-centered approaches to interface design and evaluation may be incomplete with respect to good user interfaces for the Web.

01 Jan 1999
TL;DR: This article illustrates this by showing that an Example-Based approach to lexical choice for machine translation can use the Web as an adequate and free resource.
Abstract: The WWW is two orders of magnitude larger than the largest corpora. Although noisy, web text presents language as it is used, and statistics derived from the Web can have practical uses in many NLP applications. For this reason, the WWW should be seen and studied as any other computationally available linguistic resource. In this article, we illustrate this by showing that an Example-Based approach to lexical choice for machine translation can use the Web as an adequate and free resource.

Book ChapterDOI
TL;DR: A quantitative model based on support logic for determining the interestingness of discovered patterns is developed and incorporated into the Web Site Information Filter system and examples of interesting frequent itemsets automatically discovered from real Web data are presented.
Abstract: Web Usage Mining is the application of data mining techniques to large Web data repositories in order to extract usage patterns. As with many data mining application domains, the identification of patterns that are considered interesting is a problem that must be solved in addition to simply generating them. Aneces sary step in identifying interesting results is quantifying what is considered uninteresting in order to form a basis for comparison. Several research efforts have relied on manually generated sets of uninteresting rules. However, manual generation of a comprehensive set of evidence about beliefs for a particular domain is impractical in many cases. Generally, domain knowledge can be used to automatically create evidence for or against a set of beliefs. This paper develops a quantitative model based on support logic for determining the interestingness of discovered patterns. For Web Usage Mining, there are three types of domain information available; usage, content, and structure. This paper also describes algorithms for using these three types of information to automatically identify interesting knowledge. These algorithms have been incorporated into the Web Site Information Filter (WebSIFT) system and examples of interesting frequent itemsets automatically discovered from real Web data are presented.

Journal ArticleDOI
TL;DR: The WebComposition Markup Language is introduced, an XML-based language that implements the model that embodies object-oriented principles such as modularity, abstraction and encapsulation, and WCML, a model for Web application development that implements these principles.
Abstract: Most Web applications are still developed ad hoc. One reason is the gap between established software design concepts and the low-level Web implementation model. We summarize work on WebComposition, a model for Web application development, then introduce the WebComposition Markup Language, an XML-based language that implements the model. WCML embodies object-oriented principles such as modularity, abstraction and encapsulation.

Journal ArticleDOI
TL;DR: This paper develops a general methodology for characterizing the access patterns of Web server requests based on a time‐series analysis of finite collections of observed data from real systems and applies an instance of this method to analyze aspects of large‐scale Web server performance.
Abstract: In this paper we develop a general methodology for characterizing the access patterns of Web server requests based on a time-series analysis of finite collections of observed data from real systems. Our approach is used together with the access logs from the IBM Web site for the Olympic Games to demonstrate some of its advantages over previous methods and to construct a particular class of benchmarks for large-scale heavily-accessed Web server environments. We then apply an instance of this class of benchmarks to analyze aspects of large-scale Web server performance, demonstrating some additional problems with methods commonly used to evaluate Web server performance at different request traffic intensities.

Journal ArticleDOI
TL;DR: Experienced Web users and developers may use some of the more sophisticated models, identify what it means to manage a course with a Web site, improve their own design, and hear some tips on the hurdles to avoid.
Abstract: This paper gives an overview on the topic of control education on the Web. Accompanied by a Web page, the material is partly tutorial, enabling readers to step in at their current levels and move forward in their Web usage. For those readers who have not yet made use of the Web in their courses, we will demonstrate models of Web sites for consideration to suggest what can be done, and offer introductory steps for implementation. Experienced Web users and developers may use some of the more sophisticated models, identify what it means to manage a course with a Web site, improve their own design, and hear some tips on the hurdles to avoid. Specific applications to the control field are discussed, including software demonstrations and virtual and remote labs. In the end, it is hoped that readers will find information to move them a step forward from their current level.


Proceedings ArticleDOI
05 Jan 1999
TL;DR: The objective of this paper is to provide a conceptual framework and foundation for systematically investigating features in the Web environment that contribute to user satisfaction with a Web interface and uses F. Herzberg's motivation-hygiene theory to guide the identification of these features.
Abstract: With the fast development and increasing use of the World Wide Web as both an information seeking and an electronic commerce tool, Web usability studies are growing in importance. While Web designers have largely focused on the functional aspects of Web sites, there has been little systematic attention to (1) the motivational issues of Web user interface design or (2) a theoretically-driven approach to Web user satisfaction studies. The objective of this paper is to provide a conceptual framework and foundation for systematically investigating features in the Web environment that contribute to user satisfaction with a Web interface. This research uses F. Herzberg's (1966) motivation-hygiene theory to guide the identification of these features. Among the implications and contributions of this research are the identification of Web design features that may maximize the likelihood of user satisfaction and return visits to the Web site.

Journal ArticleDOI
TL;DR: The World Wide Web has revolutionized the way that people access information, and has opened up new possibilities in areas such as digital libraries, general and scientific information dissemination and retrieval, education, commerce, entertainment, government and health care.
Abstract: The World Wide Web has revolutionized the way that people access information, and has opened up new possibilities in areas such as digital libraries, general and scientific information dissemination and retrieval, education, commerce, entertainment, government and health care. There are many avenues for improvement of the Web, for example in the areas of locating and organizing information. Current techniques for access to both general and scientific information on the Web provide much room for improvement, search engines do not provide comprehensive indices of the Web and have difficulty in accurately ranking the relevance of results. Scientific information on the Web is very disorganized. We discuss the effectiveness of Web search engines, including results that show that the major Web search engines cover only a fraction of the "publicly indexable Web". Current research into improved searching of the Web is discussed, including new techniques for ranking the relevance of results, and new techniques in metasearch that can improve the efficiency and effectiveness of Web search. The creation of digital libraries incorporating autonomous citation indexing is discussed for improved access to scientific information on the Web.