scispace - formally typeset
Search or ask a question

Showing papers by "Qiang Yang published in 2001"


Proceedings ArticleDOI
26 Aug 2001
TL;DR: This paper presents an application of web log mining to obtain web-document access patterns and uses these patterns to extend the well-known GDSF caching policies and prefetching policies.
Abstract: Web caching and prefetching are well known strategies for improving the performance of Internet systems. When combined with web log mining, these strategies can decide to cache and prefetch web documents with higher accuracy. In this paper, we present an application of web log mining to obtain web-document access patterns and use these patterns to extend the well-known GDSF caching policies and prefetching policies. Using real web logs, we show that this application of data mining can achieve dramatic improvement to web-access performance.

213 citations


Proceedings ArticleDOI
03 Jan 2001
TL;DR: This paper presents a recursive density based clustering algorithm that can adaptively change its parameters intelligently and proves both analytically and experimentally that the method yields clustering results that are superior to that of DBSCAN.
Abstract: A problem facing information retrieval on the web is how to effectively cluster large amounts of web documents. One approach is to cluster the documents based on information provided only by users' usage logs and not by the content of the documents. A major advantage of this approach is that the relevancy information is objectively reflected by the usage logs; frequent simultaneous visits to two seemingly unrelated documents should indicate that they are in fact closely related. In this paper, we present a recursive density based clustering algorithm that can adaptively change its parameters intelligently. Our clustering algorithm RDBC (Recursive Density Based Clustering algorithm) is based on DBSCAN, a density based algorithm that has been proven in its ability in processing very large datasets. The fact that DBSCAN does not require the pre-determination of the number of clusters and is linear in time complexity makes it particularly attractive in web page clustering. It can be shown that RDBC require the same time complexity as that of the DBSCAN algorithm. In addition, we prove both analytically and experimentally that our method yields clustering results that are superior to that of DBSCAN.

67 citations


Journal ArticleDOI
TL;DR: This paper presents the system architecture, algorithms as well as empirical evaluations of the interactive user- interface component of the CaseAdvisor system, which helps compress a large case base into several small ones.
Abstract: In interactive case-based reasoning, it is important to present a small number of important cases and problem features to the user at one time. This goal is difficult to achieve when large case bases are commonplace in industrial practice. In this paper we present our solution to the problem by highlighting the interactive user- interface component of the CaseAdvisor system. In CaseAdvisor, decision forests are created in real time to help compress a large case base into several small ones. This is done by merging similar cases together through a clustering algorithm. An important side effect of this operation is that it allows up-to-date maintenance operations to be performed for case base management. During the retrieval process, an information-guided subsystem can then generate decision forests based on users' current answers obtained through an interactive process. Possible questions to the user are carefully analyzed through information theory. An important feature of the system is that case-base maintenance and reasoning are integrated in a seamless whole. In this article we present the system architecture, algorithms as well as empirical evaluations.

53 citations


Journal ArticleDOI
01 May 2001
TL;DR: This special issue brings together mature work focusing on maintaining the essential underlying knowledge of case-based reasoning systems, and provides a snapshot of the state of the art, presenting twelve articles examining core issues, methods, and lessons learned from research and applications.
Abstract: Case-based reasoning (CBR) is the process of reasoning and learning by storing prior cases—records of specific prior reasoning episodes—and retrieving and adapting them to aid new problem solving or interpretation in similar situations (Kolodner 1993, Leake 1996, Watson 1997). Case-based reasoning systems rely on the knowledge contained in multiple “knowledge containers” (Richter 1998), such as the case base, case adaptation knowledge, and similarity criteria. The contents of each of these knowledge containers may affect system efficiency and the quality of results. Over time, the knowledge containers may need to be updated in order to maintain or improve performance in response to changes in the system’s knowledge, task, environment, or user base. This gives rise to the need for strategies to address the problem of maintenance in case-based reasoning systems. Experience with the growing number of deployed case-based reasoning systems has led to increasing recognition of the value of maintenance to the success of practical CBR systems, as well as the importance of maintenance research. Maintenance issues arise in designing and building CBR systems and support tools that monitor system state and effectiveness in order to determine whether, when, and how to update CBR system knowledge to better serve performance goals. Understanding the issues that underlie the maintenance problem and using that understanding to develop good practical maintenance strategies are crucial to sustaining and improving the efficiency and solution quality of CBR systems as their case bases grow and as their tasks or environments change over long-term use. Maintaining CBR systems is an active research area that has been well represented at recent conferences. This special issue brings together mature work focusing on maintaining the essential underlying knowledge of case-based reasoning systems. It provides a snapshot of the state of the art, presenting twelve articles examining core issues, methods, and lessons learned from research and applications. Topics include the foundations of CBR system maintenance—the components of the maintenance process and maintenance goals—as well as proposals for specific maintenance strategies, theoretical and empirical analyses of their performance, and lessons on maintenance arising from practical experience. In order to understand the issues involved in developing maintenance strategies, as well as to understand maintenance practice and identify opportunities for new research, it is useful to understand the nature of the maintenance process and its relationship to the overall CBR process. The first article in this issue, Wilson and Leake’s “Maintaining Case-Based Reasoners: Dimensions and Directions,” provides a characterization of what maintenance is, the components of maintenance policies, and the dimensions along which alternative maintenance policies may differ. It then uses that characterization to examine the state of the art and identify opportunities for future research. Of course, the success of maintenance depends not only on the maintenance policies themselves but also on how maintenance is integrated with the overall case-based reasoning process. Reinartz, Iglezakis, and Roth-Berghofer’s article, “On Quality Measures for Case Base Maintenance,” describes an extended, six-step CBR cycle that incorporates two explicit

41 citations


Proceedings ArticleDOI
03 Jan 2001
TL;DR: This paper extends the well-known GDSF caching policies to include not only access trend information, but also the dynamics of the access trend itself to the trends on access trends, to provide more accurate prediction on future accessing trends when the access patterns vary greatly.
Abstract: Caching is one of the most effective techniques for improving the performance of Internet systems. The heart of a caching system is its page replacement policy, which decides which page to replace in a cache by a new one. Different caching policies have dramatically different effects on the system performance. In this paper, we extend the well-known GDSF caching policies to include not only access trend information, but also the dynamics of the access trend itself to the trends on access trends. The new trend policy that we propose, called Taylor Series Prediction (TSP) policy, provides more accurate prediction on future accessing trends when the access patterns vary greatly. We back up our claims through a series of experiments using web access traces.

38 citations


Journal ArticleDOI
01 Mar 2001
TL;DR: A semi-automatic introspective learning method is introduced to partially address the dynamic maintenance of feature weights in a case base through the integration of a learning network into case-based reasoning.
Abstract: A key issue in case-based reasoning is how to maintain the domain knowledge in the face of a changing environment. During the case retrieval process in case-based reasoning, feature-value pairs are used to compute the ranking scores of the cases in a case base, and different feature-value pairs may have different importance measures, represented as weight values, in this computation. How to maintain a set of appropriate feature weights so that they can be used to solve future problems effectively and efficiently will be a key factor in determining the success of case-based reasoning applications. Our focus in this paper is on the dynamic maintenance of feature weights in a case base. We address a particular problem related to the feature-weight maintenance issue. In current practice, the feature weights are assigned and revised manually, not only making them highly informal and inaccurate, but also involving intensive labor. We would like to introduce a semi-automatic introspective learning method to partially address this issue. Our approach is to construct a network architecture on the case base that supports introspective learning. Weight learning and weight-evolution are accomplished in the background through the integration of a learning network into case-based reasoning, in which, while the reasoning part is still case based, the learning part is shouldered by a layered network. The computation in the network follows well-known neural network algorithms with well known properties. We demonstrate the effectiveness of our approach through experiments.

32 citations


Proceedings Article
01 Jan 2001
TL;DR: This paper proposes several rule-pruning methods that enable us to build efficient, compact and high-quality classifiers for webrequest prediction.
Abstract: N-gram and repeating pattern based prediction rules have been used for next-web request prediction. However, there is no rigorous study of how to select the best rule for a given observation. The longer patter may not be the best pattern, because such patters are also more rare. In this paper, we propose several rule-pruning methods that enable us to build efficient, compact and high-quality classifiers for webrequest prediction.

28 citations


Journal ArticleDOI
01 May 2001
TL;DR: This article presents a case‐addition maintenance policy that is guaranteed to return a concise case base with good coverage quality and analytically derives the well known coverage convergence curves commonly displayed in CBR experiments and shows that benefit reduction can be used as a predictor for convergence speed.
Abstract: A major problem in many practical applications of case-based reasoning (CBR) and knowledge reuse is how to keep the case bases concise and complete. To solve this problem requires repeated maintenance operations to be applied to case bases. Different maintenance policies may result in case bases with very different quality. In this article, we present a case-addition maintenance policy that is guaranteed to return a concise case base with good coverage quality. We demonstrate that the coverage of the case base computed by the case-addition algorithm is no worse than the optimal case-base coverage by a fixed lower bound. We also show that the algorithm implementing the case-addition policy is efficient. Our result also highlights benefit reduction as a key factor in influencing the convergence of case-base coverage when cases are added to a case base. Through our theoretical analysis, we analytically derive the well known coverage convergence curves commonly displayed in CBRexperiments and show that benefit reduction can be used as a predictor for convergence speed.

25 citations


Proceedings ArticleDOI
03 Sep 2001
TL;DR: This work proposes an integrated web-caching and web-prefetching model, where the issues of prefetching aggressiveness, replacement policy and increased network traffic are addressed together in an integrated framework, using a prediction model based on statistical correlation between web objects.
Abstract: Web caching and web prefetching are two effective techniques to latency reduction. However, most previous research has addressed only one of these two techniques separately. In this work, we propose an integrated web-caching and web-prefetching model, where the issues of prefetching aggressiveness, replacement policy and increased network traffic are addressed together in an integrated framework. The core of our integrated solution is a prediction model based on statistical correlation between web objects. The model is trained on realistic web server logs. By utilizing the predictive power of the model, we develop an integrated prefetching and caching algorithm, Pre-GDSF. We conduct simulations to examine the effectiveness of our algorithm. We show the tradeoff between latency reduction and increased network traffic achieved by Pre-GDSF. We also show why prefetching is more effective for smaller caches than for larger ones.

18 citations


Book ChapterDOI
23 Oct 2001
TL;DR: This paper aims at mining a prediction model from the web logs for document access patterns and using the model to extend the well-known GDSF caching policy, and presents a new method to integrate this caching algorithm with a prediction-based prefetching algorithm.
Abstract: Caching and prefetching are well known strategies for improving the performance of Internet systems. The heart of a caching system is its page replacement policy, which selects the pages to be replaced in a proxy cache when a request arrives. By the same token, the essence of a prefetching algorithm lies in its ability to accurately predict future request. In this paper, we present a method for caching variable-sized web objects using an n-gram based prediction of future web requests. Our method aims at mining a prediction model from the web logs for document access patterns and using the model to extend the well-known GDSF caching policy. In addition, we present a new method to integrate this caching algorithm with a prediction-based prefetching algorithm. We empirically show that the system performance is greatly improved using the integrated approach.

8 citations


Journal ArticleDOI
TL;DR: An integrated agent system that integrates CBR systems with an active database system that has both real-time response and is highly exible in knowledge management as well as autonomously in response to events that a passive CBR system cannot handle.
Abstract: Case-based reasoning (CBR) is an artificial intelligence (AI) technique for problem solving that uses previous similar examples to solve a current problem. Despite its success, most current CBR systems are passive: they require human users to activate them manually and to provide information about the incoming problem explicitly. In this paper, we present an integrated agent system that integrates CBR systems with an active database system. Active databases, with the support of active rules, can perform event detection, condition monitoring, and event handling (action execution) in an automatic manner. The integrated ActiveCBR system consists of two layers. In the lower layer, the active database is rule-driven; in the higher layer, the result of action execution of active rules is transformed into feature–value pairs required by the CBR subsystem. The layered architecture separates CBR from sophisticated rule-based reasoning, and improves the traditional passive CBR system with the active property. The system has both real-time response and is highly exible in knowledge management as well as autonomously in response to events that a passive CBR system cannot handle. We demonstrate the system efficiency and effectiveness through empirical tests.

Book ChapterDOI
02 Aug 2001
TL;DR: This work presents an application of CBR in the domain of web document prediction and retrieval, whereby a server-side application can decide, with high accuracy and coverage, a user's next request for hypertext documents based on past requests.
Abstract: Case-based reasoning aims to use past experience to solve new problems. A strong requirement for its application is that extensive experience base exists that provides statistically significant justification for new applications. Such extensive experience base has been rare, limiting most CBR applications to be confined to small-scale problems involving single or few users, or even toy problems. In this work, we present an application of CBR in the domain of web document prediction and retrieval, whereby a server-side application can decide, with high accuracy and coverage, a user's next request for hypertext documents based on past requests. An application program can then use the prediction knowledge to prefetch or presend web objects to reduce latency and network load. Through this application, we demonstrate the feasibility of CBR application in the web-document retrieval context, exposing the vast possibility of using web-log files that contain document retrieval experiences from millions of users. In this framework, a CBR system is embedded within an overall web-server application. A novelty of the work is that data mining and case-based reasoning are combined in a seamless manner, allowing cases to be mined efficiently. In addition we developed techniques to allow different case bases to be combined in order to yield a overall case base with higher quality than each individual ones. We validate our work through experiments using realistic, large-scale web logs.