scispace - formally typeset
Open AccessBook ChapterDOI

Cache Management for Web-Powered Databases

Reads0
Chats0
TLDR
The present chapter discusses the issues, which make the Web cache management radically different than the cache management in databases or operating systems, and presents a taxonomy and the main algorithms proposed for cache replacement and coherence maintenance.
Abstract
The Web has become the primary means for information dissemination. It is ideal for publishing data residing in a variety of repositories, such as databases. In such a multi-tier system (client - Web server - underlying database), where the Web page content is dynamically derived from the database (Web-powered database), cache management is very important in making efficient distribution of the published information. Issues related to cache management are the cache admission/replacement policy, the cache coherency and the prefetching, which acts complementary to caching. The present chapter discusses the issues, which make the Web cache management radically different than the cache management in databases or operating systems. We present a taxonomy and the main algorithms proposed for cache replacement and coherence maintenance. We present three families of predictive prefetching algorithms for the Web and characterize them as Markov predictors. Finally, we give examples of how some popular commercial products deal with the issues regarding the cache management for Web-powered databases.

read more

Content maybe subject to copyright    Report

Cache Management for Web-Powered Databases 201
Chapter VIII
Cache Management for
Web-Powered Databases
Dimitrios Katsaros and Yannis Manolopoulos
Aristotle University of Thessaloniki, Greece
Copyright © 2003, Idea Group Publishing.
ABSTRACT
The Web has become the primary means for information dissemination. It is
ideal for publishing data residing in a variety of repositories, such as
databases. In such a multi-tier system (client - Web server - underlying
database), where the Web page content is dynamically derived from the
database (Web-powered database), cache management is very important in
making efficient distribution of the published information. Issues related to
cache management are the cache admission/replacement policy, the cache
coherency and the prefetching, which acts complementary to caching. The
present chapter discusses the issues, which make the Web cache management
radically different than the cache management in databases or operating
systems. We present a taxonomy and the main algorithms proposed for cache
replacement and coherence maintenance. We present three families of predictive
prefetching algorithms for the Web and characterize them as Markov predictors.
Finally, we give examples of how some popular commercial products deal with
the issues regarding the cache management for Web-powered databases.
INTRODUCTION
In the recent years the World Wide Web or simply the Web (Berners-Lee,
Caililiau, Luotnen, Nielsen & Livny, 1994) has become the primary means for
information dissemination. It is a hypertext-based application and uses the HTTP
protocol for file transfers. What started as a medium to serve the needs of a specific

202 Katsaros & Manolopoulos
scientific community (that of Particle Physics), has now become the most popular
application running on the Internet. Today it is being used for many purposes, ranging
from pure educational to entertainment and lately for conducting business. Applica-
tions such as digital libraries, video-on-demand, distance learning and virtual stores,
that allow for buying cars, books, computers etc. are some of the services running
on the Web. The advent of the XML language and its adoption from the World Wide
Web Council as a standard for document exchange has enlarged many old and fueled
new applications on it.
During its first years the Web consisted of static HTML pages stored on the file
system of the connected machines. When new needs arose, such as the E-
Commerce or the need to publish data residing in other systems, e.g., databases, it
was realized that we could not afford in terms of storage to replicate the original data
in the Web server’s disk in the form of HTML pages. Moreover, it would make no
sense to replicate data that would never be requested. So, instead of static pages, an
application program should run on the Web server to receive the requests from
clients, retrieve the relevant data from the source and then pack them into HTML or
XML format. Even the emerged “semistructured” XML databases, which store
data directly into the XML format, need an application program which will connect
to the DMBS and retrieve the XML file (or fragment). Thus, a new kind of pages,
dynamically generated and a new architecture were born. We have no more the
traditional couple of a Web client and a Web server, but a third part is added, the
application program, running on the Web server and serving data from an underlying
repository, in most of the cases being a database. This scheme is sometimes referred
to as Web-powered database and the Web site, which provides access to a large
number of pages whose content is extracted from databases, is called data intensive
Web site (Atzeni, Mecca & Merialdo, 1998; Yagoub, Florescu, Issarny & Valduriez,
2000). The typical architecture for such a scenario is depicted in Figure 1. In this
scheme there are three tiers, the database back-end, the Web/application server and
the Web client. In order to generate dynamic content, Web servers must execute a
program (e.g., server-side scripting mechanism). This program (script) connects
to the DBMS, executes the client query, gets the results and packs them in HTML/
XML form in order to return them to the user. Quite a lot of server-side scripting
mechanisms have been proposed in the literature (Greenspun, 1999; Malaika, 1998).
An alternative to having a program that generates HTML is the several forms of
annotated HTML. The annotated HTML, such as PHP, Active Server Pages,
Java Server Pages, embeds scripting commands in an HTML document.
The popularity of the Web resulted in heavy traffic in the Internet and heavy load
on Web the servers. For Web-powered databases the situation is worsened by the
fact that the application program must interact with the underlying database to
retrieve the data. So, the net effect of this situation is network congestion, high client
perceived latency, Web server overload and slow response times for Web servers.
Fortunately the situation is not incurable due to the existence of reference locality
in Web request streams. The principle of locality (Denning & Schwartz, 1972)

Cache Management for Web-Powered Databases 203
asserts that: (a) correlation between immediate past and immediate future refer-
ences tends to be high, and (b) correlation between disjoint reference patterns tends
to zero as the distance between them tends to infinity. Existence of reference locality
is indicated by several studies (Almeida et al., 1996; Breslau, Cao, Fan, Phillips &
Shenker, 1999).
There are two types of reference locality, namely temporal and spatial locality
(Almeida et al., 1996). Temporal locality can be described using the stack distance
model, as introduced in (Mattson, Gecsei, Slutz & Traiger, 1970). Existence of high
temporal locality in a request stream results in a relatively small average stack
distance and implies that recently accessed data are more likely to be referenced in
the near future. Consider for example the following reference streams: AABCBCD
and ABCDABC. They both have the same popularity profile
1
for each item.
Evidently, the stack distance for the first stream is smaller than for the second stream.
This can be deduced from the fact that the number of intervening references
between any two references for the same item in the first stream is smaller than for
the second stream. Thus, the first stream exhibits higher temporal locality than the
second. Spatial locality on the other hand, characterizes correlated references for
different data. Spatial locality in a stream can be established by comparing the total
number of unique subsequences of the stream with the total number of subsequences
that would be found in a random permutation of that stream. Existence of spatial
locality in a stream implies that the number of such unique subsequences is smaller
that the respective number of subsequences in a random permutation of the stream.
Consider for example the following reference streams: ABCABC and ACBCAB.
They both have the same popularity profile for each item. We can observe in the
first stream that a reference to item B always follows a reference to item A and is
followed by a reference to item C. This is not the case in the second stream and we
cannot observe a similar rule for any other sequence of items.
Due to the existence of temporal locality we can exploit the technique of
caching, that is, temporal storage of data closer to the consumer. Caching can save
resources, i.e., network bandwidth, since fewer packets travel in the network, and
I N T E R N E T
Web
Client
Proxy
Server
request
response
request
response
cache cache
request
response
+ prefetching
main
memory
cache
disk
cache
Web
serve
r
Application Server
p
refetch
res
p
onse
re
q
uest
DataBase
cache
Web
-
Powered Database
Figure 1: Architecture of a typical web-powered database

204 Katsaros & Manolopoulos
time, since we have faster response times. Caching can be implemented at various
points along the path
2
of the flow of data from the repository to the final consumer.
So, we may have caching at the DBMS itself, the Web server’s memory or disk, at
various points in the network (proxy caches (Luotonen & Altis, 1994)) or at the
consumer’s endpoint. Web proxies may cooperate so as to have several proxies to
serve each other’s misses (Rodriguez, Spanner & Biersack, 2001). All the caches
present at various points comprise a memory hierarchy. The most important part of
a cache is the mechanism that determines which data will be accommodated in the
cache space and is referred to as the cache admission/replacement policy.
Caching introduces a complication: how to maintain cache contents fresh, that
is, consistent with the original data residing in the repository. The issue of cache
consistency is of particular interest for Web-powered databases, because their data
are frequently updated by other applications running on top of the DBMS and thus
the cached copies must be invalidated or refreshed.
Obviously, requests for “first-time accessed” data and “non-cacheable” data
(containing personalized, authentication information, etc.) cannot benefit from
caching. In these cases, due to the existence of spatial locality in request streams,
we can exploit the technique of preloading or prefetching, which acts complemen-
tary to caching. Prefetching deduces future requests for data and brings that data in
cache before an explicit request is made for them. Prefetching may increase the
amount of traveling data, but on the other hand can significantly reduce the latency
associated with every request.
Contributions
This chapter will provide information concerning the management of Web
caches. It intends by no means to provide a survey of Web caching. Such a survey,
although from a very different perspective, can be found in (Wang, 1999). The target
of the chapter is twofold. Firstly, it intends to clarify the particularities of the Web
environment that call for different solutions regarding the replacement policies,
cache coherence and prefetching in the context of the Web-powered databases. It
will demonstrate how these particularities made the old solutions (adopted in
traditional database and operating systems) inadequate for the Web and how they
motivated the evolution of new methods. Examples of this evolution regard the
replacement, coherence and prefetching techniques for the Web. The second
objective of the chapter is to present a taxonomy of the techniques proposed so far
and to sketch the most important algorithms belonging to each category. Through this
taxonomy, which goes from the simplest to the most sophisticated algorithms, the
chapter intends to clearly demonstrate the tradeoffs involved and show how each
category deals with them. The demonstration of some popular representative
algorithms of each category intends to show how the tradeoffs affect the complexity
in implementation of the algorithms and how the ease of implementation can be
compromised with the performance. Finally, another target of the present chapter is
to present the practical issues of these topics through a description of how some
popular commercial products deal with them.

Cache Management for Web-Powered Databases 205
The rest of the chapter is organized as follows. The Section “Background”
provides some necessary background on the aforementioned topics and presents the
peculiarities of the Web that make Web cache management vastly different from
cache management in operating systems and centralized databases. Moreover, it will
give a formulation for the Web caching problem as a combinatorial optimization
problem and will define the performance measures used to characterize the cache
efficiency in the Web. The Section “Replacement Policies” will present a taxonomy
along with the most popular and efficient cache replacement policies proposed in the
literature. The Section “Cache Coherence” will elaborate on the topic of how to
maintain the cache contents consistent with the original data in the source and the
Section “Prefetching” will deal with the issue of how to improve cache performance
through the mechanism of prefetching. For the above topics, we will not elaborate
on details of how these algorithms can be implemented in a real system, since each
system provides its own interface and extensibility mechanisms. Our concern is to
provide only a description of the algorithms and the tradeoffs involved in their
operation. The Section “Web Caches in Commercial Products” will describe how
two commercial products, a proxy cache and a Web-powered database, cope with
cache replacement and coherency. Finally, the Section “Emerging and Future
Trends” will provide a general description of the emerging Content Distribution
Networks and will highlight some directions for future research.
BACKGROUND
The presence of caches in specific positions of a three-tier (or multi-tier, in
general) architecture, like that presented earlier, can significantly improve the
performance of the whole system. For example, a cache in the application server,
which stores the “hot” data of the DBMS, can avoid the costly interaction with it.
Similarly, data that do not change very frequently can be stored closer to the
consumer e.g., in a proxy cache. But in order for a cache to be effective, it must be
tuned so that it meets the requirements imposed by the specialized characteristics of
the application it serves. The primary means for tuning a cache is the admission/
replacement policy. This mechanism decides which data will enter the cache when
a client requests them and which data already in cache will be purged out in order
to make space for the incoming data when the available space is not sufficient.
Sometimes these two policies are integrated and are simply called the replace-
ment policy. The identification of the cached pages is based on the URL of the
cached page (with any additional data following it, e.g., query string, the POST
body of the documents)
3
.
For database-backed Web applications, the issue of cache consistency is of
crucial importance, especially for applications that must always serve fresh data
(e.g., providers of stock prices, sports scores). Due to the requirements of data
freshness we would expect that all dynamically generated pages (or at least, all
pages with frequently changing data) be not cached at all. Indeed, this is the most

Citations
More filters
Proceedings ArticleDOI

Caching in Web memory hierarchies

TL;DR: The CRF replacement policy is presented, whose development is mainly motivated by the filtering effects of Web caching hierarchies and the intention of achieving a balance between hit and byte hit rates.
Journal Article

Design and maintenance of data-intensive web sites

TL;DR: In this article, a methodology for designing and maintaining large Web sites is introduced, which is composed of two intertwined activities: database design and hypertext design, each of which is further divided in a conceptual phase and a logical phase, based on specific data models, proposed in the project.
Book ChapterDOI

Consistent caching of data objects in database driven websites

TL;DR: A dependency graph is proposed which provides a mapper between update statements in a relational database and cached objects which allows keeping the number of invalidations as low as possible.
Journal ArticleDOI

Update Propagator for Joint Scalable Storage

TL;DR: This paper proposes an algorithm for update propagation among different storages like multi-column, key-value, and relational databases, which could be error prone because of inconsistencies in stored data.
Book ChapterDOI

Database Design Based on B

TL;DR: This chapter is devoted to the integration of the ASSO features in B, an industrial formal method for specifying, designing, and coding software systems that integrates static and dynamics exploiting the novel concepts of Class-Machine and Specialized Class- machine.
References
More filters
Book

Computers and Intractability: A Guide to the Theory of NP-Completeness

TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Proceedings ArticleDOI

Web caching and Zipf-like distributions: evidence and implications

TL;DR: This paper investigates the page request distribution seen by Web proxy caches using traces from a variety of sources and considers a simple model where the Web accesses are independent and the reference probability of the documents follows a Zipf-like distribution, suggesting that the various observed properties of hit-ratios and temporal locality are indeed inherent to Web accesse observed by proxies.
Journal ArticleDOI

A study of replacement algorithms for a virtual-storage computer

TL;DR: One of the basic limitations of a digital computer is the size of its available memory; an approach that permits the programmer to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.
Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Cache management for web-powered databases" ?

The authors present a taxonomy and the main algorithms proposed for cache replacement and coherence maintenance. The authors present three families of predictive prefetching algorithms for the Web and characterize them as Markov predictors. 

Moreover, the authors pointed out some areas where future work should concentrate. The authors believe that future work should concentrate on two targets.