What contributions have the authors mentioned in the paper "Cache management for web-powered databases" ?

The authors present a taxonomy and the main algorithms proposed for cache replacement and coherence maintenance. The authors present three families of predictive prefetching algorithms for the Web and characterize them as Markov predictors.

What have the authors stated for future works in "Cache management for web-powered databases" ?

Moreover, the authors pointed out some areas where future work should concentrate. The authors believe that future work should concentrate on two targets.

(Open Access) Cache Management for Web-Powered Databases (2003) | Dimitrios Katsaros

Cache Management for Web-Powered Databases 201

Chapter VIII

Cache Management for

Web-Powered Databases

Dimitrios Katsaros and Yannis Manolopoulos

Aristotle University of Thessaloniki, Greece

ABSTRACT

The Web has become the primary means for information dissemination. It is

ideal for publishing data residing in a variety of repositories, such as

databases. In such a multi-tier system (client - Web server - underlying

database), where the Web page content is dynamically derived from the

database (Web-powered database), cache management is very important in

making efficient distribution of the published information. Issues related to

cache management are the cache admission/replacement policy, the cache

coherency and the prefetching, which acts complementary to caching. The

present chapter discusses the issues, which make the Web cache management

radically different than the cache management in databases or operating

systems. We present a taxonomy and the main algorithms proposed for cache

replacement and coherence maintenance. We present three families of predictive

prefetching algorithms for the Web and characterize them as Markov predictors.

Finally, we give examples of how some popular commercial products deal with

the issues regarding the cache management for Web-powered databases.

INTRODUCTION

In the recent years the World Wide Web or simply the Web (Berners-Lee,

Caililiau, Luotnen, Nielsen & Livny, 1994) has become the primary means for

information dissemination. It is a hypertext-based application and uses the HTTP

protocol for file transfers. What started as a medium to serve the needs of a specific

202 Katsaros & Manolopoulos

scientific community (that of Particle Physics), has now become the most popular

application running on the Internet. Today it is being used for many purposes, ranging

from pure educational to entertainment and lately for conducting business. Applica-

tions such as digital libraries, video-on-demand, distance learning and virtual stores,

that allow for buying cars, books, computers etc. are some of the services running

on the Web. The advent of the XML language and its adoption from the World Wide

Web Council as a standard for document exchange has enlarged many old and fueled

new applications on it.

During its first years the Web consisted of static HTML pages stored on the file

system of the connected machines. When new needs arose, such as the E-

Commerce or the need to publish data residing in other systems, e.g., databases, it

was realized that we could not afford in terms of storage to replicate the original data

in the Web server’s disk in the form of HTML pages. Moreover, it would make no

sense to replicate data that would never be requested. So, instead of static pages, an

application program should run on the Web server to receive the requests from

clients, retrieve the relevant data from the source and then pack them into HTML or

XML format. Even the emerged “semistructured” XML databases, which store

data directly into the XML format, need an application program which will connect

to the DMBS and retrieve the XML file (or fragment). Thus, a new kind of pages,

dynamically generated and a new architecture were born. We have no more the

traditional couple of a Web client and a Web server, but a third part is added, the

application program, running on the Web server and serving data from an underlying

repository, in most of the cases being a database. This scheme is sometimes referred

to as Web-powered database and the Web site, which provides access to a large

number of pages whose content is extracted from databases, is called data intensive

Web site (Atzeni, Mecca & Merialdo, 1998; Yagoub, Florescu, Issarny & Valduriez,

2000). The typical architecture for such a scenario is depicted in Figure 1. In this

scheme there are three tiers, the database back-end, the Web/application server and

the Web client. In order to generate dynamic content, Web servers must execute a

program (e.g., server-side scripting mechanism). This program (script) connects

to the DBMS, executes the client query, gets the results and packs them in HTML/

XML form in order to return them to the user. Quite a lot of server-side scripting

mechanisms have been proposed in the literature (Greenspun, 1999; Malaika, 1998).

An alternative to having a program that generates HTML is the several forms of

annotated HTML. The annotated HTML, such as PHP, Active Server Pages,

Java Server Pages, embeds scripting commands in an HTML document.

The popularity of the Web resulted in heavy traffic in the Internet and heavy load

on Web the servers. For Web-powered databases the situation is worsened by the

fact that the application program must interact with the underlying database to

retrieve the data. So, the net effect of this situation is network congestion, high client

perceived latency, Web server overload and slow response times for Web servers.

Fortunately the situation is not incurable due to the existence of reference locality

in Web request streams. The principle of locality (Denning & Schwartz, 1972)

Cache Management for Web-Powered Databases 203

asserts that: (a) correlation between immediate past and immediate future refer-

ences tends to be high, and (b) correlation between disjoint reference patterns tends

to zero as the distance between them tends to infinity. Existence of reference locality

is indicated by several studies (Almeida et al., 1996; Breslau, Cao, Fan, Phillips &

Shenker, 1999).

There are two types of reference locality, namely temporal and spatial locality

(Almeida et al., 1996). Temporal locality can be described using the stack distance

model, as introduced in (Mattson, Gecsei, Slutz & Traiger, 1970). Existence of high

temporal locality in a request stream results in a relatively small average stack

distance and implies that recently accessed data are more likely to be referenced in

the near future. Consider for example the following reference streams: AABCBCD

and ABCDABC. They both have the same popularity profile

for each item.

Evidently, the stack distance for the first stream is smaller than for the second stream.

This can be deduced from the fact that the number of intervening references

between any two references for the same item in the first stream is smaller than for

the second stream. Thus, the first stream exhibits higher temporal locality than the

second. Spatial locality on the other hand, characterizes correlated references for

different data. Spatial locality in a stream can be established by comparing the total

number of unique subsequences of the stream with the total number of subsequences

that would be found in a random permutation of that stream. Existence of spatial

locality in a stream implies that the number of such unique subsequences is smaller

that the respective number of subsequences in a random permutation of the stream.

Consider for example the following reference streams: ABCABC and ACBCAB.

They both have the same popularity profile for each item. We can observe in the

first stream that a reference to item B always follows a reference to item A and is

followed by a reference to item C. This is not the case in the second stream and we

cannot observe a similar rule for any other sequence of items.

Due to the existence of temporal locality we can exploit the technique of

caching, that is, temporal storage of data closer to the consumer. Caching can save

resources, i.e., network bandwidth, since fewer packets travel in the network, and

I N T E R N E T

Web

Client

Proxy

Server

request

response

request

response

cache cache

request

response

+ prefetching

main

memory

cache

disk

cache

Web

serve

Application Server

refetch

res

onse

uest

DataBase

cache

Web

Powered Database

Figure 1: Architecture of a typical web-powered database

204 Katsaros & Manolopoulos

time, since we have faster response times. Caching can be implemented at various

points along the path

of the flow of data from the repository to the final consumer.

So, we may have caching at the DBMS itself, the Web server’s memory or disk, at

various points in the network (proxy caches (Luotonen & Altis, 1994)) or at the

consumer’s endpoint. Web proxies may cooperate so as to have several proxies to

serve each other’s misses (Rodriguez, Spanner & Biersack, 2001). All the caches

present at various points comprise a memory hierarchy. The most important part of

a cache is the mechanism that determines which data will be accommodated in the

cache space and is referred to as the cache admission/replacement policy.

Caching introduces a complication: how to maintain cache contents fresh, that

is, consistent with the original data residing in the repository. The issue of cache

consistency is of particular interest for Web-powered databases, because their data

are frequently updated by other applications running on top of the DBMS and thus

the cached copies must be invalidated or refreshed.

Obviously, requests for “first-time accessed” data and “non-cacheable” data

(containing personalized, authentication information, etc.) cannot benefit from

caching. In these cases, due to the existence of spatial locality in request streams,

we can exploit the technique of preloading or prefetching, which acts complemen-

tary to caching. Prefetching deduces future requests for data and brings that data in

cache before an explicit request is made for them. Prefetching may increase the

amount of traveling data, but on the other hand can significantly reduce the latency

associated with every request.

Contributions

This chapter will provide information concerning the management of Web

caches. It intends by no means to provide a survey of Web caching. Such a survey,

although from a very different perspective, can be found in (Wang, 1999). The target

of the chapter is twofold. Firstly, it intends to clarify the particularities of the Web

environment that call for different solutions regarding the replacement policies,

cache coherence and prefetching in the context of the Web-powered databases. It

will demonstrate how these particularities made the old solutions (adopted in

traditional database and operating systems) inadequate for the Web and how they

motivated the evolution of new methods. Examples of this evolution regard the

replacement, coherence and prefetching techniques for the Web. The second

objective of the chapter is to present a taxonomy of the techniques proposed so far

and to sketch the most important algorithms belonging to each category. Through this

taxonomy, which goes from the simplest to the most sophisticated algorithms, the

chapter intends to clearly demonstrate the tradeoffs involved and show how each

category deals with them. The demonstration of some popular representative

algorithms of each category intends to show how the tradeoffs affect the complexity

in implementation of the algorithms and how the ease of implementation can be

compromised with the performance. Finally, another target of the present chapter is

to present the practical issues of these topics through a description of how some

popular commercial products deal with them.

Cache Management for Web-Powered Databases 205

The rest of the chapter is organized as follows. The Section “Background”

provides some necessary background on the aforementioned topics and presents the

peculiarities of the Web that make Web cache management vastly different from

cache management in operating systems and centralized databases. Moreover, it will

give a formulation for the Web caching problem as a combinatorial optimization

problem and will define the performance measures used to characterize the cache

efficiency in the Web. The Section “Replacement Policies” will present a taxonomy

along with the most popular and efficient cache replacement policies proposed in the

literature. The Section “Cache Coherence” will elaborate on the topic of how to

maintain the cache contents consistent with the original data in the source and the

Section “Prefetching” will deal with the issue of how to improve cache performance

through the mechanism of prefetching. For the above topics, we will not elaborate

on details of how these algorithms can be implemented in a real system, since each

system provides its own interface and extensibility mechanisms. Our concern is to

provide only a description of the algorithms and the tradeoffs involved in their

operation. The Section “Web Caches in Commercial Products” will describe how

two commercial products, a proxy cache and a Web-powered database, cope with

cache replacement and coherency. Finally, the Section “Emerging and Future

Trends” will provide a general description of the emerging Content Distribution

Networks and will highlight some directions for future research.

BACKGROUND

The presence of caches in specific positions of a three-tier (or multi-tier, in

general) architecture, like that presented earlier, can significantly improve the

performance of the whole system. For example, a cache in the application server,

which stores the “hot” data of the DBMS, can avoid the costly interaction with it.

Similarly, data that do not change very frequently can be stored closer to the

consumer e.g., in a proxy cache. But in order for a cache to be effective, it must be

tuned so that it meets the requirements imposed by the specialized characteristics of

the application it serves. The primary means for tuning a cache is the admission/

replacement policy. This mechanism decides which data will enter the cache when

a client requests them and which data already in cache will be purged out in order

to make space for the incoming data when the available space is not sufficient.

Sometimes these two policies are integrated and are simply called the replace-

ment policy. The identification of the cached pages is based on the URL of the

cached page (with any additional data following it, e.g., query string, the POST

body of the documents)

For database-backed Web applications, the issue of cache consistency is of

crucial importance, especially for applications that must always serve fresh data

(e.g., providers of stock prices, sports scores). Due to the requirements of data

freshness we would expect that all dynamically generated pages (or at least, all

pages with frequently changing data) be not cached at all. Indeed, this is the most

Cache Management for Web-Powered Databases

Figures

Citations

Caching in Web memory hierarchies

Design and maintenance of data-intensive web sites

Consistent caching of data objects in database driven websites

Update Propagator for Joint Scalable Storage

Database Design Based on B

References

Computers and Intractability: A Guide to the Theory of NP-Completeness

Fast Algorithms for Mining Association Rules in Large Databases

Web caching and Zipf-like distributions: evidence and implications

Computers and Intractability: A Guide to the Theory of NP-Completeness

A study of replacement algorithms for a virtual-storage computer

Related Papers (5)

Cache Management for Web-Powered Databases.

Proxy Cache Replacement Algorithms: A History-Based Approach

Web cache replacement policies: properties, limitations and implications

Efficient management of data in proxy cache

Multicache-based content management for Web caching

Frequently Asked Questions (2)

Q1. What contributions have the authors mentioned in the paper "Cache management for web-powered databases" ?

Q2. What have the authors stated for future works in "Cache management for web-powered databases" ?