Mining preferences from OLAP query logs for proactive personalization

doi:10.1007/978-3-642-23737-9_7

Mining Preferences from OLAP Query Logs for

Proactive Personalization

Julien Aligon

1

, Matteo Golfarelli

2

,

Patrick Marcel

1

, Stefano Rizzi

2

, and Elisa Turricchia

2

1

Laboratoire d’Informatique – Universit´eFran¸cois Rabelais Tours, France

{julien.aligon,patrick.marcel}@univ-tours.fr

2

DEIS – University of Bologna, Italy

{matteo.golfarelli,stefano.rizzi,elisa.turricchia2}@unibo.it

Abstract. The goal of personalization is to deliver information that is

relevant to an individual or a group of individuals in the most appropriate

format and layout. In the OLAP context personalization is quite bene-

ﬁcial, because queries can be very complex and they may return huge

amounts of data. Aimed at making the user’s experience with OLAP as

plain as possible, in this paper we propose a proactive approach that

couples an MDX-based language for expressing OLAP preferences to a

mining technique for automatically deriving preferences. First, the log of

past MDX queries issued by that user is mined to extract a set of asso-

ciation rules that relate sets of frequent query fragments; then, given a

speciﬁc query, a subset of pertinent and eﬀective rules is selected; ﬁnally,

the selected rules are translated into a preference that is used to annotate

the user’s query. A set of experimental results proves the eﬀectiveness

and eﬃciency of our approach.

1 Introduction and Motivation

Personalization has attracted a lot of attention in the database community dur-

ing the last few years, and also raised plenty of interest in the OLAP area. The

goal of personalization is to deliver information that is relevant to an individual

or a group of individuals in the most appropriate format and layout, and in the

OLAP area it has been pursued using diﬀerent approaches:

– Query recommendation: Based on the current query and on the past sessions,

the system suggests further queries to help users navigating the cube [1].

– Personalized visualization: Users specify a set of constraints that are used to

determine a preferred visualization [2].

– Result ranking: Query results are organized in a total or partial order so that

the user visualizes the most relevant data ﬁrst [3].

– Query contextualization: The query is enhanced by adding preference predi-

cates that depend on the query context [4].

These approaches diﬀer from diﬀerent points of view, in particular:

J. Eder, M. Bielikova, and A.M. Tjoa (Eds.): ADBIS 2011, LNCS 6909, pp. 84–97, 2011.

c

 Springer-Verlag Berlin Heidelberg 2011

Mining Preferences from OLAP Query Logs for Proactive Personalization 85

– Formulation eﬀort: personalization criteria for queries may be either manu-

ally speciﬁed by users, or transparently inferred from the context and from

the user proﬁle.

– Prescriptiveness: personalization criteria may either be used as “hard” con-

straints that are added to queries, or be meant as “soft” constraints, i.e.,

preferences.

– Proactiveness: some approaches propose new queries to the user based on

the query log and on the context, while others change the current query or

post-process its results before returning them to the user.

With reference to the above, the user’s experience with OLAP can be made as

plain as possible by decreasing the formulation eﬀort (i.e., having query per-

sonalization criteria inferred), providing low prescriptiveness (i.e., annotating

queries with preferences rather than constraints), and enhancing proactiveness

(i.e., transparently changing the current query). The result ranking approach we

propose in this paper goes in this direction by coupling an MDX-based language

for expressing OLAP preferences to a mining technique for automatically de-

riving a set of preferences for a user’s query from the log of past MDX queries

issued by that user. This is done in four steps:

1. The user’s query log is mined oﬀ-line to extract a set of association rules

that relate sets of frequent query fragments (such as group-by attributes,

returned measures, selection predicates).

2. When the user formulates a query q, among the rules whose antecedent

matches with q, a subset of rules is selected whose cardinality depends on a

parameter set by the user to express the desired personalization degree, i.e.,

the complexity of the preference that will be formulated.

3. The selected rules are translated into an OLAP preference p concerning the

group-by set for aggregating data, the measures to be returned, and the

values of levels or measures.

4. Query q is annotated with p and executed. The results returned are ranked

according to p, so that the user can more eﬀectively explore them by focusing

on the most relevant data ﬁrst.

Remarkably, like in the other result ranking approaches, the overall set of tuples

returned by q annotated with p is the same set of tuples that would be returned

by q without annotation, because p expresses a soft constraint. This guarantees

that the user’s intentions are preserved, and makes our approach non-invasive.

The paper outline is as follows. After summarizing the related work in Section

2, we introduce a formal setting to manipulate multidimensional data in Section

3. In Section 4 we describe the main features of the myMDX language we adopt

to express OLAP preferences, while Section 5 describes in detail our approach.

Section 6 shows an implementation and reports the results of some experimental

tests we performed to test our approach for eﬀectiveness and eﬃciency.

2 Related Work

Several approaches to personalization were devised in the OLAP context.

86 J. Aligon et al.

In the ﬁeld of proﬁle-based personalization, we mention [2], that presents a

framework for providing personalized visualization of OLAP results based on

user proﬁles in form of constraints, and [4], that achieves OLAP personalization

by dynamically enhancing queries with context-aware user preferences. Both ap-

proaches are proactive and demand low formulation eﬀort, but in both cases the

user proﬁle is given, nothing being said on its construction. A recommendation

framework for OLAP systems is presented in [5]; new queries are suggested to

users based on the current analysis context and on the user’s proﬁle. Though the

authors mention that the proﬁle could be mined from the user’s previous behav-

ior, no speciﬁc suggestion is given to this end. A non-prescriptive approach is

presented in [3,6], where the myOLA P algebra for formulating and evaluating

OLAP preferences is introduced; the proposed algebra is very expressive, but at

the cost of a substantial formulation eﬀort.

The term history-based personalization is borrowed from [7], and refers to

approaches that suggest a new database query based on the past actions recorded

in a log ﬁle. The following approaches fall into this category and do not rely

on a user proﬁle; they are proactive and demand no formulation eﬀort —like

our approach—, but they are prescriptive. The approaches in [1,8] are aimed at

suggesting OLAP queries based on a comparison between the current session and

former sessions stored in a query log. Also [9] has a similar goal in the context

of SPJ queries; here, recommendations are computed based on the presence of

tuples in sessions. This approach is further improved in [10] by relying on query

fragments instead of tuples. A query log is exploited in [11] to support users in

writing new SQL queries; the log is transformed into a graph of query fragments,

where edges are labelled with the conditional probability of having one fragment

given another fragment. Noticeably, all these work generally assume that history

is taken from a query log shared by all users.

To the best of our knowledge, our work is the ﬁrst that proposes to extract

preferences from database query logs. However, the same idea has been used in

other contexts. In the context of information retrieval, [12] presents algorithms

to extract association rules at query time from a set of documents. These rules

are used to associate the documents retrieved by a query to a relevance class and

eventually to rank them. In the context of the web, [13] introduces algorithms

for preference extraction from web logs, with a targeted preference language.

Extraction is based on the frequency of the terms appearing in the log, and clus-

tering is used for identifying preference constructs. A comprehensive overview of

the techniques using data mining for personalization can be found in [14].

3 Preliminaries

3.1 Schemata and Instances

Our datacube formalization involves hierarchies; however, to keep the formalism

simpler, and without actually restricting the validity of our approach, we will

consider hierarchies without branches, i.e., consisting of chains of levels.

Mining Preferences from OLAP Query Logs for Proactive Personalization 87

State

Region

AllCities

City

Race

RaceGroup

Mrn

AllRaces

Year

AllYears

RESIDENCE RACE TIME

Occ

AllOccs

OCCUPATION

Sex

AllSexes

SEX

Fig. 1. Roll-up orders for the ﬁve hierarchies in the CENSUS schema (Mrn stands for

MajorRacesNumber)

Deﬁnition 1 (Multidimensional Schema). A multidimensional schema (or,

brieﬂy, a schema)isatripleM = A, H, M where:

– A is a ﬁnite set of levels, each deﬁned on a categorical domain Dom(a);

– H = {h

1

,...,h

n

} is a ﬁnite set of hierarchies, each characterized by (1) a

subset Lev(h

i

) ⊆ A of levels (such that the Lev(h

i

)’s for i =1,...,n deﬁne

a partition of A); (2) a roll-up total order 

h

i

of Lev(h

i

);

– a ﬁnite set of measures M , each deﬁned on a numerical domain Dom(m).

For each hierarchy h

i

, the top level of the order determines the ﬁnest aggregation

level for the hierarchy. Conversely, the bottom level has a single possible value

and determines the coarsest aggregation level.

A group-by set includes one level for each hierarchy, and deﬁnes a possible way

to aggregate data. A coordinate of a group-by set is a point in the n-dimensional

space deﬁned by the levels in that group-by set.

Deﬁnition 2 (Group-by Set). Given schema M = A, H, M ,letDom(H)=

Lev(h

1

) × ... × Lev(h

n

);eachG ∈ Dom(H) is called a group-by set of M.

Let G = a

k

1

,..., a

k

n

 and Dom(G)=Dom(a

k

1

) × ... × Dom(a

k

n

);each

g ∈ Dom(G) is called a coordinate of G.

Example 1. The CENSUS schema includes the ﬁve hierarchies whose roll-up or-

ders are shown in Figure 1, and measures AvgIncome, AvgCostGas,andAvgCost-

Elect.ItisCity 

RESIDENCE

State; examples of group-by sets are:

G

0

= City, Race, Year, Occ, Sex

G

1

= Region, Mrn, Year, Occ, Sex

G

2

= AllCities, AllRaces, AllYears, AllOccs, AllSexes

A schema is populated with facts, each recording a useful information for the

decision-making process. A fact is characterized by a group-by set G that deﬁnes

its aggregation level, by a coordinate of G, and by a value for one measure.

88 J. Aligon et al.

Deﬁnition 3 (Fact). Given schema M = A, H, M , a group-by set G ∈

Dom(H),andameasurem ∈ M , a fact is a couple f

G,m

= g, v,where

g ∈ Dom(G) and v ∈ Dom(m). The space of all facts for M is

F

M

=



G∈Dom(H),m∈M

(Dom(G) × Dom(m))

Example 2. An example of fact is f

G

1

,AvgIncome

= ’Paciﬁc’, ’White’, ’2008’,

’Dentist’, ’Male’, 600.

Finally, an instance of a schema (datacube)isasetoffactsD ⊆F

M

such that

no two facts characterized by the same coordinate and measure exist in D.

3.2 Queries

The MDX (MultiDimensional eXpressions) language is a de-facto standard for

querying multidimensional databases [15]. Some of its distinguishing features are

the possibility of returning query results that contain data with diﬀerent aggre-

gation levels and the possibility of specifying how the results should be visually

arranged into a multidimensional representation. In this paper we consider MDX

queries that aggregate data at one or more group-by sets, optionally select them

using a predicate in CNF, and return one or more measures. The semantics of

such an MDX query is that of a union of GPSJ queries

1

whose group-by sets

are the cross product of n sets of levels, one for each hierarchy. This semantics

corresponds to the following subset of MDX:

– Clauses

SELECT, FROM, WHERE are supported.

– All functions for navigating hierarchies are supported:

AllMembers, Ancestor,

Ascendants, Children,etc.

– All functions for manipulating sets of members or tuples are supported

(

Crossjoin, Except, Exists, Extract, Filter, Intersect, etc.) except the union.

– All functions for manipulating members/tuples are supported.

To eﬀectively use association rules for modeling frequent portions of queries, we

formally split MDX queries into fragments as explained below.

Deﬁnition 4 (Query Fragment, Query, Log). Given schema M = A, H,

M,aquery fragment is either a level in A,ameasureinM ,orasimpleBoolean

predicate involving a level and/or a measure. A qf-set is a set of query fragments.

A multidimensional query (brieﬂy, query) is represented by a qf-set that includes

at least one level for each hierarchy in H and at least one measure in M .Alog

is a set of multidimensional queries.

1

A GPSJ query takes form π

a

k

1

,...,a

k

n

,Aggr

σ

p

(χ) where, in our context: χ is the star

join between the fact table and the n dimension tables; p is a selection formula in

CNF; {a

k

1

,...,a

k

n

} is a group-by set; and Aggr is a list of aggregations of the form

α

j

(m

j

), where m

j

is a measure and α

j

is an aggregation operator.

Mining preferences from OLAP query logs for proactive personalization

Figures

Citations

Fusion Cubes: Towards Self-Service Business Intelligence

Similarity measures for OLAP sessions

A collaborative filtering approach for recommending OLAP sessions

Identifying User Interests within the Data Space - a Case Study with SkyServer

Interest-based recommendations for business intelligence users

References

Fast Algorithms for Mining Association Rules in Large Databases

The adaptive web: methods and strategies of web personalization

CMAR: accurate and efficient classification based on multiple class-association rules

The Adaptive Web

Data mining for web personalization

Related Papers (5)

A personalization framework for OLAP queries

myOLAP: An Approach to Express and Evaluate OLAP Preferences

Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions

Recommending Multidimensional Queries

Similarity measures for OLAP sessions