Showing papers in "arXiv: Databases in 2005"

PDF

Open Access

Posted Content•

Scientific Data Management in the Coming Decade

[...]

Jim Gray¹, David T. Liu², Maria Nieto-Santisteban³, Alexander S. Szalay³, David J. DeWitt, Gerd Heber⁴ - Show less +2 more•Institutions (4)

Microsoft¹, University of California, Berkeley², Johns Hopkins University³, Cornell University⁴

02 Feb 2005-arXiv: Databases

TL;DR: Analyzing this data to find the subtle effects missed by previous studies requires algorithms that can simultaneously deal with huge datasets and that can find very subtle effects --- finding both needles in the haystack and finding very small haystacks that were undetected in previous measurements.

...read moreread less

Abstract: This is a thought piece on data-intensive science requirements for databases and science centers. It argues that peta-scale datasets will be housed by science centers that provide substantial storage and processing for scientists who access the data via smart notebooks. Next-generation science instruments and simulations will generate these peta-scale datasets. The need to publish and share data and the need for generic analysis and visualization tools will finally create a convergence on common metadata standards. Database systems will be judged by their support of these metadata standards and by their ability to manage and access peta-scale datasets. The procedural stream-of-bytes-file-centric approach to data analysis is both too cumbersome and too serial for such large datasets. Non-procedural query and analysis of schematized self-describing data is both easier to use and allows much more parallelism.

...read moreread less

476 citations

Posted Content•

Conjunctive Query Containment and Answering under Description Logics Constraints

[...]

Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini

28 Jul 2005-arXiv: Databases

TL;DR: This article deals with unions of conjunctive queries, and addresses query containment and query answering under description logic constraints, showing that the problem is decidable, and analyzing its computational complexity.

...read moreread less

Abstract: Query containment and query answering are two important computational tasks in databases. While query answering amounts to compute the result of a query over a database, query containment is the problem of checking whether for every database, the result of one query is a subset of the result of another query. In this paper, we deal with unions of conjunctive queries, and we address query containment and query answering under Description Logic constraints. Every such constraint is essentially an inclusion dependencies between concepts and relations, and their expressive power is due to the possibility of using complex expressions, e.g., intersection and difference of relations, special forms of quantification, regular expressions over binary relations, in the specification of the dependencies. These types of constraints capture a great variety of data models, including the relational, the entity-relationship, and the object-oriented model, all extended with various forms of constraints, and also the basic features of the ontology languages used in the context of the Semantic Web. We present the following results on both query containment and query answering. We provide a method for query containment under Description Logic constraints, thus showing that the problem is decidable, and analyze its computational complexity. We prove that query containment is undecidable in the case where we allow inequalities in the right-hand side query, even for very simple constraints and queries. We show that query answering under Description Logic constraints can be reduced to query containment, and illustrate how such a reduction provides upper bound results with respect to both combined and data complexity.

...read moreread less

91 citations

Posted Content•

A Fast Greedy Algorithm for Outlier Mining

[...]

Zengyou He, Xiaofei Xu, Shengchun Deng

27 Jul 2005-arXiv: Databases

TL;DR: In this article, a fast greedy algorithm for mining outliers under the same optimization model is presented, which has comparable performance with respect to those state-of-the-art outlier detection algorithms on identifying true outliers.

...read moreread less

Abstract: The task of outlier detection is to find small groups of data objects that are exceptional when compared with rest large amount of data. In [38], the problem of outlier detection in categorical data is defined as an optimization problem and a local-search heuristic based algorithm (LSA) is presented. However, as is the case with most iterative type algorithms, the LSA algorithm is still very time-consuming on very large datasets. In this paper, we present a very fast greedy algorithm for mining outliers under the same optimization model. Experimental results on real datasets and large synthetic datasets show that: (1) Our algorithm has comparable performance with respect to those state-of-art outlier detection algorithms on identifying true outliers and (2) Our algorithm can be an order of magnitude faster than LSA algorithm.

...read moreread less

86 citations

Posted Content•

Complexity and Approximation of Fixing Numerical Attributes in Databases Under Integrity Constraints

[...]

Leopoldo Bertossi¹, Loreto Bravo¹, Enrico Franconi², Andrei Lopatenko¹•Institutions (2)

Carleton University¹, Free University of Bozen-Bolzano²

15 Mar 2005-arXiv: Databases

TL;DR: A quantitative definition of database fix is introduced, and the complexity of several decision and optimization problems, including DFP, DFP-hardness, but a good approximation algorithm for a relevant special case; and intractability but good approximation for CQA for aggregate queries for one database atom denials (plus built-ins).

...read moreread less

Abstract: Consistent query answering is the problem of computing the answers from a database that are consistent with respect to certain integrity constraints that the database as a whole may fail to satisfy. Those answers are characterized as those that are invariant under minimal forms of restoring the consistency of the database. In this context, we study the problem of repairing databases by fixing integer numerical values at the attribute level with respect to denial and aggregation constraints. We introduce a quantitative definition of database fix, and investigate the complexity of several decision and optimization problems, including DFP, i.e. the existence of fixes within a given distance from the original instance, and CQA, i.e. deciding consistency of answers to aggregate conjunctive queries under different semantics. We provide sharp complexity bounds, identify relevant tractable cases; and introduce approximation algorithms for some of those that are intractable. More specifically, we obtain results like undecidability of existence of fixes for aggregation constraints; MAXSNP-hardness of DFP, but a good approximation algorithm for a relevant special case; and intractability but good approximation for CQA for aggregate queries for one database atom denials (plus built-ins).

...read moreread less

66 citations

Posted Content•

An Optimization Model for Outlier Detection in Categorical Data

[...]

Zengyou He, Xiaofei Xu, Shengchun Deng

29 Mar 2005-arXiv: Databases

TL;DR: The problem of outlier detection in categorical data as an optimization problem from a global viewpoint is formally defined and a local-search heuristic based algorithm for efficiently finding feasible solutions is presented.

...read moreread less

Abstract: The task of outlier detection is to find small groups of data objects that are exceptional when compared with rest large amount of data. Detection of such outliers is important for many applications such as fraud detection and customer migration. Most existing methods are designed for numeric data. They will encounter problems with real-life applications that contain categorical data. In this paper, we formally define the problem of outlier detection in categorical data as an optimization problem from a global viewpoint. Moreover, we present a local-search heuristic based algorithm for efficiently finding feasible solutions. Experimental results on real datasets and large synthetic datasets demonstrate the superiority of our model and algorithm.

...read moreread less

60 citations

Posted Content•

On the Complexity of Nonrecursive XQuery and Functional Query Languages on Complex Values

[...]

Christoph Koch¹•Institutions (1)

Saarland University¹

23 Mar 2005-arXiv: Databases

TL;DR: It is shown that Core XQuery is just as hard as monad algebra w.r.t. combined complexity, and that it is in TC0 if the query is assumed fixed.

...read moreread less

Abstract: This paper studies the complexity of evaluating functional query languages for complex values such as monad algebra and the recursion-free fragment of XQuery. We show that monad algebra with equality restricted to atomic values is complete for the class TA[2^{O(n)}, O(n)] of problems solvable in linear exponential time with a linear number of alternations. The monotone fragment of monad algebra with atomic value equality but without negation is complete for nondeterministic exponential time. For monad algebra with deep equality, we establish TA[2^{O(n)}, O(n)] lower and exponential-space upper bounds. Then we study a fragment of XQuery, Core XQuery, that seems to incorporate all the features of a query language on complex values that are traditionally deemed essential. A close connection between monad algebra on lists and Core XQuery (with ``child'' as the only axis) is exhibited, and it is shown that these languages are expressively equivalent up to representation issues. We show that Core XQuery is just as hard as monad algebra w.r.t. combined complexity, and that it is in TC0 if the query is assumed fixed.

...read moreread less

51 citations

Posted Content•

Consistent query answers on numerical databases under aggregate constraints

[...]

Sergio Flesca¹, Filippo Furfaro¹, Francesco Parisi¹•Institutions (1)

University of Calabria¹

23 May 2005-arXiv: Databases

TL;DR: The problem of extracting consistent information from relational databases violating integrity constraints on numerical data is addressed and the characterization of several data-complexity issues related to repairing data and computing consistent query answers is provided.

...read moreread less

Abstract: The problem of extracting consistent information from relational databases violating integrity constraints on numerical data is addressed. In particular, aggregate constraints defined as linear inequalities on aggregate-sum queries on input data are considered. The notion of repair as consistent set of updates at attribute-value level is exploited, and the characterization of several complexity issues related to repairing data and computing consistent query answers is provided.

...read moreread less

41 citations

Posted Content•

Analyzing Large Collections of Electronic Text Using OLAP

[...]

Steven Keith¹, Owen Kaser¹, Daniel Lemire•Institutions (1)

University of New Brunswick¹

01 Jan 2005-arXiv: Databases

TL;DR: On-Line Analytical Processing (OLAP) allows quick analysis of multidimensional data, by storing text-analysis information in an OLAP system, so queries may be solved in seconds instead of minutes or hours.

...read moreread less

Abstract: Computer-assisted reading and analysis of text has applications in the humanities and social sciences. Ever-larger electronic text archives have the advantage of allowing a more complete analysis but the disadvantage of forcing longer waits for results. On-Line Analytical Processing (OLAP) allows quick analysis of multidimensional data. By storing text-analysis information in an OLAP system, queries may be solved in seconds instead of minutes or hours. This analysis is user-driven, allowing users the freedom to pursue their own directions of research.

...read moreread less

36 citations

Posted Content•

Data Mining for Actionable Knowledge: A Survey

[...]

Zengyou He, Xiaofei Xu, Shengchun Deng

27 Jan 2005-arXiv: Databases

TL;DR: This paper presents two frameworks for mining actionable knowledge that are inexplicitly adopted by existing research methods and tries to situate some of the research on this topic from two different viewpoints.

...read moreread less

Abstract: The data mining process consists of a series of steps ranging from data cleaning, data selection and transformation, to pattern evaluation and visualization. One of the central problems in data mining is to make the mined patterns or knowledge actionable. Here, the term actionable refers to the mined patterns suggest concrete and profitable actions to the decision-maker. That is, the user can do something to bring direct benefits (increase in profits, reduction in cost, improvement in efficiency, etc.) to the organization's advantage. However, there has been written no comprehensive survey available on this topic. The goal of this paper is to fill the void. In this paper, we first present two frameworks for mining actionable knowledge that are inexplicitly adopted by existing research methods. Then we try to situate some of the research on this topic from two different viewpoints: 1) data mining tasks and 2) adopted framework. Finally, we specify issues that are either not addressed or insufficiently studied yet and conclude the paper.

...read moreread less

33 citations

Posted Content•

Semantic Optimization Techniques for Preference Queries

[...]

Jan Chomicki¹•Institutions (1)

University at Buffalo¹

14 Oct 2005-arXiv: Databases

TL;DR: It is demonstrated that the problems of containment and satisfaction of order axioms can be captured as specific instances of constraint-generating dependency entailment, making it possible to formulate necessary and sufficient conditions for the applicability of techniques as constraint validity problems.

...read moreread less

Abstract: Preference queries are relational algebra or SQL queries that contain occurrences of the winnow operator ("find the most preferred tuples in a given relation"). Such queries are parameterized by specific preference relations. Semantic optimization techniques make use of integrity constraints holding in the database. In the context of semantic optimization of preference queries, we identify two fundamental properties: containment of preference relations relative to integrity constraints and satisfaction of order axioms relative to integrity constraints. We show numerous applications of those notions to preference query evaluation and optimization. As integrity constraints, we consider constraint-generating dependencies, a class generalizing functional dependencies. We demonstrate that the problems of containment and satisfaction of order axioms can be captured as specific instances of constraint-generating dependency entailment. This makes it possible to formulate necessary and sufficient conditions for the applicability of our techniques as constraint validity problems. We characterize the computational complexity of such problems.

...read moreread less

24 citations

Posted Content•

HepToX: Heterogeneous Peer to Peer XML Databases ∗

[...]

Angela Bonifati¹, Elaine Qing Chang, Terence Ho, Laks V. S. Lakshmanan•Institutions (1)

Indian Council of Agricultural Research¹

01 Jun 2005-arXiv: Databases

TL;DR: A novel algorithm is developed that infers a set of precise mapping rules between the schemas from these visual annotations, and is presented, which pin down a seman- tics of query translation given such mapping rules, and present a novel query translation algorithm for a simple but expressive frag- ment of XQuery.

...read moreread less

Abstract: study a collection of heterogeneous XML databases maintain- ing similar and related information, exchanging data via a peer to peer overlay network. In this setting, a mediated global schema is unrealistic. Yet, users/applications wish to query the d atabases via one peer using its schema. We have recently developed Hep- ToX, a P2P Heterogeneous XML database system. A key idea is that whenever a peer enters the system, it establishes an acquain- tance with a small number of peer databases, possibly with dif- ferent schema. The peer administrator provides correspondences between the local schema and the acquaintance schema using an informal and intuitive notation of arrows and boxes. We develop a novel algorithm that infers a set of precise mapping rules between the schemas from these visual annotations. We pin down a seman- tics of query translation given such mapping rules, and present a novel query translation algorithm for a simple but expressive frag- ment of XQuery, that employs the mapping rules in either direction. We show the translation algorithm is correct. Finally, we demon- strate the utility and scalability of our ideas and algorith ms with a detailed set of experiments on top of the Emulab, a large scale P2P network emulation testbed.

...read moreread less

Posted Content•

Summarization Techniques for Pattern Collections in Data Mining

[...]

Taneli Mielikäinen

26 May 2005-arXiv: Databases

TL;DR: This dissertation describes methods for summarizing pattern collections in order to make them also more understandable and focuses on the following themes: 1) Quality value simplifications and 2) Pattern orderings.

...read moreread less

Abstract: Discovering patterns from data is an important task in data mining. There exist techniques to find large collections of many kinds of patterns from data very efficiently. A collection of patterns can be regarded as a summary of the data. A major difficulty with patterns is that pattern collections summarizing the data well are often very large. In this dissertation we describe methods for summarizing pattern collections in order to make them also more understandable. More specifically, we focus on the following themes: 1) Quality value simplifications. 2) Pattern orderings. 3) Pattern chains and antichains. 4) Change profiles. 5) Inverse pattern discovery.

...read moreread less

Posted Content•

Relational Algebra as non-Distributive Lattice

[...]

Vadim Tropashko

21 Jan 2005-arXiv: Databases

TL;DR: The set of classic relational algebra operators is reduced to two binary operations: natural join and generalized union and it is demonstrated that this set of operators is relationally complete and honors lattice axioms.

...read moreread less

Abstract: We reduce the set of classic relational algebra operators to two binary operations: natural join and generalized union. We further demonstrate that this set of operators is relationally complete and honors lattice axioms.

...read moreread less

Posted Content•

Priority-Based Conflict Resolution in Inconsistent Relational Databases ∗

[...]

Slawomir Staworko¹, Jan Chomicki¹•Institutions (1)

University at Buffalo¹

14 Jun 2005-arXiv: Databases

TL;DR: The impact of priorities on conflict resolution in inconsistent relational databases is studied to propose a set of postulates that an extended framework should satisfy and consider two instantiations of the framework: (locally preferred)l -repairs and (globally preferred): g-repairs.

...read moreread less

Abstract: We study here the impact of priorities on conflict resolution in inconsistent relational databases. We extend the framework of [1], which is based on the notions of repair and consistent query answer. We propose a set of postulates that an extended framework should satisfy and consider two instantiations of the framework: (locally preferred)l -repairs and (globally preferred)g-repairs. We study the relationships between them and the impact each notion of repair has on the computational complexity of repair checking and consistent query answers.

...read moreread less

Posted Content•

Estimating Range Queries using Aggregate Data with Integrity Constraints: a Probabilistic Approach

[...]

Francesco Buccafurri, Filippo Furfaro¹, Domenico Saccà•Institutions (1)

University of Calabria¹

14 Jan 2005-arXiv: Databases

TL;DR: In this article, the problem of recovering (count and sum) range queries over multidimensional data only on the basis of aggregate information on such data is addressed, and the problem can be formalized as follows: given a data set D, a summary S=T(D) and a range query r on D, the problem consists of studying r by modelling it as a random variable defined over the sample space of all the data sets D' such that T(D), = S.

...read moreread less

Abstract: The problem of recovering (count and sum) range queries over multidimensional data only on the basis of aggregate information on such data is addressed. This problem can be formalized as follows. Suppose that a transformation T producing a summary from a multidimensional data set is used. Now, given a data set D, a summary S=T(D) and a range query r on D, the problem consists of studying r by modelling it as a random variable defined over the sample space of all the data sets D' such that T(D) = S. The study of such a random variable, done by the definition of its probability distribution and the computation of its mean value and variance, represents a well-founded, theoretical probabilistic approach for estimating the query only on the basis of the available information (that is the summary S) without assumptions on original data.

...read moreread less

Posted Content•

Monotonic and Nonmonotonic Preference Revision

[...]

Jan Chomicki, Joyce Song

31 Mar 2005-arXiv: Databases

TL;DR: This work uses a relational framework in which preferences are represented using binary relations and identifies several classes of revisions that preserve order axioms, for example the axiomatic of strict partial or weak orders.

...read moreread less

Abstract: We study here preference revision, considering both the monotonic case where the original preferences are preserved and the nonmonotonic case where the new preferences may override the original ones. We use a relational framework in which preferences are represented using binary relations (not necessarily finite). We identify several classes of revisions that preserve order axioms, for example the axioms of strict partial or weak orders. We consider applications of our results to preference querying in relational databases.

...read moreread less

Posted Content•

Business intelligence systems and user's parameters: an application to a documents' database

[...]

Babajide Afolabi¹, Odile Thiery¹•Institutions (1)

SITE Santa Fe¹

28 Sep 2005-arXiv: Databases

TL;DR: A user model that is based on a cognitive user evolution is presented that when used together with a good definition of the information needs of the user (decision maker) will accelerate his decision making process.

...read moreread less

Abstract: This article presents earlier results of our research works in the area of modeling Business Intelligence Systems. The basic idea of this research area is presented first. We then show the necessity of including certain users' parameters in Information systems that are used in Business Intelligence systems in order to integrate a better response from such systems. We identified two main types of attributes that can be missing from a base and we showed why they needed to be included. A user model that is based on a cognitive user evolution is presented. This model when used together with a good definition of the information needs of the user (decision maker) will accelerate his decision making process.

...read moreread less

Posted Content•

Efficient Management of Short-Lived Data

[...]

Albrecht Schmidt, Christian S. Jensen

16 May 2005-arXiv: Databases

TL;DR: This work presents data structures and algorithms for online management of data tagged with expiration times based on fully functional, persistent treaps, which are a combination of binary search trees with respect to a primary attribute and heaps withrespect to a secondary attribute.

...read moreread less

Abstract: Motivated by the increasing prominence of loosely-coupled systems, such as mobile and sensor networks, which are characterised by intermittent connectivity and volatile data, we study the tagging of data with so-called expiration times. More specifically, when data are inserted into a database, they may be tagged with time values indicating when they expire, i.e., when they are regarded as stale or invalid and thus are no longer considered part of the database. In a number of applications, expiration times are known and can be assigned at insertion time. We present data structures and algorithms for online management of data tagged with expiration times. The algorithms are based on fully functional, persistent treaps, which are a combination of binary search trees with respect to a primary attribute and heaps with respect to a secondary attribute. The primary attribute implements primary keys, and the secondary attribute stores expiration times in a minimum heap, thus keeping a priority queue of tuples to expire. A detailed and comprehensive experimental study demonstrates the well-behavedness and scalability of the approach as well as its efficiency with respect to a number of competitors.

...read moreread less

Posted Content•

First-order Complete and Computationally Complete Query Languages for Spatio-Temporal Databases

[...]

Floris Geerts¹, Sofie Haesevoets², Bart Kuijpers²•Institutions (2)

University of Edinburgh¹, University of Hasselt²

04 Mar 2005-arXiv: Databases

TL;DR: This work defines spatio-temporal queries to be computable mappings that are also generic, meaning that the result of a query may only depend to a limited extent on the actual internal representation of the spatio/temporal data.

...read moreread less

Abstract: We address a fundamental question concerning spatio-temporal database systems: ``What are exactly spatio-temporal queries?'' We define spatio-temporal queries to be computable mappings that are also generic, meaning that the result of a query may only depend to a limited extent on the actual internal representation of the spatio-temporal data. Genericity is defined as invariance under groups of geometric transformations that preserve certain characteristics of spatio-temporal data (e.g., collinearity, distance, velocity, acceleration, ...). These groups depend on the notions that are relevant in particular spatio-temporal database applications. These transformations also have the distinctive property that they respect the monotone and unidirectional nature of time. We investigate different genericity classes with respect to the constraint database model for spatio-temporal databases and we identify sound and complete languages for the first-order and the computable queries in these genericity classes. We distinguish between genericity determined by time-invariant transformations, genericity notions concerning physical quantities and genericity determined by time-dependent transformations.

...read moreread less

Posted Content•

A Unified Subspace Outlier Ensemble Framework for Outlier Detection in High Dimensional Spaces

[...]

Zengyou He, Xiaofei Xu, Shengchun Deng

24 May 2005-arXiv: Databases

TL;DR: A unified framework for outlier detection in high dimensional spaces from an ensemble-learning viewpoint is proposed and a very simple and fast algorithm, namely SOE1, in which only subspaces with one dimension is used for mining outliers from large categorical datasets is developed.

...read moreread less

Abstract: The task of outlier detection is to find small groups of data objects that are exceptional when compared with rest large amount of data Detection of such outliers is important for many applications such as fraud detection and customer migration Most such applications are high dimensional domains in which the data may contain hundreds of dimensions However, the outlier detection problem itself is not well defined and none of the existing definitions are widely accepted, especially in high dimensional space In this paper, our first contribution is to propose a unified framework for outlier detection in high dimensional spaces from an ensemble-learning viewpoint In our new framework, the outlying-ness of each data object is measured by fusing outlier factors in different subspaces using a combination function Accordingly, we show that all existing researches on outlier detection can be regarded as special cases in the unified framework with respect to the set of subspaces considered and the type of combination function used In addition, to demonstrate the usefulness of the ensemble-learning based outlier detection framework, we developed a very simple and fast algorithm, namely SOE1 (Subspace Outlier Ensemble using 1-dimensional Subspaces) in which only subspaces with one dimension is used for mining outliers from large categorical datasets The SOE1 algorithm needs only two scans over the dataset and hence is very appealing in real data mining applications Experimental results on real datasets and large synthetic datasets show that: (1) SOE1 has comparable performance with respect to those state-of-art outlier detection algorithms on identifying true outliers and (2) SOE1 can be an order of magnitude faster than one of the fastest outlier detection algorithms known so far

...read moreread less

Posted Content•

Database Reformulation with Integrity Constraints (extended abstract)

[...]

Rada Chirkova, Michael R. Genesereth

08 Jun 2005-arXiv: Databases

TL;DR: An unchase technique is introduced, which reduces the problem of query equivalence under constraints to equivalence in the absence of constraints without increasing query size.

...read moreread less

Abstract: In this paper we study the problem of reducing the evaluation costs of queries on finite databases in presence of integrity constraints, by designing and materializing views. Given a database schema, a set of queries defined on the schema, a set of integrity constraints, and a storage limit, to find a solution to this problem means to find a set of views that satisfies the storage limit, provides equivalent rewritings of the queries under the constraints (this requirement is weaker than equivalence in the absence of constraints), and reduces the total costs of evaluating the queries. This problem, database reformulation, is important for many applications, including data warehousing and query optimization. We give complexity results and algorithms for database reformulation in presence of constraints, for conjunctive queries, views, and rewritings and for several types of constraints, including functional and inclusion dependencies. To obtain better complexity results, we introduce an unchase technique, which reduces the problem of query equivalence under constraints to equivalence in the absence of constraints without increasing query size.

...read moreread less

Posted Content•

Treillis de concepts et ontologies pour l'interrogation d'un annuaire de sources de donn\'{e}es biologiques (BioRegistry)

[...]

Nizar Messai¹, Marie-Dominique Devignes¹, Malika Smaïl-Tabbone¹, Amedeo Napoli¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

06 Jun 2005-arXiv: Databases

TL;DR: An approach based on formal concept analysis to classify and search relevant bioinformatic data sources for a given query and an improvement of the approach consists in automatic query refinement thanks to domain ontologies.

...read moreread less

Abstract: Bioinformatic data sources available on the web are multiple and heterogenous. The lack of documentation and the difficulty of interaction with these data sources require users competence in both informatics and biological fields for an optimal use of sources contents that remain rather under exploited. In this paper we present an approach based on formal concept analysis to classify and search relevant bioinformatic data sources for a given query. It consists in building the concept lattice from the binary relation between bioinformatic data sources and their associated metadata. The concept built from a given query is then merged into the concept lattice. The result is given by the extraction of the set of sources belonging to the extents of the query concept subsumers in the resulting concept lattice. The sources ranking is given by the concept specificity order in the concept lattice. An improvement of the approach consists in automatic query refinement thanks to domain ontologies. Two forms of refinement are possible by generalisation and by specialisation. ----- Les sources de donn\'{e}es biologiques disponibles sur le web sont multiples et h\'{e}t\'{e}rog\`{e}nes. L'utilisation optimale de ces ressources n\'{e}cessite aujourd'hui de la part des utilisateurs des comp\'{e}tences \`{a} la fois en informatique et en biologie, du fait du manque de documentation et des difficult\'{e}s d'interaction avec les sources de donn\'{e}es. De fait, les contenus de ces ressources restent souvent sous-exploit\'{e}s. Nous pr\'{e}sentons ici une approche bas\'{e}e sur l'analyse de concepts formels, pour organiser et rechercher des sources de donn\'{e}es biologiques pertinentes pour une requ\^{e}te donn\'{e}e. Le travail consiste \`{a} construire un treillis de concepts \`{a} partir des m\'{e}ta-donn\'{e}es associ\'{e}es aux sources. Le concept construit \`{a} partir d'une requ\^{e}te donn\'{e}e est alors int\'{e}gr\'{e} au treillis. La r\'{e}ponse \`{a} la requ\^{e}te est ensuite fournie par l'extraction des sources de donn\'{e}es appartenant aux extensions des concepts subsumant le concept requ\^{e}te dans le treillis. Les sources ainsi retourn\'{e}es peuvent \^{e}tre tri\'{e}es selon l'ordre de sp\'{e}cificit\'{e} des concepts dans le treillis. Une proc\'{e}dure de raffinement de requ\^{e}te, bas\'{e}e sur des ontologies du domaine, permet d'am\'{e}liorer le rappel par g\'{e}n\'{e}ralisation ou par sp\'{e}cialisation

...read moreread less

Posted Content•

Theory and Practice of Transactional Method Caching

[...]

Daniel Pfeifer¹, Peter C. Lockemann¹•Institutions (1)

Karlsruhe Institute of Technology¹

09 Mar 2005-arXiv: Databases

TL;DR: This paper extends the concept of method caching by addressing the case where clients wrap related method invocations in ACID transactions, and extends a classical transaction formalism to create a protocol for scheduling cached method results.

...read moreread less

Abstract: Nowadays, tiered architectures are widely accepted for constructing large scale information systems. In this context application servers often form the bottleneck for a system's efficiency. An application server exposes an object oriented interface consisting of set of methods which are accessed by potentially remote clients. The idea of method caching is to store results of read-only method invocations with respect to the application server's interface on the client side. If the client invokes the same method with the same arguments again, the corresponding result can be taken from the cache without contacting the server. It has been shown that this approach can considerably improve a real world system's efficiency. This paper extends the concept of method caching by addressing the case where clients wrap related method invocations in ACID transactions. Demarcating sequences of method calls in this way is supported by many important application server standards. In this context the paper presents an architecture, a theory and an efficient protocol for maintaining full transactional consistency and in particular serializability when using a method cache on the client side. In order to create a protocol for scheduling cached method results, the paper extends a classical transaction formalism. Based on this extension, a recovery protocol and an optimistic serializability protocol are derived. The latter one differs from traditional transactional cache protocols in many essential ways. An efficiency experiment validates the approach: Using the cache a system's performance and scalability are considerably improved.

...read moreread less

Posted Content•

Iterative Algorithm for Finding Frequent Patterns in Transactional Databases

[...]

Gennady P. Berman, Vyacheslav N. Gorshkov, Edward P. MacKerrow¹, Xidi Wang²•Institutions (2)

Los Alamos National Laboratory¹, Citigroup²

26 Aug 2005-arXiv: Databases

TL;DR: A high-performance algorithm for searching for frequent patterns (FPs) in transactional databases is presented, carried out by using an iterative sieve algorithm by computing the set of enclosed cycles.

...read moreread less

Abstract: A high-performance algorithm for searching for frequent patterns (FPs) in transactional databases is presented. The search for FPs is carried out by using an iterative sieve algorithm by computing the set of enclosed cycles. In each inner cycle of level FPs composed of elements are generated. The assigned number of enclosed cycles (the parameter of the problem) defines the maximum length of the desired FPs. The efficiency of the algorithm is produced by (i) the extremely simple logical searching scheme, (ii) the avoidance of recursive procedures, and (iii) the usage of only one-dimensional arrays of integers.

...read moreread less

Posted Content•

Mining Top-k Approximate Frequent Patterns

[...]

Zengyou He

17 Mar 2005-arXiv: Databases

TL;DR: This paper derives the upper bounds on objective function and presents an approximate branch-and-bound method for finding the feasible solution to the issue of overwhelmingly large output size by introducing and studying the following problem: mining top-k approximate frequent patterns.

...read moreread less

Abstract: Frequent pattern (itemset) mining in transactional databases is one of the most well-studied problems in data mining. One obstacle that limits the practical usage of frequent pattern mining is the extremely large number of patterns generated. Such a large size of the output collection makes it difficult for users to understand and use in practice. Even restricting the output to the border of the frequent itemset collection does not help much in alleviating the problem. In this paper we address the issue of overwhelmingly large output size by introducing and studying the following problem: mining top-k approximate frequent patterns. The union of the power sets of these k sets should satisfy the following conditions: (1) including itemsets with larger support as many as possible and (2) including itemsets with smaller support as less as possible. An integrated objective function is designed to combine these two objectives. Consequently, we derive the upper bounds on objective function and present an approximate branch-and-bound method for finding the feasible solution. We give empirical evidence showing that our formulation and approximation methods work well in practice.

...read moreread less

Posted Content•

Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain

[...]

Sergiu Chelcea¹, Alzennyr Da Silva¹, Yves Lechevallier¹, Doru Tanasa¹, Brigitte Trousse¹ - Show less +1 more•Institutions (1)

French Institute for Research in Computer Science and Automation¹

30 Nov 2005-arXiv: Databases

TL;DR: In this paper, the authors present a preprocessing and clustering analysis on the clickstream dataset proposed for the ECMLPKDD 2005 Discovery Challenge and show how they build a rich data warehouse based on an advanced preprocesing.

...read moreread less

Abstract: This paper presents our preprocessing and clustering analysis on the clickstream dataset proposed for the ECMLPKDD 2005 Discovery Challenge. The main contributions of this article are double. First, after presenting the clickstream dataset, we show how we build a rich data warehouse based an advanced preprocesing. We take into account the intersite aspects in the given ecommerce domain, which offers an interesting data structuration. A preliminary statistical analysis based on time period clickstreams is given, emphasing the importance of intersite user visits in such a context. Secondly, we describe our crossed-clustering method which is applied on data generated from our data warehouse. Our preliminary results are interesting and promising illustrating the benefits of our WUM methods, even if more investigations are needed on the same dataset.

...read moreread less

Posted Content•

Where the Rubber Meets the Sky: Bridging the Gap between Databases and Science

[...]

Jim Gray, Alexander S. Szalay

02 Feb 2005-arXiv: Databases

TL;DR: This article is summarizes the experiences over the last seven years trying to bridge the gap between database technology and the needs of the astronomy community in building the World-Wide Telescope.

...read moreread less

Abstract: Scientists in all domains face a data avalanche - both from better instruments and from improved simulations. We believe that computer science tools and computer scientists are in a position to help all the sciences by building tools and developing techniques to manage, analyze, and visualize peta-scale scientific information. This article is summarizes our experiences over the last seven years trying to bridge the gap between database technology and the needs of the astronomy community in building the World-Wide Telescope.

...read moreread less

Posted Content•

Instance-Independent View Serializability for Semistructured Databases

[...]

Stijn Dekeyser¹, Jan Hidders, Jan Paredaens, Roel Vercammen•Institutions (1)

University of Southern Queensland¹

26 May 2005-arXiv: Databases

TL;DR: It is proved that it is decidable in polynomial time whether two given schedules in this framework are equivalent, and solves the view serializability for semistructured schedules polynomially in the size of the schedule and exponentially in the number of transactions.

...read moreread less

Abstract: Semistructured databases require tailor-made concurrency control mechanisms since traditional solutions for the relational model have been shown to be inadequate. Such mechanisms need to take full advantage of the hierarchical structure of semistructured data, for instance allowing concurrent updates of subtrees of, or even individual elements in, XML documents. We present an approach for concurrency control which is document-independent in the sense that two schedules of semistructured transactions are considered equivalent if they are equivalent on all possible documents. We prove that it is decidable in polynomial time whether two given schedules in this framework are equivalent. This also solves the view serializability for semistructured schedules polynomially in the size of the schedule and exponentially in the number of transactions.

...read moreread less