Showing papers on "Tuple published in 2009"

PDF

Open Access

Proceedings Article•DOI•

Keyword Search in Spatial Databases: Towards Searching by Document

[...]

Dongxiang Zhang¹, Yeow Meng Chee², Anirban Mondal³, Anthony K. H. Tung¹, Masaru Kitsuregawa³ - Show less +1 more•Institutions (3)

National University of Singapore¹, Nanyang Technological University², University of Tokyo³

29 Mar 2009

TL;DR: This work addresses a novel spatial keyword query called the m-closest keywords (mCK) query, which aims to find the spatially closest tuples which match m user-specified keywords, and introduces a new index called the bR*-tree, which is an extension of the R-tree.

...read moreread less

Abstract: This work addresses a novel spatial keyword query called the m-closest keywords (mCK) query. Given a database of spatial objects, each tuple is associated with some descriptive information represented in the form of keywords. The mCK query aims to find the spatially closest tuples which match m user-specified keywords. Given a set of keywords from a document, mCK query can be very useful in geotagging the document by comparing the keywords to other geotagged documents in a database. To answer mCK queries efficiently, we introduce a new index called the bR*-tree, which is an extension of the R*-tree. Based on bR*-tree, we exploit a priori-based search strategies to effectively reduce the search space. We also propose two monotone constraints, namely the distance mutex and keyword mutex, as our a priori properties to facilitate effective pruning. Our performance study demonstrates that our search strategy is indeed efficient in reducing query response time and demonstrates remarkable scalability in terms of the number of query keywords which is essential for our main application of searching by document.

...read moreread less

298 citations

Proceedings Article•DOI•

Group-by skyline query processing in relational engines

[...]

Ming-Hay Luk¹, Man Lung Yiu¹, Eric Lo¹•Institutions (1)

Hong Kong Polytechnic University¹

02 Nov 2009

TL;DR: In this article, the authors present a comprehensive study on processing group-by skyline queries in the context of relational engines, and examine the composition of a query plan for a groupby skyline query and develop the missing cost model for the BBS algorithm.

...read moreread less

Abstract: The skyline operator was first proposed in 2001 for retrieving interesting tuples from a dataset. Since then, 100+ skyline-related papers have been published; however, we discovered that one of the most intuitive and practical type of skyline queries, namely, group-by skyline queries remains unaddressed. Group-by skyline queries find the skyline for each group of tuples. In this paper, we present a comprehensive study on processing group-by skyline queries in the context of relational engines. Specifically, we examine the composition of a query plan for a group-by skyline query and develop the missing cost model for the BBS algorithm. Experimental results show that our techniques are able to devise the best query plans for a variety of group-by skyline queries. Our focus is on algorithms that can be directly implemented in today's commercial database systems without the addition of new access methods (which would require addressing the associated challenges of maintenance with updates, concurrency control, etc.).

...read moreread less

281 citations

Journal Article•DOI•

Computing the Numerical Scale of the Linguistic Term Set for the 2-Tuple Fuzzy Linguistic Representation Model

[...]

Yucheng Dong¹, Yinfeng Xu¹, Shui Yu²•Institutions (2)

Xi'an Jiaotong University¹, Deakin University²

01 Dec 2009-IEEE Transactions on Fuzzy Systems

TL;DR: By defining the concept of the transitive calibration matrix and its consistent index, this paper develops an optimization model to compute the numerical scale of the linguistic term set and the desired properties of the optimization model are presented.

...read moreread less

Abstract: When using linguistic approaches to solve decision problems, we need the techniques for computing with words (CW). Together with the 2-tuple fuzzy linguistic representation models (i.e., the Herrera and Martinez model and the Wang and Hao model), some computational techniques for CW are also developed. In this paper, we define the concept of numerical scale and extend the 2-tuple fuzzy linguistic representation models under the numerical scale. We find that the key of computational techniques based on linguistic 2-tuples is to set suitable numerical scale with the purpose of making transformations between linguistic 2-tuples and numerical values. By defining the concept of the transitive calibration matrix and its consistent index, this paper develops an optimization model to compute the numerical scale of the linguistic term set. The desired properties of the optimization model are also presented. Furthermore, we discuss how to construct the transitive calibration matrix for decision problems using linguistic preference relations and analyze the linkage between the consistent index of the transitive calibration matrix and one of the linguistic preference relations. The results in this paper are pretty helpful to complete the fuzzy 2-tuple representation models for CW.

...read moreread less

265 citations

Proceedings Article•DOI•

Automatic verification of data-centric business processes

[...]

Alin Deutsch¹, Richard Hull², Fabio Patrizi³, Victor Vianu¹•Institutions (3)

University of California, San Diego¹, IBM², Sapienza University of Rome³

23 Mar 2009

TL;DR: This investigation builds upon previous work on verification of data-driven Web services and ASM transducers, while addressing significant new technical challenges raised by the artifact model.

...read moreread less

Abstract: We formalize and study business process systems that are centered around "business artifacts", or simply "artifacts". Artifacts are used to represent (real or conceptual) key business entities, including both their data schema and lifecycles. The lifecycle of an artifact type specifies the possible sequencings of services that can be applied to an artifact of this type as it progresses through the business process. The artifact-centric approach was introduced by IBM, and has been used to achieve substantial savings when performing business transformations.In this paper, artifacts carry attribute records and internal state relations (holding sets of tuples) that services can consult and update. In addition, services can access an underlying database and can introduce new values from an infinite domain, thus modeling external inputs or partially specified processes described by pre-and-post conditions. The lifecycles associate services to the artifacts using declarative, condition-action style rules.We consider the problem of statically verifying whether all runs of an artifact system satisfy desirable correctness properties expressed in a first-order extension of linear-time temporal logic. We map the boundaries of decidability for the verification problem and provide its complexity. The technical challenge to static verification stems from the presence of data from an infinite domain, yielding an infinite-state system. While much work has been done lately in the verification community on model checking specialized classes of infinite-state systems, the available results do not transfer to our framework, and this remains a difficult problem. We identify an expressive class of artifact systems for which verification is nonetheless decidable. The complexity of verification is PSPACE-complete, which is no worse than classical finite-state model checking.This investigation builds upon previous work on verification of data-driven Web services and ASM transducers, while addressing significant new technical challenges raised by the artifact model.

...read moreread less

264 citations

Journal Article•DOI•

Programming pervasive and mobile computing applications: The TOTA approach

[...]

Marco Mamei, Franco Zambonelli

30 Jul 2009-ACM Transactions on Software Engineering and Methodology

TL;DR: TOTA promotes a simple way of programming that facilitates access to distributed information, navigation in complex environments, and the achievement of complex coordination tasks in a fully distributed and adaptive way, mostly freeing programmers and system managers from the need to take care of low-level issues related to network dynamics.

...read moreread less

Abstract: Pervasive and mobile computing call for suitable middleware and programming models to support the activities of complex software systems in dynamic network environments. In this article we present TOTA (“Tuples On The Air”), a novel middleware and programming approach for supporting adaptive context-aware activities in pervasive and mobile computing scenarios. The key idea in TOTA is to rely on spatially distributed tuples, adaptively propagated across a network on the basis of application-specific rules, for both representing contextual information and supporting uncoupled interactions between application components. TOTA promotes a simple way of programming that facilitates access to distributed information, navigation in complex environments, and the achievement of complex coordination tasks in a fully distributed and adaptive way, mostly freeing programmers and system managers from the need to take care of low-level issues related to network dynamics. This article includes both application examples to clarify concepts and performance figures to show the feasibility of the approach

...read moreread less

220 citations

Proceedings Article•DOI•

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks

[...]

Graham Cormode¹, Feifei Li², Ke Yi³•Institutions (3)

AT&T Labs¹, Florida State University², Hong Kong University of Science and Technology³

29 Mar 2009

TL;DR: This work is able to prove that, in contrast to all existing approaches, the expected rank satisfies all the required properties for a ranking query, and provides efficient solutions to compute this ranking across the major models of uncertain data, such as attribute-level and tuple-level uncertainty.

...read moreread less

Abstract: When dealing with massive quantities of data, top-k queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditional database settings. The importance of the top-k is perhaps even greater in probabilistic databases, where a relation can encode exponentially many possible worlds. There have been several recent attempts to propose definitions and algorithms for ranking queries over probabilistic data. However, these all lack many of the intuitive properties of a top-k over deterministic data. Specifically, we define a number of fundamental properties, including exact-k, containment, unique-rank, value-invariance, and stability, which are all satisfied by ranking queries on certain data. We argue that all these conditions should also be fulfilled by any reasonable definition for ranking uncertain data. Unfortunately, none of the existing definitions is able to achieve this. To remedy this shortcoming, this work proposes an intuitive new approach of expected rank. This uses the well-founded notion of the expected rank of each tuple across all possible worlds as the basis of the ranking. We are able to prove that, in contrast to all existing approaches, the expected rank satisfies all the required properties for a ranking query. We provide efficient solutions to compute this ranking across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. For an uncertain relation of N tuples, the processing cost is O(N logN)—no worse than simply sorting the relation. In settings where there is a high cost for generating each tuple in turn, we provide pruning techniques based on probabilistic tail bounds that can terminate the search early and guarantee that the top-k has been found. Finally, a comprehensive experimental study confirms the effectiveness of our approach.

...read moreread less

214 citations

Journal Article•DOI•

Reasoning about record matching rules

[...]

Wenfei Fan¹, Xibei Jia², Jianzhong Li¹, Shuai Ma²•Institutions (2)

Harbin Institute of Technology¹, University of Edinburgh²

01 Aug 2009

TL;DR: A class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations is introduced, defined in terms of similarity metrics and a dynamic semantics, and a mechanism for inferring MDs is proposed, a departure from traditional implication analysis.

...read moreread less

Abstract: To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics. (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O(n2) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.

...read moreread less

213 citations

Journal Article•DOI•

Streams on wires: a query compiler for FPGAs

[...]

Rene Mueller¹, Jens Teubner¹, Gustavo Alonso¹•Institutions (1)

ETH Zurich¹

01 Aug 2009

TL;DR: Glasgow, a component library and compositional compiler that transforms continuous queries into logic circuits by composing library components on an operator-level basis is presented.

...read moreread less

Abstract: Taking advantage of many-core, heterogeneous hardware for data processing tasks is a difficult problem. In this paper, we consider the use of FPGAs for data stream processing as coprocessors in many-core architectures. We present Glacier, a component library and compositional compiler that transforms continuous queries into logic circuits by composing library components on an operator-level basis. In the paper we consider selection, aggregation, grouping, as well as windowing operators, and discuss their design as modular elements.We also show how significant performance improvements can be achieved by inserting the FPGA into the system's data path (e.g., between the network interface and the host CPU). Our experiments show that queries on the FPGA can process streams at more than one million tuples per second and that they can do this directly from the network, removing much of the overhead of transferring the data to a conventional CPU.

...read moreread less

167 citations

Proceedings Article•DOI•

Self-organizing tuple reconstruction in column-stores

[...]

Stratos Idreos, Martin L. Kersten, Stefan Manegold

29 Jun 2009

TL;DR: A novel design, partial sideways cracking, is proposed that achieves performance similar to using presorted data, but without requiring the heavy initial presorting step itself, and brings significant performance benefits for multi-attribute queries.

...read moreread less

Abstract: Column-stores gained popularity as a promising physical design alternative. Each attribute of a relation is physically stored as a separate column allowing queries to load only the required attributes. The overhead incurred is on-the-fly tuple reconstruction for multi-attribute queries. Each tuple reconstruction is a join of two columns based on tuple IDs, making it a significant cost component. The ultimate physical design is to have multiple presorted copies of each base table such that tuples are already appropriately organized in multiple different orders across the various columns. This requires the ability to predict the workload, idle time to prepare, and infrequent updates. In this paper, we propose a novel design, partial sideways cracking, that minimizes the tuple reconstruction cost in a self-organizing way. It achieves performance similar to using presorted data, but without requiring the heavy initial presorting step itself. Instead, it handles dynamic, unpredictable workloads with no idle time and frequent updates. Auxiliary dynamic data structures, called cracker maps, provide a direct mapping between pairs of attributes used together in queries for tuple reconstruction. A map is continuously physically reorganized as an integral part of query evaluation, providing faster and reduced data access for future queries. To enable flexible and self-organizing behavior in storage-limited environments, maps are materialized only partially as demanded by the workload. Each map is a collection of separate chunks that are individually reorganized, dropped or recreated as needed. We implemented partial sideways cracking in an open-source column-store. A detailed experimental analysis demonstrates that it brings significant performance benefits for multi-attribute queries.

...read moreread less

166 citations

Proceedings Article•DOI•

Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting

[...]

Boris Glavic¹, Gustavo Alonso²•Institutions (2)

University of Zurich¹, ETH Zurich²

29 Mar 2009

TL;DR: This paper presents an alternative approach that uses query rewriting to annotate result tuples with provenance information and formalizes the query rewriting procedures, proves their correctness, and evaluates a first implementation of the ideas using PostgreSQL.

...read moreread less

Abstract: Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate dataitems are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.

...read moreread less

161 citations

Proceedings Article•DOI•

Keyword search in databases: the power of RDBMS

[...]

Lu Qin¹, Jeffrey Xu Yu¹, Lijun Chang¹•Institutions (1)

The Chinese University of Hong Kong¹

29 Jun 2009

TL;DR: The main idea behind the approach is tuple reduction, which uses SQL to compute all the interconnected tuple structures for a given keyword query, and uses three types of interconnected tuples to achieve that and controls the size of the structures.

...read moreread less

Abstract: Keyword search in relational databases (RDBs) has been extensively studied recently. A keyword search (or a keyword query) in RDBs is specified by a set of keywords to explore the interconnected tuple structures in an RDB that cannot be easily identified using SQL on RDBMS. In brief, it finds how the tuples containing the given keywords are connected via sequences of connections (foreign key references) among tuples in an RDB. Such interconnected tuple structures can be found as connected trees up to a certain size, sets of tuples that are reachable from a root tuple within a radius, or even multi-center subgraphs within a radius. In the literature, there are two main approaches. One is to generate a set of relational algebra expressions and evaluate every such expression using SQL on an RDBMS directly or in a middleware on top of an RDBMS indirectly. Due to a large number of relational algebra expressions needed to process, most of the existing works take a middleware approach without fully utilizing RDBMSs. The other is to materialize an RDB as a graph and find the interconnected tuple structures using graph-based algorithms in memory. In this paper we focus on using SQL to compute all the interconnected tuple structures for a given keyword query. We use three types of interconnected tuple structures to achieve that and we control the size of the structures. We show that the current commercial RDBMSs are powerful enough to support such keyword queries in RDBs efficiently without any additional new indexing to be built and maintained. The main idea behind our approach is tuple reduction. In our approach, in the first reduction step, we prune tuples that do not participate in any results using SQL, and in the second join step, we process the relational algebra expressions using SQL over the reduced relations. We conducted extensive experimental studies using two commercial RDBMSs and two large real datasets, and we report the efficiency of our approaches in this paper.

...read moreread less

Journal Article•DOI•

PrDB: managing and exploiting rich correlations in probabilistic databases

[...]

Prithviraj Sen¹, Amol Deshpande¹, Lise Getoor¹•Institutions (1)

University of Maryland, College Park¹

01 Oct 2009

TL;DR: This work defines a Probabilistic database model, PrDB, that uses graphical models, a state-of-the-art probabilistic modeling technique developed within the statistics and machine learning community, to model uncertain data and shows how the use of shared correlations, together with a novel inference algorithm based on bisimulation, can speed query processing significantly.

...read moreread less

Abstract: Due to numerous applications producing noisy data, e.g., sensor data, experimental data, data from uncurated sources, information extraction, etc., there has been a surge of interest in the development of probabilistic databases. Most probabilistic database models proposed to date, however, fail to meet the challenges of real-world applications on two counts: (1) they often restrict the kinds of uncertainty that the user can represent; and (2) the query processing algorithms often cannot scale up to the needs of the application. In this work, we define a probabilistic database model, PrDB, that uses graphical models, a state-of-the-art probabilistic modeling technique developed within the statistics and machine learning community, to model uncertain data. We show how this results in a rich, complex yet compact probabilistic database model, which can capture the commonly occurring uncertainty models (tuple uncertainty, attribute uncertainty), more complex models (correlated tuples and attributes) and allows compact representation (shared and schema-level correlations). In addition, we show how query evaluation in PrDB translates into inference in an appropriately augmented graphical model. This allows us to easily use any of a myriad of exact and approximate inference algorithms developed within the graphical modeling community. While probabilistic inference provides a generic approach to solving queries, we show how the use of shared correlations, together with a novel inference algorithm that we developed based on bisimulation, can speed query processing significantly. We present a comprehensive experimental evaluation of the proposed techniques and show that even with a few shared correlations, significant speedups are possible.

...read moreread less

Proceedings Article•DOI•

Querying Communities in Relational Databases

[...]

Lu Qin¹, Jeffrey Xu Yu¹, Lijun Chang¹, Yufei Tao¹•Institutions (1)

The Chinese University of Hong Kong¹

29 Mar 2009

TL;DR: This paper proposes new efficient algorithms to find all/top-k communities which consume small memory, for an l-keyword query, and conducts extensive performance studies using two large real datasets to confirm the efficiency of the algorithms.

...read moreread less

Abstract: Keyword search on relational databases provides users with insights that they can not easily observe using the traditional RDBMS techniques. Here, an l-keyword query is specified by a set of l keywords, {k1, k2, · · · , kl}. It finds how the tuples that contain the keywords are connected in a relational database via the possible foreign key references. Conceptually, it is to find some structural information in a database graph, where nodes are tuples and edges are foreign key references. The existing work studied how to find connected trees for an l-keyword query. However, a tree may only show partial information about how those tuples that contain the keywords are connected. In this paper, we focus on finding communities for an l-keyword query. A community is an induced subgraph that contains all the l-keywords within a given distance. We propose new efficient algorithms to find all/top-k communities which consume small memory, for an l-keyword query. For top kl-keyword queries, our algorithm allows users to interactively enlarge k at run time. We conducted extensive performance studies using two large real datasets to confirm the efficiency of our algorithms.

...read moreread less

Proceedings Article•DOI•

SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases

[...]

Dan Olteanu¹, Jiewen Huang¹, Christoph Koch²•Institutions (2)

University of Oxford¹, Cornell University²

29 Mar 2009

TL;DR: An efficient secondary-storage operator for exact computation of queries on tuple-independent probabilistic databases, which is semantically equivalent to a sequence of aggregations and can be naturally integrated into existing relational query plans.

...read moreread less

Abstract: A paramount challenge in probabilistic databases is the scalable computation of confidences of tuples in query results. This paper introduces an efficient secondary-storage operator for exact computation of queries on tuple-independent probabilistic databases. We consider the conjunctive queries without self-joins that are known to be tractable on any tuple-independent database, and queries that are not tractable in general but become tractable on probabilistic databases restricted by functional dependencies. Our operator is semantically equivalent to a sequence of aggregations and can be naturally integrated into existing relational query plans. As a proof of concept, we developed an extension of the PostgreSQL 8.3.3 query engine called SPROUT. We study optimizations that push or pull our operator or parts thereof past joins. The operator employs static information, such as the query structure and functional dependencies, to decide which constituent aggregations can be evaluated together in one scan and how many scans are needed for the overall confidence computation task. A case study on the TPC-H benchmark reveals that most TPC-H queries obtained by removing aggregations can be evaluated efficiently using our operator. Experimental evaluation on probabilistic TPC-H data shows substantial efficiency improvements when compared to the state of the art.

...read moreread less

Proceedings Article•DOI•

Efficient type-ahead search on relational data: a TASTIER approach

[...]

Guoliang Li¹, Shengyue Ji², Chen Li², Jianhua Feng¹•Institutions (2)

Tsinghua University¹, University of California, Irvine²

29 Jun 2009

TL;DR: A novel approach to keyword search in the relational world, called Tastier, which proposes efficient index structures and algorithms for finding relevant answers on-the-fly by joining tuples in the database and devise a partition-based method to improve query performance.

...read moreread less

Abstract: Existing keyword-search systems in relational databases require users to submit a complete query to compute answers. Often users feel "left in the dark" when they have limited knowledge about the data, and have to use a try-and-see approach for modifying queries and finding answers. In this paper we propose a novel approach to keyword search in the relational world, called Tastier. A Tastier system can bring instant gratification to users by supporting type-ahead search, which finds answers "on the fly" as the user types in query keywords. A main challenge is how to achieve a high interactive speed for large amounts of data in multiple tables, so that a query can be answered efficiently within milliseconds. We propose efficient index structures and algorithms for finding relevant answers on-the-fly by joining tuples in the database. We devise a partition-based method to improve query performance by grouping highly relevant tuples and pruning irrelevant tuples efficiently. We also develop a technique to answer a query efficiently by predicting the highly relevant complete queries for the user. We have conducted a thorough experimental evaluation of the proposed techniques on real data sets to demonstrate the efficiency and practicality of this new search paradigm.

...read moreread less

Proceedings Article•DOI•

Combining keyword search and forms for ad hoc querying of databases

[...]

Eric Chu¹, Akanksha Baid¹, Xiaoyong Chai¹, AnHai Doan¹, Jeffrey F. Naughton¹ - Show less +1 more•Institutions (1)

University of Wisconsin-Madison¹

29 Jun 2009

TL;DR: This paper proposes to take as input a target database and then generate and index a set of query forms offline, to address challenges that arise in form generation, keyword search over forms, and ranking and displaying these forms.

...read moreread less

Abstract: A common criticism of database systems is that they are hard to query for users uncomfortable with a formal query language. To address this problem, form-based interfaces and keyword search have been proposed; while both have benefits, both also have limitations. In this paper, we investigate combining the two with the hopes of creating an approach that provides the best of both. Specifically, we propose to take as input a target database and then generate and index a set of query forms offline. At query time, a user with a question to be answered issues standard keyword search queries; but instead of returning tuples, the system returns forms relevant to the question. The user may then build a structured query with one of these forms and submit it back to the system for evaluation. In this paper, we address challenges that arise in form generation, keyword search over forms, and ranking and displaying these forms. We explore techniques to tackle these challenges, and present experimental results suggesting that the approach of combining keyword search and form-based interfaces is promising.

...read moreread less

Journal Article•DOI•

A scalable, predictable join operator for highly concurrent data warehouses

[...]

George Candea¹, Neoklis Polyzotis², Radek Vingralek¹•Institutions (2)

Aster¹, University of California, Santa Cruz²

01 Aug 2009

TL;DR: This work describes an augmentation of traditional query engines that improves join throughput in large-scale concurrent data warehouses by using an "always-on" pipeline of non-blocking operators, coupled with a controller that continuously examines the current query mix and performs run-time optimizations.

...read moreread less

Abstract: Conventional data warehouses employ the query-at-a-time model, which maps each query to a distinct physical plan. When several queries execute concurrently, this model introduces contention, because the physical plans---unaware of each other---compete for access to the underlying I/O and computation resources. As a result, while modern systems can efficiently optimize and evaluate a single complex data analysis query, their performance suffers significantly when multiple complex queries run at the same time.We describe an augmentation of traditional query engines that improves join throughput in large-scale concurrent data warehouses. In contrast to the conventional query-at-a-time model, our approach employs a single physical plan that can share I/O, computation, and tuple storage across all in-flight join queries. We use an "always-on" pipeline of non-blocking operators, coupled with a controller that continuously examines the current query mix and performs run-time optimizations. Our design allows the query engine to scale gracefully to large data sets, provide predictable execution times, and reduce contention. In our empirical evaluation, we found that our prototype outperforms conventional commercial systems by an order of magnitude for tens to hundreds of concurrent queries.

...read moreread less

Proceedings Article•DOI•

Top-k queries on uncertain data: on score distribution and typical answers

[...]

Tingjian Ge¹, Stan Zdonik¹, Samuel Madden²•Institutions (2)

Brown University¹, Massachusetts Institute of Technology²

29 Jun 2009

TL;DR: The need to present the score distribution of top-k vectors to allow the user to choose between results along this score-probability dimensions is demonstrated and a number of typical vectors that effectively sample this distribution are proposed.

...read moreread less

Abstract: Uncertain data arises in a number of domains, including data integration and sensor networks. Top-k queries that rank results according to some user-defined score are an important tool for exploring large uncertain data sets. As several recent papers have observed, the semantics of top-k queries on uncertain data can be ambiguous due to tradeoffs between reporting high-scoring tuples and tuples with a high probability of being in the resulting data set. In this paper, we demonstrate the need to present the score distribution of top-k vectors to allow the user to choose between results along this score-probability dimensions. One option would be to display the complete distribution of all potential top-k tuple vectors, but this set is too large to compute. Instead, we propose to provide a number of typical vectors that effectively sample this distribution. We propose efficient algorithms to compute these vectors. We also extend the semantics and algorithms to the scenario of score ties, which is not dealt with in the previous work in the area. Our work includes a systematic empirical study on both real dataset and synthetic datasets.

...read moreread less

Patent•

Unified window support for event stream data management

[...]

Jin Zhang, Ying Yan, Ming-Chien Shan

16 Jul 2009

TL;DR: In this paper, a data stream query mediator is configured to provide the translated query to the data stream management system (DSMS) for processing therewith, and a query translator may be configured to translate the query including mapping the range attribute, the synchronization attribute, and the evaluation attribute to a stream processing language of a DSMS to obtain a translated query.

...read moreread less

Abstract: Data stream query mediation may utilize a query handler configured to receive a query from a stream application to be applied against a stream of data including multiple tuples representing events. A stream window manager may be configured to express the query in a specification which defines a window including a subset of the tuples, the specification defining content of the window as a range of the tuples having a range attribute over which the content is specified, defining when to update the window using a synchronization attribute specifying a movement of the window over time with respect to the content, and defining an evaluation of the content of the window using an evaluation attribute specifying when to perform the evaluation. A query translator may be configured to translate the query including mapping the range attribute, the synchronization attribute, and the evaluation attribute to a stream processing language of a data stream management system (DSMS), to thereby obtain a translated query. A DSMS mediator may be configured to provide the translated query to the DSMS for processing therewith.

...read moreread less

Proceedings Article•DOI•

Supporting Database Applications as a Service

[...]

Mei Hui¹, Dawei Jiang¹, Guoliang Li², Yuan Zhou¹•Institutions (2)

National University of Singapore¹, Tsinghua University²

29 Mar 2009

TL;DR: The experimental results show that the proposed approach is a promising multi-tenancy storage and indexing scheme which can be easily integrated into existing DBMS and extended MySQL based on the proposed design and conducted extensive experiments.

...read moreread less

Abstract: Multi-tenant data management is a form of Software as a Service (SaaS), whereby a third party service provider hosts databases as a service and provides its customers with seamless mechanisms to create, store and access their databases at the host site. One of the main problems in such a system, as we shall discuss in this paper, is scalability, namely the ability to serve an increasing number of tenants without too much query performance degradation. A promising way to handle the scalability issue is to consolidate tuples from different tenants into the same shared tables. However, this approach introduces two problems: 1) The shared tables are too sparse. 2)Indexing on shared tables is not effective. To resolve the problems, we propose a multi-tenant database system called M-Store, which provides storage and indexing services for multi-tenants. To improve the scalability of the system, we develop two techniques in M-Store: Bitmap Interpreted Tuple(BIT) and Multi-Separated Index (MSI). BIT is efficient in that it does not store NULLs from unused attributes in the shared tables and MSI provides flexibility since it only indexes each tenant's own data on frequently accessed attributes. We extended MySQL based on our proposed design and conducted extensive experiments. The experimental results show that our proposed approach is a promising multi-tenancy storage and indexing scheme which can be easily integrated into existing DBMS.

...read moreread less

Book Chapter•DOI•

Biochemical Tuple Spaces for Self-organising Coordination

[...]

Mirko Viroli¹, Matteo Casadei¹•Institutions (1)

University of Bologna¹

11 Jun 2009

TL;DR: A mechanism to leverage exact computational modelling of chemical reactions for achieving self-organisation in system coordination by formalising as a process algebra with stochastic semantics is proposed.

...read moreread less

Abstract: Inspired by recent works in computational systems biology and existing literature proposing nature-inspired approaches for the coordination of today complex distributed systems, this paper proposes a mechanism to leverage exact computational modelling of chemical reactions for achieving self-organisation in system coordination. We conceive the notion of biochemical tuple spaces. In this model: a tuple resembles a chemical substance, a notion of activity/pertinency value for tuples is used to model chemical concentration, coordination rules are structured as chemical reactions evolving tuple concentration over time, a tuple space resembles a single-compartment solution, and finally a network of tuple spaces resembles a tissue-like biological system. The proposed model is formalised as a process algebra with stochastic semantics, and several examples are described up to an ecology-inspired scenario of system coordination, which emphasises the self-organisation features of the proposed model.

...read moreread less

Journal Article•DOI•

Semantics and evaluation of top-k queries in probabilistic databases

[...]

Xi Zhang¹, Jan Chomicki¹•Institutions (1)

University at Buffalo¹

01 Aug 2009-Distributed and Parallel Databases

TL;DR: A new semantics, Global-Topk, is introduced that satisfies three intuitive postulates for the semantics of top-k queries in probabilistic databases, and introduces a new semantics that satisfies those postulates to a large degree.

...read moreread less

Abstract: We study here fundamental issues involved in top-k query evaluation in probabilistic databases. We consider simple probabilistic databases in which probabilities are associated with individual tuples, and general probabilistic databases in which, additionally, exclusivity relationships between tuples can be represented. In contrast to other recent research in this area, we do not limit ourselves to injective scoring functions. We formulate three intuitive postulates for the semantics of top-k queries in probabilistic databases, and introduce a new semantics, Global-Topk, that satisfies those postulates to a large degree. We also show how to evaluate queries under the Global-Topk semantics. For simple databases we design dynamic-programming based algorithms. For general databases we show polynomial-time reductions to the simple cases, and provide effective heuristics to speed up the computation in practice. For example, we demonstrate that for a fixed k the time complexity of top-k query evaluation is as low as linear, under the assumption that probabilistic databases are simple and scoring functions are injective.

...read moreread less

Proceedings Article•DOI•

A tensor-based algorithm for high-order graph matching

[...]

Olivier Duchenne¹, Francis Bach¹, In So Kweon², Jean Ponce¹•Institutions (2)

École Normale Supérieure¹, KAIST²

20 Jun 2009

TL;DR: The proposed approach to establishing correspondences between two sets of visual features using higher-order constraints instead of the unary or pairwise ones used in classical methods is compared to state-of-the-art algorithms on both synthetic and real data.

...read moreread less

Abstract: This paper addresses the problem of establishing correspondences between two sets of visual features using higher-order constraints instead of the unary or pairwise ones used in classical methods. Concretely, the corresponding hypergraph matching problem is formulated as the maximization of a multilinear objective function over all permutations of the features. This function is defined by a tensor representing the affinity between feature tuples. It is maximized using a generalization of spectral techniques where a relaxed problem is first solved by a multi-dimensional power method, and the solution is then projected onto the closest assignment matrix. The proposed approach has been implemented, and it is compared to state-of-the-art algorithms on both synthetic and real data.

...read moreread less

Journal Article•DOI•

Representing uncertain data: models, properties, and algorithms

[...]

Anish Das Sarma¹, Omar Benjelloun², Alon Halevy², Shubha U. Nabar³, Jennifer Widom¹ - Show less +1 more•Institutions (3)

Stanford University¹, Google², Microsoft³

01 Oct 2009

TL;DR: It is shown that minimization is intractable in general and study the more restricted problem of maintaining minimality incrementally when performing operations, and several results on the problem of approximating uncertain data in an insufficiently expressive model are presented.

...read moreread less

Abstract: In general terms, an uncertain relation encodes a set of possible certain relations. There are many ways to represent uncertainty, ranging from alternative values for attributes to rich constraint languages. Among the possible models for uncertain data, there is a tension between simple and intuitive models, which tend to be incomplete, and complete models, which tend to be nonintuitive and more complex than necessary for many applications. We present a space of models for representing uncertain data based on a variety of uncertainty constructs and tuple-existence constraints. We explore a number of properties and results for these models. We study completeness of the models, as well as closure under relational operations, and we give results relating closure and completeness. We then examine whether different models guarantee unique representations of uncertain data, and for those models that do not, we provide complexity results and algorithms for testing equivalence of representations. The next problem we consider is that of minimizing the size of representation of models, showing that minimizing the number of tuples also minimizes the size of constraints. We show that minimization is intractable in general and study the more restricted problem of maintaining minimality incrementally when performing operations. Finally, we present several results on the problem of approximating uncertain data in an insufficiently expressive model.

...read moreread less

Patent•

Framework for dynamically generating tuple and page classes

[...]

Hoyong Park¹, Namit Jain¹, Anand Srinivasan¹, Shailendra Mishra¹•Institutions (1)

Business International Corporation¹

02 Mar 2009

TL;DR: In this article, a method for processing a data stream includes receiving a tuple and determining a tuple specification that defines a layout of the tuple, which identifies one or more data types that are included in the tuple.

...read moreread less

Abstract: Techniques for reducing the memory used for processing events received in a data stream are provided. This may be achieved by reducing the memory required for storing tuples. A method for processing a data stream includes receiving a tuple and determining a tuple specification that defines a layout of the tuple. The layout identifies one or more data types that are included in the tuple. A tuple class corresponding to the tuple specification may be determined. A tuple object based on the tuple class is instantiated, and during runtime of the processing system. The tuple object is stored in a memory.

...read moreread less

Patent•

Extracting semantics from data

[...]

Erik Thomsen

15 Sep 2009

TL;DR: The authors convert data from atomic tuples found in data sources such as spreadsheets (e.g., raw numbers, words, and formatted dates) into semantically enriched schemas and associated tuples.

...read moreread less

Abstract: Embodiments of the invention convert data from atomic tuples found in data sources such as spreadsheets (e.g., raw numbers, words, and formatted dates) into semantically enriched schemas and associated tuples. In addition to the data content, visual content, such as font and background color, is also analyzed as a part of the interpretation process. Embodiments of the invention also provide methods of interacting with the raw data via the semantically enriched schema tuples.

...read moreread less

Proceedings Article•DOI•

Ranking distributed probabilistic data

[...]

Feifei Li¹, Ke Yi², Jeffrey Jestes¹•Institutions (2)

Florida State University¹, Hong Kong University of Science and Technology²

29 Jun 2009

TL;DR: This work designs both communication- and computation-efficient algorithms for retrieving the top-k tuples with the smallest ranks from distributed sites with minimum communication cost.

...read moreread less

Abstract: Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global top-k tuples from all distributed sites with minimum communication cost. Using the well founded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication- and computation-efficient algorithms for retrieving the top-k tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.

...read moreread less

Proceedings Article•

A distortion free watermark framework for relational databases

[...]

Sukriti Bhattacharya¹, Agostino Cortesi¹•Institutions (1)

Ca' Foscari University of Venice¹

01 Jan 2009

TL;DR: A distortion free invisible watermarking technique for relational databases to build the watermark after partitioning tuples with actual attribute values and getting a watermark as a permutation of tuples in the original table.

...read moreread less

Abstract: In this paper we introduce a distortion free invisible watermarking technique for relational databases. The main idea is to build the watermark after partitioning tuples with actual attribute values. Then, we build hash functions on top of this grouping and get a watermark as a permutation of tuples in the original table. As the ordering of tuples does not affect the original database, this technique is distortion free. Our contribution can be seen as an application to relational databases of software watermarking ideas developed within the Abstract Interpretation framework.

...read moreread less

Book Chapter•DOI•

Symbolic Query Exploration

[...]

Margus Veanes¹, Pavel Grigorenko², Peli de Halleux¹, Nikolai Tillmann¹•Institutions (2)

Microsoft¹, Tallinn University of Technology²

17 Nov 2009

TL;DR: This work describes an application of model generation in the context of the database unit testing framework of Visual Studio and uses the satisfiability modulo theories (SMT) solver Z3 in the concrete implementation.

...read moreread less

Abstract: We study the problem of generating a database and parameters for a given parameterized SQL query satisfying a given test condition We introduce a formal background theory that includes arithmetic, tuples, and sets, and translate the generation problem into a satisfiability or model generation problem modulo the background theory We use the satisfiability modulo theories (SMT) solver Z3 in the concrete implementation We describe an application of model generation in the context of the database unit testing framework of Visual Studio

...read moreread less

Proceedings Article•DOI•

Core schema mappings

[...]

Giansalvatore Mecca, Paolo Papotti¹, Salvatore Raunich•Institutions (1)

Roma Tre University¹

29 Jun 2009

TL;DR: This paper shows how, given a mapping scenario, it is possible to generate an executable script that computes core solutions for the corresponding data exchange problem, and introduces several new algorithms that contribute to bridge the gap between the practice of mapping generation and the theory of data exchange.

...read moreread less

Abstract: Research has investigated mappings among data sources under two perspectives. On one side, there are studies of practical tools for schema mapping generation; these focus on algorithms to generate mappings based on visual specifications provided by users. On the other side, we have theoretical researches about data exchange. These study how to generate a solution - i.e., a target instance - given a set of mappings usually specified as tuple generating dependencies. However, despite the fact that the notion of a core of a data exchange solution has been formally identified as an optimal solution, there are yet no mapping systems that support core computations. In this paper we introduce several new algorithms that contribute to bridge the gap between the practice of mapping generation and the theory of data exchange. We show how, given a mapping scenario, it is possible to generate an executable script that computes core solutions for the corresponding data exchange problem. The algorithms have been implemented and tested using common runtime engines to show that they guarantee very good performances, orders of magnitudes better than those of known algorithms that compute the core as a post-processing step.

...read moreread less

Collapse