Showing papers by "Gao Cong published in 2007"

PDF

Open Access

Proceedings Article•

Improving data quality: consistency and accuracy

[...]

Gao Cong¹, Wenfei Fan², Floris Geerts³, Xibei Jia², Shuai Ma² - Show less +1 more•Institutions (3)

Microsoft¹, University of Edinburgh², Transnational University Limburg³

23 Sep 2007

TL;DR: This paper proposes two algorithms: one for automatically computing a repair D' that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database.

...read moreread less

Abstract: Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D' that satisfies the constraints and "minimally" differs from D. Equally important is to ensure that the automatically-generated repair D' is accurate, or makes sense, i.e., D' differs from the "correct" data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D' that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction.

...read moreread less

354 citations

Proceedings Article•

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

[...]

Guihua Sun, Xiaohua Liu, Gao Cong, Ming Zhou, Zhongyang Xiong, John Lee, Chin-Yew Lin - Show less +3 more

01 Jun 2007

TL;DR: A new approach to detecting erroneous sentences by integrating pattern discovery with supervised learning models is proposed, and experimental results show that the techniques are promising.

...read moreread less

Abstract: This paper studies the problem of identifying erroneous/correct sentences. The problem has important applications, e.g., providing feedback for writers of English as a Second Language, controlling the quality of parallel bilingual sentences mined from the Web, and evaluating machine translation results. In this paper, we propose a new approach to detecting erroneous sentences by integrating pattern discovery with supervised learning models. Experimental results show that our techniques are promising.

...read moreread less

61 citations

Proceedings Article•DOI•

Distributed query evaluation with performance guarantees

[...]

Gao Cong¹, Wenfei Fan², Anastasios Kementsietsidis²•Institutions (2)

Microsoft¹, University of Edinburgh²

11 Jun 2007

TL;DR: This paper provides evaluation algorithms and optimizations for generic XPath queries in the same distributed and fragmented setting that explore parallelism and retain the performance guarantees of their counterpart for Boolean queries, regardless of how the tree is fragmented and distributed.

...read moreread less

Abstract: Partial evaluation has recently proven an effective technique for evaluating Boolean XPath queries over a fragmented tree that is distributed over a number of sites. What left open is whether or not the technique is applicable to generic data-selecting XPath queries. In contrast to Boolean queries that return a single truth value, a generic XPath query returns a set of elements, and its evaluation introduces difficulties to avoiding excessive data shipping. This paper settles this question in positive by providing evaluation algorithms and optimizations for generic XPath queries in the same distributed and fragmented setting. These algorithms explore parallelism and retain the performance guarantees of their counterpart for Boolean queries, regardless of how the tree is fragmented and distributed. First, each site is visited at most three times, and down to at most twice when optimizations are in place. Second, the network traffic is determined by the final answer of the query, rather than the size of the tree, without incurring unnecessary data shipping. Third, the total computation is comparable to that of centralized algorithms on the tree stored in a single site. We show both analytically and experimentally that our algorithms and optimizations are scalable and efficient on large trees and complex XPath queries.

...read moreread less

59 citations

Proceedings Article•DOI•

Querying xml with update syntax

[...]

Wenfei Fan¹, Gao Cong², Philip Bohannon³•Institutions (3)

University of Edinburgh¹, Microsoft², Yahoo!³

11 Jun 2007

TL;DR: This paper provides automaton-based techniques for efficiently evaluating transform queries and for computing their compositions with user queries in standard XQuery and presents experimental results comparing the efficiency of the evaluation and composition algorithms for transform queries.

...read moreread less

Abstract: This paper investigates a class of transform queries proposed by XQuery Update [6]. A transform query is defined in terms of XML update syntax. When posed on an XML tree T, it returns another XML tree that would be produced by executing its embedded update on T, without destructive impact on T. Transform queries support a variety of applications including XML hypothetical queries, the simulation of updates on virtual views, and the enforcement of XML access control. In light of the wide-range of applications for transform queries, we develop automaton-based techniques for efficiently evaluating transform queries and for computing their compositions with user queries in standard XQuery. We provide (a)three algorithms to implement transform queries without change to existing XQuery processors,(b) a linear-time algorithm, based on a seamless integration of automaton execution and SAX parsing, to evaluate transform queries on large XML documents that are difficult to handle by existing XQuery engines, and (c) an algorithm to rewrite the composition of user queries and transform queries into a single efficient query in standard XQuery. We also present experimental results comparing the efficiency of our evaluation and composition algorithms for transform queries.

...read moreread less

20 citations

Proceedings Article•

Mining sequential patterns and tree patterns to detect erroneous sentences

[...]

Guihua Sun¹, Gao Cong², Xiaohua Liu², Chin-Yew Lin², Ming Zhou² - Show less +1 more•Institutions (2)

Chongqing University¹, Microsoft²

22 Jul 2007

TL;DR: A novel approach to identifying erroneous sentences is proposed, which first mine labeled tree patterns and sequential patterns to characterize both erroneous and correct sentences, and which are utilized in two ways to distinguish correct sentences from erroneous sentences.

...read moreread less

Abstract: An important application area of detecting erroneous sentences is to provide feedback for writers of English as a Second Language. This problem is difficult since both erroneous and correct sentences are diversified. In this paper, we propose a novel approach to identifying erroneous sentences. We first mine labeled tree patterns and sequential patterns to characterize both erroneous and correct sentences. Then the discovered patterns are utilized in two ways to distinguish correct sentences from erroneous sentences: (1) the patterns are transformed into sentence features for existing classification models, e.g, SVM; (2) the patterns are used to build a rule-based classification model. Experimental results show that both techniques are promising while the second technique outperforms the first approach. Moreover, the classification model in the second proposal is easy to understand, and we can provide intuitive explanation for classification results.

...read moreread less

19 citations

Proceedings Article•DOI•

Updating Recursive XML Views of Relations

[...]

Byron Choi¹, Gao Cong², Wenfei Fan³, Stratis D. Viglas⁴•Institutions (4)

Nanyang Technological University¹, Microsoft², Bell Labs³, University of Edinburgh⁴

15 Apr 2007

TL;DR: This paper proposes a mild condition on SPJ views, and shows that under this condition the analysis of deletions on relational views becomes PTIME while the insertion analysis is NF-complete, and presents efficient algorithms to translate XML updates to relational view updates.

...read moreread less

Abstract: This paper investigates the view update problem for XML views published from relational data. We consider (possibly) recursively defined XML views, compressed into DAGs and stored in relations. We provide new techniques to efficiently support XML view updates specified in terms of XFath expressions with recursion and complex filters. The interaction between XFath recursion and DAG compression of XML views makes the analysis of XML view updates intriguing. Furthermore, many issues are still open even for relational view updates, and need to be explored. In response to these, we revise the update semantics to accommodate XML side effects based on the semantics of XML views, and present efficient algorithms to translate XML updates to relational view updates. Moreover, we propose a mild condition on SPJ views, and show that under this condition the analysis of deletions on relational views becomes PTIME while the insertion analysis is NF-complete. Finally, we present an experimental study to verify the effectiveness of our techniques.

...read moreread less

17 citations

Book Chapter•DOI•

Query and update through XML views

[...]

Gao Cong¹•Institutions (1)

Microsoft¹

17 Oct 2007

TL;DR: The work on building XML views defined in terms of update syntax and updating XML views of relations of relations is outlined, and some related work is discussed.

...read moreread less

Abstract: XML has become a standard medium for data exchange, and XML views are frequently used as an interface to relational database and XML data There have been a considerable number of studies on building and querying XML views, while updating related topics for XML views have not receive much attention In this paper, we outline our work on building XML views defined in terms of update syntax and updating XML views of relations, and discuss some related work

...read moreread less

11 citations