scispace - formally typeset
Search or ask a question

Showing papers by "Gao Cong published in 2007"


Proceedings Article
23 Sep 2007
TL;DR: This paper proposes two algorithms: one for automatically computing a repair D' that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database.
Abstract: Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D' that satisfies the constraints and "minimally" differs from D. Equally important is to ensure that the automatically-generated repair D' is accurate, or makes sense, i.e., D' differs from the "correct" data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D' that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction.

354 citations


Proceedings Article
01 Jun 2007
TL;DR: A new approach to detecting erroneous sentences by integrating pattern discovery with supervised learning models is proposed, and experimental results show that the techniques are promising.
Abstract: This paper studies the problem of identifying erroneous/correct sentences. The problem has important applications, e.g., providing feedback for writers of English as a Second Language, controlling the quality of parallel bilingual sentences mined from the Web, and evaluating machine translation results. In this paper, we propose a new approach to detecting erroneous sentences by integrating pattern discovery with supervised learning models. Experimental results show that our techniques are promising.

61 citations


Proceedings ArticleDOI
11 Jun 2007
TL;DR: This paper provides evaluation algorithms and optimizations for generic XPath queries in the same distributed and fragmented setting that explore parallelism and retain the performance guarantees of their counterpart for Boolean queries, regardless of how the tree is fragmented and distributed.
Abstract: Partial evaluation has recently proven an effective technique for evaluating Boolean XPath queries over a fragmented tree that is distributed over a number of sites. What left open is whether or not the technique is applicable to generic data-selecting XPath queries. In contrast to Boolean queries that return a single truth value, a generic XPath query returns a set of elements, and its evaluation introduces difficulties to avoiding excessive data shipping. This paper settles this question in positive by providing evaluation algorithms and optimizations for generic XPath queries in the same distributed and fragmented setting. These algorithms explore parallelism and retain the performance guarantees of their counterpart for Boolean queries, regardless of how the tree is fragmented and distributed. First, each site is visited at most three times, and down to at most twice when optimizations are in place. Second, the network traffic is determined by the final answer of the query, rather than the size of the tree, without incurring unnecessary data shipping. Third, the total computation is comparable to that of centralized algorithms on the tree stored in a single site. We show both analytically and experimentally that our algorithms and optimizations are scalable and efficient on large trees and complex XPath queries.

59 citations


Proceedings ArticleDOI
11 Jun 2007
TL;DR: This paper provides automaton-based techniques for efficiently evaluating transform queries and for computing their compositions with user queries in standard XQuery and presents experimental results comparing the efficiency of the evaluation and composition algorithms for transform queries.
Abstract: This paper investigates a class of transform queries proposed by XQuery Update [6]. A transform query is defined in terms of XML update syntax. When posed on an XML tree T, it returns another XML tree that would be produced by executing its embedded update on T, without destructive impact on T. Transform queries support a variety of applications including XML hypothetical queries, the simulation of updates on virtual views, and the enforcement of XML access control. In light of the wide-range of applications for transform queries, we develop automaton-based techniques for efficiently evaluating transform queries and for computing their compositions with user queries in standard XQuery. We provide (a)three algorithms to implement transform queries without change to existing XQuery processors,(b) a linear-time algorithm, based on a seamless integration of automaton execution and SAX parsing, to evaluate transform queries on large XML documents that are difficult to handle by existing XQuery engines, and (c) an algorithm to rewrite the composition of user queries and transform queries into a single efficient query in standard XQuery. We also present experimental results comparing the efficiency of our evaluation and composition algorithms for transform queries.

20 citations


Proceedings Article
Guihua Sun1, Gao Cong2, Xiaohua Liu2, Chin-Yew Lin2, Ming Zhou2 
22 Jul 2007
TL;DR: A novel approach to identifying erroneous sentences is proposed, which first mine labeled tree patterns and sequential patterns to characterize both erroneous and correct sentences, and which are utilized in two ways to distinguish correct sentences from erroneous sentences.
Abstract: An important application area of detecting erroneous sentences is to provide feedback for writers of English as a Second Language. This problem is difficult since both erroneous and correct sentences are diversified. In this paper, we propose a novel approach to identifying erroneous sentences. We first mine labeled tree patterns and sequential patterns to characterize both erroneous and correct sentences. Then the discovered patterns are utilized in two ways to distinguish correct sentences from erroneous sentences: (1) the patterns are transformed into sentence features for existing classification models, e.g, SVM; (2) the patterns are used to build a rule-based classification model. Experimental results show that both techniques are promising while the second technique outperforms the first approach. Moreover, the classification model in the second proposal is easy to understand, and we can provide intuitive explanation for classification results.

19 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: This paper proposes a mild condition on SPJ views, and shows that under this condition the analysis of deletions on relational views becomes PTIME while the insertion analysis is NF-complete, and presents efficient algorithms to translate XML updates to relational view updates.
Abstract: This paper investigates the view update problem for XML views published from relational data. We consider (possibly) recursively defined XML views, compressed into DAGs and stored in relations. We provide new techniques to efficiently support XML view updates specified in terms of XFath expressions with recursion and complex filters. The interaction between XFath recursion and DAG compression of XML views makes the analysis of XML view updates intriguing. Furthermore, many issues are still open even for relational view updates, and need to be explored. In response to these, we revise the update semantics to accommodate XML side effects based on the semantics of XML views, and present efficient algorithms to translate XML updates to relational view updates. Moreover, we propose a mild condition on SPJ views, and show that under this condition the analysis of deletions on relational views becomes PTIME while the insertion analysis is NF-complete. Finally, we present an experimental study to verify the effectiveness of our techniques.

17 citations


Book ChapterDOI
Gao Cong1
17 Oct 2007
TL;DR: The work on building XML views defined in terms of update syntax and updating XML views of relations of relations is outlined, and some related work is discussed.
Abstract: XML has become a standard medium for data exchange, and XML views are frequently used as an interface to relational database and XML data There have been a considerable number of studies on building and querying XML views, while updating related topics for XML views have not receive much attention In this paper, we outline our work on building XML views defined in terms of update syntax and updating XML views of relations, and discuss some related work

11 citations