scispace - formally typeset
Search or ask a question

Showing papers by "Jiuyong Li published in 2009"


Journal ArticleDOI
TL;DR: Results on EMT data sets show that the proposed method uncovers many known miRNA targets as well as new potentially promising miRNA-mRNA interactions that could not be achieved by the normal Bayesian network structure learning.
Abstract: microRNAs (miRNAs) regulate target gene expression by controlling their mRNAs post-transcriptionally. Increasing evidence demonstrates that miRNAs play important roles in various biological processes. However, the functions and precise regulatory mechanisms of most miRNAs remain elusive. Current research suggests that miRNA regulatory modules are complicated, including up-, down-, and mix-regulation for different physiological conditions. Previous computational approaches for discovering miRNA-mRNA interactions focus only on down-regulatory modules. In this work, we present a method to capture complex miRNA-mRNA interactions including all regulatory types between miRNAs and mRNAs. We present a method to capture complex miRNA-mRNA interactions using Bayesian network structure learning with splitting-averaging strategy. It is designed to explore all possible miRNA-mRNA interactions by integrating miRNA-targeting information, expression profiles of miRNAs and mRNAs, and sample categories. We also present an analysis of data sets for epithelial and mesenchymal transition (EMT). Our results show that the proposed method identified all possible types of miRNA-mRNA interactions from the data. Many interactions are of tremendous biological significance. Some discoveries have been validated by previous research, for example, the miR-200 family negatively regulates ZEB1 and ZEB2 for EMT. Some are consistent with the literature, such as LOX has wide interactions with the miR-200 family members for EMT. Furthermore, many novel interactions are statistically significant and worthy of validation in the near future. This paper presents a new method to explore the complex miRNA-mRNA interactions for different physiological conditions using Bayesian network structure learning with splitting-averaging strategy. The method makes use of heterogeneous data including miRNA-targeting information, expression profiles of miRNAs and mRNAs, and sample categories. Results on EMT data sets show that the proposed method uncovers many known miRNA targets as well as new potentially promising miRNA-mRNA interactions. These interactions could not be achieved by the normal Bayesian network structure learning.

70 citations


Journal ArticleDOI
TL;DR: It is proved that mining optimal risk pattern sets conforms an anti-monotone property that supports an efficient mining algorithm and is useful for exploratory study on large medical data to generate and refine hypotheses.

58 citations


Journal ArticleDOI
TL;DR: A computational method is proposed to discover the functional miRNA-mRNA regulatory modules (FMRMs), that is, groups of miRNAs and their target mRNAs that are believed to participate cooperatively in post-transcriptional gene regulation under specific conditions.

53 citations


Journal ArticleDOI
TL;DR: An improved prediction accuracy and state space complexity is presented by using novel approaches that combine clustering, association rules and Markov Models.
Abstract: Accurate next web page prediction benefits many applications, e-business in particular. The most widely used techniques for this purpose are Markov Model, association rules and clustering. However, each of these techniques has its own limitations, especially when it comes to accuracy and space complexity. This paper presents an improved prediction accuracy and state space complexity by using novel approaches that combine clustering, association rules and Markov Models. The three techniques are integrated together to maximise their strengths. The integration model has been shown to achieve better prediction accuracy than individual and other integrated models.

43 citations


Journal ArticleDOI
01 Oct 2009
TL;DR: It is proved that the optimal (α, k)-anonymity problem is NP-hard, and two scalable local-recoding algorithms are proposed which are both more scalable and result in less data distortion.
Abstract: Privacy preservation is an important issue in the release of data for mining purposes. The k-anonymity model has been introduced for protecting individual identification. Recent studies show that a more sophisticated model is necessary to protect the association of individuals to sensitive information. In this paper, we propose an (α, k)-anonymity model to protect both identifications and relationships to sensitive information in data. We discuss the properties of (α, k)-anonymity model. We prove that the optimal (α, k)-anonymity problem is NP-hard. We first present an optimal global-recoding method for the (α, k)-anonymity problem. Next we propose two scalable local-recoding algorithms which are both more scalable and result in less data distortion. The effectiveness and efficiency are shown by experiments. We also describe how the model can be extended to more general cases.

41 citations


Proceedings ArticleDOI
02 Nov 2009
TL;DR: This paper proposes a much finer level anonymisation scheme with regard to the data requester's trust value and specific application purpose and prioritize the attributes for anonymisation based on how important and critical they are related to the specified application purposes.
Abstract: Most existing works of data anonymisation target at the optimization of the anonymisation metrics to balance the data utility and privacy, whereas they ignore the effects of a requester's trust level and application purposes during the data anonymisation. Our aim of this paper is to propose a much finer level anonymisation scheme with regard to the data requester's trust value and specific application purpose. We prioritize the attributes for anonymisation based on how important and critical they are related to the specified application purposes and propose a trust evaluation strategy to quantify the data requester's reliability, and further build the projection between the trust value and the degree of data anonymiztion, which intends to determine to what extent the data should be anonymized. The decomposition algorithm is developed to find the desired anonymous solution, which guarantees the uniqueness and correctness.

34 citations


01 Jan 2009
TL;DR: Experimental results show that the proposed microaggregation technique is efficient and effective in the terms of running time and information loss.
Abstract: Microdata protection is a hot topic in the field of Statistical Disclosure Control, which has gained special interest after the disclosure of 658000 queries by the America Online (AOL) search engine in August 2006. Many algorithms, methods and properties have been proposed to deal with microdata disclosure. One of the emerging concepts in microdata protection is k-anonymity, introduced by Samarati and Sweeney. k-anonymity provides a simple and efficient approach to protect private individual information and is gaining increasing popularity. k-anonymity requires that every record in the microdata table released be indistinguishably related to no fewer than k respondents. In this paper, we apply the concept of entropy to propose a distance metric to evaluate the amount of mutual information among records in microdata, and propose a method of constructing dependency tree to find the key attributes, which we then use to process approximate microaggregation. Further, we adopt this new microaggregation technique to study k-anonymity problem, and an efficient algorithm is developed. Experimental results show that the proposed microaggregation technique is efficient and effective in the terms of running time and information loss.

16 citations


Proceedings ArticleDOI
21 Oct 2009
TL;DR: A novel permutation-based approach called \textit{anatomy} to release the quasi-identifier and sensitive values directly in two separate tables that protect privacy, but captures a large amount of correlation in the microdata.
Abstract: Privacy-preserving data publishing is to protect sensitive information of individuals in published data while the distortion ratio of the data is minimized. One well-studied approach is the $k$-anonymity model. Recently, several authors have recognized that $k$-anonymity cannot prevent attribute disclosure. To address this privacy threat, one solution would be to employ $p$-sensitive $k$-anonymity, a novel paradigm in relational data privacy, which prevents sensitive attribute disclosure. $p$-sensitive $k$-anonymity partitions the data into groups of records such that each group has at least $p$ distinct sensitive values. Existing approaches for achieving $p$-sensitive $k$-anonymity are mostly generalization-based. In this paper, we propose a novel permutation-based approach called \textit{anatomy} to release the quasi-identifier and sensitive values directly in two separate tables. Combined with a grouping mechanism, this approach not only protects privacy, but captures a large amount of correlation in the microdata. We develop a top-down algorithm for computing anatomized tables that obey the $p$-sensitive $k$-anonymity privacy requirement, and minimize the error of reconstructing the microdata. Extensive experiments confirm that \textit{anatomy} allows significantly more effective data analysis than the conventional publication methods based on generalization.

14 citations


Book ChapterDOI
TL;DR: This paper shows that the criteria for identifying redundant useful embeddings given in a previous work are neither sufficient nor necessary, and presents a heuristic algorithm for removing redundant usefulembeddings.
Abstract: Contained rewriting and maximal contained rewriting of tree pattern queries using views have been studied recently for the class of tree patterns involving /, //, and []. Given query Q and view V , it has been shown that a contained rewriting of Q using V can be obtained by finding a useful embedding of Q in V . However, for the same Q and V , there may be many useful embeddings and thus many contained rewritings. Some of the useful embeddings may be redundant in that the rewritings obtained from them are contained in those obtained from other useful embeddings. Redundant useful embeddings are useless and they waste computational resource. Thus it becomes important to identify and remove them. In this paper, we show that the criteria for identifying redundant useful embeddings given in a previous work are neither sufficient nor necessary. We then present some useful observations on the containment relationship of rewritings, and based on which, a heuristic algorithm for removing redundant useful embeddings. We demonstrate the efficiency of our algorithm using examples.

6 citations


Proceedings Article
01 Dec 2009
TL;DR: This research paper shows that how k-anonymised medical and genomic data is subject to genotype-phenotype attack.
Abstract: Personal data of patients is largely collected at hospitals, clinics, labs etc. This data consists of medical and genomic record. Such patient data is shared for various health and research purposes. The utility of such sharing is worthwhile and its benefits are now well documented. It includes early diagnostic of some diseases like Phenylketonuria that can cause high chances of recovery. Population health analysis, derived from collaborative sharing of patient data, help government agencies to draft proper policies to raise the standard of living of people. On the other side of picture, many patients fear about the misuse of their personal data. This fear caused social (sometimes legal) requirement to properly safeguard the personal data before sharing. Various generalization techniques were suggested to anonymize the both types of patient data i.e. medical and genomic. Generalization based privacy protection technique, k-anonymity is considered to be one of important practices to anonymize the patient data. Due to rapid technological advancements, it is possible that the medical and genomic data of same patient(s) can be publically available from different sources. Such a scenario has created new privacy threats to patient data. Genotype-Phenotype attack is one of these threats. This research paper shows that how k-anonymised medical and genomic data is subject to genotype-phenotype attack.

4 citations


01 Jan 2009
TL;DR: In this article, a diversified multiple decision tree (DMDT) is proposed to handle the high dimensionality problem of microarray data classification, which is used primarily to predict unseen data using a model built on categorized existing Microarray data.
Abstract: Microarray data classification is used primarily to predict unseen data using a model built on categorized existing Microarray data One of the major challenges is that Microarray data contains a large number of genes with a small number of samples This high dimensionality problem has prevented many existing classification methods from directly dealing with this type of data Moreover, the small number of samples increases the overfitting problem of Classification, as a result leading to lower accuracy classification performance Another major challenge is that of the uncertainty of Microarray data quality Microarray data contains various levels of noise and quite often high levels of noise, and these data lead to unreliable and low accuracy analysis as well as the high dimensionality problem Most current classification methods are not robust enough to handle these type of data properly In our research, accuracy and noise resistance or robustness issues are focused on Our approach is to design a robust classification method for Microarray data classification An algorithm, called diversified multiple decision trees (DMDT) is pro-

Proceedings Article
01 Dec 2009
TL;DR: A large number of small molecule structures, obtained from publicly available databases, was used to generate a set of molecular descriptors that can be used with machine learning to predict drug activity.
Abstract: The ability to predict drug activity from molecular structure is an important field of research both in academia and in the pharmaceutical industry. Raw 3D structure data is not in a form suitable for identifying properties using machine learning so it must be reconfigured into descriptor sets that continue to encapsulate important structural properties of the molecule. In this study, a large number of small molecule structures, obtained from publicly available databases, was used to generate a set of molecular descriptors that can be used with machine learning to predict drug activity. The descriptors were for the most part simple graph strings representing chains of connected atoms. Atom counts averaging seventy, using a dataset of just over one million molecules, resulted in a very large set of simple graph strings of lengths two to twelve atoms. Elimination of duplicates, reverse strings and feature reduction techniques were applied to reduce the path count to about three thousand which was viable for machine learning. Training data from twenty six data sets was used to build a decision tree classifier using J48 and Random Forest. Forty three thousand molecules from the NCI HIV dataset were used with the descriptor set to generate decision tree models with good accuracy. A similar algorithm was used to extract ring structures in the molecules. Inclusion of thirteen ring structure descriptors increased the accuracy of prediction.