scispace - formally typeset
Search or ask a question

Showing papers by "Qiang Yang published in 2010"


Journal ArticleDOI
TL;DR: The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.
Abstract: A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold. For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data-labeling efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift. We also explore some potential future issues in transfer learning research.

18,616 citations


Proceedings ArticleDOI
26 Apr 2010
TL;DR: This work develops a general solution to sentiment classification when the authors do not have any labels in a target domain but have some labeled data in a different domain, regarded as source domain and proposes a spectral feature alignment (SFA) algorithm to align domain-specific words from different domains into unified clusters, with the help of domain-independent words as a bridge.
Abstract: Sentiment classification aims to automatically predict sentiment polarity (e.g., positive or negative) of users publishing sentiment data (e.g., reviews, blogs). Although traditional classification algorithms can be used to train sentiment classifiers from manually labeled text data, the labeling work can be time-consuming and expensive. Meanwhile, users often use some different words when they express sentiment in different domains. If we directly apply a classifier trained in one domain to other domains, the performance will be very low due to the differences between these domains. In this work, we develop a general solution to sentiment classification when we do not have any labels in a target domain but have some labeled data in a different domain, regarded as source domain. In this cross-domain sentiment classification setting, to bridge the gap between the domains, we propose a spectral feature alignment (SFA) algorithm to align domain-specific words from different domains into unified clusters, with the help of domain-independent words as a bridge. In this way, the clusters can be used to reduce the gap between domain-specific words of the two domains, which can be used to train sentiment classifiers in the target domain accurately. Compared to previous approaches, SFA can discover a robust representation for cross-domain data by fully exploiting the relationship between the domain-specific and domain-independent words via simultaneously co-clustering them in a common latent space. We perform extensive experiments on two real world datasets, and demonstrate that SFA significantly outperforms previous approaches to cross-domain sentiment classification.

778 citations


Proceedings ArticleDOI
26 Apr 2010
TL;DR: This paper shows that, by using the location data based on GPS and users' comments at various locations, it can discover interesting locations and possible activities that can be performed there for recommendations and extensively evaluated the system.
Abstract: With the increasing popularity of location-based services, such as tour guide and location-based social network, we now have accumulated many location data on the Web. In this paper, we show that, by using the location data based on GPS and users' comments at various locations, we can discover interesting locations and possible activities that can be performed there for recommendations. Our research is highlighted in the following location-related queries in our daily life: 1) if we want to do something such as sightseeing or food-hunting in a large city such as Beijing, where should we go? 2) If we have already visited some places such as the Bird's Nest building in Beijing's Olympic park, what else can we do there? By using our system, for the first question, we can recommend her to visit a list of interesting locations such as Tiananmen Square, Bird's Nest, etc. For the second question, if the user visits Bird's Nest, we can recommend her to not only do sightseeing but also to experience its outdoor exercise facilities or try some nice food nearby. To achieve this goal, we first model the users' location and activity histories that we take as input. We then mine knowledge, such as the location features and activity-activity correlations from the geographical databases and the Web, to gather additional inputs. Finally, we apply a collective matrix factorization method to mine interesting locations and activities, and use them to recommend to the users where they can visit if they want to perform some specific activities and what they can do if they visit some specific places. We empirically evaluated our system using a large GPS dataset collected by 162 users over a period of 2.5 years in the real-world. We extensively evaluated our system and showed that our system can outperform several state-of-the-art baselines.

703 citations


Journal ArticleDOI
TL;DR: BOOST has identified some disease-associated interactions between genes in the major histocompatibility complex region in the type 1 diabetes data set and can serve as a computationally and statistically useful tool in the coming era of large-scale interaction mapping in genome-wide case-control studies.
Abstract: Gene-gene interactions have long been recognized to be fundamentally important for understanding genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named "BOolean Operation-based Screening and Testing" (BOOST). For the discovery of unknown gene-gene interactions that underlie complex diseases, BOOST allows examination of all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hr to completely evaluate all pairs of roughly 360,000 SNPs on a standard 3.0 GHz desktop with 4G memory running the Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, although both data sets share a very similar hit region in the WTCCC report. BOOST has also identified some disease-associated interactions between genes in the major histocompatibility complex region in the type 1 diabetes data set. We believe that our method can serve as a computationally and statistically useful tool in the coming era of large-scale interaction mapping in genome-wide case-control studies.

453 citations


Posted Content
TL;DR: In this paper, a simple but powerful method, named "BOolean Operation based Screening and Testing" (BOOST), is introduced to discover unknown gene-gene interactions that underlie complex diseases.
Abstract: Gene-gene interactions have long been recognized to be fundamentally important to understand genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named `BOolean Operation based Screening and Testing'(BOOST). To discover unknown gene-gene interactions that underlie complex diseases, BOOST allows examining all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hours on a standard 3.0 GHz desktop with 4G memory running Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, while both data sets share a very similar hit region in the WTCCC report. BOOST has also identified many undiscovered interactions between genes in the major histocompatibility complex (MHC) region in the type 1 diabetes data set. In the coming era of large-scale interaction mapping in genome-wide case-control studies, our method can serve as a computationally and statistically useful tool.

397 citations


Proceedings Article
11 Jul 2010
TL;DR: This paper discovers the principle coordinates of both users and items in the auxiliary data matrices, and transfers them to the target domain in order to reduce the effect of data sparsity.
Abstract: Data sparsity is a major problem for collaborative filtering (CF) techniques in recommender systems, especially for new users and items. We observe that, while our target data are sparse for CF systems, related and relatively dense auxiliary data may already exist in some other more mature application domains. In this paper, we address the data sparsity problem in a target domain by transferring knowledge about both users and items from auxiliary data sources. We observe that in different domains the user feedbacks are often heterogeneous such as ratings vs. clicks. Our solution is to integrate both user and item knowledge in auxiliary data sources through a principled matrix-based transfer learning framework that takes into account the data heterogeneity. In particular, we discover the principle coordinates of both users and items in the auxiliary data matrices, and transfer them to the target domain in order to reduce the effect of data sparsity. We describe our method, which is known as coordinate system transfer or CST, and demonstrate its effectiveness in alleviating the data sparsity problem in collaborative filtering. We show that our proposed method can significantly outperform several state-of-the-art solutions for this problem.

359 citations


Proceedings Article
11 Jul 2010
TL;DR: A new approach, known as user-centered collaborative location and activity filtering (UCLAF), to pull many users' data together and apply collaborative filtering to find like-minded users and like-patterned activities at different locations is presented.
Abstract: With the increasing popularity of location tracking services such as GPS, more and more mobile data are being accumulated. Based on such data, a potentially useful service is to make timely and targeted recommendations for users on places where they might be interested to go and activities that they are likely to conduct. For example, a user arriving in Beijing might wonder where to visit and what she can do around the Forbidden City. A key challenge for such recommendation problems is that the data we have on each individual user might be very limited, while to make useful and accurate recommendations, we need extensive annotated location and activity information from user trace data. In this paper, we present a new approach, known as user-centered collaborative location and activity filtering (UCLAF), to pull many users' data together and apply collaborative filtering to find like-minded users and like-patterned activities at different locations. We model the user-location-activity relations with a tensor representation, and propose a regularized tensor and matrix decomposition solution which can better address the sparse data problem in mobile information retrieval. We empirically evaluate UCLAF using a real-world GPS dataset collected from 164 users over 2.5 years, and showed that our system can outperform several state-of-the-art solutions to the problem.

348 citations


Journal ArticleDOI
TL;DR: To the knowledge, SNPRuler is the first method that guarantees to find the epistatic interactions without exhaustive search in GWAS and is computationally attainable in practice.
Abstract: Motivation: Under the current era of genome-wide association study (GWAS), finding epistatic interactions in the large volume of SNP data is a challenging and unsolved issue. Few of previous studies could handle genome-wide data due to the difficulties in searching the combinatorially explosive search space and statistically evaluating high-order epistatic interactions given the limited number of samples. In this work, we propose a novel learning approach (SNPRuler) based on the predictive rule inference to find disease-associated epistatic interactions. Results: Our extensive experiments on both simulated data and real genome-wide data from Wellcome Trust Case Control Consortium (WTCCC) show that SNPRuler significantly outperforms its recent competitor. To our knowledge, SNPRuler is the first method that guarantees to find the epistatic interactions without exhaustive search. Our results indicate that finding epistatic interactions in GWAS is computationally attainable in practice. Availability: http://bioinformatics.ust.hk/SNPRuler.zip Contact:[email protected], [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

162 citations


Proceedings Article
11 Jul 2010
TL;DR: An Adaptive Transfer learning algorithm based on Gaussian Processes (AT-GP), which can be used to adapt the transfer learning schemes by automatically estimating the similarity between a source and a target task, is proposed.
Abstract: Transfer learning aims at reusing the knowledge in some source tasks to improve the learning of a target task. Many transfer learning methods assume that the source tasks and the target task be related, even though many tasks are not related in reality. However, when two tasks are unrelated, the knowledge extracted from a source task may not help, and even hurt, the performance of a target task. Thus, how to avoid negative transfer and then ensure a "safe transfer" of knowledge is crucial in transfer learning. In this paper, we propose an Adaptive Transfer learning algorithm based on Gaussian Processes (AT-GP), which can be used to adapt the transfer learning schemes by automatically estimating the similarity between a source and a target task. The main contribution of our work is that we propose a new semi-parametric transfer kernel for transfer learning from a Bayesian perspective, and propose to learn the model with respect to the target task, rather than all tasks as in multi-task learning. We can formulate the transfer learning problem as a unified Gaussian Process (GP) model. The adaptive transfer ability of our approach is verified on both synthetic and real-world datasets.

151 citations


Book ChapterDOI
20 Sep 2010
TL;DR: A new criterion to overcome "double" distribution shift is formulated and a practical approach "Transfer Cross Validation" (TrCV) is presented to select both models and data in a cross validation framework, optimized for transfer learning.
Abstract: One solution to the lack of label problem is to exploit transfer learning, whereby one acquires knowledge from source-domains to improve the learning performance in the target-domain. The main challenge is that the source and target domains may have different distributions. An open problem is how to select the available models (including algorithms and parameters) and importantly, abundance of source-domain data, through statistically reliable methods, thus making transfer learning practical and easy-to-use for real-world applications. To address this challenge, one needs to take into account the difference in both marginal and conditional distributions in the same time, but not just one of them. In this paper, we formulate a new criterion to overcome "double" distribution shift and present a practical approach "Transfer Cross Validation" (TrCV) to select both models and data in a cross validation framework, optimized for transfer learning. The idea is to use density ratio weighting to overcome the difference in marginal distributions and propose a "reverse validation" procedure to quantify how well a model approximates the true conditional distribution of target-domain. The usefulness of TrCV is demonstrated on different cross-domain tasks, including wine quality evaluation, web-user ranking and text categorization. The experiment results show that the proposed method outperforms both traditional cross-validation and one state-of-the-art method which only considers marginal distribution shift. The software and datasets are available from the authors.

139 citations


Proceedings Article
21 Jun 2010
TL;DR: This paper proposes a nonparametric Bayesian framework for solving the CLP problem, which allows knowledge to be adaptively transferred across heterogeneous tasks while taking into account the similarities between tasks.
Abstract: Link prediction is a key technique in many applications such as recommender systems, where potential links between users and items need to be predicted. A challenge in link prediction is the data sparsity problem. In this paper, we address this problem by jointly considering multiple heterogeneous link prediction tasks such as predicting links between users and different types of items including books, movies and songs, which we refer to as the collective link prediction (CLP) problem. We propose a nonparametric Bayesian framework for solving the CLP problem, which allows knowledge to be adaptively transferred across heterogeneous tasks while taking into account the similarities between tasks. We learn the inter-task similarity automatically. We also introduce link functions for different tasks to correct their biases and skewness of distributions in their link data. We conduct experiments on several real world datasets and demonstrate significant improvements over several existing state-of-the-art methods.

Journal ArticleDOI
TL;DR: This special issue gathers the state-of-the-art research in social learning and is devoted to exhibiting some of the best representative works in this area.
Abstract: In recent years, social behavioral data have been exponentially expanding due to the tremendous success of various outlets on the social Web (aka Web 2.0) such as Facebook, Digg, Twitter, Wikipedia, and Delicious. As a result, there's a need for social learning to support the discovery, analysis, and modeling of human social behavioral data. The goal is to discover social intelligence, which encompasses a spectrum of knowledge that characterizes human interaction, communication, and collaborations. The social Web has thus become a fertile ground for machine learning and data mining research. This special issue gathers the state-of-the-art research in social learning and is devoted to exhibiting some of the best representative works in this area.

Proceedings ArticleDOI
26 Sep 2010
TL;DR: This paper extended the widely used neighborhood based algorithms by incorporating temporal information and developed an incremental algorithm for updating neighborhood similarities with new data.
Abstract: Collaborative filtering algorithms attempt to predict a user's interests based on his past feedback. In real world applications, a user's feedback is often continuously collected over a long period of time. It is very common for a user's interests or an item's popularity to change over a long period of time. Therefore, the underlying recommendation algorithm should be able to adapt to such changes accordingly. However, most existing algorithms do not distinguish current and historical data when predicting the users' current interests. In this paper, we consider a new problem - online evolutionary collaborative filtering, which tracks user interests over time in order to make timely recommendations. We extended the widely used neighborhood based algorithms by incorporating temporal information and developed an incremental algorithm for updating neighborhood similarities with new data. Experiments on two real world datasets demonstrated both improved effectiveness and efficiency of the proposed approach.

Journal ArticleDOI
TL;DR: This article presents a novel algorithm called LAMP (Learning Action Models from Plan traces), to learn action models with quantifiers and logical implications from a set of observed plan traces with only partially observed intermediate state information.

Journal ArticleDOI
TL;DR: A comprehensive survey of videoblogging (vlogging for short) as a new technological trend is presented and several multimedia technologies are introduced to empower vlogging technology with better scalability, interactivity, searchability, and accessability.
Abstract: In recent years, blogging has become an exploding passion among Internet communities. By combining the grassroots blogging with the richness of expression available in video, videoblogs (vlogs for short) will be a powerful new media adjunct to our existing televised news sources. Vlogs have gained much attention worldwide, especially with Google's acquisition of YouTube. This article presents a comprehensive survey of videoblogging (vlogging for short) as a new technological trend. We first summarize the technological challenges for vlogging as four key issues that need to be answered. Along with their respective possibilities, we give a review of the currently available techniques and tools supporting vlogging, and envision emerging technological directions for future vlogging. Several multimedia technologies are introduced to empower vlogging technology with better scalability, interactivity, searchability, and accessability, and to potentially reduce the legal, economic, and moral risks of vlogging applications. We also make an in-depth investigation of various vlog mining topics from a research perspective and present several incentive applications such as user-targeted video advertising and collective intelligence gaming. We believe that vlogging and its applications will bring new opportunities and drives to the research in related fields.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This work developed matrix factorization models that can be trained from explicit and implicit feedback simultaneously and showed that the algorithm could effectively combine these two forms of heterogeneous user feedback to improve recommendation quality.
Abstract: Most collaborative filtering algorithms are based on certain statistical models of user interests built from either explicit feedback (eg: ratings, votes) or implicit feedback (eg: clicks, purchases). Explicit feedbacks are more precise but more difficult to collect from users while implicit feedbacks are much easier to collect though less accurate in reflecting user preferences. In the existing literature, separate models have been developed for either of these two forms of user feedbacks due to their heterogeneous representation. However in most real world recommended systems both explicit and implicit user feedback are abundant and could potentially complement each other. It is desirable to be able to unify these two heterogeneous forms of user feedback in order to generate more accurate recommendations. In this work, we developed matrix factorization models that can be trained from explicit and implicit feedback simultaneously. Experimental results of multiple datasets showed that our algorithm could effectively combine these two forms of heterogeneous user feedback to improve recommendation quality.

Journal ArticleDOI
TL;DR: The hypothesis is validated out that the siRNA binding efficacy on different messenger RNAs(mRNAs) have different conditional distribution, thus the multi-task learning can be conducted by viewing tasks at an "mRNA"-level rather than at the "experiment"-level, and this provides useful insights on how to analyze various cross-platform RNAi data for uncovering of their complex mechanism.
Abstract: Background Gene silencing using exogenous small interfering RNAs (siRNAs) is now a widespread molecular tool for gene functional study and new-drug target identification. The key mechanism in this technique is to design efficient siRNAs that incorporated into the RNA-induced silencing complexes (RISC) to bind and interact with the mRNA targets to repress their translations to proteins. Although considerable progress has been made in the computational analysis of siRNA binding efficacy, few joint analysis of different RNAi experiments conducted under different experimental scenarios has been done in research so far, while the joint analysis is an important issue in cross-platform siRNA efficacy prediction. A collective analysis of RNAi mechanisms for different datasets and experimental conditions can often provide new clues on the design of potent siRNAs.

Journal ArticleDOI
TL;DR: This paper designs a novel transfer learning approach, called BIG (Bridging Information Gap), to effectively extract useful knowledge in a worldwide knowledge base, which is then used to link the source and target domains for improving the classification performance.
Abstract: A major problem of classification learning is the lack of ground-truth labeled data. It is usually expensive to label new data instances for training a model. To solve this problem, domain adaptation in transfer learning has been proposed to classify target domain data by using some other source domain data, even when the data may have different distributions. However, domain adaptation may not work well when the differences between the source and target domains are large. In this paper, we design a novel transfer learning approach, called BIG (Bridging Information Gap), to effectively extract useful knowledge in a worldwide knowledge base, which is then used to link the source and target domains for improving the classification performance. BIG works when the source and target domains share the same feature space but different underlying data distributions. Using the auxiliary source data, we can extract a ?bridge? that allows cross-domain text classification problems to be solved using standard semisupervised learning algorithms. A major contribution of our work is that with BIG, a large amount of worldwide knowledge can be easily adapted and used for learning in the target domain. We conduct experiments on several real-world cross-domain text classification tasks and demonstrate that our proposed approach can outperform several existing domain adaptation approaches significantly.

Proceedings ArticleDOI
20 May 2010
TL;DR: This paper studies how to reduce this calibration effort by only collecting the labeled data on one floor, while collecting unlabeled data on other floors, inspired by the observation that, although the wireless signals can be quite different, the floor-plans in a building are similar.
Abstract: In pervasive computing, localizing a user in wireless indoor environments is an important yet challenging task. Among the state-of-art localization methods, fingerprinting is shown to be quite successful by statistically learning the signal to location relations. However, a major drawback for fingerprinting is that, it usually requires a lot of labeled data to train an accurate localization model. To establish a fingerprinting-based localization model in a building with many floors, we have to collect sufficient labeled data on each floor. This effort can be very burdensome. In this paper, we study how to reduce this calibration effort by only collecting the labeled data on one floor, while collecting unlabeled data on other floors. Our idea is inspired by the observation that, although the wireless signals can be quite different, the floor-plans in a building are similar. Therefore, if we co-embed these different floors' data in some common low-dimensional manifold, we are able to align the unlabeled data with the labeled data well so that we can then propagate the labels to the unlabeled data. We conduct empirical evaluations on real-world multi-floor data sets to validate our proposed method.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This paper proposes a novel approach for behavior pattern mining which takes context logs as time ordered sequences of context records and takes into account the co-occurrences of contexts and interaction records in the whole time ranges of contexts.
Abstract: The user interaction with the mobile device plays an important role in user habit understanding, which is crucial for improving context-aware services. In this paper, we propose to mine the associations between user interactions and contexts captured by mobile devices, or behavior patterns for short, from context logs to characterize the habits of mobile users. Though several state-of-the-art studies have been reported for association mining, they cannot apply to behavior pattern mining due to the unbalanced occurrences of contexts and user interaction records. To this end, we propose a novel approach for behavior pattern mining which takes context logs as time ordered sequences of context records and takes into account the co-occurrences of contexts and interaction records in the whole time ranges of contexts. Moreover, we develop an Apriori-like algorithm for behavior pattern mining and improve the original algorithm in terms of efficiency by introducing the context hash tree. Last, we build a data collection system and collect the rich context data and interaction records of 50 recruited volunteers from their mobile devices. The extensive experiments on the collected real life data clearly validate the ability of our approach for mining effective behavior patterns.

Proceedings ArticleDOI
30 Sep 2010
TL;DR: Experimental results shows that the use of temporal information can lead to significant improvement on the weekly recommendation track whereas for the social network track, the NRMF could lead to minor improvement by combining thesocial network with the rating data.
Abstract: In this paper, we describe our solutions to the weekly recommendation track and social network track of the CAMRA 2010 challenge. The key challenge in the weekly recommendation track is designing models that can cope with time dependent user or item characteristics. Toward this goal, we compared two general approaches, one is a data weighting approach, the other is a time-aware modeling approach. Both approaches can be implemented by extending either the well known neighborhood model or the matrix factorization. For the social network track, we developed and compared two extensions of the matrix factorization models for incorporating the social network structure, namely collective matrix factorization(CMF) and network regularized matrix factorization(NRMF). Experimental results shows that the use of temporal information can lead to significant improvement on the weekly recommendation track whereas for the social network track, the NRMF could lead to minor improvement by combining the social network with the rating data.

Journal ArticleDOI
TL;DR: This paper proposes an Adaptive Group Lasso (AGL) model for large-scale association studies that enables us to analyze SNPs and their interactions simultaneously and introduces a sparsity constraint in this model based on the fact that only a small fraction of SNPs is disease-associated.
Abstract: Single nucleotide polymorphism (SNP) based association studies aim at identifying SNPs associated with phenotypes, for example, complex diseases. The associated SNPs may influence the disease risk individually (main effects) or behave jointly (epistatic interactions). For the analysis of high throughput data, the main difficulty is that the number of SNPs far exceeds the number of samples. This difficulty is amplified when identifying interactions. In this paper, we propose an Adaptive Group Lasso (AGL) model for large-scale association studies. Our model enables us to analyze SNPs and their interactions simultaneously. We achieve this by introducing a sparsity constraint in our model based on the fact that only a small fraction of SNPs is disease-associated. In order to reduce the number of false positive findings, we develop an adaptive reweighting scheme to enhance sparsity. In addition, our method treats SNPs and their interactions as factors, and identifies them in a grouped manner. Thus, it is flexible to analyze various disease models, especially for interaction detection. However, due to the intensive computation when millions of interaction terms needs to be searched in the model fitting, our method needs to combined with some filtering methods when applied to genome-wide data for detecting interactions. By using a wide range of simulated datasets and a real dataset from WTCCC, we demonstrate the advantages of our method.

Journal ArticleDOI
TL;DR: The new frontier of mobile information retrieval will combine context awareness and content adaptation, according to a new report from 451 Research.
Abstract: The new frontier of mobile information retrieval will combine context awareness and content adaptation.

Journal ArticleDOI
TL;DR: A novel algorithm called Topic-Sensitive pLSA is proposed, which extends the original probabilistic latent semantic analysis (pLSA), which is a purely unsupervised framework, by injecting a small amount of supervision information from the user by introducing the penalty terms for these constraints.
Abstract: It is often difficult and time-consuming to provide a large amount of positive and negative examples for training a classification system in many applications such as information retrieval. Instead, users often find it easier to indicate just a few positive examples of what he or she likes, and thus, these are the only labeled examples available for the learning system. A large amount of unlabeled data are easier to obtain. How to make use of the positive and unlabeled data for learning is a critical problem in machine learning and information retrieval. Several approaches for solving this problem have been proposed in the past, but most of these methods do not work well when only a small amount of labeled positive data are available. In this paper, we propose a novel algorithm called Topic-Sensitive pLSA to solve this problem. This algorithm extends the original probabilistic latent semantic analysis (pLSA), which is a purely unsupervised framework, by injecting a small amount of supervision information from the user. The supervision from users is in the form of indicating which documents fit the users' interests. The supervision is encoded into a set of constraints. By introducing the penalty terms for these constraints, we propose an objective function that trades off the likelihood of the observed data and the enforcement of the constraints. We develop an iterative algorithm that can obtain the local optimum of the objective function. Experimental evaluation on three data corpora shows that the proposed method can improve the performance especially only with a small amount of labeled positive data.

Journal ArticleDOI
TL;DR: This inaugural issue of the ACM Transactions on Intelligent Systems and Technology (ACM TIST) is presented, a new scholarly journal that publishes the highest quality articles on intelligent systems, applicable algorithms, and technology with a multidisciplinary perspective.
Abstract: It is our great pleasure to present this inaugural issue of the ACM Transactions on Intelligent Systems and Technology (ACM TIST). In today’s world, systems empowered with artificial intelligence (AI) technology have truly touched on every aspect of our lives, ranging from Web search to smart phones, from social networking and media systems to computational sustainability. Looking back, the field of AI has undergone tremendous changes throughout the years since its inception in the 1950s. Today’s AI application has grown from the standalone, single systems in the early days to ones that are more pervasive, integrated, embedded, and multidisciplinary. AI systems are becoming more integrated, with more than one technology operating and interacting therein as well as becoming more embedded by acting as key components of a larger, overall system or systems. Increasingly, systems and technologies are data driven, which complements the traditional top-down design methodologies in AI. Intelligent systems are also stepping out of the traditional computer science realm as evidenced by explosive research in areas such as bioinformatics and biomedicine, intelligent education systems, and intelligent transportation systems. In light of the technology and societal changes just mentioned, we see a tremendous demand for opening up a new archival journal at a top venue to document high-impact works in the area of intelligent systems and technology. As we state in our editorial charter published at http://tist.acm.org, ACM Transactions on Intelligent Systems and Technology is a new scholarly journal that publishes the highest quality articles on intelligent systems, applicable algorithms, and technology with a multidisciplinary perspective. An intelligent system is one that uses artificial intelligence techniques to offer important services (e.g., as a component of a larger system) that allow integrated systems to perceive, reason, learn, and act intelligently in the real world. The journal welcomes articles that report on the integration of artificial intelligence technology with various subareas of computer science as well as with other branches of science and engineering. The journal welcomes innovative high-impact articles on deployed or emerging intelligent systems and technology with solid evaluation or evidence of success on a variety of topics. In a field where there are already many top conferences every year disseminating volumes of good papers, do we still need to have another journal? In his Communications of ACM article “Conferences vs. journals in computing research”, Moshe Y. Vardi [2009] ponders whether computer science as a field has now reached a maturity such that we should see more journal publications as a way to disseminate our results. He asks, “Why are we the only discipline driving on the conference side of the ‘publication road?”’ Many authors have had the experience with a typical computer science journal submission incurring much

Proceedings ArticleDOI
01 Dec 2010
TL;DR: This paper introduces the well-known Collective Matrix Factorization (CMF) technique to 'transfer' usable linkage knowledge from relatively dense interaction network to a sparse target network, and establishes the correspondence between a source and a target network via network similarities.
Abstract: Protein-protein interactions (PPI) play an important role in cellular processes and metabolic processes within a cell. An important task is to determine the existence of interactions among proteins. Unfortunately, existing biological experimental techniques are expensive, time-consuming and labor-intensive. The network structures of many such networks are sparse, incomplete and noisy, containing many false positive and false negatives. Thus, state-of-the-art methods for link prediction in these networks often cannot give satisfactory prediction results, especially when some networks are extremely sparse. Noticing that we typically have more than one PPI network available, we naturally wonder whether it is possible to 'transfer' the linkage knowledge from some existing, relatively dense networks to a sparse network, to improve the prediction performance. Noticing that a network structure can be modeled using a matrix model, in this paper, we introduce the well-known Collective Matrix Factorization (CMF) technique to 'transfer' usable linkage knowledge from relatively dense interaction network to a sparse target network. Our approach is to establish the correspondence between a source and a target network via network similarities. We test this method on two real protein-protein interaction networks, Helicobacter pylori (as a target network) and Human (as a source network). Our experimental results show that our method can achieve higher and more robust performance as compared to some baseline methods.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: A novel and efficient heuristic algorithm to tackle the VN mapping problem in the context of virtual multicast service-oriented network subject to delay and delay variation constraints (VMNDDVC) based on a sliding window approach is presented.
Abstract: As a key issue of building a virtual network (VN), the VN mapping problem can be addressed by various state-of-the-art algorithms. While these algorithms are efficient for the construction of unicast service-oriented VNs, they are generally not suitable for multicast cases. In this paper, we investigate the mapping problem in the context of virtual multicast service-oriented network subject to delay and delay variation constraints (VMNDDVC). We present a novel and efficient heuristic algorithm to tackle this problem based on a sliding window approach. The primary objective of this algorithm is in two-fold: to minimize the cost of VMNDDVC request mapping, and to achieve load balancing so as to increase the acceptance ratio of virtual multicast network (VMN) requests. The numerical results obtained from extensive simulation experiments demonstrate the effectiveness of the proposed approach and superiority than existing solutions in terms of VN mapping acceptance ratio, total revenue and cost in the long term.

Dissertation
01 Jan 2010
TL;DR: A novel dimensionality reduction framework for transfer learning is proposed, which tries to reduce the distance between different domains while preserve data properties as much as possible and is general for many transfer learning problems when domain knowledge is unavailable.
Abstract: Transfer learning is a new machine learning and data mining framework that allows the training and test data to come from different distributions and/or feature spaces. We can find many novel applications of machine learning and data mining where transfer learning is helpful, especially when we have limited labeled data in our domain of interest. In this thesis, we first survey different settings and approaches of transfer learning and give a big picture of the field. We focus on latent space learning for transfer learning, which aims at discovering a “good” common feature space across domain, such that knowledge transfer becomes possible. In our study, we propose a novel dimensionality reduction framework for transfer learning, which tries to reduce the distance between different domains while preserve data properties as much as possible. This framework is general for many transfer learning problems when domain knowledge is unavailable. Based on this framework, we propose three effective solutions to learn the latent space for transfer learning. We apply these methods to two diverse applications: cross-domain WiFi localization and cross-domain text classification, and achieve promising results. Furthermore, for a specific application area, such as sentiment classification, where domain knowledge is available for encoding to transfer learning methods, we propose a spectral feature alignment algorithm for cross-domain learning. In this algorithm, we try to align domain-specific features from different domains by using some domain independent features as a bridge. Experimental results show that this method outperforms a state-of-the-art algorithm in two real-world datasets on cross-domain sentiment classification.

Journal ArticleDOI
08 Mar 2010-PLOS ONE
TL;DR: The present findings suggest that genome regions characterized by the co-occurrence of positive selection and hotspot recombination, two interacting factors both affecting genetic diversity, merit close scrutiny with respect to the etiology of common complex disorders.
Abstract: Background: Schizophrenia is a major disorder with complex genetic mechanisms. Earlier, population genetic studies revealed the occurrence of strong positive selection in the GABRB2 gene encoding the β2 subunit of GABAA receptors, within a segment of 3,551 bp harboring twenty-nine single nucleotide polymorphisms (SNPs) and containing schizophrenia-associated SNPs and haplotypes. Methodology/Principal Findings:In the present study, the possible occurrence of recombination in this 'S1-S29' segment was assessed. The occurrence of hotspot recombination was indicated by high resolution recombination rate estimation, haplotype diversity, abundance of rare haplotypes, recurrent mutations and torsos in haplotype networks, and experimental haplotyping of somatic and sperm DNA. The sub-segment distribution of relative recombination strength, measured by the ratio of haplotype diversity (Hd) over mutation rate (θ), was indicative of a human specific Alu-Yi6 insertion serving as a central recombining sequence facilitating homologous recombination. Local anomalous DNA conformation attributable to the Alu-Yi6 element, as suggested by enhanced DNase I sensitivity and obstruction to DNA sequencing, could be a contributing factor of the increased sequence diversity. Linkage disequilibrium (LD) analysis yielded prominent low LD points that supported ongoing recombination. LD contrast revealed significant dissimilarity between control and schizophrenic cohorts. Among the large array of inferred haplotypes, H26 and H73 were identified to be protective, and H19 and H81 risk-conferring, toward the development of schizophrenia. Conclusions/Significance: The co-occurrence of hotspot recombination and positive selection in the S1-S29 segment of GABRB2 has provided a plausible contribution to the molecular genetics mechanisms for schizophrenia. The present findings therefore suggest that genome regions characterized by the co-occurrence of positive selection and hotspot recombination, two interacting factors both affecting genetic diversity, merit close scrutiny with respect to the etiology of common complex disorders. © 2010 Ng et al.

Proceedings Article
01 Dec 2010
TL;DR: A generalization bound is derived that specifically considers distribution difference and further evaluate the model on a number of applications to learn the target model, and achieves an average improvement of 30% in accuracy.
Abstract: Lack of labeled training examples is a common problem for many applications. In the same time, there is usually an abundance of labeled data from related tasks. But they have different distributions and outputs (e.g., different class labels, and different scales of regression values). Conjecture that there is only a limited number of vaccine efficacy examples against the new epidemic swine flu H1N1, whereas there exists a large amount of labeled vaccine data against previous years’ flu. However, it is difficult to directly apply the older flu vaccine data as training examples because of the difference in data distribution and efficacy output criteria between different viruses. To increase the sources of labeled data, we propose a method to utilize these examples whose marginal distribution and output criteria can be different. The idea is to first select a subset of source examples similar in distribution to the target data; all the selected instances are then “re-scaled” and assigned new output values from the labeled space of the target task. A new predictive model is built on the enlarged training set. We derive a generalization bound that specifically considers distribution difference and further evaluate the model on a number of applications. For an siRNA efficacy prediction problem, we extract examples from 4 heterogeneous regression tasks and 2 classification tasks to learn the target model, and achieve an average improvement of 30% in accuracy.