scispace - formally typeset
Search or ask a question
Author

Tianyu Cao

Other affiliations: University of Vermont
Bio: Tianyu Cao is an academic researcher from Amazon.com. The author has contributed to research in topics: Computer science & Maximization. The author has an hindex of 6, co-authored 14 publications receiving 170 citations. Previous affiliations of Tianyu Cao include University of Vermont.

Papers
More filters
Journal ArticleDOI
TL;DR: An integrated framework on autodetection of subkilometer craters with boosting and transfer learning that can achieve an F1 score above 0.85, a significant improvement over the other crater detection algorithms.
Abstract: Counting craters in remotely sensed images is the only tool that provides relative dating of remote planetary surfaces. Surveying craters requires counting a large amount of small subkilometer craters, which calls for highly efficient automatic crater detection. In this article, we present an integrated framework on autodetection of subkilometer craters with boosting and transfer learning. The framework contains three key components. First, we utilize mathematical morphology to efficiently identify crater candidates, the regions of an image that can potentially contain craters. Only those regions occupying relatively small portions of the original image are the subjects of further processing. Second, we extract and select image texture features, in combination with supervised boosting ensemble learning algorithms, to accurately classify crater candidates into craters and noncraters. Third, we integrate transfer learning into boosting, to enhance detection performance in the regions where surface morphology differs from what is characterized by the training set. Our framework is evaluated on a large test image of 37,500 × 56,250 m2 on Mars, which exhibits a heavily cratered Martian terrain characterized by nonuniform surface morphology. Empirical studies demonstrate that the proposed crater detection framework can achieve an F1 score above 0.85, a significant improvement over the other crater detection algorithms.

62 citations

Proceedings ArticleDOI
22 Mar 2010
TL;DR: It is proved that finding an optimal allocation in a modular social network is NP-hard and a new optimal dynamic programming algorithm is proposed to solve the problem, which is named OASNET (Optimal Allocation in a Social NETwork).
Abstract: Influence maximization in a social network is to target a given number of nodes in the network such that the expected number of activated nodes from these nodes is maximized. A social network usually exhibits some degree of modularity. Previous research efforts that made use of this topological property are restricted to random networks with two communities. In this paper, we firstly transform the influence maximization problem in a modular social network to an optimal resource allocation problem in the same network. We assume that the communities of the social network are disconnected. We then propose a recursive relation for finding such an optimal allocation. We prove that finding an optimal allocation in a modular social network is NP-hard and propose a new optimal dynamic programming algorithm to solve the problem. We name our new algorithm OASNET (Optimal Allocation in a Social NETwork). We compare OASNET with equal allocation, proportional allocation, random allocation and selecting top degree nodes without any allocation strategy on both synthetic and real world datasets. Experimental results show that OASNET outperforms these four heuristics.

44 citations

Journal ArticleDOI
TL;DR: It is proved that finding an optimal allocation in a modular social network is NP-hard and a new dynamic programming algorithm is proposed to solve the problem, which is named OASNET (Optimal Allocation in a Social NETwork).
Abstract: Influence maximization in a social network is to target a given number of nodes in the network such that the expected number of activated nodes from these nodes is maximized. A social network usually exhibits some degree of modularity. Previous research efforts that made use of this topological property are restricted to random networks with two communities. In this paper, we firstly transform the influence maximization problem in a modular social network to an optimal resource allocation problem. We assume that the communities of the social network are disconnected. We then propose a recursive relation for finding such an optimal allocation. We prove that finding an optimal allocation in a modular social network is NP-hard and propose a new dynamic programming algorithm to solve the problem. We name our new algorithm OASNET (Optimal Allocation in a Social NETwork). We compare OASNET with the high degree heuristics, the single degree discount heuristics, and the degree discount heuristics on three real world datasets. Experimental results show that OASNET outperforms comparison heuristics significantly on the independent cascade model when the diffusion probability is greater than a certain threshold.

34 citations

Proceedings ArticleDOI
28 Mar 2022
TL;DR: This paper explores multilingual KG completion, which leverages limited seed alignment as a bridge, to embrace the collective knowledge from multiple languages and proposes a novel self-supervised adaptive graph alignment (SS-AGA) method, which fuses all KGs as a whole graph by regarding alignment as an new edge type.
Abstract: Predicting missing facts in a knowledge graph (KG) is crucial as modern KGs are far from complete. Due to labor-intensive human labeling, this phenomenon deteriorates when handling knowledge represented in various languages. In this paper, we explore multilingual KG completion, which leverages limited seed alignment as a bridge, to embrace the collective knowledge from multiple languages. However, language alignment used in prior works is still not fully exploited: (1) alignment pairs are treated equally to maximally push parallel entities to be close, which ignores KG capacity inconsistency; (2) seed alignment is scarce and new alignment identification is usually in a noisily unsupervised manner. To tackle these issues, we propose a novel self-supervised adaptive graph alignment (SS-AGA) method. Specifically, SS-AGA fuses all KGs as a whole graph by regarding alignment as a new edge type. As such, information propagation and noise influence across KGs can be adaptively controlled via relation-aware attention weights. Meanwhile, SS-AGA features a new pair generator that dynamically captures potential alignment pairs in a self-supervised paradigm. Extensive experiments on both the public multilingual DBPedia KG and newly-created industrial multilingual E-commerce KG empirically demonstrate the effectiveness of SS-AGA

20 citations

Book ChapterDOI
05 Sep 2011
TL;DR: Extensive experimental evaluations on five popular network datasets demonstrate that the proposed weighted sampling algorithm outperforms pure random sampling in terms of both model accuracy and the proposed objective function.
Abstract: Previous research efforts on the influence maximization problem assume that the network model parameters are known beforehand. However, this is rarely true in real world networks. This paper deals with the situation when the network information diffusion parameters are unknown. To this end, we firstly examine the parameter sensitivity of a popular diffusion model in influence maximization, i.e., the linear threshold model, to motivate the necessity of learning the unknown model parameters. Experiments show that the influence maximization problem is sensitive to the model parameters under the linear threshold model. In the sequel, we formally define the problem of finding the model parameters for influence maximization as an active learning problem under the linear threshold model. We then propose a weighted sampling algorithm to solve this active learning problem. Extensive experimental evaluations on five popular network datasets demonstrate that the proposed weighted sampling algorithm outperforms pure random sampling in terms of both model accuracy and the proposed objective function.

18 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A survey of discretization methods can be found in this paper, where the main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data.
Abstract: Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers, and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals, and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.

419 citations

Journal ArticleDOI
TL;DR: A novel Online Streaming Feature Selection method to select strongly relevant and nonredundant features on the fly and an efficient Fast-OSFS algorithm is proposed to improve feature selection performance.
Abstract: We propose a new online feature selection framework for applications with streaming features where the knowledge of the full feature space is unknown in advance. We define streaming features as features that flow in one by one over time whereas the number of training examples remains fixed. This is in contrast with traditional online learning methods that only deal with sequentially added observations, with little attention being paid to streaming features. The critical challenges for Online Streaming Feature Selection (OSFS) include 1) the continuous growth of feature volumes over time, 2) a large feature space, possibly of unknown or infinite size, and 3) the unavailability of the entire feature set before learning starts. In the paper, we present a novel Online Streaming Feature Selection method to select strongly relevant and nonredundant features on the fly. An efficient Fast-OSFS algorithm is proposed to improve feature selection performance. The proposed algorithms are evaluated extensively on high-dimensional datasets and also with a real-world case study on impact crater detection. Experimental results demonstrate that the algorithms achieve better compactness and higher prediction accuracy than existing streaming feature selection algorithms.

240 citations

Journal ArticleDOI
TL;DR: CoFIM is proposed, a community-based framework for influence maximization on large-scale networks that derives a simple evaluation form of the total influence spread which is submodular and can be efficiently computed and a fast algorithm to select the seed nodes.
Abstract: Influence maximization is a classic optimization problem studied in the area of social network analysis and viral marketing. Given a network, it is defined as the problem of finding k seed nodes so that the influence spread of the network can be optimized. Kempe et al. have proved that this problem is NP hard and the objective function is submodular, based on which a greedy algorithm was proposed to give a near-optimal solution. However, this simple greedy algorithm is time consuming, which limits its application on large-scale networks. Heuristic algorithms generally cannot provide any performance guarantee. To solve this problem, in this paper we propose CoFIM, a community-based framework for influence maximization on large-scale networks. In our framework the influence propagation process is divided into two phases: (i) seeds expansion; and (ii) intra-community propagation. The first phase is the expansion of seed nodes among different communities at the beginning of diffusion. The second phase is the influence propagation within communities which are independent of each other. Based on the framework, we derive a simple evaluation form of the total influence spread which is submodular and can be efficiently computed. Then we further propose a fast algorithm to select the seed nodes. Experimental results on synthetic and nine real-world large datasets including networks with millions of nodes and hundreds of millions of edges show that our algorithm achieves competitive results in influence spread as compared with state-of-the-art algorithms and it is much more efficient in terms of both time and memory usage.

139 citations

Journal ArticleDOI
TL;DR: This paper introduces fuzzy mutual information to evaluate the quality of features in multilabel learning, and design efficient algorithms to conduct multILabel feature selection when the feature space is completely known or partially known in advance.
Abstract: Due to complex semantics, a sample may be associated with multiple labels in various classification and recognition tasks. Multilabel learning generates training models to map feature vectors to multiple labels. There are several significant challenges in multilabel learning. Samples in multilabel learning are usually described with high-dimensional features and some features may be sequentially extracted. Thus, we do not know the full feature set at the beginning of learning, referred to as streaming features. In this paper, we introduce fuzzy mutual information to evaluate the quality of features in multilabel learning, and design efficient algorithms to conduct multilabel feature selection when the feature space is completely known or partially known in advance. These algorithms are called multilabel feature selection with label correlation (MUCO) and multilabel streaming feature selection (MSFS), respectively. MSFS consists of two key steps: online relevance analysis and online redundancy analysis. In addition, we design a metric to measure the correlation between the label sets, and both MUCO and MSFS take label correlation to consideration. The proposed algorithms are not only able to select features from streaming features, but also able to select features for ordinal multilabel learning. However streaming feature selection is more efficient. The proposed algorithms are tested with a collection of multilabel learning tasks. The experimental results illustrate the effectiveness of the proposed algorithms.

121 citations

Journal ArticleDOI
TL;DR: This work formalizes the problem of online streaming feature selection for class imbalanced data, and presents an efficient online feature selection framework regarding the dependency between condition features and decision classes, and proposes a new algorithm of Online Feature Selection based on the Dependency in K nearest neighbors, called K-OFSD.
Abstract: When tackling high dimensionality in data mining, online feature selection which deals with features flowing in one by one over time, presents more advantages than traditional feature selection methods. However, in real-world applications, such as fraud detection and medical diagnosis, the data is high-dimensional and highly class imbalanced, namely there are many more instances of some classes than others. In such cases of class imbalance, existing online feature selection algorithms usually ignore the small classes which can be important in these applications. It is hence a challenge to learn from high-dimensional and class imbalanced data in an online manner. Motivated by this, we first formalize the problem of online streaming feature selection for class imbalanced data, and then present an efficient online feature selection framework regarding the dependency between condition features and decision classes. Meanwhile, we propose a new algorithm of Online Feature Selection based on the Dependency in K nearest neighbors, called K-OFSD. In terms of Neighborhood Rough Set theory, K-OFSD uses the information of nearest neighbors to select relevant features which can get higher separability between the majority class and the minority class. Finally, experimental studies on seven high-dimensional and class imbalanced data sets show that our algorithm can achieve better performance than traditional feature selection methods with the same numbers of features and state-of-the-art online streaming feature selection algorithms in an online manner.

113 citations