scispace - formally typeset
Search or ask a question
Author

Yilun Jin

Other affiliations: Peking University
Bio: Yilun Jin is an academic researcher from Hong Kong University of Science and Technology. The author has contributed to research in topics: Graph (abstract data type) & Computer science. The author has an hindex of 9, co-authored 22 publications receiving 343 citations. Previous affiliations of Yilun Jin include Peking University.

Papers
More filters
Journal ArticleDOI
TL;DR: The SecureBoost framework is shown to be as accurate as other nonfederated gradient tree-boosting algorithms that require centralized data, and thus, it is highly scalable and practical for industrial applications such as credit risk analysis.
Abstract: The protection of user privacy is an important concern in machine learning, as evidenced by the rolling out of the General Data Protection Regulation (GDPR) in the European Union (EU) in May 2018 The GDPR is designed to give users more control over their personal data, which motivates us to explore machine learning frameworks for data sharing that do not violate user privacy To meet this goal, in this paper, we propose a novel lossless privacy-preserving tree-boosting system known as SecureBoost in the setting of federated learning This federated-learning system allows the learning process to be jointly conducted over multiple parties with partially common user samples but different feature sets, which corresponds to a vertically partitioned data set An advantage of SecureBoost is that it provides the same level of accuracy as the non privacy-preserving approach while at the same time, reveals no information of each private data provider We formally prove that the SecureBoost framework is as accurate as other non-federated gradient tree-boosting algorithms that concentrate data in one place In addition, we describe information leakage during the protocol execution and propose ways to provably reduce it

321 citations

Posted Content
TL;DR: SecureBoost as mentioned in this paper is a lossless privacy-preserving tree-boosting system for federated learning, which allows the learning process to be jointly conducted over multiple parties with common user samples but different feature sets, which corresponds to vertically partitioned data set.
Abstract: The protection of user privacy is an important concern in machine learning, as evidenced by the rolling out of the General Data Protection Regulation (GDPR) in the European Union (EU) in May 2018. The GDPR is designed to give users more control over their personal data, which motivates us to explore machine learning frameworks for data sharing that do not violate user privacy. To meet this goal, in this paper, we propose a novel lossless privacy-preserving tree-boosting system known as SecureBoost in the setting of federated learning. SecureBoost first conducts entity alignment under a privacy-preserving protocol and then constructs boosting trees across multiple parties with a carefully designed encryption strategy. This federated learning system allows the learning process to be jointly conducted over multiple parties with common user samples but different feature sets, which corresponds to a vertically partitioned data set. An advantage of SecureBoost is that it provides the same level of accuracy as the non-privacy-preserving approach while at the same time, reveals no information of each private data provider. We show that the SecureBoost framework is as accurate as other non-federated gradient tree-boosting algorithms that require centralized data and thus it is highly scalable and practical for industrial applications such as credit risk analysis. To this end, we discuss information leakage during the protocol execution and propose ways to provably reduce it.

128 citations

Posted Content
TL;DR: The need to exploit unlabeled data in FL is identified, and possible research fields that can contribute to the goal are surveyed, to identify a potentially promising research topic.
Abstract: Federated Learning (FL) proposed in recent years has received significant attention from researchers in that it can bring separate data sources together and build machine learning models in a collaborative but private manner. Yet, in most applications of FL, such as keyboard prediction, labeling data requires virtually no additional efforts, which is not generally the case. In reality, acquiring large-scale labeled datasets can be extremely costly, which motivates research works that exploit unlabeled data to help build machine learning models. However, to the best of our knowledge, few existing works aim to utilize unlabeled data to enhance federated learning, which leaves a potentially promising research topic. In this paper, we identify the need to exploit unlabeled data in FL, and survey possible research fields that can contribute to the goal.

38 citations

Proceedings ArticleDOI
Qingqing Long1, Yiming Wang1, Lun Du1, Guojie Song1, Yilun Jin1, Wei Lin2 
03 Nov 2019
TL;DR: This paper proposes a network embedding framework, abbreviated SpaceNE, preserving hierarchies formed by communities through subspaces, manifolds with flexible dimensionalities and are inherently hierarchical, and proposes that subsp spaces are able to address further problems in representing hierarchical communities, including sparsity and space warps.
Abstract: To depict ubiquitous relational data in real world, network data have been widely applied in modeling complex relationships. Projecting vertices to low dimensional spaces, quoted as Network Embedding, would thus be applicable to diverse real-world predicative tasks. Numerous works exploiting pairwise proximities, one characteristic owned by real networks, the clustering property, namely vertices are inclined to form communities of various ranges and hence form a hierarchy consisting of communities, has barely received attention from researchers. In this paper, we propose our network embedding framework, abbreviated SpaceNE, preserving hierarchies formed by communities through subspaces, manifolds with flexible dimensionalities and are inherently hierarchical. Moreover, we propose that subspaces are able to address further problems in representing hierarchical communities, including sparsity and space warps. Last but not least, we proposed constraints on dimensions of subspaces to denoise, which are further approximated by differentiable functions such that joint optimization is enabled, along with a layer-wise scheme to alleviate the overhead cause by the vast number of parameters. We conduct various experiments with results demonstrating our model's effectiveness in addressing community hierarchies.

35 citations

Proceedings ArticleDOI
Yizhou Zhang1, Guojie Song1, Lun Du1, Shuwen Yang1, Yilun Jin1 
01 Jul 2019
TL;DR: Domain Adaptive Network Embedding (DANE) as mentioned in this paper applies graph convolutional network to learn transferable embeddings by encoding nodes from multiple networks via a shared set of learnable parameters so that the vectors share an aligned embedding space.
Abstract: Recent works reveal that network embedding techniques enable many machine learning models to handle diverse downstream tasks on graph structured data. However, as previous methods usually focus on learning embeddings for a single network, they can not learn representations transferable on multiple networks. Hence, it is important to design a network embedding algorithm that supports downstream model transferring on different networks, known as domain adaptation. In this paper, we propose a novel Domain Adaptive Network Embedding framework, which applies graph convolutional network to learn transferable embeddings. In DANE, nodes from multiple networks are encoded to vectors via a shared set of learnable parameters so that the vectors share an aligned embedding space. The distribution of embeddings on different networks are further aligned by adversarial learning regularization. In addition, DANE’s advantage in learning transferable network embedding can be guaranteed theoretically. Extensive experiments reflect that the proposed framework outperforms other state-of-the-art network embedding baselines in cross-network domain adaptation tasks.

30 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Posted Content
TL;DR: Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
Abstract: Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.

1,107 citations

Journal ArticleDOI
TL;DR: This paper aims to provide a comprehensive study concerning FL’s security and privacy aspects that can help bridge the gap between the current state of federated AI and a future in which mass adoption is possible.

565 citations

Journal ArticleDOI
01 Mar 2021
TL;DR: In this article, the authors provide a review of federated learning in the biomedical space, and summarize the general solutions to the statistical challenges, system challenges, and privacy issues in federated Learning, and point out the implications and potentials in healthcare.
Abstract: With the rapid development of computer software and hardware technologies, more and more healthcare data are becoming readily available from clinical institutions, patients, insurance companies, and pharmaceutical industries, among others. This access provides an unprecedented opportunity for data science technologies to derive data-driven insights and improve the quality of care delivery. Healthcare data, however, are usually fragmented and private making it difficult to generate robust results across populations. For example, different hospitals own the electronic health records (EHR) of different patient populations and these records are difficult to share across hospitals because of their sensitive nature. This creates a big barrier for developing effective analytical approaches that are generalizable, which need diverse, "big data." Federated learning, a mechanism of training a shared global model with a central server while keeping all the sensitive data in local institutions where the data belong, provides great promise to connect the fragmented healthcare data sources with privacy-preservation. The goal of this survey is to provide a review for federated learning technologies, particularly within the biomedical space. In particular, we summarize the general solutions to the statistical challenges, system challenges, and privacy issues in federated learning, and point out the implications and potentials in healthcare.

396 citations

Proceedings ArticleDOI
TL;DR: Graph Contrastive Coding (GCC) is designed --- a self-supervised graph neural network pre-training framework --- to capture the universal network topological properties across multiple networks and leverage contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations.
Abstract: Graph representation learning has emerged as a powerful technique for addressing real-world problems Various downstream graph learning tasks have benefited from its recent developments, such as node classification, similarity search, and graph classification However, prior arts on graph representation learning focus on domain specific problems and train a dedicated model for each graph dataset, which is usually non-transferable to out-of-domain data Inspired by the recent advances in pre-training from natural language processing and computer vision, we design Graph Contrastive Coding (GCC) -- a self-supervised graph neural network pre-training framework -- to capture the universal network topological properties across multiple networks We design GCC's pre-training task as subgraph instance discrimination in and across networks and leverage contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations We conduct extensive experiments on three graph learning tasks and ten graph datasets The results show that GCC pre-trained on a collection of diverse datasets can achieve competitive or better performance to its task-specific and trained-from-scratch counterparts This suggests that the pre-training and fine-tuning paradigm presents great potential for graph representation learning

325 citations