Home
/
Authors
/
Jinyang Gao

Author

Jinyang Gao

Other affiliations: National University of Singapore

Bio: Jinyang Gao is an academic researcher from Alibaba Group. The author has contributed to research in topics: Deep learning & Recommender system. The author has an hindex of 14, co-authored 45 publications receiving 730 citations. Previous affiliations of Jinyang Gao include National University of Singapore.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Contrastive Learning for Sequential Recommendation

[...]

Xu Xie, Feixiang Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Jiandong Zhang, Bolin Ding, Bin Cui - Show less +4 more

01 May 2022

TL;DR: A novel multi-task framework called Contrastive Learning for Sequential Recommendation (CL4SRec) is proposed, which not only takes advantage of the traditional next item prediction task but also utilizes the contrastive learning framework to derive self-supervision signals from the original user behavior sequences.

...read moreread less

Abstract: Sequential recommendation methods play a crucial role in modern recommender systems because of their ability to capture a user's dynamic interest from her/his historical inter-actions. Despite their success, we argue that these approaches usually rely on the sequential prediction task to optimize the huge amounts of parameters. They usually suffer from the data sparsity problem, which makes it difficult for them to learn high-quality user representations. To tackle that, inspired by recent advances of contrastive learning techniques in the computer vision, we propose a novel multi-task framework called Contrastive Learning for Sequential Recommendation (CL4SRec). CL4SRec not only takes advantage of the traditional next item prediction task but also utilizes the contrastive learning framework to derive self-supervision signals from the original user behavior sequences. Therefore, it can extract more meaningful user patterns and further encode the user representations effectively. In addition, we propose three data augmentation approaches to construct self-supervision signals. Extensive experiments on four public datasets demonstrate that CL4SRec achieves state-of-the-art performance over existing baselines by inferring better user representations.

...read moreread less

134 citations

Proceedings Article•DOI•

SINGA: A Distributed Deep Learning Platform

[...]

Beng Chin Ooi¹, Kian-Lee Tan¹, Sheng Wang¹, Wei Wang¹, Qingchao Cai¹, Gang Chen², Jinyang Gao¹, Zhaojing Luo¹, Anthony K. H. Tung¹, Yuan Wang, Zhongle Xie¹, Meihui Zhang³, Kaiping Zheng¹ - Show less +9 more•Institutions (3)

National University of Singapore¹, Zhejiang University², Singapore University of Technology and Design³

13 Oct 2015

TL;DR: A distributed deep learning system, called SINGA, for training big models over large datasets, which supports a variety of popular deep learning models and provides different neural net partitioning schemes for training large models.

...read moreread less

Abstract: Deep learning has shown outstanding performance in various machine learning tasks. However, the deep complex model structure and massive training data make it expensive to train. In this paper, we present a distributed deep learning system, called SINGA, for training big models over large datasets. An intuitive programming model based on the layer abstraction is provided, which supports a variety of popular deep learning models. SINGA architecture supports both synchronous and asynchronous training frameworks. Hybrid training frameworks can also be customized to achieve good scalability. SINGA provides different neural net partitioning schemes for training large models. SINGA is an Apache Incubator project released under Apache License 2.

...read moreread less

132 citations

Proceedings Article•DOI•

DSH: data sensitive hashing for high-dimensional k-nnsearch

[...]

Jinyang Gao¹, H. V. Jagadish², Wei Lu¹, Beng Chin Ooi¹•Institutions (2)

National University of Singapore¹, University of Michigan²

18 Jun 2014

TL;DR: Data Sensitive Hashing improves the hashing functions and hashing family, and is orthogonal to most of the recent state-of-the-art approaches which mainly focus on indexing and querying strategies.

...read moreread less

Abstract: The need to locate the k-nearest data points with respect to a given query point in a multi- and high-dimensional space is common in many applications. Therefore, it is essential to provide efficient support for such a search. Locality Sensitive Hashing (LSH) has been widely accepted as an effective hash method for high-dimensional similarity search. However, data sets are typically not distributed uniformly over the space, and as a result, the buckets of LSH are unbalanced, causing the performance of LSH to degrade. In this paper, we propose a new and efficient method called Data Sensitive Hashing (DSH) to address this drawback. DSH improves the hashing functions and hashing family, and is orthogonal to most of the recent state-of-the-art approaches which mainly focus on indexing and querying strategies. DSH leverages data distributions and is capable of directly preserving the nearest neighbor relations. We show the theoretical guarantee of DSH, and demonstrate its efficiency experimentally.

...read moreread less

72 citations

Journal Article•DOI•

Rafiki: machine learning as an analytics service system

[...]

Wei Wang¹, Jinyang Gao¹, Meihui Zhang², Sheng Wang¹, Gang Chen³, Teck Khim Ng¹, Beng Chin Ooi¹, Jie Shao⁴, Moaz Reyad¹ - Show less +5 more•Institutions (4)

National University of Singapore¹, University of Electronic Science and Technology of China², Zhejiang University³, Beijing Institute of Technology⁴

01 Oct 2018

TL;DR: Rafiki is developed and presented to provide the training and inference service of machine learning models, and facilitate complex analytics on top of cloud platforms, and provides distributed hyper-parameter tuning for the training service, and online ensemble modeling for the inference service which trades off between latency and accuracy.

...read moreread less

Abstract: Big data analytics is gaining massive momentum in the last few years. Applying machine learning models to big data has become an implicit requirement or an expectation for most analysis tasks, especially on high-stakes applications. Typical applications include sentiment analysis against reviews for analyzing on-line products, image classification in food logging applications for monitoring user's daily intake, and stock movement prediction. Extending traditional database systems to support the above analysis is intriguing but challenging. First, it is almost impossible to implement all machine learning models in the database engines. Second, expert knowledge is required to optimize the training and inference procedures in terms of efficiency and effectiveness, which imposes heavy burden on the system users. In this paper, we develop and present a system, called Rafiki, to provide the training and inference service of machine learning models. Rafiki provides distributed hyper-parameter tuning for the training service, and online ensemble modeling for the inference service which trades off between latency and accuracy. Experimental results confirm the efficiency, effectiveness, scalability and usability of Rafiki.

...read moreread less

63 citations

Proceedings Article•DOI•

Medical Concept Embedding with Time-Aware Attention

[...]

Xiangrui Cai¹, Jinyang Gao², Kee Yuan Ngiam³, Beng Chin Ooi², Ying Zhang¹, Xiaojie Yuan¹ - Show less +2 more•Institutions (3)

Nankai University¹, National University of Singapore², University Health System³

01 Jul 2018

TL;DR: In this article, a continuous bag-of-words model is employed to learn a "soft" time-aware context window for each medical concept in Electronic Medical Records (EMRs).

...read moreread less

Abstract: Embeddings of medical concepts such as medication, procedure and diagnosis codes in Electronic Medical Records (EMRs) are central to healthcare analytics. Previous work on medical concept embedding takes medical concepts and EMRs as words and documents respectively. Nevertheless, such models miss out the temporal nature of EMR data. On the one hand, two consecutive medical concepts do not indicate they are temporally close, but the correlations between them can be revealed by the time gap. On the other hand, the temporal scopes of medical concepts often vary greatly (e.g., \textit{common cold} and \textit{diabetes}). In this paper, we propose to incorporate the temporal information to embed medical codes. Based on the Continuous Bag-of-Words model, we employ the attention mechanism to learn a "soft" time-aware context window for each medical concept. Experiments on public and proprietary datasets through clustering and nearest neighbour search tasks demonstrate the effectiveness of our model, showing that it outperforms five state-of-the-art baselines.

...read moreread less

62 citations

1
2
3
4
…
5
6
7
8
9
10

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

A review of data-driven building energy consumption prediction studies

[...]

Kadir Amasyali¹, Nora El-Gohary¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jan 2018-Renewable & Sustainable Energy Reviews

TL;DR: A review of the studies that developed data-driven building energy consumption prediction models, with a particular focus on reviewing the scopes of prediction, the data properties and the data preprocessing methods used, the machine learning algorithms utilized for prediction, and the performance measures used for evaluation is provided in this paper.

...read moreread less

Abstract: Energy is the lifeblood of modern societies. In the past decades, the world's energy consumption and associated CO 2 emissions increased rapidly due to the increases in population and comfort demands of people. Building energy consumption prediction is essential for energy planning, management, and conservation. Data-driven models provide a practical approach to energy consumption prediction. This paper offers a review of the studies that developed data-driven building energy consumption prediction models, with a particular focus on reviewing the scopes of prediction, the data properties and the data preprocessing methods used, the machine learning algorithms utilized for prediction, and the performance measures used for evaluation. Based on this review, existing research gaps are identified and future research directions in the area of data-driven building energy consumption prediction are highlighted.

...read moreread less

1,015 citations

Journal Article•DOI•

Big data and machine learning algorithms for health-care delivery

[...]

Kee Yuan Ngiam¹, Kee Yuan Ngiam², Ing Wei Khor¹•Institutions (2)

National University of Singapore¹, University Health System²

01 May 2019-Lancet Oncology

TL;DR: The benefits and challenges of big data and machine learning in health care are discussed, which include flexibility and scalability compared with traditional biostatistical methods, which makes it deployable for many tasks, such as risk stratification, diagnosis and classification, and survival predictions.

...read moreread less

Abstract: Analysis of big data by machine learning offers considerable advantages for assimilation and evaluation of large amounts of complex health-care data. However, to effectively use machine learning tools in health care, several limitations must be addressed and key issues considered, such as its clinical implementation and ethics in health-care delivery. Advantages of machine learning include flexibility and scalability compared with traditional biostatistical methods, which makes it deployable for many tasks, such as risk stratification, diagnosis and classification, and survival predictions. Another advantage of machine learning algorithms is the ability to analyse diverse data types (eg, demographic data, laboratory findings, imaging data, and doctors' free-text notes) and incorporate them into predictions for disease risk, diagnosis, prognosis, and appropriate treatments. Despite these advantages, the application of machine learning in health-care delivery also presents unique challenges that require data pre-processing, model training, and refinement of the system with respect to the actual clinical problem. Also crucial are ethical considerations, which include medico-legal implications, doctors' understanding of machine learning tools, and data privacy and security. In this Review, we discuss some of the benefits and challenges of big data and machine learning in health care.

...read moreread less

569 citations

Posted Content•

Graph Neural Networks in Recommender Systems: A Survey

[...]

Shiwen Wu¹, Wentao Zhang¹, Fei Sun¹, Bin Cui²•Institutions (2)

Peking University¹, Alibaba Group²

04 Nov 2020-arXiv: Information Retrieval

TL;DR: This article provides a taxonomy of GNN-based recommendation models according to the types of information used and recommendation tasks and systematically analyze the challenges of applying GNN on different types of data.

...read moreread less

Abstract: Owing to the superiority of GNN in learning on graph data and its efficacy in capturing collaborative signals and sequential patterns, utilizing GNN techniques in recommender systems has gain increasing interests in academia and industry. In this survey, we provide a comprehensive review of the most recent works on GNN-based recommender systems. We proposed a classification scheme for organizing existing works. For each category, we briefly clarify the main issues, and detail the corresponding strategies adopted by the representative models. We also discuss the advantages and limitations of the existing strategies. Furthermore, we suggest several promising directions for future researches. We hope this survey can provide readers with a general understanding of the recent progress in this field, and shed some light on future developments.

...read moreread less

314 citations

Proceedings Article•

Multivariate Time Series Imputation with Generative Adversarial Networks

[...]

Yonghong Luo¹, Xiangrui Cai¹, Ying Zhang¹, Jun Xu², Yuan xiaojie - Show less +1 more•Institutions (2)

Nankai University¹, Chinese Academy of Sciences²

01 Jan 2018

TL;DR: Experiments show that the proposed model outperformed the baselines in terms of accuracy of imputation, and a simple model on the imputed data can achieve state-of-the-art results on the prediction tasks, demonstrating the benefits of the model in downstream applications.

...read moreread less

Abstract: Multivariate time series usually contain a large number of missing values, which hinders the application of advanced analysis methods on multivariate time series data. Conventional approaches to addressing the challenge of missing values, including mean/zero imputation, case deletion, and matrix factorization-based imputation, are all incapable of modeling the temporal dependencies and the nature of complex distribution in multivariate time series. In this paper, we treat the problem of missing value imputation as data generation. Inspired by the success of Generative Adversarial Networks (GAN) in image generation, we propose to learn the overall distribution of a multivariate time series dataset with GAN, which is further used to generate the missing values for each sample. Different from the image data, the time series data are usually incomplete due to the nature of data recording process. A modified Gate Recurrent Unit is employed in GAN to model the temporal irregularity of the incomplete time series. Experiments on two multivariate time series datasets show that the proposed model outperformed the baselines in terms of accuracy of imputation. Experimental results also showed that a simple model on the imputed data can achieve state-of-the-art results on the prediction tasks, demonstrating the benefits of our model in downstream applications.

...read moreread less

245 citations

Journal Article•DOI•

Crowdsourced Data Management: A Survey

[...]

Guoliang Li¹, Jiannan Wang², Yudian Zheng³, Michael J. Franklin⁴•Institutions (4)

Tsinghua University¹, Simon Fraser University², University of Hong Kong³, University of California, Berkeley⁴

01 Sep 2016-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper surveys and synthesizes a wide spectrum of existing studies on crowdsourced data management and outlines key factors that need to be considered to improve crowdsourcing data management.

...read moreread less

Abstract: Any important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring platforms are an effective way to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. Thus, crowdsourced data management has become an area of increasing interest in research and industry. We identify three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy or incorrect results so effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans consisting of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies on crowdsourced data management. Based on this analysis we then outline key factors that need to be considered to improve crowdsourced data management.

...read moreread less

240 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse