Author

Yuzhou Zhang

Bio: Yuzhou Zhang is an academic researcher from Google. The author has contributed to research in topics: Load balancing (computing) & Bottleneck. The author has an hindex of 2, co-authored 2 publications receiving 198 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

[...]

Zhiyuan Liu¹, Yuzhou Zhang¹, Edward Y. Chang¹, Maosong Sun²•Institutions (2)

Google¹, Tsinghua University²

06 May 2011-ACM Transactions on Intelligent Systems and Technology

TL;DR: Data placement, pipeline processing, word bundling, and priority-based scheduling are proposed to improve scalability of LDA and significantly reduce the unparallelizable communication bottleneck and achieve good load balancing.

...read moreread less

Abstract: Previous methods of distributed Gibbs sampling for LDA run into either memory or communication bottlenecks. To improve scalability, we propose four strategies: data placement, pipeline processing, word bundling, and priority-based scheduling. Experiments show that our strategies significantly reduce the unparallelizable communication bottleneck and achieve good load balancing, and hence improve scalability of LDA.

...read moreread less

190 citations

Patent•

Parallel generation of topics from documents

[...]

Zhiyuan Liu¹, Yuzhou Zhang¹, Edward Y. Chang¹•Institutions (1)

Google¹

11 May 2011

11 citations

Cited by

PDF

Open Access

More filters

Book Chapter•DOI•

Mining Text Data

[...]

Charu C. Aggarwal¹, ChengXiang Zhai²•Institutions (2)

IBM¹, University of Illinois at Urbana–Champaign²

03 Feb 2012

TL;DR: Mining Text Data introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining.

...read moreread less

Abstract: Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned. Mining Text Data introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including the key research content on the topic, and the future directions of research in the field. There is a special focus on Text Embedded with Heterogeneous and Multimedia Data which makes the mining process much more challenging. A number of methods have been designed such as transfer learning and cross-lingual mining for such cases. Mining Text Data simplifies the content, so that advanced-level students, practitioners and researchers in computer science can benefit from this book. Academic and corporate libraries, as well as ACM, IEEE, and Management Science focused on information security, electronic commerce, databases, data mining, machine learning, and statistics are the primary buyers for this reference book.

...read moreread less

732 citations

Journal Article•DOI•

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

[...]

Hamed Jelodar¹, Yongli Wang¹, Chi Yuan¹, Xia Feng¹, Xiahui Jiang¹, Yanchao Li¹, Liang Zhao¹ - Show less +3 more•Institutions (1)

Nanjing University of Science and Technology¹

01 Jun 2019-Multimedia Tools and Applications

TL;DR: In this article, the authors investigated highly scholarly articles (between 2003 to 2016) related to topic modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling.

...read moreread less

Abstract: Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modelling; Latent Dirichlet Allocation (LDA) is one of the most popular in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper will be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated highly scholarly articles (between 2003 to 2016) related to topic modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. In addition, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA.

...read moreread less

608 citations

Posted Content•

Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey

[...]

Hamed Jelodar¹, Yongli Wang¹, Chi Yuan¹, Xia Feng¹, Xiahui Jiang¹, Yanchao Li¹, Liang Zhao¹ - Show less +3 more•Institutions (1)

Nanjing University of Science and Technology¹

12 Nov 2017-arXiv: Information Retrieval

TL;DR: In this article, the authors investigated the research development, current trends and intellectual structure of topic modeling based on Latent Dirichlet Allocation (LDA), and summarized challenges and introduced famous tools and datasets in topic modelling based on LDA.

...read moreread less

Abstract: Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data, text documents. Researchers have published many articles in the field of topic modeling and applied in various fields such as software engineering, political science, medical and linguistic science, etc. There are various methods for topic modeling, which Latent Dirichlet allocation (LDA) is one of the most popular methods in this field. Researchers have proposed various models based on the LDA in topic modeling. According to previous work, this paper can be very useful and valuable for introducing LDA approaches in topic modeling. In this paper, we investigated scholarly articles highly (between 2003 to 2016) related to Topic Modeling based on LDA to discover the research development, current trends and intellectual structure of topic modeling. Also, we summarize challenges and introduce famous tools and datasets in topic modeling based on LDA.

...read moreread less

546 citations

Proceedings Article•DOI•

LightLDA: Big Topic Models on Modest Computer Clusters

[...]

Jinhui Yuan¹, Fei Gao¹, Qirong Ho², Wei Dai³, Jinliang Wei³, Xun Zheng³, Eric P. Xing³, Tie-Yan Liu¹, Wei-Ying Ma¹ - Show less +5 more•Institutions (3)

Microsoft¹, Institute for Infocomm Research Singapore², Carnegie Mellon University³

18 May 2015

TL;DR: In this article, the authors show that with a modest cluster of as few as 8 machines, they can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters) on a document collection with 200 billion tokens.

...read moreread less

Abstract: When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners and academic researchers. We consider this challenge in the context of topic modeling on web-scale corpora, and show that with a modest cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens --- a scale not yet reported even with thousands of machines. Our major contributions include: 1) a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers; 2) a model-scheduling scheme to handle the big model challenge, where each worker machine schedules the fetch/use of sub-models as needed, resulting in a frugal use of limited memory capacity and network bandwidth; 3) a differential data-structure for model storage, which uses separate data structures for high- and low-frequency words to allow extremely large models to fit in memory, while maintaining high inference speed. These contributions are built on top of the Petuum open-source distributed ML framework, and we provide experimental evidence showing how this development puts massive data and models within reach on a small cluster, while still enjoying proportional time cost reductions with increasing cluster size.

...read moreread less

175 citations

Proceedings Article•

You Are What You Like! Information Leakage Through Users' Interests

[...]

Abdelberi Chaabane, Gergely Acs, Mohamed Ali Kaafar

05 Feb 2012

TL;DR: This paper shows how seemingly harmless interests of users can leak privacy sensitive information about users, and for the first time that user interests are used for profiling, and semantics-driven inference of private data is addressed.

...read moreread less

Abstract: Suppose that a Facebook user, whose age is hidden or missing, likes Britney Spears. Can you guess his/her age? Knowing that most Britney fans are teenagers, it is fairly easy for humans to answer this question. Interests (or "likes") of users is one of the highly-available on-line information. In this paper, we show how these seemingly harmless interests (e.g., music interests) can leak privacy sensitive information about users. In particular, we infer their undisclosed (private) attributes using the public attributes of other users sharing similar interests. In order to compare user-defined interest names, we extract their semantics using an ontologized version of Wikipedia and measure their similarity by applying a statistical learning method. Besides self-declared interests in music, our technique does not rely on any further information about users such as friend relationships or group belongings. Our experiments, based on more than 104K public profiles collected from Facebook and more than 2000 private profiles provided by volunteers, show that our inference technique efficiently predicts attributes that are very often hidden by users. To the best of our knowledge, this is the first time that user interests are used for profiling, and more generally, semantics-driven inference of private data is addressed.

...read moreread less

151 citations

Collapse