PPLSA: Parallel probabilistic latent semantic analysis based on MapReduce
Ning Li,Ning Li,Fuzhen Zhuang,Qing He,Zhongzhi Shi +4 more
- pp 40-49
Reads0
Chats0
TLDR
A parallel PLSA algorithm called PPLSA is proposed to accommodate large corpus collections in the MapReduce framework and efficiently distributes computation and is relatively simple to implement.Abstract:
PLSA(Probabilistic Latent Semantic Analysis) is a popular topic modeling technique for exploring document collections. Due to the increasing prevalence of large datasets, there is a need to improve the scalability of computation in PLSA. In this paper, we propose a parallel PLSA algorithm called PPLSA to accommodate large corpus collections in the MapReduce framework. Our solution efficiently distributes computation and is relatively simple to implement.read more
Citations
More filters
Book
Intelligent Data Engineering and Automated Learning -- IDEAL 2011: 12th International Conference, Norwich, UK, September 7-9, 2011. Proceedings ... Applications, incl. Internet/Web, and HCI)
TL;DR: This book constitutes the refereed proceedings of the 12th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2011, held in Norwich, UK, in September 2011.
Journal ArticleDOI
Self-organizing weighted incremental probabilistic latent semantic analysis
TL;DR: A novel Weighted Incremental PLSA algorithm called WIPLSA is proposed to dynamically discover topics and incrementally learn the topics from new documents to be applicable for big data mining.
Dissertation
Learning features for text classification
Mari Ostendorf,Bin Zhang +1 more
TL;DR: A type of feature, i.e., phrase patterns, and the efficient algorithm to learn them from labeled training data, are proposed, which are particularly useful for tasks involving modeling long-range complex behaviors as the authors see in social media data, and they are more flexible than n-gram features.
Proceedings ArticleDOI
Big Data Processing with Probabilistic Latent Semantic Analysis on MapReduce
TL;DR: A parallel method to train PLSA is proposed by adapting the traditional EM algorithm into MapReduce a computing framework for processing vast amounts of data in-parallel on clusters, so that the main memory in each computer just needs to load part of the dataset.
References
More filters
Book
Data Mining: Concepts and Techniques
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI
The Google file system
TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.