scispace - formally typeset
Open AccessBook ChapterDOI

PPLSA: Parallel probabilistic latent semantic analysis based on MapReduce

Reads0
Chats0
TLDR
A parallel PLSA algorithm called PPLSA is proposed to accommodate large corpus collections in the MapReduce framework and efficiently distributes computation and is relatively simple to implement.
Abstract
PLSA(Probabilistic Latent Semantic Analysis) is a popular topic modeling technique for exploring document collections. Due to the increasing prevalence of large datasets, there is a need to improve the scalability of computation in PLSA. In this paper, we propose a parallel PLSA algorithm called PPLSA to accommodate large corpus collections in the MapReduce framework. Our solution efficiently distributes computation and is relatively simple to implement.

read more

Content maybe subject to copyright    Report

Citations
More filters
Book

Intelligent Data Engineering and Automated Learning -- IDEAL 2011: 12th International Conference, Norwich, UK, September 7-9, 2011. Proceedings ... Applications, incl. Internet/Web, and HCI)

TL;DR: This book constitutes the refereed proceedings of the 12th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2011, held in Norwich, UK, in September 2011.
Journal ArticleDOI

Self-organizing weighted incremental probabilistic latent semantic analysis

TL;DR: A novel Weighted Incremental PLSA algorithm called WIPLSA is proposed to dynamically discover topics and incrementally learn the topics from new documents to be applicable for big data mining.
Dissertation

Learning features for text classification

TL;DR: A type of feature, i.e., phrase patterns, and the efficient algorithm to learn them from labeled training data, are proposed, which are particularly useful for tasks involving modeling long-range complex behaviors as the authors see in social media data, and they are more flexible than n-gram features.
Proceedings ArticleDOI

Big Data Processing with Probabilistic Latent Semantic Analysis on MapReduce

TL;DR: A parallel method to train PLSA is proposed by adapting the traditional EM algorithm into MapReduce a computing framework for processing vast amounts of data in-parallel on clusters, so that the main memory in each computer just needs to load part of the dataset.
References
More filters
Book

Data Mining: Concepts and Techniques

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI

The Google file system

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Related Papers (5)