PPLSA: Parallel probabilistic latent semantic analysis based on MapReduce

doi:10.1007/978-3-642-32891-6_8

Open AccessBook ChapterDOI

PPLSA: Parallel probabilistic latent semantic analysis based on MapReduce

Ning Li, +4 more

- pp 40-49

Chats0

TLDR

A parallel PLSA algorithm called PPLSA is proposed to accommodate large corpus collections in the MapReduce framework and efficiently distributes computation and is relatively simple to implement.

Abstract:

PLSA(Probabilistic Latent Semantic Analysis) is a popular topic modeling technique for exploring document collections. Due to the increasing prevalence of large datasets, there is a need to improve the scalability of computation in PLSA. In this paper, we propose a parallel PLSA algorithm called PPLSA to accommodate large corpus collections in the MapReduce framework. Our solution efficiently distributes computation and is relatively simple to implement.

Citations

PDF

Open Access

More filters

Data Mining - Concepts and Techniques.

Petra Perner

Book

Intelligent Data Engineering and Automated Learning -- IDEAL 2011: 12th International Conference, Norwich, UK, September 7-9, 2011. Proceedings ... Applications, incl. Internet/Web, and HCI)

Hujun Yin, +2 more

TL;DR: This book constitutes the refereed proceedings of the 12th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2011, held in Norwich, UK, in September 2011.

...read moreread less

Journal ArticleDOI

Self-organizing weighted incremental probabilistic latent semantic analysis

Ning Li, +5 more

- 01 Dec 2018 -

International Journal of Machine Learnin...

TL;DR: A novel Weighted Incremental PLSA algorithm called WIPLSA is proposed to dynamically discover topics and incrementally learn the topics from new documents to be applicable for big data mining.

...read moreread less

Dissertation

Learning features for text classification

Mari Ostendorf, +1 more

TL;DR: A type of feature, i.e., phrase patterns, and the efficient algorithm to learn them from labeled training data, are proposed, which are particularly useful for tasks involving modeling long-range complex behaviors as the authors see in social media data, and they are more flexible than n-gram features.

...read moreread less

Proceedings ArticleDOI

Big Data Processing with Probabilistic Latent Semantic Analysis on MapReduce

Yong Zhao, +4 more

TL;DR: A parallel method to train PLSA is proposed by adapting the traditional EM algorithm into MapReduce a computing framework for processing vast amounts of data in-parallel on clusters, so that the main memory in each computer just needs to load part of the dataset.

...read moreread less

References

PDF

Open Access

More filters

Book

Data Mining: Concepts and Techniques

Jiawei Han, +2 more

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Journal ArticleDOI

MapReduce: simplified data processing on large clusters

Jeffrey Dean, +1 more

- 01 Jan 2008 -

Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Data Mining - Concepts and Techniques.

Petra Perner

Journal ArticleDOI

The Google file system

Sanjay Ghemawat, +2 more

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

...read moreread less

Journal of The Institution of Engineers ...

PPLSA: Parallel probabilistic latent semantic analysis based on MapReduce

Citations

Data Mining - Concepts and Techniques.

Intelligent Data Engineering and Automated Learning -- IDEAL 2011: 12th International Conference, Norwich, UK, September 7-9, 2011. Proceedings ... Applications, incl. Internet/Web, and HCI)

Self-organizing weighted incremental probabilistic latent semantic analysis

Learning features for text classification

Big Data Processing with Probabilistic Latent Semantic Analysis on MapReduce

References

Data Mining: Concepts and Techniques

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Data Mining - Concepts and Techniques.

The Google file system

Related Papers (5)

A parallel Probabilistic Latent Semantic Analysis method on MapReduce platform

Topic selection in latent dirichlet allocation

Parallel K-Means Clustering Based on MapReduce

Web-Scale Image Annotation

An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm