scispace - formally typeset
Book ChapterDOI

Text Document Analysis Using Map-Reduce Framework

Reads0
Chats0
TLDR
The results show the advantage of the proposed map-reduce algorithm by detecting clusters of document features within less computation time and provides premier solution for increasing the precision rate of retrieval in information extraction.
Abstract
Due to the advance Internet and increasing globalization, the electronics forms of information grow in a rapid manner. Extracting the useful hidden information from those multiple documents is a recent challenge. Hence, efficient and automated clustering algorithm which is effective in identifying topics plays the main role in information retrieval. In this paper, the analysis regarding the large unstructured text document corpus using our proposed map-reduce algorithm has been performed, and the results show the advantage of the proposed method by detecting clusters of document features within less computation time and provides premier solution for increasing the precision rate of retrieval in information extraction.

read more

Citations
More filters
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Book ChapterDOI

A Survey of Text Clustering Algorithms

TL;DR: This chapter will study the key challenges of the clustering problem, as it applies to the text domain, and discuss the key methods used for text clustering, and their relative advantages.
Journal ArticleDOI

A framework for understanding Latent Semantic Indexing (LSI) performance

TL;DR: A theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application and shows a strong correlation between second-order term co-occurrence and the values produced by the SVD algorithm that forms the foundation for LSI.
Journal ArticleDOI

TopCat: data mining for topic identification in a text corpus

TL;DR: An evaluation against a manually categorized ground truth news corpus shows the TopCat technique is effective in identifying topics in collections of news articles.