Book ChapterDOI
Text Document Analysis Using Map-Reduce Framework
K. V. Kanimozhi,P. Prabhavathy,M. Venkatesan +2 more
- pp 585-594
Reads0
Chats0
TLDR
The results show the advantage of the proposed map-reduce algorithm by detecting clusters of document features within less computation time and provides premier solution for increasing the precision rate of retrieval in information extraction.Abstract:
Due to the advance Internet and increasing globalization, the electronics forms of information grow in a rapid manner. Extracting the useful hidden information from those multiple documents is a recent challenge. Hence, efficient and automated clustering algorithm which is effective in identifying topics plays the main role in information retrieval. In this paper, the analysis regarding the large unstructured text document corpus using our proposed map-reduce algorithm has been performed, and the results show the advantage of the proposed method by detecting clusters of document features within less computation time and provides premier solution for increasing the precision rate of retrieval in information extraction.read more
Citations
More filters
References
More filters
Journal ArticleDOI
Latent dirichlet allocation
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article
Latent Dirichlet Allocation
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Book ChapterDOI
A Survey of Text Clustering Algorithms
TL;DR: This chapter will study the key challenges of the clustering problem, as it applies to the text domain, and discuss the key methods used for text clustering, and their relative advantages.
Journal ArticleDOI
A framework for understanding Latent Semantic Indexing (LSI) performance
TL;DR: A theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application and shows a strong correlation between second-order term co-occurrence and the values produced by the SVD algorithm that forms the foundation for LSI.
Journal ArticleDOI
TopCat: data mining for topic identification in a text corpus
TL;DR: An evaluation against a manually categorized ground truth news corpus shows the TopCat technique is effective in identifying topics in collections of news articles.