scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Text Mining Methods and Techniques

16 Jan 2014-International Journal of Computer Applications (Foundation of Computer Science (FCS))-Vol. 85, Iss: 17, pp 42-45
TL;DR: This survey paper discusses such successful techniques and methods to give effectiveness over information retrieval in text mining, the types of situations where each technology may be useful in order to help users are discussed.
Abstract: In recent years growth of digital data is increasing, knowledge discovery and data mining have attracted great attention with coming up need for turning such data into useful information and knowledge. The use of the information and knowledge extracted from a large amount of data benefits many applications like market analysis and business management. In many applications database stores information in text form so text mining is the one of the most resent area for research. To extract user required information is the challenging issue. Text Mining is an important step of knowledge discovery process. Text mining extracts hidden information from notstructured to semi-structured data. Text mining is the discovery by automatically extracting information from different written resources and also by computer for extracting new, previously unknown information. This survey paper tries to cover the text mining techniques and methods that solve these challenges. In this survey paper we discuss such successful techniques and methods to give effectiveness over information retrieval in text mining. The types of situations where each technology may be useful in order to help users are also discussed.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This survey focused on analyzing the text mining studies related to Facebook and Twitter; the two dominant social media in the world, to describe how studies in social media have used text analytics and text mining techniques for the purpose of identifying the key themes in the data.
Abstract: Text mining has become one of the trendy fields that has been incorporated in several research fields such as computational linguistics, Information Retrieval (IR) and data mining Natural Language Processing (NLP) techniques were used to extract knowledge from the textual text that is written by human beings Text mining reads an unstructured form of data to provide meaningful information patterns in a shortest time period Social networking sites are a great source of communication as most of the people in today’s world use these sites in their daily lives to keep connected to each other It becomes a common practice to not write a sentence with correct grammar and spelling This practice may lead to different kinds of ambiguities like lexical, syntactic, and semantic and due to this type of unclear data, it is hard to find out the actual data order Accordingly, we are conducting an investigation with the aim of looking for different text mining methods to get various textual orders on social media websites This survey aims to describe how studies in social media have used text analytics and text mining techniques for the purpose of identifying the key themes in the data This survey focused on analyzing the text mining studies related to Facebook and Twitter; the two dominant social media in the world Results of this survey can serve as the baselines for future text mining research

158 citations


Cites background from "Text Mining Methods and Techniques"

  • ...A management information system is capable of incorporating the resulting information, and as a result, significant knowledge is produced for the user of that information system [57]....

    [...]

Book ChapterDOI
01 Jan 2018
TL;DR: A comprehensive overview about text mining and its current research status is demonstrated and experimental results indicated that Springer database represents the main source for research articles in the field of mobile education for the medical domain.
Abstract: Nowadays, research in text mining has become one of the widespread fields in analyzing natural language documents. The present study demonstrates a comprehensive overview about text mining and its current research status. As indicated in the literature, there is a limitation in addressing Information Extraction from research articles using Data Mining techniques. The synergy between them helps to discover different interesting text patterns in the retrieved articles. In our study, we collected, and textually analyzed through various text mining techniques, three hundred refereed journal articles in the field of mobile learning from six scientific databases, namely: Springer, Wiley, Science Direct, SAGE, IEEE, and Cambridge. The selection of the collected articles was based on the criteria that all these articles should incorporate mobile learning as the main component in the higher educational context. Experimental results indicated that Springer database represents the main source for research articles in the field of mobile education for the medical domain. Moreover, results where the similarity among topics could not be detected were due to either their interrelations or ambiguity in their meaning. Furthermore, findings showed that there was a booming increase in the number of published articles during the years 2015 through 2016. In addition, other implications and future perspectives are presented in the study.

125 citations

Journal ArticleDOI
TL;DR: In this paper, the authors investigated what are the key attributes and the structural relationship of those key attributes in hotel reviews and applied semantic network analysis, factor analysis and regression analysis to understand the experience and satisfaction of the hotel customer.
Abstract: With the development of social media, customers are sharing their experiences, and it is rapidly spreading as a form of online review. That is why the online review has become a significant information source affecting customers’ purchase intention and behavior. Therefore, it is important to understand the customer’s experience shown in the online review in order to maintain sustainable customer satisfaction and loyalty. The purpose of this study is to investigate what are the key attributes and the structural relationship of those key attributes. To accomplish this purpose, a total of 6596 hotel reviews were collected from Google (google.com). A frequency analysis using text mining was performed to figure out the most frequently mentioned attributes. In addition, semantic network analysis, factor analysis, and regression analysis were applied to understand the experience and satisfaction of the hotel customer. As a result, the top 99 keywords were divided into four groups such as “Intangible Service”, “Physical Environment”, “Purpose”, and “Location”. The factor analysis reduced the dimension of the original 64 keywords to 22 keywords, and grouped them into five factors, which are “Access”, “F&B (Food and Beverage)”, “Purpose”, “Tangibles”, and “Empathy”. Based on these results, theoretical and practical implications for sustainable hotel marketing strategies are suggested.

50 citations

Journal ArticleDOI
TL;DR: A comprehensive review of text analytics finds that the ontology- and rule-based approach has been dominant, at the same time, recent research has attempted to apply the state-of-the-art machine learning methods.

30 citations

Journal ArticleDOI
TL;DR: This paper addresses a comparison study on scientific unstructured text document classification (e-books) based on the full text where applying the most popular topic modeling approach (LDA, LSA) to cluster the words into a set of topics as important keywords for classification.
Abstract: With the rapid growth of information technology, the amount of unstructured text data in digital libraries is rapidly increased and has become a big challenge in analyzing, organizing and how to classify text automatically in E-research repository to get the benefit from them is the cornerstone. The manual categorization of text documents requires a lot of financial, human resources for management. In order to get so, topic modeling are used to classify documents. This paper addresses a comparison study on scientific unstructured text document classification (e-books) based on the full text where applying the most popular topic modeling approach (LDA, LSA) to cluster the words into a set of topics as important keywords for classification. Our dataset consists of (300) books contain about 23 million words based on full text. In the used topic models (LSA, LDA) each word in the corpus of vocabulary is connected with one or more topics with a probability, as estimated by the model. Many (LDA, LSA) models were built with different values of coherence and pick the one that produces the highest coherence value. The result of this paper showed that LDA has better results than LSA and the best results obtained from the LDA method was ( 0.592179 ) of coherence value when the number of topics was 20 while the LSA coherence value was (0.5773026) when the number of topics was 10.

30 citations


Additional excerpts

  • ...The text exploration process includes many functions such as (cleaning up unstructured data to be available for text analytics, text classification, text clustering, keyword extraction, document summarization, and entity relationship model in [22]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Abstract: The experimental evidence accumulated over the past 20 years indicates that textindexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective term weighting systems. This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

9,460 citations


"Text Mining Methods and Techniques" refers methods in this paper

  • ...Term based methods suffer from the problems of polysemy and synonymy[1]....

    [...]

Proceedings Article
01 Jan 2000
TL;DR: In this article, an inner product in the feature space consisting of all subsequences of length k was introduced for comparing two text documents, where a subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously.
Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results.

1,464 citations

Journal ArticleDOI
TL;DR: A novel kernel is introduced for comparing two text documents consisting of an inner product in the feature space consisting of all subsequences of length k, which can be efficiently evaluated by a dynamic programming technique.
Abstract: We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a standard word feature space kernel (Joachims, 1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with different decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations efficiently for large datasets.

1,281 citations


"Text Mining Methods and Techniques" refers methods in this paper

  • ...The typical text categorization process consists of pre-processing, indexing, dimensionally reduction, and classification[3][4]....

    [...]

Journal ArticleDOI
TL;DR: This work develops an automatic text categorization approach and investigates its application to text retrieval, demonstrating the effectiveness of the approach and demonstrating that the retrieval performance using automatic categorization achieves the same retrieval quality as the performance using manual categorization.
Abstract: We develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instance-based learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the effectiveness of our categorization approach using two real-world document collections from the MEDLINE database. Next, we investigate the application of automatic categorization to text retrieval. Our experiments clearly indicate that automatic categorization improves the retrieval performance compared with no categorization. We also demonstrate that the retrieval performance using automatic categorization achieves the same retrieval quality as the performance using manual categorization. Furthermore, detailed analysis of the retrieval performance on each individual test query is provided.

177 citations


"Text Mining Methods and Techniques" refers methods in this paper

  • ...The typical text categorization process consists of pre-processing, indexing, dimensionally reduction, and classification[3][4]....

    [...]

Proceedings ArticleDOI
18 Dec 2006
TL;DR: The performance of the pattern deploying algorithms for text mining is investigated on the Reuters dataset RCVI and the results show that the effectiveness is improved by using the proposed pattern refinement approaches.
Abstract: Text mining is the technique that helps users find useful information from a large amount of digital text documents on the Web or databases. Instead of the keyword-based approach which is typically used in this field, the pattern-based model containing frequent sequential patterns is employed to perform the same concept of tasks. However, how to effectively use these discovered patterns is still a big challenge. In this study, we propose two approaches based on the use of pattern deploying strategies. The performance of the pattern deploying algorithms for text mining is investigated on the Reuters dataset RCV1 and the results show that the effectiveness is improved by using our proposed pattern refinement approaches.

131 citations


"Text Mining Methods and Techniques" refers methods in this paper

  • ...The pattern based technique uses two processes pattern deploying and pattern evolving[6]....

    [...]