scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Dissimilarity algorithm on conceptual graphs to mine text outliers

TL;DR: A dissimilarity algorithm to detect outliers from a collection of text represented with Conceptual Graph Interchange Format (CGIF) and results indicate that the method outperforms the existing method and correlates better to human judgements.
Abstract: The graphical text representation method such as Conceptual Graphs (CGs) attempts to capture the structure and semantics of documents. As such, they are the preferred text representation approach for a wide range of problems namely in natural language processing, information retrieval and text mining. In a number of these applications, it is necessary to measure the dissimilarity (or similarity) between knowledge represented in the CGs. In this paper, we would like to present a dissimilarity algorithm to detect outliers from a collection of text represented with Conceptual Graph Interchange Format (CGIF). In order to avoid the NP-complete problem of graph matching algorithm, we introduce the use of a standard CG in the dissimilarity computation. We evaluate our method in the context of analyzing real world financial statements for identifying outlying performance indicators. For evaluation purposes, we compare the proposed dissimilarity function with a dice-coefficient similarity function used in a related previous work. Experimental results indicate that our method outperforms the existing method and correlates better to human judgements. In Comparison to other text outlier detection method, this approach managed to capture the semantics of documents through the use of CGs and is convenient to detect outliers through a simple dissimilarity function. Furthermore, our proposed algorithm retains a linear complexity with the increasing number of CGs.
Citations
More filters
01 Jan 2006
TL;DR: A method for measuring the similarity of FCA concepts is presented, which is a refinement of a previous proposal of the author that allows a higher correlation with human judgement than other proposals for evaluating concept similarity in a taxonomy defined in the literature.
Abstract: Formal Concept Analysis (FCA) is revealing interesting in supporting difficult activities that are becoming fundamental in the development of the Semantic Web. Assessing concept similarity is one of such activities since it allows the identification of different concepts that are semantically close. In this paper, a method for measuring the similarity of FCA concepts is presented, which is a refinement of a previous proposal of the author. The refinement consists in determining the similarity of concept descriptors (attributes) by using the information content approach, rather than relying on human domain expertise. The information content approach which has been adopted allows a higher correlation with human judgement than other proposals for evaluating concept similarity in a taxonomy defined in the literature.

124 citations

Journal ArticleDOI
TL;DR: Six main directions are identified where research in text mining is heading: Deep Learning, Topic Models, Graphical Modeling, Summarization, Sentiment Analysis, Learning from Unlabeled Text, and data‐centric directions are likely to influence future research in Natural Language Processing.
Abstract: In recent years, Text Mining has seen a tremendous spurt of growth as data scientists focus their attention on analyzing unstructured data. The main drivers for this growth have been big data as well as complex applications where the information in the text is often combined with other kinds of information in building predictive models. These applications require highly efficient and scalable algorithms to meet the overall performance demands. In this context, six main directions are identified where research in text mining is heading: Deep Learning, Topic Models, Graphical Modeling, Summarization, Sentiment Analysis, Learning from Unlabeled Text. Each direction has its own motivations and goals. There is some overlap of concepts because of the common themes of text and prediction. The predictive models involved are typically ones that involve meta-information or tags that could be added to the text. These tags can then be used in other text processing tasks such as information extraction. While the boundary between the fields of Text Mining and Natural Language Processing is becoming increasingly blurry, the importance of predictive models for various applications involving text means there is still substantial growth potential within the traditional sub-fields of text mining. These data-centric directions are also likely to influence future research in Natural Language Processing, especially in resource-poor languages and in multilingual texts. WIREs Data Mining Knowl Discov 2015, 5:155-164. doi: 10.1002/widm.1154

20 citations

Proceedings ArticleDOI
28 Jun 2011
TL;DR: An exploratory research on opinion mining of online movie reviews collected from several forums and blogs written by the Malaysian reviewers shows that the performance of machine learning techniques without any preprocessing of the micro-texts or feature selection is quite low.
Abstract: Advancement in information and technology facilities especially the Internet has changed the way we communicate and express opinions or sentiments on services or products that we consume. Opinion mining aims to automate the process of mining opinions into the positive or the negative views. It will benefit both the customers and the sellers in identifying the best product or service. Although there are researchers that explore new techniques of identifying the sentiment polarization, few works have been done on opinion mining created by the Malaysian reviewers. The same scenario happens to micro-text. Therefore in this study, we conduct an exploratory research on opinion mining of online movie reviews collected from several forums and blogs written by the Malaysian. The experiment data are tested using machine learning classifiers i.e. Support VectorMachine, Naive Baiyes and k-Nearest Neighbor. The result illustrates that the performance of these machine learning techniques without any preprocessing of the micro-texts or feature selection is quite low. Therefore additional steps are required in order to mine the opinions from these data.

13 citations

Journal ArticleDOI
01 May 2012
TL;DR: This paper focuses on a graph-based approach for text representation and presents a novel error tolerance dissimilarity algorithm for deviation detection, which has managed to identify deviating sentences and it strongly correlates with expert judgments.
Abstract: The rapid increase in the amount of textual data has brought forward a growing research interest towards mining text to detect deviations. Specialized methods for specific domains have emerged to satisfy various needs in discovering rare patterns in text. This paper focuses on a graph-based approach for text representation and presents a novel error tolerance dissimilarity algorithm for deviation detection. We resolve two non-trivial problems, i.e. semantic representation of text and the complexity of graph matching. We employ conceptual graphs interchange format CGIF --a knowledge representation formalism to capture the structure and semantics of sentences. We propose a novel error tolerance dissimilarity algorithm to detect deviations in the CGIFs. We evaluate our method in the context of analyzing real world financial statements for identifying deviating performance indicators. We show that our method performs better when compared with two related text based graph similarity measuring methods. Our proposed method has managed to identify deviating sentences and it strongly correlates with expert judgments. Furthermore, it offers error tolerance matching of CGIFs and retains a linear complexity with the increasing number of CGIFs.

6 citations


Cites background from "Dissimilarity algorithm on conceptu..."

  • ...Among these methods, CG has gained considerable attention due to various reasons: i.e. firstly, it simplifies the representation of relations of any arity compared to other network language that uses labelled arc....

    [...]

  • ...Thirdly, they are adequate to represent accurate and highly structured information beyond the keyword approach [12] and fourthly, both semantic and episodic association between words can be represented using CGs [13]....

    [...]

Journal ArticleDOI
01 Jan 2015
TL;DR: The method of conceptual graph is introduced first, and then it is switched to expressing knowledge extracted from experiences accumulated in maintenance actions, through which maintenance staff could find out the most similar case when new fault appears.
Abstract: Experience knowledge is extremely important in maintenance domain. However, it is difficult to express and extract this kind of knowledge. Conceptual Graph is a new and powerful visual knowledge represen- tation method. This paper proposes one technique based on conceptual graph to extract knowledge from experi- ences accumulated in maintenance actions. This technique introduces conceptual graph to maintenance domain. With VE distribution pump as an example, the method of conceptual graph is introduced first, and then it is ap- plied to expressing knowledge extracted from experiences accumulated in maintenance actions. Finally, the sim- ilarity between graphs of new fault case and base cases is computed, through which maintenance staff could find out the most similar case when new fault appears. The causes and solutions of the most similar case could assist maintenance staff to resolve new faults.

2 citations

References
More filters
Book
01 Jan 1983
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

12,059 citations


"Dissimilarity algorithm on conceptu..." refers background in this paper

  • ...As such, they are the preferred text representation approach for a wide range of problems namely in natural language processing, information retrieval and text mining....

    [...]

Proceedings ArticleDOI
03 Nov 1997
TL;DR: This paper defines Web mining and presents an overview of the various research issues, techniques, and development efforts, and briefly describes WEBMINER, a system for Web usage mining, and concludes the paper by listing research issues.
Abstract: Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. However, there is no established vocabulary, leading to confusion when comparing research efforts. The term Web mining has been used in two distinct ways. The first, called Web content mining in this paper, is the process of information discovery from sources across the World Wide Web. The second, called Web usage mining, is the process of mining for user browsing and access patterns. We define Web mining and present an overview of the various research issues, techniques, and development efforts. We briefly describe WEBMINER, a system for Web usage mining, and conclude the paper by listing research issues.

1,365 citations


"Dissimilarity algorithm on conceptu..." refers methods in this paper

  • ...To overcome this limitation many researchers divert their attention on clustering based method such as Expectation Maximization algorithm as explored in [22, 23], According to this method, outliers are data items that do not belong to any clusters....

    [...]

Journal ArticleDOI
TL;DR: The SVM approach as represented by Schoelkopf was superior to all the methods except the neural network one, where it was, although occasionally worse, essentially comparable.
Abstract: We implemented versions of the SVM appropriate for one-class classification in the context of information retrieval. The experiments were conducted on the standard Reuters data set. For the SVM implementation we used both a version of Schoelkopf et al. and a somewhat different version of one-class SVM based on identifying "outlier" data as representative of the second-class. We report on experiments with different kernels for both of these implementations and with different representations of the data, including binary vectors, tf-idf representation and a modification called "Hadamard" representation. Then we compared it with one-class versions of the algorithms prototype (Rocchio), nearest neighbor, naive Bayes, and finally a natural one-class neural network classification method based on "bottleneck" compression generated filters.The SVM approach as represented by Schoelkopf was superior to all the methods except the neural network one, where it was, although occasionally worse, essentially comparable. However, the SVM methods turned out to be quite sensitive to the choice of representation and kernel in ways which are not well understood; therefore, for the time being leaving the neural network approach as the most robust.

1,293 citations


"Dissimilarity algorithm on conceptu..." refers methods in this paper

  • ...Classification based methods such as Neural Network, Naïve Bayes and Support Vector Machine are explored in [21]....

    [...]

Posted Content
TL;DR: This article developed a formal grammatical system called a link grammar and showed how English grammar can be encoded in such a system, and gave algorithms for efficiently parsing with a link grammars.
Abstract: We develop a formal grammatical system called a link grammar, show how English grammar can be encoded in such a system, and give algorithms for efficiently parsing with a link grammar. Although the expressive power of link grammars is equivalent to that of context free grammars, encoding natural language grammars appears to be much easier with the new system. We have written a program for general link parsing and written a link grammar for the English language. The performance of this preliminary system -- both in the breadth of English phenomena that it captures and in the computational resources used -- indicates that the approach may have practical uses as well as linguistic significance. Our program is written in C and may be obtained through the internet.

839 citations

01 Aug 1995
TL;DR: The authors developed a formal grammatical system called a link grammar and showed how English grammar can be encoded in such a system, and gave algorithms for efficiently parsing with a link grammars.
Abstract: We develop a formal grammatical system called a link grammar, show how English grammar can be encoded in such a system, and give algorithms for efficiently parsing with a link grammar. Although the expressive power of link grammars is equivalent to that of context free grammars, encoding natural language grammars appears to be much easier with the new system. We have written a program for general link parsing and written a link grammar for the English language. The performance of this preliminary system ‐ both in the breadth of English phenomena that it captures and in the computational resources used ‐ indicates that the approach may have practical uses as well as linguistic significance. Our program is written in C and may be obtained through the internet.

726 citations