scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Deviation detection in text using conceptual graph interchange format and error tolerance dissimilarity function

TL;DR: This paper focuses on a graph-based approach for text representation and presents a novel error tolerance dissimilarity algorithm for deviation detection, which has managed to identify deviating sentences and it strongly correlates with expert judgments.
Abstract: The rapid increase in the amount of textual data has brought forward a growing research interest towards mining text to detect deviations. Specialized methods for specific domains have emerged to satisfy various needs in discovering rare patterns in text. This paper focuses on a graph-based approach for text representation and presents a novel error tolerance dissimilarity algorithm for deviation detection. We resolve two non-trivial problems, i.e. semantic representation of text and the complexity of graph matching. We employ conceptual graphs interchange format CGIF --a knowledge representation formalism to capture the structure and semantics of sentences. We propose a novel error tolerance dissimilarity algorithm to detect deviations in the CGIFs. We evaluate our method in the context of analyzing real world financial statements for identifying deviating performance indicators. We show that our method performs better when compared with two related text based graph similarity measuring methods. Our proposed method has managed to identify deviating sentences and it strongly correlates with expert judgments. Furthermore, it offers error tolerance matching of CGIFs and retains a linear complexity with the increasing number of CGIFs.

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 2006
TL;DR: A method for measuring the similarity of FCA concepts is presented, which is a refinement of a previous proposal of the author that allows a higher correlation with human judgement than other proposals for evaluating concept similarity in a taxonomy defined in the literature.
Abstract: Formal Concept Analysis (FCA) is revealing interesting in supporting difficult activities that are becoming fundamental in the development of the Semantic Web. Assessing concept similarity is one of such activities since it allows the identification of different concepts that are semantically close. In this paper, a method for measuring the similarity of FCA concepts is presented, which is a refinement of a previous proposal of the author. The refinement consists in determining the similarity of concept descriptors (attributes) by using the information content approach, rather than relying on human domain expertise. The information content approach which has been adopted allows a higher correlation with human judgement than other proposals for evaluating concept similarity in a taxonomy defined in the literature.

124 citations

Proceedings ArticleDOI
15 Oct 2012
TL;DR: Associative classification approach on the Arabic language to mine knowledge from Arabic text data set is examined, and is able to achieve a good classification performance in terms of classification time and classification accuracy.
Abstract: Text classification problem receives a lot of research that are based on machine learning, statistical, and information retrieval techniques. In the last decade, the associative classification algorithms which depends on pure data mining techniques appears as an effective method for classification. In this paper, we examine associative classification approach on the Arabic language to mine knowledge from Arabic text data set. Two methods of classification using AC are applied in this study; these methods are single rule prediction and multiple rule prediction. The experimental results against different classes of Arabic data set show that multiple rule prediction method outperforms single rule prediction method with regards to their accuracy. In general, the associative classification approach is a suitable method to classify Arabic text data set, and is able to achieve a good classification performance in terms of classification time and classification accuracy.

11 citations

Proceedings ArticleDOI
15 Oct 2012
TL;DR: The use of artificial immune system (AIS) technique in identifying Malaysian online movie reviews using three string similarity functions namely Cosine Similarity, Jaccard Coefficient and Sorensen Coefficient is reported.
Abstract: Opinion mining is used to automate the process of identifying opinion whether it is a positive or negative view. Majority of previous works on this field uses natural language programming techniques to identify the sentiment. This paper reports the use of artificial immune system (AIS) technique in identifying Malaysian online movie reviews. This opinion mining process uses three string similarity functions namely Cosine Similarity, Jaccard Coefficient and Sorensen Coefficient. In addition, AIS performance was compared with other traditional machine learning techniques, which are Support Vector Machine, Naive Baiyes and k-Nearest Network. The result of the findings are analyzed and discussed in this paper.

8 citations

Journal ArticleDOI
01 Jan 2015
TL;DR: A text mining system that is able to detect sentence deviations from a collection of financial documents is described and a dissimilarity function to compare sentences represented as graphs is implemented.
Abstract: Attempts to mine text documents to discover deviations or anomalies have increased in recent years due to the ele- vated amount of textual data in today's data repositories. Text mining assists in uncovering hidden information contents across multiple documents. Although various text mining tools are available, their focus is mainly to assist in data summarization or document classification. These tasks proved to be helpful, however; they do not provide semantic analysis and rigorous textual comparison to detect abnormal sentences that exist in the documents. In this paper, we describe a text mining system that is able to detect sentence deviations from a collection of financial documents. The system implements a dissimilarity function to compare sentences represented as graphs. Our evaluation on the proposed system revolves around experiments using finan- cial statements of a bank. The findings provide valid evidence that the proposed system is able to identify deviating sentences occurring in the documents. The detected deviations can be beneficial for the authorities in order to improve their business decisions.

5 citations


Cites background or methods from "Deviation detection in text using c..."

  • ...Third, the resulting graph-based sentence structure is compared to each other using a noise tolerant dissimilarity metrics as proposed in [23] to compute the dissimilarity between the produced sentence graphs....

    [...]

  • ...Detailed discussion on the dissimilarity function can be found in [23]....

    [...]

  • ...In this component, a dissimilarity function as proposed in [23] is adopted....

    [...]

  • ...As for the dissimilarity calculation component a noise tolerant dissimilarity function as proposed in [23] is adopted....

    [...]

  • ...The subjective nature of real world sentences is addressed by deploying a noise tolerant dissimilarity calculation [23]....

    [...]

Journal ArticleDOI
TL;DR: This work introduces a simple and straightforward method to capture canonical form of sentences using the Word Sense Disambiguation technique and later applies the First Order Predicate Logic (FOPL) scheme to represent the identified canonical forms.
Abstract: Canonical form is a notion stating that related idea should have the same meaning representation. It is a notion that greatly simplifies task by dealing with a single meaning representation for a wide range of expression. The issue in text representation is to generate a formal approach of capturing meaning or semantics in sentences. These issues include heterogeneity and inconsistency in text. Polysemous, synonymous, morphemes and homonymous word poses serious drawbacks when trying to capture senses in sentences. This calls for a need to capture and represent senses in order to resolve vagueness and improve understanding of senses in documents for knowledge creation purposes. We introduce a simple and straightforward method to capture canonical form of sentences. The proposed method first identifies the canonical forms using the Word Sense Disambiguation (WSD) technique and later applies the First Order Predicate Logic (FOPL) scheme to represent the identified canonical forms. We adopted two algorithms in WSD, which are Lesk and Selectional Preference Restriction. These algorithms concentrate mainly on disambiguating senses in words, phrases and sentences. Also we adopted the First order Predicate Logic scheme to analyse argument predicate in sentences, employing the consequence logic theorem to test for satisfiability, validity and completeness of information in sentences.

2 citations

References
More filters
Book
01 Jan 1983
TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.
Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

12,059 citations

Journal ArticleDOI
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

9,627 citations


"Deviation detection in text using c..." refers methods in this paper

  • ...CG is proven to be competitive and more expressive than the logic-based method [18]....

    [...]

Journal ArticleDOI
TL;DR: A survey of contemporary techniques for outlier detection is introduced and their respective motivations are identified and distinguish their advantages and disadvantages in a comparative review.
Abstract: Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.

3,235 citations


"Deviation detection in text using c..." refers background in this paper

  • ...The concept nodes represent concepts such as entities, attributes, states and events while the relation nodes represent relations to show how the concepts are interrelated....

    [...]

Proceedings ArticleDOI
01 Aug 1999
TL;DR: The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.
Abstract: This paper reports a controlled study with statistical signi cance tests on ve text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classi er, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a Naive Bayes (NB) classier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are su ciently common (over 300 instances).

2,877 citations


"Deviation detection in text using c..." refers background in this paper

  • ...A number of operations such as projection (graph matching), unification (join), simplification, restriction and copying can be performed on the produced CG. Additional information such as descriptions and the organization of the graphs into hierarchies of abstraction can help to reduce the search…...

    [...]

Journal ArticleDOI
TL;DR: There are a multitude of applications where novelty detection is extremely important including signal processing, computer vision, pattern recognition, data mining, and robotics.

1,457 citations


"Deviation detection in text using c..." refers background in this paper

  • ...The concept nodes represent concepts such as entities, attributes, states and events while the relation nodes represent relations to show how the concepts are interrelated....

    [...]