Deviation detection in text using conceptual graph interchange format and error tolerance dissimilarity function

doi:10.3233/IDA-2012-0535

Home
/
Papers
/
Deviation detection in text using conceptual graph interchange format and error tolerance dissimilarity function

Journal Article•DOI•

Deviation detection in text using conceptual graph interchange format and error tolerance dissimilarity function

Siti Sakira Kamaruddin¹, Abdul Razak Hamdan², Azuraliza Abu Bakar², Fauzias Mat Nor²•Institutions (2)

Universiti Utara Malaysia¹, National University of Malaysia²

01 May 2012-Vol. 16, Iss: 3, pp 487-511

TL;DR: This paper focuses on a graph-based approach for text representation and presents a novel error tolerance dissimilarity algorithm for deviation detection, which has managed to identify deviating sentences and it strongly correlates with expert judgments.

read less

Abstract: The rapid increase in the amount of textual data has brought forward a growing research interest towards mining text to detect deviations. Specialized methods for specific domains have emerged to satisfy various needs in discovering rare patterns in text. This paper focuses on a graph-based approach for text representation and presents a novel error tolerance dissimilarity algorithm for deviation detection. We resolve two non-trivial problems, i.e. semantic representation of text and the complexity of graph matching. We employ conceptual graphs interchange format CGIF --a knowledge representation formalism to capture the structure and semantics of sentences. We propose a novel error tolerance dissimilarity algorithm to detect deviations in the CGIFs. We evaluate our method in the context of analyzing real world financial statements for identifying deviating performance indicators. We show that our method performs better when compared with two related text based graph similarity measuring methods. Our proposed method has managed to identify deviating sentences and it strongly correlates with expert judgments. Furthermore, it offers error tolerance matching of CGIFs and retains a linear complexity with the increasing number of CGIFs.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

[...]

Anna Formica

01 Jan 2006

TL;DR: A method for measuring the similarity of FCA concepts is presented, which is a refinement of a previous proposal of the author that allows a higher correlation with human judgement than other proposals for evaluating concept similarity in a taxonomy defined in the literature.

...read moreread less

Abstract: Formal Concept Analysis (FCA) is revealing interesting in supporting difficult activities that are becoming fundamental in the development of the Semantic Web. Assessing concept similarity is one of such activities since it allows the identification of different concepts that are semantically close. In this paper, a method for measuring the similarity of FCA concepts is presented, which is a refinement of a previous proposal of the author. The refinement consists in determining the similarity of concept descriptors (attributes) by using the information content approach, rather than relying on human domain expertise. The information content approach which has been adopted allows a higher correlation with human judgement than other proposals for evaluating concept similarity in a taxonomy defined in the literature.

...read moreread less

124 citations

Proceedings Article•DOI•

Text associative classification approach for mining Arabic data set

[...]

Abdullah Saeed Ghareb¹, Abdul Razak Hamdan¹, Azuraliza Abu Bakar¹•Institutions (1)

National University of Malaysia¹

15 Oct 2012

TL;DR: Associative classification approach on the Arabic language to mine knowledge from Arabic text data set is examined, and is able to achieve a good classification performance in terms of classification time and classification accuracy.

...read moreread less

Abstract: Text classification problem receives a lot of research that are based on machine learning, statistical, and information retrieval techniques. In the last decade, the associative classification algorithms which depends on pure data mining techniques appears as an effective method for classification. In this paper, we examine associative classification approach on the Arabic language to mine knowledge from Arabic text data set. Two methods of classification using AC are applied in this study; these methods are single rule prediction and multiple rule prediction. The experimental results against different classes of Arabic data set show that multiple rule prediction method outperforms single rule prediction method with regards to their accuracy. In general, the associative classification approach is a suitable method to classify Arabic text data set, and is able to achieve a good classification performance in terms of classification time and classification accuracy.

...read moreread less

11 citations

Proceedings Article•DOI•

Is artificial immune system suitable for opinion mining

[...]

Norlela Samsudin¹, Mazidah Puteh¹, Abdul Razak Hamdan², Mohd Zakree Ahmad Nazri²•Institutions (2)

Universiti Teknologi MARA¹, National University of Malaysia²

15 Oct 2012

TL;DR: The use of artificial immune system (AIS) technique in identifying Malaysian online movie reviews using three string similarity functions namely Cosine Similarity, Jaccard Coefficient and Sorensen Coefficient is reported.

...read moreread less

Abstract: Opinion mining is used to automate the process of identifying opinion whether it is a positive or negative view. Majority of previous works on this field uses natural language programming techniques to identify the sentiment. This paper reports the use of artificial immune system (AIS) technique in identifying Malaysian online movie reviews. This opinion mining process uses three string similarity functions namely Cosine Similarity, Jaccard Coefficient and Sorensen Coefficient. In addition, AIS performance was compared with other traditional machine learning techniques, which are Support Vector Machine, Naive Baiyes and k-Nearest Network. The result of the findings are analyzed and discussed in this paper.

...read moreread less

8 citations

Journal Article•DOI•

A text mining system for deviation detection in financial documents

[...]

Siti Sakira Kamaruddin¹, Azuraliza Abu Bakar², Abdul Razak Hamdan², Fauzias Mat Nor², Mohd Zakree Ahmad Nazri², Zulaiha Ali Othman², Ghassan Saleh Hussein² - Show less +3 more•Institutions (2)

Florida State University College of Arts and Sciences¹, National University of Malaysia²

01 Jan 2015

TL;DR: A text mining system that is able to detect sentence deviations from a collection of financial documents is described and a dissimilarity function to compare sentences represented as graphs is implemented.

...read moreread less

Abstract: Attempts to mine text documents to discover deviations or anomalies have increased in recent years due to the ele- vated amount of textual data in today's data repositories. Text mining assists in uncovering hidden information contents across multiple documents. Although various text mining tools are available, their focus is mainly to assist in data summarization or document classification. These tasks proved to be helpful, however; they do not provide semantic analysis and rigorous textual comparison to detect abnormal sentences that exist in the documents. In this paper, we describe a text mining system that is able to detect sentence deviations from a collection of financial documents. The system implements a dissimilarity function to compare sentences represented as graphs. Our evaluation on the proposed system revolves around experiments using finan- cial statements of a bank. The findings provide valid evidence that the proposed system is able to identify deviating sentences occurring in the documents. The detected deviations can be beneficial for the authorities in order to improve their business decisions.

...read moreread less

5 citations

Cites background or methods from "Deviation detection in text using c..."

...Third, the resulting graph-based sentence structure is compared to each other using a noise tolerant dissimilarity metrics as proposed in [23] to compute the dissimilarity between the produced sentence graphs....
[...]
...Detailed discussion on the dissimilarity function can be found in [23]....
[...]
...In this component, a dissimilarity function as proposed in [23] is adopted....
[...]
...As for the dissimilarity calculation component a noise tolerant dissimilarity function as proposed in [23] is adopted....
[...]
...The subjective nature of real world sentences is addressed by deploying a noise tolerant dissimilarity calculation [23]....
[...]

Journal Article•DOI•

Representing semantics of text by acquiring its canonical form

[...]

Mohammed Ahmed Taiye¹, Siti Sakira Kamaruddin¹, Farzana Kabir Ahmad¹•Institutions (1)

Universiti Utara Malaysia¹

15 Jun 2017-International Journal on Advanced Science, Engineering and Information Technology

TL;DR: This work introduces a simple and straightforward method to capture canonical form of sentences using the Word Sense Disambiguation technique and later applies the First Order Predicate Logic (FOPL) scheme to represent the identified canonical forms.

...read moreread less

Abstract: Canonical form is a notion stating that related idea should have the same meaning representation. It is a notion that greatly simplifies task by dealing with a single meaning representation for a wide range of expression. The issue in text representation is to generate a formal approach of capturing meaning or semantics in sentences. These issues include heterogeneity and inconsistency in text. Polysemous, synonymous, morphemes and homonymous word poses serious drawbacks when trying to capture senses in sentences. This calls for a need to capture and represent senses in order to resolve vagueness and improve understanding of senses in documents for knowledge creation purposes. We introduce a simple and straightforward method to capture canonical form of sentences. The proposed method first identifies the canonical forms using the Word Sense Disambiguation (WSD) technique and later applies the First Order Predicate Logic (FOPL) scheme to represent the identified canonical forms. We adopted two algorithms in WSD, which are Lesk and Selectional Preference Restriction. These algorithms concentrate mainly on disambiguating senses in words, phrases and sentences. Also we adopted the First order Predicate Logic scheme to analyse argument predicate in sentences, employing the consequence logic theorem to test for satisfiability, validity and completeness of information in sentences.

...read moreread less

2 citations

References

PDF

Open Access

More filters

Book•

Introduction to Modern Information Retrieval

[...]

Gerard Salton, Michael J. McGill

01 Jan 1983

TL;DR: Reading is a need and a hobby at once and this condition is the on that will make you feel that you must read.

...read moreread less

Abstract: Some people may be laughing when looking at you reading in your spare time. Some may be admired of you. And some may want be like you who have reading hobby. What about your own feel? Have you felt right? Reading is a need and a hobby at once. This condition is the on that will make you feel that you must read. If you know are looking for the book enPDFd introduction to modern information retrieval as the choice of reading, you can find here.

...read moreread less

12,059 citations

Journal Article•DOI•

Anomaly detection: A survey

[...]

Varun Chandola¹, Arindam Banerjee¹, Vipin Kumar¹•Institutions (1)

University of Minnesota¹

30 Jul 2009-ACM Computing Surveys

TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.

...read moreread less

Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

...read moreread less

9,627 citations

"Deviation detection in text using c..." refers methods in this paper

...CG is proven to be competitive and more expressive than the logic-based method [18]....
[...]

Journal Article•DOI•

A Survey of Outlier Detection Methodologies

[...]

Victoria J. Hodge¹, Jim Austin¹•Institutions (1)

University of York¹

01 Oct 2004-Artificial Intelligence Review

TL;DR: A survey of contemporary techniques for outlier detection is introduced and their respective motivations are identified and distinguish their advantages and disadvantages in a comparative review.

...read moreread less

Abstract: Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.

...read moreread less

3,235 citations

"Deviation detection in text using c..." refers background in this paper

...The concept nodes represent concepts such as entities, attributes, states and events while the relation nodes represent relations to show how the concepts are interrelated....
[...]

Proceedings Article•DOI•

A re-examination of text categorization methods

[...]

Yiming Yang¹, Xin Liu¹•Institutions (1)

Carnegie Mellon University¹

01 Aug 1999

TL;DR: The results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small, and that all the methods perform comparably when the categories are over 300 instances.

...read moreread less

Abstract: This paper reports a controlled study with statistical signi cance tests on ve text categorization methods: the Support Vector Machines (SVM), a k-Nearest Neighbor (kNN) classi er, a neural network (NNet) approach, the Linear Leastsquares Fit (LLSF) mapping and a Naive Bayes (NB) classier. We focus on the robustness of these methods in dealing with a skewed category distribution, and their performance as function of the training-set category frequency. Our results show that SVM, kNN and LLSF signi cantly outperform NNet and NB when the number of positive training instances per category are small (less than ten), and that all the methods perform comparably when the categories are su ciently common (over 300 instances).

...read moreread less

2,877 citations

"Deviation detection in text using c..." refers background in this paper

...A number of operations such as projection (graph matching), unification (join), simplification, restriction and copying can be performed on the produced CG. Additional information such as descriptions and the organization of the graphs into hierarchies of abstraction can help to reduce the search…...
[...]

Journal Article•DOI•

Novelty detection: a review—part 1: statistical approaches

[...]

M. Markou¹, Sameer Singh¹•Institutions (1)

University of Exeter¹

01 Dec 2003-Signal Processing

TL;DR: There are a multitude of applications where novelty detection is extremely important including signal processing, computer vision, pattern recognition, data mining, and robotics.

...read moreread less

1,457 citations

"Deviation detection in text using c..." refers background in this paper

...The concept nodes represent concepts such as entities, attributes, states and events while the relation nodes represent relations to show how the concepts are interrelated....
[...]