scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Knowledge Discovery from Semi-Structured Data for Conceptual Organization

18 Dec 2006-pp 291-294
TL;DR: A knowledge-discovery mechanism that extracts noun phrases from documents and arranges them into concept maps based on their co-occurrence is proposed, which can be used for automatic grouping and conceptual categorization of documents.
Abstract: Conceptual organization of semi-structured documents can help in effective retrieval from collections of emails, product complaints, video descriptions etc. In this paper, we propose a conceptual organization scheme for grouping and categorizing semi-structured text data using natural language processing techniques. We propose a knowledge-discovery mechanism that extracts noun phrases from documents and arranges them into concept maps based on their co-occurrence. The emerging concept maps can be used for automatic grouping and conceptual categorization of documents. Further, Phrase structure Grammar is employed to extract relationships among these entities from documents and index the document collection with these relations.
Citations
More filters
Proceedings ArticleDOI
18 Oct 2008
TL;DR: A graphic ontology representation schema for virtual enterprise model is described, which is called ontology structure graph (OSG) based on applications of semantic annotation, and the process of knowledge discovery in thevirtual enterprise model base is given.
Abstract: Discovering knowledge from virtual enterprise model is becoming increasingly important, as numerical models established for virtual enterprise are difficult to support interoperability of virtual enterprise. To solve this problem, a knowledge discovery method based on semantic annotation is put forward in this paper. A graphic ontology representation schema for virtual enterprise model is described, which is called ontology structure graph (OSG). Based on applications of semantic annotation, the process of knowledge discovery in the virtual enterprise model base is given and activities such as model selection, semantic annotation, data transformation, knowledge extraction and ontology interoperation in knowledge discovery process are illustrated in detail. Then, several critical issues influencing knowledge discovery are explained, including the organization of domain vocabulary, the definition of semantic annotation rules and semantic affinity function, the formulation of reference ontology. Finally, an instance is given to demonstrate the knowledge discovery method and results of knowledge discovery are presented.

5 citations


Cites background from "Knowledge Discovery from Semi-Struc..."

  • ...[6] proposed a knowledge discovery schema for conceptual organization of semi-structured data like emails, bibliographic databases, customer complaints, video descriptions etc....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
Abstract: In this final installment of the paper we consider the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now. To a considerable extent the continuous case can be obtained through a limiting process from the discrete case by dividing the continuum of messages and signals into a large but finite number of small regions and calculating the various parameters involved on a discrete basis. As the size of the regions is decreased these parameters in general approach as limits the proper values for the continuous case. There are, however, a few new effects that appear and also a general change of emphasis in the direction of specialization of the general results to particular cases.

65,425 citations

Journal ArticleDOI
TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Abstract: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

7,539 citations

Journal ArticleDOI
TL;DR: The Photobook system is described, which is a set of interactive tools for browsing and searching images and image sequences that make direct use of the image content rather than relying on text annotations to provide a sophisticated browsing and search capability.
Abstract: We describe the Photobook system, which is a set of interactive tools for browsing and searching images and image sequences. These query tools differ from those used in standard image databases in that they make direct use of the image content rather than relying on text annotations. Direct search on image content is made possible by use of semantics-preserving image compression, which reduces images to a small set of perceptually-significant coefficients. We discuss three types of Photobook descriptions in detail: one that allows search based on appearance, one that uses 2-D shape, and a third that allows search based on textural properties. These image content descriptions can be combined with each other and with text-based descriptions to provide a sophisticated browsing and search capability. In this paper we demonstrate Photobook on databases containing images of people, video keyframes, hand tools, fish, texture swatches, and 3-D medical data.

1,748 citations

Journal ArticleDOI
TL;DR: The concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets and are localized in the word space, are sparse, and tend towards orthonormality.
Abstract: Unlabeled document collections are becoming increasingly common and availables mining such data sets represents a major contemporary challenge. Using words as features, text documents are often represented as high-dimensional and sparse vectors–a few thousand dimensions and a sparsity of 95 to 99% is typical. In this paper, we study a certain spherical k-means algorithm for clustering such document vectors. The algorithm outputs k disjoint clusters each with a concept vector that is the centroid of the cluster normalized to have unit Euclidean norm. As our first contribution, we empirically demonstrate that, owing to the high-dimensionality and sparsity of the text data, the clusters produced by the algorithm have a certain “fractal-like” and “self-similar” behavior. As our second contribution, we introduce concept decompositions to approximate the matrix of document vectorss these decompositions are obtained by taking the least-squares approximation onto the linear subspace spanned by all the concept vectors. We empirically establish that the approximation errors of the concept decompositions are close to the best possible, namely, to truncated singular value decompositions. As our third contribution, we show that the concept vectors are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors are global in the word space and are dense. Nonetheless, we observe the surprising fact that the linear subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them. In conclusion, the concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets.

1,398 citations

Journal ArticleDOI
TL;DR: The authors describes the genesis and development of concept mapping as a useful tool for science education and offers an overview of the contents of this special issue and comments on the current state of knowledge representation.
Abstract: This article describes the genesis and development of concept mapping as a useful tool for science education. It also offers an overview of the contents of this special issue and comments on the current state of knowledge representation. Suggestions for further research are made throughout the article.

995 citations