scispace - formally typeset
Search or ask a question

Showing papers on "Semantic similarity published in 2003"


Journal ArticleDOI
TL;DR: This article introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words, based on two different statistical measures of word association.
Abstract: The evaluative character of a word is called its semantic orientation. Positive semantic orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation indicates criticism (e.g., "disturbing", "superfluous"). Semantic orientation varies in both direction (positive or negative) and degree (mild to strong). An automated system for measuring semantic orientation would have application in text classification, text filtering, tracking opinions in online discussions, analysis of survey responses, and automated chat systems (chatbots). This article introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. Two instances of this approach are evaluated, based on two different statistical measures of word association: pointwise mutual information (PMI) and latent semantic analysis (LSA). The method is experimentally tested with 3,596 words (including adjectives, adverbs, nouns, and verbs) that have been manually labeled positive (1,614 words) and negative (1,982 words). The method attains an accuracy of 82.8p on the full test set, but the accuracy rises above 95p when the algorithm is allowed to abstain from classifying mild words.

1,651 citations


Journal ArticleDOI
TL;DR: This paper explores the determination of semantic similarity by a number of information sources, which consist of structural semantic information from a lexical taxonomy and information content from a corpus.
Abstract: Semantic similarity between words is becoming a generic problem for many applications of computational linguistics and artificial intelligence. This paper explores the determination of semantic similarity by a number of information sources, which consist of structural semantic information from a lexical taxonomy and information content from a corpus. To investigate how information sources could be used effectively, a variety of strategies for using various possible information sources are implemented. A new measure is then proposed which combines information sources nonlinearly. Experimental evaluation against a benchmark set of human similarity ratings demonstrates that the proposed measure significantly outperforms traditional similarity measures.

1,138 citations


Journal ArticleDOI
TL;DR: This work presents an approach to computing semantic similarity that relaxes the requirement of a single ontology and accounts for differences in the levels of explicitness and formalization of the different ontology specifications.
Abstract: Semantic similarity measures play an important role in information retrieval and information integration. Traditional approaches to modeling semantic similarity compute the semantic distance between definitions within a single ontology. This single ontology is either a domain-independent ontology or the result of the integration of existing ontologies. We present an approach to computing semantic similarity that relaxes the requirement of a single ontology and accounts for differences in the levels of explicitness and formalization of the different ontology specifications. A similarity function determines similar entity classes by using a matching process over synonym sets, semantic neighborhoods, and distinguishing features that are classified into parts, functions, and attributes. Experimental results with different ontologies indicate that the model gives good results when ontologies have complete and detailed representations of entity classes. While the combination of word matching and semantic neighborhood matching is adequate for detecting equivalent entity classes, feature matching allows us to discriminate among similar, but not necessarily equivalent entity classes.

948 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigate the use of ontological annotation to measure the similarities in knowledge content or "semantic similarity" between entries in a data resource, and present a simple extension that enables a semantic search of the knowledge held within sequence databases.
Abstract: Motivation: Many bioinformatics data resources not only hold data in the form of sequences, but also as annotation. In the majority of cases, annotation is written as scientific natural language: this is suitable for humans, but not particularly useful for machine processing. Ontologies offer a mechanism by which knowledge can be represented in a form capable of such processing. In this paper we investigate the use of ontological annotation to measure the similarities in knowledge content or ‘semantic similarity’ between entries in a data resource. These allow a bioinformatician to perform a similarity measure over annotation in an analogous manner to those performed over sequences. Am easure of semantic similarity for the knowledge component of bioinformatics resources should afford a biologist a new tool in their repetoire of analyses. Results: We present the results from experiments that investigate the validity of using semantic similarity by comparison with sequence similarity. We show a simple extension that enables a semantic search of the knowledge held within sequence databases. Availability: Software available from http://www.russet.

903 citations


Proceedings Article
01 Jan 2003
TL;DR: A new measure of semantic relatedness between concepts that is based on the number of shared words (overlaps) in their definitions (glosses) and reasonably correlates to human judgments is presented.

753 citations


Proceedings Article
09 Aug 2003
TL;DR: This article presented a new measure of semantic relatedness between concepts based on the number of shared words (overlaps) in their definitions (glosses), which is unique in that it extends the glosses of the concepts under consideration to include the glosss of other concepts to which they are related according to a given concept hierarchy.
Abstract: This paper presents a new measure of semantic relatedness between concepts that is based on the number of shared words (overlaps) in their definitions (glosses). This measure is unique in that it extends the glosses of the concepts under consideration to include the glosses of other concepts to which they are related according to a given concept hierarchy. We show that this new measure reasonably correlates to human judgments. We introduce a new method of word sense disambiguation based on extended gloss overlaps, and demonstrate that it fares well on the SENSEVAL-2 lexical sample data.

720 citations


Journal ArticleDOI
01 Nov 2003
TL;DR: GLUE is described, a system that employs machine learning techniques to find semantic mappings between ontologies and is distinguished in that it works with a variety of well-defined similarity notions and that it efficiently incorporates multiple types of knowledge.
Abstract: On the Semantic Web, data will inevitably come from many different ontologies, and information processing across ontologies is not possible without knowing the semantic mappings between them. Manually finding such mappings is tedious, error-prone, and clearly not possible on the Web scale. Hence the development of tools to assist in the ontology mapping process is crucial to the success of the Semantic Web. We describe GLUE, a system that employs machine learning techniques to find such mappings. Given two ontologies, for each concept in one ontology GLUE finds the most similar concept in the other ontology. We give well-founded probabilistic definitions to several practical similarity measures and show that GLUE can work with all of them. Another key feature of GLUE is that it uses multiple learning strategies, each of which exploits well a different type of information either in the data instances or in the taxonomic structure of the ontologies. To further improve matching accuracy, we extend GLUE to incorporate commonsense knowledge and domain constraints into the matching process. Our approach is thus distinguished in that it works with a variety of well-defined similarity notions and that it efficiently incorporates multiple types of knowledge. We describe a set of experiments on several real-world domains and show that GLUE proposes highly accurate semantic mappings. Finally, we extend GLUE to find complex mappings between ontologies and describe experiments that show the promise of the approach.

533 citations


01 Dec 2003
TL;DR: This paper generalizes the Adapted Lesk Algorithm to a method of word sense disambiguation based on semantic relatedness and finds that the gloss overlaps of AdaptedLesk and the semantic distance measure of Jiang and Conrath (1997) result in the highest accuracy.
Abstract: This paper generalizes the Adapted Lesk Algorithm of Banerjee and Pedersen (2002) to a method of word sense disambiguation based on semantic relatedness. This is possible since Lesk's original algorithm (1986) is based on gloss overlaps which can be viewed as a measure of semantic relatedness. We evaluate a variety of measures of semantic relatedness when applied to word sense disambiguation by carrying out experiments using the English lexical sample data of SENSEVAL-2. We find that the gloss overlaps of Adapted Lesk and the semantic distance measure of Jiang and Conrath (1997) result in the highest accuracy.

510 citations


Book ChapterDOI
16 Feb 2003
TL;DR: The authors generalize the Adapted Lesk algorithm to a method of word sense disambiguation based on semantic relatedness, which is possible since Lesk's original algorithm (1986) is based on gloss overlaps which can be viewed as a measure of semantics.
Abstract: This paper generalizes the Adapted Lesk Algorithm of Banerjee and Pedersen (2002) to a method of word sense disambiguation based on semantic relatedness This is possible since Lesk's original algorithm (1986) is based on gloss overlaps which can be viewed as a measure of semantic relatedness We evaluate a variety of measures of semantic relatedness when applied to word sense disambiguation by carrying out experiments using the English lexical sample data of SENSEVAL-2 We find that the gloss overlaps of Adapted Lesk and the semantic distance measure of Jiang and Conrath (1997) result in the highest accuracy

494 citations


Journal ArticleDOI
TL;DR: This article focuses on methods for similarity search that make the general assumption that similarity is represented with a distance metric d, and presents algorithms for common types of queries that operate on an arbitrary "search hierarchy."
Abstract: Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distance-based indexing), while the second is based on mapping to a vector space (mapping-based approach). The main part of this article is dedicated to a survey of distance-based indexing methods, but we also briefly outline how search occurs in mapping-based methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary "search hierarchy." These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.

480 citations


Book ChapterDOI
30 Jan 2003
TL;DR: The grammatical morphemes have more abstract and general meanings and are thus more apt to be used in multiple ways than content words as discussed by the authors. And many linguists regard the study of grammar as more interesting and prestigious, so the grams have tended to occupy center stage in linguistic theory.
Abstract: A recurrent problem in linguistic analysis is the existence of multiple senses or uses of a linguistic unit. While this affects all meaningful elements of language alike, content words as well as function words (such as prepositions and auxiliaries) and affixal categories (such as tense and case), it is particularly prominent with the latter two. Function words and affixes, which I will group together as "grammatical morphemes" (or "grams" for short), have more abstract and general meanings and are thus more apt to be used in multiple ways than content words. Moreover, many linguists regard the study of grammar as more interesting and prestigious, so the grams have tended to occupy center stage in linguistic theory. A few examples of grammatical morphemes with multiple senses/uses are given in (1)-(3). Each English example is followed by a short label describing the use or sense. The specific item whose uses/senses are exemplified is highlighted by boldface.

Journal ArticleDOI
TL;DR: This work proposes new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores, and provides an experimental comparison of the measures against traditional similarity measures, and reports on a user study that evaluated how well the measures match human intuition.
Abstract: The notion of similarity between objects finds use in many contexts, for example, in search engines, collaborative filtering, and clustering. Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We propose new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We extend our similarity measures to provide appropriate results in the presence of multisets (also handled unsatisfactorily by traditional measures), for example, to correctly compute the similarity between customers who buy several instances of the same product (say milk), or who buy several products in the same category (say dairy products). We also provide an experimental comparison of our measures against traditional similarity measures, and report on a user study that evaluated how well our measures match human intuition.

Journal ArticleDOI
01 Jan 2003-Language
TL;DR: The authors compare the semantics of spatial adpositions in nine unrelated languages, with the help of a standard elicitation procedure, thus producing a preliminary semantic typology of space adpositional systems.
Abstract: Most approaches to spatial language have assumed that the simplest spatial notions are (after Piaget) topological and universal (containment, contiguity, proximity, support, represented as semantic primitives suchas IN, ON, UNDER, etc.). These concepts would be coded directly in language, above all in small closed classes suchas adpositions-thus providing a striking example of semantic categories as language-specific projections of universal conceptual notions. This idea, if correct, should have as a consequence that the semantic categories instantiated in spatial adpositions should be essentially uniform crosslinguistically. This article attempts to verify this possibility by comparing the semantics of spatial adpositions in nine unrelated languages, with the help of a standard elicitation procedure, thus producing a preliminary semantic typology of spatial adpositional systems. The differences between the languages turn out to be so significant as to be incompatible withstronger versions of the UNIVERSAL CONCEPTUAL CATEGORIES hypothesis. Rather, the language-specific spatial adposition meanings seem to emerge as compact subsets of an underlying semantic space, withcertain areas being statistical ATTRACTORS or FOCI. Moreover, a comparison of systems withdifferent degrees of complexity suggests the possibility of positing implicational hierarchies for spatial adpositions. But such hierarchies need to be treated as successive divisions of semantic space, as in recent treatments of basic color terms. This type of analysis appears to be a promising approachfor future work in semantic typology.*

Proceedings ArticleDOI
27 May 2003
TL;DR: This paper presents a method and its results for learning semantic constraints to detect part-whole relations and the targeted part-Whole relations were detected with an accuracy of 83%.
Abstract: The discovery of semantic relations from text becomes increasingly important for applications such as Question Answering, Information Extraction, Text Summarization, Text Understanding, and others. The semantic relations are detected by checking selectional constraints. This paper presents a method and its results for learning semantic constraints to detect part-whole relations. Twenty constraints were found. Their validity was tested on a 10,000 sentence corpus, and the targeted part-whole relations were detected with an accuracy of 83%.

Book ChapterDOI
20 Oct 2003
TL;DR: A new algorithm for discovering semantic mappings across hierarchical classifications based on a new approach to semantic coordination is proposed based on the problem of deducing relations between sets of logical formulae that represent the meaning of concepts belonging to different models.
Abstract: Semantic coordination, namely the problem of finding an agreement on the meaning of heterogeneous semantic models, is one of the key issues in the development of the Semantic Web. In this paper, we propose a new algorithm for discovering semantic mappings across hierarchical classifications based on a new approach to semantic coordination. This approach shifts the problem of semantic coordination from the problem of computing linguistic or structural similarities (what most other proposed approaches do) to the problem of deducing relations between sets of logical formulae that represent the meaning of concepts belonging to different models. We show how to apply the approach and the algorithm to an interesting family of semantic models, namely hierarchical classifications, and present the results of preliminary tests on two types of hierarchical classifications, web directories and catalogs. Finally, we argue why this is a significant improvement on previous approaches.

Proceedings ArticleDOI
12 Jul 2003
TL;DR: A construction-inspecific model of multiword expression decomposability based on latent semantic analysis is presented, and evidence is furnished for the calculated similarities being correlated with the semantic relational content of WordNet.
Abstract: This paper presents a construction-inspecific model of multiword expression decomposability based on latent semantic analysis. We use latent semantic analysis to determine the similarity between a multiword expression and its constituent words, and claim that higher similarities indicate greater decomposability. We test the model over English noun-noun compounds and verb-particles, and evaluate its correlation with similarities and hyponymy values in WordNet. Based on mean hyponymy over partitions of data ranked on similarity, we furnish evidence for the calculated similarities being correlated with the semantic relational content of WordNet.

Proceedings Article
01 Jan 2003
TL;DR: This paper implemented a system that measures semantic similarity using a computerized 1987 Roget's Thesaurus, and evaluated it by performing a few typical tests, including TOEFL, 50 ESL and 300 Reader's Digest questions.
Abstract: We have implemented a system that measures semantic similarity using a computerized 1987 Roget's Thesaurus, and evaluated it by performing a few typical tests. We compare the results of these tests with those produced by WordNet-based similarity measures. One of the benchmarks is Miller and Charles’ list of 30 noun pairs to which human judges had assigned similarity measures. We correlate these measures with those computed by several NLP systems. The 30 pairs can be traced back to Rubenstein and Goodenough’s 65 pairs, which we have also studied. Our Roget’s-based system gets correlations of .878 for the smaller and .818 for the larger list of noun pairs; this is quite close to the .885 that Resnik obtained when he employed humans to replicate the Miller and Charles experiment. We further evaluate our measure by using Roget’s and WordNet to answer 80 TOEFL, 50 ESL and 300 Reader’s Digest questions: the correct synonym must be selected amongst a group of four words. Our system gets 78.75%, 82.00% and 74.33% of the questions respectively.

Patent
23 Jun 2003
TL;DR: In this article, the semantics of one or more XML language inquiries across relational and non-relational data sources are generated using a semantic intermediate language representation, which is a graph structure with nodes which describe the operations of the original query.
Abstract: A computer system and method generate a semantic representation of one or more XML language inquiries across relational and non-relational data sources. A semantic intermediate language representation explicitly describes the meaning of the one or more XML language inquiries. The semantic intermediate language may be a graph structure with nodes which describe the operations of the original query. Operators assigned to the nodes in the semantic graph allow an unambiguous definition of the original XML query. The semantic intermediate language may be used to perform XML queries over single or multiple data sources. A method includes receiving at least one inquiry, defining at least one node object for every operation within the received inquiry, translating each node object using operators, and generating a semantic representation from the operators.

Patent
14 Feb 2003
TL;DR: The Similarity Search Engine as discussed by the authors is a system and method for defining a schema and sending a query to a similarity search engine to determine a quantitative assessment of the similarity of attributes between an anchor record and one or more target records.
Abstract: The invention provides a system and method for defining a schema and sending a query to a Similarity Search Engine to determine a quantitative assessment of the similarity of attributes between an anchor record and one or more target records. The Similarity Search Engine makes a similarity assessment in a single pass through the target records having multiple relationship characteristics. The Similarity Search Engine is a server configuration that comprises a Gateway for command and response routing, a Virtual Document Manager for document generation, a Search Manager for document scoring, and an Relational Database Management System for providing data persistence, data retrieval and access to User Defined Functions. The Similarity Search Engine uses a unique command syntax based on the Extensible Markup Language to implement functions necessary for similarity searching and scoring.

Journal ArticleDOI
TL;DR: This work presents three experiments that indicate that similarity is strongly influenced by transformation distance, and introduces a family of transformation-based accounts of similarity, called 'Representational Distortion', as a specific example of a transformational approach to similarity.

Proceedings ArticleDOI
20 May 2003
TL;DR: This paper describes a novel approach for obtaining semantic interoperability among data sources in a bottom-up, semi-automatic manner without relying on pre-existing, global semantic models and develops a formal framework that takes into account both syntactic and semantic criteria.
Abstract: This paper describes a novel approach for obtaining semantic interoperability among data sources in a bottom-up, semi-automatic manner without relying on pre-existing, global semantic models. We assume that large amounts of data exist that have been organized and annotated according to local schemas. Seeing semantics as a form of agreement, our approach enables the participating data sources to incrementally develop global agreement in an evolutionary and completely decentralized process that solely relies on pair-wise, local interactions: Participants provide translations between schemas they are interested in and can learn about other translations by routing queries (gossiping). To support the participants in assessing the semantic quality of the achieved agreements we develop a formal framework that takes into account both syntactic and semantic criteria. The assessment process is incremental and the quality ratings are adjusted along with the operation of the system. Ultimately, this process results in global agreement, i.e., the semantics that all participants understand. We discuss strategies to efficiently find translations and provide results from a case study to justify our claims. Our approach applies to any system which provides a communication infrastructure (existing websites or databases, decentralized systems, P2P systems) and offers the opportunity to study semantic interoperability as a global phenomenon in a network of information sharing parties.

Patent
Mark E. Epstein1, Hakan Erdogan1, Yuqing Gao1, Michael Picheny1, Ruhi Sarikaya1 
05 Sep 2003
TL;DR: In this paper, a system and method for speech recognition includes generating a set of likely hypotheses in recognizing speech, rescoring the likely hypotheses by using semantic content by employing semantic structured language models, and scoring parse trees to identify a best sentence according to the sentence's parse tree.
Abstract: A system and method for speech recognition includes generating a set of likely hypotheses in recognizing speech, rescoring the likely hypotheses by using semantic content by employing semantic structured language models, and scoring parse trees to identify a best sentence according to the sentence's parse tree by employing the semantic structured language models to clarify the recognized speech.

Journal ArticleDOI
TL;DR: This work uses correspondence analysis to visualize groupings resulting from the association between semantic types and the relationships, and shows that the relationships are organized around a limited number of pivot groups and that partitions created at random do not exhibit this property.

Proceedings Article
07 Sep 2003
TL;DR: This work discusses a framework where ranking techniques can be used to identify more interesting and more relevant Semantic Associations, and utilizes alternative ways of specifying the context using ontology to capture users' interests more precisely and better quality results in relevance ranking.
Abstract: Discovering complex and meaningful relationships, which we call Semantic Associations, is an important challenge. Just as ranking of documents is a critical component of today's search engines, ranking of relationships will be essential in tomorrow's semantic search engines that would support discovery and mining of the Semantic Web. Building upon our recent work on specifying types of Semantic Associations in RDF graphs, which are possible to create through semantic metadata extraction and annotation, we discuss a framework where ranking techniques can be used to identify more interesting and more relevant Semantic Associations. Our techniques utilize alternative ways of specifying the context using ontology. This enables capturing users' interests more precisely and better quality results in relevance ranking.

Journal ArticleDOI
TL;DR: The authors used Latent Semantic Analysis (LSA) to estimate the semantic similarity between readers' think-aloud protocols to focal sentences and sentences in the stories that provided direct causal antecedents to the focal sentences.
Abstract: The viability of assessing reading strategies is studied based on think-aloud protocols combined with Latent Semantic Analysis (LSA). Readers in two studies thought aloud after reading specific focal sentences embedded in two stories. LSA was used to estimate the semantic similarity between readers' think-aloud protocols to the focal sentences and sentences in the stories that provided direct causal antecedents to the focal sentences. Study 1 demonstrated that according to human- and LSA-based assessments of the protocols, the responses of less-skilled readers semantically overlapped more with the focal sentences than with the causal antecedent sentences, whereas the responses of skilled readers overlapped with these sentences equally. In addition, the extent that the semantic overlap with causal antecedents was greater than the overlap with the focal sentences predicted performance on comprehension test questions and the Nelson-Denny test of reading skill. Study 2 replicated these findings and also demon...

01 Jan 2003
TL;DR: Semantic Relations and the Lexicon explores the many paradigmatic semantic relations between words, such as synonymy, antonymy and hyponymy as mentioned in this paper, and their relevance to the mental organization of our vocabularies.
Abstract: Semantic Relations and the Lexicon explores the many paradigmatic semantic relations between words, such as synonymy, antonymy and hyponymy, and their relevance to the mental organization of our vocabularies. Drawing on a century’s research in linguistics, psychology, philosophy, anthropology and computer science, LynneMurphy proposes a new, pragmatic approach to these relations. Whereas traditional approaches to the lexicon have claimed that paradigmatic relations are part of our lexical knowledge, Dr Murphy argues that they constitute metalinguistic knowledge, which can be derived through a single relational principle, and may also be stored as part of our conceptual representation of a word. Part I shows how this approach can account for the properties of lexical relations in ways that traditional approaches cannot, and Part II examines particular relations in detail. This book will serve as an informative handbook for all linguists and cognitive scientists interested in the mental representation of vocabulary.

Book Chapter
01 Jan 2003
TL;DR: In this article, the authors propose two methods for inferring semantic similarity between terms from a corpus, one based on word-similarity and the other based on document similarity, giving rise to a system of equations whose equilibrium point they use to obtain a semantic similarity measure.
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results.

01 Jan 2003
TL;DR: The role that semantic structures make for establishing communication between different agents in general are discussed and a number of intelligent means that make semantic web sites accessible from the outside are elaborate.
Abstract: The core idea of the Semantic Web is to make information accessible to human and software agents on a semantic basis. Hence, web sites may feed directly from the Semantic Web exploiting the underlying structures for human and machine access. We have developed a generic approach for developing semantic portals, viz. SEAL (SEmantic portAL), that exploits semantics for providing and accessing information at a portal as well as constructing and maintaining the portal. In this paper, we discuss the role that semantic structures make for establishing communication between different agents in general. We elaborate on a number of intelligent means that make semantic web sites accessible from the outside, viz. semantics-based browsing, semantic querying and querying with semantic similarity, semantic personalization, and machine access to semantic information at a semantic portal. As a case study we refer to the AIFB web site — a place that is increasingly driven by Semantic Web technologies.

Proceedings ArticleDOI
12 Jan 2003
TL;DR: In this paper, a new class of metrics appropriate for measuring effective similarity relations between sequences, say one type of similarity per metric, is studied, and a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, is proposed.
Abstract: A new class of metrics appropriate for measuring effective similarity relations between sequences, say one type of similarity per metric, is studied. We propose a new "normalized information distance", based on the noncomputable notion of Kolmogorov complexity, and show that it minorizes every metric in the class (that is, it is universal in that it discovers all effective similarities). We demonstrate that it too is a metric and takes values in [0, 1]; hence it may be called the similarity metric. This is a theory foundation for a new general practical tool. We give two distinctive applications in widely divergent areas (the experiments by necessity use just computable approximations to the target notions). First, we computationally compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we give fully automatically computed language tree of 52 different language based on translated versions of the "Universal Declaration of Human Rights".

Book ChapterDOI
15 Dec 2003
TL;DR: A suite of methods that utilizes both the semantics of the identifiers of WSDL descriptions and the structure of their operations, messages and data types to assess the similarity of two WSDD files are described and experimentally evaluated.
Abstract: The web-services stack of standards is designed to support the reuse and interoperation of software components on the web. A critical step in the process of developing applications based on web services is service discovery, i.e., the identification of existing web services that can potentially be used in the context of a new web application. UDDI, the standard API for publishing web-services specifications, provides a simple browsing-by-business-category mechanism for developers to review and select published services. To support programmatic service discovery, we have developed a suite of methods that utilizes both the semantics of the identifiers of WSDL descriptions and the structure of their operations, messages and data types to assess the similarity of two WSDL files. Given only a textual description of the desired service, a semantic information-retrieval method can be used to identify and order the most similar service-description files. This step assesses the similarity of the provided description of the desired service with the available services. If a (potentially partial) specification of the desired service behavior is also available, this set of likely candidates can be further refined by a semantic structure-matching step assessing the structural similarity of the desired vs. the retrieved services and the semantic similarity of their identifier. In this paper, we describe and experimentally evaluate our suite of service-similarity assessment methods.