scispace - formally typeset
Open AccessPosted Content

Automatic Meaning Discovery Using Google

Reads0
Chats0
TLDR
A new theory of relative semantics between objects is presented, based on information distance and Kolmogorov complexity, which is then applied to construct a method to automatically extract the meaning of words and phrases from the world-wide-web using Google page counts.
Abstract
Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the database.' We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide-web using Google page counts. The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WordNet categories.

read more

Citations
More filters
Journal ArticleDOI

The Google Similarity Distance

TL;DR: A new theory of similarity between words and phrases based on information distance and Kolmogorov complexity is presented, which is applied to construct a method to automatically extract similarity, the Google similarity distance, of Words and phrases from the WWW using Google page counts.
BookDOI

Ontology Engineering in a Networked World

TL;DR: This book by Surez-Figueroa et al. provides the necessary methodological and technological support for the development and use of ontology networks, which ontology developers need in this distributed environment.
Patent

System and method for manipulating content in a hierarchical data-driven search and navigation system

TL;DR: In this paper, a data-driven, hierarchical information search and navigation system and method enable searching and navigation of sets of materials by certain common attributes that characterize the materials and provide a rules engine providing for manipulation of the content displayed to the user based on the query entered by the user.
Journal ArticleDOI

Learning non-taxonomic relationships from web documents for domain ontology construction

TL;DR: An automatic and unsupervised methodology is presented that is able to discover domain-related verbs, extract non-taxonomically related concepts and label relationships, using the Web as corpus and presents encouraging results for several domains.
Patent

Method and system for information retrieval with clustering

TL;DR: In this article, a collection of materials is selected from the collection and relevant properties associated with the subset of materials are clustered into property clusters, each property cluster generally contains properties that are more similar to each other than to properties in different property clusters.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Journal ArticleDOI

A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.

TL;DR: A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena.

An Introduction to Kolmogorov Complexity and Its Applications

TL;DR: The book presents a thorough treatment of the central ideas and their applications of Kolmogorov complexity with a wide range of illustrative applications, and will be ideal for advanced undergraduate students, graduate students, and researchers in computer science, mathematics, cognitive sciences, philosophy, artificial intelligence, statistics, and physics.
Journal ArticleDOI

Three approaches to the quantitative definition of information

TL;DR: In this article, three approaches to the quantitative definition of information are presented: information-based, information-aware and information-neutral approaches to quantifying information in the context of information retrieval.