scispace - formally typeset
Search or ask a question

Showing papers on "Pointwise mutual information published in 2002"


Posted Content
TL;DR: This article presented an unsupervised learning algorithm for recognizing synonyms based on statistical data acquired by querying a web search engine, called Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words.
Abstract: This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).

1,303 citations


Journal ArticleDOI
01 Oct 2002
TL;DR: The findings show that the algorithms used so far may be quite substantially improved upon when dealing with small datasets, finite sample effects and other sources of potentially misleading results have to be taken into account.
Abstract: Motivation: Clustering co-expressed genes usually requires the definition of ‘distance’ or ‘similarity’ between measured datasets, the most common choices being Pearson correlation or Euclidean distance. With the size of available datasets steadily increasing, it has become feasible to consider other, more general, definitions as well. One alternative, based on information theory, is the mutual information, providing a general measure of dependencies between variables. While the use of mutual information in cluster analysis and visualization of large-scale gene expression data has been suggested previously, the earlier studies did not focus on comparing different algorithms to estimate the mutual information from finite data. Results: Here we describe and review several approaches to estimate the mutual information from finite datasets. Our findings show that the algorithms used so far may be quite substantially improved upon. In particular when dealing with small datasets, finite sample effects and other sources of potentially misleading results have to be taken into account.

764 citations


Posted Content
TL;DR: This paper introduces a simple algorithm for unsupervised learning of semantic orientation from extremely large corpora by issuing queries to a Web search engine and using pointwise mutual information to analyse the results.
Abstract: The evaluative character of a word is called its semantic orientation. A positive semantic orientation implies desirability (e.g., "honest", "intrepid") and a negative semantic orientation implies undesirability (e.g., "disturbing", "superfluous"). This paper introduces a simple algorithm for unsupervised learning of semantic orientation from extremely large corpora. The method involves issuing queries to a Web search engine and using pointwise mutual information to analyse the results. The algorithm is empirically evaluated using a training corpus of approximately one hundred billion words — the subset of the Web that is indexed by the chosen search engine. Tested with 3,596 words (1,614 positive and 1,982 negative), the algorithm attains an accuracy of 80%. The 3,596 test words include adjectives, adverbs, nouns, and verbs. The accuracy is comparable with the results achieved by Hatzivassiloglou and McKeown (1997), using a complex four-stage supervised learning algorithm that is restricted to determining the semantic orientation of adjectives.

375 citations


Proceedings Article
01 Aug 2002
TL;DR: In this article, the distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution, is analyzed for the problem of selecting features for incremental learning and classification of the naive Bayes classifier.
Abstract: Mutual information is widely used in artificial intelligence, in a descriptive way, to measure the stochastic dependence of discrete random variables. In order to address questions such as the reliability of the empirical value, one must consider sample-to-population inferential approaches. This paper deals with the distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean and an analytical approximation of the variance are reported. Asymptotic approximations of the distribution are proposed. The results are applied to the problem of selecting features for incremental learning and classification of the naive Bayes classifier. A fast, newly defined method is shown to outperform the traditional approach based on empirical mutual information on a number of real data sets. Finally, a theoretical development is reported that allows one to efficiently extend the above methods to incomplete samples in an easy and effective way.

116 citations


Proceedings Article
01 May 2002
TL;DR: It is discovered that a frequency-biased version of mutual dependency performs the best, followed close by likelihood ratio, and some implications that usage of available electronic dictionaries such as the WordNet for evaluation of collocation extraction encompasses are pointed out.
Abstract: Corpus-based automatic extraction of collocations is typically carried out employing some statistic indicating concurrency in order to identify words that co-occur more often than expected by chance. In this paper we are concerned with some typical measures such as the t-score, Pearson’s χ-square test, log-likelihood ratio, pointwise mutual information and a novel information theoretic measure, namely mutual dependency. Apart from some theoretical discussion about their correlation, we perform comparative evaluation experiments judging performance by their ability to identify lexically associated bigrams. We use two different gold standards: WordNet and lists of named-entities. Besides discovering that a frequency-biased version of mutual dependency performs the best, followed close by likelihood ratio, we point out some implications that usage of available electronic dictionaries such as the WordNet for evaluation of collocation extraction encompasses.

76 citations


Proceedings ArticleDOI
01 Nov 2002
TL;DR: The focus of this article is simple networks and includes an example from hydrology, where the nodal random variables are time series.
Abstract: A statistical network is a collection of nodes representing random variables and a set of edges that connect the nodes. A probabilistic model for such is called a statistical graphical model. These models, graphs and networks are particularly useful for examining statistical dependencies amongst quantities via conditioning. In this article the nodal random variables are time series. Basic to the study of statistical networks is some measure of the strength of (possibly directed) connections between the nodes. The use of the ordinary and partial coherences and of mutual information is considered as a study for inference concerning statistical graphical models. The focus of this article is simple networks. The article includes an example from hydrology.

25 citations


Proceedings ArticleDOI
23 Oct 2002
TL;DR: Mutual information similarity metrics computed from fractional order Renyi entropy and entropy kind t are presented as novel similarity metrics for ultrasound/MRI registration and are shown to be more accurate than Shannon mutual information in many cases.
Abstract: Mutual information has been widely used as a similarity metric for biomedical image registration. Although usually based on the Shannon definition of entropy, mutual information may be computed from other entropy definitions. Mutual information similarity metrics computed from fractional order Renyi entropy and entropy kind t are presented as novel similarity metrics for ultrasound/MRI registration. These metrics are shown to be more accurate than Shannon mutual information in many cases, and frequently facilitate faster convergence to the optimum. They are particularly effective for local optimization, but some measures may potentially be exploited for global searches.

11 citations


Posted Content
Don H. Johnson1
TL;DR: To measure mutual information, the experimenter defines a stimulus set and, from the measured response, estimates , the probability distribution of the response under each stimulus condition.
Abstract: Mutual information between stimulus and response has been advocated as an information theoretic measure of a neural system’s capability to process information. Once calculated, the result is a single number that supposedly captures the system’s information characteristics over the range of stimulus conditions used to measure it. I show that mutual information is a flawed measure, the standard approach to measuring it has theoretical difficulties, and that relating capacity to information processing capability is quite complicated.

11 citations


Proceedings ArticleDOI
08 May 2002
TL;DR: The expectation-maximization algorithm is introduced for Gaussian clustering of MI estimates and the specification of a set of rules for intelligently determining the binning interval of the input and target spaces is marked.
Abstract: The mutual information-radial basis function network (MI-RBFN) is an efficient, general, and integrated method of approximating complex, continuous, deterministic systems from incomplete information. The nodes of the MI-RBFN are located by clustering local mutual information estimates thereby yielding a, mapping that inherently generalizes better than one formulated by seeking solely to minimize residuals. The expectation-maximization algorithm is introduced for Gaussian clustering of MI estimates. A further improvement in the methodology is marked by the specification of a set of rules for intelligently determining the binning interval of the input and target spaces.

5 citations


15 May 2002
TL;DR: A fast, newly defined filter is shown to outperform the traditional approach based on empirical mutual information on a number of real data sets and allows the above methods to be extended to incomplete samples in an easy and effective way.
Abstract: Mutual information is widely used in artificial intelligence, in a descriptive way, to measure the stochastic dependence of discrete random variables. In order to address questions such as the reliability of the empirical value, one must consider sample-to-population inferential approaches. This paper deals with the distribution of mutual information, as obtained in a Bayesian framework by using second-order Dirichlet prior distributions. We derive reliable and quickly computable analytical approximations for the distribution of mutual information. We concentrate on the mean, variance, skewness, and kurtosis. For the mean we also provide an exact expression. The results are applied to the problem of selecting features for incremental learning and classification of the naive Bayes classifier. A fast, newly defined filter is shown to outperform the traditional approach based on empirical mutual information on a number of real data sets. A theoretical development allows the above methods to be extended to incomplete samples in an easy and effective way. Further experiments on incomplete data sets support the extension of the proposed filter to the case of missing data.

4 citations


Journal ArticleDOI
TL;DR: This work uses a new method for calculating mutual information based on empirical classification to show how mutual information and the Kullback–Leibler distance summarize coding efficacy, and suggests that knowledge gained through mutual information methods could be more easily obtained and interpreted using the KULLback– Leiblerdistance.

Proceedings Article
08 Jul 2002
TL;DR: In this article, the authors proposed a new family of Hidden Markov Models (MMIHMMs) which have the same graphical structure as HMMs, but the cost function being optimized is not the joint likelihood of the observations and the hidden states.
Abstract: This paper proposes a new family of Hidden Markov Models named Maximum Mutual Information Hidden Markov Models (MMIHMMs). MMIHMMs have the same graphical structure as HMMs. However, the cost function being optimized is not the joint likelihood of the observations and the hidden states. It consists of the weighted linear combination of the mutual information between the hidden states and the observations and the likelihood of the observations and the states. We present both theoretical and practical motivations for having such a cost function. Next, we derive the parameter estimation (learning) equations for both the discrete and continuous observation cases. Finally we illustrate the superiority of our approach in different classification tasks by comparing the classification performance of our proposed Maximum Mutual Information HMMs (MMIHMMs) with standard Maximum Likelihood HMMs (HMMs), in the case of synthetic and real, discrete and continuous, supervised and unsupervised data. We believe that MMIHMMs are a powerful tool to solve many of the problems associated with HMMs when used for classification and/or clustering.

Journal ArticleDOI
TL;DR: The standard quantum information theory of block messages with fixed block length to the variable one is generalized and it is shown that the states belonging to a sufficiently large Hilbert space are the highly distinguishable states.
Abstract: By making use of the theoretical framework presented by Bostroem (K. J. Bostroem, LANL quant-ph/0009052), we generalize the standard quantum information theory of block messages with fixed block length to the variable one. We show that the states belonging to a sufficiently large Hilbert space are the highly distinguishable states. We also consider the collection states (product states of more than one qubit state) and seek a ``pretty good measurement"(PGM) with measurement vectors to improve the mutual information. The average mutual information over random block-message ensembles with variable block length n is discussed in detail.