scispace - formally typeset
Search or ask a question

Showing papers on "Metric (mathematics) published in 2007"


Proceedings ArticleDOI
20 Jun 2007
TL;DR: An information-theoretic approach to learning a Mahalanobis distance function that can handle a wide variety of constraints and can optionally incorporate a prior on the distance function and derive regret bounds for the resulting algorithm.
Abstract: In this paper, we present an information-theoretic approach to learning a Mahalanobis distance function. We formulate the problem as that of minimizing the differential relative entropy between two multivariate Gaussians under constraints on the distance function. We express this problem as a particular Bregman optimization problem---that of minimizing the LogDet divergence subject to linear constraints. Our resulting algorithm has several advantages over existing methods. First, our method can handle a wide variety of constraints and can optionally incorporate a prior on the distance function. Second, it is fast and scalable. Unlike most existing methods, no eigenvalue computations or semi-definite programming are required. We also present an online version and derive regret bounds for the resulting algorithm. Finally, we evaluate our method on a recent error reporting system for software called Clarify, in the context of metric learning for nearest neighbor classification, as well as on standard data sets.

2,058 citations


Journal ArticleDOI
TL;DR: METRIC uses as its foundation the pioneering SEBAL energy balance process developed in The Netherlands by Bastiaanssen, where the near-surface temperature gradients are an indexed function of radiometric surface temperature, thereby eliminating the need for absolutely accurate surface temperature and theneed for air-temperature measurements.
Abstract: Mapping evapotranspiration at high resolution with internalized calibration (METRIC) is a satellite-based image-processing model for calculating evapotranspiration (ET) as a residual of the surface energy balance. METRIC uses as its foundation the pioneering SEBAL energy balance process developed in The Netherlands by Bastiaanssen, where the near-surface temperature gradients are an indexed function of radiometric surface temperature, thereby eliminating the need for absolutely accurate surface temperature and the need for air-temperature measurements. The surface energy balance is internally calibrated using ground-based reference ET to reduce computational biases inherent to remote sensing-based energy balance and to provide congruency with traditional methods for ET. Slope and aspect functions and temperature lapsing are used in applications in mountainous terrain. METRIC algorithms are designed for relatively routine application by trained engineers and other technical professionals who possess a fami...

1,570 citations


Journal ArticleDOI
TL;DR: In this paper, the authors introduce cone metric spaces and prove fixed point theorems of contractive mappings on these spaces, and prove some fixed point properties of the mappings.

1,171 citations


Journal ArticleDOI
01 Apr 2007
TL;DR: A programming-by-demonstration framework for generically extracting the relevant features of a given task and for addressing the problem of generalizing the acquired knowledge to different contexts is presented.
Abstract: We present a programming-by-demonstration framework for generically extracting the relevant features of a given task and for addressing the problem of generalizing the acquired knowledge to different contexts. We validate the architecture through a series of experiments, in which a human demonstrator teaches a humanoid robot simple manipulatory tasks. A probability-based estimation of the relevance is suggested by first projecting the motion data onto a generic latent space using principal component analysis. The resulting signals are encoded using a mixture of Gaussian/Bernoulli distributions (Gaussian mixture model/Bernoulli mixture model). This provides a measure of the spatio-temporal correlations across the different modalities collected from the robot, which can be used to determine a metric of the imitation performance. The trajectories are then generalized using Gaussian mixture regression. Finally, we analytically compute the trajectory which optimizes the imitation metric and use this to generalize the skill to different contexts

1,089 citations


Proceedings ArticleDOI
23 Jun 2007
TL;DR: The technical details underlying the Meteor metric are recapped, the latest release includes improved metric parameters and extends the metric to support evaluation of MT output in Spanish, French and German, in addition to English.
Abstract: Meteor is an automatic metric for Machine Translation evaluation which has been demonstrated to have high levels of correlation with human judgments of translation quality, significantly outperforming the more commonly used Bleu metric. It is one of several automatic metrics used in this year's shared task within the ACL WMT-07 workshop. This paper recaps the technical details underlying the metric and describes recent improvements in the metric. The latest release includes improved metric parameters and extends the metric to support evaluation of MT output in Spanish, French and German, in addition to English.

1,045 citations


Proceedings ArticleDOI
24 May 2007
TL;DR: This paper presents an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code and implemented this algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK.
Abstract: Detecting code clones has many software engineering applications. Existing approaches either do not scale to large code bases or are not robust against minor code modifications. In this paper, we present an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code. Our algorithm is based on a novel characterization of subtrees with numerical vectors in the Euclidean space \mathbb{R}^n and an efficient algorithm to cluster these vectors w.r.t. the Euclidean distance metric. Subtrees with vectors in one cluster are considered similar. We have implemented our tree similarity algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK. Our experiments show that DECKARD is both scalable and accurate. It is also language independent, applicable to any language with a formally specified grammar.

1,008 citations


Journal ArticleDOI
TL;DR: The paper addresses the problems of choice of fitness metric, characterization of landscape modality, and determination of the most suitable search technique to apply, and sheds light on the nature of the regression testing search space, indicating that it is multimodal.
Abstract: Regression testing is an expensive, but important, process. Unfortunately, there may be insufficient resources to allow for the reexecution of all test cases during regression testing. In this situation, test case prioritization techniques aim to improve the effectiveness of regression testing by ordering the test cases so that the most beneficial are executed first. Previous work on regression test case prioritization has focused on greedy algorithms. However, it is known that these algorithms may produce suboptimal results because they may construct results that denote only local minima within the search space. By contrast, metaheuristic and evolutionary search algorithms aim to avoid such problems. This paper presents results from an empirical study of the application of several greedy, metaheuristic, and evolutionary search algorithms to six programs, ranging from 374 to 11,148 lines of code for three choices of fitness metric. The paper addresses the problems of choice of fitness metric, characterization of landscape modality, and determination of the most suitable search technique to apply. The empirical results replicate previous results concerning greedy algorithms. They shed light on the nature of the regression testing search space, indicating that it is multimodal. The results also show that genetic algorithms perform well, although greedy approaches are surprisingly effective, given the multimodal nature of the landscape

690 citations


Journal ArticleDOI
TL;DR: This work formally generalizes the ROC metric to the early recognition problem and proposes a novel metric called Boltzmann-enhanced discrimination of receiver operating characteristic that turns out to contain the discrimination power of the RIE metric but incorporates the statistical significance from ROC and its well-behaved boundaries.
Abstract: Many metrics are currently used to evaluate the performance of ranking methods in virtual screening (VS), for instance, the area under the receiver operating characteristic curve (ROC), the area under the accumulation curve (AUAC), the average rank of actives, the enrichment factor (EF), and the robust initial enhancement (RIE) proposed by Sheridan et al. In this work, we show that the ROC, the AUAC, and the average rank metrics have the same inappropriate behaviors that make them poor metrics for comparing VS methods whose purpose is to rank actives early in an ordered list (the "early recognition problem"). In doing so, we derive mathematical formulas that relate those metrics together. Moreover, we show that the EF metric is not sensitive to ranking performance before and after the cutoff. Instead, we formally generalize the ROC metric to the early recognition problem which leads us to propose a novel metric called the Boltzmann-enhanced discrimination of receiver operating characteristic that turns out to contain the discrimination power of the RIE metric but incorporates the statistical significance from ROC and its well-behaved boundaries. Finally, two major sources of errors, namely, the statistical error and the "saturation effects", are examined. This leads to practical recommendations for the number of actives, the number of inactives, and the "early recognition" importance parameter that one should use when comparing ranking methods. Although this work is applied specifically to VS, it is general and can be used to analyze any method that needs to segregate actives toward the front of a rank-ordered list.

676 citations


Journal ArticleDOI
TL;DR: Experiments using the AESA algorithm in handwritten digit recognition show that the new normalized edit distance between X and Y can generally provide similar results to some other normalized edit distances and may perform slightly better if the triangle inequality is violated in a particular data set.
Abstract: Although a number of normalized edit distances presented so far may offer good performance in some applications, none of them can be regarded as a genuine metric between strings because they do not satisfy the triangle inequality. Given two strings X and Y over a finite alphabet, this paper defines a new normalized edit distance between X and Y as a simple function of their lengths (|X| and |Y|) and the Generalized Levenshtein Distance (GLD) between them. The new distance can be easily computed through GLD with a complexity of O(|X| \cdot |Y|) and it is a metric valued in [0, 1] under the condition that the weight function is a metric over the set of elementary edit operations with all costs of insertions/deletions having the same weight. Experiments using the AESA algorithm in handwritten digit recognition show that the new distance can generally provide similar results to some other normalized edit distances and may perform slightly better if the triangle inequality is violated in a particular data set.

624 citations


Journal ArticleDOI
01 Feb 2007-Neuron
TL;DR: An alternate model in which cortical networks are inherently able to tell time as a result of time-dependent changes in network state is examined, showing that within this framework, there is no linear metric of time, and that a given interval is encoded in the context of preceding events.

513 citations


Journal ArticleDOI
TL;DR: A new worst-case metric is proposed for predicting practical system performance in the absence of matching failures, and the worst case theoretical equal error rate (EER) is predicted to be as low as 2.59 times 10-1 available data sets.
Abstract: This paper presents a novel iris coding method based on differences of discrete cosine transform (DCT) coefficients of overlapped angular patches from normalized iris images. The feature extraction capabilities of the DCT are optimized on the two largest publicly available iris image data sets, 2,156 images of 308 eyes from the CASIA database and 2,955 images of 150 eyes from the Bath database. On this data, we achieve 100 percent correct recognition rate (CRR) and perfect receiver-operating characteristic (ROC) curves with no registered false accepts or rejects. Individual feature bit and patch position parameters are optimized for matching through a product-of-sum approach to Hamming distance calculation. For verification, a variable threshold is applied to the distance metric and the false acceptance rate (FAR) and false rejection rate (FRR) are recorded. A new worst-case metric is proposed for predicting practical system performance in the absence of matching failures, and the worst case theoretical equal error rate (EER) is predicted to be as low as 2.59 times 10-1 available data sets

Journal ArticleDOI
TL;DR: The proposed EMD-L1 significantly simplifies the original linear programming formulation of EMD, and empirically shows that this new algorithm has an average time complexity of O(N2), which significantly improves the best reported supercubic complexity of the original EMD.
Abstract: We propose EMD-L1: a fast and exact algorithm for computing the earth mover's distance (EMD) between a pair of histograms. The efficiency of the new algorithm enables its application to problems that were previously prohibitive due to high time complexities. The proposed EMD-L1 significantly simplifies the original linear programming formulation of EMD. Exploiting the L1 metric structure, the number of unknown variables in EMD-L1 is reduced to O(N) from O(N2) of the original EMD for a histogram with N bins. In addition, the number of constraints is reduced by half and the objective function of the linear program is simplified. Formally, without any approximation, we prove that the EMD-L1 formulation is equivalent to the original EMD with a L1 ground distance. To perform the EMD-L1 computation, we propose an efficient tree-based algorithm, Tree-EMD. Tree-EMD exploits the fact that a basic feasible solution of the simplex algorithm-based solver forms a spanning tree when we interpret EMD-L1 as a network flow optimization problem. We empirically show that this new algorithm has an average time complexity of O(N2), which significantly improves the best reported supercubic complexity of the original EMD. The accuracy of the proposed methods is evaluated by experiments for two computation-intensive problems: shape recognition and interest point matching using multidimensional histogram-based local features. For shape recognition, EMD-L1 is applied to compare shape contexts on the widely tested MPEG7 shape data set, as well as an articulated shape data set. For interest point matching, SIFT, shape context and spin image are tested on both synthetic and real image pairs with large geometrical deformation, illumination change, and heavy intensity noise. The results demonstrate that our EMD-L1-based solutions outperform previously reported state-of-the-art features and distance measures in solving the two tasks

Journal ArticleDOI
TL;DR: The proposed Stochastic Response Surface (SRS) Method iteratively utilizes a response surface model to approximate the expensive function and identifies a promising point for function evaluation from a set of randomly generated points, called candidate points, which converges to the global minimum in a probabilistic sense.
Abstract: We introduce a new framework for the global optimization of computationally expensive multimodal functions when derivatives are unavailable. The proposed Stochastic Response Surface (SRS) Method iteratively utilizes a response surface model to approximate the expensive function and identifies a promising point for function evaluation from a set of randomly generated points, called candidate points. Assuming some mild technical conditions, SRS converges to the global minimum in a probabilistic sense. We also propose Metric SRS (MSRS), which is a special case of SRS where the function evaluation point in each iteration is chosen to be the best candidate point according to two criteria: the estimated function value obtained from the response surface model, and the minimum distance from previously evaluated points. We develop a global optimization version and a multistart local optimization version of MSRS. In the numerical experiments, we used a radial basis function (RBF) model for MSRS and the resulting algorithms, Global MSRBF and Multistart Local MSRBF, were compared to 6 alternative global optimization methods, including a multistart derivative-based local optimization method. Multiple trials of all algorithms were compared on 17 multimodal test problems and on a 12-dimensional groundwater bioremediation application involving partial differential equations. The results indicate that Multistart Local MSRBF is the best on most of the higher dimensional problems, including the groundwater problem. It is also at least as good as the other algorithms on most of the lower dimensional problems. Global MSRBF is competitive with the other alternatives on most of the lower dimensional test problems and also on the groundwater problem. These results suggest that MSRBF is a promising approach for the global optimization of expensive functions.

Journal ArticleDOI
TL;DR: This paper details supervised training algorithms that directly maximize the evaluation metric under consideration, such as mean average precision, and shows that linear feature-based models can consistently and significantly outperform current state of the art retrieval models with the correct choice of features.
Abstract: There have been a number of linear, feature-based models proposed by the information retrieval community recently. Although each model is presented differently, they all share a common underlying framework. In this paper, we explore and discuss the theoretical issues of this framework, including a novel look at the parameter space. We then detail supervised training algorithms that directly maximize the evaluation metric under consideration, such as mean average precision. We present results that show training models in this way can lead to significantly better test set performance compared to other training methods that do not directly maximize the metric. Finally, we show that linear feature-based models can consistently and significantly outperform current state of the art retrieval models with the correct choice of features.

Journal ArticleDOI
TL;DR: This work proposes that a recently introduced method of analysis, angular segment analysis, can marry axial and road-centre line representations, and in doing so reflect a cognitive model of how route choice decisions may be made, and implies that there is no reason why space syntax inspired measures cannot be combined with transportation network analysis representations in order to create a new, cognitively coherent, model of movement in the city.
Abstract: Axial analysis is one of the fundamental components of space syntax. The space syntax community has suggested that it picks up qualities of configurational relationships between spaces not illuminated by other representations. However, critics have questioned the absolute necessity of axial lines to space syntax, as well as the exact definition of axial lines. Why not another representation? In particular, why not road-centre lines, which are easily available in many countries for use within geographical information systems? Here I propose that a recently introduced method of analysis, angular segment analysis, can marry axial and road-centre line representations, and in doing so reflect a cognitive model of how route choice decisions may be made. I show that angular segment analysis can be applied generally to road-centre line segments or axial segments, through a simple length- weighted normalisation procedure that makes values between the two maps comparable. I make comparative quantitative assessments for a real urban system, not just investigating angular analysis between axial and road-centre line networks, but also including more intuitive measures based on metric (or block) distances between locations. I show that the new angular segment analysis algorithm produces better correlation with observed vehicular flow than both standard axial analysis and metric distance measures. The results imply that there is no reason why space syntax inspired measures cannot be combined with transportation network analysis representations in order to create a new, cognitively coherent, model of movement in the city.

01 Jan 2007
TL;DR: A simple and effective model of visual between-coefficient contrast masking of DCT basis functions based on a human visual system (HVS) and the new metric, PSNR-HVS-M has outperformed other well-known reference based quality metrics and demonstrated high correlation with the results of subjective experiments.
Abstract: In this paper we propose a simple and effective model of visual between-coefficient contrast masking of DCT basis functions based on a human visual system (HVS). The model operates with the values of DCT coefficients of 8x8 pixel block of an image. For each DCT coefficient of the block the model allows to calculate its maximal distortion that is not visible due to the between-coefficient masking. A modification of the PSNR is also described in this paper. The proposed metric, PSNR-HVS-M, takes into account the proposed model and the contrast sensitivity function (CSF). For efficiency analysis of the proposed model, a set of 18 test images with different effects of noise masking has been used. During experiments, 155 observers have sorted this set of test images in the order of their visual appearance comparing them to undistorted original. The new metric, PSNRHVS-M has outperformed other well-known reference based quality metrics and demonstrated high correlation with the results of subjective experiments (Spearman correlation is 0.984, Kendall correlation is 0.948).

Proceedings ArticleDOI
26 Dec 2007
TL;DR: Non-metric similarities between pairs of images by matching SIFT features are derived and affinity propagation successfully identifies meaningful categories, which provide a natural summarization of the training images and can be used to classify new input images.
Abstract: Unsupervised categorization of images or image parts is often needed for image and video summarization or as a preprocessing step in supervised methods for classification, tracking and segmentation. While many metric-based techniques have been applied to this problem in the vision community, often, the most natural measures of similarity (e.g., number of matching SIFT features) between pairs of images or image parts is non-metric. Unsupervised categorization by identifying a subset of representative exemplars can be efficiently performed with the recently-proposed 'affinity propagation' algorithm. In contrast to k-centers clustering, which iteratively refines an initial randomly-chosen set of exemplars, affinity propagation simultaneously considers all data points as potential exemplars and iteratively exchanges messages between data points until a good solution emerges. When applied to the Olivetti face data set using a translation-invariant non-metric similarity, affinity propagation achieves a much lower reconstruction error and nearly halves the classification error rate, compared to state-of-the-art techniques. For the more challenging problem of unsupervised categorization of images from the CaltechlOl data set, we derived non-metric similarities between pairs of images by matching SIFT features. Affinity propagation successfully identifies meaningful categories, which provide a natural summarization of the training images and can be used to classify new input images.

Journal ArticleDOI
TL;DR: The object of this paper is to illustrate the utility of the data-driven approach to damage identification by means of a number of case studies.
Abstract: In broad terms, there are two approaches to damage identification Model-driven methods establish a high-fidelity physical model of the structure, usually by finite element analysis, and then establish a comparison metric between the model and the measured data from the real structure If the model is for a system or structure in normal (ie undamaged) condition, any departures indicate that the structure has deviated from normal condition and damage is inferred Data-driven approaches also establish a model, but this is usually a statistical representation of the system, eg a probability density function of the normal condition Departures from normality are then signalled by measured data appearing in regions of very low density The algorithms that have been developed over the years for data-driven approaches are mainly drawn from the discipline of pattern recognition, or more broadly, machine learning The object of this paper is to illustrate the utility of the data-driven approach to damage identification by means of a number of case studies

Journal ArticleDOI
TL;DR: This paper justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic property of the distances and not an artifact from a finite sample, and an estimation of the concentration as a function of the exponent of the distance and of the distribution of the data.
Abstract: Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned in the past, and fractional norms (Minkowski-like norms with an exponent less than one) were introduced to fight the concentration phenomenon. This paper justifies the use of alternative distances to fight concentration by showing that the concentration is indeed an intrinsic property of the distances and not an artifact from a finite sample. Furthermore, an estimation of the concentration as a function of the exponent of the distance and of the distribution of the data is given. It leads to the conclusion that, contrary to what is generally admitted, fractional norms are not always less concentrated than the euclidean norm; a counterexample is given to prove this claim. Theoretical arguments are presented, which show that the concentration phenomenon can appear for real data that do not match the hypotheses of the theorems, in particular, the assumption of independent and identically distributed variables. Finally, some insights about how to choose an optimal metric are given.

Journal ArticleDOI
TL;DR: In this article, a new quantitative metric called the ratio of spatial frequency error (rSFe) is proposed to objectively evaluate the quality of fused imagery, where the measured value of the proposed metric is used as feedback to a fusion algorithm such that the image quality of the fused image can potentially be improved.

Journal ArticleDOI
TL;DR: The KKT conditions in an optimization problem with interval-valued objective function are derived by considering two partial orderings on the set of all closed intervals by invoking the Hausdorff metric and the Hukuhara difference.

Proceedings ArticleDOI
01 Apr 2007
TL;DR: This work defines a similarity metric, proposes an efficient approximation method to reduce its calculation cost, and develops novel metrics and heuristics to support k-most-similar-trajectory search in spatiotemporal databases exploiting on existing R-tree-like structures that are already found there to support more traditional queries.
Abstract: The problem of trajectory similarity in moving object databases is a relatively new topic in the spatial and spatiotemporal database literature. Existing work focuses on the spatial notion of similarity ignoring the temporal dimension of trajectories and disregarding the presence of a general-purpose spatiotemporal index. In this work, we address the issue of spatiotemporal trajectory similarity search by defining a similarity metric, proposing an efficient approximation method to reduce its calculation cost, and developing novel metrics and heuristics to support k-most-similar-trajectory search in spatiotemporal databases exploiting on existing R-tree-like structures that are already found there to support more traditional queries. Our experimental study, based on real and synthetic datasets, verifies that the proposed similarity metric efficiently retrieves spatiotemporally similar trajectories in cases where related work fails, while at the same time the proposed algorithm is shown to be efficient and highly scalable.

Book ChapterDOI
17 Sep 2007
TL;DR: A new metric that measures the informativeness of objects to be classified, which can be applied as a query-based distance metric to measure the closeness between objects, and two novel KNN procedures are proposed.
Abstract: The K-nearest neighbor (KNN) decision rule has been a ubiquitous classification tool with good scalability. Past experience has shown that the optimal choice of Kdepends upon the data, making it laborious to tune the parameter for different applications. We introduce a new metric that measures the informativeness of objects to be classified. When applied as a query-based distance metric to measure the closeness between objects, two novel KNN procedures, Locally Informative-KNN (LI-KNN) and Globally Informative-KNN (GI-KNN), are proposed. By selecting a subset of most informative objects from neighborhoods, our methods exhibit stability to the change of input parameters, number of neighbors(K) and informative points (I). Experiments on UCI benchmark data and diverse real-world data sets indicate that our approaches are application-independent and can generally outperform several popular KNN extensions, as well as SVM and Boosting methods.

Journal ArticleDOI
TL;DR: A parametric generalization of the two different multiplicative update rules for nonnegative matrix factorization by Lee and Seung (2001) is shown to lead to locally optimal solutions of the nonnegative Matrix factorization problem with this new cost function.
Abstract: This letter presents a general parametric divergence measure. The metric includes as special cases quadratic error and Kullback-Leibler divergence. A parametric generalization of the two different multiplicative update rules for nonnegative matrix factorization by Lee and Seung (2001) is shown to lead to locally optimal solutions of the nonnegative matrix factorization problem with this new cost function. Numeric simulations demonstrate that the new update rule may improve the quadratic distance convergence speed. A proof of convergence is given that, as in Lee and Seung, uses an auxiliary function known from the expectation-maximization theoretical framework.

Journal ArticleDOI
TL;DR: Several reasonable measures to calculate the degree of similarity between IFSs are proposed, in which the proposed measures are induced by L"p metric, and applied to analyze the behavior of decision making.

Journal ArticleDOI
TL;DR: In this article, conditions for convergence of polyhedral surfaces and their discrete geometric properties to smooth surfaces embedded in Euclidean 3-space were provided. But the convergence conditions were not considered.
Abstract: We provide conditions for convergence of polyhedral surfaces and their discrete geometric properties to smooth surfaces embedded in Euclidean 3-space. Under the assumption of convergence of surfaces in Hausdorff distance, we show that convergence of the following properties are equivalent: surface normals, surface area, metric tensors, and Laplace–Beltrami operators. Additionally, we derive convergence of minimizing geodesics, mean curvature vectors, and solutions to the Dirichlet problem.

Journal ArticleDOI
01 Jun 2007
TL;DR: This paper presents the first approach for estimating the selectivity of tf.idf based cosine similarity predicates and shows that this method often produces estimates that are within 40% of the actual selectivity.
Abstract: An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.

Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper extends Nadaraya-Watson kernel regression by recasting the regression problem in terms of Frechet expectation, and uses the infinite dimensional manifold of diffeomorphic transformations, with an associated metric, to study the small scale changes in anatomy.
Abstract: Regression analysis is a powerful tool for the study of changes in a dependent variable as a function of an independent regressor variable, and in particular it is applicable to the study of anatomical growth and shape change. When the underlying process can be modeled by parameters in a Euclidean space, classical regression techniques are applicable and have been studied extensively. However, recent work suggests that attempts to describe anatomical shapes using flat Euclidean spaces undermines our ability to represent natural biological variability. In this paper we develop a method for regression analysis of general, manifold-valued data. Specifically, we extend Nadaraya-Watson kernel regression by recasting the regression problem in terms of Frechet expectation. Although this method is quite general, our driving problem is the study anatomical shape change as a function of age from random design image data. We demonstrate our method by analyzing shape change in the brain from a random design dataset of MR images of 89 healthy adults ranging in age from 22 to 79 years. To study the small scale changes in anatomy, we use the infinite dimensional manifold of diffeomorphic transformations, with an associated metric. We regress a representative anatomical shape, as a function of age, from this population.

Journal ArticleDOI
TL;DR: In this article, a decision problem is defined in terms of an outcome space, an action space and a loss function, which allows generalisation of many standard statistical concepts and properties, including generalised exponential families.
Abstract: A decision problem is defined in terms of an outcome space, an action space and a loss function. Starting from these simple ingredients, we can construct: Proper Scoring Rule; Entropy Function; Divergence Function; Riemannian Metric; and Unbiased Estimating Equation. From an abstract viewpoint, the loss function defines a duality between the outcome and action spaces, while the correspondence between a distribution and its Bayes act induces a self-duality. Together these determine a “decision geometry” for the family of distributions on outcome space. This allows generalisation of many standard statistical concepts and properties. In particular we define and study generalised exponential families. Several examples are analysed, including a general Bregman geometry.

Journal ArticleDOI
TL;DR: A method is presented which relates the Isomap to Mercer kernel machines, so that the generalization property naturally emerges, through kernel principal component analysis.