scispace - formally typeset
Search or ask a question

Showing papers on "Metric (mathematics) published in 2020"


Proceedings ArticleDOI
01 Jul 2020
TL;DR: This work explores and compares the plethora of metrics for the performance evaluation of object-detection algorithms and proposes a standard implementation that can be used as a benchmark among different datasets with minimum adaptation on the annotation files.
Abstract: This work explores and compares the plethora of metrics for the performance evaluation of object-detection algorithms. Average precision (AP),for instance, is a popular metric for evaluating the accuracy of object detectors by estimating the area under the curve (AUC) of the precision × recall relationship. Depending on the point interpolation used in the plot, two different AP variants can be defined and, therefore, different results are generated. AP has six additional variants increasing the possibilities of benchmarking. The lack of consensus in different works and AP implementations is a problem faced by the academic and scientific communities. Metric implementations written in different computational languages and platforms are usually distributed with corresponding datasets sharing a given bounding-box description. Such projects indeed help the community with evaluation tools, but demand extra work to be adapted for other datasets and bounding-box formats. This work reviews the most used metrics for object detection detaching their differences, applications, and main concepts. It also proposes a standard implementation that can be used as a benchmark among different datasets with minimum adaptation on the annotation files.

451 citations


Proceedings Article
30 Apr 2020
TL;DR: This work performs extensive studies on benchmark datasets to propose a metric that quantifies the "hardness" of a few-shot episode and finds that using a large number of meta-training classes results in high few- shot accuracies even for a largeNumber of few-shots classes.
Abstract: Fine-tuning a deep network trained with the standard cross-entropy loss is a strong baseline for few-shot learning. When fine-tuned transductively, this outperforms the current state-of-the-art on standard datasets such as Mini-Imagenet, Tiered-Imagenet, CIFAR-FS and FC-100 with the same hyper-parameters. The simplicity of this approach enables us to demonstrate the first few-shot learning results on the Imagenet-21k dataset. We find that using a large number of meta-training classes results in high few-shot accuracies even for a large number of few-shot classes. We do not advocate our approach as the solution for few-shot learning, but simply use the results to highlight limitations of current benchmarks and few-shot protocols. We perform extensive studies on benchmark datasets to propose a metric that quantifies the "hardness" of a few-shot episode. This metric can be used to report the performance of few-shot algorithms in a more systematic way.

355 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: Zhang et al. as discussed by the authors adopt the Earth Mover's distance (EMD) as a metric to compute a structural distance between dense image representations to determine image relevance, which is used to represent the image distance for classification.
Abstract: In this paper, we address the few-shot classification task from a new perspective of optimal matching between image regions. We adopt the Earth Mover's Distance (EMD) as a metric to compute a structural distance between dense image representations to determine image relevance. The EMD generates the optimal matching flows between structural elements that have the minimum matching cost, which is used to represent the image distance for classification. To generate the important weights of elements in the EMD formulation, we design a cross-reference mechanism, which can effectively minimize the impact caused by the cluttered background and large intra-class appearance variations. To handle k-shot classification, we propose to learn a structured fully connected layer that can directly classify dense image representations with the EMD. Based on the implicit function theorem, the EMD can be inserted as a layer into the network for end-to-end training. We conduct comprehensive experiments to validate our algorithm and we set new state-of-the-art performance on four popular few-shot classification benchmarks, namely miniImageNet, tieredImageNet, Fewshot-CIFAR100 (FC100) and Caltech-UCSD Birds-200-2011 (CUB).

354 citations


Posted Content
15 Mar 2020
TL;DR: This paper adopts the Earth Mover's Distance (EMD) as a metric to compute a structural distance between dense image representations to determine image relevance and designs a cross-reference mechanism that can effectively minimize the impact caused by the cluttered background and large intra-class appearance variations.
Abstract: In this paper, we address the few-shot classification task from a new perspective of optimal matching between image regions. We adopt the Earth Mover's Distance (EMD) as a metric to compute a structural distance between dense image representations to determine image relevance. The EMD generates the optimal matching flows between structural elements that have the minimum matching cost, which is used to represent the image distance for classification. To generate the important weights of elements in the EMD formulation, we design a cross-reference mechanism, which can effectively minimize the impact caused by the cluttered background and large intra-class appearance variations. To handle k-shot classification, we propose to learn a structured fully connected layer that can directly classify dense image representations with the EMD. Based on the implicit function theorem, the EMD can be inserted as a layer into the network for end-to-end training. We conduct comprehensive experiments to validate our algorithm and we set new state-of-the-art performance on four popular few-shot classification benchmarks, namely miniImageNet, tieredImageNet, Fewshot-CIFAR100 (FC100) and Caltech-UCSD Birds-200-2011 (CUB).

271 citations


Proceedings ArticleDOI
Walid Krichene1, Steffen Rendle1
23 Aug 2020
TL;DR: It is shown that sampled metrics are inconsistent with their exact version, in the sense that they do not persist relative statements, and it is suggested that sampling should be avoided for metric calculation, however if an experimental study needs to sample, the proposed corrections can improve the quality of the estimate.
Abstract: The task of item recommendation requires ranking a large catalogue of items given a context. Item recommendation algorithms are evaluated using ranking metrics that depend on the positions of relevant items. To speed up the computation of metrics, recent work often uses sampled metrics where only a smaller set of random items and the relevant items are ranked. This paper investigates sampled metrics in more detail and shows that they are inconsistent with their exact version, in the sense that they do not persist relative statements, e.g., recommender A is better than B, not even in expectation. Moreover, the smaller the sampling size, the less difference there is between metrics, and for very small sampling size, all metrics collapse to the AUC metric. We show that it is possible to improve the quality of the sampled metrics by applying a correction, obtained by minimizing different criteria such as bias or mean squared error. We conclude with an empirical evaluation of the naive sampled metrics and their corrected variants. To summarize, our work suggests that sampling should be avoided for metric calculation, however if an experimental study needs to sample, the proposed corrections can improve the quality of the estimate.

264 citations


Posted Content
TL;DR: Flaws in the experimental methodology of numerous metric learning papers are found, and it is shown that the actual improvements over time have been marginal at best.
Abstract: Deep metric learning papers from the past four years have consistently claimed great advances in accuracy, often more than doubling the performance of decade-old methods. In this paper, we take a closer look at the field to see if this is actually true. We find flaws in the experimental methodology of numerous metric learning papers, and show that the actual improvements over time have been marginal at best.

244 citations


Journal ArticleDOI
TL;DR: This work presents a novel MOT evaluation metric, higher order tracking accuracy (HOTA), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing trackers.
Abstract: Multi-Object Tracking (MOT) has been notoriously difficult to evaluate. Previous metrics overemphasize the importance of either detection or association. To address this, we present a novel MOT evaluation metric, HOTA (Higher Order Tracking Accuracy), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing trackers. HOTA decomposes into a family of sub-metrics which are able to evaluate each of five basic error types separately, which enables clear analysis of tracking performance. We evaluate the effectiveness of HOTA on the MOTChallenge benchmark, and show that it is able to capture important aspects of MOT performance not previously taken into account by established metrics. Furthermore, we show HOTA scores better align with human visual evaluation of tracking performance.

216 citations


Book ChapterDOI
18 Mar 2020
TL;DR: In this article, the authors take a closer look at the field to see if this is actually true, and find flaws in the experimental methodology of numerous metric learning papers, and show that the actual improvements over time have been marginal at best.
Abstract: Deep metric learning papers from the past four years have consistently claimed great advances in accuracy, often more than doubling the performance of decade-old methods. In this paper, we take a closer look at the field to see if this is actually true. We find flaws in the experimental methodology of numerous metric learning papers, and show that the actual improvements over time have been marginal at best. Code is available at github.com/KevinMusgrave/powerful-benchmarker.

200 citations


Proceedings ArticleDOI
26 Mar 2020
TL;DR: This paper proposed a metric learning objective for open-set speaker recognition, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-Speaker distance.
Abstract: The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-speaker and large inter-speaker distance. A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most popular loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our proposed metric learning objective outperform state-of-the-art methods.

199 citations


Posted Content
TL;DR: This work proposes to train the first part of the circuit with the objective of maximally separating data classes in Hilbert space, a strategy it calls quantum metric learning, which provides a powerful analytic framework for quantum machine learning.
Abstract: Quantum classifiers are trainable quantum circuits used as machine learning models. The first part of the circuit implements a quantum feature map that encodes classical inputs into quantum states, embedding the data in a high-dimensional Hilbert space; the second part of the circuit executes a quantum measurement interpreted as the output of the model. Usually, the measurement is trained to distinguish quantum-embedded data. We propose to instead train the first part of the circuit---the embedding---with the objective of maximally separating data classes in Hilbert space, a strategy we call quantum metric learning. As a result, the measurement minimizing a linear classification loss is already known and depends on the metric used: for embeddings separating data using the l1 or trace distance, this is the Helstrom measurement, while for the l2 or Hilbert-Schmidt distance, it is a simple overlap measurement. This approach provides a powerful analytic framework for quantum machine learning and eliminates a major component in current models, freeing up more precious resources to best leverage the capabilities of near-term quantum information processors.

189 citations


Journal ArticleDOI
Dimitrios Psaltis1, Lia Medeiros2, Pierre Christian1, Feryal Özel1  +212 moreInstitutions (53)
TL;DR: It is shown analytically that spacetimes that deviate from the Kerr metric but satisfy weak-field tests can lead to large deviations in the predicted black-hole shadows that are inconsistent with even the current EHT measurements.
Abstract: The 2017 Event Horizon Telescope (EHT) observations of the central source in M87 have led to the first measurement of the size of a black-hole shadow. This observation offers a new and clean gravitational test of the black-hole metric in the strong-field regime. We show analytically that spacetimes that deviate from the Kerr metric but satisfy weak-field tests can lead to large deviations in the predicted black-hole shadows that are inconsistent with even the current EHT measurements. We use numerical calculations of regular, parametric, non-Kerr metrics to identify the common characteristic among these different parametrizations that control the predicted shadow size. We show that the shadow-size measurements place significant constraints on deviation parameters that control the second post-Newtonian and higher orders of each metric and are, therefore, inaccessible to weak-field tests. The new constraints are complementary to those imposed by observations of gravitational waves from stellar-mass sources.

Journal ArticleDOI
TL;DR: In this article, it was shown that the field equations of the EG theory do not admit an intrinsically four-dimensional definition, in terms of metric only, as such it does not exist in four dimensions.
Abstract: No! We show that the field equations of Einstein–Gauss–Bonnet theory defined in generic $$D>4$$ dimensions split into two parts one of which always remains higher dimensional, and hence the theory does not have a non-trivial limit to $$D=4$$. Therefore, the recently introduced four-dimensional, novel, Einstein–Gauss–Bonnet theory does not admit an intrinsically four-dimensional definition, in terms of metric only, as such it does not exist in four dimensions. The solutions (the spacetime, the metric) always remain $$D>4$$ dimensional. As there is no canonical choice of 4 spacetime dimensions out of D dimensions for generic metrics, the theory is not well defined in four dimensions.

Posted Content
TL;DR: The core idea is to use feature-wise transformation layers for augmenting the image features using affine transforms to simulate various feature distributions under different domains in the training stage, and applies a learning-to-learn approach to search for the hyper-parameters of the feature- wise transformation layers.
Abstract: Few-shot classification aims to recognize novel categories with only few labeled images in each class. Existing metric-based few-shot classification algorithms predict categories by comparing the feature embeddings of query images with those from a few labeled images (support examples) using a learned metric function. While promising performance has been demonstrated, these methods often fail to generalize to unseen domains due to large discrepancy of the feature distribution across domains. In this work, we address the problem of few-shot classification under domain shifts for metric-based methods. Our core idea is to use feature-wise transformation layers for augmenting the image features using affine transforms to simulate various feature distributions under different domains in the training stage. To capture variations of the feature distributions under different domains, we further apply a learning-to-learn approach to search for the hyper-parameters of the feature-wise transformation layers. We conduct extensive experiments and ablation studies under the domain generalization setting using five few-shot classification datasets: mini-ImageNet, CUB, Cars, Places, and Plantae. Experimental results demonstrate that the proposed feature-wise transformation layer is applicable to various metric-based models, and provides consistent improvements on the few-shot classification performance under domain shift.

Journal ArticleDOI
TL;DR: A novel unsupervised deep framework named the DEep Clustering-based Asymmetric MEtric Learning (DECAMEL) is developed, which learns a compact cross-view cluster structure of Re-ID data to help alleviate the view-specific bias and facilitate mining the potential cross-View discriminative information for unsuper supervised Re- ID.
Abstract: Person re-identification (Re-ID) aims to match identities across non-overlapping camera views. Researchers have proposed many supervised Re-ID models which require quantities of cross-view pairwise labelled data. This limits their scalabilities to many applications where a large amount of data from multiple disjoint camera views is available but unlabelled. Although some unsupervised Re-ID models have been proposed to address the scalability problem, they often suffer from the view-specific bias problem which is caused by dramatic variances across different camera views, e.g., different illumination, viewpoints and occlusion. The dramatic variances induce specific feature distortions in different camera views, which can be very disturbing in finding cross-view discriminative information for Re-ID in the unsupervised scenarios, since no label information is available to help alleviate the bias. We propose to explicitly address this problem by learning an unsupervised asymmetric distance metric based on cross-view clustering. The asymmetric distance metric allows specific feature transformations for each camera view to tackle the specific feature distortions. We then design a novel unsupervised loss function to embed the asymmetric metric into a deep neural network, and therefore develop a novel unsupervised deep framework named the DE ep C lustering-based A symmetric ME tric L earning ( DECAMEL ). In such a way, DECAMEL jointly learns the feature representation and the unsupervised asymmetric metric. DECAMEL learns a compact cross-view cluster structure of Re-ID data, and thus help alleviate the view-specific bias and facilitate mining the potential cross-view discriminative information for unsupervised Re-ID. Extensive experiments on seven benchmark datasets whose sizes span several orders show the effectiveness of our framework.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: A novel, efficient unsupervised symmetric image registration method which maximizes the similarity between images within the space of diffeomorphic maps and estimates both forward and inverse transformations simultaneously.
Abstract: Diffeomorphic deformable image registration is crucial in many medical image studies, as it offers unique, special features including topology preservation and invertibility of the transformation. Recent deep learning-based deformable image registration methods achieve fast image registration by leveraging a convolutional neural network (CNN) to learn the spatial transformation from the synthetic ground truth or the similarity metric. However, these approaches often ignore the topology preservation of the transformation and the smoothness of the transformation which is enforced by a global smoothing energy function alone. Moreover, deep learning-based approaches often estimate the displacement field directly, which cannot guarantee the existence of the inverse transformation. In this paper, we present a novel, efficient unsupervised symmetric image registration method which maximizes the similarity between images within the space of diffeomorphic maps and estimates both forward and inverse transformations simultaneously. We evaluate our method on 3D image registration with a large scale brain image dataset. Our method achieves state-of-the-art registration accuracy and running time while maintaining desirable diffeomorphic properties.

Journal ArticleDOI
Bin Yang1, Yaguo Lei1, Feng Jia1, Naipeng Li1, Du Zhaojun1 
TL;DR: A distance metric named polynomial kernel induced MMD (PK-MMD) is proposed and combined with a diagnosis model is constructed to reuse diagnosis knowledge from one machine to the other, and the PK- MMD-based diagnosis model presents better transfer results than other methods.
Abstract: Deep transfer-learning-based diagnosis models are promising to apply diagnosis knowledge across related machines, but from which the collected data follow different distribution. To reduce the distribution discrepancy, Gaussian kernel induced maximum mean discrepancy (GK-MMD) is a widely used distance metric to impose constraints on the training of diagnosis models. However, the models using GK-MMD have three weaknesses: 1) GK-MMD may not accurately estimate distribution discrepancy because it ignores the high-order moment distances of data; 2) the time complexity of GK-MMD is high to require much computation cost; 3) the transfer performance of GK-MMD-based diagnosis models is sensitive to the selected kernel parameters. In order to overcome the weaknesses, a distance metric named polynomial kernel induced MMD (PK-MMD) is proposed in this article. Combined with PK-MMD, a diagnosis model is constructed to reuse diagnosis knowledge from one machine to the other. The proposed methods are verified by two transfer learning cases, in which the health states of locomotive bearings are identified with the help of data respectively from motor bearings and gearbox bearings in laboratories. The results show that PK-MMD enables to improve the inefficient computation of GK-MMD, and the PK-MMD-based diagnosis model presents better transfer results than other methods.

Proceedings Article
25 Jun 2020
TL;DR: Despite its simplicity, the method outperforms the self-supervised state-of-the-art on a variety of label propagation tasks involving objects, semantic parts, and pose.
Abstract: This paper proposes a simple self-supervised approach for learning a representation for visual correspondence from raw video. We cast correspondence as prediction of links in a space-time graph constructed from video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a representation in which pairwise similarity defines transition probability of a random walk, so that long-range correspondence is computed as a walk along the graph. We optimize the representation to place high probability along paths of similarity. Targets for learning are formed without supervision, by cycle-consistency: the objective is to maximize the likelihood of returning to the initial node when walking along a graph constructed from a palindrome of frames. Thus, a single path-level constraint implicitly supervises chains of intermediate comparisons. When used as a similarity metric without adaptation, the learned representation outperforms the self-supervised state-of-the-art on label propagation tasks involving objects, semantic parts, and pose. Moreover, we demonstrate that a technique we call edge dropout, as well as self-supervised adaptation at test-time, further improve transfer for object-centric correspondence.

Journal ArticleDOI
TL;DR: A large-scale multi-object tracker based on the generalised labeled multi-Bernoulli (GLMB) filter is proposed and a new method of applying the optimal sub-pattern assignment (OSPA) metric to determine a meaningful distance between two sets of tracks is introduced.
Abstract: A large-scale multi-object tracker based on the generalised labeled multi-Bernoulli (GLMB) filter is proposed. The algorithm is capable of tracking a very large, unknown and time-varying number of objects simultaneously, in the presence of a high number of false alarms, as well as missed detections and measurement origin uncertainty due to closely spaced objects. The algorithm is demonstrated on a simulated tracking scenario, where the peak number objects appearing simultaneously exceeds one million. Additionally, we introduce a new method of applying the optimal sub-pattern assignment (OSPA) metric to determine a meaningful distance between two sets of tracks. We also develop an efficient strategy for its exact computation in large-scale scenarios to evaluate the performance of the proposed tracker.

Book ChapterDOI
23 Aug 2020
TL;DR: A new domain generalization framework that learns how to generalize across domains simultaneously from extrinsic relationship supervision and intrinsic self-supervision for images from multi-source domains is presented.
Abstract: The generalization capability of neural networks across domains is crucial for real-world applications. We argue that a generalized object recognition system should well understand the relationships among different images and also the images themselves at the same time. To this end, we present a new domain generalization framework (called EISNet) that learns how to generalize across domains simultaneously from extrinsic relationship supervision and intrinsic self-supervision for images from multi-source domains. To be specific, we formulate our framework with feature embedding using a multi-task learning paradigm. Besides conducting the common supervised recognition task, we seamlessly integrate a momentum metric learning task and a self-supervised auxiliary task to collectively integrate the extrinsic and intrinsic supervisions. Also, we develop an effective momentum metric learning scheme with the K-hard negative mining to boost the network generalization ability. We demonstrate the effectiveness of our approach on two standard object recognition benchmarks VLCS and PACS, and show that our EISNet achieves state-of-the-art performance.

Journal ArticleDOI
TL;DR: A novel representation learning-based domain adaptation method to transfer information from the source domain to the target domain where labeled data is scarce and it outperforms several state-of-the-art domain adaptation methods and the progressive learning strategy is promising.
Abstract: Domain adaptation aims to exploit the supervision knowledge in a source domain for learning prediction models in a target domain. In this article, we propose a novel representation learning-based domain adaptation method, i.e., neural embedding matching (NEM) method, to transfer information from the source domain to the target domain where labeled data is scarce. The proposed approach induces an intermediate common representation space for both domains with a neural network model while matching the embedding of data from the two domains in this common representation space. The embedding matching is based on the fundamental assumptions that a cross-domain pair of instances will be close to each other in the embedding space if they belong to the same class category, and the local geometry property of the data can be maintained in the embedding space. The assumptions are encoded via objectives of metric learning and graph embedding techniques to regularize and learn the semisupervised neural embedding model. We also provide a generalization bound analysis for the proposed domain adaptation method. Meanwhile, a progressive learning strategy is proposed and it improves the generalization ability of the neural network gradually. Experiments are conducted on a number of benchmark data sets and the results demonstrate that the proposed method outperforms several state-of-the-art domain adaptation methods and the progressive learning strategy is promising.

Proceedings ArticleDOI
01 May 2020
TL;DR: This article proposed an unsupervised and reference-free evaluation metric for dialog, called USR, which is a reference free metric that trains models to measure several desirable qualities of dialog.
Abstract: The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

Proceedings Article
12 Jul 2020
TL;DR: A simple, yet effective, training regularization is proposed to reliably boost the performance of ranking-based DML models on various standard benchmark datasets.
Abstract: Deep Metric Learning (DML) is arguably one of the most influential lines of research for learning visual similarities with many proposed approaches every year. Although the field benefits from the rapid progress, the divergence in training protocols, architectures, and parameter choices make an unbiased comparison difficult. To provide a consistent reference point, we revisit the most widely used DML objective functions and conduct a study of the crucial parameter choices as well as the commonly neglected mini-batch sampling process. Under consistent comparison, DML objectives show much higher saturation than indicated by literature. Further based on our analysis, we uncover a correlation between the embedding space density and compression to the generalization performance of DML models. Exploiting these insights, we propose a simple, yet effective, training regularization to reliably boost the performance of ranking-based DML models on various standard benchmark datasets. Code and a publicly accessible WandB-repo are available at this https URL.

Journal ArticleDOI
TL;DR: An unsupervised knowledge transfer theorem that guarantees the correctness of transferring knowledge and a principal angle-based metric to measure the distance between two pairs of domains are presented.
Abstract: Domain adaptation leverages the knowledge in one domain—the source domain—to improve learning efficiency in another domain—the target domain. Existing heterogeneous domain adaptation research is relatively well-progressed but only in situations where the target domain contains at least a few labeled instances. In contrast, heterogeneous domain adaptation with an unlabeled target domain has not been well-studied. To contribute to the research in this emerging field, this article presents: 1) an unsupervised knowledge transfer theorem that guarantees the correctness of transferring knowledge and 2) a principal angle-based metric to measure the distance between two pairs of domains: one pair comprises the original source and target domains and the other pair comprises two homogeneous representations of two domains. The theorem and the metric have been implemented in an innovative transfer model, called a Grassmann–linear monotonic maps–geodesic flow kernel (GLG), which is specifically designed for heterogeneous unsupervised domain adaptation (HeUDA). The linear monotonic maps (LMMs) meet the conditions of the theorem and are used to construct homogeneous representations of the heterogeneous domains. The metric shows the extent to which the homogeneous representations have preserved the information in the original source and target domains. By minimizing the proposed metric, the GLG model learns the homogeneous representations of heterogeneous domains and transfers knowledge through these learned representations via a geodesic flow kernel (GFK). To evaluate the model, five public data sets were reorganized into ten HeUDA tasks across three applications: cancer detection, the credit assessment, and text classification. The experiments demonstrate that the proposed model delivers superior performance over the existing baselines.

Journal ArticleDOI
TL;DR: A new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption are introduced that outperform models which only model global image characteristics.
Abstract: Generative adversarial networks conditioned on textual image descriptions are capable of generating realistic-looking images. However, current methods still struggle to generate images based on complex image captions from a heterogeneous domain. Furthermore, quantitatively evaluating these text-to-image models is challenging, as most evaluation metrics only judge image quality but not the conformity between the image and its caption. To address these challenges we introduce a new model that explicitly models individual objects within an image and a new evaluation metric called Semantic Object Accuracy (SOA) that specifically evaluates images given an image caption. The SOA uses a pre-trained object detector to evaluate if a generated image contains objects that are mentioned in the image caption, e.g. whether an image generated from "a car driving down the street" contains a car. We perform a user study comparing several text-to-image models and show that our SOA metric ranks the models the same way as humans, whereas other metrics such as the Inception Score do not. Our evaluation also shows that models which explicitly model objects outperform models which only model global image characteristics.

Posted Content
TL;DR: It is shown that even the latest version of the precision and recall metrics are not reliable yet, and density and coverage metrics provide more interpretable and reliable signals for practitioners than the existing metrics.
Abstract: Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Frechet Inception Distance (FID) score. Because it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet. For example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics. Code: this https URL.

Book ChapterDOI
23 Aug 2020
TL;DR: Smooth-AP is a plug-and-play objective function that allows for end-to-end training of deep networks with a simple and elegant implementation and improves the performance over the state-of-the-art, especially for larger-scale datasets, thus demonstrating the effectiveness and scalability of Smooth-AP to real-world scenarios.
Abstract: Optimising a ranking-based metric, such as Average Precision (AP), is notoriously challenging due to the fact that it is non-differentiable, and hence cannot be optimised directly using gradient-descent methods. To this end, we introduce an objective that optimises instead a smoothed approximation of AP, coined Smooth-AP. Smooth-AP is a plug-and-play objective function that allows for end-to-end training of deep networks with a simple and elegant implementation. We also present an analysis for why directly optimising the ranking based metric of AP offers benefits over other deep metric learning losses.

Proceedings ArticleDOI
Xia Li1, Yibo Yang1, Qijie Zhao1, Tiancheng Shen1, Zhouchen Lin1, Hong Liu1 
14 Jun 2020
TL;DR: Wang et al. as mentioned in this paper applied graph convolution into the semantic segmentation task and proposed an improved Laplacian, which is data-dependent and introduces an attention diagonal matrix to learn a better distance metric.
Abstract: The convolution operation suffers from a limited receptive filed, while global modeling is fundamental to dense prediction tasks, such as semantic segmentation. In this paper, we apply graph convolution into the semantic segmentation task and propose an improved Laplacian. The graph reasoning is directly performed in the original feature space organized as a spatial pyramid. Different from existing methods, our Laplacian is data-dependent and we introduce an attention diagonal matrix to learn a better distance metric. It gets rid of projecting and re-projecting processes, which makes our proposed method a light-weight module that can be easily plugged into current computer vision architectures. More importantly, performing graph reasoning directly in the feature space retains spatial relationships and makes spatial pyramid possible to explore multiple long-range contextual patterns from different scales. Experiments on Cityscapes, COCO Stuff, PASCAL Context and PASCAL VOC demonstrate the effectiveness of our proposed methods on semantic segmentation. We achieve comparable performance with advantages in computational and memory overhead.

Posted Content
TL;DR: This paper presents the on-line tracking method, which made the first place in the NuScenes Tracking Challenge, and outperforms the AB3DMOT baseline method by a large margin in the Average Multi-Object Tracking Accuracy (AMOTA) metric.
Abstract: 3D multi-object tracking is a key module in autonomous driving applications that provides a reliable dynamic representation of the world to the planning module. In this paper, we present our on-line tracking method, which made the first place in the NuScenes Tracking Challenge, held at the AI Driving Olympics Workshop at NeurIPS 2019. Our method estimates the object states by adopting a Kalman Filter. We initialize the state covariance as well as the process and observation noise covariance with statistics from the training set. We also use the stochastic information from the Kalman Filter in the data association step by measuring the Mahalanobis distance between the predicted object states and current object detections. Our experimental results on the NuScenes validation and test set show that our method outperforms the AB3DMOT baseline method by a large margin in the Average Multi-Object Tracking Accuracy (AMOTA) metric.

Proceedings ArticleDOI
06 Jul 2020
TL;DR: A family of statistical dispersion measurements for the prediction of perceptual degradations is proposed and assessed, and best-performing attributes and features are revealed, under different neighborhood sizes.
Abstract: Point cloud is a 3D image representation that has recently emerged as a viable approach for advanced content modality in modern communication systems. In view of its wide adoption, quality evaluation metrics are essential. In this paper, we propose and assess a family of statistical dispersion measurements for the prediction of perceptual degradations. The employed features characterize local distributions of point cloud attributes reflecting topology and color. After associating local regions between a reference and a distorted model, the corresponding feature values are compared. The visual quality of a distorted model is then predicted by error pooling across individual quality scores obtained per region. The extracted features aim at capturing local changes, similarly to the well- known Structural Similarity Index. Benchmarking results using available datasets reveal best-performing attributes and features, under different neighborhood sizes. Finally, point cloud voxelization is examined as part of the process, improving the prediction accuracy under certain conditions.

Proceedings ArticleDOI
26 May 2020
TL;DR: This paper introduces PCQM, a full-reference objective metric for visual quality assessment of 3D point clouds, an optimally-weighted linear combination of geometry-based and color-based features that outperforms all previous metrics in terms of correlation with mean opinion scores.
Abstract: 3D point clouds constitute an emerging multimedia content, now used in a wide range of applications. The main drawback of this representation is the size of the data since typical point clouds may contain millions of points, usually associated with both geometry and color information. Consequently, a significant amount of work has been devoted to the efficient compression of this representation. Lossy compression leads to a degradation of the data and thus impacts the visual quality of the displayed content. In that context, predicting perceived visual quality computationally is essential for the optimization and evaluation of compression algorithms. In this paper, we introduce PCQM, a full-reference objective metric for visual quality assessment of 3D point clouds. The metric is an optimally-weighted linear combination of geometry-based and color-based features. We evaluate its performance on an open subjective dataset of colored point clouds compressed by several algorithms; the proposed quality assessment approach outperforms all previous metrics in terms of correlation with mean opinion scores.