scispace - formally typeset
Open AccessProceedings Article

Learning with Average Top-k Loss

TLDR
In this paper, the average top-k loss was introduced as a new ensemble loss for supervised learning. But, it was shown that the average loss can lead to convex optimization problems that can be solved effectively with conventional sub-gradient based method.
Abstract
In this work, we introduce the average top-$k$ (\atk) loss as a new ensemble loss for supervised learning. The \atk loss provides a natural generalization of the two widely used ensemble losses, namely the average loss and the maximum loss. Furthermore, the \atk loss combines the advantages of them and can alleviate their corresponding drawbacks to better adapt to different data distributions. We show that the \atk loss affords an intuitive interpretation that reduces the penalty of continuous and convex individual losses on correctly classified data. The \atk loss can lead to convex optimization problems that can be solved effectively with conventional sub-gradient based method. We further study the Statistical Learning Theory of \matk by establishing its classification calibration and statistical consistency of \matk which provide useful insights on the practical choice of the parameter $k$. We demonstrate the applicability of \matk learning combined with different individual loss functions for binary and multi-class classification and regression using synthetic and real datasets.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Long-tail learning via logit adjustment

TL;DR: These techniques revisit the classic idea of logit adjustment based on the label frequencies, either applied post-hoc to a trained model, or enforced in the loss during training, to encourage a large relative margin between logits of rare versus dominant labels.
Posted Content

Large-Scale Methods for Distributionally Robust Optimization

TL;DR: This work proposes and analyzes algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $\chi^2$ divergence uncertainty sets and proves that they require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications.
Posted Content

When Do Curricula Work

TL;DR: The experiments demonstrate that curriculum, but not anti-curriculum can indeed improve the performance either with limited training time budget or in existence of noisy data, suggesting that any benefit is entirely due to the dynamic training set size.
Posted Content

Coping with Label Shift via Distributionally Robust Optimisation

TL;DR: This paper proposes a model that minimises an objective based on distributionally robust optimisation (DRO), design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective, and establishes its convergence.
Posted Content

Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization

TL;DR: This result is the first one that shows Epoch-GDA can achieve the optimal rate of O(1/T) for the duality gap of general SCSC min-max problems, leading to a nearly optimal complexity without resorting to smoothness or other structural conditions.
Related Papers (5)