scispace - formally typeset
Search or ask a question
Journal ArticleDOI

LowCon: A Design-based Subsampling Approach in a Misspecified Linear Model

TL;DR: A novel subsampling method is developed, called "LowCon", which outperforms the competing methods when the working linear model is misspecified and approximately minimizes the so-called "worst-case" bias with respect to many possible misspecification terms.
Abstract: We consider a measurement constrained supervised learning problem, that is, (i) full sample of the predictors are given; (ii) the response observations are unavailable and expensive to measure. Thu...
Citations
More filters
Journal ArticleDOI
01 Jul 2020
TL;DR: This review presents some cutting-edge subsampling methods based on the large-scale least squares estimation that aim to develop a more effective data-dependent sampling probability and a deterministic subsample in accordance with certain optimality criteria.
Abstract: Subsampling methods aim to select a subsample as a surrogate for the observed sample. As a powerful technique for large-scale data analysis, various subsampling methods are developed for more effective coefficient estimation and model prediction. This review presents some cutting-edge subsampling methods based on the large-scale least squares estimation. Two major families of subsampling methods are introduced, respectively, the randomized subsampling approach and the optimal subsampling approach. The former aims to develop a more effective data-dependent sampling probability, while the latter aims to select a deterministic subsample in accordance with certain optimality criteria. Real data examples are provided to compare these methods empirically, respecting both the estimation accuracy and the computing time.

12 citations


Cites background or methods from "LowCon: A Design-based Subsampling ..."

  • ...Measurement constrained supervised learning is an emerging problem in machine learning (Settles, 2012; Wang et al., 2017; Dereziński et al., 2018; Clarkson et al., 2019; Meng et al., 2020)....

    [...]

  • ...To combat the obstacle, Meng et al. (2020) aims to select a subsample, which balances the trade-off between bias and variance, in order to yield a robust estimation of coefficients....

    [...]

  • ...To achieve the goal, Meng et al. (2020) considered the setting that the linear regression model is a postulated model, and the true model contains both a linear part and unknown misspecification....

    [...]

  • ...…al., 2019), large-scale matrix approximation (Williams and Seeger, 2001; Wang and Zhang, 2013; Alaoui and Mahoney, 2015; Altschuler et al., 2019; Wang et al., 2019), nonparametric regression (Gu and Kim, 2002; Ma et al., 2015; Zhang et al., 2018; Meng et al., 2020; Sun et al., 2020), among others....

    [...]

  • ..., 2019), nonparametric regression (Gu and Kim, 2002; Ma et al., 2015; Zhang et al., 2018; Meng et al., 2020; Sun et al., 2020), among others....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a subdata selection method based on leverage scores which enables us to conduct the selection task on a small subdata set and improves the probability of selecting the best model but also enhances the estimation efficiency.

11 citations

Journal ArticleDOI
13 May 2022
TL;DR: Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost as discussed by the authors , which is called the Wasserstein distance.
Abstract: Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost. Such a minimum transport cost, with a certain power transform, is called the Wasserstein distance. Recently, OT methods have drawn great attention in statistics, machine learning, and computer science, especially in deep generative neural networks. Despite its broad applications, the estimation of high‐dimensional Wasserstein distances is a well‐known challenging problem owing to the curse‐of‐dimensionality. There are some cutting‐edge projection‐based techniques that tackle high‐dimensional OT problems. Three major approaches of such techniques are introduced, respectively, the slicing approach, the iterative projection approach, and the projection robust OT approach. Open challenges are discussed at the end of the review.

10 citations

Journal ArticleDOI
TL;DR: In this paper, the authors developed an efficient algorithm that is adaptive to the unknown probability density function of the predictors, which has the same convergence rate as the full-basis estimator when $q$ is roughly at the order of $O[n^{2d/\{(pr+1)(d+2)\}}\quad]$, where $p\in[1,2]$ and $r\approx 4$ are some constants depend on the type of the spline.
Abstract: Smoothing splines have been used pervasively in nonparametric regressions. However, the computational burden of smoothing splines is significant when the sample size $n$ is large. When the number of predictors $d\geq2$, the computational cost for smoothing splines is at the order of $O(n^3)$ using the standard approach. Many methods have been developed to approximate smoothing spline estimators by using $q$ basis functions instead of $n$ ones, resulting in a computational cost of the order $O(nq^2)$. These methods are called the basis selection methods. Despite algorithmic benefits, most of the basis selection methods require the assumption that the sample is uniformly-distributed on a hyper-cube. These methods may have deteriorating performance when such an assumption is not met. To overcome the obstacle, we develop an efficient algorithm that is adaptive to the unknown probability density function of the predictors. Theoretically, we show the proposed estimator has the same convergence rate as the full-basis estimator when $q$ is roughly at the order of $O[n^{2d/\{(pr+1)(d+2)\}}\quad]$, where $p\in[1,2]$ and $r\approx 4$ are some constants depend on the type of the spline. Numerical studies on various synthetic datasets demonstrate the superior performance of the proposed estimator in comparison with mainstream competitors.

8 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, two sampling plans are examined as alternatives to simple random sampling in Monte Carlo studies and they are shown to be improvements over simple sampling with respect to variance for a class of estimators which includes the sample mean and the empirical distribution function.
Abstract: Two types of sampling plans are examined as alternatives to simple random sampling in Monte Carlo studies. These plans are shown to be improvements over simple random sampling with respect to variance for a class of estimators which includes the sample mean and the empirical distribution function.

8,328 citations

Journal ArticleDOI
TL;DR: In this paper, a method for producing Latin hypercube samples when the components of the input variables are statistically dependent is described, and the estimate is also shown to be asymptotically normal.
Abstract: Latin hypercube sampling (McKay, Conover, and Beckman 1979) is a method of sampling that can be used to produce input values for estimation of expectations of functions of output variables. The asymptotic variance of such an estimate is obtained. The estimate is also shown to be asymptotically normal. Asymptotically, the variance is less than that obtained using simple random sampling, with the degree of variance reduction depending on the degree of additivity in the function being integrated. A method for producing Latin hypercube samples when the components of the input variables are statistically dependent is also described. These techniques are applied to a simulation of the performance of a printer actuator.

1,750 citations

Journal ArticleDOI
TL;DR: In this article, a scheme for development and use of soil spectral libraries for rapid nondestructive estimation of soil properties based on analysis of diffuse reflectance spectroscopy was developed. But the spectral library approach is not suitable for use in agricultural, environmental, and engineering applications.
Abstract: Methods for rapid estimation of soil properties are needed for quantitative assessments of land management problems. We developed a scheme for development and use of soil spectral libraries for rapid nondestructive estimation of soil properties based on analysis of diffuse reflectance spectroscopy. A diverse library of over 1000 archived topsoils from eastern and southern Africa was used to test the approach. Air-dried soils were scanned using a portable spectrometer (0.35-2.5 μm) with an artificial light source. Soil properties were calibrated to soil reflectance using multivariate adaptive regression splines (MARS), and screening tests were developed for various soil fertility constraints using classification trees. A random sample of one-third of the soils was withheld for validation purposes. Validation r 2 values for regressions were: exchangeable Ca, 0.88; effective cation-exchange capacity (ECEC), 0.88; exchangeable Mg, 0.81; organic C concentration, 0.80; clay content, 0.80; sand content, 0.76; and soil pH, 0.70. Validation likelihood ratios for diagnostic screening tests were: ECEC 4.1 mg kg -1 d -1 , 2.9; extractable P <7 mg kg -1 , 2.9; exchangeable K <0.2 cmol c kg -1 , 2.6. We show the response of prediction accuracy to sample size and demonstrate how the predictive value of spectral libraries can be iteratively increased through detection of spectral outliers among new samples. The spectral library approach opens up new possibilities for modeling, assessment and management of risk in soil evaluations in agricultural, environmental, and engineering applications. Further research should test the use of soil reflectance in pedotransfer functions for prediction of soil functional attributes.

936 citations

Journal ArticleDOI
Boxin Tang1
TL;DR: It is proved that when used for integration, the sampling scheme with OA-based Latin hypercubes offers a substantial improvement over Latin hypercube sampling.
Abstract: In this article, we use orthogonal arrays (OA's) to construct Latin hypercubes. Besides preserving the univariate stratification properties of Latin hypercubes, these strength r OA-based Latin hypercubes also stratify each r-dimensional margin. Therefore, such OA-based Latin hypercubes provide more suitable designs for computer experiments and numerical integration than do general Latin hypercubes. We prove that when used for integration, the sampling scheme with OA-based Latin hypercubes offers a substantial improvement over Latin hypercube sampling.

768 citations

Journal ArticleDOI
TL;DR: The cLHS method with a search algorithm based on heuristic rules combined with an annealing schedule is presented, illustrated with a simple 3-D example and an application in digital soil mapping of part of the Hunter Valley of New South Wales, Australia.

744 citations