LowCon: A Design-based Subsampling Approach in a Misspecified Linear Model

doi:10.1080/10618600.2020.1844215

Home
/
Papers
/
LowCon: A Design-based Subsampling Approach in a Misspecified Linear Model

Journal Article•DOI•

LowCon: A Design-based Subsampling Approach in a Misspecified Linear Model

Cheng Meng¹, Rui Xie², Abhyuday Mandal³, Xinlian Zhang⁴, Wenxuan Zhong³, Ping Ma³ - Show less +2 more•Institutions (4)

Renmin University of China¹, University of Central Florida², University of Georgia³, University of California, San Diego⁴

03 Jul 2021-Journal of Computational and Graphical Statistics (Taylor & Francis)-Vol. 30, Iss: 3, pp 694-708

TL;DR: A novel subsampling method is developed, called "LowCon", which outperforms the competing methods when the working linear model is misspecified and approximately minimizes the so-called "worst-case" bias with respect to many possible misspecification terms.

read less

Abstract: We consider a measurement constrained supervised learning problem, that is, (i) full sample of the predictors are given; (ii) the response observations are unavailable and expensive to measure. Thu...

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Modern Subsampling Methods for Large-Scale Least Squares Regression

[...]

Tao Li¹, Cheng Meng¹•Institutions (1)

Renmin University of China¹

01 Jul 2020

TL;DR: This review presents some cutting-edge subsampling methods based on the large-scale least squares estimation that aim to develop a more effective data-dependent sampling probability and a deterministic subsample in accordance with certain optimality criteria.

...read moreread less

Abstract: Subsampling methods aim to select a subsample as a surrogate for the observed sample. As a powerful technique for large-scale data analysis, various subsampling methods are developed for more effective coefficient estimation and model prediction. This review presents some cutting-edge subsampling methods based on the large-scale least squares estimation. Two major families of subsampling methods are introduced, respectively, the randomized subsampling approach and the optimal subsampling approach. The former aims to develop a more effective data-dependent sampling probability, while the latter aims to select a deterministic subsample in accordance with certain optimality criteria. Real data examples are provided to compare these methods empirically, respecting both the estimation accuracy and the computing time.

...read moreread less

12 citations

Cites background or methods from "LowCon: A Design-based Subsampling ..."

...Measurement constrained supervised learning is an emerging problem in machine learning (Settles, 2012; Wang et al., 2017; Dereziński et al., 2018; Clarkson et al., 2019; Meng et al., 2020)....
[...]
...To combat the obstacle, Meng et al. (2020) aims to select a subsample, which balances the trade-off between bias and variance, in order to yield a robust estimation of coefficients....
[...]
...To achieve the goal, Meng et al. (2020) considered the setting that the linear regression model is a postulated model, and the true model contains both a linear part and unknown misspecification....
[...]
...…al., 2019), large-scale matrix approximation (Williams and Seeger, 2001; Wang and Zhang, 2013; Alaoui and Mahoney, 2015; Altschuler et al., 2019; Wang et al., 2019), nonparametric regression (Gu and Kim, 2002; Ma et al., 2015; Zhang et al., 2018; Meng et al., 2020; Sun et al., 2020), among others....
[...]
..., 2019), nonparametric regression (Gu and Kim, 2002; Ma et al., 2015; Zhang et al., 2018; Meng et al., 2020; Sun et al., 2020), among others....
[...]

Journal Article•DOI•

Subdata selection algorithm for linear model discrimination

[...]

Jun Yu, HaiYing Wang

03 Mar 2022-Statistical papers

TL;DR: This work proposes a subdata selection method based on leverage scores which enables us to conduct the selection task on a small subdata set and improves the probability of selecting the best model but also enhances the estimation efficiency.

...read moreread less

11 citations

Journal Article•DOI•

Projection‐based techniques for high‐dimensional optimal transport problems

[...]

Jingyi Zhang, Ping Ma, Wenxuan Zhong

13 May 2022

TL;DR: Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost as discussed by the authors , which is called the Wasserstein distance.

...read moreread less

Abstract: Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost. Such a minimum transport cost, with a certain power transform, is called the Wasserstein distance. Recently, OT methods have drawn great attention in statistics, machine learning, and computer science, especially in deep generative neural networks. Despite its broad applications, the estimation of high‐dimensional Wasserstein distances is a well‐known challenging problem owing to the curse‐of‐dimensionality. There are some cutting‐edge projection‐based techniques that tackle high‐dimensional OT problems. Three major approaches of such techniques are introduced, respectively, the slicing approach, the iterative projection approach, and the projection robust OT approach. Open challenges are discussed at the end of the review.

...read moreread less

10 citations

Journal Article•DOI•

Smoothing splines approximation using Hilbert curve basis selection

[...]

Cheng Meng¹, Jun Yu², Yongkai Chen³, Wenxuan Zhong³, Ping Ma³ - Show less +1 more•Institutions (3)

Renmin University of China¹, Beijing Institute of Technology², University of Georgia³

08 Nov 2021-Journal of Computational and Graphical Statistics

TL;DR: In this paper, the authors developed an efficient algorithm that is adaptive to the unknown probability density function of the predictors, which has the same convergence rate as the full-basis estimator when $q$ is roughly at the order of $O[n^{2d/\{(pr+1)(d+2)\}}\quad]$, where $p\in[1,2]$ and $r\approx 4$ are some constants depend on the type of the spline.

...read moreread less

Abstract: Smoothing splines have been used pervasively in nonparametric regressions. However, the computational burden of smoothing splines is significant when the sample size $n$ is large. When the number of predictors $d\geq2$, the computational cost for smoothing splines is at the order of $O(n^3)$ using the standard approach. Many methods have been developed to approximate smoothing spline estimators by using $q$ basis functions instead of $n$ ones, resulting in a computational cost of the order $O(nq^2)$. These methods are called the basis selection methods. Despite algorithmic benefits, most of the basis selection methods require the assumption that the sample is uniformly-distributed on a hyper-cube. These methods may have deteriorating performance when such an assumption is not met. To overcome the obstacle, we develop an efficient algorithm that is adaptive to the unknown probability density function of the predictors. Theoretically, we show the proposed estimator has the same convergence rate as the full-basis estimator when $q$ is roughly at the order of $O[n^{2d/\{(pr+1)(d+2)\}}\quad]$, where $p\in[1,2]$ and $r\approx 4$ are some constants depend on the type of the spline. Numerical studies on various synthetic datasets demonstrate the superior performance of the proposed estimator in comparison with mainstream competitors.

...read moreread less

8 citations

Journal Article•DOI•

A review on design inspired subsampling for big data

[...]

Jun Yu, Mingyao Ai, ZhiQiang Ye

13 Feb 2023-Statistical papers

6 citations

1
2
3
4
…
5

References

PDF

Open Access

More filters

Journal Article•DOI•

A comparison of three methods for selecting values of input variables in the analysis of output from a computer code

[...]

Michael D. McKay¹, Richard J. Beckman¹, William J. Conover²•Institutions (2)

Los Alamos National Laboratory¹, Texas Tech University²

01 Feb 2000-Technometrics

TL;DR: In this paper, two sampling plans are examined as alternatives to simple random sampling in Monte Carlo studies and they are shown to be improvements over simple sampling with respect to variance for a class of estimators which includes the sample mean and the empirical distribution function.

...read moreread less

Abstract: Two types of sampling plans are examined as alternatives to simple random sampling in Monte Carlo studies. These plans are shown to be improvements over simple random sampling with respect to variance for a class of estimators which includes the sample mean and the empirical distribution function.

...read moreread less

8,328 citations

Journal Article•DOI•

Large sample properties of simulations using latin hypercube sampling

[...]

Michael L. Stein¹•Institutions (1)

University of Chicago¹

01 May 1987-Technometrics

TL;DR: In this paper, a method for producing Latin hypercube samples when the components of the input variables are statistically dependent is described, and the estimate is also shown to be asymptotically normal.

...read moreread less

Abstract: Latin hypercube sampling (McKay, Conover, and Beckman 1979) is a method of sampling that can be used to produce input values for estimation of expectations of functions of output variables. The asymptotic variance of such an estimate is obtained. The estimate is also shown to be asymptotically normal. Asymptotically, the variance is less than that obtained using simple random sampling, with the degree of variance reduction depending on the degree of additivity in the function being integrated. A method for producing Latin hypercube samples when the components of the input variables are statistically dependent is also described. These techniques are applied to a simulation of the performance of a printer actuator.

...read moreread less

1,750 citations

Journal Article•DOI•

Development of Reflectance Spectral Libraries for Characterization of Soil Properties

[...]

Keith D. Shepherd, Markus G. Walsh

01 May 2002-Soil Science Society of America Journal

TL;DR: In this article, a scheme for development and use of soil spectral libraries for rapid nondestructive estimation of soil properties based on analysis of diffuse reflectance spectroscopy was developed. But the spectral library approach is not suitable for use in agricultural, environmental, and engineering applications.

...read moreread less

Abstract: Methods for rapid estimation of soil properties are needed for quantitative assessments of land management problems. We developed a scheme for development and use of soil spectral libraries for rapid nondestructive estimation of soil properties based on analysis of diffuse reflectance spectroscopy. A diverse library of over 1000 archived topsoils from eastern and southern Africa was used to test the approach. Air-dried soils were scanned using a portable spectrometer (0.35-2.5 μm) with an artificial light source. Soil properties were calibrated to soil reflectance using multivariate adaptive regression splines (MARS), and screening tests were developed for various soil fertility constraints using classification trees. A random sample of one-third of the soils was withheld for validation purposes. Validation r 2 values for regressions were: exchangeable Ca, 0.88; effective cation-exchange capacity (ECEC), 0.88; exchangeable Mg, 0.81; organic C concentration, 0.80; clay content, 0.80; sand content, 0.76; and soil pH, 0.70. Validation likelihood ratios for diagnostic screening tests were: ECEC 4.1 mg kg -1 d -1 , 2.9; extractable P <7 mg kg -1 , 2.9; exchangeable K <0.2 cmol c kg -1 , 2.6. We show the response of prediction accuracy to sample size and demonstrate how the predictive value of spectral libraries can be iteratively increased through detection of spectral outliers among new samples. The spectral library approach opens up new possibilities for modeling, assessment and management of risk in soil evaluations in agricultural, environmental, and engineering applications. Further research should test the use of soil reflectance in pedotransfer functions for prediction of soil functional attributes.

...read moreread less

936 citations

Journal Article•DOI•

Orthogonal Array-Based Latin Hypercubes

[...]

Boxin Tang¹•Institutions (1)

University of Toronto¹

01 Dec 1993-Journal of the American Statistical Association

TL;DR: It is proved that when used for integration, the sampling scheme with OA-based Latin hypercubes offers a substantial improvement over Latin hypercube sampling.

...read moreread less

Abstract: In this article, we use orthogonal arrays (OA's) to construct Latin hypercubes. Besides preserving the univariate stratification properties of Latin hypercubes, these strength r OA-based Latin hypercubes also stratify each r-dimensional margin. Therefore, such OA-based Latin hypercubes provide more suitable designs for computer experiments and numerical integration than do general Latin hypercubes. We prove that when used for integration, the sampling scheme with OA-based Latin hypercubes offers a substantial improvement over Latin hypercube sampling.

...read moreread less

768 citations

Journal Article•DOI•

A conditioned Latin hypercube method for sampling in the presence of ancillary information

[...]

Budiman Minasny¹, Alex B. McBratney¹•Institutions (1)

University of Sydney¹

01 Nov 2006-Computers & Geosciences

TL;DR: The cLHS method with a search algorithm based on heuristic rules combined with an annealing schedule is presented, illustrated with a simple 3-D example and an application in digital soil mapping of part of the Hunter Valley of New South Wales, Australia.

...read moreread less

744 citations