Optimal subsampling for quantile regression in big data

Home
/
Papers
/
Optimal subsampling for quantile regression in big data

Posted Content•

Optimal subsampling for quantile regression in big data

HaiYing Wang¹, Yanyuan Ma²•Institutions (2)

University of Connecticut¹, Pennsylvania State University²

28 Jan 2020-arXiv: Computation-

TL;DR: In this article, optimal subsampling for quantile regression is investigated and algorithms based on the optimal sampling probabilities are proposed to obtain asymptotic distributions and optimality of the resulting estimators.

read less

Abstract: We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling probabilities. One version minimizes the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator and the other minimizes that of the original parameter estimator. The former does not depend on the densities of the responses given covariates and is easy to implement. Algorithms based on optimal subsampling probabilities are proposed and asymptotic distributions and asymptotic optimality of the resulting estimators are established. Furthermore, we propose an iterative subsampling procedure based on the optimal subsampling probabilities in the linearly transformed parameter estimation which has great scalability to utilize available computational resources. In addition, this procedure yields standard errors for parameter estimators without estimating the densities of the responses given the covariates. We provide numerical examples based on both simulated and real data to illustrate the proposed method.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Optimal subsampling for large-scale quantile regression

[...]

Mingyao Ai¹, Fei Wang¹, Jun Yu², Huiming Zhang¹•Institutions (2)

Peking University¹, Beijing Institute of Technology²

01 Feb 2021-Journal of Complexity

TL;DR: An efficient subsampling method is developed for large-scale quantile regression via Poisson sampling framework, which can solve the memory constraint problem imposed by big data.

...read moreread less

36 citations

Journal Article•DOI•

A Review on Optimal Subsampling Methods for Massive Datasets

[...]

Yaqiong Yao, HaiYing Wang

01 Jan 2021-Journal of data science

TL;DR: The optimal subsampling methods have been investigated to include logistic regression models, softmax regressors, generalized linear models, quantile 12 regression Models, and quasi-likelihood estimation.

...read moreread less

Abstract: 5 Subsampling is an effective way to deal with big data problems and many subsampling 6 approaches have been proposed for different models, such as leverage sampling for lin7 ear regression models and local case control sampling for logistic regression models. In 8 this article, we focus on optimal subsampling methods, which draw samples according to 9 optimal subsampling probabilities formulated by minimizing some function of the asymp10 totic distribution. The optimal subsampling methods have been investigated to include 11 logistic regression models, softmax regression models, generalized linear models, quantile 12 regression models, and quasi-likelihood estimation. Real data examples are provided to 13 show how optimal subsampling methods are applied. 14

...read moreread less

21 citations

Journal Article•DOI•

Sampling-based estimation for massive survival data with additive hazards model.

[...]

Lulu Zuo¹, Haixiang Zhang¹, HaiYing Wang², Lei Liu³•Institutions (3)

Tianjin University¹, University of Connecticut², Washington University in St. Louis³

30 Jan 2021-Statistics in Medicine

TL;DR: For massive survival data, a subsampling algorithm is proposed to efficiently approximate the estimates of regression parameters in the additive hazards model and establishes consistency and asymptotic normality of the subsample‐based estimator given the full data.

...read moreread less

Abstract: For massive survival data, we propose a subsampling algorithm to efficiently approximate the estimates of regression parameters in the additive hazards model. We establish consistency and asymptotic normality of the subsample-based estimator given the full data. The optimal subsampling probabilities are obtained via minimizing asymptotic variance of the resulting estimator. The subsample-based procedure can largely reduce the computational cost compared with the full data method. In numerical simulations, our method has low bias and satisfactory coverage probabilities. We provide an illustrative example on the survival analysis of patients with lymphoma cancer from the Surveillance, Epidemiology, and End Results Program.

...read moreread less

18 citations

Cites background from "Optimal subsampling for quantile re..."

...17 The asymptotic mean squared error (AMSE) of ?̃? is equal to the trace of Σ, which is given by AMSE(?̃?) = tr(Σ), (10)...
[...]

Journal Article•DOI•

Distributed subdata selection for big data via sampling-based approach

[...]

Haixiang Zhang¹, HaiYing Wang²•Institutions (2)

Tianjin University¹, University of Connecticut²

01 Jan 2021-Computational Statistics & Data Analysis

TL;DR: A distributed subdata selection method for big data linear regression model is proposed and a two-step subsampling strategy with optimal subsampled probabilities and optimal allocation sizes is developed, which effectively approximates the ordinary least squares estimator from the full data.

...read moreread less

18 citations

Cites background from "Optimal subsampling for quantile re..."

...Wang & Ma (2020) considered the optimal subsampling for quantile regression in big data.25 For the above-mentioned subsampling methods, one common assumption is that the data is stored in one location....
[...]

Journal Article•DOI•

Optimal subsampling for linear quantile regression models

[...]

Yan Fan¹, Yukun Liu², Lixing Zhu³, Lixing Zhu¹•Institutions (3)

Shanghai University of International Business and Economics¹, East China Normal University², Beijing Normal University³

10 Mar 2021-Canadian Journal of Statistics-revue Canadienne De Statistique

9 citations

Cites result from "Optimal subsampling for quantile re..."

...In the process of revising our manuscript, we have noticed that Wang & Ma (2020) and Ai et al. (2020a) obtain similar results on optimal subsampling for QR in the context of big data....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12

Collapse

References

PDF

Open Access

More filters

Book Chapter•DOI•

Domain-adversarial training of neural networks

[...]

Yaroslav Ganin¹, Evgeniya Ustinova¹, Hana Ajakan², Pascal Germain², Hugo Larochelle³, François Laviolette², Mario Marchand², Victor Lempitsky¹ - Show less +4 more•Institutions (3)

Skolkovo Institute of Science and Technology¹, Laval University², Université de Sherbrooke³

01 Jan 2016-Journal of Machine Learning Research

TL;DR: In this article, a new representation learning approach for domain adaptation is proposed, in which data at training and test time come from similar but different distributions, and features that cannot discriminate between the training (source) and test (target) domains are used to promote the emergence of features that are discriminative for the main learning task on the source domain.

...read moreread less

Abstract: We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

...read moreread less

4,862 citations

Monograph•DOI•

Quantile Regression: Name Index

[...]

Roger Koenker

01 Jan 2005

2,686 citations

"Optimal subsampling for quantile re..." refers background or methods in this paper

...However, the interior point algorithm still need polynomial time for optimization; its worst-case time complexity is O(N5/2p3), where N is the sample size and p is the dimension of the regression coefficient (Sec 6.4.4 of Koenker, 2005)....
[...]
...Whilst for linear median regression, under some conditions, the overall time complexity is O(N1+ap3 log n), where 0 a 0.5 (Theorem 6.3 of Koenker, 2005)....
[...]
...For the first term on the right hand side of (S.6), following an approach similar to that in Section 4.2 of Koenker (2005) under the conditions in Assumption 1, we have n N N∑ i=1 E(Z2ni) = n N N∑ i=1 ∫ vi 0 {Fε|X(s, xi)− Fε|X(0, xi)}ds = √ n N N∑ i=1 ∫ λTxi 0 {Fε|X(t/ √ n, xi)− Fε|X(0, xi)}dt =…...
[...]
...(5) As shown in Theorem 4.1 of Koenker (2005), under Assumption 1, the full data estimator β̂ satisfies that {τ(1− τ)D−1N DN0D−1N }−1/2 √ N(β̂ − βt) −→ N(0, I), (6) in distribution, where N(0, I) represents a multivariate standard normal distribution, and βt stands for the true value of β....
[...]
...Here we adopt the set of regularity conditions used in Koenker (2005) and list them below as Assumption 1 for completeness....
[...]

Book•

Optimum Experimental Designs, with SAS

[...]

Anthony C. Atkinson, A. N. Donev, Randall D. Tobias

01 Jan 2007

TL;DR: This book presents the theory and methods of optimum experimental design, making them available through the use of SAS programs, and stresses the importance of models in the analysis of data and introduces least squares fitting and simple optimum experimental designs.

...read moreread less

Abstract: Experiments on patients, processes or plants all have random error, making statistical methods essential for their efficient design and analysis. This book presents the theory and methods of optimum experimental design, making them available through the use of SAS programs. Little previous statistical knowledge is assumed. The first part of the book stresses the importance of models in the analysis of data and introduces least squares fitting and simple optimum experimental designs. The second part presents a more detailed discussion of the general theory and of a wide variety of experiments. The book stresses the use of SAS to provide hands-on solutions for the construction of designs in both standard and non-standard situations. The mathematical theory of the designs is developed in parallel with their construction in SAS, so providing motivation for the development of the subject. Many chapters cover self-contained topics drawn from science, engineering and pharmaceutical investigations, such as response surface designs, blocking of experiments, designs for mixture experiments and for nonlinear and generalized linear models. Understanding is aided by the provision of "SAS tasks" after most chapters as well as by more traditional exercises and a fully supported website. The authors are leading experts in key fields and this book is ideal for statisticians and scientists in academia, research and the process and pharmaceutical industries.

...read moreread less

1,076 citations

"Optimal subsampling for quantile re..." refers background or methods in this paper

...ce also has an optimality interpretation in terms of optimal experimental design; it is termed the L-optimality criterion, where \L" stands for \linear transformation" of the estimator (see Atkinson et al., 2007). Using this criterion we are able to obtain the explicit expression of optimal subsampling probabilities in the following theorem. Theorem 2 (L-optimality) If the sampling probabilities ˇ i, i= 1;:::...
[...]
...This choice also has an optimality interpretation in terms of optimal experimental design; it is termed the L-optimality criterion, where “L” stands for “linear transformation” of the estimator (see Atkinson et al., 2007)....
[...]
...ng probabilities that minimize the asymptotic MSE of 1e, that is, the ˇ i’s that minimize the trace of n D 1 N V ˇD 1 N . This is called the A-optimality criterion in optimal experimental design (see Atkinson et al., 2007). Theorem 3 (A-optimality) If the sampling probabilities ˇ i, i= 1;:::;Nare chosen as ˇAopt i = j˝ I(" i<0)jkD 1 N xk P N j=1 j˝ I(" j <0)jkD 1 N x jk ;i= 1;2;:::;N; then the total asy...
[...]
...This is called the A-optimality criterion in optimal experimental design (see Atkinson et al., 2007)....
[...]

Journal Article•DOI•

Challenges of Big Data analysis

[...]

Jianqing Fan¹, Fang Han², Han Liu¹•Institutions (2)

Princeton University¹, Johns Hopkins University²

01 Jun 2014-National Science Review

TL;DR: In this paper, the authors provide an overview of the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures, and provide various new perspectives on the Big Data analysis and computation.

...read moreread less

Abstract: Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

...read moreread less

897 citations

"Optimal subsampling for quantile re..." refers background in this paper

...In big data problems, because data are often collected from different sources with different times and locations, the homoscedasticity assumption is often not valid (Fan et al., 2014), which makes quantile regression a natural candidate as an analysis tool....
[...]

Journal Article•DOI•

Challenges of Big Data Analysis

[...]

Jianqing Fan¹, Fang Han², Han Liu¹•Institutions (2)

Princeton University¹, Johns Hopkins University²

07 Aug 2013-arXiv: Machine Learning

TL;DR: In this article, the authors provide an overview of the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures, and provide various new perspectives on the Big Data analysis and computation.

...read moreread less

Abstract: Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

...read moreread less

733 citations