scispace - formally typeset
Search or ask a question
Posted Content

Optimal subsampling for quantile regression in big data

TL;DR: In this article, optimal subsampling for quantile regression is investigated and algorithms based on the optimal sampling probabilities are proposed to obtain asymptotic distributions and optimality of the resulting estimators.
Abstract: We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling probabilities. One version minimizes the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator and the other minimizes that of the original parameter estimator. The former does not depend on the densities of the responses given covariates and is easy to implement. Algorithms based on optimal subsampling probabilities are proposed and asymptotic distributions and asymptotic optimality of the resulting estimators are established. Furthermore, we propose an iterative subsampling procedure based on the optimal subsampling probabilities in the linearly transformed parameter estimation which has great scalability to utilize available computational resources. In addition, this procedure yields standard errors for parameter estimators without estimating the densities of the responses given the covariates. We provide numerical examples based on both simulated and real data to illustrate the proposed method.
Citations
More filters
Journal ArticleDOI
TL;DR: An efficient subsampling method is developed for large-scale quantile regression via Poisson sampling framework, which can solve the memory constraint problem imposed by big data.

36 citations

Journal ArticleDOI
TL;DR: The optimal subsampling methods have been investigated to include logistic regression models, softmax regressors, generalized linear models, quantile 12 regression Models, and quasi-likelihood estimation.
Abstract: 5 Subsampling is an effective way to deal with big data problems and many subsampling 6 approaches have been proposed for different models, such as leverage sampling for lin7 ear regression models and local case control sampling for logistic regression models. In 8 this article, we focus on optimal subsampling methods, which draw samples according to 9 optimal subsampling probabilities formulated by minimizing some function of the asymp10 totic distribution. The optimal subsampling methods have been investigated to include 11 logistic regression models, softmax regression models, generalized linear models, quantile 12 regression models, and quasi-likelihood estimation. Real data examples are provided to 13 show how optimal subsampling methods are applied. 14

21 citations

Journal ArticleDOI
TL;DR: For massive survival data, a subsampling algorithm is proposed to efficiently approximate the estimates of regression parameters in the additive hazards model and establishes consistency and asymptotic normality of the subsample‐based estimator given the full data.
Abstract: For massive survival data, we propose a subsampling algorithm to efficiently approximate the estimates of regression parameters in the additive hazards model. We establish consistency and asymptotic normality of the subsample-based estimator given the full data. The optimal subsampling probabilities are obtained via minimizing asymptotic variance of the resulting estimator. The subsample-based procedure can largely reduce the computational cost compared with the full data method. In numerical simulations, our method has low bias and satisfactory coverage probabilities. We provide an illustrative example on the survival analysis of patients with lymphoma cancer from the Surveillance, Epidemiology, and End Results Program.

18 citations


Cites background from "Optimal subsampling for quantile re..."

  • ...17 The asymptotic mean squared error (AMSE) of ?̃? is equal to the trace of Σ, which is given by AMSE(?̃?) = tr(Σ), (10)...

    [...]

Journal ArticleDOI
TL;DR: A distributed subdata selection method for big data linear regression model is proposed and a two-step subsampling strategy with optimal subsampled probabilities and optimal allocation sizes is developed, which effectively approximates the ordinary least squares estimator from the full data.

18 citations


Cites background from "Optimal subsampling for quantile re..."

  • ...Wang & Ma (2020) considered the optimal subsampling for quantile regression in big data.25 For the above-mentioned subsampling methods, one common assumption is that the data is stored in one location....

    [...]

Journal ArticleDOI

9 citations


Cites result from "Optimal subsampling for quantile re..."

  • ...In the process of revising our manuscript, we have noticed that Wang & Ma (2020) and Ai et al. (2020a) obtain similar results on optimal subsampling for QR in the context of big data....

    [...]

References
More filters
Book ChapterDOI
TL;DR: In this article, a new representation learning approach for domain adaptation is proposed, in which data at training and test time come from similar but different distributions, and features that cannot discriminate between the training (source) and test (target) domains are used to promote the emergence of features that are discriminative for the main learning task on the source domain.
Abstract: We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

4,862 citations

MonographDOI
01 Jan 2005

2,686 citations


"Optimal subsampling for quantile re..." refers background or methods in this paper

  • ...However, the interior point algorithm still need polynomial time for optimization; its worst-case time complexity is O(N5/2p3), where N is the sample size and p is the dimension of the regression coefficient (Sec 6.4.4 of Koenker, 2005)....

    [...]

  • ...Whilst for linear median regression, under some conditions, the overall time complexity is O(N1+ap3 log n), where 0 a 0.5 (Theorem 6.3 of Koenker, 2005)....

    [...]

  • ...For the first term on the right hand side of (S.6), following an approach similar to that in Section 4.2 of Koenker (2005) under the conditions in Assumption 1, we have n N N∑ i=1 E(Z2ni) = n N N∑ i=1 ∫ vi 0 {Fε|X(s, xi)− Fε|X(0, xi)}ds = √ n N N∑ i=1 ∫ λTxi 0 {Fε|X(t/ √ n, xi)− Fε|X(0, xi)}dt =…...

    [...]

  • ...(5) As shown in Theorem 4.1 of Koenker (2005), under Assumption 1, the full data estimator β̂ satisfies that {τ(1− τ)D−1N DN0D−1N }−1/2 √ N(β̂ − βt) −→ N(0, I), (6) in distribution, where N(0, I) represents a multivariate standard normal distribution, and βt stands for the true value of β....

    [...]

  • ...Here we adopt the set of regularity conditions used in Koenker (2005) and list them below as Assumption 1 for completeness....

    [...]

Book
01 Jan 2007
TL;DR: This book presents the theory and methods of optimum experimental design, making them available through the use of SAS programs, and stresses the importance of models in the analysis of data and introduces least squares fitting and simple optimum experimental designs.
Abstract: Experiments on patients, processes or plants all have random error, making statistical methods essential for their efficient design and analysis. This book presents the theory and methods of optimum experimental design, making them available through the use of SAS programs. Little previous statistical knowledge is assumed. The first part of the book stresses the importance of models in the analysis of data and introduces least squares fitting and simple optimum experimental designs. The second part presents a more detailed discussion of the general theory and of a wide variety of experiments. The book stresses the use of SAS to provide hands-on solutions for the construction of designs in both standard and non-standard situations. The mathematical theory of the designs is developed in parallel with their construction in SAS, so providing motivation for the development of the subject. Many chapters cover self-contained topics drawn from science, engineering and pharmaceutical investigations, such as response surface designs, blocking of experiments, designs for mixture experiments and for nonlinear and generalized linear models. Understanding is aided by the provision of "SAS tasks" after most chapters as well as by more traditional exercises and a fully supported website. The authors are leading experts in key fields and this book is ideal for statisticians and scientists in academia, research and the process and pharmaceutical industries.

1,076 citations


"Optimal subsampling for quantile re..." refers background or methods in this paper

  • ...ce also has an optimality interpretation in terms of optimal experimental design; it is termed the L-optimality criterion, where \L" stands for \linear transformation" of the estimator (see Atkinson et al., 2007). Using this criterion we are able to obtain the explicit expression of optimal subsampling probabilities in the following theorem. Theorem 2 (L-optimality) If the sampling probabilities ˇ i, i= 1;:::...

    [...]

  • ...This choice also has an optimality interpretation in terms of optimal experimental design; it is termed the L-optimality criterion, where “L” stands for “linear transformation” of the estimator (see Atkinson et al., 2007)....

    [...]

  • ...ng probabilities that minimize the asymptotic MSE of 1e, that is, the ˇ i’s that minimize the trace of n D 1 N V ˇD 1 N . This is called the A-optimality criterion in optimal experimental design (see Atkinson et al., 2007). Theorem 3 (A-optimality) If the sampling probabilities ˇ i, i= 1;:::;Nare chosen as ˇAopt i = j˝ I(" i<0)jkD 1 N xk P N j=1 j˝ I(" j <0)jkD 1 N x jk ;i= 1;2;:::;N; then the total asy...

    [...]

  • ...This is called the A-optimality criterion in optimal experimental design (see Atkinson et al., 2007)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors provide an overview of the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures, and provide various new perspectives on the Big Data analysis and computation.
Abstract: Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

897 citations


"Optimal subsampling for quantile re..." refers background in this paper

  • ...In big data problems, because data are often collected from different sources with different times and locations, the homoscedasticity assumption is often not valid (Fan et al., 2014), which makes quantile regression a natural candidate as an analysis tool....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors provide an overview of the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures, and provide various new perspectives on the Big Data analysis and computation.
Abstract: Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

733 citations