scispace - formally typeset
Open accessJournal ArticleDOI: 10.1093/BIOMET/ASAA043

Optimal subsampling for quantile regression in big data

02 Mar 2021-Biometrika (Oxford University Press (OUP))-Vol. 108, Iss: 1, pp 99-112
Abstract: We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling probabilities. One version minimizes the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator and the other minimizes that of the original parameter estimator. The former does not depend on the densities of the responses given covariates and is easy to implement. Algorithms based on optimal subsampling probabilities are proposed and asymptotic distributions and asymptotic optimality of the resulting estimators are established. Furthermore, we propose an iterative subsampling procedure based on the optimal subsampling probabilities in the linearly transformed parameter estimation which has great scalability to utilize available computational resources. In addition, this procedure yields standard errors for parameter estimators without estimating the densities of the responses given the covariates. We provide numerical examples based on both simulated and real data to illustrate the proposed method.

... read more

Topics: Estimator (57%), Asymptotic distribution (56%), Estimation theory (53%)

13 results found

Open accessJournal ArticleDOI: 10.1080/10618600.2020.1844215
Cheng Meng1, Rui Xie2, Abhyuday Mandal3, Xinlian Zhang4  +2 moreInstitutions (4)
Abstract: We consider a measurement constrained supervised learning problem, that is, (i) full sample of the predictors are given; (ii) the response observations are unavailable and expensive to measure. Thu...

... read more

5 Citations

Journal ArticleDOI: 10.1016/J.CSDA.2020.107072
Haixiang Zhang1, HaiYing Wang2Institutions (2)
Abstract: With the development of modern technologies, it is possible to gather an extraordinarily large number of observations. Due to the storage or transmission burden, big data are usually scattered at multiple locations. It is difficult to transfer all of data to the central server for analysis. A distributed subdata selection method for big data linear regression model is proposed. Particularly, a two-step subsampling strategy with optimal subsampling probabilities and optimal allocation sizes is developed. The subsample-based estimator effectively approximates the ordinary least squares estimator from the full data. The convergence rate and asymptotic normality of the proposed estimator are established. Simulation studies and an illustrative example about airline data are provided to assess the performance of the proposed method.

... read more

Topics: Ordinary least squares (56%), Estimator (55%), Regression diagnostic (51%) ... show more

5 Citations

Open accessJournal ArticleDOI: 10.1016/J.FMRE.2021.08.012
Shanshan Song1, Yuanyuan Lin1, Yong Zhou2Institutions (2)
01 Sep 2021-
Abstract: In this paper, we study the large-scale inference for a linear expectile regression model. To mitigate the computational challenges in the classical asymmetric least squares (ALS) estimation under massive data, we propose a communication-efficient divide and conquer algorithm to combine the information from sub-machines through confidence distributions. The resulting pooled estimator has a closed-form expression, and its consistency and asymptotic normality are established under mild conditions. Moreover, we derive the Bahadur representation of the ALS estimator, which serves as an important tool to study the relationship between the number of sub-machines K and the sample size. Numerical studies including both synthetic and real data examples are presented to illustrate the finite-sample performance of our method and support the theoretical results.

... read more

Topics: Estimator (55%), Regression analysis (52%), Asymptotic distribution (50%)

2 Citations

Journal ArticleDOI: 10.1007/S13198-021-01220-W
Jiadi Yang1, Jinjin Wang1Institutions (1)
Abstract: The purpose is to study how to innovate and teach TV programs in the background of big data. Shot boundary detection technology is adopted to search the content of TV programs video. The content retrieval of TV program is realized by shot boundary detection technology, which mainly includes two aspects of decompressed domain and compressed domain. Regarding the decompressed domain, a new abrupt shot change detection algorithm for decompressed domain is adopted to analyze of the whole search process of decompressed domain shot boundary. Regarding the compressed domain, the algorithm of video shot boundary detection on H.264/AVC code stream is used. Experimental results show that shot detection algorithm can detect not only abrupt shot change, but also gradual change. In the experiment, the comprehensive detection performance of various frequency sequences achieves 94% recall and 93.2% accuracy. The recall rate of abrupt shot change detection algorithm for experimental data is 94.5%, and the accuracy rate is 97.6%, which is superior to the detection performance of existing abrupt shot detection methods, and has a certain application value. Meanwhile, the similar video fast retrieval algorithm, the MinHash algorithm and LSH (Locality Sensitive Hashing) algorithm are compared. Similar video fast retrieval algorithm can achieve fast clustering of similar video faster, and can effectively retrieve similar video, so as to complete the fast retrieval of large-scale video data. The use of new abrupt shot change detection algorithm for decompressed domain and shot boundary detection algorithm in TV programs, to a large extent, optimizes the management of TV advertising and the manual broadcast of TV programs; moreover, it saves manpower and the broadcast cost of TV programs, which is a reform and innovation of traditional TV programs. In the future research, the boundary detection technology can be optimized to better play high-quality TV pictures.

... read more

Topics: MinHash (50%)

1 Citations


24 results found

Open accessJournal Article
01 Jan 2014-MSOR connections
Abstract: Copyright (©) 1999–2012 R Foundation for Statistical Computing. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Core Team.

... read more

Topics: R Programming Language (78%)

229,202 Citations

Open accessBook ChapterDOI: 10.1007/978-3-319-58347-1_10
Yaroslav Ganin1, Evgeniya Ustinova1, Hana Ajakan2, Pascal Germain2  +4 moreInstitutions (3)
Abstract: We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

... read more

Topics: Semi-supervised learning (60%), Domain (software engineering) (58%), Feature learning (56%) ... show more

4,760 Citations

MonographDOI: 10.1017/CBO9780511754098
01 Jan 2005-

2,321 Citations

Open accessBook
01 Jan 2007-
Abstract: Experiments on patients, processes or plants all have random error, making statistical methods essential for their efficient design and analysis. This book presents the theory and methods of optimum experimental design, making them available through the use of SAS programs. Little previous statistical knowledge is assumed. The first part of the book stresses the importance of models in the analysis of data and introduces least squares fitting and simple optimum experimental designs. The second part presents a more detailed discussion of the general theory and of a wide variety of experiments. The book stresses the use of SAS to provide hands-on solutions for the construction of designs in both standard and non-standard situations. The mathematical theory of the designs is developed in parallel with their construction in SAS, so providing motivation for the development of the subject. Many chapters cover self-contained topics drawn from science, engineering and pharmaceutical investigations, such as response surface designs, blocking of experiments, designs for mixture experiments and for nonlinear and generalized linear models. Understanding is aided by the provision of "SAS tasks" after most chapters as well as by more traditional exercises and a fully supported website. The authors are leading experts in key fields and this book is ideal for statisticians and scientists in academia, research and the process and pharmaceutical industries.

... read more

Topics: Mathematical theory (51%)

1,017 Citations

Open accessJournal ArticleDOI: 10.1093/NSR/NWT032
Jianqing Fan1, Fang Han2, Han Liu1Institutions (2)
Abstract: Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

... read more

Topics: Big data (56%), Population (51%)

704 Citations

No. of citations received by the Paper in previous years