Optimal Subsampling Methods for Quantile Regression Model

Draw subsample from full dataset and fit quantile regression model. For a quick start, refer to the vignette.

Usage

ssp.quantreg(
  formula,
  data,
  subset = NULL,
  tau = 0.5,
  n.plt,
  n.ssp,
  B = 5,
  boot = TRUE,
  criterion = "optL",
  sampling.method = "withReplacement",
  likelihood = c("weighted"),
  control = list(...),
  contrasts = NULL,
  ...
)

Arguments

formula

A model formula object of class "formula" that describes the model to be fitted.

data

A data frame containing the variables in the model. Denote \(N\) as the number of observations in data.

subset

An optional vector specifying a subset of observations from data to use for the analysis. This subset will be viewed as the full data.

tau

The interested quantile.

n.plt

The pilot subsample size (first-step subsample size). This subsample is used to compute the pilot estimator and estimate the optimal subsampling probabilities.

n.ssp

The expected size of the optimal subsample (second-step subsample). For sampling.method = 'withReplacement', The exact subsample size is n.ssp. For sampling.method = 'poisson', n.ssp is the expected subsample size.

B

The number of subsamples for the iterative sampling algorithm. Each subsample contains n.ssp observations. This allows us to estimate the covariance matrix.

boot

If TRUE then perform iterative sampling algorithm and estimate the covariance matrix. If FALSE then only one subsample with size B*n.ssp is returned.

criterion

It determines how subsampling probabilities are computed. Choices include optL(default) and uniform.

optL Minimizes the trace of a transformation of the asymptotic covariance matrix of the subsample estimator.
uniform Assigns equal subsampling probability \(\frac{1}{N}\) to each observation, serving as a baseline subsampling strategy.

sampling.method

The sampling method for drawing the optimal subsample. Choices include withReplacement and poisson(default). withReplacement draws exactly n.ssp subsamples from size \(N\) full dataset with replacement, using the specified subsampling probabilities. poisson draws observations independently by comparing each subsampling probability with a realization of uniform random variable \(U(0,1)\).

likelihood

The type of the maximum likelihood function used to calculate the optimal subsampling estimator. Currently weighted is implemented which applies a weighted likelihood function where each observation is weighted by the inverse of its subsampling probability.

control

The argument control contains two tuning parameters alpha and b.

alpha \(\in [0,1]\) is the mixture weight of the user-assigned subsampling probability and uniform subsampling probability. The actual subsample probability is \(\pi = (1-\alpha)\pi^{opt} + \alpha \pi^{uni}\). This protects the estimator from extreme small subsampling probability. The default value is 0.
b is a positive number which is used to constaint the poisson subsampling probability. b close to 0 results in subsampling probabilities closer to uniform probability \(\frac{1}{N}\). b=2 is the default value. See relevant references for further details.

contrasts

An optional list. It specifies how categorical variables are represented in the design matrix. For example, contrasts = list(v1 = 'contr.treatment', v2 = 'contr.sum').

...

A list of parameters which will be passed to quantreg::rq().

Value

ssp.quantreg returns an object of class "ssp.quantreg" containing the following components (some are optional):

model.call: The original function call.
coef.plt: The pilot estimator. See Details for more information.
coef: The estimator obtained from the optimal subsample.
cov: The covariance matrix of coef
index.plt: Row indices of pilot subsample in the full dataset.
index.ssp: Row indices of of optimal subsample in the full dataset.
N: The number of observations in the full dataset.
subsample.size.expect: The expected subsample size
terms: The terms object for the fitted model.

Details

Most of the arguments and returned variables have the same meaning with ssp.glm. Refer to vignette

A pilot estimator for the unknown parameter \(\beta\) is required because optL subsampling probabilities depend on \(\beta\). There is no "free lunch" when determining optimal subsampling probabilities. For quantile regression, this is achieved by drawing a size n.plt subsample with replacement from full dataset, using uniform sampling probability.

If boot=TRUE, the returned value subsample.size.expect equals to B*n.ssp, and the covariance matrix for coef would be calculated. If boot=FALSE, the returned value subsample.size.expect equals to B*n.ssp, but the covariance matrix won't be estimated.

References

Wang, H., & Ma, Y. (2021). Optimal subsampling for quantile regression in big data. Biometrika, 108(1), 99-112.

Examples

#quantile regression
set.seed(1)
N <- 1e4
B <- 5
tau <- 0.75
beta.true <- rep(1, 7)
d <- length(beta.true) - 1
corr  <- 0.5
sigmax  <- matrix(0, d, d)
for (i in 1:d) for (j in 1:d) sigmax[i, j] <- corr^(abs(i-j))
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
err <- rnorm(N, 0, 1) - qnorm(tau)
Y <- beta.true[1] + X %*% beta.true[-1] + 
err * rowMeans(abs(X))
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
formula <- Y ~ .
n.plt <- 200
n.ssp <- 100
optL.results <- ssp.quantreg(formula,data,tau = tau,n.plt = n.plt,
n.ssp = n.ssp,B = B,boot = TRUE,criterion = 'optL',
sampling.method = 'withReplacement',likelihood = 'weighted')
summary(optL.results)
#> Model Summary
#> 
#> 
#> Call:
#> 
#> ssp.quantreg(formula = formula, data = data, tau = tau, n.plt = n.plt, 
#>     n.ssp = n.ssp, B = B, boot = TRUE, criterion = "optL", sampling.method = "withReplacement", 
#>     likelihood = "weighted")
#> 
#> Subsample Size:
#> [1] 500
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.9554     0.0318 30.0568  <0.0001
#> V1          0.9926     0.0567 17.5146  <0.0001
#> V2          0.9761     0.0641 15.2257  <0.0001
#> V3          0.9800     0.0494 19.8297  <0.0001
#> V4          1.0749     0.0361 29.8075  <0.0001
#> V5          0.9890     0.0167 59.3399  <0.0001
#> V6          0.9877     0.0715 13.8179  <0.0001
uni.results <- ssp.quantreg(formula,data,tau = tau,n.plt = n.plt,
n.ssp = n.ssp,B = B,boot = TRUE,criterion = 'uniform',
sampling.method = 'withReplacement', likelihood = 'weighted')
summary(uni.results)
#> Model Summary
#> 
#> 
#> Call:
#> 
#> ssp.quantreg(formula = formula, data = data, tau = tau, n.plt = n.plt, 
#>     n.ssp = n.ssp, B = B, boot = TRUE, criterion = "uniform", 
#>     sampling.method = "withReplacement", likelihood = "weighted")
#> 
#> Subsample Size:
#> [1] 500
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   1.0404     0.0157 66.4117  <0.0001
#> V1          1.0048     0.0806 12.4606  <0.0001
#> V2          1.0664     0.0687 15.5237  <0.0001
#> V3          0.9842     0.0715 13.7698  <0.0001
#> V4          0.9684     0.0638 15.1784  <0.0001
#> V5          1.0856     0.0378 28.7252  <0.0001
#> V6          0.9097     0.0455 19.9756  <0.0001