`ssp.quantreg`: Subsampling for Quantile Regression
Source:vignettes/ssp-quantreg.Rmd
ssp-quantreg.Rmd
This vignette introduces the usage of ssp.quantreg
. The
statistical theory and algorithms behind this implementation can be
found in the relevant reference papers.
Quantile regression aims to estimate conditional quantiles by minimizing the following loss function:
where is the quantile of interest, is the response variable, is covariates vector and is the number of observations in full dataset.
The idea of subsampling methods is as follows: instead of fitting the model on the size full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.
Terminology
Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability (): The probability assigned to each observation for inclusion in the subsample.
Example
We introduce ssp.quantreg
with simulated data.
contains
covariates drawn from multinormal distribution and
is the response variable. The full data size is
.
The interested quantile
.
set.seed(1)
N <- 1e4
tau <- 0.75
beta.true <- rep(1, 7)
d <- length(beta.true) - 1
corr <- 0.5
sigmax <- matrix(0, d, d)
for (i in 1:d) for (j in 1:d) sigmax[i, j] <- corr^(abs(i-j))
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
err <- rnorm(N, 0, 1) - qnorm(tau)
Y <- beta.true[1] + X %*% beta.true[-1] + err * rowMeans(abs(X))
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
formula <- Y ~ .
head(data)
#> Y V1 V2 V3 V4 V5
#> 1 3.0813580 -0.1825325 -0.01613791 -0.01852406 1.0672454 0.9353870
#> 2 0.1114953 -0.3829652 -1.20674035 -0.33354934 0.3818526 0.6610612
#> 3 4.0233475 -0.1384141 0.35758454 -0.08962728 0.8591475 0.7554356
#> 4 -7.0116774 -0.7668158 -1.07028901 -2.57374497 -1.4283868 -0.4782146
#> 5 -1.2551700 -0.9557206 -0.82219260 0.47905721 0.1096016 -0.3116279
#> 6 2.6764218 0.8646208 -0.32527175 0.23441106 0.5800169 1.8153229
#> V6
#> 1 0.44382164
#> 2 0.12626628
#> 3 1.63208199
#> 4 1.10717085
#> 5 -0.08180055
#> 6 -0.03612645
Key Arguments
The function usage is
ssp.quantreg(
formula,
data,
subset = NULL,
tau = 0.5,
n.plt,
n.ssp,
B = 5,
boot = TRUE,
criterion = "optL",
sampling.method = "withReplacement",
likelihood = c("weighted"),
control = list(...),
contrasts = NULL,
...
)
The core functionality of ssp.quantreg
revolves around
three key questions:
How are subsampling probabilities computed? (Controlled by the
criterion
argument)How is the subsample drawn? (Controlled by the
sampling.method
argument)How is the likelihood adjusted to correct for bias? (Controlled by the
likelihood
argument)
criterion
criterion
stands for the criterion we choose to compute
the sampling probability for each observation. The choices of
criterion
include optL
(default) and
uniform
. In optL
, the optimal subsampling
probability is by minimizing a transformation of the asymptotic variance
of subsample estimator. uniform
is a baseline method.
sampling.method
The options for the sampling.method
argument include
withReplacement
(default) and poisson
.
withReplacement
stands for drawing
subsamples from full dataset with replacement, using the specified
subsampling probabilities. poisson
stands for drawing
subsamples one by one by comparing the subsampling probability with a
realization of uniform random variable
.
The expected number of drawn samples are
.
likelihood
The available choice for likelihood
in
ssp.quantreg
is weighted
. It takes the inverse
of sampling probabblity as the weights in likelihood function to correct
the bias introduced by unequal subsampling probabilities.
boot
and B
An option for drawing
subsamples (each with expected size n.ssp
) and deriving
subsample estimator and asymptotic covariance matrix based on them.
After getting
on the
-th
subsample,
,
it calculates
as the final subsample estimator and where is a correction term for effective subsample size since the observations in each subsample can be replicated. For more details, see Wang and Ma (2021).
Outputs
After drawing subsample(s), ssp.quantreg
utilizes
quantreg::rq
to fit the model on the subsample(s).
Arguments accepted by quantreg::rq
can be passed through
...
in ssp.quantreg
.
Below are two examples demonstrating the use of
ssp.quantreg
with different configurations.
B <- 5
n.plt <- 200
n.ssp <- 200
ssp.results1 <- ssp.quantreg(formula,
data,
tau = tau,
n.plt = n.plt,
n.ssp = n.ssp,
B = B,
boot = TRUE,
criterion = 'optL',
sampling.method = 'withReplacement',
likelihood = 'weighted'
)
ssp.results2 <- ssp.quantreg(formula,
data,
tau = tau,
n.plt = n.plt,
n.ssp = n.ssp,
B = B,
boot = FALSE,
criterion = 'optL',
sampling.method = 'withReplacement',
likelihood = 'weighted'
)
Returned object
The returned object contains estimation results and index of drawn subsample in the full dataset.
names(ssp.results1)
#> [1] "model.call" "coef.plt" "coef"
#> [4] "cov" "index.plt" "index.ssp"
#> [7] "N" "subsample.size.expect" "terms"
summary(ssp.results1)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.quantreg(formula = formula, data = data, tau = tau, n.plt = n.plt,
#> n.ssp = n.ssp, B = B, boot = TRUE, criterion = "optL", sampling.method = "withReplacement",
#> likelihood = "weighted")
#>
#> Subsample Size:
#> [1] 1000
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.9753 0.0324 30.0654 <0.0001
#> V1 0.9701 0.0220 44.1763 <0.0001
#> V2 1.0295 0.0394 26.1369 <0.0001
#> V3 0.9980 0.0209 47.8506 <0.0001
#> V4 0.9834 0.0609 16.1529 <0.0001
#> V5 1.0508 0.0301 34.8848 <0.0001
#> V6 0.9441 0.0327 28.8878 <0.0001
summary(ssp.results2)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.quantreg(formula = formula, data = data, tau = tau, n.plt = n.plt,
#> n.ssp = n.ssp, B = B, boot = FALSE, criterion = "optL", sampling.method = "withReplacement",
#> likelihood = "weighted")
#>
#> Subsample Size:
#> [1] 1000
#>
#> Coefficients:
#>
#> Estimate
#> Intercept 0.9510
#> V1 1.0234
#> V2 1.0562
#> V3 1.0221
#> V4 0.9676
#> V5 0.9698
#> V6 1.0445
Some key returned variables:
index.plt
andindex
are the row indices of drawn pilot subsamples and optimal subsamples in the full data.coef.ssp
is the subsample estimator for andcoef
is the linear combination ofcoef.plt
(pilot estimator) andcoef.ssp
.cov.ssp
andcov
are estimated covariance matrices ofcoef.ssp
andcoef
. Ifboot=FALSE
, covariance matrix would not be estimated and a sizen.ssp * B
subsample would be drawn.
See the help documentation of ssp.quantreg
for
details.