This vignette introduces the usage of the ssp.glm
using
logistic regression as an example of generalized linear models (GLM).
The statistical theory and algorithms in this implementation can be
found in the relevant reference papers.
The log-likelihood function for a GLM is
where and are known functions depend on the distribution from the exponential family. For the binomial distribution, the log-likelihood function becomes
The idea of subsampling methods is as follows: instead of fitting the model on the size full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.
Installation
You can install the development version of subsampling from GitHub with:
# install.packages("devtools")
devtools::install_github("dqksnow/Subsampling")
Terminology
Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability (): The probability assigned to each observation for inclusion in the subsample.
Example: Logistic Regression with Simulated Data
We introduce the usage of ssp.glm
with simulated data.
contains
covariates drawn from multinormal distribution, and
is the binary response variable. The full dataset size is
.
set.seed(1)
N <- 1e4
beta0 <- rep(-0.5, 7)
d <- length(beta0) - 1
corr <- 0.5
sigmax <- matrix(corr, d, d) + diag(1-corr, d)
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
colnames(X) <- paste("V", 1:ncol(X), sep = "")
P <- 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1]))
Y <- rbinom(N, 1, P)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
head(data)
#> Y V1 V2 V3 V4 V5 V6
#> 1 1 -1.0918680 -0.4462684 -0.02250989 -0.19626329 -0.67460551 -0.4392570
#> 2 0 -0.1591053 -0.4748068 0.46515238 0.88370061 -0.05910325 0.1857218
#> 3 1 -1.6260754 -0.3394421 -0.68490712 -0.55721107 0.01024563 -0.6319413
#> 4 0 0.1251949 1.5113247 1.38931519 1.24287417 2.48829727 0.5534888
#> 5 0 0.1931921 -0.1478401 -0.14788926 0.46973556 0.05205022 1.0907459
#> 6 0 -0.2560258 -1.6065024 0.32710042 -0.04590727 -0.94748664 -1.2310368
Key Arguments
The function usage is
ssp.glm(
formula,
data,
subset = NULL,
n.plt,
n.ssp,
family = "quasibinomial",
criterion = "optL",
sampling.method = "poisson",
likelihood = "weighted",
control = list(...),
contrasts = NULL,
...
)
The core functionality of ssp.glm
revolves around three
key questions:
How are subsampling probabilities computed? (Controlled by the
criterion
argument)How is the subsample drawn? (Controlled by the
sampling.method
argument)How is the likelihood adjusted to correct for bias? (Controlled by the
likelihood
argument)
criterion
The choices of criterion
include optA
,
optL
(default), LCC
and uniform
.
The optimal subsampling criterion optA
and
optL
are derived by minimizing the asymptotic covariance of
subsample estimator, proposed by Wang, Zhu, and
Ma (2018). LCC
and uniform
are baseline
methods.
sampling.method
The options for the sampling.method
argument include
withReplacement
and poisson
(default).
withReplacement
stands for drawing n.ssp
subsamples from full dataset with replacement, using the specified
subsampling probabilities. poisson
stands for drawing
subsamples one by one by comparing the subsampling probability with a
realization of uniform random variable
.
The expected number of drawn samples are n.ssp
. More
details see Wang (2019).
likelihood
The available choices for likelihood
include
weighted
(default) and logOddsCorrection
. Both
of these likelihood functions can derive an unbiased estimator.
Theoretical results indicate that logOddsCorrection
is more
efficient than weighted
in the context of logistic
regression. See Wang and Kim (2022).
Outputs
After drawing subsample, ssp.glm
utilizes
survey::svyglm
to fit the model on the subsample, which
eventually uses glm
. Arguments accepted by
svyglm
can be passed through ...
in
ssp.glm
.
Below are two examples demonstrating the use of ssp.glm
with different configurations.
n.plt <- 200
n.ssp <- 600
ssp.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "optL",
sampling.method = "withReplacement",
likelihood = "weighted"
)
summary(ssp.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optL", sampling.method = "withReplacement",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 600
#> 4 Unique Subsample Size 561
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6%
#> 7 Unique Subample Rate 5.61%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.5876 0.0867 -6.7749 <0.0001
#> V1 -0.4725 0.1053 -4.4865 <0.0001
#> V2 -0.5252 0.1109 -4.7357 <0.0001
#> V3 -0.4789 0.1037 -4.6193 <0.0001
#> V4 -0.6400 0.1090 -5.8705 <0.0001
#> V5 -0.4937 0.1155 -4.2737 <0.0001
#> V6 -0.6226 0.1125 -5.5368 <0.0001
ssp.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "optA",
sampling.method = "poisson",
likelihood = "logOddsCorrection"
)
summary(ssp.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optA", sampling.method = "poisson",
#> likelihood = "logOddsCorrection")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 634
#> 4 Unique Subsample Size 634
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6.34%
#> 7 Unique Subample Rate 6.34%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.5052 0.0775 -6.5171 <0.0001
#> V1 -0.4788 0.0992 -4.8288 <0.0001
#> V2 -0.4893 0.0985 -4.9662 <0.0001
#> V3 -0.4037 0.0997 -4.0485 <0.0001
#> V4 -0.6308 0.0988 -6.3844 <0.0001
#> V5 -0.5781 0.1029 -5.6208 <0.0001
#> V6 -0.4679 0.1001 -4.6747 <0.0001
As recommended by survey::svyglm
, when working with
binomial models, it is advisable to use use
family=quasibinomial()
to avoid a warning issued by
glm
. Refer to svyglm()
help documentation Details. The ‘quasi’ version of the family
objects provide the same point estimates.
Returned object
The object returned by ssp.glm
contains estimation
results and indices of the drawn subsample in the full dataset.
names(ssp.results)
#> [1] "model.call" "coef.plt" "coef.ssp"
#> [4] "coef" "cov.ssp" "cov"
#> [7] "index.plt" "index" "N"
#> [10] "subsample.size.expect" "terms"
Some key returned variables:
index.plt
andindex
are the row indices of drawn pilot subsamples and optimal subsamples in the full data.coef.ssp
is the subsample estimator for andcoef
is the linear combination ofcoef.plt
(pilot estimator) andcoef.ssp
.cov.ssp
andcov
are estimated covariance matrices ofcoef.ssp
andcoef
.
The coefficients and standard errors printed by
summary()
are coef
and the square root of
diag(cov)
. See the help documentation of
ssp.glm
for details.
Other Families
We also provide examples for poisson regression and gamma regression
in the help documentation of ssp.glm
. Note that
likelihood = logOddsCorrection
is currently implemented
only for logistic regression (family = binomial
or
quasibonomial
).