`ssp.relogit`: Subsampling for Logistic Regression Model with Rare Events
Source:vignettes/ssp-relogit.Rmd
ssp-relogit.Rmd
This vignette introduces the usage of ssp.relogit
. The
statistical theory and algorithms in this implementation can be found in
relevant reference papers.
The logistic regression log-likelihood function is
Terminology
Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability (): The probability assigned to each observation for inclusion in the subsample.
Rare events: Observations where (positive instances).
Non-rare events: Observations where (negative instances).
The idea of subsampling methods is as follows: instead of fitting the model on the size full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.
In the face of logistic regression with rare events, Wang, Zhang, and Wang (2021) shows that the available information ties to the number of positive instances instead of the full data size. Based on this insight, one can keep all the rare instances and perform subsampling on the non-rare instances to reduce the computational cost.
Example
We introduce the basic usage by using ssp.relogit
with
simulated data.
contains
covariates drawn from multinormal distribution and
is the binary response variable. The full data size is
.
Denote
as the counts of rare observations and
as the counts of non-rare observations.
set.seed(2)
N <- 2 * 1e4
beta0 <- c(-6, -rep(0.5, 6))
d <- length(beta0) - 1
X <- matrix(0, N, d)
corr <- 0.5
sigmax <- corr ^ abs(outer(1:d, 1:d, "-"))
X <- MASS::mvrnorm(n = N, mu = rep(0, d), Sigma = sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
print(paste('N: ', N))
#> [1] "N: 20000"
print(paste('sum(Y): ', sum(Y)))
#> [1] "sum(Y): 266"
n.plt <- 200
n.ssp <- 1000
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
formula <- Y ~ .
Key Arguments
The function usage is
ssp.relogit(
formula,
data,
subset = NULL,
n.plt,
n.ssp,
criterion = "optL",
likelihood = "logOddsCorrection",
control = list(...),
contrasts = NULL,
...
)
The core functionality of ssp.glm
revolves around three
key questions:
How are subsampling probabilities computed? (Controlled by the
criterion
argument)How is the subsample drawn?
How is the likelihood adjusted to correct for bias? (Controlled by the
likelihood
argument)
Different from ssp.glm
which can choose
withReplacement
and poisson
as the option of
sampling.method
, ssp.relogit
uses
poisson
as default sampling method. poisson
stands for drawing subsamples one by one by comparing the subsampling
probability with a realization of uniform random variable
.
The actual size of drawn subsample is random but the expected size is
.
criterion
The choices of criterion
include optA
,
optL
(default), LCC
and uniform
.
The optimal subsampling criterion optA
and
optL
are derived by minimizing the asymptotic covariance of
subsample estimator, proposed by Wang, Zhu, and
Ma (2018). LCC
and uniform
are baseline
methods.
Note that for rare observations
in the full data, the sampling probabilities are
.
For non-rare observations, the sampling probabilities depend on the
choice of criterion
.
likelihood
The available choices for likelihood
include
weighted
and logOddsCorrection
(default). Both
of these likelihood functions can derive an unbiased estimator.
Theoretical results indicate that logOddsCorrection
is more
efficient than weighted
in the context of rare events
logistic regression. See @Wang, Zhang, and Wang
(2021).
Results
After drawing subsample, ssp.relogit
utilizes
survey::svyglm
to fit the model on the subsample, which
eventually uses glm
. Arguments accepted by
svyglm
can be passed through ...
in
ssp.glm
.
Below is an example demonstrating the use of
ssp.relogit
.
n.plt <- 200
n.ssp <- 600
ssp.results <- ssp.relogit(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = 'optA',
likelihood = 'logOddsCorrection'
)
Outputs
The returned object contains estimation results and indices of drawn subsample in the full dataset.
names(ssp.results)
#> [1] "model.call" "coef.plt" "coef.ssp"
#> [4] "coef" "cov.ssp" "cov"
#> [7] "index.plt" "index" "N"
#> [10] "subsample.size.expect" "terms"
Some key returned variables:
index.plt
andindex
are the row indices of drawn pilot subsamples and optimal subsamples in the full data.coef.ssp
is the subsample estimator for andcoef
is the linear combination ofcoef.plt
(pilot estimator) andcoef.ssp
.cov.ssp
andcov
are estimated covariance matrices ofcoef.ssp
andcoef
.
summary(ssp.results)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.relogit(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> criterion = "optA", likelihood = "logOddsCorrection")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 20000
#> 2 Expected Subsample Size 866
#> 3 Actual Subsample Size 886
#> 4 Unique Subsample Size 886
#> 5 Expected Subample Rate 4.33%
#> 6 Actual Subample Rate 4.43%
#> 7 Unique Subample Rate 4.43%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -5.7329 0.1410 -40.6603 <0.0001
#> V1 -0.4249 0.0895 -4.7460 <0.0001
#> V2 -0.5991 0.1014 -5.9098 <0.0001
#> V3 -0.3641 0.1030 -3.5336 0.0004
#> V4 -0.5157 0.0972 -5.3076 <0.0001
#> V5 -0.4974 0.1032 -4.8182 <0.0001
#> V6 -0.5034 0.0848 -5.9376 <0.0001
In the printed results, Expected Subsample Size
is the
sum of rare event counts
()
and the expected size of negative subsample drawn from
non-rare observations. Actual Subsample Size
is the sum of
and the actual size of negative subsample from
non-rare observations.
The coefficients and standard errors printed by
summary()
are coef
and the square root of
diag(cov)
.