Optimal Subsampling for Logistic Regression Model with Rare Events Data
Source:R/rare_events_logistic_main_function.R
ssp.relogit.Rd
Draw subsample from full dataset and fit logistic regression model on subsample. For a quick start, refer to the vignette.
Usage
ssp.relogit(
formula,
data,
subset = NULL,
n.plt,
n.ssp,
criterion = "optL",
likelihood = "logOddsCorrection",
control = list(...),
contrasts = NULL,
...
)
Arguments
- formula
A model formula object of class "formula" that describes the model to be fitted.
- data
A data frame containing the variables in the model. Denote \(N\) as the number of observations in
data
.- subset
An optional vector specifying a subset of observations from
data
to use for the analysis. This subset will be viewed as the full data.- n.plt
The pilot subsample size (first-step subsample size). This subsample is used to compute the pilot estimator and estimate the optimal subsampling probabilities.
- n.ssp
The expected subsample size (the second-step subsample size) drawn from those samples with
Y=0
. All rare events (Y=1
) are included in the optimal subsample automatically.- criterion
The choices include
optA
,optL
(default),LCC
anduniform.
optA
Minimizes the trace of the asymptotic covariance matrix of the subsample estimator.optL
Minimizes the trace of a transformation of the asymptotic covariance matrix. The computational complexity of optA is \(O(N d^2)\) while that of optL is \(O(N d)\).LCC
Local Case-Control sampling probability, used as a baseline subsampling strategy.uniform
Assigns equal subsampling probability \(\frac{1}{N}\) to each observation, serving as a baseline subsampling strategy.
- likelihood
The likelihood function to use. Options include
weighted
andlogOddsCorrection
(default). A bias-correction likelihood function is required for subsample since unequal subsampling probabilities introduce bias.weighted
Applies a weighted likelihood function where each observation is weighted by the inverse of its subsampling probability.logOddsCorrection
This lieklihood is available only for logistic regression model (i.e., when family is binomial or quasibinomial). It uses a conditional likelihood, where each element of the likelihood represents the probability of \(Y=1\), given that this subsample was drawn.
- control
The argument
control
contains two tuning parametersalpha
andb
.alpha
\(\in [0,1]\) is the mixture weight of the user-assigned subsampling probability and uniform subsampling probability. The actual subsample probability is \(\pi = (1-\alpha)\pi^{opt} + \alpha \pi^{uni}\). This protects the estimator from extreme small subsampling probability. The default value is 0.b
is a positive number which is used to constaint the poisson subsampling probability.b
close to 0 results in subsampling probabilities closer to uniform probability \(\frac{1}{N}\).b=2
is the default value. See relevant references for further details.
- contrasts
An optional list. It specifies how categorical variables are represented in the design matrix. For example,
contrasts = list(v1 = 'contr.treatment', v2 = 'contr.sum')
.- ...
A list of parameters which will be passed to
svyglm()
.
Value
ssp.relogit
returns an object of class "ssp.relogit" containing the following components (some are optional):
- model.call
The original function call.
- coef.plt
The pilot estimator. See Details for more information.
- coef.ssp
The estimator obtained from the optimal subsample.
- coef
The weighted linear combination of
coef.plt
andcoef.ssp.
The combination weights depend on the relative size ofn.plt
andn.ssp
and the estimated covariance matrices ofcoef.plt
andcoef.ssp.
We blend the pilot subsample information into optimal subsample estimator since the pilot subsample has already been drawn. The coefficients and standard errors reported by summary arecoef
and the square root ofdiag(cov)
.- cov.ssp
The covariance matrix of
coef.ssp
.- cov
The covariance matrix of
beta.cmb
.- index.plt
Row indices of pilot subsample in the full dataset.
- index.ssp
Row indices of of optimal subsample in the full dataset.
- N
The number of observations in the full dataset.
- subsample.size.expect
The expected subsample size.
- terms
The terms object for the fitted model.
Details
'Rare event' stands for the number of observations where \(Y=1\) is rare compare to the number of \(Y=0\) in the full data. In the face of logistic regression with rare events, @wang2021nonuniform shows that the available information ties to the number of positive instances instead of the full data size. Based on this insight, one can keep all the rare instances and perform subsampling on the non-rare instances to reduce the computational cost. When criterion = optA, optL or LCC
, all observations with \(Y=1\) are preserved and it draw n.ssp
subsmples from observations with Y=0. When criterion = uniform
, it draws (n.plt+n.ssp) subsmples from the full sample with equal sampling probability.
A pilot estimator for the unknown parameter \(\beta\) is required because both optA and optL subsampling probabilities depend on \(\beta\). This is achieved by drawing half size subsample from rare observations and half from non-rare observations.
Most of the arguments and returned variables have similar meaning with ssp.glm. Refer to vignette
References
Wang, H., Zhang, A., & Wang, C. (2021). Nonuniform negative sampling and log odds correction with rare events data. Advances in Neural Information Processing Systems, 34, 19847-19859.
Examples
set.seed(1)
N <- 2 * 1e4
beta0 <- c(-5, -rep(0.7, 6))
d <- length(beta0) - 1
X <- matrix(0, N, d)
corr <- 0.5
sigmax <- corr ^ abs(outer(1:d, 1:d, "-"))
sigmax <- sigmax / 4
X <- MASS::mvrnorm(n = N, mu = rep(0, d), Sigma = sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
print(paste('N: ', N))
#> [1] "N: 20000"
print(paste('sum(Y): ', sum(Y)))
#> [1] "sum(Y): 277"
n.plt <- 200
n.ssp <- 1000
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
formula <- Y ~ .
subsampling.results <- ssp.relogit(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = 'optA',
likelihood = 'logOddsCorrection')
summary(subsampling.results)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.relogit(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> criterion = "optA", likelihood = "logOddsCorrection")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 20000
#> 2 Expected Subsample Size 1277
#> 3 Actual Subsample Size 1302
#> 4 Unique Subsample Size 1302
#> 5 Expected Subample Rate 6.385%
#> 6 Actual Subample Rate 6.51%
#> 7 Unique Subample Rate 6.51%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -4.9647 0.0980 -50.6818 <0.0001
#> V1 -0.7250 0.1513 -4.7933 <0.0001
#> V2 -0.9808 0.1633 -6.0076 <0.0001
#> V3 -0.3470 0.1587 -2.1863 0.0288
#> V4 -0.5822 0.1638 -3.5544 0.0004
#> V5 -0.8724 0.1625 -5.3679 <0.0001
#> V6 -0.5414 0.1416 -3.8235 0.0001