Optimal Subsampling Methods for Generalized Linear Models
Source:R/general_glm_main_function.R
ssp.glm.Rd
Draw subsample from full dataset and fit a generalized linear model (GLM) on the subsample. For a quick start, refer to the vignette.
Usage
ssp.glm(
formula,
data,
subset = NULL,
n.plt,
n.ssp,
family = "binomial",
criterion = "optL",
sampling.method = "poisson",
likelihood = "weighted",
control = list(...),
contrasts = NULL,
...
)
Arguments
- formula
A model formula object of class "formula" that describes the model to be fitted.
- data
A data frame containing the variables in the model. Denote \(N\) as the number of observations in
data
.- subset
An optional vector specifying a subset of observations from
data
to use for the analysis. This subset will be viewed as the full data.- n.plt
The pilot subsample size (first-step subsample size). This subsample is used to compute the pilot estimator and estimate the optimal subsampling probabilities.
- n.ssp
The expected size of the optimal subsample (second-step subsample). For
sampling.method = 'withReplacement'
, The exact subsample size isn.ssp
. Forsampling.method = 'poisson'
,n.ssp
is the expected subsample size.- family
family
can be a character string naming a family function, a family function or the result of a call to a family function.- criterion
The choices include
optA
,optL
(default),LCC
anduniform.
optA
Minimizes the trace of the asymptotic covariance matrix of the subsample estimator.optL
Minimizes the trace of a transformation of the asymptotic covariance matrix. The computational complexity of optA is \(O(N d^2)\) while that of optL is \(O(N d)\).LCC
Local Case-Control sampling probability, used as a baseline subsampling strategy.uniform
Assigns equal subsampling probability \(\frac{1}{N}\) to each observation, serving as a baseline subsampling strategy.
- sampling.method
The sampling method to use. Options include
withReplacement
andpoisson
(default).withReplacement
draws exactlyn.ssp
subsamples from size \(N\) full dataset with replacement, using the specified subsampling probabilities.poisson
draws observations independently by comparing each subsampling probability with a realization of uniform random variable \(U(0,1)\).Differences between methods:
Sample size:
withReplacement
draws exactlyn.ssp
subsamples whilepoisson
draws subsamples with expected sizen.ssp
, meaning the actual size may vary.Memory usage:
withReplacement
requires the entire dataset to be loaded at once, whilepoisson
allows for processing observations sequentially (will be implemented in future version).Estimator performance: Theoretical results show that the
poisson
tends to get a subsample estimator with lower asymptotic variance compared to thewithReplacement
- likelihood
The likelihood function to use. Options include
weighted
(default) andlogOddsCorrection
. A bias-correction likelihood function is required for subsample since unequal subsampling probabilities introduce bias.weighted
Applies a weighted likelihood function where each observation is weighted by the inverse of its subsampling probability.logOddsCorrection
This lieklihood is available only for logistic regression model (i.e., when family is binomial or quasibinomial). It uses a conditional likelihood, where each element of the likelihood represents the probability of \(Y=1\), given that this subsample was drawn.
- control
The argument
control
contains two tuning parametersalpha
andb
.alpha
\(\in [0,1]\) is the mixture weight of the user-assigned subsampling probability and uniform subsampling probability. The actual subsample probability is \(\pi = (1-\alpha)\pi^{opt} + \alpha \pi^{uni}\). This protects the estimator from extreme small subsampling probability. The default value is 0.b
is a positive number which is used to constaint the poisson subsampling probability.b
close to 0 results in subsampling probabilities closer to uniform probability \(\frac{1}{N}\).b=2
is the default value. See relevant references for further details.
- contrasts
An optional list. It specifies how categorical variables are represented in the design matrix. For example,
contrasts = list(v1 = 'contr.treatment', v2 = 'contr.sum')
.- ...
A list of parameters which will be passed to
svyglm()
.
Value
ssp.glm
returns an object of class "ssp.glm" containing the following components (some are optional):
- model.call
The original function call.
- coef.plt
The pilot estimator. See Details for more information.
- coef.ssp
The estimator obtained from the optimal subsample.
- coef
The weighted linear combination of
coef.plt
andcoef.ssp
. The combination weights depend on the relative size ofn.plt
andn.ssp
and the estimated covariance matrices ofcoef.plt
andcoef.ssp.
We blend the pilot subsample information into optimal subsample estimator since the pilot subsample has already been drawn. The coefficients and standard errors reported by summary arecoef
and the square root ofdiag(cov)
.- cov.ssp
The covariance matrix of
coef.ssp
.- cov
The covariance matrix of
coef
.- index.plt
Row indices of pilot subsample in the full dataset.
- index.ssp
Row indices of of optimal subsample in the full dataset.
- N
The number of observations in the full dataset.
- subsample.size.expect
The expected subsample size, equals to
n.ssp
forssp.glm.
Note that for other functions, such as ssp.relogit, this value may differ.- terms
The terms object for the fitted model.
Details
A pilot estimator for the unknown parameter \(\beta\) is required because both optA and
optL subsampling probabilities depend on \(\beta\). There is no "free lunch" when determining optimal subsampling probabilities. Fortunately the
pilot estimator only needs to satisfy mild conditions. For logistic regression, this
is achieved by drawing a size n.plt
subsample with replacement from full
dataset. The case-control subsample probability is applied, that is, \(\pi_i =
\frac{1}{2N_1}\) for \(Y_i=1\) and \(\pi_i = \frac{1}{2N_0}\) for \(Y_i=0\),
\(i=1,...,N\), where\(N_0\) and \(N_1\) are the counts of observations with \(Y = 0\) and \(Y = 1\), respectively. For other
families, uniform subsampling probabilities are applied. Typically, n.plt
is
relatively small compared to n.ssp
.
When criterion = 'uniform'
, there is no need to compute the pilot estimator. In this case, a size n.plt + n.ssp
subsample will be drawn with uniform sampling probability and coef
is the corresponding estimator.
As suggested by survey::svyglm()
, for binomial and poisson families, use family=quasibinomial()
and family=quasipoisson()
to avoid a warning "In eval(family$initialize) : non-integer #successes in a binomial glm!". The quasi versions of the family objects give the same point estimates and suppress the warning. Since subsampling methods only rely on point estimates from svyglm() for further computation, using the quasi families does not introduce any issues.
For Gamma family, ssp.glm
returns only the estimation of coefficients, as the dispersion parameter is not estimated.
References
Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of machine learning research, 20(132), 1-59.
Ai, M., Yu, J., Zhang, H., & Wang, H. (2021). Optimal subsampling algorithms for big data regressions. Statistica Sinica, 31(2), 749-772.
Wang, H., & Kim, J. K. (2022). Maximum sampled conditional likelihood for informative subsampling. Journal of machine learning research, 23(332), 1-50.
Examples
# logistic regression
set.seed(2)
N <- 1e4
beta0 <- rep(-0.5, 7)
d <- length(beta0) - 1
corr <- 0.5
sigmax <- matrix(corr, d, d) + diag(1-corr, d)
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 500
n.ssp <- 1000
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = "optL",
sampling.method = 'poisson',
likelihood = "logOddsCorrection")
summary(subsampling.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optL", sampling.method = "poisson",
#> likelihood = "logOddsCorrection")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 1000
#> 3 Actual Subsample Size 1034
#> 4 Unique Subsample Size 1034
#> 5 Expected Subample Rate 10%
#> 6 Actual Subample Rate 10.34%
#> 7 Unique Subample Rate 10.34%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.5430 0.0603 -9.0098 <0.0001
#> V2 -0.5562 0.0743 -7.4864 <0.0001
#> V3 -0.4597 0.0726 -6.3341 <0.0001
#> V4 -0.4561 0.0761 -5.9945 <0.0001
#> V5 -0.5714 0.0755 -7.5644 <0.0001
#> V6 -0.4958 0.0721 -6.8725 <0.0001
#> V7 -0.4962 0.0736 -6.7385 <0.0001
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = "optL",
sampling.method = 'withReplacement',
likelihood = "weighted")
summary(subsampling.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optL", sampling.method = "withReplacement",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 1000
#> 3 Actual Subsample Size 1000
#> 4 Unique Subsample Size 901
#> 5 Expected Subample Rate 10%
#> 6 Actual Subample Rate 10%
#> 7 Unique Subample Rate 9.01%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.5201 0.0649 -8.0143 <0.0001
#> V2 -0.6526 0.0853 -7.6500 <0.0001
#> V3 -0.4247 0.0864 -4.9176 <0.0001
#> V4 -0.2804 0.0841 -3.3329 0.0009
#> V5 -0.5186 0.0792 -6.5508 <0.0001
#> V6 -0.6281 0.0837 -7.5026 <0.0001
#> V7 -0.5878 0.0836 -7.0348 <0.0001
Uni.subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'uniform')
summary(Uni.subsampling.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "uniform")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 1500
#> 3 Actual Subsample Size 1462
#> 4 Unique Subsample Size 1462
#> 5 Expected Subample Rate 15%
#> 6 Actual Subample Rate 14.62%
#> 7 Unique Subample Rate 14.62%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.5165 0.0720 -7.1700 <0.0001
#> V2 -0.7625 0.1000 -7.6253 <0.0001
#> V3 -0.3873 0.0919 -4.2152 <0.0001
#> V4 -0.3826 0.0976 -3.9188 <0.0001
#> V5 -0.5533 0.0914 -6.0567 <0.0001
#> V6 -0.3779 0.0991 -3.8143 0.0001
#> V7 -0.5427 0.0923 -5.8806 <0.0001
####################
# poisson regression
set.seed(1)
N <- 1e4
beta0 <- rep(0.5, 7)
d <- length(beta0) - 1
X <- matrix(runif(N * d), N, d)
epsilon <- runif(N)
lambda <- exp(beta0[1] + X %*% beta0[-1])
Y <- rpois(N, lambda)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 200
n.ssp <- 600
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'poisson',
criterion = "optL",
sampling.method = 'poisson',
likelihood = "weighted")
summary(subsampling.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "poisson", criterion = "optL", sampling.method = "poisson",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 681
#> 4 Unique Subsample Size 681
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6.81%
#> 7 Unique Subample Rate 6.81%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.5662 0.0568 9.9711 <0.0001
#> V2 0.4893 0.0357 13.7219 <0.0001
#> V3 0.5126 0.0370 13.8583 <0.0001
#> V4 0.4253 0.0351 12.1181 <0.0001
#> V5 0.4972 0.0397 12.5128 <0.0001
#> V6 0.4930 0.0383 12.8792 <0.0001
#> V7 0.4871 0.0377 12.9209 <0.0001
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'poisson',
criterion = "optL",
sampling.method = 'withReplacement',
likelihood = "weighted")
summary(subsampling.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "poisson", criterion = "optL", sampling.method = "withReplacement",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 600
#> 4 Unique Subsample Size 568
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6%
#> 7 Unique Subample Rate 5.68%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.5543 0.0590 9.3997 <0.0001
#> V2 0.4505 0.0402 11.2074 <0.0001
#> V3 0.4643 0.0423 10.9781 <0.0001
#> V4 0.5392 0.0420 12.8471 <0.0001
#> V5 0.4497 0.0425 10.5933 <0.0001
#> V6 0.4529 0.0415 10.9180 <0.0001
#> V7 0.5495 0.0395 13.9048 <0.0001
Uni.subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'poisson',
criterion = 'uniform')
summary(Uni.subsampling.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "poisson", criterion = "uniform")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 800
#> 3 Actual Subsample Size 820
#> 4 Unique Subsample Size 820
#> 5 Expected Subample Rate 8%
#> 6 Actual Subample Rate 8.2%
#> 7 Unique Subample Rate 8.2%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4226 0.0591 7.1479 <0.0001
#> V2 0.5789 0.0431 13.4381 <0.0001
#> V3 0.4852 0.0425 11.4087 <0.0001
#> V4 0.5401 0.0427 12.6439 <0.0001
#> V5 0.4419 0.0432 10.2351 <0.0001
#> V6 0.5156 0.0436 11.8327 <0.0001
#> V7 0.5311 0.0418 12.6998 <0.0001
##################
# gamma regression
set.seed(1)
N <- 1e4
p <- 3
beta0 <- rep(0.5, p + 1)
d <- length(beta0) - 1
shape <- 2
X <- matrix(runif(N * d), N, d)
link_function <- function(X, beta0) 1 / (beta0[1] + X %*% beta0[-1])
scale <- link_function(X, beta0) / shape
Y <- rgamma(N, shape = shape, scale = scale)
data <- as.data.frame(cbind(Y, X))
formula <- Y ~ .
n.plt <- 200
n.ssp <- 1000
subsampling.results <- ssp.glm(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'Gamma',
criterion = "optL",
sampling.method = 'poisson',
likelihood = "weighted")
summary(subsampling.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "Gamma", criterion = "optL", sampling.method = "poisson",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 1000
#> 3 Actual Subsample Size 989
#> 4 Unique Subsample Size 989
#> 5 Expected Subample Rate 10%
#> 6 Actual Subample Rate 9.89%
#> 7 Unique Subample Rate 9.89%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.5099 0.0510 9.9893 <0.0001
#> V2 0.4732 0.0603 7.8469 <0.0001
#> V3 0.4040 0.0621 6.5035 <0.0001
#> V4 0.4945 0.0681 7.2599 <0.0001