`ssp.softmax`: Subsampling for Softmax (Multinomial) Regression Model
Source:vignettes/ssp-softmax.Rmd
ssp-softmax.Rmd
This vignette introduces the usage of ssp.softmax
, which
draws optimal subsample from full data and fit softmax (multinomial)
regression on the subsample. The statistical theory and algorithms in
this implementation can be found in the relevant reference papers.
Denote as multi-category response variable and is the number of categories. is the number of observations in the full dataset. is the covariates matrix. Softmax regression model assumes that for and , where βs are -dimensional unknown coefficients.
The log-likelihood function of softmax regression is
The idea of subsampling methods is as follows: instead of fitting the model on the size full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.
Terminology
Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability (): The probability assigned to each observation for inclusion in the subsample.
Example
We introduce the usage of ssp.softmax
with simulated
data.
contains
covariates drawn from multinormal distribution and
is the multicategory response variable with
categories. The full data size is
.
set.seed(1)
d <- 3
K <- 2
G <- rbind(rep(-1/(K+1), K), diag(K) - 1/(K+1)) %x% diag(d)
N <- 1e4
beta.true.baseline <- cbind(rep(0, d), matrix(-1.5, d, K))
beta.true.summation <- cbind(rep(1, d), 0.5 * matrix(-1, d, K))
mu <- rep(0, d)
sigma <- matrix(0.5, nrow = d, ncol = d)
diag(sigma) <- rep(1, d)
X <- MASS::mvrnorm(N, mu, sigma)
prob <- exp(X %*% beta.true.summation)
prob <- prob / rowSums(prob)
Y <- apply(prob, 1, function(row) sample(0:K, size = 1, prob = row))
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
head(data)
#> Y V1 V2 V3
#> 1 2 -0.3756189 -0.17727086 -0.9816025
#> 2 2 0.2912939 0.60753208 -0.4489936
#> 3 2 -1.0530547 0.02079337 -1.0146024
#> 4 0 0.1854791 2.45385260 1.2682922
#> 5 0 0.8687332 0.21941612 -0.2810234
#> 6 1 -0.8336174 -0.32556141 -0.8505501
Key Arguments
The function usage is
ssp.softmax(
formula,
data,
subset,
n.plt,
n.ssp,
criterion = "MSPE",
sampling.method = "poisson",
likelihood = "MSCLE",
constraint = "summation",
control = list(...),
contrasts = NULL,
...
)
The core functionality of ssp.softmax
revolves around
three key questions:
How are subsampling probabilities computed? (Controlled by the
criterion
argument)How is the subsample drawn? (Controlled by the
sampling.method
argument)How is the likelihood adjusted to correct for bias? (Controlled by the
likelihood
argument)
criterion
The choices of criterion
include optA
,
optL
, ,MSPE
(default), LUC
and
uniform
. The default criterion MSPE
minimizes
the mean squared prediction error between subsample estimator and full
data estimator. Criterion optA
and optL
are
derived by minimizing the asymptotic covariance of subsample estimator.
LUC
and uniform
are baseline methods. See
Yao, Zou, and Wang (2023) and Wang and Kim (2022) for details.
sampling.method
The options for sampling.method
include
withReplacement
and poisson
(default).
withReplacement.
stands for drawing
subsamples from full dataset with replacement, using the specified
subsampling probability. poisson
stands for drawing
subsamples one by one by comparing the subsampling probability with a
realization of uniform random variable
.
The expected number of drawed samples are
.
likelihood
The available choices for likelihood
include
weighted
and MSCLE
(default).
MSCLE
stands for maximum sampled conditional likelihood.
Both of these likelihood functions can derive an unbiased optimal
subsample estimator. See Wang and Kim
(2022) for details about MSCLE
.
constraint
Softmax model needs constraint on unknown coefficients for
identifiability. The options for constraint
include
summation
and baseline
(default). The baseline
constraint assumes the coefficient for the baseline category are
.
Without loss of generality, ssp.softmax
sets the category
as the baseline category so that
.
The summation constraint
can also used in the subsampling method for the purpose of calculating
optimal subsampling probability. These two constraints lead to different
interpretation of coefficients but are equal for computing
.
The estimation of coefficients returned by ssp.softmax()
is
under baseline constraint.
Outputs
After drawing subsample, ssp.softmax
utilizes
nnet::multinom
to fit the model on the subsample. Arguments
accepted by nnet::multinom
can be passed through
...
in ssp.softmax
.
Below are two examples demonstrating the use of
ssp.softmax
with different configurations.
n.plt <- 200
n.ssp <- 600
formula <- Y ~ . -1
ssp.results1 <- ssp.softmax(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = 'MSPE',
sampling.method = 'withReplacement',
likelihood = 'weighted',
constraint = 'baseline'
)
#> [1] "Message from nnet::multinom: "
#> # weights: 12 (6 variable)
#> initial value 219.722458
#> iter 10 value 143.304799
#> final value 143.304778
#> converged
#> [1] "Message from nnet::multinom: "
#> # weights: 12 (6 variable)
#> initial value 5789394.068697
#> iter 10 value 4219770.981211
#> final value 4219003.340996
#> converged
summary(ssp.results1)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.softmax(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> criterion = "MSPE", sampling.method = "withReplacement",
#> likelihood = "weighted", constraint = "baseline")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 600
#> 4 Unique Subsample Size 563
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6%
#> 7 Unique Subample Rate 5.63%
#>
#> Coefficients:
#>
#> [,1] [,2]
#> V1 -1.175907 -1.281935
#> V2 -1.619482 -1.502412
#> V3 -1.231662 -1.414723
#>
#> Std. Errors:
#>
#> [,1] [,2]
#> V1 0.1457377 0.1459830
#> V2 0.1442445 0.1461619
#> V3 0.1498036 0.1492071
summary(ssp.results1)
shows that it draws 600
observations out of 10000, where the number of unique indices is less
than 600 since we use sampling.method = 'withReplacement'
.
After fitting softmax model on subsample using the choosen
weighted
likelihood function, we get coefficients
estimation and standard errors as above.
ssp.results2 <- ssp.softmax(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = 'MSPE',
sampling.method = 'poisson',
likelihood = 'MSCLE',
constraint = 'baseline'
)
#> [1] "Message from nnet::multinom: "
#> # weights: 12 (6 variable)
#> initial value 219.722458
#> iter 10 value 126.651711
#> final value 126.637171
#> converged
#> [1] "Message from nnet::multinom: "
#> # weights: 21 (6 variable)
#> initial value 908.207717
#> iter 10 value 651.150361
#> final value 650.943777
#> converged
summary(ssp.results2)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.softmax(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> criterion = "MSPE", sampling.method = "poisson", likelihood = "MSCLE",
#> constraint = "baseline")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 702
#> 4 Unique Subsample Size 702
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 7.02%
#> 7 Unique Subample Rate 7.02%
#>
#> Coefficients:
#>
#> [,1] [,2]
#> V1 -1.450790 -1.434135
#> V2 -1.764387 -1.726621
#> V3 -1.448671 -1.607003
#>
#> Std. Errors:
#>
#> [,1] [,2]
#> V1 0.1314360 0.1353568
#> V2 0.1441155 0.1413029
#> V3 0.1389344 0.1411973
The returned object contains estimation results and index of drawn subsamples in the full dataset.
names(ssp.results1)
#> [1] "model.call" "coef.plt" "coef.ssp"
#> [4] "coef" "coef.plt.sum" "coef.ssp.sum"
#> [7] "coef.sum" "cov.plt" "cov.ssp"
#> [10] "cov" "cov.plt.sum" "cov.sum"
#> [13] "cov.ssp.sum" "index.plt" "index.ssp"
#> [16] "N" "subsample.size.expect" "terms"
Some key returned variables:
index.plt
andindex
are the row indices of drawn pilot subsamples and optimal subsamples in the full data.coef.ssp
is the subsample estimator for andcoef
is the linear combination ofcoef.plt
(pilot estimator) andcoef.ssp
.cov.ssp
andcov
are estimated covariance matrices ofcoef.ssp
andcoef
.