`ssp.softmax`: Subsampling for Softmax (Multinomial) Regression Model • subsampling

This vignette introduces the usage of ssp.softmax, which draws optimal subsample from full data and fit softmax (multinomial) regression on the subsample. The statistical theory and algorithms in this implementation can be found in the relevant reference papers.

Denote $y$ as multi-category response variable and $K+1$ is the number of categories. $N$ is the number of observations in the full dataset. $X$ is the $N \times d$ covariates matrix. Softmax regression model assumes that $P(y_{i,k} = 1 \mid \mathbf{x}_i) = \frac{\exp(\mathbf{x}_i^\top \boldsymbol{\beta}_k)}{\sum_{l=0}^{K} \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_l)}$ for $i = 1, \ldots, N$ and $k = 0, 1, \ldots, K$ , where $\boldsymbol{\beta}_k$ ’s are $d$ -dimensional unknown coefficients.

The log-likelihood function of softmax regression is

$\max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \sum_{k=0}^{K} y_{i,k} \mathbf{x}_i^\top \boldsymbol{\beta}_k - \ln \left\{ \sum_{l=0}^{K} \exp(\mathbf{x}_i^\top \boldsymbol{\beta}_l) \right\} \right].$

The idea of subsampling methods is as follows: instead of fitting the model on the size $N$ full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost.

Terminology

Full dataset: The whole dataset used as input.
Full data estimator: The estimator obtained by fitting the model on the full dataset.
Subsample: A subset of observations drawn from the full dataset.
Subsample estimator: The estimator obtained by fitting the model on the subsample.
Subsampling probability ( $\pi$ ): The probability assigned to each observation for inclusion in the subsample.

Example

We introduce the usage of ssp.softmax with simulated data. $X$ contains $d=3$ covariates drawn from multinormal distribution and $Y$ is the multicategory response variable with $K+1=3$ categories. The full data size is $N = 1 \times 10^4$ .

library(subsampling)

set.seed(1)
d <- 3
K <- 2
G <- rbind(rep(-1/(K+1), K), diag(K) - 1/(K+1)) %x% diag(d)
N <- 1e4
beta.true.baseline <- cbind(rep(0, d), matrix(-1.5, d, K))
beta.true.summation <- cbind(rep(1, d), 0.5 * matrix(-1, d, K))
mu <- rep(0, d)
sigma <- matrix(0.5, nrow = d, ncol = d)
diag(sigma) <- rep(1, d)
X <- MASS::mvrnorm(N, mu, sigma)
prob <- exp(X %*% beta.true.summation)
prob <- prob / rowSums(prob)
Y <- apply(prob, 1, function(row) sample(0:K, size = 1, prob = row))
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste("V", 1:ncol(X), sep=""))
head(data)
#>   Y         V1          V2         V3
#> 1 2 -0.3756189 -0.17727086 -0.9816025
#> 2 2  0.2912939  0.60753208 -0.4489936
#> 3 2 -1.0530547  0.02079337 -1.0146024
#> 4 0  0.1854791  2.45385260  1.2682922
#> 5 0  0.8687332  0.21941612 -0.2810234
#> 6 1 -0.8336174 -0.32556141 -0.8505501

Key Arguments

The function usage is

ssp.softmax(
  formula,
  data,
  subset,
  n.plt,
  n.ssp,
  criterion = "MSPE",
  sampling.method = "poisson",
  likelihood = "MSCLE",
  constraint = "summation",
  control = list(...),
  contrasts = NULL,
  ...
)

The core functionality of ssp.softmax revolves around three key questions:

How are subsampling probabilities computed? (Controlled by the criterion argument)
How is the subsample drawn? (Controlled by the sampling.method argument)
How is the likelihood adjusted to correct for bias? (Controlled by the likelihood argument)

`criterion`

The choices of criterion include optA, optL, ,MSPE(default), LUC and uniform. The default criterion MSPE minimizes the mean squared prediction error between subsample estimator and full data estimator. Criterion optA and optL are derived by minimizing the asymptotic covariance of subsample estimator. LUC and uniform are baseline methods. See Yao, Zou, and Wang (2023) and Wang and Kim (2022) for details.

`sampling.method`

The options for sampling.method include withReplacement and poisson (default). withReplacement. stands for drawing $n.ssp$ subsamples from full dataset with replacement, using the specified subsampling probability. poisson stands for drawing subsamples one by one by comparing the subsampling probability with a realization of uniform random variable $U(0,1)$ . The expected number of drawed samples are $n.ssp$ .

`likelihood`

The available choices for likelihood include weighted and MSCLE(default). MSCLE stands for maximum sampled conditional likelihood. Both of these likelihood functions can derive an unbiased optimal subsample estimator. See Wang and Kim (2022) for details about MSCLE.

`constraint`

Softmax model needs constraint on unknown coefficients for identifiability. The options for constraint include summation and baseline (default). The baseline constraint assumes the coefficient for the baseline category are $0$ . Without loss of generality, ssp.softmax sets the category $Y=0$ as the baseline category so that $\boldsymbol{\beta}_0=0$ . The summation constraint $\sum_{k=0}^{K} \boldsymbol{\beta}_k$ can also used in the subsampling method for the purpose of calculating optimal subsampling probability. These two constraints lead to different interpretation of coefficients but are equal for computing $P(y_{i,k} = 1 \mid \mathbf{x}_i)$ . The estimation of coefficients returned by ssp.softmax() is under baseline constraint.

Outputs

After drawing subsample, ssp.softmax utilizes nnet::multinom to fit the model on the subsample. Arguments accepted by nnet::multinom can be passed through ... in ssp.softmax.

Below are two examples demonstrating the use of ssp.softmax with different configurations.

n.plt <- 200
n.ssp <- 600
formula <- Y ~ . -1
ssp.results1 <- ssp.softmax(formula = formula,
                            data = data,
                            n.plt = n.plt,
                            n.ssp = n.ssp,
                            criterion = 'MSPE',
                            sampling.method = 'withReplacement',
                            likelihood = 'weighted',
                            constraint = 'baseline'
                            )
#> [1] "Message from nnet::multinom: "
#> # weights:  12 (6 variable)
#> initial  value 219.722458 
#> iter  10 value 143.304799
#> final  value 143.304778 
#> converged
#> [1] "Message from nnet::multinom: "
#> # weights:  12 (6 variable)
#> initial  value 5789394.068697 
#> iter  10 value 4219770.981211
#> final  value 4219003.340996 
#> converged
summary(ssp.results1)
#> Model Summary
#> 
#> 
#> Call:
#> 
#> ssp.softmax(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     criterion = "MSPE", sampling.method = "withReplacement", 
#>     likelihood = "weighted", constraint = "baseline")
#> 
#> Subsample Size:
#>                                
#> 1       Total Sample Size 10000
#> 2 Expected Subsample Size   600
#> 3   Actual Subsample Size   600
#> 4   Unique Subsample Size   563
#> 5  Expected Subample Rate    6%
#> 6    Actual Subample Rate    6%
#> 7    Unique Subample Rate 5.63%
#> 
#> Coefficients:
#> 
#>         [,1]      [,2]
#> V1 -1.175907 -1.281935
#> V2 -1.619482 -1.502412
#> V3 -1.231662 -1.414723
#> 
#> Std. Errors:
#> 
#>         [,1]      [,2]
#> V1 0.1457377 0.1459830
#> V2 0.1442445 0.1461619
#> V3 0.1498036 0.1492071

summary(ssp.results1) shows that it draws 600 observations out of 10000, where the number of unique indices is less than 600 since we use sampling.method = 'withReplacement'. After fitting softmax model on subsample using the choosen weighted likelihood function, we get coefficients estimation and standard errors as above.

ssp.results2 <- ssp.softmax(formula = formula,
                            data = data,
                            n.plt = n.plt,
                            n.ssp = n.ssp,
                            criterion = 'MSPE',
                            sampling.method = 'poisson',
                            likelihood = 'MSCLE',
                            constraint = 'baseline'
                            )
#> [1] "Message from nnet::multinom: "
#> # weights:  12 (6 variable)
#> initial  value 219.722458 
#> iter  10 value 126.651711
#> final  value 126.637171 
#> converged
#> [1] "Message from nnet::multinom: "
#> # weights:  21 (6 variable)
#> initial  value 908.207717 
#> iter  10 value 651.150361
#> final  value 650.943777 
#> converged
summary(ssp.results2)
#> Model Summary
#> 
#> 
#> Call:
#> 
#> ssp.softmax(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     criterion = "MSPE", sampling.method = "poisson", likelihood = "MSCLE", 
#>     constraint = "baseline")
#> 
#> Subsample Size:
#>                                
#> 1       Total Sample Size 10000
#> 2 Expected Subsample Size   600
#> 3   Actual Subsample Size   702
#> 4   Unique Subsample Size   702
#> 5  Expected Subample Rate    6%
#> 6    Actual Subample Rate 7.02%
#> 7    Unique Subample Rate 7.02%
#> 
#> Coefficients:
#> 
#>         [,1]      [,2]
#> V1 -1.450790 -1.434135
#> V2 -1.764387 -1.726621
#> V3 -1.448671 -1.607003
#> 
#> Std. Errors:
#> 
#>         [,1]      [,2]
#> V1 0.1314360 0.1353568
#> V2 0.1441155 0.1413029
#> V3 0.1389344 0.1411973

The returned object contains estimation results and index of drawn subsamples in the full dataset.

names(ssp.results1)
#>  [1] "model.call"            "coef.plt"              "coef.ssp"             
#>  [4] "coef"                  "coef.plt.sum"          "coef.ssp.sum"         
#>  [7] "coef.sum"              "cov.plt"               "cov.ssp"              
#> [10] "cov"                   "cov.plt.sum"           "cov.sum"              
#> [13] "cov.ssp.sum"           "index.plt"             "index.ssp"            
#> [16] "N"                     "subsample.size.expect" "terms"

Some key returned variables:

index.plt and index are the row indices of drawn pilot subsamples and optimal subsamples in the full data.
coef.ssp is the subsample estimator for $\beta$ and coef is the linear combination of coef.plt (pilot estimator) and coef.ssp.
cov.ssp and cov are estimated covariance matrices of coef.ssp and coef.

References

Wang, HaiYing, and Jae Kwang Kim. 2022. “Maximum Sampled Conditional Likelihood for Informative Subsampling.” Journal of Machine Learning Research 23 (332): 1–50.

Yao, Yaqiong, Jiahui Zou, and HaiYing Wang. 2023. “Model Constraints Independent Optimal Subsampling Probabilities for Softmax Regression.” Journal of Statistical Planning and Inference 225: 188–201.