Select parameters from a single (sampling) iteration of the Markov Chain
Using the selected parameters and the model to simulate a data set with the sample size (with same number of observations/variables)
From the simulated data set, calculate selected summary statistics (e.g., mean, sd)
Repeat steps 1-3 for a fixed number of iterations (perhaps across the whole chain)
When done, compare position of observed summary statistics to that of the distribution of summary statistics from simulated data sets (predictive distribution)
Step 1: Assemble the posterior draws
For our linear model, let’s denote our observed dependent variable as \(Y\)
Note that independent variables are not modeled (not explained by statistical formula), so we cannot examine them.
First, let’s assemble the posterior draws (\(1000 \times 4 \times 7\)):
posteriorSample = fit_full_new$draws(variables =c('beta','sigma'), format ='draws_matrix')posteriorSample
Note that we do not want to directly compare simulated data to the observed data.
Instead, we extract some characteristics of simulated/observed data for comparison using summary statistics.
There are some advantages:
We may have research interests in only some characteristics of data (whether our model predict the mean of dependent variables)
We can QC detailed aspects of the fitting process of model
In this case, for example, we may be interested in whether the model captures the “location” or “scale” of WeightLB
mean(dat$WeightLB)
[1] 171
(simMean =mean(simY))
[1] 174.6217
sd(dat$WeightLB)
[1] 49.51907
(simSD =sd(simY))
[1] 46.67779
Step 5: Looping across all posterior samples
We can repeat step 2-4 for a set number of samples
Optionally, we can choose to use up all iterations we have for Markov Chains (\(I = 4000\)) in practice
I =nrow(posteriorSample)## create empty simSD and simMean as placeholderssimSD = simMean =rep(NA, I)for (i in1:I) {# beta betaVector =matrix(posteriorSample[i, 1:6], ncol =1)# sigma sigma = posteriorSample[i, 7]# X simY =rnorm(n = N, mean = X %*% betaVector, sd = sigma) simMean[i] =mean(simY) simSD[i] =sd(simY)}par(mfrow =1:2)hist(simMean)hist(simSD)
Compare to the observed mean
We can now compare our observed mean and standard deviation with that of the sample values.
Blue line: the average value of predicted WeightLB
Red line: the observed mean value of WeightLB
The PDF of predictive values of summary statistics of WeightLB is called posterior predictive distribution
PPMC can be checked using visual inspection:
library(ggplot2)ppp <-data.frame(simMean = simMean, simSD = simSD)ggplot(ppp) +geom_histogram(aes(x = simMean), fill ="grey", col ="black") +geom_vline(xintercept =mean(dat$WeightLB), col ="red", size =1.4) +# red line: location of mean of predicted WeightLB by modelgeom_vline(xintercept =mean(simMean), col ="blue", size =1.4, alpha =0.5) # blue line: location of mean of WeightLB
Compare to the observed SD
Similarly, let’s compare SD to the posterior predictive distribution of SD of WeightLB
the observed SD is located as the center of posterior predictive distribution (PPD)
the average mean of PPD is slightly higher than the observed SD
ggplot(ppp) +geom_histogram(aes(x = simSD), fill ="grey", col ="black") +geom_vline(xintercept =sd(dat$WeightLB), col ="red", size =1.4) +# red line: location of mean of predicted WeightLB by modelgeom_vline(xintercept =mean(simSD), col ="blue", size =1.4, alpha =0.5) # blue line: location of mean of WeightLB
PPMC Characteristics
PPMC methods are very useful
They provide a visual way to determine if the model fits the observed data
They are the main method of assessing absolute fit in Bayesian models
Absolute fit assesses if a model fits the data instead of comparing to another model
But, there are some drawbacks to PPMC methods
Almost any statistic can be used
Some are better than others (mean and SD of outcomes are nice choices for linear regression)
No standard determining how much misfit is too much
May be overwhelming to compute depending on your model
Question: Can PPMC be used for models with maximum likelihood estimation or ordinary least squares?
Posterior Predictive P-values
We can summarize the PPMC using a type of “p-value”
Personally, I don’t like the name “p-value”, sounds like we are trying to justify our results using significance testing
Different from the frequentist “p-value” (if the null hypothesis is true, the probability of the observed data existing)
The PPP-value: the proportion of times the statistic from the simulated data exceeds that of the observed data
Useful to determine how far off a statistic is from its posterior predictive distribution
If these p-values were:
near 0 or 1, indicating your model is far off your data
near .5, indicating your model fits your data in terms of the statistics you examined
The PPP-value for mean:
mean(simMean >mean(dat$WeightLB))
[1] 0.4985
The PPP-value for SD:
mean(simSD >sd(dat$WeightLB))
[1] 0.52075
Compute PPP-values within Stan
We can use the generated quantities block of Stan to compute PPP-values for us:
generated quantities{// simulated dataarray[N] real weightLB_rep = normal_rng(X*beta, sigma);// posterior predictive distribution for mean and SDreal mean_weightLB = mean(weightLB);real sd_weightLB = sd(weightLB);real mean_weightLB_rep = mean(to_vector(weightLB_rep));real<lower=0> sd_weightLB_rep = sd(to_vector(weightLB_rep));// ppp-values for mean and sdint<lower=0, upper=1> ppp_mean = (mean_weightLB_rep > mean_weightLB);int<lower=0, upper=1> ppp_sd = (sd_weightLB_rep > sd_weightLB);}
Relative model fit: used to compare 2 or more competing models in terms of their mode fit. Sometime, it is also called model selection.
In non-Bayesian models, Information Criteria are often used to make comparisons
AIC, BIC, DIC etc.
Typically IC is a function of log-likelihood and penalty
The model with the lowest IC is the model that fits best
Bayesian model fit is similar
Uses an index value
The model with the lowest index is the model that fits best
Recent advances in Bayesian model fit use indices that are tied to make cross-validation predictions (inspired by machine learning):
Fit model leaving one observation out (LOO)
Calculate statistics related to prediction (for instance, log-likelihood of that observation conditional on model parameters)
Do for all observations
New Bayesian indices try to mirror these leave-one-out predictions (but approximate these due to time constraints)
Deviance Information Indices
When late 1990s and early 2000s, the Deviance Information Criterion was popular for relative Bayesian model fit comparisons. It is proved not as good as LOO or WAIC. But let’s have a look at:
\[
\text{DIC} = p_D + \overline{D(\theta)}
\]
where \(p_D\) is the estimated number of parameters as follows:
\[p_D = \overline{D(\theta)} - D(\bar\theta)\]and where
\[
D(\theta) = -2 \log(p(y \mid \theta)) + C
\]
C is a constant that cancels out when model comparisons are made
Here.
\(\overline{D(\theta)}\) is the average log likelihood of the data (y) given the parameters (\(\theta\)) computed across all samples
\(D(\bar\theta)\) is the log likelihood of the data (y) computed at the average of the parameters (\(\theta\)) computed across all samples
DIC in Stan
Some program like JAGS will have provided DIC directly, but Stan does not provides direct ways of calculating DIC:
We can manually calculate approximated DIC by replacing \(\bar\theta\) as median of posteriors \(\dot\theta\) if the posterior distributions are almost normal.
DIC for Full Model
# extract log-likelihood and parametersposterior_draws <- fit_full_ppp$draws(c('log_lik', 'beta'), format ='draws_matrix')log_like_draws <-rowSums(posterior_draws[,colnames(posterior_draws) %in%paste0("log_lik[", 1:30, "]")]) # find draws of loglikelihood that have parameters equaling to median valuestheta_draws <- posterior_draws[, colnames(posterior_draws) %in%paste0("beta[", 1:6, "]")]theta_ave_iteration =apply(round(theta_draws, 3), 2, \(x) which(x ==median(x))) |>unlist() |>as.numeric()log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])mean_log_like_draws = (-2)*mean(log_like_draws)# compute DICDIC_full = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_drawsDIC_full
[1] 211.6029
DIC for Empty Model
# extract log-likelihood and parametersposterior_draws <- fit_empty_ppp$draws(c('log_lik', 'beta0'), format ='draws_matrix')log_like_draws <-rowSums(posterior_draws[,colnames(posterior_draws) %in%paste0("log_lik[", 1:30, "]")]) # find draws of loglikelihood that have parameters equaling to median valuestheta_draws <- posterior_draws[, colnames(posterior_draws) =="beta0"]theta_ave_iteration =which(round(theta_draws,2) ==median(round(theta_draws,2)))log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])mean_log_like_draws = (-2)*mean(log_like_draws)# compute DICDIC_empty = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_drawsDIC_empty
[1] 322.1444
Issues to DIC
The DIC has fallen out of favor recently (not used in Stan)
Has issues when parameters are discrete
Not fully Bayesian (point estimate of average of parameter values)
Can give negative values for estimated numbers of parameters in a model
WAIC (Widely applicable or Watanabe-Akaike information criterion, Watanabe, 2010) corrects some of the problems with DIC:
A more frequently used model comparison indices for Bayesian analysis
Using the loo R package, we can calculate WAIC with the waic() function:
WAIC for the full model
loo::waic(fit_full_ppp$draws('log_lik'))
Computed from 4000 by 30 log-likelihood matrix
Estimate SE
elpd_waic -109.4 6.9
p_waic 6.7 2.6
waic 218.8 13.8
5 (16.7%) p_waic estimates greater than 0.4. We recommend trying loo instead.
WAIC for the empty model
loo::waic(fit_empty_ppp$draws('log_lik'))
Computed from 4000 by 30 log-likelihood matrix
Estimate SE
elpd_waic -161.0 2.7
p_waic 1.4 0.3
waic 322.0 5.3
Here:
elpd_waic is the expected log pointwise predictive density for WAIC
p_waic is the WAIC-version estimated number of parameter, similar to \(p(D)\) in DIC, which is a penalty to the likelihood for more parameters
waic is the WAIC index used for model comparisons (lowest value is best fitting; -2*elpd_waic)
Note that WAIC needs a log_lik variable in the model analysis to be calculated correctly. cmdstan will automate calculate this variable.
LOO: Approximation to Leave-one-out
Big picture:
Besides WAIC, other comparative fit indices include LOO via Pareto Smoothed Important Sampling (PSIS) via Stan’s LOO package
WAIC/LOO can be used for model comparison with lowest value suggesting better model fit
Different from DIC, LOO via PSIS attempts to “approximate” the process of leave-one-out cross-validation (LOO-CV) using a sampling based-approach
Gives a finite-sample approximation
Implemented in Stan
Can quickly compare models
Gives warnings when it may be less reliable to use
The details of computation of LOO are very technical, but are nicely compiled in Vehtari, Gelman, and Gabry (2017).
LOO: Leave-one-out in Stan
Using loo package, we can calculate efficient approximate leave-one-out cross-validation (LOO-CV)
elpd_loo is the expected log pointwise predictive density for LOO (recall that posterior predictive distribution has some uncertainty around the mean value…)
p_loo is the LOO calculation of number of model parameters (a penalty to the likelihood for more parameters)
looic is the LOO index used for model comparisons — lowest value suggests best fitting -2elpd_loo
LOO: Comparing Model with LOO
Alternative, you can use the built-in function loo_compare in loo package to compare alternative models:
loo::loo_compare(list(empty = fit_empty_ppp$loo(), full = fit_full_ppp$loo()))
elpd_diff se_diff
full 0.0 0.0
empty -51.3 7.9
This function calculats the standard error of the difference in elpd (expected log pointwise predictive density) between models
The SE gives an indication of the standard error in the estimate (relative to the size)
Can use this to downweight the choice of models when the standard error is high (Open Question: how high is high?)
Sometimes the normal approx is pretty bad (funnels, multimodal distributions, long tails).
Sometimes the joint mode is useless or does not even exist (funnels, etc.), in which case the model itself would need to be altered in some way to get a stable mode.
Code
# Stan's LBFGS algorithmfit_full_optim <- mod_full_ppp$optimize(data = data_full_new, seed =1234, jacobian =TRUE)fit_full_laplace <- mod_full_ppp$laplace(data = data_full_new, mode = fit_full_optim, draws =4000)# Run 'variational' method to use ADVI to approximate posteriorfit_full_vb <- mod_full_ppp$variational(data = data_full_new, seed =1234, draws =4000)# Run 'pathfinder' method, a new alternative to the variational methodfit_pf <- mod_full_ppp$pathfinder(data = data_full_new, seed =1234, draws =4000)
Approximation Algorithm vs. Full MCMC
Though they are very quicker to converge, the accuracy of these approximation algorithms largely depend on the complexity of modeling (number of parameters, prior distributions, latent variables etc.)
The three lectures using linear models was built to show nearly all parts needed in a Bayesian analysis
MCMC specifications
Prior specifications
Assessing MCMC convergence
Reporting MCMC results
Determining if a model fits the data (absolute fit)
Determining which model fits the data better (relative fit)
All of these topics will be with us when we start model complicated models in our future lecture.
Next Class
Generalized measurement models
Reference
Dhaka, Akash Kumar, Alejandro Catalina, Michael Riis Andersen, Måns Magnusson, Jonathan H. Huggins, and Aki Vehtari. 2020. “Robust, Accurate Stochastic Optimization for Variational Inference.”arXiv. https://doi.org/10.48550/ARXIV.2009.00666.
Gelman, Andrew, Xiao-Li Meng, and Hal Stern. 1996. “Posterior Predictive Assessment of Model Fitness Via Realized Discrepancies.”Statistica Sinica.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2016. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.”Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.
Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Yes, but Did It Work?: Evaluating Variational Inference.”arXiv. https://doi.org/10.48550/ARXIV.1802.02538.
Zhang, Lu, Bob Carpenter, Andrew Gelman, and Aki Vehtari. 2021. “Pathfinder: Parallel Quasi-Newton Variational Inference.”https://doi.org/10.48550/ARXIV.2108.03782.