Lecture 05 – Jihong Zhang, Ph.D.

1 Today’s Lecture Objectives

Bayesian methods for determining how well a model fits the data (absolute fit)
Bayesian methods for determining which model fits better (relative model fit)
In previous class…
1. We estimated the empty model and the full model (you can try other constrained model between the empty and full model)
  - Empty model: a model without any covariate (predictors)
  - Full model: a model with all possible covariates and up-to highest interaction effects
2. We also make the Stan code more efficient by vectorizing the parameters and the data
3. The next question is how to determine the model is “good” enough
  - “good” has multiple meanings
  - a typical criteria is to what degree the model can generate a “simulated” data that is similar to the original data
  - or, the likelihood of original data given the parameters of model

2 Types of Model fit

Absolute model fit
- PPMC
- In SEM, RMSEA, chi-square, SRMR, and GFI
Relative model fit
- information criterion
Incremental model fit (not frequently used other than SEM)
- A special type of absolute model fit - how a model fits to a saturated model
- In SEM, comparative fit index (CFI), Tucker-Lewis index (TLI)

3 Absolute Model Fit: PPMC

Posterior predictive model checking (Gelman et al., 1996) is a Bayesian method evaluation technique for determining if a model fits the data.

Absolute model fit: “Does my model fit my data well?”
- Recall that “model is a simplified version of the true data-generation process”
- Thus, the model should be able to reproduce the “data” that is similar to observed data
- In machine learning, this is also called “validation”, typically used as a separate “validation data sets”
Overall idea: if a model fits the data well, then simulated data based on the model will resemble the observed data

3.1 Ingredients in PPMC:

Original data
- Typically, characterized by some set of statistics (i.e., sample mean, standard deviation, covariance) applied to data
Simulated data
- Generated from partial/all posterior draws in the Markov Chain
- Summarized by the same set of statistics

4 PPMC Example: Linear Models

The linear model from our example was:

\text{WeightLB}_p = \beta_0 + \beta_1 \text{HeightIN}_p + \beta_2 \text{Group2}_p + \beta_3\text{Group3}_p \\ +\beta_4 \text{HeightIN}_p\text{Group2}_p \\ +\beta_5 \text{HeightIN}_p\text{Group3}_p \\ + e_p

with:

\text{Group2}_p the binary indicator of person p being in group 2
\text{Group}3_p the binary indicator of person p being in group 3
e_p \sim N(0, \sigma_e)

FullModel_New.stan

data{
  int<lower=0> N;         // number of observations
  int<lower=0> P;         // number of predictors (plus column for intercept)
  matrix[N, P] X;         // model.matrix() from R 
  vector[N] weightLB;     // outcome
  real sigmaRate;         // hyperparameter: prior rate parameter for residual standard deviation
}
parameters {
  vector[P] beta;         // vector of coefficients for Beta
  real<lower=0> sigma;    // residual standard deviation
}
model {
  sigma ~ exponential(sigmaRate);         // prior for sigma
  weightLB ~ normal(X*beta, sigma);       // linear model
}

fit_full_new$summary(variables = c('beta', 'sigma'))

# A tibble: 7 × 10
  variable    mean  median    sd   mad     q5     q95  rhat ess_bulk ess_tail
  <chr>      <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 beta[1]  148.    148.    3.27  3.23  142.   153.     1.00    1402.    1838.
2 beta[2]   -0.374  -0.372 0.483 0.479  -1.16   0.414  1.00    1522.    2019.
3 beta[3]  -24.1   -24.2   4.59  4.45  -31.6  -16.5    1.00    1746.    2190.
4 beta[4]   81.3    81.3   4.44  4.40   74.1   88.5    1.00    1538.    2091.
5 beta[5]    2.47    2.48  0.683 0.676   1.32   3.56   1.00    1805.    2118.
6 beta[6]    3.57    3.57  0.646 0.643   2.52   4.62   1.00    1678.    2218.
7 sigma      8.25    8.10  1.26  1.20    6.50  10.5    1.00    2265.    2212.

MLE Method:

dat <- read.csv(here::here("teaching", "2024-01-12-syllabus-adv-multivariate-esrm-6553", "Data", "DietData.csv"))
dat$DietGroup <- factor(dat$DietGroup, levels = 1:3)
dat$HeightIN60 <- dat$HeightIN - 60
res <- summary(lm(WeightLB ~ DietGroup*HeightIN60, data=dat))
round(res$coefficients, 3)

                      Estimate Std. Error t value Pr(>|t|)
(Intercept)            147.557      3.008  49.061    0.000
DietGroup2             -24.202      4.253  -5.690    0.000
DietGroup3              81.326      4.021  20.227    0.000
HeightIN60              -0.377      0.459  -0.822    0.419
DietGroup2:HeightIN60    2.473      0.649   3.812    0.001
DietGroup3:HeightIN60    3.567      0.613   5.817    0.000

sd(resid(res)) # residual error

[1] 7.231744

4.1 PPMC Process

The PPMC Process can be summarized as follows:

Select parameters from a single (sampling) iteration of the Markov Chain
Using the selected parameters and the model to simulate a data set with the sample size (with same number of observations/variables)
From the simulated data set, calculate selected summary statistics (e.g., mean, sd)
Repeat steps 1-3 for a fixed number of iterations (perhaps across the whole chain)
When done, compare position of observed summary statistics to that of the distribution of summary statistics from simulated data sets (predictive distribution)

4.2 Step 1: Assemble the posterior draws

For our linear model, let’s denote our observed dependent variable as Y

Note that independent variables are not modeled (not explained by statistical formula), so we cannot examine them.

First, let’s assemble the posterior draws (1000 \times 4 \times 7):

posteriorSample = fit_full_new$draws(variables = c('beta','sigma'), format = 'draws_matrix')
posteriorSample

# A draws_matrix: 1000 iterations, 4 chains, and 7 variables
    variable
draw beta[1] beta[2] beta[3] beta[4] beta[5] beta[6] sigma
  1      148   0.215     -21      86     2.0     2.8   8.8
  2      145  -0.332     -20      84     1.6     3.0   7.1
  3      149  -0.705     -25      78     2.4     3.3   9.0
  4      148  -0.615     -22      84     2.9     3.7   8.7
  5      151  -0.598     -28      79     2.8     3.6   6.8
  6      151  -0.574     -27      77     2.8     3.7   6.3
  7      152  -1.174     -27      79     3.5     3.9   7.9
  8      151  -0.983     -30      81     3.3     4.2   7.8
  9      148  -0.278     -27      83     2.7     3.5   8.2
  10     147  -0.041     -25      77     2.4     3.4   7.6
# ... with 3990 more draws

4.3 Step 2: One draw from posterior

Next, let’s draw one set of posterior values of parameters at random with replacement:

set.seed(1234)
sampleIteration = sample(x = 1:nrow(posteriorSample), size = 1, replace = TRUE)
sampleIteration

[1] 3356

posteriorSample[sampleIteration,]

# A draws_matrix: 1 iterations, 1 chains, and 7 variables
      variable
draw   beta[1] beta[2] beta[3] beta[4] beta[5] beta[6] sigma
  3356     156   -0.62     -27      70     2.5     3.5   9.3

Those draws of \boldsymbol{\beta} and \sigma can be used to generate a predicted \hat{y}:

\hat{y}_i \sim N(\mathbf{X}\boldsymbol{\beta}_i, \sigma_i)

i represents the i th draw

4.4 Step 3: One simulated data

We then generate data based on this sampled iteration and our model distributional assumption:

# Sample Size
N = nrow(dat)
# beta
betaVector = matrix(posteriorSample[sampleIteration, 1:6], ncol = 1)
betaVector

           [,1]
[1,] 156.097000
[2,]  -0.615446
[3,] -27.399200
[4,]  69.629900
[5,]   2.508580
[6,]   3.450710

# sigma
sigma = posteriorSample[sampleIteration, 7]
# X
FullModelFormula = as.formula("WeightLB ~ HeightIN60 + DietGroup + HeightIN60*DietGroup")
X = model.matrix(FullModelFormula, data = dat)

simY = rnorm(n = N, mean = X %*% betaVector, sd = sigma)
head(simY)

[1] 161.4679 159.0327 156.9904 144.3520 149.0453 160.8447

4.5 Step 4: Summary Statistics of Simulated Data

Note that we do not want to directly compare simulated data to the observed data.

Instead, we extract some characteristics of simulated/observed data for comparison using summary statistics.

There are some advantages:

We may have research interests in only some characteristics of data (whether our model predict the mean of dependent variables)
We can QC detailed aspects of the fitting process of model

In this case, for example, we may be interested in whether the model captures the “location” or “scale” of WeightLB

mean(dat$WeightLB)

[1] 171

(simMean = mean(simY))

[1] 174.6217

sd(dat$WeightLB)

[1] 49.51907

(simSD = sd(simY))

[1] 46.67779

4.6 Step 5: Looping across all posterior samples

We can repeat step 2-4 for a set number of samples

Optionally, we can choose to use up all iterations we have for Markov Chains (I = 4000) in practice

I = nrow(posteriorSample)
## create empty simSD and simMean as placeholders
simSD = simMean = rep(NA, I)
for (i in 1:I) {
  # beta
  betaVector = matrix(posteriorSample[i, 1:6], ncol = 1)
  # sigma
  sigma = posteriorSample[i, 7]
  # X
  simY = rnorm(n = N, mean = X %*% betaVector, sd = sigma)
  simMean[i] = mean(simY)
  simSD[i] = sd(simY)
}
par(mfrow = 1:2)
hist(simMean)
hist(simSD)

4.7 Compare to the observed mean

We can now compare our observed mean and standard deviation with that of the sample values.

Blue line: the average value of predicted WeightLB
Red line: the observed mean value of WeightLB
The PDF of predictive values of summary statistics of WeightLB is called posterior predictive distribution

PPMC can be checked using visual inspection:

library(ggplot2)
ppp <- data.frame(
  simMean = simMean, 
  simSD = simSD
)
ggplot(ppp) +
  geom_histogram(aes(x = simMean), fill = "grey", col = "black") +
  geom_vline(xintercept = mean(dat$WeightLB), col = "red", size = 1.4) + # red line: location of mean of predicted WeightLB by model
  geom_vline(xintercept = mean(simMean), col = "blue", size = 1.4, alpha = 0.5) # blue line: location of mean of WeightLB

4.8 Compare to the observed SD

Similarly, let’s compare SD to the posterior predictive distribution of SD of WeightLB

the observed SD is located as the center of posterior predictive distribution (PPD)
the average mean of PPD is slightly higher than the observed SD

ggplot(ppp) +
  geom_histogram(aes(x = simSD), fill = "grey", col = "black") +
  geom_vline(xintercept = sd(dat$WeightLB), col = "red", size = 1.4) + # red line: location of mean of predicted WeightLB by model
  geom_vline(xintercept = mean(simSD), col = "blue", size = 1.4, alpha = 0.5) # blue line: location of mean of WeightLB

5 PPMC Characteristics

PPMC methods are very useful

They provide a visual way to determine if the model fits the observed data
They are the main method of assessing absolute fit in Bayesian models
Absolute fit assesses if a model fits the data instead of comparing to another model

But, there are some drawbacks to PPMC methods

Almost any statistic can be used
- Some are better than others (mean and SD of outcomes are nice choices for linear regression)
No standard determining how much misfit is too much
May be overwhelming to compute depending on your model

Question: Can PPMC be used for models with maximum likelihood estimation or ordinary least squares?

6 Posterior Predictive P-values

We can summarize the PPMC using a type of “p-value”

Personally, I don’t like the name “p-value”, sounds like we are trying to justify our results using significance testing

Different from the frequentist “p-value” (if the null hypothesis is true, the probability of the observed data existing)

The PPP-value: the proportion of times the statistic from the simulated data exceeds that of the observed data
Useful to determine how far off a statistic is from its posterior predictive distribution

If these p-values were:

near 0 or 1, indicating your model is far off your data
near .5, indicating your model fits your data in terms of the statistics you examined

The PPP-value for mean:

mean(simMean > mean(dat$WeightLB))

[1] 0.4985

The PPP-value for SD:

mean(simSD > sd(dat$WeightLB))

[1] 0.52075

7 Compute PPP-values within Stan

We can use the generated quantities block of Stan to compute PPP-values for us:

generated quantities{
  // simulated data
  array[N] real weightLB_rep = normal_rng(X*beta, sigma);
  // posterior predictive distribution for mean and SD
  real mean_weightLB = mean(weightLB);
  real sd_weightLB = sd(weightLB);
  real mean_weightLB_rep = mean(to_vector(weightLB_rep));
  real<lower=0> sd_weightLB_rep = sd(to_vector(weightLB_rep));
  // ppp-values for mean and sd
  int<lower=0, upper=1> ppp_mean = (mean_weightLB_rep > mean_weightLB);
  int<lower=0, upper=1> ppp_sd = (sd_weightLB_rep > sd_weightLB);
}

It will give us:

fit_full_ppp$summary(variables = c('mean_weightLB_rep', 'sd_weightLB_rep', 'ppp_mean', 'ppp_sd'))

# A tibble: 4 × 10
  variable           mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>             <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 mean_weightLB_… 171.     171.  2.17   2.13 167.  175.  1.000    4468.    3736.
2 sd_weightLB_rep  49.7     49.7 2.18   2.12  46.2  53.3 1.00     4177.    3795.
3 ppp_mean          0.503    1   0.500  0      0     1   1.00     4204.      NA 
4 ppp_sd            0.529    1   0.499  0      0     1   1.00     4280.      NA

8 Advantages or disadvantages of Computing PPP within Stan

Pros:

Built-in functions of Stan to generate simulated data for example normal_rng(), making PPP-values estimated much faster
Nice visual inspection tools existed - bayesplot

Cons:

Not allowed to debug each step in PPMC if something wrong
Cannot adjust the statistics and need to re-run the whole MCMC sampling, which is time-consuming

bayesplot::mcmc_dens_chains(fit_full_ppp$draws('mean_weightLB_rep'))

Posterior predictive distribution for mean of weight by chains

9 Relative Model Fit

Relative model fit: used to compare 2 or more competing models in terms of their mode fit. Sometime, it is also called model selection.

In non-Bayesian models, Information Criteria are often used to make comparisons
- AIC, BIC, DIC etc.
- Typically IC is a function of log-likelihood and penalty
- The model with the lowest IC is the model that fits best
Bayesian model fit is similar
- Uses an index value
- The model with the lowest index is the model that fits best
Recent advances in Bayesian model fit use indices that are tied to make cross-validation predictions (inspired by machine learning):
- Fit model leaving one observation out (LOO)
- Calculate statistics related to prediction (for instance, log-likelihood of that observation conditional on model parameters)
- Do for all observations
New Bayesian indices try to mirror these leave-one-out predictions (but approximate these due to time constraints)

10 Deviance Information Indices (DIC)

When late 1990s and early 2000s, the Deviance Information Criterion was popular for relative Bayesian model fit comparisons. It is proved not as good as LOO or WAIC. But let’s have a look at:

\text{DIC} = p_D + \overline{D(\theta)}

where p_D is the estimated number of parameters as follows:

p_D = \overline{D(\theta)} - D(\bar\theta)

and where

D(\theta) = -2 \log(p(y \mid \theta)) + C

C is a constant that cancels out when model comparisons are made

Here.

\overline{D(\theta)} is the average log likelihood of the data (y) given the parameters (\theta) computed across all samples
D(\bar\theta) is the log likelihood of the data (y) computed at the average of the parameters (\theta) computed across all samples

10.1 DIC in Stan

Some program like JAGS will have provided DIC directly, but Stan does not provides direct ways of calculating DIC:

We can manually calculate approximated DIC by replacing \bar\theta as median of posteriors \dot\theta if the posterior distributions are almost normal.

DIC for Full Model

# extract log-likelihood and parameters
posterior_draws <- fit_full_ppp$draws(c('log_lik', 'beta'), format = 'draws_matrix')
log_like_draws <- rowSums(posterior_draws[,colnames(posterior_draws) %in% paste0("log_lik[", 1:30, "]")]) 

# find draws of loglikelihood that have parameters equaling to median values
theta_draws <- posterior_draws[, colnames(posterior_draws) %in% paste0("beta[", 1:6, "]")]
theta_ave_iteration = apply(round(theta_draws, 3), 2, \(x) which(x == median(x))) |> unlist() |> as.numeric()
log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])
mean_log_like_draws = (-2)*mean(log_like_draws)

# compute DIC
DIC_full = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_draws
DIC_full

[1] 213.6502

DIC for Empty Model

# extract log-likelihood and parameters
posterior_draws <- fit_empty_ppp$draws(c('log_lik', 'beta0'), format = 'draws_matrix')
log_like_draws <- rowSums(posterior_draws[,colnames(posterior_draws) %in% paste0("log_lik[", 1:30, "]")]) 

# find draws of loglikelihood that have parameters equaling to median values
theta_draws <- posterior_draws[, colnames(posterior_draws) == "beta0"]
theta_ave_iteration = which(round(theta_draws,2) == round(median(theta_draws,2)))
log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])
mean_log_like_draws = (-2)*mean(log_like_draws)

# compute DIC
DIC_empty = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_draws
DIC_empty

[1] 355.4397

10.2 Issues to DIC

The DIC has fallen out of favor recently (not used in Stan)

Has issues when parameters are discrete
Not fully Bayesian (point estimate of average of parameter values)
Can give negative values for estimated numbers of parameters in a model

WAIC (Widely applicable or Watanabe-Akaike information criterion, Watanabe, 2010) corrects some of the problems with DIC:

Fully Bayesian (uses entire posterior distribution)
Asymptotically equal to Bayesian cross-validation
Invariant to parameterization

11 Watanabe-Akaike information criterion (WAIC)

A more frequently used model comparison indices for Bayesian analysis
Using the loo R package, we can calculate WAIC with the waic() function:

WAIC for the full model

loo::waic(fit_full_ppp$draws('log_lik'))


Computed from 4000 by 30 log-likelihood matrix.

          Estimate   SE
elpd_waic   -109.4  7.0
p_waic         6.7  2.7
waic         218.9 13.9

5 (16.7%) p_waic estimates greater than 0.4. We recommend trying loo instead.

WAIC for the empty model

loo::waic(fit_empty_ppp$draws('log_lik'))


Computed from 4000 by 30 log-likelihood matrix.

          Estimate   SE
elpd_waic   -207.9  6.5
p_waic        44.8  5.2
waic         415.8 13.0

30 (100.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.

Here:
- elpd_waic is the expected log pointwise predictive density for WAIC
- p_waic is the WAIC-version estimated number of parameter, similar to p(D) in DIC, which is a penalty to the likelihood for more parameters
- waic is the WAIC index used for model comparisons (lowest value is best fitting; -2*elpd_waic)
Note that WAIC needs a log_lik variable in the model analysis to be calculated correctly. cmdstan will automate calculate this variable.

12 LOO: Approximation to Leave-one-out

Big picture:

Besides WAIC, other comparative fit indices include LOO via Pareto Smoothed Important Sampling (PSIS) via Stan’s LOO package
WAIC/LOO can be used for model comparison with lowest value suggesting better model fit
Different from DIC, LOO via PSIS attempts to “approximate” the process of leave-one-out cross-validation (LOO-CV) using a sampling based-approach
- Gives a finite-sample approximation
- Implemented in Stan
- Can quickly compare models
- Gives warnings when it may be less reliable to use
The details of computation of LOO are very technical, but are nicely compiled in Vehtari, Gelman, and Gabry (2017).

12.1 LOO: Leave-one-out in Stan

Using loo package, we can calculate efficient approximate leave-one-out cross-validation (LOO-CV)

Full model’s LOO:

full_loo_res <- fit_full_ppp$loo('log_lik', save_psis = TRUE, cores = 4)
full_loo_res$estimates

            Estimate        SE
elpd_loo -109.813621  7.113334
p_loo       7.080317  2.844954
looic     219.627242 14.226667

Empty model’s LOO:

empty_loo_res <- fit_empty_ppp$loo('log_lik', save_psis = TRUE, cores = 4)
empty_loo_res$estimates

           Estimate        SE
elpd_loo -244.25051 10.081720
p_loo      81.16836  8.858813
looic     488.50102 20.163441

Here:

elpd_loo is the expected log pointwise predictive density for LOO (recall that posterior predictive distribution has some uncertainty around the mean value…)
p_loo is the LOO calculation of number of model parameters (a penalty to the likelihood for more parameters)
looic is the LOO index used for model comparisons — lowest value suggests best fitting -2elpd_loo

12.2 LOO: Comparing Model with LOO

Alternative, you can use the built-in function loo_compare in loo package to compare alternative models:

loo::loo_compare(list(empty = fit_empty_ppp$loo(), full = fit_full_ppp$loo()))

      elpd_diff se_diff
full     0.0       0.0 
empty -134.4      13.5

This function calculats the standard error of the difference in elpd (expected log pointwise predictive density) between models

The SE gives an indication of the standard error in the estimate (relative to the size)
Can use this to downweight the choice of models when the standard error is high (Open Question: how high is high?)
Note: elpd_diff is looic divided by -2 (on the log-likelihood scale, not the deviance scale)
- Here, we interpret the results as the model with full predictors is preferred to the model with empty predictors.
- The “confidence interval” of the elpd_diff does not include 0, indicating we can be fairly certain of this result

0 \in \mathrm{elpd\_diff} \pm 1.96 \cdot \mathrm{se\_diff}

12.3 LOO: Pareto smoothed importance sampling (PSIS)

Estimated a vector of Pareto shape parameters k for individuals which represents the reliability of sampling:
- k < .5 (good) suggests the estimate converges quickly
- .5 < k < .7 (ok) suggests the estimate converges slowly
- .7 < k < 1 (bad) suggests bad performance
- k > 1 (very bad)
PSIS screens all cases that have bad diagnostic values. The percentage may tells some information regarding reliability of Bayesian estimation.
Criteria: Estimated tail shape parameter
This is new area, please see more references (Vehtari et al., 2016)

pareto_k_table(full_loo_res)

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     29    96.7%   259     
   (0.7, 1]   (bad)       1     3.3%   <NA>    
   (1, Inf)   (very bad)  0     0.0%   <NA>

13 Compile the relative model fit indices

Below shows the all three model comparison fit indices for the empty model and the full model.

data.frame(
  Model = c("Full Model", "Empty Model"),
  DIC = c(DIC_full, DIC_empty),
  WAIC = c(WAIC_full, WAIC_empty),
  LOOIC = c(LOO_full, LOO_empty)
)

        Model      DIC     WAIC    LOOIC
1  Full Model 213.6502 218.8798 219.6272
2 Empty Model 355.4397 415.8137 488.5010

All fit indices suggest the full model is better than the empty model
For simple models, the results of DIC/WAIC/LOOIC should be consistent
In some situations that they are inconsistent, please use WAIC/LOO as standards.

14 General points about Bayesian Model Comparison

Note, WAIC and LOO will converge as sample size increases (WAIC is asymptotic value of LOO)
Latent variable models present challenges (could be your dissertation project)
- Need log likelihood with latent variable being integrated out
Missing data present challenges
- Need log likelihood with missing data integrated out
Generally, using LOO is recommended (but providing both is appropriate)

15 Cutting-Edge Research Field: Approximation Algorithm

MCMC sampling could be very computational time-consuming.
For number of parameters up to thousands or even millions, it could be a challenge to get the estimation.
Thus, approximation algorithms has been developed, such as Laplace Approximation, Variational Inference (Dhaka et al., 2020; Yao et al., 2018), Pathfinder methods (Zhang et al., 2021).

Quoted from Gelman’s blog:

Advantages of Laplace:

Relatively cheap to compute, given that we already have a mode-finder in Stan.

Easy to understand, use, and communicate.

Works reasonably well in many examples (wherever the posterior can be well approximated by a normal distribution).

Easy to take draws from the normal approximation, also easy to compute importance ratios and use Pareto-smoothed importance sampling.

Limitations of Laplace:

Sometimes the normal approx is pretty bad (funnels, multimodal distributions, long tails).

Sometimes the joint mode is useless or does not even exist (funnels, etc.), in which case the model itself would need to be altered in some way to get a stable mode.

# Stan's LBFGS algorithm
fit_full_optim <- mod_full_ppp$optimize(data = data_full_new, seed = 1234, jacobian = TRUE, show_messages = FALSE)
fit_full_laplace <- mod_full_ppp$laplace(data = data_full_new, mode = fit_full_optim, draws = 4000, show_messages = FALSE)

# Run 'variational' method to use ADVI to approximate posterior
fit_full_vb <- mod_full_ppp$variational(data = data_full_new, seed = 1234, draws = 4000, show_messages = FALSE)
# Run 'pathfinder' method, a new alternative to the variational method
fit_pf <- mod_full_ppp$pathfinder(data = data_full_new, seed = 1234, draws = 4000, show_messages = FALSE)

output_compare <- data.frame(
  Bayesian = unlist(fit_full_ppp$summary(c('beta', 'sigma'))[, 2]),
  Optim = unlist(fit_full_optim$summary(c('beta', 'sigma'))[, 2]),
  Laplace = unlist(fit_full_laplace$summary(c('beta', 'sigma'))[, 2]),
  VB = unlist(fit_full_vb$summary(c('beta', 'sigma'))[, 2]),
  pathfinder = unlist(fit_pf$summary(c('beta', 'sigma'))[, 2] )
)
rownames(output_compare) <- c(paste0("beta", 0:5), "sigma")
output_compare

         Bayesian      Optim     Laplace          VB  pathfinder
beta0 147.6108432 147.557000 147.5510808 147.9058510 147.7262355
beta1  -0.3815513  -0.376992  -0.3702885  -0.3473592  -0.4478386
beta2 -24.2371296 -24.205200 -24.1710838 -23.9152932 -24.3770032
beta3  81.2571781  81.329800  81.3088230  81.6694042  81.4513564
beta4   2.4754819   2.473100   2.4625638   2.4865032   2.5144901
beta5   3.5758706   3.566510   3.5696170   3.5923468   3.6153040
sigma   8.2202957   7.144090   7.2311770   7.9185233   8.0978742

15.1 Approximation Algorithm vs. Full MCMC

Though they are very quicker to converge, the accuracy of these approximation algorithms largely depend on the complexity of modeling (number of parameters, prior distributions, latent variables etc.)

Click here to see R code

summ_list <- lapply(c(fit_full_ppp, fit_full_laplace, fit_full_vb, fit_pf), \(x) 
                    x$summary(c("beta", "sigma"))[c('variable', 'mean',"q5", "q95")])
summ_list[[1]]$Algorithm = "MCMC"
summ_list[[2]]$Algorithm = "Laplace Approx."
summ_list[[3]]$Algorithm = "Variational Inference"
summ_list[[4]]$Algorithm = "Pathfinder"
summ_forplot <- Reduce(rbind, summ_list)  
summ_forplot$Algorithm = factor(summ_forplot$Algorithm, levels = c("MCMC", "Laplace Approx.", "Pathfinder", "Variational Inference"))

ggplot(summ_forplot) +
  geom_col(aes(x = variable, y = mean, fill = Algorithm), position = position_dodge()) +
  geom_errorbar(aes(x = variable, ymax = q95, ymin = q5, fill = Algorithm),position = position_dodge2(padding = 0.6)) +
  labs(x = '', y = '') +
  theme_classic()

Mean and 95% Credit Interval of aproximation algorithms and MCMC

library(tidyverse)
rbind(
  cbind(data.frame(Algorithm = "Laplace Approx."), fit_full_laplace$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6]),
  cbind(data.frame(Algorithm = "Variational Inference"), fit_full_vb$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6]),
  cbind(data.frame(Algorithm = "Pathfinder"), fit_pf$draws(paste0("beta[", 1:6, "]"))[, 1:6]),
  cbind(data.frame(Algorithm = "MCMC"), fit_full_ppp$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6])
) |>
  mutate(Algorithm = factor(Algorithm, levels = c("MCMC", "Laplace Approx.", "Pathfinder", "Variational Inference"))) |> 
  pivot_longer(starts_with("beta"), names_to = "Parameter", values_to = "Draws") |>
  ggplot() +
  geom_density(aes(x = Draws, fill = Algorithm), alpha = 0.4) +
  facet_wrap(~Parameter, scales = "free")

16 Wrapping up

We used three lectures using linear models was built to show nearly all parts needed in a Bayesian analysis

MCMC specifications
Prior specifications
Assessing MCMC convergence
Reporting MCMC results
Determining if a model fits the data (absolute fit)
Determining which model fits the data better (relative fit)

All of these topics will be with us when we start model complicated models in our future lecture.

17 Next Class

Generalized measurement models

18 Reference

Dhaka, A. K., Catalina, A., Andersen, M. R., Magnusson, M., Huggins, J. H., & Vehtari, A. (2020). Robust, accurate stochastic optimization for variational inference. arXiv. https://doi.org/10.48550/ARXIV.2009.00666

Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica.

Vehtari, A., Gelman, A., & Gabry, J. (2016). Practical bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432. https://doi.org/10.1007/s11222-016-9696-4

Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Yes, but did it work?: Evaluating variational inference. arXiv. https://doi.org/10.48550/ARXIV.1802.02538

Zhang, L., Carpenter, B., Gelman, A., & Vehtari, A. (2021). Pathfinder: Parallel quasi-newton variational inference. https://doi.org/10.48550/ARXIV.2108.03782

--- title: "Lecture 05" subtitle: "Bayesian Model fit and comparison" author: "Jihong Zhang" institute: "Educational Statistics and Research Methods" title-slide-attributes: data-background-image: ../Images/title_image.png data-background-size: contain data-background-opacity: "0.9" execute: echo: true eval: true warning: false format: html: code-tools: true code-line-numbers: false code-fold: false code-summary: 'Click here to see R code' number-offset: 1 fig.width: 10 fig-align: center message: false revealjs: output-file: slides-index.html logo: ../Images/UA_Logo_Horizontal.png incremental: true # choose "false "if want to show all together theme: [serif, ../pp.scss] footer: <https://jihongzhang.org/posts/2024-01-12-syllabus-adv-multivariate-esrm-6553> transition: slide background-transition: fade slide-number: true chalkboard: true number-sections: false code-line-numbers: true code-link: true code-annotations: hover code-copy: true highlight-style: arrow code-block-border-left: true code-block-background: "#b22222" bibliography: references.bib --- ## Today's Lecture Objectives 1. Bayesian methods for determining how well a model fits the data (absolute fit) 2. Bayesian methods for determining which model fits better (relative model fit) 3. In previous class... 1. We estimated the empty model and the full model (you can try other constrained model between the empty and full model) - Empty model: a model without any covariate (predictors) - Full model: a model with all possible covariates and up-to highest interaction effects 2. We also make the Stan code more efficient by vectorizing the parameters and the data 3. The next question is **how to determine the model is "good" enough** - "good" has multiple meanings - a typical criteria is to what degree the model can generate a "simulated" data that is similar to the original data - or, the likelihood of original data given the parameters of model ## Types of Model fit - Absolute model fit - PPMC - In SEM, RMSEA, chi-square, SRMR, and GFI - Relative model fit - information criterion - Incremental model fit (not frequently used other than SEM) - A special type of absolute model fit - how a model fits to a saturated model - In SEM, comparative fit index (CFI), Tucker-Lewis index (TLI) ## Absolute Model Fit: PPMC **Posterior predictive model checking** [@gelman1996] is a Bayesian method evaluation technique for determining if a model fits the data. - Absolute model fit: "Does my model fit my data well?" - Recall that "model is a simplified version of the true data-generation process" - Thus, the model should be able to reproduce the "data" that is similar to observed data - In machine learning, this is also called "validation", typically used as a separate "validation data sets" - Overall idea: if a model fits the data well, then simulated data based on the model will resemble the observed data ------------------------------------------------------------------------ ### Ingredients in PPMC: ::: nonincremental - Original data - Typically, characterized by some set of statistics (i.e., sample mean, standard deviation, covariance) applied to data - Simulated data ::: nonincremental - Generated from partial/all posterior draws in the Markov Chain - Summarized by the same set of statistics ::: ::: ![](../Images/Lecture5_PPMC_DataSets.png){fig-align="center" width="600"} ------------------------------------------------------------------------ ## PPMC Example: Linear Models The linear model from our example was: $$ \text{WeightLB}_p = \beta_0 + \beta_1 \text{HeightIN}_p + \beta_2 \text{Group2}_p + \beta_3\text{Group3}_p \\ +\beta_4 \text{HeightIN}_p\text{Group2}_p \\ +\beta_5 \text{HeightIN}_p\text{Group3}_p \\ + e_p $$ with: ::: nonincremental - $\text{Group2}_p$ the binary indicator of person $p$ being in group 2 - $\text{Group}3_p$ the binary indicator of person $p$ being in group 3 - $e_p \sim N(0, \sigma_e)$ ::: ```{r} #| echo: false load(here::here("teaching/2024-01-12-syllabus-adv-multivariate-esrm-6553/Lecture05/Code", "Lecture05.RData")) ``` ```{stan filename = "FullModel_New.stan", output.var="display", eval=FALSE, tidy=FALSE} data{ int<lower=0> N; // number of observations int<lower=0> P; // number of predictors (plus column for intercept) matrix[N, P] X; // model.matrix() from R vector[N] weightLB; // outcome real sigmaRate; // hyperparameter: prior rate parameter for residual standard deviation } parameters { vector[P] beta; // vector of coefficients for Beta real<lower=0> sigma; // residual standard deviation } model { sigma ~ exponential(sigmaRate); // prior for sigma weightLB ~ normal(X*beta, sigma); // linear model } ``` ```{r} #| eval: true fit_full_new$summary(variables = c('beta', 'sigma')) ``` __MLE Method:__ ```{r} dat <- read.csv(here::here("teaching", "2024-01-12-syllabus-adv-multivariate-esrm-6553", "Data", "DietData.csv")) dat$DietGroup <- factor(dat$DietGroup, levels = 1:3) dat$HeightIN60 <- dat$HeightIN - 60 res <- summary(lm(WeightLB ~ DietGroup*HeightIN60, data=dat)) round(res$coefficients, 3) sd(resid(res)) # residual error ``` ------------------------------------------------------------------------ ### PPMC Process The PPMC Process can be summarized as follows: 1. Select parameters from a single (sampling) iteration of the Markov Chain 2. Using the selected parameters and the model to simulate a data set with the sample size (with same number of observations/variables) 3. From the simulated data set, calculate selected summary statistics (e.g., mean, sd) 4. Repeat steps 1-3 for a fixed number of iterations (perhaps across the whole chain) 5. When done, compare position of observed summary statistics to that of the distribution of summary statistics from simulated data sets (predictive distribution) ------------------------------------------------------------------------ ### Step 1: Assemble the posterior draws For our linear model, let's denote our observed dependent variable as $Y$ - Note that independent variables are not modeled (not explained by statistical formula), so we cannot examine them. First, let's assemble the posterior draws ($1000 \times 4 \times 7$): ```{r} posteriorSample = fit_full_new$draws(variables = c('beta','sigma'), format = 'draws_matrix') posteriorSample ``` ------------------------------------------------------------------------ ### Step 2: One draw from posterior Next, let's draw one set of posterior values of parameters at random with replacement: ```{r} set.seed(1234) sampleIteration = sample(x = 1:nrow(posteriorSample), size = 1, replace = TRUE) sampleIteration posteriorSample[sampleIteration,] ``` Those draws of $\boldsymbol{\beta}$ and $\sigma$ can be used to generate a predicted $\hat{y}$: $$ \hat{y}_i \sim N(\mathbf{X}\boldsymbol{\beta}_i, \sigma_i) $$ - $i$ represents the $i$ th draw ------------------------------------------------------------------------ ### Step 3: One simulated data We then generate data based on this sampled iteration and our model distributional assumption: ```{r} # Sample Size N = nrow(dat) # beta betaVector = matrix(posteriorSample[sampleIteration, 1:6], ncol = 1) betaVector # sigma sigma = posteriorSample[sampleIteration, 7] # X FullModelFormula = as.formula("WeightLB ~ HeightIN60 + DietGroup + HeightIN60*DietGroup") X = model.matrix(FullModelFormula, data = dat) simY = rnorm(n = N, mean = X %*% betaVector, sd = sigma) head(simY) ``` ------------------------------------------------------------------------ ### Step 4: Summary Statistics of Simulated Data Note that we do not want to directly compare simulated data to the observed data. Instead, we extract some characteristics of simulated/observed data for comparison using summary statistics. There are some advantages: 1. We may have research interests in only some characteristics of data (whether our model predict the mean of dependent variables) 2. We can QC detailed aspects of the fitting process of model In this case, for example, we may be interested in whether the model captures the "location" or "scale" of WeightLB ```{r} mean(dat$WeightLB) (simMean = mean(simY)) sd(dat$WeightLB) (simSD = sd(simY)) ``` ------------------------------------------------------------------------ ### Step 5: Looping across all posterior samples We can repeat step 2-4 for a set number of samples Optionally, we can choose to use up all iterations we have for Markov Chains ($I = 4000$) in practice ```{r} #| output-location: slide I = nrow(posteriorSample) ## create empty simSD and simMean as placeholders simSD = simMean = rep(NA, I) for (i in 1:I) { # beta betaVector = matrix(posteriorSample[i, 1:6], ncol = 1) # sigma sigma = posteriorSample[i, 7] # X simY = rnorm(n = N, mean = X %*% betaVector, sd = sigma) simMean[i] = mean(simY) simSD[i] = sd(simY) } par(mfrow = 1:2) hist(simMean) hist(simSD) ``` ------------------------------------------------------------------------ ### Compare to the observed mean ::: columns ::: {.column width="50%"} We can now compare our observed mean and standard deviation with that of the sample values. - Blue line: the average value of predicted WeightLB - Red line: the observed mean value of WeightLB - The PDF of predictive values of summary statistics of WeightLB is called `posterior predictive distribution` ::: ::: {.column width="50%"} PPMC can be checked using visual inspection: ```{r} library(ggplot2) ppp <- data.frame( simMean = simMean, simSD = simSD ) ggplot(ppp) + geom_histogram(aes(x = simMean), fill = "grey", col = "black") + geom_vline(xintercept = mean(dat$WeightLB), col = "red", size = 1.4) + # red line: location of mean of predicted WeightLB by model geom_vline(xintercept = mean(simMean), col = "blue", size = 1.4, alpha = 0.5) # blue line: location of mean of WeightLB ``` ::: ::: ------------------------------------------------------------------------ ### Compare to the observed SD ::: columns ::: {.column width="50%"} Similarly, let's compare SD to the posterior predictive distribution of SD of WeightLB - the observed SD is located as the center of posterior predictive distribution (PPD) - the average mean of PPD is slightly higher than the observed SD ::: ::: {.column width="50%"} ```{r} ggplot(ppp) + geom_histogram(aes(x = simSD), fill = "grey", col = "black") + geom_vline(xintercept = sd(dat$WeightLB), col = "red", size = 1.4) + # red line: location of mean of predicted WeightLB by model geom_vline(xintercept = mean(simSD), col = "blue", size = 1.4, alpha = 0.5) # blue line: location of mean of WeightLB ``` ::: ::: ------------------------------------------------------------------------ ## PPMC Characteristics PPMC methods are very useful ::: nonincremental - They provide a visual way to determine if the model fits the observed data - They are the main method of assessing absolute fit in Bayesian models - Absolute fit assesses if a model fits the data instead of comparing to another model ::: But, there are some drawbacks to PPMC methods ::: nonincremental - Almost any statistic can be used - Some are better than others (mean and SD of outcomes are nice choices for linear regression) - No standard determining how much misfit is too much - May be overwhelming to compute depending on your model ::: **Question**: Can PPMC be used for models with maximum likelihood estimation or ordinary least squares? ------------------------------------------------------------------------ ## Posterior Predictive P-values ::: columns ::: {.column width="50%"} We can summarize the PPMC using a type of "p-value" > Personally, I don't like the name "p-value", sounds like we are trying to justify our results using significance testing Different from the frequentist "p-value" (if the null hypothesis is true, the probability of the observed data existing) - The PPP-value: the proportion of times the statistic from the simulated data exceeds that of the observed data - Useful to determine how far off a statistic is from its posterior predictive distribution ::: ::: {.column width="50%"} If these p-values were: 1. near 0 or 1, indicating your model is far off your data 2. near .5, indicating your model fits your data in terms of the statistics you examined The PPP-value for mean: ```{r} mean(simMean > mean(dat$WeightLB)) ``` The PPP-value for SD: ```{r} mean(simSD > sd(dat$WeightLB)) ``` ::: ::: ------------------------------------------------------------------------ ## Compute PPP-values within Stan We can use the `generated quantities` block of Stan to compute PPP-values for us: ```{stan output.var='display', eval = FALSE, tidy = FALSE} generated quantities{ // simulated data array[N] real weightLB_rep = normal_rng(X*beta, sigma); // posterior predictive distribution for mean and SD real mean_weightLB = mean(weightLB); real sd_weightLB = sd(weightLB); real mean_weightLB_rep = mean(to_vector(weightLB_rep)); real<lower=0> sd_weightLB_rep = sd(to_vector(weightLB_rep)); // ppp-values for mean and sd int<lower=0, upper=1> ppp_mean = (mean_weightLB_rep > mean_weightLB); int<lower=0, upper=1> ppp_sd = (sd_weightLB_rep > sd_weightLB); } ``` It will give us: ```{r} fit_full_ppp$summary(variables = c('mean_weightLB_rep', 'sd_weightLB_rep', 'ppp_mean', 'ppp_sd')) ``` ------------------------------------------------------------------------ ## Advantages or disadvantages of Computing PPP within Stan ::: columns ::: {.column width="60%"} Pros: 1. Built-in functions of Stan to generate simulated data for example `normal_rng()`, making PPP-values estimated much faster 2. Nice visual inspection tools existed - `bayesplot` Cons: 1. Not allowed to debug each step in PPMC if something wrong 2. Cannot adjust the statistics and need to re-run the whole MCMC sampling, which is time-consuming ::: ::: {.column width="40%"} ```{r} #| fig-cap: 'Posterior predictive distribution for mean of weight by chains' bayesplot::mcmc_dens_chains(fit_full_ppp$draws('mean_weightLB_rep')) ``` ::: ::: ------------------------------------------------------------------------ ## Relative Model Fit Relative model fit: used to compare 2 or more competing models in terms of their mode fit. Sometime, it is also called model selection. - In non-Bayesian models, Information Criteria are often used to make comparisons - AIC, BIC, DIC etc. - Typically IC is a function of log-likelihood and penalty - The model with the lowest IC is the model that fits best - Bayesian model fit is similar - Uses an index value - The model with the lowest index is the model that fits best - Recent advances in Bayesian model fit use indices that are tied to make cross-validation predictions (inspired by machine learning): - Fit model leaving one observation out (LOO) - Calculate statistics related to prediction (for instance, log-likelihood of that observation conditional on model parameters) - Do for all observations - New Bayesian indices try to mirror these leave-one-out predictions (but approximate these due to time constraints) ------------------------------------------------------------------------ ## Deviance Information Indices (DIC) When late 1990s and early 2000s, the **Deviance Information Criterion** was popular for relative Bayesian model fit comparisons. It is proved not as good as LOO or WAIC. But let's have a look at: $$ \text{DIC} = p_D + \overline{D(\theta)} $$ where $p_D$ is the estimated number of parameters as follows: $$ p_D = \overline{D(\theta)} - D(\bar\theta) $$ and where $$ D(\theta) = -2 \log(p(y \mid \theta)) + C $$ C is a constant that cancels out when model comparisons are made Here. - $\overline{D(\theta)}$ is the average log likelihood of the data (y) given the parameters ($\theta$) computed across all samples - $D(\bar\theta)$ is the log likelihood of the data (y) computed at the average of the parameters ($\theta$) computed across all samples ------------------------------------------------------------------------ ### DIC in Stan Some program like JAGS will have provided DIC directly, but Stan does not provides direct ways of calculating DIC: We can manually calculate approximated DIC by replacing $\bar\theta$ as median of posteriors $\dot\theta$ if the posterior distributions are almost normal. ::: columns ::: {.column width="50%"} **DIC for Full Model** ```{r} # extract log-likelihood and parameters posterior_draws <- fit_full_ppp$draws(c('log_lik', 'beta'), format = 'draws_matrix') log_like_draws <- rowSums(posterior_draws[,colnames(posterior_draws) %in% paste0("log_lik[", 1:30, "]")]) # find draws of loglikelihood that have parameters equaling to median values theta_draws <- posterior_draws[, colnames(posterior_draws) %in% paste0("beta[", 1:6, "]")] theta_ave_iteration = apply(round(theta_draws, 3), 2, \(x) which(x == median(x))) |> unlist() |> as.numeric() log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration]) mean_log_like_draws = (-2)*mean(log_like_draws) # compute DIC DIC_full = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_draws DIC_full ``` ::: ::: {.column width="50%"} **DIC for Empty Model** ```{r} # extract log-likelihood and parameters posterior_draws <- fit_empty_ppp$draws(c('log_lik', 'beta0'), format = 'draws_matrix') log_like_draws <- rowSums(posterior_draws[,colnames(posterior_draws) %in% paste0("log_lik[", 1:30, "]")]) # find draws of loglikelihood that have parameters equaling to median values theta_draws <- posterior_draws[, colnames(posterior_draws) == "beta0"] theta_ave_iteration = which(round(theta_draws,2) == round(median(theta_draws,2))) log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration]) mean_log_like_draws = (-2)*mean(log_like_draws) # compute DIC DIC_empty = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_draws DIC_empty ``` ::: ::: ------------------------------------------------------------------------ ### Issues to DIC The DIC has fallen out of favor recently (not used in Stan) ::: nonincremental - Has issues when parameters are discrete - Not fully Bayesian (point estimate of average of parameter values) - Can give negative values for estimated numbers of parameters in a model ::: **WAIC** (Widely applicable or Watanabe-Akaike information criterion, Watanabe, 2010) corrects some of the problems with DIC: ::: nonincremental - Fully Bayesian (uses entire posterior distribution) - Asymptotically equal to Bayesian cross-validation - Invariant to parameterization ::: ------------------------------------------------------------------------ ## Watanabe-Akaike information criterion (WAIC) ::: nonincremental - A more frequently used model comparison indices for Bayesian analysis - Using the `loo` R package, we can calculate WAIC with the `waic()` function: ::: ::: columns ::: {.column width="50%"} **WAIC for the full model** ```{r} loo::waic(fit_full_ppp$draws('log_lik')) ``` ::: ::: {.column width="50%"} **WAIC for the empty model** ```{r} loo::waic(fit_empty_ppp$draws('log_lik')) ``` ::: ::: ::: nonincremental - Here: - `elpd_waic` is the expected log pointwise predictive density for WAIC - `p_waic` is the WAIC-version estimated number of parameter, similar to $p(D)$ in DIC, which is a penalty to the likelihood for more parameters - `waic` is the WAIC index used for model comparisons (lowest value is best fitting; -2\*`elpd_waic`) - Note that WAIC needs a `log_lik` variable in the model analysis to be calculated correctly. `cmdstan` will automate calculate this variable. ::: ------------------------------------------------------------------------ ## LOO: Approximation to Leave-one-out Big picture: ::: nonincremental 1. Besides WAIC, other comparative fit indices include LOO via Pareto Smoothed Important Sampling (PSIS) via Stan's `LOO` package 2. WAIC/LOO can be used for model comparison with lowest value suggesting better model fit 3. Different from DIC, LOO via PSIS attempts to "approximate" the process of leave-one-out cross-validation (LOO-CV) using a sampling based-approach ::: nonincremental - Gives a finite-sample approximation - Implemented in Stan - Can quickly compare models - Gives warnings when it may be less reliable to use ::: 4. The details of computation of LOO are very technical, but are nicely compiled in Vehtari, Gelman, and Gabry (2017). ::: ------------------------------------------------------------------------ ### LOO: Leave-one-out in Stan Using `loo` package, we can calculate efficient approximate leave-one-out cross-validation (LOO-CV) ::: columns ::: {.column width="50%"} Full model's LOO: ```{r} #| echo: false library(loo) ``` ```{r} #| warning: false full_loo_res <- fit_full_ppp$loo('log_lik', save_psis = TRUE, cores = 4) full_loo_res$estimates ``` ::: ::: {.column width="50%"} Empty model's LOO: ```{r} empty_loo_res <- fit_empty_ppp$loo('log_lik', save_psis = TRUE, cores = 4) empty_loo_res$estimates ``` ::: ::: Here: - `elpd_loo` is the expected log pointwise predictive density for LOO (recall that posterior predictive distribution has some uncertainty around the mean value...) - `p_loo` is the LOO calculation of number of model parameters (a penalty to the likelihood for more parameters) - `looic` is the LOO index used for model comparisons — lowest value suggests best fitting -2`elpd_loo` ------------------------------------------------------------------------ ### LOO: Comparing Model with LOO Alternative, you can use the built-in function `loo_compare` in `loo` package to compare alternative models: ```{r} loo::loo_compare(list(empty = fit_empty_ppp$loo(), full = fit_full_ppp$loo())) ``` This function calculats the standard error of the difference in `elpd` (expected log pointwise predictive density) between models ::: nonincremental - The SE gives an indication of the standard error in the estimate (relative to the size) - Can use this to downweight the choice of models when the standard error is high (Open Question: how high is high?) - Note: `elpd_diff` is `looic` divided by -2 (on the log-likelihood scale, not the deviance scale) - Here, we interpret the results as the model with full predictors is preferred to the model with empty predictors. - The "confidence interval" of the `elpd_diff` does not include 0, indicating we can be fairly certain of this result ::: $$ 0 \in \mathrm{elpd\_diff} \pm 1.96 \cdot \mathrm{se\_diff} $$ ------------------------------------------------------------------------ ### LOO: Pareto smoothed importance sampling (PSIS) ::: nonincremental - Estimated a vector of Pareto shape parameters $k$ for individuals which represents the reliability of sampling: - $k < .5$ (good) suggests the estimate converges quickly - $.5 < k < .7$ (ok) suggests the estimate converges slowly - $.7 < k < 1$ (bad) suggests bad performance - $k > 1$ (very bad) - PSIS screens all cases that have bad diagnostic values. The percentage may tells some information regarding reliability of Bayesian estimation. - Criteria: Estimated tail shape parameter - This is new area, please see more references [@vehtari2016] ::: ```{r} pareto_k_table(full_loo_res) ``` ------------------------------------------------------------------------ ## Compile the relative model fit indices Below shows the all three model comparison fit indices for the empty model and the full model. ```{r} data.frame( Model = c("Full Model", "Empty Model"), DIC = c(DIC_full, DIC_empty), WAIC = c(WAIC_full, WAIC_empty), LOOIC = c(LOO_full, LOO_empty) ) ``` ::: nonincremental - All fit indices suggest the full model is better than the empty model - For simple models, the results of DIC/WAIC/LOOIC should be consistent - In some situations that they are inconsistent, please use WAIC/LOO as standards. ::: ------------------------------------------------------------------------ ## General points about Bayesian Model Comparison ::: nonincremental - Note, WAIC and LOO will converge as sample size increases (WAIC is asymptotic value of LOO) - Latent variable models present challenges (could be your dissertation project) - Need log likelihood with latent variable being integrated out - Missing data present challenges - Need log likelihood with missing data integrated out - Generally, using LOO is recommended (but providing both is appropriate) ::: ------------------------------------------------------------------------ ## Cutting-Edge Research Field: Approximation Algorithm ::: nonincremental - MCMC sampling could be very computational time-consuming. - For number of parameters up to thousands or even millions, it could be a challenge to get the estimation. - Thus, approximation algorithms has been developed, such as Laplace Approximation, Variational Inference [@dhaka2020; @yao2018], [Pathfinder methods](https://statmodeling.stat.columbia.edu/2021/08/10/pathfinder-a-parallel-quasi-newton-algorithm-for-reaching-regions-of-high-probability-mass/) [@zhang2021]. ::: Quoted from [Gelman's blog](https://statmodeling.stat.columbia.edu/2023/02/08/implementing-laplace-approximation-in-stan-whats-happening-under-the-hood/): > ::: columns > ::: {.column width="50%"} > **Advantages of Laplace:** > > ::: nonincremental > - Relatively cheap to compute, given that we already have a mode-finder in Stan. > - Easy to understand, use, and communicate. > - Works reasonably well in many examples (wherever the posterior can be well approximated by a normal distribution). > - Easy to take draws from the normal approximation, also easy to compute importance ratios and use [Pareto-smoothed importance sampling](http://www.stat.columbia.edu/~gelman/research/unpublished/psis4.pdf). > ::: > ::: > > ::: {.column width="50%"} > **Limitations of Laplace:** > > ::: nonincremental > - Sometimes the normal approx is pretty bad (funnels, multimodal distributions, long tails). > > - Sometimes the joint mode is useless or does not even exist (funnels, etc.), in which case the model itself would need to be altered in some way to get a stable mode. > ::: > ::: > ::: ```{r} #| message: false #| warning: false #| code-fold: false #| error: false #| eval: true # Stan's LBFGS algorithm fit_full_optim <- mod_full_ppp$optimize(data = data_full_new, seed = 1234, jacobian = TRUE, show_messages = FALSE) fit_full_laplace <- mod_full_ppp$laplace(data = data_full_new, mode = fit_full_optim, draws = 4000, show_messages = FALSE) # Run 'variational' method to use ADVI to approximate posterior fit_full_vb <- mod_full_ppp$variational(data = data_full_new, seed = 1234, draws = 4000, show_messages = FALSE) # Run 'pathfinder' method, a new alternative to the variational method fit_pf <- mod_full_ppp$pathfinder(data = data_full_new, seed = 1234, draws = 4000, show_messages = FALSE) ``` ```{r} output_compare <- data.frame( Bayesian = unlist(fit_full_ppp$summary(c('beta', 'sigma'))[, 2]), Optim = unlist(fit_full_optim$summary(c('beta', 'sigma'))[, 2]), Laplace = unlist(fit_full_laplace$summary(c('beta', 'sigma'))[, 2]), VB = unlist(fit_full_vb$summary(c('beta', 'sigma'))[, 2]), pathfinder = unlist(fit_pf$summary(c('beta', 'sigma'))[, 2] ) ) rownames(output_compare) <- c(paste0("beta", 0:5), "sigma") output_compare ``` ------------------------------------------------------------------------ ### Approximation Algorithm vs. Full MCMC Though they are very quicker to converge, the accuracy of these approximation algorithms largely depend on the complexity of modeling (number of parameters, prior distributions, latent variables etc.) ```{r} #| code-fold: true #| fig-cap: "Mean and 95% Credit Interval of aproximation algorithms and MCMC" summ_list <- lapply(c(fit_full_ppp, fit_full_laplace, fit_full_vb, fit_pf), \(x) x$summary(c("beta", "sigma"))[c('variable', 'mean',"q5", "q95")]) summ_list[[1]]$Algorithm = "MCMC" summ_list[[2]]$Algorithm = "Laplace Approx." summ_list[[3]]$Algorithm = "Variational Inference" summ_list[[4]]$Algorithm = "Pathfinder" summ_forplot <- Reduce(rbind, summ_list) summ_forplot$Algorithm = factor(summ_forplot$Algorithm, levels = c("MCMC", "Laplace Approx.", "Pathfinder", "Variational Inference")) ggplot(summ_forplot) + geom_col(aes(x = variable, y = mean, fill = Algorithm), position = position_dodge()) + geom_errorbar(aes(x = variable, ymax = q95, ymin = q5, fill = Algorithm),position = position_dodge2(padding = 0.6)) + labs(x = '', y = '') + theme_classic() ``` ------------------------------------------------------------------------ ```{r} library(tidyverse) rbind( cbind(data.frame(Algorithm = "Laplace Approx."), fit_full_laplace$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6]), cbind(data.frame(Algorithm = "Variational Inference"), fit_full_vb$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6]), cbind(data.frame(Algorithm = "Pathfinder"), fit_pf$draws(paste0("beta[", 1:6, "]"))[, 1:6]), cbind(data.frame(Algorithm = "MCMC"), fit_full_ppp$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6]) ) |> mutate(Algorithm = factor(Algorithm, levels = c("MCMC", "Laplace Approx.", "Pathfinder", "Variational Inference"))) |> pivot_longer(starts_with("beta"), names_to = "Parameter", values_to = "Draws") |> ggplot() + geom_density(aes(x = Draws, fill = Algorithm), alpha = 0.4) + facet_wrap(~Parameter, scales = "free") ``` ------------------------------------------------------------------------ ## Wrapping up ![](lec5_wrapup.svg) We used three lectures using linear models was built to show nearly all parts needed in a Bayesian analysis - MCMC specifications - Prior specifications - Assessing MCMC convergence - Reporting MCMC results - Determining if a model fits the data (absolute fit) - Determining which model fits the data better (relative fit) All of these topics will be with us when we start model complicated models in our future lecture. ------------------------------------------------------------------------ ## Next Class 1. Generalized measurement models ## Reference