Lecture 04 – Jihong Zhang, Ph.D.

1 Today’s Lecture Objectives

Explain Stan Syntax for linear regression model
Computing Functions of Model Parameters

Download R file DietDataExample2.R

2 In previous class…

R code for data read in

library(cmdstanr)
library(bayesplot)
library(tidyr)
library(dplyr)
library(kableExtra)
color_scheme_set('brightblue')
dat <- read.csv(here::here("teaching", "2024-01-12-syllabus-adv-multivariate-esrm-6553", "Lecture03", "Code", "DietData.csv"))
dat$DietGroup <- factor(dat$DietGroup, levels = 1:3)
dat$HeightIN60 <- dat$HeightIN - 60
kable( rbind(head(dat), tail(dat)) ) |>
  kable_classic_2() |>
  kable_styling(full_width = F, font_size = 15)

	Respondent	DietGroup	HeightIN	WeightLB	HeightIN60
1	1	1	56	140	-4
2	2	1	60	155	0
3	3	1	64	143	4
4	4	1	68	161	8
5	5	1	72	139	12
6	6	1	54	159	-6
25	25	3	70	259	10
26	26	3	52	201	-8
27	27	3	59	228	-1
28	28	3	64	245	4
29	29	3	65	241	5
30	30	3	72	269	12

Introduce the empty model
Example: Post-Diet Weights
- WeightLB (Dependent Variable): The respondents’ weight in pounds
- HeightIN: The respondents’ height in inches
- DietGroup: 1, 2, 3 representing the group to which a respondent was assigned
The empty model has two parameters to be estimated: (1) \beta_0, (2) \sigma_e
The posterior mean/median of \beta_0 should be mean of WeightLB
The posterior mean/median of \sigma_e should be sd of WeightLB

3 Making `Stan` Code Short and Efficient

The Stan syntax from our previous model was lengthy:

A declared variable for each parameter
The linear combination of coefficients by multiplying predictors

Stan has built-in features to shorten syntax:

Matrices/Vectors
Matrix products
Multivariate distribution (initially for prior distributions)
Built-in Functions (sum() better than +=)

Note: if you are interested in Efficiency tuning in Stan, look at this Charpter for more details.

4 Linear Models without Matrices

The linear model from our example was:

\text{WeightLB}_p = \beta_0 + \beta_1 \text{HeightIN}_p + \beta_2 \text{Group2}_p + \beta_3\text{Group3}_p \\ +\beta_4 \text{HeightIN}_p\text{Group2}_p \\ +\beta_5 \text{HeightIN}_p\text{Group3}_p \\ + e_p

with:

\text{Group2}_p the binary indicator of person p being in group 2
\text{Group}3_p the binary indicator of person p being in group 3
e_p \sim N(0, \sigma_e)

4.1 Path Diagram of the Full Model

graph LR;
  HeightIN60 --> WeightLB;
  DietGroup2 --> WeightLB;
  DietGroup3 --> WeightLB;
  HeightIN60xDietGroup2 --> WeightLB;
  HeightIN60xDietGroup3 --> WeightLB;

Figure 1

5 Linear Models with Matrices

Model (predictor) matrix with the size 30 (rows) \times 6 (columns)

\mathbf{X} = \begin{bmatrix}1 & -4 & 0 & 0 & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & 12 & 0 & 1 & 0 & 12 \end{bmatrix}

Coefficients vectors with the size 6 (rows) \times 1 (column):

\mathbf{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \beta_5 \\ \end{bmatrix}

model.matrix creates a design (or model) matrix (X), e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.

FullModelFormula = as.formula("WeightLB ~ HeightIN60 + DietGroup + HeightIN60*DietGroup")
model.matrix(FullModelFormula, data = dat) |> unique()

   (Intercept) HeightIN60 DietGroup2 DietGroup3 HeightIN60:DietGroup2
1            1         -4          0          0                     0
2            1          0          0          0                     0
3            1          4          0          0                     0
4            1          8          0          0                     0
5            1         12          0          0                     0
6            1         -6          0          0                     0
7            1          2          0          0                     0
8            1          5          0          0                     0
10           1         10          0          0                     0
11           1         -4          1          0                    -4
12           1          0          1          0                     0
13           1          4          1          0                     4
14           1          8          1          0                     8
15           1         12          1          0                    12
16           1         -6          1          0                    -6
17           1          2          1          0                     2
18           1          5          1          0                     5
20           1         10          1          0                    10
21           1         -6          0          1                     0
22           1         -2          0          1                     0
23           1          2          0          1                     0
24           1          6          0          1                     0
25           1         10          0          1                     0
26           1         -8          0          1                     0
27           1         -1          0          1                     0
28           1          4          0          1                     0
29           1          5          0          1                     0
30           1         12          0          1                     0
   HeightIN60:DietGroup3
1                      0
2                      0
3                      0
4                      0
5                      0
6                      0
7                      0
8                      0
10                     0
11                     0
12                     0
13                     0
14                     0
15                     0
16                     0
17                     0
18                     0
20                     0
21                    -6
22                    -2
23                     2
24                     6
25                    10
26                    -8
27                    -1
28                     4
29                     5
30                    12

6 Linear Models with Matrices (Cont.)

We then rewrite the equation from

to:

\mathbf{WeightLB} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e}

Where:

\mathbf{WeightLB} is the vector of outcome (N \times 1)
\mathbf{X} is the model (predictor) matrix (N \times P for P - 1 predictors)
\boldsymbol{\beta} is the coefficients vector (P \times 1)
\mathbf{e} is the vector for residuals (N \times 1)

6.1 Example: Predicted Values and \text{R}^2

Similar to Monte Carlo Simulation, given matrices P and \boldsymbol{\beta}

Click here to see R code

set.seed(1234)
fit_lm <- lm(formula = FullModelFormula, data = dat)

beta = coefficients(fit_lm)
P = length(beta)
X = model.matrix(FullModelFormula, data = dat)
head(X%*%beta)

Calculating R^2 and adjusted R^2:

rss = crossprod(dat$WeightLB - X%*%beta) # residual sum of squares
tss = crossprod(dat$WeightLB - mean(dat$WeightLB)) # total sum of squares
R2 = 1 - rss / tss
R2.adjust = 1 - (rss/(nrow(dat)-P)) / (tss/((nrow(dat)-1)))
data.frame(
  R2, # r-square
  R2.adjust # adjusted. r-square
)

         R2 R2.adjust
1 0.9786724 0.9742292

lm function: R^2 and adjusted R^2:

summary_lm <- summary(fit_lm)
summary_lm$r.squared

[1] 0.9786724

summary_lm$adj.r.squared

[1] 0.9742292

7 Vectorize prior distributions

Previously, we defined a normal distribution for each regression coefficient \beta_0 \sim normal(0, 1) \\ \vdots \\ \beta_p \sim normal(0, 1)

They are all univariate normal distribution
Issue: Each parameter had a prior that was independent of the other parameter; then the correlation between betas is low and cannot be changed.

For example, the code shows two betas with univariate normal distribution have low correlation (r = -0.025)

set.seed(1234)
beta0 = rnorm(100, 0, 1)
beta1 = rnorm(100, 0, 1)
cor(beta0, beta1)

[1] -0.02538285

8 Vectorize prior distributions (Cont.)

When combining all parameters into a vector, a natural extension is a multivariate normal distribution, so that the betas have a pre-defined correlation strength

The syntax shows the two betas generated by the multivariate normal distribution with correlation of .5

set.seed(1234)
sigma_of_betas = matrix(c(1, 0.5, 0.5, 1), ncol = 2)
betas = mvtnorm::rmvnorm(100, mean = c(0, 0), sigma = sigma_of_betas)
beta0 = betas[,1]
beta1 = betas[,2]
cor(beta0, beta1)

[1] 0.5453899

Back to the stan code, if we want to have more control over beta’s prior distribution, we need to specify:

Mean vector of betas (meanBeta; size P \times 1)
- Put all prior means for those coefficients into a vector
Covariance matrix for betas (covBeta; size P \times P)
- Put all prior variances into the diagonal; zeros for off diagonal; ’cause we are not sure the potential correlation between betas

9 Syntax Changes: Data Section

Old syntax without matrix:

data{
    int<lower=0> N;
    vector[N] weightLB;
    vector[N] height60IN;
    vector[N] group2;
    vector[N] group3;
    vector[N] heightXgroup2;
    vector[N] heightXgroup3;
}

New syntax with matrix:

data{
  int<lower=0> N;         // number of observations
  int<lower=0> P;         // number of predictors (plus column for intercept)
  matrix[N, P] X;         // model.matrix() from R
  vector[N] weightLB;     // outcome
  real sigmaRate;         // hyperparameter: rate parameter for residual standard deviation
}

10 Syntax Changes: Parameters Section

Old syntax without matrix:

parameters {
  real beta0;
  real betaHeight;
  real betaGroup2;
  real betaGroup3;
  real betaHxG2;
  real betaHxG3;
  real<lower=0> sigma;
}

New syntax with matrix:

parameters {
  vector[P] beta;         // vector of coefficients for Beta
  real<lower=0> sigma;    // residual standard deviation
}

11 Syntax Changes: Prior Distributions Definition

Old syntax without matrix:

model {
  beta0 ~ normal(0,100);
  betaHeight ~ normal(0,100);
  betaGroup2 ~ normal(0,100);
  betaGroup3 ~ normal(0,100);
  betaHxG2 ~ normal(0,100);
  betaHxG3 ~ normal(0,100);
  sigma ~ exponential(.1); // prior for sigma
  weightLB ~ normal(
    beta0 + betaHeight * height60IN + betaGroup2 * group2 +
    betaGroup3*group3 + betaHxG2*heightXgroup2 +
    betaHxG3*heightXgroup3, sigma);
}

New syntax with matrix:

multi_normal() is the multivariate normal sampling in Stan, similar to rmvnorm() in R; For uninformative, we did not need to specify
exponential() is the exponential distribution sampling in Stan, similar to rexp() in R

model {
  sigma ~ exponential(sigmaRate);         // prior for sigma
  weightLB ~ normal(X*beta, sigma);       // linear model
}

11.1 A little more about exponential distribution

The mean of the exp. distribution is \frac{1}{\lambda}, where \lambda is called rate parameter
The variance of the exp. distribution is \frac{1}{\lambda^2}
It is typically positive skewed (skewness is 2)
Question: which hyperparameter rate \lambda is most informative/uninformative

Click here to see R code

library(tidyr)
library(dplyr)
library(ggplot2)
rate_list = seq(0.1, 1, 0.2)
pdf_points = sapply(rate_list, \(x) dexp(seq(0, 20, 0.01), x)) |> as.data.frame()
colnames(pdf_points) <- rate_list
pdf_points$x = seq(0, 20, 0.01)
pdf_points %>%
  pivot_longer(-x, values_to = 'y') %>%
  mutate(
    sigmaRate = factor(name, levels = rate_list)
    ) %>%
  ggplot() +
  geom_path(aes(x = x, y = y, color = sigmaRate, group = sigmaRate), size = 1.2) +
  scale_x_continuous(limits = c(0, 20)) +
  labs(x = "Sigma")

Figure 2: PDF for the exponential distribution by varied rate parameters

11.2 Since we talked about Exponential distribution…

Let’s dive deeper into Laplace distribution. It is sometimes called double-exponential distribution. Exponential distribution is positive part of Laplace distribution.

\text{PDF}_{exp.} = \lambda e^{-\lambda x}

\text{PDF}_{laplace} = \frac{1}{2b} e^{-\frac{|x - u|}{b}}

Thus, we know that for x > 0, exponential distribution is a special case of Laplace distribution with scale parameter b as \frac{1}{\lambda} and location parameter as 0.
Laplace-based distribution, Cauchy, and Horseshoe distribution all belong to so-called “shrinkage” priors.

Shrinkage priors will be very useful for high-dimensional data (say P = 1000) and variable selection

Click here to see R code

library(LaplacesDemon)
b_list = 1 / rate_list * 2
pdf_points = sapply(b_list, \(x) dlaplace(seq(-20, 20, 0.01), scale = x, location = 0)) |> as.data.frame()
colnames(pdf_points) <- round(b_list, 2)
pdf_points$x = seq(-20, 20, 0.01)
pdf_points %>%
  pivot_longer(-x, values_to = 'y') %>%
  mutate(
    scale = factor(name, levels = round(b_list, 2))
    ) %>%
  ggplot() +
  geom_path(aes(x = x, y = y, color = scale, group = scale), size = 1.2) +
  scale_x_continuous(limits = c(-20, 20)) +
  labs(x = "")

12 Compare results and computational time

Click here to see R code

code_path <- here::here("teaching", "2024-01-12-syllabus-adv-multivariate-esrm-6553", "Lecture04", "Code")
mod_full_old <- cmdstan_model(paste0(code_path, "/FullModel_Old.stan"))
data_full_old <- list(
  N = nrow(dat),
  weightLB = dat$WeightLB,
  height60IN = dat$HeightIN60,
  group2 = as.numeric(dat$DietGroup == 2),
  group3 = as.numeric(dat$DietGroup == 3),
  heightXgroup2 = as.numeric(dat$DietGroup == 2) * dat$HeightIN60,
  heightXgroup3 = as.numeric(dat$DietGroup == 3) * dat$HeightIN60
)
fit_full_old <- mod_full_old$sample(
  data = data_full_old,
  seed = 1234,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)

fit_full_old$summary()[, -c(9, 10)]

# A tibble: 8 × 8
  variable      mean  median    sd   mad     q5     q95  rhat
  <chr>        <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>
1 lp__       -76.7   -76.3   2.17  2.00  -80.8  -73.8    1.00
2 beta0      148.    148.    3.26  3.20  142.   153.     1.00
3 betaHeight  -0.370  -0.378 0.495 0.488  -1.17   0.437  1.00
4 betaGroup2 -24.2   -24.1   4.59  4.52  -31.6  -16.4    1.00
5 betaGroup3  81.2    81.1   4.31  4.36   74.6   88.5    1.00
6 betaHxG2     2.46    2.45  0.690 0.678   1.34   3.59   1.00
7 betaHxG3     3.56    3.55  0.658 0.648   2.48   4.66   1.00
8 sigma        8.26    8.10  1.23  1.13    6.52  10.5    1.00

Click here to see R code

mod_full_new <- cmdstan_model("Code/FullModel_New.stan")
FullModelFormula = as.formula("WeightLB ~ HeightIN60 + DietGroup + HeightIN60*DietGroup")
X = model.matrix(FullModelFormula, data = dat)
data_full_new <- list(
  N = nrow(dat),
  P = ncol(X),
  X = X,
  weightLB = dat$WeightLB,
  sigmaRate = 0.1
)
fit_full_new <- mod_full_new$sample(
  data = data_full_new,
  seed = 1234,
  chains = 4,
  parallel_chains = 4
)

fit_full_new$summary()[, -c(9, 10)]

# A tibble: 8 × 8
  variable    mean  median    sd   mad     q5     q95  rhat
  <chr>      <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>
1 lp__     -76.6   -76.3   2.15  2.00  -80.6  -73.9    1.00
2 beta[1]  148.    148.    3.27  3.23  142.   153.     1.00
3 beta[2]   -0.374  -0.372 0.483 0.479  -1.16   0.414  1.00
4 beta[3]  -24.1   -24.2   4.59  4.45  -31.6  -16.5    1.00
5 beta[4]   81.3    81.3   4.44  4.40   74.1   88.5    1.00
6 beta[5]    2.47    2.48  0.683 0.676   1.32   3.56   1.00
7 beta[6]    3.57    3.57  0.646 0.643   2.52   4.62   1.00
8 sigma      8.25    8.10  1.26  1.20    6.50  10.5    1.00

Note that if you omit an explicit prior for a parameter it implies a uniform prior across that parameter’s domain.

mod_full_new$print(line_numbers = TRUE)

 1: data{
 2:   int<lower=0> N;         // number of observations
 3:   int<lower=0> P;         // number of predictors (plus column for intercept)
 4:   matrix[N, P] X;         // model.matrix() from R 
 5:   vector[N] weightLB;     // outcome
 6:   real sigmaRate;         // hyperparameter: prior rate parameter for residual standard deviation
 7: }
 8: parameters {
 9:   vector[P] beta;         // vector of coefficients for Beta
10:   real<lower=0> sigma;    // residual standard deviation
11: }
12: model {
13:   sigma ~ exponential(sigmaRate);         // prior for sigma
14:   weightLB ~ normal(X*beta, sigma);       // linear model
15: }
16:

mod_full_old$print(line_numbers = TRUE)

 1: data{
 2:     int<lower=0> N;
 3:     vector[N] weightLB;
 4:     vector[N] height60IN;
 5:     vector[N] group2;
 6:     vector[N] group3;
 7:     vector[N] heightXgroup2;
 8:     vector[N] heightXgroup3;
 9: }
10: parameters {
11:   real beta0;
12:   real betaHeight;
13:   real betaGroup2;
14:   real betaGroup3;
15:   real betaHxG2;
16:   real betaHxG3;
17:   real<lower=0> sigma;
18: }
19: model {
20:   sigma ~ exponential(.1); // prior for sigma
21:   weightLB ~ normal(
22:     beta0 + betaHeight * height60IN + betaGroup2 * group2 + 
23:     betaGroup3*group3 + betaHxG2*heightXgroup2 +
24:     betaHxG3*heightXgroup3, sigma);
25: }
26:

The differences between two method:

betaGroup3 has the largest differences between two methods

cbind(fit_full_old$summary()[,1], fit_full_old$summary()[, -c(1, 9, 10)] - fit_full_new$summary()[, -c(1, 9, 10)])

    variable         mean     median            sd          mad        q5
1       lp__ -0.007672400  0.0063000  0.0242926098  0.007931910 -0.115880
2      beta0  0.071524500  0.1155000 -0.0135811270 -0.029652000 -0.102200
3 betaHeight  0.004062205 -0.0067305  0.0121231572  0.008502340 -0.018165
4 betaGroup2 -0.013670300  0.0513000 -0.0005405792  0.069904590 -0.037665
5 betaGroup3 -0.103896750 -0.2107500 -0.1281290904 -0.041068020  0.468720
6   betaHxG2 -0.019823355 -0.0341300  0.0072874047  0.001823598  0.020645
7   betaHxG3 -0.010911025 -0.0213900  0.0116900063  0.005040840 -0.041349
8      sigma  0.011585872  0.0005500 -0.0326167204 -0.068377512  0.021184
          q95          rhat
1  0.03527000  2.381442e-03
2  0.00740000  1.487790e-04
3  0.02306265  1.225408e-03
4  0.07863000  4.347532e-05
5 -0.04479000 -4.122054e-05
6  0.02261500  5.967715e-04
7  0.03879350  7.265786e-05
8 -0.00486500  2.854078e-04

13 Compare computational time

The Stan code with matrix has faster computation:

fit_full_old$time()

$total
[1] 0.2250798

$chains
  chain_id warmup sampling total
1        1  0.053    0.041 0.094
2        2  0.061    0.048 0.109
3        3  0.055    0.051 0.106
4        4  0.054    0.042 0.096

fit_full_new$time()

$total
[1] 0.204052

$chains
  chain_id warmup sampling total
1        1  0.031    0.027 0.058
2        2  0.036    0.032 0.068
3        3  0.037    0.034 0.071
4        4  0.032    0.031 0.063

Pros: With matrices, there is less syntax to write
- Model is equivalent
- More efficient for sampling (sample from matrix space)
- More flexible: modify matrix elements in R instead of individual parameters in Stan
Cons: Output, however, is not labeled with respect to parameters
- May have to label output

14 Computing Functions of Parameters

Often, we need to compute some linear or non-linear function of parameters in a linear model
- Missing effects - beta for diet group 2 and 3
- Model fit indices: R^2
- Transformed effects - residual variance \sigma^2
In non-Bayesian (frequentist) analyses, there are often formed with the point estimates of parameters (with standard errors - second derivative of likelihood function)
For Bayesian analyses, however, we seek to build the posterior distribution for any function of parameters
- This means applying the function to all posterior samples
- It is especially useful when you want to propose your new statistic

14.1 Example: Need Slope for Diet Group 2

Recall our model:

Here, \beta_1 denotes the average change in \text{WeightLB} with one-unit increase in \text{HeightIN} for members in the reference group— Diet Group 1.

Question: What about the slope for members in Diet Group 2.

Typically, we can calculate by hand by assign \text{Group2} as 1 and all effects regarding \text{HeightIN}:

\beta_{\text{group2}}*\text{HeightIN} = (\beta_1 + \beta_4*1 + \beta_5*0)*\text{HeightIN}

\beta_{\text{group2}}= \beta_1 +\beta_4
Similarly, the intercept for Group2 - the average mean of \text{WeightLB} is \beta_0 + \beta_2.

14.2 Computing slope for Diet Group 2

Our task: Create posterior distribution for Diet Group 2

We must do so for each iteration we’ve kept from our MCMC chain
A somewhat tedious way to do this is after using Stan

fit_full_new$summary()

# A tibble: 8 × 10
  variable    mean  median    sd   mad     q5     q95  rhat ess_bulk ess_tail
  <chr>      <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -76.6   -76.3   2.15  2.00  -80.6  -73.9    1.00    1401.    1886.
2 beta[1]  148.    148.    3.27  3.23  142.   153.     1.00    1402.    1838.
3 beta[2]   -0.374  -0.372 0.483 0.479  -1.16   0.414  1.00    1522.    2019.
4 beta[3]  -24.1   -24.2   4.59  4.45  -31.6  -16.5    1.00    1746.    2190.
5 beta[4]   81.3    81.3   4.44  4.40   74.1   88.5    1.00    1538.    2091.
6 beta[5]    2.47    2.48  0.683 0.676   1.32   3.56   1.00    1805.    2118.
7 beta[6]    3.57    3.57  0.646 0.643   2.52   4.62   1.00    1678.    2218.
8 sigma      8.25    8.10  1.26  1.20    6.50  10.5    1.00    2265.    2212.

beta_group2 <- fit_full_new$draws("beta[2]")  + fit_full_new$draws("beta[5]")
summary(beta_group2)

# A tibble: 1 × 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 beta[2]   2.10   2.09 0.478 0.478  1.32  2.88  1.00    4032.    3211.

14.3 Computing slope within Stan

Stan can compute these values for us-with the “generated quantities” section of the syntax

Stan code

data{
  int<lower=0> N;         // number of observations
  int<lower=0> P;         // number of predictors (plus column for intercept)
  matrix[N, P] X;         // model.matrix() from R
  vector[N] weightLB;     // outcome
  real sigmaRate;         // hyperparameter: prior rate parameter for residual standard deviation
}
parameters {
  vector[P] beta;         // vector of coefficients for Beta
  real<lower=0> sigma;    // residual standard deviation
}
model {
  sigma ~ exponential(sigmaRate);         // prior for sigma
  weightLB ~ normal(X*beta, sigma);       // linear model
}
generated quantities{
  real slopeG2;
  slopeG2 = beta[2] + beta[5];
}

The generated quantities block computes values that do not affect the posterior distributions of the parameters–they are computed after the sampling from each iteration

The values are then added to the Stan object and can be seen in the summary
- They can also be plotted using bayesplot package

mod_full_compute <- cmdstan_model("Code/FullModel_compute.stan")
fit_full_compute <- mod_full_compute$sample(
  data = data_full_new,
  seed = 1234,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)

fit_full_compute$summary('slopeG2')

# A tibble: 1 × 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 slopeG2   2.10   2.09 0.478 0.478  1.32  2.88  1.00    4032.    3211.

bayesplot::mcmc_dens_chains(fit_full_compute$draws('slopeG2'))

14.4 Alternative way of computing the slope with Matrix

This is a little more complicated but more flexible method.

That is, we can make use of matrix operation and form a contrast matrix

Contrasts are linear combinations of parameters
- You may have used these in R using glht package

For use, we form a contrast matrix that is size of C \times P where C is the number of contrasts:

The entries of this matrix are the values that multiplying the coefficients
- For (\beta_1 + \beta_2) this would be:
  - a “1” in the corresponding entry for \beta_1
  - a “1” in the corresponding entry for \beta_4
  - “0”s elsewhere
- \mathbf{C} = \begin{bmatrix} 0\ \mathbf{1}\ 0\ 0\ \mathbf{1}\ 0 \end{bmatrix}
Then, the contrast matrix is multiplied by the coefficients vector to form the values:
- \mathbf{C} * \beta

14.5 Contrasts in Stan

Stan code

data{
  int<lower=0> N;         // number of observations
  int<lower=0> P;         // number of predictors (plus column for intercept)
  matrix[N, P] X;         // model.matrix() from R
  vector[N] weightLB;     // outcome
  real sigmaRate;         // hyperparameter: prior rate parameter for residual standard deviation
  int<lower=0> nContrasts;
  matrix[nContrasts, P] contrast; // C matrix
}
parameters {
  vector[P] beta;         // vector of coefficients for Beta
  real<lower=0> sigma;    // residual standard deviation
}
model {
  sigma ~ exponential(sigmaRate);         // prior for sigma
  weightLB ~ normal(X*beta, sigma);       // linear model
}
generated quantities{
  vector[nContrasts] computedEffects;
  computedEffects = contrast*beta;
}

R code

mod_full_contrast <- cmdstan_model("Code/FullModel_contrast.stan")
contrast_dat <- list(
  nContrasts = 2,
  contrast = matrix(
    c(0,1,0,0,1,0, # slope for diet group2
      1,0,1,0,0,0),# intercept for diet group 2
    nrow = 2, byrow = TRUE
  )
)
fit_full_contrast <- mod_full_contrast$sample(
  data = c(data_full_new, contrast_dat),
  seed = 1234,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)

fit_full_contrast$summary('computedEffects')[, -c(9, 10)]

# A tibble: 2 × 8
  variable             mean median    sd   mad     q5    q95  rhat
  <chr>               <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
1 computedEffects[1]   2.10   2.09 0.478 0.478   1.32   2.88  1.00
2 computedEffects[2] 123.   123.   3.16  3.02  118.   129.    1.00

bayesplot::mcmc_hist(fit_full_contrast$draws('computedEffects'))

14.6 Computing \text{R}^2

We can use the generated quantities section to build a posterior distribution for \text{R}^2

There are several formulas for \text{R}^2, we will use the following:

\text{R}^2 = 1-\frac{RSS}{TSS} = 1- \frac{\Sigma_{p=1}^{N}(y_p -\hat{y}_p)}{\Sigma_{p=1}^{N}(y_p -\bar{y}_p)} Where:

RSS is the residual sum of squares
TSS is the total sum of squares of dependent variable
\hat{y}_p is the predicted values: \hat{y}_p = \mathbf{X}\boldsymbol{\beta}
\bar{y}_p is the mean value of dependent variable: \bar{y}_p = \frac{\Sigma_{p=1}^{N}y_p}{N}

Notice: RSS depends on sampled parameters, so we will use this to build our posterior distribution for \text{R}^2

For adjusted \text{R}^2, we use the following:

\text{adj.R}^2 = 1-\frac{RSS/(N-P)}{TSS/(N-1)} = 1- \frac{\Sigma_{p=1}^{N}(y_p -\hat{y}_p)}{\Sigma_{p=1}^{N}(y_p -\bar{y}_p)}*\frac{N-P}{N-1}

Then, we can calculate the how to calculate \text{adj.R}^2 by \text{R}^2:

\text{adj.R}^2 = 1-(1-\text{R}^2)*\frac{N-P}{N-1} = \frac{(P-1)+(N-1)R^2}{N-P}

14.7 `Stan` code for Computing \text{R}^2

Stan code

generated quantities{
  vector[nContrasts] computedEffects;
  computedEffects = contrast*beta;
  // compute R2
  real rss;
  real tss;
  real R2;
  real R2adj;
  {// anything in these brackets will not appear in summary table
    vector[N] pred = X*beta;
    rss = dot_self(weightLB-pred); // dot_self is stan function for matrix square
    tss = dot_self(weightLB-mean(weightLB));
  }
  R2 = 1-rss/tss;
  R2adj = 1-(rss/(N-P))/(tss/(N-1));
}

Recall that our lm function provides \text{R}^2 as 0.9787 and adjusted \text{R}^2 as 0.9742

fit_full_contrast$summary(c('rss', 'tss', 'R2','R2adj'))[, -c(9, 10)]

# A tibble: 4 × 8
  variable      mean    median        sd       mad        q5       q95  rhat
  <chr>        <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl> <dbl>
1 rss       1941.     1877.    285.      226.       1616.     2499.     1.00
2 tss      71112     71112       0         0       71112     71112     NA   
3 R2           0.973     0.974   0.00401   0.00317     0.965     0.977  1.00
4 R2adj        0.967     0.968   0.00484   0.00384     0.958     0.973  1.00

bayesplot::mcmc_hist(fit_full_contrast$draws(c('R2', 'R2adj')))

14.8 Get posterior mode

# Create the function.
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Calculate the mode using the user function.
getmode(fit_full_contrast$draws('R2'))

[1] 0.974917

getmode(fit_full_contrast$draws('R2adj'))

[1] 0.96674

15 Wrapping up

Today we further added generated quantities into our Bayesian toolset:

How to make Stan use less syntax using matrices
How to form posterior distributions for functions of parameters

We will use both of these features in psychometric models.

16 Exercise for today’s class

The simulated data below contains one continuous outcome (y), one continuous predictor (x), and one group variable with two levels (group). The true data-generating parameters are:

Intercept: \beta_0 = 10
Slope for x: \beta_1 = 2
Group 2 effect: \beta_2 = 5
Residual SD: \sigma = 3

Run the following code to generate the simulated data:

set.seed(42)
N <- 100
x <- rnorm(N, mean = 5, sd = 2)
group <- factor(sample(c(1, 2), N, replace = TRUE))
y <- 10 + 2 * x + 5 * (as.numeric(group) == 2) + rnorm(N, 0, 3)
sim_dat <- data.frame(y = y, x = x, group = group)
head(sim_dat)

         y        x group
1 30.36174 7.741917     2
2 18.08657 3.870604     2
3 24.95402 5.726257     1
4 26.71051 6.265725     2
5 20.21354 5.808537     1
6 15.86074 4.787751     1

Questions

Model 1 (without matrix): Write a Stan model string using individual beta parameters (old syntax). Use default (uniform) priors for all regression coefficients and an exponential(0.1) prior for \sigma. Compile with cmdstan_model(write_stan_file(Model1_String)) and sample from the posterior.
Model 2 (with matrix): Rewrite Model 1 using matrix notation (new syntax) with model.matrix(). Compile with cmdstan_model(write_stan_file(Model2_String)) and sample from the posterior.
Compare the posterior means from both models. Do they recover the true parameters (\beta_0=10, \beta_1=2, \beta_2=5, \sigma=3)?

16.1 Answer: Model 1 (without matrix)

Click to see answer: Model 1 Stan string and sampling

Model1_String <- "
data {
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
  vector[N] group2;
}
parameters {
  real beta0;
  real betaX;
  real betaGroup2;
  real<lower=0> sigma;
}
model {
  // default (uniform) priors for regression coefficients
  sigma ~ exponential(0.1);
  y ~ normal(beta0 + betaX * x + betaGroup2 * group2, sigma);
}
"

mod1 <- cmdstan_model(write_stan_file(Model1_String))
data_mod1 <- list(
  N    = nrow(sim_dat),
  y      = sim_dat$y,
  x      = sim_dat$x,
  group2 = as.numeric(sim_dat$group == 2)
)
fit_mod1 <- mod1$sample(
  data             = data_mod1,
  seed             = 42,
  chains           = 4,
  parallel_chains  = 4,
  refresh          = 0
)

Click to see answer: Model 1 summary

fit_mod1$summary(c("beta0", "betaX", "betaGroup2", "sigma"))[, -c(9, 10)]

# A tibble: 4 × 8
  variable    mean median    sd   mad    q5   q95  rhat
  <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta0       9.73   9.74 0.810 0.817  8.40 11.1   1.00
2 betaX       2.10   2.10 0.138 0.139  1.87  2.33  1.00
3 betaGroup2  4.43   4.41 0.570 0.565  3.51  5.38  1.00
4 sigma       2.84   2.84 0.206 0.204  2.53  3.20  1.00

16.2 Answer: Model 2 (with matrix)

Click to see answer: Model 2 Stan string and sampling

Model2_String <- "
data {
  int<lower=0> N;
  int<lower=0> P;
  matrix[N, P] X;
  vector[N] y;
  real sigmaRate;
}
parameters {
  vector[P] beta;
  real<lower=0> sigma;
}
model {
  // default (uniform) priors for regression coefficients
  sigma ~ exponential(sigmaRate);
  y ~ normal(X * beta, sigma);
}
"

mod2 <- cmdstan_model(write_stan_file(Model2_String))
ExFormula <- as.formula("y ~ x + group")
X_sim <- model.matrix(ExFormula, data = sim_dat)
data_mod2 <- list(
  N         = nrow(sim_dat),
  P         = ncol(X_sim),
  X         = X_sim,
  y         = sim_dat$y,
  sigmaRate = 0.1
)
fit_mod2 <- mod2$sample(
  data            = data_mod2,
  seed            = 42,
  chains          = 4,
  parallel_chains = 4,
  refresh         = 0
)

Click to see answer: Model 2 summary

fit_mod2$summary(c("beta", "sigma"))[, -c(9, 10)]

# A tibble: 4 × 8
  variable  mean median    sd   mad    q5   q95  rhat
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta[1]   9.73   9.74 0.810 0.812  8.42 11.0   1.00
2 beta[2]   2.10   2.10 0.138 0.136  1.87  2.33  1.00
3 beta[3]   4.42   4.42 0.575 0.573  3.47  5.36  1.00
4 sigma     2.83   2.82 0.207 0.205  2.52  3.19  1.00

17 Next Class

Bayesian Model fit
Bayesian Model Comparison

--- title: "Lecture 04" subtitle: "Linear Regression Model with Stan II" author: "Jihong Zhang" institute: "Educational Statistics and Research Methods" title-slide-attributes: data-background-image: ../Images/title_image.png data-background-size: contain data-background-opacity: "0.9" execute: echo: true warning: false format: html: code-tools: true code-line-numbers: false code-fold: false code-summary: 'Click here to see R code' number-offset: 1 fig.width: 10 fig-align: center message: false revealjs: logo: ../Images/UA_Logo_Horizontal.png incremental: true # choose "false "if want to show all together theme: [serif, ../pp.scss] footer: <https://jihongzhang.org/posts/2024-01-12-syllabus-adv-multivariate-esrm-6553> transition: slide background-transition: fade slide-number: true chalkboard: true number-sections: false code-line-numbers: true code-link: true code-annotations: hover code-copy: true highlight-style: arrow code-block-border-left: true code-block-background: "#b22222" output-file: slides-index.html --- ## Today's Lecture Objectives 1. Explain Stan Syntax for linear regression model 2. Computing Functions of Model Parameters **Download R file [DietDataExample2.R]{.underline}** ## In previous class... ::::: columns ::: {.column width="50%"} ```{r} #| output: true #| code-fold: true #| code-summary: "R code for data read in" library(cmdstanr) library(bayesplot) library(tidyr) library(dplyr) library(kableExtra) color_scheme_set('brightblue') dat <- read.csv(here::here("teaching", "2024-01-12-syllabus-adv-multivariate-esrm-6553", "Lecture03", "Code", "DietData.csv")) dat$DietGroup <- factor(dat$DietGroup, levels = 1:3) dat$HeightIN60 <- dat$HeightIN - 60 kable( rbind(head(dat), tail(dat)) ) |> kable_classic_2() |> kable_styling(full_width = F, font_size = 15) ``` ::: ::: {.column width="50%"} 1. Introduce the empty model 2. Example: Post-Diet Weights - WeightLB (*Dependent Variable*): The respondents' weight in pounds - HeightIN: The respondents' height in inches - DietGroup: 1, 2, 3 representing the group to which a respondent was assigned 3. The empty model has two parameters to be estimated: (1) $\beta_0$, (2) $\sigma_e$ 4. The posterior mean/median of $\beta_0$ should be mean of WeightLB 5. The posterior mean/median of $\sigma_e$ should be sd of WeightLB ::: ::::: ## Making `Stan` Code Short and Efficient The Stan syntax from our previous model was lengthy: - A declared variable for each parameter - The linear combination of coefficients by multiplying predictors Stan has built-in features to shorten syntax: - Matrices/Vectors - Matrix products - Multivariate distribution (initially for prior distributions) - Built-in Functions (`sum()` better than `+=`) Note: if you are interested in Efficiency tuning in Stan, look at this [Charpter](https://mc-stan.org/docs/stan-users-guide/efficiency-tuning.html) for more details. ------------------------------------------------------------------------ ## Linear Models without Matrices The linear model from our example was: $$ \text{WeightLB}_p = \beta_0 + \beta_1 \text{HeightIN}_p + \beta_2 \text{Group2}_p + \beta_3\text{Group3}_p \\ +\beta_4 \text{HeightIN}_p\text{Group2}_p \\ +\beta_5 \text{HeightIN}_p\text{Group3}_p \\ + e_p $$ with: - $\text{Group2}_p$ the binary indicator of person $p$ being in group 2 - $\text{Group}3_p$ the binary indicator of person $p$ being in group 3 - $e_p \sim N(0, \sigma_e)$ ------------------------------------------------------------------------ ### Path Diagram of the Full Model ```{mermaid} %%| echo: false %%| label: fig-diagram graph LR; HeightIN60 --> WeightLB; DietGroup2 --> WeightLB; DietGroup3 --> WeightLB; HeightIN60xDietGroup2 --> WeightLB; HeightIN60xDietGroup3 --> WeightLB; ``` ------------------------------------------------------------------------ ## Linear Models with Matrices ::::: columns ::: {.column width="50%"} Model (predictor) matrix with the size 30 (rows) $\times$ 6 (columns) $$ \mathbf{X} = \begin{bmatrix}1 & -4 & 0 & 0 & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & 12 & 0 & 1 & 0 & 12 \end{bmatrix} $$ ::: ::: {.column width="50%"} Coefficients vectors with the size 6 (rows) $\times$ 1 (column): $$ \mathbf{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \beta_5 \\ \end{bmatrix} $$ ::: ::::: `model.matrix` creates a design (or model) matrix ($X$), e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly. ```{r} FullModelFormula = as.formula("WeightLB ~ HeightIN60 + DietGroup + HeightIN60*DietGroup") model.matrix(FullModelFormula, data = dat) |> unique() ``` ## Linear Models with Matrices (Cont.) We then rewrite the equation from $$ \text{WeightLB}_p = \beta_0 + \beta_1 \text{HeightIN}_p + \beta_2 \text{Group2}_p + \beta_3\text{Group3}_p \\ +\beta_4 \text{HeightIN}_p\text{Group2}_p \\ +\beta_5 \text{HeightIN}_p\text{Group3}_p \\ + e_p $$ to: $$ \mathbf{WeightLB} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e} $$ Where: - $\mathbf{WeightLB}$ is the vector of outcome (N $\times$ 1) - $\mathbf{X}$ is the model (predictor) matrix (N $\times$ P for P - 1 predictors) - $\boldsymbol{\beta}$ is the coefficients vector (P $\times$ 1) - $\mathbf{e}$ is the vector for residuals (N $\times$ 1) ------------------------------------------------------------------------ ### Example: Predicted Values and $\text{R}^2$ ::::: columns ::: {.column width="50%"} **Similar to Monte Carlo Simulation, given** matrices $P$ **and** $\boldsymbol{\beta}$ ```{r} #| code-fold: true set.seed(1234) fit_lm <- lm(formula = FullModelFormula, data = dat) beta = coefficients(fit_lm) P = length(beta) X = model.matrix(FullModelFormula, data = dat) head(X%*%beta) ``` ::: ::: {.column width="50%"} **Calculating** $R^2$ **and adjusted** $R^2$**:** ```{r} rss = crossprod(dat$WeightLB - X%*%beta) # residual sum of squares tss = crossprod(dat$WeightLB - mean(dat$WeightLB)) # total sum of squares R2 = 1 - rss / tss R2.adjust = 1 - (rss/(nrow(dat)-P)) / (tss/((nrow(dat)-1))) data.frame( R2, # r-square R2.adjust # adjusted. r-square ) ``` **`lm` function:** $R^2$ **and adjusted** $R^2$: ```{r} summary_lm <- summary(fit_lm) summary_lm$r.squared summary_lm$adj.r.squared ``` ::: ::::: ## Vectorize prior distributions Previously, we defined a normal distribution for each regression coefficient $$ \beta_0 \sim normal(0, 1) \\ \vdots \\ \beta_p \sim normal(0, 1) $$ - They are all univariate normal distribution - Issue: Each parameter had a prior that was independent of the other parameter; then the correlation between betas is low and cannot be changed. For example, the code shows two betas with univariate normal distribution have low correlation (r = -0.025) ```{r} set.seed(1234) beta0 = rnorm(100, 0, 1) beta1 = rnorm(100, 0, 1) cor(beta0, beta1) ``` ------------------------------------------------------------------------ ## Vectorize prior distributions (Cont.) When combining all parameters into a vector, a natural extension is a multivariate normal distribution, so that the betas have a pre-defined correlation strength - The syntax shows the two betas generated by the multivariate normal distribution with correlation of .5 ```{r} set.seed(1234) sigma_of_betas = matrix(c(1, 0.5, 0.5, 1), ncol = 2) betas = mvtnorm::rmvnorm(100, mean = c(0, 0), sigma = sigma_of_betas) beta0 = betas[,1] beta1 = betas[,2] cor(beta0, beta1) ``` Back to the `stan` code, if we want to have more control over beta's prior distribution, we need to specify: - Mean vector of betas (`meanBeta`; size P $\times$ 1) - Put all prior means for those coefficients into a vector - Covariance matrix for betas (`covBeta`; size P $\times$ P) - Put all prior variances into the diagonal; zeros for off diagonal; 'cause we are not sure the potential correlation between betas ## Syntax Changes: Data Section ::::::: columns :::: {.column width="50%"} ::: nonincremental - **Old syntax without matrix:** ```{stan output.var='display', eval = FALSE, tidy = FALSE} #| eval: false data{ int<lower=0> N; vector[N] weightLB; vector[N] height60IN; vector[N] group2; vector[N] group3; vector[N] heightXgroup2; vector[N] heightXgroup3; } ``` ::: :::: :::: {.column width="50%"} ::: nonincremental - **New syntax with matrix:** ::: ```{stan output.var='display', eval = FALSE, tidy = FALSE} #| eval: false data{ int<lower=0> N; // number of observations int<lower=0> P; // number of predictors (plus column for intercept) matrix[N, P] X; // model.matrix() from R vector[N] weightLB; // outcome real sigmaRate; // hyperparameter: rate parameter for residual standard deviation } ``` :::: ::::::: ## Syntax Changes: Parameters Section ::::::: columns :::: {.column width="50%"} ::: nonincremental - **Old syntax without matrix:** ::: ```{stan output.var='display', eval = FALSE, tidy = FALSE} parameters { real beta0; real betaHeight; real betaGroup2; real betaGroup3; real betaHxG2; real betaHxG3; real<lower=0> sigma; } ``` :::: :::: {.column width="50%"} ::: nonincremental - **New syntax with matrix:** ::: ```{stan output.var='display', eval = FALSE, tidy = FALSE} parameters { vector[P] beta; // vector of coefficients for Beta real<lower=0> sigma; // residual standard deviation } ``` :::: ::::::: ## Syntax Changes: Prior Distributions Definition ::::::: columns :::: {.column width="50%"} ::: nonincremental - **Old syntax without matrix:** ::: ```{stan output.var='display', eval = FALSE, tidy = FALSE} #| code-line-numbers: "|2-8|9-12" model { beta0 ~ normal(0,100); betaHeight ~ normal(0,100); betaGroup2 ~ normal(0,100); betaGroup3 ~ normal(0,100); betaHxG2 ~ normal(0,100); betaHxG3 ~ normal(0,100); sigma ~ exponential(.1); // prior for sigma weightLB ~ normal( beta0 + betaHeight * height60IN + betaGroup2 * group2 + betaGroup3*group3 + betaHxG2*heightXgroup2 + betaHxG3*heightXgroup3, sigma); } ``` :::: :::: {.column width="50%"} **New syntax with matrix:** ::: nonincremental - `multi_normal()` is the multivariate normal sampling in Stan, similar to `rmvnorm()` in R; For uninformative, we did not need to specify - `exponential()` is the exponential distribution sampling in Stan, similar to `rexp()` in R ::: ```{stan output.var='display', eval = FALSE, tidy = FALSE} #| code-line-numbers: "|3" model { sigma ~ exponential(sigmaRate); // prior for sigma weightLB ~ normal(X*beta, sigma); // linear model } ``` :::: ::::::: ------------------------------------------------------------------------ ### A little more about exponential distribution - The mean of the exp. distribution is $\frac{1}{\lambda}$, where $\lambda$ is called **rate parameter** - The variance of the exp. distribution is $\frac{1}{\lambda^2}$ - It is typically positive skewed (skewness is 2) - Question: which hyperparameter rate $\lambda$ is most informative/uninformative ```{r} #| echo: true #| output-location: slide #| label: fig-pdf-exp #| fig-cap: PDF for the exponential distribution by varied rate parameters #| fig-cap-location: top #| code-line-numbers: "1-3|4|5|7-16" #| code-fold: true library(tidyr) library(dplyr) library(ggplot2) rate_list = seq(0.1, 1, 0.2) pdf_points = sapply(rate_list, \(x) dexp(seq(0, 20, 0.01), x)) |> as.data.frame() colnames(pdf_points) <- rate_list pdf_points$x = seq(0, 20, 0.01) pdf_points %>% pivot_longer(-x, values_to = 'y') %>% mutate( sigmaRate = factor(name, levels = rate_list) ) %>% ggplot() + geom_path(aes(x = x, y = y, color = sigmaRate, group = sigmaRate), size = 1.2) + scale_x_continuous(limits = c(0, 20)) + labs(x = "Sigma") ``` ------------------------------------------------------------------------ ### Since we talked about Exponential distribution... ::::: columns ::: {.column width="50%"} Let's dive deeper into Laplace distribution. It is sometimes called double-exponential distribution. Exponential distribution is positive part of Laplace distribution. $$ \text{PDF}_{exp.} = \lambda e^{-\lambda x} $$ $$ \text{PDF}_{laplace} = \frac{1}{2b} e^{-\frac{|x - u|}{b}} $$ - Thus, we know that for x \> 0, exponential distribution is a special case of Laplace distribution with scale parameter $b$ as $\frac{1}{\lambda}$ and location parameter as 0. - Laplace-based distribution, Cauchy, and Horseshoe distribution all belong to so-called "**shrinkage**" priors. ::: ::: {.column width="50%"} - Shrinkage priors will be very useful for high-dimensional data (say P = 1000) and variable selection ```{r} #| code-fold: true library(LaplacesDemon) b_list = 1 / rate_list * 2 pdf_points = sapply(b_list, \(x) dlaplace(seq(-20, 20, 0.01), scale = x, location = 0)) |> as.data.frame() colnames(pdf_points) <- round(b_list, 2) pdf_points$x = seq(-20, 20, 0.01) pdf_points %>% pivot_longer(-x, values_to = 'y') %>% mutate( scale = factor(name, levels = round(b_list, 2)) ) %>% ggplot() + geom_path(aes(x = x, y = y, color = scale, group = scale), size = 1.2) + scale_x_continuous(limits = c(-20, 20)) + labs(x = "") ``` ::: ::::: ------------------------------------------------------------------------ ## Compare results and computational time ::::: columns ::: {.column width="50%"} ```{r} #| code-fold: true #| results: hide code_path <- here::here("teaching", "2024-01-12-syllabus-adv-multivariate-esrm-6553", "Lecture04", "Code") mod_full_old <- cmdstan_model(paste0(code_path, "/FullModel_Old.stan")) data_full_old <- list( N = nrow(dat), weightLB = dat$WeightLB, height60IN = dat$HeightIN60, group2 = as.numeric(dat$DietGroup == 2), group3 = as.numeric(dat$DietGroup == 3), heightXgroup2 = as.numeric(dat$DietGroup == 2) * dat$HeightIN60, heightXgroup3 = as.numeric(dat$DietGroup == 3) * dat$HeightIN60 ) fit_full_old <- mod_full_old$sample( data = data_full_old, seed = 1234, chains = 4, parallel_chains = 4, refresh = 0 ) ``` ```{r} fit_full_old$summary()[, -c(9, 10)] ``` ::: ::: {.column width="50%"} ```{r} #| code-fold: true #| results: hide mod_full_new <- cmdstan_model("Code/FullModel_New.stan") FullModelFormula = as.formula("WeightLB ~ HeightIN60 + DietGroup + HeightIN60*DietGroup") X = model.matrix(FullModelFormula, data = dat) data_full_new <- list( N = nrow(dat), P = ncol(X), X = X, weightLB = dat$WeightLB, sigmaRate = 0.1 ) fit_full_new <- mod_full_new$sample( data = data_full_new, seed = 1234, chains = 4, parallel_chains = 4 ) ``` ```{r} fit_full_new$summary()[, -c(9, 10)] ``` ::: ::::: ::: {.columns} Note that if you omit an explicit prior for a parameter it implies a uniform prior across that parameter’s domain. ::: {.column} ```{r} #| code-fold: false mod_full_new$print(line_numbers = TRUE) ``` ::: ::: {.column} ```{r} #| code-fold: false mod_full_old$print(line_numbers = TRUE) ``` ::: ::: The differences between two method: - `betaGroup3` has the largest differences between two methods ```{r} cbind(fit_full_old$summary()[,1], fit_full_old$summary()[, -c(1, 9, 10)] - fit_full_new$summary()[, -c(1, 9, 10)]) ``` ## Compare computational time - The Stan code with matrix has faster computation: ::::: columns ::: {.column width="50%"} ```{r} fit_full_old$time() ``` ::: ::: {.column width="50%"} ```{r} fit_full_new$time() ``` ::: ::::: - Pros: With matrices, there is less syntax to write - Model is equivalent - More efficient for sampling (sample from matrix space) - More flexible: modify matrix elements in R instead of individual parameters in Stan - Cons: Output, however, is not labeled with respect to parameters - May have to label output ## Computing Functions of Parameters - Often, we need to compute some linear or non-linear function of parameters in a linear model - Missing effects - beta for diet group 2 and 3 - Model fit indices: $R^2$ - Transformed effects - residual variance $\sigma^2$ - In non-Bayesian (frequentist) analyses, there are often formed with the point estimates of parameters (with standard errors - second derivative of likelihood function) - For Bayesian analyses, however, we seek to build the posterior distribution for any function of parameters - This means applying the function to all posterior samples - It is especially useful when you want to propose your new statistic ------------------------------------------------------------------------ ### Example: Need Slope for Diet Group 2 Recall our model: $$ \text{WeightLB}_p = \beta_0 + \beta_1 \text{HeightIN}_p + \beta_2 \text{Group2}_p + \beta_3\text{Group3}_p \\ +\beta_4 \text{HeightIN}_p\text{Group2}_p \\ +\beta_5 \text{HeightIN}_p\text{Group3}_p \\ + e_p $$ Here, $\beta_1$ denotes the average change in $\text{WeightLB}$ with one-unit increase in $\text{HeightIN}$ for members in the reference group— Diet Group 1. Question: What about the slope for members in Diet Group 2. - Typically, we can calculate by hand by assign $\text{Group2}$ as 1 and all effects regarding $\text{HeightIN}$: $$ \beta_{\text{group2}}*\text{HeightIN} = (\beta_1 + \beta_4*1 + \beta_5*0)*\text{HeightIN} $$ $$ \beta_{\text{group2}}= \beta_1 +\beta_4 $$ - Similarly, the intercept for Group2 - the average mean of $\text{WeightLB}$ is $\beta_0 + \beta_2$. ------------------------------------------------------------------------ ### Computing slope for Diet Group 2 Our task: Create posterior distribution for Diet Group 2 - We must do so for each iteration we've kept from our MCMC chain - A somewhat tedious way to do this is after using Stan ```{r} fit_full_new$summary() beta_group2 <- fit_full_new$draws("beta[2]") + fit_full_new$draws("beta[5]") summary(beta_group2) ``` ------------------------------------------------------------------------ ### Computing slope within Stan Stan can compute these values for us-with the "generated quantities" section of the syntax ```{stan output.var='display', eval = FALSE, tidy = FALSE} #| code-line-numbers: "16-20" #| code-fold: true #| code-summary: "Stan code" data{ int<lower=0> N; // number of observations int<lower=0> P; // number of predictors (plus column for intercept) matrix[N, P] X; // model.matrix() from R vector[N] weightLB; // outcome real sigmaRate; // hyperparameter: prior rate parameter for residual standard deviation } parameters { vector[P] beta; // vector of coefficients for Beta real<lower=0> sigma; // residual standard deviation } model { sigma ~ exponential(sigmaRate); // prior for sigma weightLB ~ normal(X*beta, sigma); // linear model } generated quantities{ real slopeG2; slopeG2 = beta[2] + beta[5]; } ``` The generated quantities block computes values that do not affect the posterior distributions of the parameters–they are computed after the sampling from each iteration - The values are then added to the Stan object and can be seen in the summary - They can also be plotted using `bayesplot` package ```{r} #| results: hide mod_full_compute <- cmdstan_model("Code/FullModel_compute.stan") fit_full_compute <- mod_full_compute$sample( data = data_full_new, seed = 1234, chains = 4, parallel_chains = 4, refresh = 0 ) ``` ```{r} fit_full_compute$summary('slopeG2') ``` ------------------------------------------------------------------------ ```{r} bayesplot::mcmc_dens_chains(fit_full_compute$draws('slopeG2')) ``` ------------------------------------------------------------------------ ### Alternative way of computing the slope with Matrix This is a little more complicated but more flexible method. That is, we can make use of matrix operation and form a contrast matrix - Contrasts are linear combinations of parameters - You may have used these in R using `glht` package For use, we form a contrast matrix that is size of $C \times P$ where C is the number of contrasts: - The entries of this matrix are the values that multiplying the coefficients - For $(\beta_1 + \beta_2)$ this would be: - a "1" in the corresponding entry for $\beta_1$ - a "1" in the corresponding entry for $\beta_4$ - "0"s elsewhere - $$ \mathbf{C} = \begin{bmatrix} 0\ \mathbf{1}\ 0\ 0\ \mathbf{1}\ 0 \end{bmatrix} $$ - Then, the contrast matrix is multiplied by the coefficients vector to form the values: - $\mathbf{C} * \beta$ ------------------------------------------------------------------------ ### Contrasts in Stan ```{stan output.var='display', eval = FALSE, tidy = FALSE} #| code-line-numbers: "7-8,18-21" #| code-fold: true #| code-summary: "Stan code" data{ int<lower=0> N; // number of observations int<lower=0> P; // number of predictors (plus column for intercept) matrix[N, P] X; // model.matrix() from R vector[N] weightLB; // outcome real sigmaRate; // hyperparameter: prior rate parameter for residual standard deviation int<lower=0> nContrasts; matrix[nContrasts, P] contrast; // C matrix } parameters { vector[P] beta; // vector of coefficients for Beta real<lower=0> sigma; // residual standard deviation } model { sigma ~ exponential(sigmaRate); // prior for sigma weightLB ~ normal(X*beta, sigma); // linear model } generated quantities{ vector[nContrasts] computedEffects; computedEffects = contrast*beta; } ``` ```{r} #| code-line-numbers: "2-9,11" #| code-fold: true #| code-summary: "R code" #| output: false mod_full_contrast <- cmdstan_model("Code/FullModel_contrast.stan") contrast_dat <- list( nContrasts = 2, contrast = matrix( c(0,1,0,0,1,0, # slope for diet group2 1,0,1,0,0,0),# intercept for diet group 2 nrow = 2, byrow = TRUE ) ) fit_full_contrast <- mod_full_contrast$sample( data = c(data_full_new, contrast_dat), seed = 1234, chains = 4, parallel_chains = 4, refresh = 0 ) ``` ```{r} fit_full_contrast$summary('computedEffects')[, -c(9, 10)] ``` ```{r} bayesplot::mcmc_hist(fit_full_contrast$draws('computedEffects')) ``` ------------------------------------------------------------------------ ### Computing $\text{R}^2$ We can use the `generated quantities` section to build a posterior distribution for $\text{R}^2$ There are several formulas for $\text{R}^2$, we will use the following: $$ \text{R}^2 = 1-\frac{RSS}{TSS} = 1- \frac{\Sigma_{p=1}^{N}(y_p -\hat{y}_p)}{\Sigma_{p=1}^{N}(y_p -\bar{y}_p)} $$ Where: 1. RSS is the residual sum of squares 2. TSS is the total sum of squares of dependent variable 3. $\hat{y}_p$ is the predicted values: $\hat{y}_p = \mathbf{X}\boldsymbol{\beta}$ 4. $\bar{y}_p$ is the mean value of dependent variable: $\bar{y}_p = \frac{\Sigma_{p=1}^{N}y_p}{N}$ Notice: RSS depends on sampled parameters, so we will use this to build our posterior distribution for $\text{R}^2$ ------------------------------------------------------------------------ For adjusted $\text{R}^2$, we use the following: $$ \text{adj.R}^2 = 1-\frac{RSS/(N-P)}{TSS/(N-1)} = 1- \frac{\Sigma_{p=1}^{N}(y_p -\hat{y}_p)}{\Sigma_{p=1}^{N}(y_p -\bar{y}_p)}*\frac{N-P}{N-1} $$ Then, we can calculate the how to calculate $\text{adj.R}^2$ by $\text{R}^2$: $$ \text{adj.R}^2 = 1-(1-\text{R}^2)*\frac{N-P}{N-1} = \frac{(P-1)+(N-1)R^2}{N-P} $$ ------------------------------------------------------------------------ ### `Stan` code for Computing $\text{R}^2$ ```{stan output.var='display', eval = FALSE, tidy = FALSE} #| code-fold: true #| code-summary: "Stan code" #| code-line-numbers: "4-15" generated quantities{ vector[nContrasts] computedEffects; computedEffects = contrast*beta; // compute R2 real rss; real tss; real R2; real R2adj; {// anything in these brackets will not appear in summary table vector[N] pred = X*beta; rss = dot_self(weightLB-pred); // dot_self is stan function for matrix square tss = dot_self(weightLB-mean(weightLB)); } R2 = 1-rss/tss; R2adj = 1-(rss/(N-P))/(tss/(N-1)); } ``` Recall that our `lm` function provides $\text{R}^2$ as 0.9787 and adjusted $\text{R}^2$ as 0.9742 ```{r} fit_full_contrast$summary(c('rss', 'tss', 'R2','R2adj'))[, -c(9, 10)] bayesplot::mcmc_hist(fit_full_contrast$draws(c('R2', 'R2adj'))) ``` ------------------------------------------------------------------------ ### Get posterior mode ```{r} # Create the function. getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } # Calculate the mode using the user function. getmode(fit_full_contrast$draws('R2')) getmode(fit_full_contrast$draws('R2adj')) ``` ## Wrapping up Today we further added generated quantities into our Bayesian toolset: - How to make `Stan` use less syntax using matrices - How to form posterior distributions for functions of parameters We will use both of these features in psychometric models. ## Exercise for today's class The simulated data below contains one continuous outcome (`y`), one continuous predictor (`x`), and one group variable with two levels (`group`). The **true** data-generating parameters are: - Intercept: $\beta_0 = 10$ - Slope for `x`: $\beta_1 = 2$ - Group 2 effect: $\beta_2 = 5$ - Residual SD: $\sigma = 3$ **Run the following code to generate the simulated data:** ```{r} set.seed(42) N <- 100 x <- rnorm(N, mean = 5, sd = 2) group <- factor(sample(c(1, 2), N, replace = TRUE)) y <- 10 + 2 * x + 5 * (as.numeric(group) == 2) + rnorm(N, 0, 3) sim_dat <- data.frame(y = y, x = x, group = group) head(sim_dat) ``` ::: callout-note ## Questions 1. **Model 1 (without matrix):** Write a Stan model string using individual beta parameters (old syntax). Use default (uniform) priors for all regression coefficients and an `exponential(0.1)` prior for $\sigma$. Compile with `cmdstan_model(write_stan_file(Model1_String))` and sample from the posterior. 2. **Model 2 (with matrix):** Rewrite Model 1 using matrix notation (new syntax) with `model.matrix()`. Compile with `cmdstan_model(write_stan_file(Model2_String))` and sample from the posterior. 3. Compare the posterior means from both models. Do they recover the true parameters ($\beta_0=10$, $\beta_1=2$, $\beta_2=5$, $\sigma=3$)? ::: ------------------------------------------------------------------------ ### Answer: Model 1 (without matrix) ```{r} #| code-fold: true #| code-summary: "Click to see answer: Model 1 Stan string and sampling" #| results: hide Model1_String <- " data { int<lower=0> N; vector[N] y; vector[N] x; vector[N] group2; } parameters { real beta0; real betaX; real betaGroup2; real<lower=0> sigma; } model { // default (uniform) priors for regression coefficients sigma ~ exponential(0.1); y ~ normal(beta0 + betaX * x + betaGroup2 * group2, sigma); } " mod1 <- cmdstan_model(write_stan_file(Model1_String)) data_mod1 <- list( N = nrow(sim_dat), y = sim_dat$y, x = sim_dat$x, group2 = as.numeric(sim_dat$group == 2) ) fit_mod1 <- mod1$sample( data = data_mod1, seed = 42, chains = 4, parallel_chains = 4, refresh = 0 ) ``` ```{r} #| code-fold: true #| code-summary: "Click to see answer: Model 1 summary" fit_mod1$summary(c("beta0", "betaX", "betaGroup2", "sigma"))[, -c(9, 10)] ``` ------------------------------------------------------------------------ ### Answer: Model 2 (with matrix) ```{r} #| code-fold: true #| code-summary: "Click to see answer: Model 2 Stan string and sampling" #| results: hide Model2_String <- " data { int<lower=0> N; int<lower=0> P; matrix[N, P] X; vector[N] y; real sigmaRate; } parameters { vector[P] beta; real<lower=0> sigma; } model { // default (uniform) priors for regression coefficients sigma ~ exponential(sigmaRate); y ~ normal(X * beta, sigma); } " mod2 <- cmdstan_model(write_stan_file(Model2_String)) ExFormula <- as.formula("y ~ x + group") X_sim <- model.matrix(ExFormula, data = sim_dat) data_mod2 <- list( N = nrow(sim_dat), P = ncol(X_sim), X = X_sim, y = sim_dat$y, sigmaRate = 0.1 ) fit_mod2 <- mod2$sample( data = data_mod2, seed = 42, chains = 4, parallel_chains = 4, refresh = 0 ) ``` ```{r} #| code-fold: true #| code-summary: "Click to see answer: Model 2 summary" fit_mod2$summary(c("beta", "sigma"))[, -c(9, 10)] ``` ## Next Class 1. Bayesian Model fit 2. Bayesian Model Comparison

1 Today’s Lecture Objectives

2 In previous class…

3 Making Stan Code Short and Efficient

4 Linear Models without Matrices

4.1 Path Diagram of the Full Model

5 Linear Models with Matrices

6 Linear Models with Matrices (Cont.)

6.1 Example: Predicted Values and \text{R}^2

7 Vectorize prior distributions

8 Vectorize prior distributions (Cont.)

9 Syntax Changes: Data Section

10 Syntax Changes: Parameters Section

11 Syntax Changes: Prior Distributions Definition

11.1 A little more about exponential distribution

11.2 Since we talked about Exponential distribution…

12 Compare results and computational time

13 Compare computational time

14 Computing Functions of Parameters

14.1 Example: Need Slope for Diet Group 2

14.2 Computing slope for Diet Group 2

14.3 Computing slope within Stan

14.4 Alternative way of computing the slope with Matrix

14.5 Contrasts in Stan

14.6 Computing \text{R}^2

14.7 Stan code for Computing \text{R}^2

14.8 Get posterior mode

15 Wrapping up

16 Exercise for today’s class

16.1 Answer: Model 1 (without matrix)

16.2 Answer: Model 2 (with matrix)

17 Next Class

3 Making `Stan` Code Short and Efficient

14.7 `Stan` code for Computing \text{R}^2