Experimental Design in Education
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2025-02-17
Class Outline
Pre-defined:
Why have planned contrasts?
Weights assigned:
\[ D = weights * means \]
Imagine a study comparing the effects of three different study methods (A, B, C) on test scores.
One planned contrast might be to compare the average score of method A (considered the “experimental” method) against the combined average of methods B and C (considered the “control” conditions),
Testing the hypothesis that method A leads to significantly higher scores than the traditional methods.
\(H_0: \mu_{A} = \frac{\mu_B+\mu_C}{2}\); we also call this a complex contrast
When to use planned contrasts:
Note
We should not test all possible combinations of groups. Instead, justify your comparison plan before performing statistical analysis.
We performed omnibus tests in the last lecture, which provide all pairwise group comparisons (simple contrasts)
Today we focus more on complex contrasts.
By default, R uses treatment contrasts: each group compared to the reference group
Orthogonal Contrasts: Independent from each other, sum of product of weights equals zero.
Non-Orthogonal Contrasts: Not independent, lead to inflated Type I error rates.
Note
Orthogonal contrasts allow clear interpretation without redundancy.
Orthogonal contrasts follow a series of group comparisons that do not overlap variances.
Helmert contrast example
| Group | Contrast 1 | Contrast 2 | Product |
|---|---|---|---|
| G1 | +1 | -1 | -1 |
| G2 | +1 | +1 | +1 |
| G3 | -2 | 0 | 0 |
| Sum | 0 | 0 | 0 |
## Constrasts and Coding
[,1] [,2]
[1,] 1 -1
[2,] 1 1
[3,] -2 0
To understand if these contrasts are orthogonal, one could compute the dot product between the contrast vectors. The dot product of first and second contrast is:
\((1 * -1) + (1 * 1) + (-2 * 0) = -1 + 1 + 0 = 0\)
Since the dot product of the contrasts is equal to zero, these contrasts are indeed orthogonal.
Orthogonal contrasts have the advantage that the comparison of the means of one contrast does not affect the comparison of the means of any other contrast.
## the dot-product of two contrasts should be zero
[,1]
[1,] 0
Assume we have 5 groups and we want to compare each group to the grand mean.
One one-way ANOVA may have the following contrast matrix:
Check whether the contrasts are orthogonal by calculating the cross-product of the contrast matrix.
new_code <- matrix(c(
0, 0, 0, 0,
1, 0, 0, 0,
0, 1, 0, 0,
0, 0, 1, 0,
0, 0, 0, 1
), byrow = TRUE, nrow = 5)
cat("## the cross-product of two contrasts should be zero\n")
t(new_code[,1]) %*% new_code[,2]
t(new_code[,1]) %*% new_code[,3]
t(new_code[,1]) %*% new_code[,4]
t(new_code[,2]) %*% new_code[,3]The relationship between planned contrasts in ANOVA and coding in regression lies in how categorical variables are represented and interpreted in statistical models.
Both approaches aim to test specific hypotheses about group differences, but their implementation varies based on the framework:
General Linear Hypothesis Testing: Planned contrast can be done using linear regression + contrasts
In R, the multcomp package provides a convenient way to perform general linear hypothesis testing
The contrast matrix can be defined manually or using built-in contrast functions in R, such as contr.helmert, contr.sum, and contr.treatment.
Let’s look at the default contrasts plan: treatment contrasts == dummy coding
(Intercept) groupg2 groupg3 groupg4 groupg5
1 1 0 0 0 0
29 1 1 0 0 0
57 1 0 1 0 0
85 1 0 0 1 0
113 1 0 0 0 1
contrasts(): function to set the contrast coding for a factor variable in R. It allows you to specify how the levels of the factor should be coded for regression analysis.levels(dt$group): retrieves the levels of the factor variable group in the dataset dt, which are used to define the contrast coding scheme.model.matrix(): function that generates the design (or model) matrix for a linear model, based on the specified formula and data. It shows how the categorical variable is represented in the regression model according to the chosen contrast coding.# Set seed for reproducibility
options(digits = 5)
summary_tbl <- dt |>
group_by(group) |>
summarise(
N = n(),
Mean = mean(score),
SD = sd(score),
shapiro.test.p.values = shapiro.test(score)$p.value
) |>
mutate(department = c("Engineering", "Education", "Chemistry", "Political", "Psychology")) |>
relocate(group, department)
summary_tbl| group | department | N | Mean | SD | shapiro.test.p.values |
|---|---|---|---|---|---|
| g1 | Engineering | 28 | 4.2500 | 3.15054 | 0.07759 |
| g2 | Education | 28 | 2.7589 | 2.19478 | 0.07605 |
| g3 | Chemistry | 28 | 3.5446 | 2.86506 | 0.00623 |
| g4 | Political | 28 | 3.8568 | 0.58325 | 0.03023 |
| g5 | Psychology | 28 | 2.0243 | 1.30911 | 0.06147 |
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 13 5.8e-09 ***
135
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Even though the assumption checks did not pass using the original categorical levels, we may still be interested in different group contrasts.
(Intercept) group1 group2 group3 group4
3.286929 -0.745536 0.013393 0.084732 -0.315661
\[ [\beta_0 + (\beta_0 + \beta_2)] / 2 - [(\beta_0 + \beta_1) + (\beta_0 + \beta_3) + (\beta_0 + \beta_4)] / 3 \\ = \beta_2 / 2 - (\beta_1 + \beta_3 + \beta_4) /3 \]
Method 1: Create a new variable indicating STEM vs. Non-STEM and fit a simple linear model to test the contrast.
Call:
lm(formula = score ~ Stem, data = dt)
Residuals:
Min 1Q Median 3Q Max
-3.897 -1.797 0.175 1.163 6.103
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.880 0.251 11.48 <2e-16 ***
StemYes 1.017 0.397 2.56 0.011 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.3 on 138 degrees of freedom
Multiple R-squared: 0.0455, Adjusted R-squared: 0.0386
F-statistic: 6.58 on 1 and 138 DF, p-value: 0.0114
Method 2: Use the multcomp package to specify the contrast directly on the fitted ANOVA model.
Simultaneous Tests for General Linear Hypotheses
Fit: aov(formula = score ~ group, data = dt)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
Stem vs. Non-Stem == 0 0.332 0.141 2.35 0.02 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)
glht: general linear hypothesis test, which allows us to specify custom contrasts and test them against the fitted model.
contrast: A matrix specifying the contrast of interest. In this case, we are comparing the average of the STEM groups (Group 1 and Group 3) against the average of the non-STEM groups (Group 2, Group 4, and Group 5). The weights for the STEM groups are +1/2, and the weights for the non-STEM groups are -1/3.
❓ Why method 1 and method 2 have different p-value for t-test:
The two methods are not testing the same hypothesis, even though they’re both labeled “STEM vs Non-STEM”. That’s why the p-values differ.
Regrouping the data uses the sample sizes to weight the group means, while the contrast method uses fixed weights that do not account for sample size. This leads to different estimates of the mean difference and its standard error, resulting in different t-statistics and p-values.
Let’s break it down carefully.
🔎 What Method 1 Is Testing:
Here what we’re doing is:
Collapse the 4 (or 5) original groups into two categories
Fit a simple two-group linear model.
This tests:
\[ H_0:\ \mu_{\text{STEM}} = \mu_{\text{Non-STEM}} \]
Crucially:
So the STEM mean is:
\[ \frac{n_{g1}\mu_{g1} + n_{g3}\mu_{g3}}{n_{g1} + n_{g3}} \]
and similarly for Non-STEM.
👉 This is a sample-size weighted comparison.
🔎 What Method 2 Is Testing
Here:
The contrast matrix:
(0, -1/3, 1/2, -1/3, -1/3)
means we’re testing something like:
\[ \frac{1}{2}\mu_{g2} - \frac{1}{3}(\mu_{g1} + \mu_{g3} + \mu_{g4}) \]
(or similar depending on group ordering)
This is:
👉 This is an equal-weight contrast, not a sample-size weighted one.
⚠️ Why the p-values differ
There are two major reasons:
1️⃣ Different definitions of the STEM mean
Method 1:
Method 2:
If group sizes are unequal, these are not the same number.
2️⃣ Different standard errors
Method 1:
Method 2:
\[ SE = \sqrt{\sigma^2 \cdot c^T (X^TX)^{-1} c} \]
So even if the estimated difference were identical, the SE may differ → different t → different p.
For example, Helmert: four contrasts
Summary Statistics:
group department N Mean SD shapiro.test.p.values Ctras1 Ctras2
g1 g1 Engineering 28 4.2500 3.15054 0.0775874 -1 -1
g2 g2 Education 28 2.7589 2.19478 0.0760542 1 -1
g3 g3 Chemistry 28 3.5446 2.86506 0.0062253 0 2
g4 g4 Political 28 3.8568 0.58325 0.0302312 0 0
g5 g5 Psychology 28 2.0243 1.30911 0.0614743 0 0
Ctras3 Ctras4
g1 -1 -1
g2 -1 -1
g3 -1 -1
g4 3 -1
g5 0 4
Ctras1 Ctras2 Ctras3 Ctras4
0 0 0 0
Ctras1 Ctras2 Ctras3 Ctras4
Ctras1 2 0 0 0
Ctras2 0 6 0 0
Ctras3 0 0 12 0
Ctras4 0 0 0 20
\(t = \frac{C}{\sqrt{MSE \sum \frac{c_i^2}{n_i}}}\)
# A tibble: 4 × 2
t_value p_value
<dbl> <dbl>
1 -2.50 0.00690
2 0.0776 0.531
3 0.695 0.756
4 -3.34 0.000541
g1 vs. g2: We reject the null and determine that the mean of Education is different from the mean of Engineering in their growth mindset scores (p = 0.0069).
\(\frac{g1+g2}{2}\) vs. g3: We retain the null and determine that the mean of Chemistry is not significantly different from the mean of Education and Engineering in their growth mindset scores (p = 0.813).
Remember the planned contrast: g1 vs. g2 from the Helmert contrast:
[,1] [,2] [,3] [,4]
g1 -1 -1 -1 -1
g2 1 -1 -1 -1
g3 0 2 -1 -1
g4 0 0 3 -1
g5 0 0 0 4
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.286929 0.189003 17.390874 2.8188e-36
group1 -0.745536 0.298840 -2.494765 1.3810e-02
group2 0.013393 0.172535 0.077624 9.3824e-01
group3 0.084732 0.122001 0.694520 4.8855e-01
group4 -0.315661 0.094502 -3.340271 1.0825e-03
[1] 3.2869
(Intercept) group1 group2 group3 group4 group
1 1 -1 -1 -1 -1 1
29 1 1 -1 -1 -1 2
57 1 0 2 -1 -1 3
85 1 0 0 3 -1 4
113 1 0 0 0 4 5
[,1] [,2] [,3] [,4]
[1,] -0.74554 0.013393 0.084732 -0.31566
For treatment contrasts, four dummy variables are created to compare:
Intercept: G1’s meangroup2: G2 vs. G1group3: G3 vs. G1group4: G4 vs. G1group5: G5 vs. G1 (Intercept) groupg2 groupg3 groupg4 groupg5 group
1 1 0 0 0 0 1
29 1 1 0 0 0 2
57 1 0 1 0 0 3
85 1 0 0 1 0 4
113 1 0 0 0 1 5
Another type of coding is effect coding. In R, the corresponding contrast type is the so-called sum contrasts.
A detailed post about sum contrasts can be found here
With sum contrasts, the reference level is the grand mean.
[,1] [,2] [,3] [,4]
g1 1 0 0 0
g2 0 1 0 0
g3 0 0 1 0
g4 0 0 0 1
g5 -1 -1 -1 -1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.28693 0.18900 17.39087 2.8188e-36
group1 0.96307 0.37801 2.54777 1.1962e-02
group2 -0.52800 0.37801 -1.39680 1.6476e-01
group3 0.25771 0.37801 0.68177 4.9655e-01
group4 0.56986 0.37801 1.50753 1.3401e-01
Note
Effect coding is a method of encoding categorical variables in regression models, similar to dummy coding, but with a different interpretation of the resulting coefficients. It is particularly useful when researchers want to compare each level of a categorical variable to the overall mean rather than to a specific reference category.
In effect coding, categorical variables are transformed into numerical variables, typically using values of -1, 0, and 1. The key difference from dummy coding is that the reference category is represented by -1 instead of 0, and the coefficients indicate deviations from the grand mean.
For a categorical variable with k levels, effect coding requires k-1 coded variables. If we have a categorical variable X with three levels: \(A, B, C\), the effect coding scheme could be:
| Category | \(X_1\) | \(X_2\) |
|---|---|---|
| A | 1 | 0 |
| B | 0 | 1 |
| C (reference) | -1 | -1 |
The last category (\(C\)) is the reference group, coded as -1 for all indicator
Here’s how you could effect code a categorical variable with three levels (e.g., groups):
[,1] [,2]
g1 1 0
g2 0 1
g3 -1 -1
This coding scheme shows that:
g1) is compared to the overall effect by coding it as 1 in the first column and 0 in the second, suggesting it is above or below the overall mean.g2) is also contrasted with the overall effect.g3), coded as -1 in both columns, serves as the reference category against which the other two are compared.In your regression output, the coefficients for the first two groups will show how the mean of these groups differs from the overall mean. The intercept will represent the overall mean across all groups.
When effect coding is used in a regression model:
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \]
library(ggplot2)
# Create a data frame for text labels
text_data <- data.frame(
x = rep(0.25, 3), # Repeating the same x-coordinate
y = c(0.3, 0.7, 0.9), # Different y-coordinates
label = c("C: beta[0] - beta[1] - beta[2]",
"A: beta[0] + 1*'×'*beta[1] + 0*'×'*beta[2]",
"B: beta[0] + 0*'×'*beta[1] + 1*'×'*beta[2]") # Labels
)
# Create an empty ggplot with defined limits
ggplot() +
geom_text(data = text_data, aes(x = x, y = y, label = label), parse = TRUE, size = 11) +
# Add a vertical line at x = 0.5
# geom_vline(xintercept = 0.5, color = "blue", linetype = "dashed", linewidth = 1) +
# Add two horizontal lines at y = 0.3 and y = 0.7
geom_hline(yintercept = c(0.35, 0.75, 0.95), color = "red", linetype = "solid", linewidth = 1) +
geom_hline(yintercept = 0.5, color = "grey", linetype = "solid", linewidth = 1) +
geom_text(aes(x = .25, y = .45, label = "grand mean of Y"), color = "grey", size = 11) +
# Set axis limits
xlim(0, 1) + ylim(0, 1) +
labs(y = "Y", x = "") +
# Theme adjustments
theme_minimal() +
theme(text = element_text(size = 20))
| Category | \(X_1\) | \(X_2\) |
|---|---|---|
| A | 1 | 0 |
| B | 0 | 1 |
| C (reference) | 0 | 0 |
Effect coding is beneficial when:
Effect coding can be set in R using the contr.sum function:
party scores
1 Democrat 4
2 Democrat 3
3 Democrat 5
4 Democrat 4
5 Democrat 4
6 Republican 6
(g1, g2, g3) vs (g4, g5)
group department N Mean SD shapiro.test.p.values Contrasts
1 g1 Engineering 28 4.2500 3.15054 0.0775874 0.50000
2 g2 Education 28 2.7589 2.19478 0.0760542 -0.33333
3 g3 Chemistry 28 3.5446 2.86506 0.0062253 0.50000
4 g4 Political 28 3.8568 0.58325 0.0302312 -0.33333
5 g5 Psychology 28 2.0243 1.30911 0.0614743 -0.33333
\[ H_0: \frac{\mu_{Engineering}+\mu_{Chemistry}}{2} = \frac{\mu_{Education}+\mu_{PoliSci}+\mu_{Psychology}}{3} \]
Weighted mean difference:
\[ C = c_1\mu_{Eng}+c_2\mu_{Edu}+c_3\mu_{Chem}+c_4\mu_{PoliSci}+c_5\mu_{Psych}\\ = \frac{1}{2}*4.25+(-\frac13)*2.75+(\frac12)*3.54+(-\frac13)*3.85+(-\frac13)*2.02\\ = 1.0173 \]
\[ \sum\frac{c^2}{n} = \frac{(\frac12)^2}{28}+\frac{(-\frac13)^2}{28}+\frac{(\frac12)^2}{28}+\frac{(-\frac13)^2}{28}+\frac{(-\frac13)^2}{28} \]
t p.value
1 2.6369 0.0093476
\[ t = \frac{C}{\sqrt{MSE*\sum\frac{c^2}{n} }} = \frac{1.0173}{\sqrt{5.0011*0.029762}}=2.6368 \]
(Intercept) group1 group2 group3 group4
1 1 0.50000 -0.692619 -0.10070 -0.10070
29 1 -0.33333 0.164436 -0.56552 -0.56552
57 1 0.50000 0.692619 0.10070 0.10070
85 1 -0.33333 -0.082218 0.78276 -0.21724
113 1 -0.33333 -0.082218 -0.21724 0.78276
(g1, g2, g3) vs (g4, g5)
(Intercept) group1 group2 group3 group4
1 1 -0.33333 -4.3068e-01 -0.490501 -0.490501
29 1 -0.33333 -3.8540e-01 0.508987 0.508987
57 1 -0.33333 8.1608e-01 -0.018486 -0.018486
85 1 0.50000 -2.6485e-17 0.500000 -0.500000
113 1 0.50000 -2.6485e-17 -0.500000 0.500000
Estimate Std. Error t value Pr(>|t|)
-0.693 0.463 -1.496 0.137
Note
Many psychology journals require the reporting of effect sizes
Df Sum Sq Mean Sq F value Pr(>F)
group 4 89.368 22.3420 4.4674 0.0020173
Residuals 135 675.149 5.0011 NA NA
Interpretation: 11.69% of variance in the DV is due to group differences.
Df Sum Sq Mean Sq F value Pr(>F)
group 4 89.368 22.3420 4.4674 0.0020173
Residuals 135 675.149 5.0011 NA NA
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.28692857 0.18900 17.3908736 2.8188e-36
group1 -0.69278571 0.46296 -1.4964232 1.3688e-01
group2 -0.00096962 0.42262 -0.0022943 9.9817e-01
group3 0.17035380 0.42262 0.4030862 6.8752e-01
group4 -1.66214620 0.42262 -3.9329222 1.3359e-04
[1] 0.831 0.128 0.000 0.035 0.321

ESRM 64503