AERA 2026 Annual Meeting
Department of Counseling, Leadership, and Research Methods
University of Arkansas
2026-04-01
AI-driven automation can improve efficiency, consistency, and scalability
| \(\alpha_1\) | \(\alpha_2\) | \(\alpha_3\) | |
|---|---|---|---|
| Item 1 | 1 | 0 | 0 |
| Item 2 | 0 | 1 | 0 |
| Item 3 | 1 | 1 | 0 |
| Item 4 | 0 | 1 | 1 |
RQ1: What is the stability of Q-matrices generated and validated by the multi-agent framework across repeated runs?
RQ2: What are the overlap rates between the multi-agent-produced Q-matrices and the reference Q-matrix?
RQ3: How do different finalization strategies within the multi-agent framework affect the overlap rate between the final Q-matrix and the reference Q-matrix?

| Phase | Agent | Output |
|---|---|---|
| Generation | Domain Expert | \(Q_0\) from item content |
| Validation | Psychometrician | \(Q_1\) via GDINA/Qval |
| Finalization | Researcher | \(Q_{final}\) integrating \(Q_0\) + \(Q_1\) |
| Plan | Phase 2 | Phase 3 | Phase 4 | Binarization |
|---|---|---|---|---|
| A | 1 \(Q_0\) | 1 \(Q_1\) | 1 \(Q_{final}\) | AI agent reviews \(Q_0\) and \(Q_1\) |
| B | 100 \(Q_0\) | 100 \(Q_1\) | 1 \(Q_{final}\) | AI agent reviews \(\hat{P}_{Q_0}\) and \(\hat{P}_{Q_1}\) |
| C | 100 \(Q_0\) | 100 \(Q_1\) | 1 \(Q_{final}\) | Researcher cutoff on \(\hat{P}_{Q_1}\) |
| D | 100 \(Q_0\) | 100 \(Q_1\) | 100 \(Q_{final}\) | AI agent reviews \(\hat{P}_{Q_{final}}\) |
{"q_matrix": {"A1": [items], ..}}
GDINA() \(\rightarrow\) fit DCM; Qval() \(\rightarrow\) PVAF validation. Output: “Old_Matrix” or “New_Matrix”
{"q_matrix": {"A1": [items], ..}}
Evaluation metric: Overlap Rate (OR) = proportion of matched q-entries between two Q-matrices, computed at overall, attribute, and item levels.


Initial Q-matrix vs. \(Q_1^*\): Overall OR = 0.90 (95% CI [0.89, 0.92])
Validated Q-matrix vs. \(Q_2^*\): Overall OR = 0.77 (95% CI [0.76, 0.78])



Initial Q-matrix vs. \(Q_{true}\): Overall OR = 0.71 (95% CI [0.71, 0.72])
Validated Q-matrix vs. \(Q_{true}\): Overall OR = 0.65 (95% CI [0.64, 0.66])

| Study 1 (SAD) | Study 2 (Fraction) | |
|---|---|---|
| Initial OR | 0.90 | 0.71 |
| Validated OR | 0.77 | 0.65 |
| Best Plan | B/C/D (0.77) | A (0.77) |
| Challenging Attribute | A2 (close scrutiny) | A2 (simplifying) |
Questions?
Contact: jzhang@uark.edu
