Human-AI K-12
Evidence Project
A systematic, AI-assisted evidence synthesis of causal research across ten core domains of K-12 education policy — continuously updated as new research emerges. 69 papers cited, 124 in the full bibliography — all rigorously verified against primary source PDFs.
An Ongoing, Living Research Project
The empirical literature on K-12 education policy is vast, politically salient, and methodologically heterogeneous. This project applies a systematic, multi-agent AI-assisted workflow to synthesize the causal evidence across ten core research clusters — prioritizing randomized controlled trials, natural experiments, and quasi-experimental designs over observational correlations. Every quantitative claim has been verified against primary source PDFs using direct API-based text extraction.
This is not a finished product. The first literature review (10 clusters, 120 papers) and first replication study (Jackson, Johnson & Persico 2016) represent the opening phase of a longer-term project. New clusters, updated syntheses, additional replication studies, and practitioner summaries will be added as the work progresses. Suggestions for papers or corrections to existing claims are welcome via the contact page.
10 Research Clusters
Click any cluster to read the full evidence summary
Matters enormously — but measuring it precisely enough for high-stakes decisions remains contested.
Boutique programs work; scaled-up programs often fade — quality of implementation is everything. Boston Pre-K shows sustained gains are achievable with high program quality.
Strong causal evidence — but ranks poorly on cost-effectiveness vs. other interventions.
Money matters — especially for low-income students. The debate has shifted to how it is spent.
Urban charters work broadly (not just 'No Excuses'); vouchers and virtual charters often harm students.
The Reading Wars have a clear winner for early grades: systematic phonics. Evidence is unambiguous for K–2.
One of the most cost-effective interventions available — large effects in RCTs, though scale-up effects are smaller (d ≈ 0.10–0.20 at district scale).
Comprehensive SEL programs work; brief mindset interventions (grit, growth mindset) do not at scale.
Schools alone cannot close gaps driven by poverty — broader social policy is required.
Useful as existence proofs — not as direct policy templates for the US context.
Evidence at a Glance
Effect sizes across the major K-12 interventions reviewed in this synthesis. Toggle to the cost-effectiveness view to see approximate cost per 0.1 SD gain — the most policy-relevant comparison. Systematic phonics stands out as by far the most cost-effective intervention: large effect at very low cost.
Effect sizes are illustrative summaries from meta-analytic and quasi-experimental literature. Hover each bar for the intervention unit and source. Confidence intervals and context-dependence matter — do not treat these as precise point estimates.
Cross-Cutting Themes
Patterns that emerge consistently across all ten research clusters
The Persistence of Selection Bias
In almost every domain — from teacher value-added to charter schools to early childhood education — observational estimates are routinely found to be biased upward when subjected to rigorous quasi-experimental or experimental stress tests.
The Fadeout Phenomenon
Interventions that produce large short-term gains frequently see those gains fade over time. However, fadeout in test scores does not always preclude long-term benefits in adult outcomes, suggesting that non-cognitive skills may act as a crucial, unmeasured mediator.
Implementation Trumps Intervention
The efficacy of many interventions is highly dependent on implementation fidelity and context. Interventions that scale well — like high-dosage tutoring — typically have highly structured, standardized delivery mechanisms.
Money Matters, But How It Is Spent Matters More
The debate over school funding has shifted from whether resources matter to how they are deployed. Targeted funding for high-need students and evidence-based programs yields the highest returns.
Citation Drift and Policy Amplification
Findings are routinely stripped of their caveats as they pass through the citation chain. The Chetty et al. (2014) $250,000 per-classroom earnings figure (approximately $39,000 per child) was used to justify aggressive termination policies despite the authors’ own warnings against it. Heckman’s Perry Preschool IRR was applied to modern universal pre-K programs that lack the quality controls of the original 123-child boutique program (58 treatment, 65 control). Duckworth’s grit construct prompted schools to grade students on grit before replication studies found near-zero incremental validity. Greenberg (2009) formalized these mechanisms as citation bias, amplification, and invention; Sims et al. (2023) found early small-scale RCTs exaggerate true effect sizes by at least 52% on average.
Methodology: Human-AI Collaboration
This synthesis was produced through a structured six-stage workflow combining human judgment with AI assistance. All final results, interpretations, and editorial decisions were reviewed and approved by the human author.
Manus AI served as the primary orchestrator — directing the overall workflow, executing literature retrieval, initial drafting, and cross-cluster synthesis under continuous human direction.
Elicit screened the literature to identify candidate papers across 10 research clusters, prioritizing RCTs, natural experiments, and quasi-experimental designs.
Semantic Scholar API provided programmatic citation network analysis to identify seminal papers and trace citation lineages across clusters — informing which papers to prioritize.
Every effect size, sample size, and p-value was verified against primary source PDFs using Claude (claude-opus-4) and Gemini (Google DeepMind) with direct text extraction via PyMuPDF.
Perplexity AI independently fact-checked quantitative claims and methodological descriptions, flagging discrepancies for human review.
All results, interpretations, and editorial decisions were reviewed and approved by the human author. AI tools were used as assistants and auditors, not autonomous decision-makers.
Stay Updated
Get notified when new research clusters, replication studies, or major revisions are published. No spam — research updates only.
Start Exploring the Evidence
Browse the 10 research clusters, read the JJP replication note, or download the full literature review PDF.