Round 35: paper count reconciled (69 cited / 124 reviewed). Rounds 33–34 added 7 new cited papers and applied accuracy corrections across 6 claims.View changelog →
Living Research Project · Updated May 2026

Human-AI K-12
Evidence Project

A systematic, AI-assisted evidence synthesis of causal research across ten core domains of K-12 education policy — continuously updated as new research emerges. 69 papers cited, 124 in the full bibliography — all rigorously verified against primary source PDFs.

124
Papers Reviewed
69
Directly Cited
10
Research Clusters
1
Replication Study

An Ongoing, Living Research Project

The empirical literature on K-12 education policy is vast, politically salient, and methodologically heterogeneous. This project applies a systematic, multi-agent AI-assisted workflow to synthesize the causal evidence across ten core research clusters — prioritizing randomized controlled trials, natural experiments, and quasi-experimental designs over observational correlations. Every quantitative claim has been verified against primary source PDFs using direct API-based text extraction.

This is not a finished product. The first literature review (10 clusters, 120 papers) and first replication study (Jackson, Johnson & Persico 2016) represent the opening phase of a longer-term project. New clusters, updated syntheses, additional replication studies, and practitioner summaries will be added as the work progresses. Suggestions for papers or corrections to existing claims are welcome via the contact page.

10 Research Clusters

Click any cluster to read the full evidence summary

Cluster 1
New
Teacher Quality
Strong

Matters enormously — but measuring it precisely enough for high-stakes decisions remains contested.

Small–moderate effectd ≈ 0.10–0.15 per SD of teacher quality
Cluster 2
New
Early Childhood
Mixed

Boutique programs work; scaled-up programs often fade — quality of implementation is everything. Boston Pre-K shows sustained gains are achievable with high program quality.

High long-run return7–10% annual IRR (Perry Preschool)
Cluster 3
New
Class Size
Moderate

Strong causal evidence — but ranks poorly on cost-effectiveness vs. other interventions.

Small effectd ≈ 0.22 (STAR, early grades)
Cluster 4
New
School Funding
Strong

Money matters — especially for low-income students. The debate has shifted to how it is spent.

Large long-run effect+7.25% wages per 10% spending increase (JJP 2016)
Cluster 5
New
School Choice
Mixed

Urban charters work broadly (not just 'No Excuses'); vouchers and virtual charters often harm students.

Large effect (top networks)d ≈ 0.40/year (Boston charters, math)
Cluster 6
New
Reading Instruction
Strong

The Reading Wars have a clear winner for early grades: systematic phonics. Evidence is unambiguous for K–2.

Large effectd ≈ 0.41–0.43 (systematic phonics vs. whole-language)
Cluster 7
New
High-Dosage Tutoring
Strong

One of the most cost-effective interventions available — large effects in RCTs, though scale-up effects are smaller (d ≈ 0.10–0.20 at district scale).

Large effectd = 0.37 (pooled average, Nickow et al. 2020)
Cluster 8
New
SEL & Non-Cognitive
Mixed

Comprehensive SEL programs work; brief mindset interventions (grit, growth mindset) do not at scale.

Moderate effectd = 0.27 (universal SEL, Durlak et al. 2011)
Cluster 9
New
Out-of-School Factors
Strong

Schools alone cannot close gaps driven by poverty — broader social policy is required.

Large structural gap30–40% larger income-achievement gap for children born in 2001 vs. 1975 (Reardon 2011)
Cluster 10
New
International Systems
Moderate

Useful as existence proofs — not as direct policy templates for the US context.

Comparative (no d)

Evidence at a Glance

Effect sizes across the major K-12 interventions reviewed in this synthesis. Toggle to the cost-effectiveness view to see approximate cost per 0.1 SD gain — the most policy-relevant comparison. Systematic phonics stands out as by far the most cost-effective intervention: large effect at very low cost.

What is Cohen's d? Cohen's d measures the standardized difference between two groups — how many standard deviations apart their average outcomes are. Conventionally: 0.2 = small, 0.5 = medium, 0.8 = large. In education research, effects above 0.2 are considered meaningful. Hover each bar for details and source.
d=0.00d=0.10d=0.19d=0.28d=0.38d=0.55Systematic PhonicsUrban CharterSchoolsHigh-DosageTutoringUniversal SELProgramsClass Size ReductionTeacher Quality (1SD)Summer ReadingProgramsGrowth MindsetGrit InterventionsSmallMedium

Effect sizes are illustrative summaries from meta-analytic and quasi-experimental literature. Hover each bar for the intervention unit and source. Confidence intervals and context-dependence matter — do not treat these as precise point estimates.

Cross-Cutting Themes

Patterns that emerge consistently across all ten research clusters

The Persistence of Selection Bias

In almost every domain — from teacher value-added to charter schools to early childhood education — observational estimates are routinely found to be biased upward when subjected to rigorous quasi-experimental or experimental stress tests.

The Fadeout Phenomenon

Interventions that produce large short-term gains frequently see those gains fade over time. However, fadeout in test scores does not always preclude long-term benefits in adult outcomes, suggesting that non-cognitive skills may act as a crucial, unmeasured mediator.

Implementation Trumps Intervention

The efficacy of many interventions is highly dependent on implementation fidelity and context. Interventions that scale well — like high-dosage tutoring — typically have highly structured, standardized delivery mechanisms.

Money Matters, But How It Is Spent Matters More

The debate over school funding has shifted from whether resources matter to how they are deployed. Targeted funding for high-need students and evidence-based programs yields the highest returns.

Citation Drift and Policy Amplification

Findings are routinely stripped of their caveats as they pass through the citation chain. The Chetty et al. (2014) $250,000 per-classroom earnings figure (approximately $39,000 per child) was used to justify aggressive termination policies despite the authors’ own warnings against it. Heckman’s Perry Preschool IRR was applied to modern universal pre-K programs that lack the quality controls of the original 123-child boutique program (58 treatment, 65 control). Duckworth’s grit construct prompted schools to grade students on grit before replication studies found near-zero incremental validity. Greenberg (2009) formalized these mechanisms as citation bias, amplification, and invention; Sims et al. (2023) found early small-scale RCTs exaggerate true effect sizes by at least 52% on average.

Methodology: Human-AI Collaboration

This synthesis was produced through a structured six-stage workflow combining human judgment with AI assistance. All final results, interpretations, and editorial decisions were reviewed and approved by the human author.

01
Research Orchestration & DraftingManus AI

Manus AI served as the primary orchestrator — directing the overall workflow, executing literature retrieval, initial drafting, and cross-cluster synthesis under continuous human direction.

02
Systematic Literature ScreeningElicit

Elicit screened the literature to identify candidate papers across 10 research clusters, prioritizing RCTs, natural experiments, and quasi-experimental designs.

03
Citation Network AnalysisSemantic Scholar

Semantic Scholar API provided programmatic citation network analysis to identify seminal papers and trace citation lineages across clusters — informing which papers to prioritize.

04
PDF-Based Quantitative VerificationClaude + Gemini

Every effect size, sample size, and p-value was verified against primary source PDFs using Claude (claude-opus-4) and Gemini (Google DeepMind) with direct text extraction via PyMuPDF.

05
Independent Fact-CheckingPerplexity

Perplexity AI independently fact-checked quantitative claims and methodological descriptions, flagging discrepancies for human review.

06
Final Review & ApprovalHuman Author

All results, interpretations, and editorial decisions were reviewed and approved by the human author. AI tools were used as assistants and auditors, not autonomous decision-makers.

Latest Update

Round 32: Accuracy Corrections + New Citation

Read more →May 13, 2026

Stay Updated

Get notified when new research clusters, replication studies, or major revisions are published. No spam — research updates only.

Start Exploring the Evidence

Browse the 10 research clusters, read the JJP replication note, or download the full literature review PDF.