A Curriculum-Aligned Knowledge Graph for Benchmarking & Training Educational LLMs
Built from the official People's Education Press (PEP) Chinese Kβ12 textbooks, K12-KGraph aligns the same scientific concept across definition, formula, experiment, exercise, structural location, and relational neighborhood.
Modern LLMs can answer "what is the Pythagorean theorem?" but struggle with curriculum cognition β the structured understanding of:
- π§ What are the prerequisites of a concept?
- π¬ Which experiment verifies it?
- π Which exercises test it?
- π Where does it live in the textbook?
- πΈ What are its taxonomic and relational neighbors?
K12-KGraph is the first open, multi-subject, official-textbook-grounded knowledge graph that explicitly aligns all five dimensions around each STEM concept, yielding two ready-to-use AI assets:
| K12-Bench | K12-Train | |
|---|---|---|
| Size | 23,640 multi-select questions | 2,267 instructionβresponse pairs |
| Purpose | Evaluate structural curriculum cognition | Teach it via KG-guided SFT |
| Task families / sources | Ground Β· Prereq Β· Neighbor Β· Evidence Β· Locate | Node-grounded + Edge-grounded + Deterministic templates |
| Headline result | Gemini-3-Flash reaches only 57.1% EM | Beats 8 mainstream SFT corpora on GaokaoBench & EduEval under a strict 2,300-sample budget |
Instance-level macro F1 and exact match, in %.
| Model | Overall EM | Overall F1 |
|---|---|---|
| Random guess baseline | 6.7 | 36.4 |
| Meta-LLaMA-3-8B-Instruct | 7.2 | 52.6 |
| GLM-4.7-Flash | 31.7 | 63.9 |
| GPT-4o | 31.1 | 65.9 |
| Qwen3-32B | 42.6 | 69.5 |
| Gemma-4-31B-IT | 46.4 | 69.5 |
| GPT-5.2 | 42.8 | 68.0 |
| Gemini-2.5-Flash | 48.3 | 66.7 |
| Gemini-3-Flash | 57.1 | 73.0 |
Even the strongest proprietary model leaves > 40% of items unsolved on Prereq and Neighbor β the tasks requiring directed, structural reasoning. See the project page for the full 5-task breakdown.
K12-Dataset/
βββ src/
β βββ kg/ # Knowledge-graph construction pipeline
β βββ benchmark/ # K12-Bench generation from graph queries
β βββ sft_qa/ # K12-Train synthesis (node & edge grounded)
β βββ utils/ # Shared config / LLM client / IO
βββ eval/ # Multiple-choice evaluation runner (OpenAI / vLLM)
βββ config/ # Default pipeline configuration
βββ demo/ # Trimmed JSON/JSONL samples
βββ books.yaml # Book registry
βββ docs/img/ # README figures
βββ requirements.txt
Pipeline flow:
PDF textbooks ββΊ MinerU parsing ββΊ Section split ββΊ GPT-5.2 schema-constrained extraction
ββΊ Hierarchical merge (book β subject β global) ββΊ DAG validation + expert review
ββΊ K12-KGraph ββΊ K12-Bench (queries) + K12-Train (QA synthesis)
git clone https://github.com/haolpku/K12-Dataset.git
cd K12-Dataset
pip install -r requirements.txtIf you will run the graph pipeline from PDFs, also install MinerU and make
magic-pdfcallable from the shell (command name configurable viaconfig/default.yaml).
from datasets import load_dataset
kg = load_dataset("lhpku20010120/K12-KGraph", split="train")
bench = load_dataset("lhpku20010120/K12-KGraph", name="bench", split="test")
train = load_dataset("lhpku20010120/K12-KGraph", name="train", split="train")cp config/.env.example config/.env # add your OPENAI_API_KEY etc.
python src/kg/run_pipeline.py \
--config config/default.yaml \
--filter-prefix <YourBookPrefix> # e.g. math_7a_rjbpython src/benchmark/run_pipeline.py --help
python src/sft_qa/run_pipeline.py --helpcp eval/configs/.env.example eval/configs/.env
chmod +x eval/run.sh
./eval/run.sh <model-config-stem> # eval/configs/models/<stem>.yaml7 node types β Book Β· Chapter Β· Section Β· Concept Β· Skill Β· Experiment Β· Exercise
9 edge types β is_a Β· prerequisites_for Β· relates_to Β· verifies Β· tests_concept Β· tests_skill Β· appears_in Β· leads_to Β· is_part_of
Every Concept carries name, definition, importance, and optional formula, aliases, examples. Every Experiment carries instruments, is_student, process, phenomena, conclusion. Full schema and attribute specification in docs/schema.md (coming soon) or on the project page.
| Subject | Books | Concepts | Skills | Experiments | Exercises |
|---|---|---|---|---|---|
| Mathematics | 23 | 1,475 | 428 | 0 | 471 |
| Physics | 9 | 1,154 | 197 | 220 | 186 |
| Chemistry | 7 | 2,302 | 451 | 309 | 270 |
| Biology | 9 | 1,648 | 288 | 123 | 244 |
| Total | 48 | 6,579 | 1,364 | 652 | 1,171 |
- Fleiss' ΞΊ = 0.84 overall, from 12 subject-qualified expert annotators (ΞΊ by relation:
is_a0.91,prerequisites_for0.82,relates_to0.69,verifies0.88) - Automatic DAG validation on
is_aandprerequisites_forsubgraphs - Per-edge
evidencefield linking back to textbook source text for auditability - 98.4% stratified K12-Bench items verified as "fully correct" in a 3-expert spot-check
Want to browse nodes, sample bench items, or inspect the training data without cloning the repo? The companion project page offers a rich interactive view:
Contributions are welcome! We particularly appreciate:
- π« Adding support for other textbook publishers (BNU, Jiangsu, etc.)
- π§ͺ New task families that extend beyond the current 5
- π Bug reports and quality issues on existing graph edges (please cite the specific edge ID)
- π Translation of the schema/documentation into additional languages
Open an issue or pull request β GitHub Issues are monitored within 48 hours.
If you find K12-KGraph useful in your research, please cite:
@misc{k12kgraph2026,
title = {K12-KGraph: A Curriculum-Aligned Knowledge Graph for
Benchmarking and Training Educational LLMs},
author = {Hao Liang and others},
year = {2026},
howpublished = {Submitted to NeurIPS 2026 Evaluations and Datasets Track},
url = {https://github.com/haolpku/K12-Dataset}
}- Dataset (graph, benchmark, training data): CC BY-NC-SA 4.0
- Code (this repository): MIT

