Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning
SuperIgor (Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning)
SuperIgor is a framework for instruction-following tasks that combines large language models (LLMs) with goal-conditional reinforcement learning (RL). Unlike prior approaches that depend on predefined subtasks, SuperIgor enables an LLM to generate and refine high-level plans through self-learning. These plans guide an RL agent, which in turn provides feedback to improve future plan generation.
This forms an iterative feedback loop:
- 📋 Plan Generation – The LLM proposes multiple structured action plans for each instruction.
- 🎮 Policy Learning – The RL agent (trained with PPO) learns to execute these plans.
- ✅ Plan Validation – Candidate plans are evaluated based on execution success.
- 🔁 LLM Fine-Tuning – The LLM is refined with Direct Preference Optimization (DPO), aligning plan scores with the agent’s actual performance.
- 🔹 Self-supervised plan generation — no manual annotation required
- 🔹 Curriculum training to overcome sparse rewards
- 🔹 Strong generalization to unseen or paraphrased instructions
- 🔹 Implemented in the CrafText benchmark (Minecraft-like environment)
| Trajectory #1 | Trajectory #2 | Trajectory #3 |
|---|---|---|
![]() |
![]() |
![]() |
# 1) Create conda environment
conda env create -f environment.yml
conda activate super-igor
# 2) Install CrafText from source
git clone https://github.com/AIRI-Institute/CrafText.git
pip install -e ./CrafText
# 3) Install SuperIgor package
pip install -e .Follow these steps to run experiments with SuperIgor:
bash ./super_scripts/sh_scripts/train/generate_skills.sh bash ./super_scripts/sh_scripts/train/run_curriculum.sh - Evaluate generated plans (one instruction → one plan):
bash ./super_scripts/sh_scripts/train/run_validation.sh- For DPO training: add
--optimizer_type dpo_external - For SFT training: add
--optimizer_type sft
bash ./super_scripts/sh_scripts/train/run_llm_training.sh- For DPO dataset generation:
bash ./super_scripts/sh_scripts/train/make_dpo_dataset.shbash super_scripts//sh_scripts/train/run_nll_eval.sh 


