Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning

SuperIgor (Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning)

SuperIgor is a framework for instruction-following tasks that combines large language models (LLMs) with goal-conditional reinforcement learning (RL). Unlike prior approaches that depend on predefined subtasks, SuperIgor enables an LLM to generate and refine high-level plans through self-learning. These plans guide an RL agent, which in turn provides feedback to improve future plan generation.

This forms an iterative feedback loop:

📋 Plan Generation – The LLM proposes multiple structured action plans for each instruction.
🎮 Policy Learning – The RL agent (trained with PPO) learns to execute these plans.
✅ Plan Validation – Candidate plans are evaluated based on execution success.
🔁 LLM Fine-Tuning – The LLM is refined with Direct Preference Optimization (DPO), aligning plan scores with the agent’s actual performance.

✨ Key Features

🔹 Self-supervised plan generation — no manual annotation required
🔹 Curriculum training to overcome sparse rewards
🔹 Strong generalization to unseen or paraphrased instructions
🔹 Implemented in the CrafText benchmark (Minecraft-like environment)

Agent Trajectories examples

Trajectory #1	Trajectory #2	Trajectory #3

🧱 Environment Setup

# 1) Create conda environment
conda env create -f environment.yml
conda activate super-igor

# 2) Install CrafText from source
git clone https://github.com/AIRI-Institute/CrafText.git
pip install -e ./CrafText

# 3) Install SuperIgor package
pip install -e .

🚀 Running Super IGOR Experiments

Follow these steps to run experiments with SuperIgor:

1. ▶️ Run dataset generation

bash ./super_scripts/sh_scripts/train/generate_skills.sh

2. ▶️ Run policy training

bash ./super_scripts/sh_scripts/train/run_curriculum.sh

3. ▶️ Run validation

Evaluate generated plans (one instruction → one plan):

bash ./super_scripts/sh_scripts/train/run_validation.sh

4. ▶️ Run LLM training

For DPO training: add --optimizer_type dpo_external
For SFT training: add --optimizer_type sft

bash ./super_scripts/sh_scripts/train/run_llm_training.sh

For DPO dataset generation:

bash ./super_scripts/sh_scripts/train/make_dpo_dataset.sh

5. ▶️ Run NLL scoring

bash super_scripts//sh_scripts/train/run_nll_eval.sh

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
external		external
super_igor		super_igor
super_scripts		super_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning

✨ Key Features

Agent Trajectories examples

🧱 Environment Setup

🚀 Running Super IGOR Experiments

1. ▶️ Run dataset generation

2. ▶️ Run policy training

3. ▶️ Run validation

4. ▶️ Run LLM training

5. ▶️ Run NLL scoring

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning

✨ Key Features

Agent Trajectories examples

🧱 Environment Setup

🚀 Running Super IGOR Experiments

1. ▶️ Run dataset generation

2. ▶️ Run policy training

3. ▶️ Run validation

4. ▶️ Run LLM training

5. ▶️ Run NLL scoring

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages