Skip to content

AI: Add catastrophic forgetting mitigation to MMSD fine-tuning dataset #1324

@anchapin

Description

@anchapin

Summary

Supplementing the 1400-pair MMSD Java→Bedrock dataset with a general programming data mix (5–15% of training tokens) to prevent catastrophic forgetting during fine-tuning — preserving the base model's general code reasoning while adding Minecraft-specific specialization.

Problem

Fine-tuning Qwen2.5-Coder-7B exclusively on MMSD's 1400 domain-specific pairs risks catastrophic forgetting: the model overwrites general Java and JavaScript knowledge with Minecraft-specific patterns, degrading its ability to handle:

  • Standard Java class hierarchies and design patterns
  • General JSON schema writing
  • JavaScript ES6+ module syntax unrelated to Bedrock

This is especially risky with QLoRA at r=64 (high rank = more weights updated). The fix is well-understood: mix general coding data into the training set at ~5–15% token ratio.

What to do

  1. Download a general code dataset from HuggingFace. Recommended options (in order of relevance to PortKit):

    • codeparrot/github-code filtered to java + javascript files (high quality, code-only)
    • bigcode/the-stack-dedup filtered to Java + JavaScript
    • m-a-p/CodeFeedback-Filtered-Instruction (instruction-tuned pairs, Java/JS heavy)
  2. Sample ~150–200 general Java/JS instruction pairs from the chosen dataset. Target ratio: ~12% general / ~88% MMSD-specific by token count.

  3. Mix the datasets before training:

    from datasets import concatenate_datasets, load_dataset
    
    mmsd = load_dataset("json", data_files="validated_pairs.jsonl")["train"]
    general = load_dataset("m-a-p/CodeFeedback-Filtered-Instruction", split="train")
    general_sample = general.filter(lambda x: x["lang"] in ["java", "javascript"]).shuffle(seed=42).select(range(200))
    
    # Format general data to match Stage A prompt template before concatenating
    mixed = concatenate_datasets([mmsd, general_formatted]).shuffle(seed=42)
  4. Update docs/ml_intern_finetuning_prompt.md to include this mixing step in Section 4 (Training Recipe).

  5. Evaluate the effect: compare eval set perplexity with and without the mix. A well-calibrated mix should show ≤ 2% degradation on general code tasks vs. a significant improvement in output quality consistency.

Expected outcome

  • Preserved general Java/JS reasoning (no regression on non-Minecraft code tasks)
  • Better output for edge cases not covered in the 1400 MMSD pairs
  • More robust handling of Java code patterns the MMSD pairs don't cover (abstract classes, generics, lambda expressions)

References

  • Source document: AI for Game and Mod Coding — Section 5 (Training and Fine-Tuning), "Risk Mitigation"
  • Catastrophic forgetting in LLM fine-tuning: https://arxiv.org/abs/2308.08747
  • MMSD dataset: ai_engine/mmsd/synthesis_pairs.jsonl
  • ml-intern prompt: docs/ml_intern_finetuning_prompt.md

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions