AI: Add catastrophic forgetting mitigation to MMSD fine-tuning dataset

## Summary

Supplementing the 1400-pair MMSD Java→Bedrock dataset with a **general programming data mix** (5–15% of training tokens) to prevent catastrophic forgetting during fine-tuning — preserving the base model's general code reasoning while adding Minecraft-specific specialization.

## Problem

Fine-tuning `Qwen2.5-Coder-7B` exclusively on MMSD's 1400 domain-specific pairs risks **catastrophic forgetting**: the model overwrites general Java and JavaScript knowledge with Minecraft-specific patterns, degrading its ability to handle:
- Standard Java class hierarchies and design patterns
- General JSON schema writing
- JavaScript ES6+ module syntax unrelated to Bedrock

This is especially risky with QLoRA at `r=64` (high rank = more weights updated). The fix is well-understood: mix general coding data into the training set at ~5–15% token ratio.

## What to do

1. **Download a general code dataset** from HuggingFace. Recommended options (in order of relevance to PortKit):
   - `codeparrot/github-code` filtered to `java` + `javascript` files (high quality, code-only)
   - `bigcode/the-stack-dedup` filtered to Java + JavaScript
   - `m-a-p/CodeFeedback-Filtered-Instruction` (instruction-tuned pairs, Java/JS heavy)

2. **Sample ~150–200 general Java/JS instruction pairs** from the chosen dataset. Target ratio: ~12% general / ~88% MMSD-specific by token count.

3. **Mix the datasets before training**:
   ```python
   from datasets import concatenate_datasets, load_dataset
   
   mmsd = load_dataset("json", data_files="validated_pairs.jsonl")["train"]
   general = load_dataset("m-a-p/CodeFeedback-Filtered-Instruction", split="train")
   general_sample = general.filter(lambda x: x["lang"] in ["java", "javascript"]).shuffle(seed=42).select(range(200))
   
   # Format general data to match Stage A prompt template before concatenating
   mixed = concatenate_datasets([mmsd, general_formatted]).shuffle(seed=42)
   ```

4. **Update `docs/ml_intern_finetuning_prompt.md`** to include this mixing step in Section 4 (Training Recipe).

5. **Evaluate the effect**: compare eval set perplexity with and without the mix. A well-calibrated mix should show ≤ 2% degradation on general code tasks vs. a significant improvement in output quality consistency.

## Expected outcome

- Preserved general Java/JS reasoning (no regression on non-Minecraft code tasks)
- Better output for edge cases not covered in the 1400 MMSD pairs
- More robust handling of Java code patterns the MMSD pairs don't cover (abstract classes, generics, lambda expressions)

## References

- Source document: *AI for Game and Mod Coding* — Section 5 (Training and Fine-Tuning), "Risk Mitigation"
- Catastrophic forgetting in LLM fine-tuning: https://arxiv.org/abs/2308.08747
- MMSD dataset: `ai_engine/mmsd/synthesis_pairs.jsonl`
- ml-intern prompt: `docs/ml_intern_finetuning_prompt.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI: Add catastrophic forgetting mitigation to MMSD fine-tuning dataset #1324

Summary

Problem

What to do

Expected outcome

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

AI: Add catastrophic forgetting mitigation to MMSD fine-tuning dataset #1324

Description

Summary

Problem

What to do

Expected outcome

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions