Add RULER and full LongBench benchmark suite

Current evaluation uses perplexity on multi-topic passages and 5 custom QA tasks. For paper and production credibility, need full RULER (needle-in-a-haystack, multi-hop) and LongBench (16 tasks across summarization, QA, few-shot, code completion). Preliminary LongBench results show task-dependent quality — narrative QA degrades +8-11% at 10x while factual trivia improves.