FACTS is a great paper from @GoogleDeepMind on measuring the factuality of LLM outputs. You can now download their prompt templates from @huggingface to improve LLM-based fact-checking yourself!
π The paper introduces the FACTS Grounding benchmark for evaluating the factuality of LLM outputs.
π€ Fact-checking is automated by an ensemble of LLM judges that verify if a response is fully grounded in a factual reference document.
π§ͺ The authors tested different prompt templates on held-out data to ensure their generalization.
π It's highly educational to read these templates to learn how frontier labs design prompts and understand their limitations.
πΎ You can now download and reuse these prompt templates via the prompt-templates library!
π The library simplifies sharing prompt templates on the HF hub or locally via standardized YAML files. Letβs make LLM work more transparent and reproducible by sharing more templates like this!
The TRL v0.13 release is π₯! My highlight are the new process reward trainer to train models similar to o1 and tool call support:
π§ Process reward trainer: Enables training of Process-supervised Reward Models (PRMs), which reward the quality of intermediate steps, promoting structured reasoning. Perfect for tasks like stepwise reasoning.
π Model merging: A new callback leverages mergekit to merge models during training, improving performance by blending reference and policy models - optionally pushing merged models to the Hugging Face Hub.
π οΈ Tool call support: TRL preprocessing now supports tool integration, laying the groundwork for agent fine-tuning with examples like dynamic temperature fetching in prompts.
βοΈ Mixture of judges: The new AllTrueJudge combines decisions from multiple binary judges for more nuanced evaluation.