FACTS is a great paper from @GoogleDeepMind on measuring the factuality of LLM outputs. You can now download their prompt templates from @huggingface to improve LLM-based fact-checking yourself!
π The paper introduces the FACTS Grounding benchmark for evaluating the factuality of LLM outputs.
π€ Fact-checking is automated by an ensemble of LLM judges that verify if a response is fully grounded in a factual reference document.
π§ͺ The authors tested different prompt templates on held-out data to ensure their generalization.
π It's highly educational to read these templates to learn how frontier labs design prompts and understand their limitations.
πΎ You can now download and reuse these prompt templates via the prompt-templates library!
π The library simplifies sharing prompt templates on the HF hub or locally via standardized YAML files. Letβs make LLM work more transparent and reproducible by sharing more templates like this!
Multimodal πΌοΈ > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook β 22k hours worth of samples from instruction videos π€― > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
Embeddings π > @MoritzLaurer released zero-shot version of ModernBERT large π > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation β―οΈ > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts π₯ > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
The TRL v0.13 release is π₯! My highlight are the new process reward trainer to train models similar to o1 and tool call support:
π§ Process reward trainer: Enables training of Process-supervised Reward Models (PRMs), which reward the quality of intermediate steps, promoting structured reasoning. Perfect for tasks like stepwise reasoning.
π Model merging: A new callback leverages mergekit to merge models during training, improving performance by blending reference and policy models - optionally pushing merged models to the Hugging Face Hub.
π οΈ Tool call support: TRL preprocessing now supports tool integration, laying the groundwork for agent fine-tuning with examples like dynamic temperature fetching in prompts.
βοΈ Mixture of judges: The new AllTrueJudge combines decisions from multiple binary judges for more nuanced evaluation.
π Supercharge your LLM apps with Langfuse on Hugging Face Spaces!
Langfuse brings end-to-end observability and tooling to accelerate your dev workflow from experiments through production
Now available as a Docker Space directly on the HF Hub! π€
π Trace everything: monitor LLM calls, retrieval, and agent actions with popular frameworks 1β£ One-click deployment: on Spaces with persistent storage and integrated OAuth π Simple Prompt Management: Version, edit, and update without redeployment β Intuitive Evals: Collect user feedback, run model/prompt evaluations, and improve quality π Dataset Creation: Build datasets directly from production data to enhance future performance
Kudos to the Langfuse team for this collab and the awesome, open-first product theyβre building! π @marcklingen@Clemo@MJannik
OpenAI is losing money on the $200/month subscription π€―. It's crazy how expensive it is to run these largest LLMs:
- ChatGPT Pro costs $200/month ($2,400/year) and is still unprofitable for OpenAI due to higher-than-expected usage. - OpenAI reportedly expected losses of about $5 billion on revenue of $3.7 billion last year, with ChatGPT alone once costing an estimated $700,000 per day to operate. πΈπ₯ - They build strong models and do great research. Whether this business model will work in the long run is one of the biggest questions in the AI economy today.
π Releasing a new zeroshot-classifier based on ModernBERT! Some key takeaways:
- β‘ Speed & efficiency: It's multiple times faster and uses significantly less memory than DeBERTav3. You can use larger batch sizes and enabling bf16 (instead of fp16) gave me a ~2x speed boost as well - π Performance tradeoff: It performs slightly worse than DeBERTav3 on average across my zeroshot classification task collection - π§ Use cases: I recommend using it for scenarios requiring speed and a larger context window (8k). - π‘ Whatβs next? Iβm preparing a newer version trained on better + longer synthetic data to fully leverage the 8k context window and improve upon the training mix of my older zeroshot-v2.0 models. I also hope that there will be a multilingual variant in the future.