Caleb Fahlgren's picture

Caleb Fahlgren PRO

cfahlgren1

AI & ML interests

None yet

Recent Activity

updated a dataset 21 minutes ago
cfahlgren1/react-code-instructions
updated a dataset about 14 hours ago
cfahlgren1/hub-stats
updated a dataset about 16 hours ago
duckdb-nsql-hub/duckdb-nsql-scores
View all activity

Articles

Organizations

Hugging Face's profile picture Datasets Maintainers's profile picture Hugging Face OSS Metrics's profile picture Hugging Face TB Research's profile picture ChatDB's profile picture Cognitive Computations's profile picture nltpt-q's profile picture DuckDB Text-2-SQL Bench's profile picture open/ acc's profile picture Bluesky Community's profile picture

cfahlgren1's activity

reacted to merve's post with ❀️ about 17 hours ago
view post
Post
1280
What a beginning to this year in open ML 🀠
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714

Multimodal πŸ–ΌοΈ
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook β€” 22k hours worth of samples from instruction videos 🀯
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!

LLMs πŸ’¬
> Microsoft released Phi-4, sota open-source 14B language model πŸ”₯
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct πŸ’­
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview πŸ“•
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs πŸ“•
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences πŸ‘©πŸ»β€πŸ’»

Embeddings πŸ”–
> @MoritzLaurer released zero-shot version of ModernBERT large πŸ‘
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B

Image/Video Generation ⏯️
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts πŸ”₯
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M

Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
posted an update 1 day ago
view post
Post
949
Wow, I just added Langfuse tracing to the Deepseek Artifacts app and it's really nice πŸ”₯

It allows me to visualize and track more things along with the cfahlgren1/react-code-instructions dataset.

It was just added as a one click Docker Space template, so it's super easy to self host πŸ’ͺ
posted an update 8 days ago
view post
Post
1929
You'll notice the AI in the SQL Console is much better at working with chatml conversations:

Here's example of unnesting the cfahlgren1/react-code-instructions in less than 10 seconds by asking it. Check it out here: cfahlgren1/react-code-instructions

- "show me the average assistant response length"
- "extract user, system, and assistant messages into separate columns"

It's super easy to work with conversational datasets now with natural language πŸ—£οΈ





reacted to clem's post with πŸš€πŸ€—β€οΈ 8 days ago
view post
Post
3830
Cool to see @ylecun joining the top 10 of most followed on HF!

(and leaderboard by @mvaloatto is here: mvaloatto/TCTF)
  • 2 replies
Β·
posted an update 12 days ago
reacted to davidberenstein1957's post with πŸ§ πŸ‘€ 12 days ago
reacted to lorraine2's post with πŸš€ 25 days ago
view post
Post
1997
πŸ¦™New NVIDIA paper: LLaMA-Mesh πŸ¦™

We enable large language models to generate and understand 3D meshes by representing them as text and fine-tuning. This unifies the 3D and text modalities in a single model and preserves language abilities, unlocking conversational 3D creation with mesh understanding.

πŸ”Ž Project Page: https://research.nvidia.com/labs/toronto-ai/LLaMA-Mesh/
πŸ•ΉοΈ Interactive Demo: Zhengyi/LLaMA-Mesh (courtesy of HuggingFace and Gradio)
πŸ“– Full Paper: https://arxiv.org/abs/2411.09595
πŸ‘¨β€πŸ’»Code: https://github.com/nv-tlabs/LLaMa-Mesh
πŸ’Ύ Model Checkpoint: Zhengyi/LLaMA-Mesh
🧩 Blender Addon: https://github.com/huggingface/meshgen (courtesy of Dylan Ebert)
πŸŽ₯ 5-min Overview Video: https://youtu.be/eZNazN-1lPo?si=-idQa5aaceVw0Bbj (courtesy of AI Papers Academy)
reacted to julien-c's post with πŸ‘πŸ€—β€οΈπŸ”₯ about 1 month ago
view post
Post
8203
After some heated discussion πŸ”₯, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community πŸ”₯

cc: @reach-vb @pierric @victor and the HF team
Β·
posted an update about 1 month ago
view post
Post
1933
You can just ask things πŸ—£οΈ

"show me messages in the coding category that are in the top 10% of reward model scores"

Download really high quality instructions from the Llama3.1 405B synthetic dataset πŸ”₯

argilla/magpie-ultra-v1.0

posted an update about 1 month ago
view post
Post
3019
We just dropped an LLM inside the SQL Console 🀯

The amazing, new Qwen/Qwen2.5-Coder-32B-Instruct model can now write SQL for any Hugging Face dataset ✨

It's 2025, you shouldn't be hand writing SQL! This is a big step in making it where anyone can do in depth analysis on a dataset. Let us know what you think πŸ€—
posted an update about 2 months ago
view post
Post
919
observers πŸ”­ - automatically log all OpenAI compatible requests to a datasetπŸ’½

β€’ supports any OpenAI compatible endpoint πŸ’ͺ
β€’ supports DuckDB, Hugging Face Datasets, and Argilla as stores

> pip install observers

No complex framework. Just a few lines of code to start sending your traces somewhere. Let us know what you think! @davidberenstein1957 and I will continue iterating!

Here's an example dataset that was logged to Hugging Face from Ollama: cfahlgren1/llama-3.1-awesome-chatgpt-prompts
replied to their post about 2 months ago
posted an update about 2 months ago
view post
Post
876
You can create charts, leaderboards, and filters on top of any Hugging Face dataset in less than a minute

β€’ ASCII Bar Charts πŸ“Š
β€’ Powered by DuckDB WASM ⚑
β€’ Download results to Parquet πŸ’½
β€’ Embed and Share results with friends πŸ“¬

Do you have any interesting queries?
reacted to davanstrien's post with ❀️ about 2 months ago