146 714 576

Adam Molnar

lunarflu

AI & ML interests

join the Hugging Face discord! hf.co/discord/join

Recent Activity

reacted to nyuuzyou's post with 🤗 1 day ago

🗂️ I don't think the collections feature of Hugging Face is widely used, even though it's an excellent way to organize and discover interesting resources. To do my bit to change that, I've created two carefully curated collections that combine both my original work and other valuable datasets: Educational Datasets - Mostly English-Russian, but other languages are also included - Extended by my new Begemot.ai dataset (2.7M+ Russian education records) https://huggingface.co/datasets/nyuuzyou/begemot Link: https://huggingface.co/collections/nyuuzyou/educational-datasets-677c268978ac1cec96cc3605 Anime & Art - Extensive art-focused collection, including my new datasets: - Buzzly.art (2K artworks) https://huggingface.co/datasets/nyuuzyou/buzzlyart - Paintberri (60K+ pieces) https://huggingface.co/datasets/nyuuzyou/paintberri - Itaku.ee (924K+ items) https://huggingface.co/datasets/nyuuzyou/itaku - Extended with other amazing datasets from the community Link: https://huggingface.co/collections/nyuuzyou/anime-and-art-677ae996682a389fccd892c3 Collections should become a more common feature - hopefully this will encourage others to create and share their own curated collections. By organizing related datasets into these themed collections, I hope to make it easier for researchers and developers to discover and use these valuable resources.

reacted to merve's post with ❤️ 1 day ago

What a beginning to this year in open ML 🤠 Let's unwrap! https://huggingface.co/collections/merve/jan-10-releases-677fe34177759de0edfc9714 Multimodal 🖼️ > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook — 22k hours worth of samples from instruction videos 🤯 > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge! LLMs 💬 > Microsoft released Phi-4, sota open-source 14B language model 🔥 > Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬 > Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment > SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct 💭 > Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview 📕 > Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs 📕 > Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences 👩🏻‍💻 Embeddings 🔖 > @MoritzLaurer released zero-shot version of ModernBERT large 👏 > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B Image/Video Generation ⏯️ > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts 🔥 > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding

reacted to as-cle-bert's post with 🧠 1 day ago

Hi HuggingFace community!🤗 I recently released PrAIvateSearch v2.0-beta.0 (https://github.com/AstraBert/PrAIvateSearch), my privacy-first, AI-powered, user-centered and data-safe application aimed at providing a local and open-source alternative to big AI search engines such as SearchGPT or Perplexity AI. We have several key changes: - New chat UI built with NextJS - DuckDuckGo API used for web search instead of Google - https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct as a language model served on API (by FastAPI) - Crawl4AI crawler used for web scraping - Optimizations in the data workflow inside the application Read more in my blog post 👉 https://huggingface.co/blog/as-cle-bert/search-the-web-with-ai Have fun and feel free to leave feedback about how to improve the application!✨

View all activity

Organizations

lunarflu's activity

reacted to nyuuzyou's post with 🤗 1 day ago

Post

995

🗂️ I don't think the collections feature of Hugging Face is widely used, even though it's an excellent way to organize and discover interesting resources. To do my bit to change that, I've created two carefully curated collections that combine both my original work and other valuable datasets:

Educational Datasets
- Mostly English-Russian, but other languages are also included
- Extended by my new Begemot.ai dataset (2.7M+ Russian education records) nyuuzyou/begemot

Link: nyuuzyou/educational-datasets-677c268978ac1cec96cc3605

Anime & Art

- Extensive art-focused collection, including my new datasets:
- Buzzly.art (2K artworks) nyuuzyou/buzzlyart
- Paintberri (60K+ pieces) nyuuzyou/paintberri
- Itaku.ee (924K+ items) nyuuzyou/itaku
- Extended with other amazing datasets from the community

Link: nyuuzyou/anime-and-art-677ae996682a389fccd892c3

Collections should become a more common feature - hopefully this will encourage others to create and share their own curated collections. By organizing related datasets into these themed collections, I hope to make it easier for researchers and developers to discover and use these valuable resources.

1 reply

reacted to merve's post with ❤️ 1 day ago

Post

1401

What a beginning to this year in open ML 🤠
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714

Multimodal 🖼️
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook — 22k hours worth of samples from instruction videos 🤯
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!

LLMs 💬
> Microsoft released Phi-4, sota open-source 14B language model 🔥
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct 💭
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview 📕
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs 📕
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences 👩🏻‍💻

Embeddings 🔖
> @MoritzLaurer released zero-shot version of ModernBERT large 👏
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B

Image/Video Generation ⏯️
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts 🔥
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M

Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding

reacted to as-cle-bert's post with 🧠 1 day ago

Post

730

Hi HuggingFace community!🤗

I recently released PrAIvateSearch v2.0-beta.0 (https://github.com/AstraBert/PrAIvateSearch), my privacy-first, AI-powered, user-centered and data-safe application aimed at providing a local and open-source alternative to big AI search engines such as SearchGPT or Perplexity AI.

We have several key changes:

- New chat UI built with NextJS
- DuckDuckGo API used for web search instead of Google
- Qwen/Qwen2.5-1.5B-Instruct as a language model served on API (by FastAPI)
- Crawl4AI crawler used for web scraping
- Optimizations in the data workflow inside the application

Read more in my blog post 👉 https://huggingface.co/blog/as-cle-bert/search-the-web-with-ai

Have fun and feel free to leave feedback about how to improve the application!✨

3 replies

replied to as-cle-bert's post 1 day ago

btw, a bit offtopic, but you might be interested in applying for this!

https://x.com/FlavienC/status/1877651464127926589

reacted to as-cle-bert's post with 🔥 1 day ago

Post

730

Hi HuggingFace community!🤗

I recently released PrAIvateSearch v2.0-beta.0 (https://github.com/AstraBert/PrAIvateSearch), my privacy-first, AI-powered, user-centered and data-safe application aimed at providing a local and open-source alternative to big AI search engines such as SearchGPT or Perplexity AI.

We have several key changes:

- New chat UI built with NextJS
- DuckDuckGo API used for web search instead of Google
- Qwen/Qwen2.5-1.5B-Instruct as a language model served on API (by FastAPI)
- Crawl4AI crawler used for web scraping
- Optimizations in the data workflow inside the application

Read more in my blog post 👉 https://huggingface.co/blog/as-cle-bert/search-the-web-with-ai

Have fun and feel free to leave feedback about how to improve the application!✨

3 replies

reacted to AkimfromParis's post with 👀 2 days ago

Post

1623

💵 Polymarket is leveraging “Chatbot Arena LLM Leaderboard” on HuggingFace for online gambling on the “Top AI model on January 31?”. 🤗

As of January 3rd, 2025:
-1./ Gemini (83%) -2./ ChatGPT (13%) -3./ Other (2%) -4./ Claude (2%) -5./ Grok (1%) -6./ Llama (<1%)

🇺🇸 The market opinion is following historical data. It's clearly bias towards US historical AI giants, yet Polymarket is forbidden in the USA and for US citizens.

🇨🇳 In the “Other”, you might have Chinese AI labs that are probably the future AI leaders (Qwen, DeepSeek, Yi).

⚖️ In the market resolution, if two models are tied in the evaluation, they will take the alphabetical order. (e.g. if both were tied, “Google” would resolve to “Yes”, and “xAI” would resolve to “No”). 🙃

That might be illegal usage of the Chatbot Arena policy? And maybe HuggingFace? @clem
Or maybe authors and contributors should get a cut each month as “market markers”. @weichiang @angelopoulos

1 reply

replied to AkimfromParis's post 2 days ago

Thanks! We're taking a look 🤗

reacted to alielfilali01's post with 👍 4 days ago

Post

1701

3C3H AraGen Leaderboard welcomes today deepseek-ai/DeepSeek-V3 and 12 other models (including the late gpt-3.5 💀) to the ranking of best LLMs in Arabic !

Observations:
- DeepSeek-v3 ranked 3rd and only Open model among the top 5 !

- A 14B open model ( Qwen/Qwen2.5-14B-Instruct) outperforms gpt-3.5-turbo-0125 (from last year). This shows how much we came in advancing and supporting Arabic presence within the LLM ecosystem !

- Contrary to what observed in likelihood-acc leaderboards (like OALL/Open-Arabic-LLM-Leaderboard) further finetuned models like maldv/Qwentile2.5-32B-Instruct actually decreased the performance compared to the original model Qwen/Qwen2.5-32B-Instruct.
It's worth to note that the decrease is statiscally insignificant which imply that at best, the out-domain finetuning do not really hurts the model original capabilities acquired during pretraining.
Previous work addressed this (finetuning VS pretraining) but more investigation in this regard is required (any PhDs here ? This could be your question ...)

Check out the latest rankings: inceptionai/AraGen-Leaderboard

reacted to albertvillanova's post with 👀 4 days ago

Post

1631

Discover all the improvements in the new version of Lighteval: https://huggingface.co/docs/lighteval/

reacted to Jaward's post with 🔥🧠 4 days ago

Post

2254

damn I love nvidia's bullish stance on taking AI to the edge - from being the overlord of compute to cutting-edge physical AI with SOTA multiverse simulation engines that brings the scaling laws under your control!!

My favorite: Cosmos - fully opensourced, open-weight physics based video gen platform, what an incredible way to start off the year✨

Code: https://github.com/NVIDIA/Cosmos
Models: nvidia/cosmos-6751e884dc10e013a0a0d8e6
Paper: https://d1qx31qr3h6wln.cloudfront.net/publications/NVIDIA%20Cosmos_2.pdf

reacted to MoritzLaurer's post with 🤯👀 4 days ago

Post

2001

OpenAI is losing money on the $200/month subscription 🤯. It's crazy how expensive it is to run these largest LLMs:

- ChatGPT Pro costs $200/month ($2,400/year) and is still unprofitable for OpenAI due to higher-than-expected usage.
- OpenAI reportedly expected losses of about $5 billion on revenue of $3.7 billion last year, with ChatGPT alone once costing an estimated $700,000 per day to operate. 💸🔥
- They build strong models and do great research. Whether this business model will work in the long run is one of the biggest questions in the AI economy today.

Source with the numbers 👇
https://techcrunch.com/2025/01/05/openai-is-losing-money-on-its-pricey-chatgpt-pro-plan-ceo-sam-altman-says/

3 replies

reacted to m-ric's post with 🚀🔥 4 days ago

Post

4738

Since I published it on GitHub a few days ago,
Hugging Face's new agentic library 𝘀𝗺𝗼𝗹𝗮𝗴𝗲𝗻𝘁𝘀 has gathered nearly 4k stars 🤯

➡️ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!

The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.

We will make it work better, and fully open. ✨

Sounds like something you'd like to do? Apply here 👉 https://apply.workable.com/huggingface/j/AF1D4E3FEB/

3 replies

reacted to hba123's post with 🚀 4 days ago

Post

1775

I have some New Year presents for you, #MachineLearning and #AI community! We just opened our code for new state-of-the-art results that beat EAGLE-2 and Medusa #LLM inference.

We also shared the model check pt on @huggingface ! @MatthieuZ

Check the blog out: https://huggingface.co/blog/hba123/sotaspeculativedecoding

replied to Xenova's post 15 days ago

waiting for moonshine-distilled next :)

reacted to Xenova's post with 🚀🔥❤️ 15 days ago

Post

3389

Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser!
🚀 Faster and more accurate than Whisper
🔒 Privacy-focused (no data leaves your device)
⚡️ WebGPU accelerated (w/ WASM fallback)
🔥 Powered by ONNX Runtime Web and Transformers.js

Demo: webml-community/moonshine-web
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/moonshine-web

5 replies