AI & ML interests

None defined yet.

Recent Activity

akhaliqΒ  updated a Space over 2 years ago
tensorflow/efficientnet_b2
akhaliqΒ  updated a Space over 2 years ago
tensorflow/efficientnet_b1
akhaliqΒ  updated a Space almost 3 years ago
tensorflow/efficientnet_b0
View all activity

tensorflow's activity

merveΒ 
posted an update 1 day ago
view post
Post
1480
What a beginning to this year in open ML 🀠
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714

Multimodal πŸ–ΌοΈ
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook β€” 22k hours worth of samples from instruction videos 🀯
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!

LLMs πŸ’¬
> Microsoft released Phi-4, sota open-source 14B language model πŸ”₯
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct πŸ’­
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview πŸ“•
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs πŸ“•
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences πŸ‘©πŸ»β€πŸ’»

Embeddings πŸ”–
> @MoritzLaurer released zero-shot version of ModernBERT large πŸ‘
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B

Image/Video Generation ⏯️
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts πŸ”₯
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M

Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
merveΒ 
posted an update 2 days ago
view post
Post
1515
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license πŸ’— ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️

> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM πŸ’¬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡️

> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
  • 1 reply
Β·
merveΒ 
posted an update 11 days ago
view post
Post
4714
supercharge your LLM apps with smolagents πŸ”₯

however cool your LLM is, without being agentic it can only go so far

enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff!

Here's our blog for you to get started https://huggingface.co/blog/smolagents
merveΒ 
posted an update 18 days ago
akhaliqΒ 
posted an update 23 days ago
view post
Post
4597
Google drops Gemini 2.0 Flash Thinking

a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more

now available in anychat, try it out: akhaliq/anychat
merveΒ 
posted an update 24 days ago
view post
Post
2787
Aya by Cohere For AI can now see! πŸ‘€

C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B 🌱 works on 8 languages! πŸ—£οΈ

The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo

Dataset maya-multimodal/pretrain

Model maya-multimodal/maya πŸ‘
kudos @nahidalam and team
  • 1 reply
Β·
merveΒ 
posted an update 25 days ago
view post
Post
3296
Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models 🧢

✨ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
✨ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work ⏯️

Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled πŸ“ˆ scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful πŸ”₯
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models πŸ”₯
Β·
merveΒ 
posted an update about 1 month ago
view post
Post
1766
A complete RAG pipeline includes a reranker, which ranks the documents to find the best document πŸ““
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 πŸ”₯ https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms
merveΒ 
posted an update about 1 month ago
view post
Post
5597
This week in open-source AI was insane 🀠 A small recapπŸ•ΊπŸ» merve/dec-6-releases-67545caebe9fc4776faac0a3

Multimodal πŸ–ΌοΈ
> Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants πŸ‘
> OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license ✨
> Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts

LLMs πŸ’¬
> Meta released a new iteration of Llama 70B, Llama3.2-70B trained further
> EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license πŸ”₯
> Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license
> Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models
> Dataset: FineWeb2 just landed with multilinguality update! πŸ”₯ nearly 8TB pretraining data in many languages!

Image/Video Generation πŸ–ΌοΈ
> Tencent released HunyuanVideo, a new photorealistic video generation model
> OminiControl is a new editing/control framework for image generation models like Flux

Audio πŸ”Š
> Indic-Parler-TTS is a new text2speech model made by community
merveΒ 
posted an update about 1 month ago
view post
Post
1552
New InternVL drop with a state-of-the-art 78B vision language model with MIT license πŸ”₯ https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c
The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes
78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks πŸ‘ Try here OpenGVLab/InternVL
merveΒ 
posted an update about 1 month ago
view post
Post
2661
small but mighty πŸ”₯
you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM 🫰🏻 also with gradient accumulation simulated batch size is 16 ✨
I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work πŸ’ https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
merveΒ 
posted an update about 1 month ago
view post
Post
2903
Last week we were blessed with open-source models! A recap πŸ’
merve/nov-29-releases-674ccc255a57baf97b1e2d31

πŸ–ΌοΈ Multimodal
> At Hugging Face we released SmolVLM, a performant and efficient smol vision language model πŸ’—
> Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents πŸ€–
> Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length
> ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM
> Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning modelπŸ“–
> Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT πŸ“•

πŸ’¬ LLMs
> Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet πŸ”₯
> AliBaba has released Marco-o1, a new open-source reasoning model πŸ’₯
> NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)

⏯️ Image/Video Generation
> Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation
> Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res ⏯️
> Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla 🏷️

Audio
> OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
akhaliqΒ 
posted an update about 1 month ago
view post
Post
5751
QwQ-32B-Preview is now available in anychat

A reasoning model that is competitive with OpenAI o1-mini and o1-preview

try it out: akhaliq/anychat
  • 1 reply
Β·
akhaliqΒ 
posted an update about 1 month ago
view post
Post
3790
New model drop in anychat

allenai/Llama-3.1-Tulu-3-8B is now available

try it here: akhaliq/anychat
merveΒ 
posted an update about 2 months ago
view post
Post
2191
The authors of ColPali trained a retrieval model based on SmolVLM 🀠 vidore/colsmolvlm-alpha
TLDR;

- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks

- ColSmolVLM is more memory efficient than ColQwen2 πŸ’—
merveΒ 
posted an update about 2 months ago
view post
Post
3912
Small yet mighty! πŸ’«

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🀠

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base HuggingFaceTB/smolvlm-6740bd584b2dcbf51ecb1f39

Learn more from our blog here: huggingface.co/blog/smolvlm
This release comes with a demo, fine-tuning code, MLX integration and TRL integration for DPO πŸ’
Try the demo: HuggingFaceTB/SmolVLM
Fine-tuning Recipe: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Also TRL integration for DPO πŸ’—
akhaliqΒ 
posted an update about 2 months ago
view post
Post
2767
anychat

supports chatgpt, gemini, perplexity, claude, meta llama, grok all in one app

try it out there: akhaliq/anychat
merveΒ 
posted an update about 2 months ago
view post
Post
2610
What a week! A recap for everything you missed ❄️
merve/nov-22-releases-673fbbcfc1c97c4f411def07
Multimodal ✨
> Mistral AI
released Pixtral 124B, a gigantic open vision language model
> Llava-CoT (formerly known as Llava-o1) was released, a multimodal reproduction of o1 model by PKU
> OpenGVLab released MMPR: a new multimodal reasoning dataset
> Jina has released Jina-CLIP-v2 0.98B multilingual multimodal embeddings
> Apple released new SotA vision encoders AIMv2

LLMs πŸ¦™
> AllenAI dropped a huge release of models, datasets and scripts for TΓΌlu, a family of models based on Llama 3.1 aligned with SFT, DPO and a new technique they have developed called RLVR
> Jina has released embeddings-v3: new multilingual embeddings with longer context
> Hugging Face released SmolTalk: synthetic dataset used to align SmolLM2 using supervised fine-tuning
> Microsoft released orca-agentinstruct-1M-v1: a gigantic instruction dataset of 1M synthetic instruction pairs

Image Generation πŸ–ΌοΈ
> Black Forest Labs released Flux 1. tools: four new models for different image modifications and two LoRAs to do image conditioning and better steer generations

Lastly Hugging Face released a new library Observers: a lightweight SDK for monitoring interactions with AI APIs and easily store and browse them πŸ“š
$ pip install observers
  • 3 replies
Β·
merveΒ 
posted an update about 2 months ago
view post
Post
1515
Apple released AIMv2 🍏 a family of state-of-the-art open-set vision encoders
apple/aimv2-6720fe1558d94c7805f7688c
> like CLIP, but add a decoder and train on autoregression 🀯
> 19 open models come in 300M, 600M, 1.2B, 2.7B with resolutions of 224, 336, 448
> Load and use with πŸ€— transformers
merveΒ 
posted an update about 2 months ago
view post
Post
3156
your hugging face profile now has your recent activities πŸ€—