Thanks for opensourcing this!
I found FLOPs estimation function for all model is text-only LLM estimation,
https://github.com/EvolvingLMMs-Lab/lmms-engine/blob/main/src/lmms_engine/models/utils.py#L61-L81.
Since official documentations https://lmms-engine.readthedocs.io/en/latest/reference/mfu.html#qwen3-vl-8b-with-sequence-parallel reports MFU around 0.2~0.25, and I believe it should not be from MFU calculation above, but includes ViT forward FLOPs.
Also, I would love to know what kind of dataset is used for MFU estimation, since specifically Qwen series multimodal(e.g., Qwen3-VL) supports native resolution, and also configuration of FPS in video dataset would change MFU a lot.
Thanks for opensourcing this!
I found FLOPs estimation function for all model is text-only LLM estimation,
https://github.com/EvolvingLMMs-Lab/lmms-engine/blob/main/src/lmms_engine/models/utils.py#L61-L81.
Since official documentations https://lmms-engine.readthedocs.io/en/latest/reference/mfu.html#qwen3-vl-8b-with-sequence-parallel reports MFU around 0.2~0.25, and I believe it should not be from MFU calculation above, but includes ViT forward FLOPs.
Also, I would love to know what kind of dataset is used for MFU estimation, since specifically Qwen series multimodal(e.g., Qwen3-VL) supports native resolution, and also configuration of FPS in video dataset would change MFU a lot.