Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed sampling #720

Draft
wants to merge 32 commits into
base: habana_main
Choose a base branch
from
Draft

Conversation

mfylcek
Copy link

@mfylcek mfylcek commented Jan 22, 2025

No description provided.

mfylcek and others added 30 commits January 16, 2025 11:54
remove expert_max hard code (#47)
vLLM-Ext: Full enabling of ALiBi (#34)
Add version inference via setuptools-scm (#58)
Revert "vLLM-Ext: Full enabling of ALiBi (#34)" (#59)
Remove punica_hpu.py from vllm_hpu_extension (#66)
Removed previous (not-pipelined) pa implementation (#72)
Add flag to enable running softmax in fp32 (#71)
Update calibration readme link (#73)
allow lm_head quantization in calibration process (#65)
Pad to bmin if value is less (#67)
Update pyproject.toml (#75)

---------

Co-authored-by: Michał Kuligowski <[email protected]>
This reverts commit c445fe7.
fix high level profiling
WA for recompilations - reduces number of them from 2 each decode to 1 and total time from 11ms to 2.6sm.

Also torch.tensor is preferable vs index citing Marceli:
"torch.index_select returns a new tensor which copies the indexed fields into a new memory location.
torch.Tensor.select or slicing returns a view of the original tensor."
Sync there causes 4-6ms gap because executor waits for hpu to finish although in delayed sampling it's not needed.
Instead of removing the sync let's add an if - we don't want to change too much (in this case affecting also MSS without delayed) increasing the testing scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants