forked from pytorch/executorch
-
Notifications
You must be signed in to change notification settings - Fork 0
testing: track the llama export times #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mroreo
wants to merge
239
commits into
main
Choose a base branch
from
dev/issue_10761
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### Summary Pulling in the aoti change for lowering that lets you using mingw posix flavor
Fast path was broken for negative indices (see pytorch#15285) Because of this, pytorch#15366 disabled the fast path when the index tensor had negative indices. In this PR we fix the bug, and re-enable the fast path for negative indices. Fixes pytorch#15285 Differential Revision: D86351194
Differential Revision: D85817305 Pull Request resolved: pytorch#15471
Arm tests logged too much as comparisons with logger.level were used instead of logger.getEffectiveLevel(). logger.level will always be logging.NOTSET unless explicitly set with logger.setLevel() which we want to avoid. Instead, we should use logger.getEffectiveLevel() which will inherit the level from its parent. Signed-off-by: Oscar Andersson <[email protected]>
The pass assumed that if all repeat multiples are one, the op is a no-op. However, it can still change the rank. Signed-off-by: Erik Lundell <[email protected]>
pytorch#15596) Signed-off-by: Sebastian Larsson <[email protected]>
…ytorch#15595) Signed-off-by: Yufeng Shi <[email protected]>
Signed-off-by: Sebastian Larsson <[email protected]>
Add ignores for third-party modules and change import common into using the correct path. Signed-off-by: [email protected]
### Summary A minor refactor on HF LLM model UT, so it is easier to maintain ### Test plan UT pass
…ytorch#15555) Signed-off-by: Sebastian Larsson <[email protected]>
…orch#15590) A number of ops only handles shape/meta-data without changing the dynamic range. In these cases, no rescaling needs to be performed and the int8 portable_ops kernel can be used directly. A new test is added to ensure this behaviour, as well as a test showing how operators which does change the dynamic range (SUB) are not supported. To support quantization of graphs with no-rescale ops in the beginning/ end of the graph, two new quantizers InputQuantizer and OutputQuantizer are introduced. By explicitly stating the dtpye of the input/output, no-rescale ops inherit dtypes from them as with any other op. Signed-off-by: Adrian Lundell <[email protected]>
…ytorch#15630) Fix mypy warnings in test_insert_int32_casts_after_int64_placeholders_pass.py about using Tensor instead of LongTensor. Signed-off-by: [email protected]
Reuses the FoldAndAnnotateQParamsPass from the Arm backend to greatly simplify the logic for fusing the ops. Additionally updates the linear kernel to be numerically correct and computes the kernel_sum aot in the quantized_linear_fusion pass. Note that since this replaces the bias node it typically causes no extra memory usage. Updates the Linear tests to mirror this, including removing the various matmul tests. Since the linear is handled as a separate op rather than a particular type of matmul these tests are not related anymore. Removes unnecessary stub definitions in operators.py, operators.yaml and op_quantized_linear.cpp Leaving a few TODO:s since the patch is large already. Signed-off-by: Adrian Lundell <[email protected]>
Add a return None if elf_path is not set. Signed-off-by: [email protected]
…h#15635) - Add (0,3,1,2) and (0,2,3,1) as permutations supported for large shapes. - Lower permutations expressable as views ('singleton permutations') to views to allow them to run on the Ethos-U55. All unittests added were previosuly not lowered which leads for example to 19 permutes delegated on the convnext_tiny model from torchvision. Signed-off-by: Adrian Lundell <[email protected]>
…ch#15632) ### Summary Delay compile-spec creation in the backend test flow to prevent sharing the temp directory between tests. Previously, using a shared compile spec implied a shared temp directory. After we began cleaning the temp directory after each test, this sharing caused conflicts. ### Test plan This is tested by the Backend test flow Signed-off-by: Zingo Andersen <[email protected]>
) ### Summary This PR replaces optimization in `move_relu_before_concat.py` by `MoveActivationBeforeConcat` aten pass. The pass moves selected activations that are supported for fusion on Neutron (Relu, Relu6, Sigmoid, Tanh) before the `concat` node if concat input nodes are either Conv 2D or Linear 2D. The whole node Logic is determined by target specs, now supporting Neutron-C. Tests updated. ### Test plan Unit tests provided (test_move_activation_before_concatenation.py). cc @robert-kalmar
…#15633) ### Summary Fix the filename in the log to match the file. ### Test plan Tested by hand cc @freddan80 @per @oscarandersson8218 @digantdesai Signed-off-by: Zingo Andersen <[email protected]>
…p debug mode usage (pytorch#15643) Title says it all! Differential Revision: [D86340342](https://our.internmc.facebook.com/intern/diff/D86340342/)
## Context The SDPA custom op accepts the `input_pos` (i.e. cache position) argument as a symbolic integer. The value of the symbolic integer is obtained by selecting the first element of a cache position input tensor and converting it to symint via local_scalar_dense. Currently, ET-VK handles this in a hacky manner. 1. the select + local_scalar_dense op pattern is removed, and the cache pos tensor is passed directly into the custom sdpa ops 2. Single element tensors that have users that are all select + local_scalar_dense will be interpreted as symints instead of tensors Unfortunately, this technique will not work for the huggingface implementation of transformer models, since the cache pos input tensor has not just a single element but is expected to be a vector of integer cache positions corresponding to all cache positions that will be updated. ## Changes Introduce a custom op to capture the select + local_scalar_dense op pattern, which is the proper way to handle the op pattern. Note that a custom op is needed because this op needs to access the staging buffer data of the input tensor, whereas `select` would typically be executed via a compute shader. The reason for this is because the `input_pos` value is needed to configure the sizes of attention weight tensors participating in the custom SDPA op, so the value must be set before any command buffers are dispatched. As a consequence of this change, the previous handling of select + local scalar dense can also be removed. Differential Revision: [D86340340](https://our.internmc.facebook.com/intern/diff/D86340340/)
…ytorch#15645) SDPA used to be handled by a custom op `sdpa_with_kv_cache`, but it was eventually split (D62301837) into update_cache and custom_sdpa ops. However, having a single fused op is useful for Vulkan since it allows more control over how the cache tensors are stored and represented. Essentially, it makes it easier to manage the cache tensors and opens up opportunities for future optimizations. This diff introduces a fusion pass that does 2 things: 1. Combine update_cache and custom_sdpa back into sdpa_with_kv_cache 2. Ensure all references to the cache_pos symint use the same node - this prevents the select_at_dim_as_symint op from being called every time it is used. Differential Revision: [D86340339](https://our.internmc.facebook.com/intern/diff/D86340339/)
…grid_sampler 2D and 3D (pytorch#15371) ### Summary Enable operators adaptive_max_pool2d and grid_sampler 2D and 3D ### Test plan ```bash python backends/qualcomm/tests/test_qnn_delegate.py TestQNNFloatingPointOperator.test_qnn_backend_adaptive_max_pool2d -b build-android -H $HOST -s $SN -m $CHIPID python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_adaptive_max_pool2d -b build-android -H $HOST -s $SN -m $CHIPID python backends/qualcomm/tests/test_qnn_delegate.py TestQNNFloatingPointOperator.test_qnn_backend_grid_sampler -b build-android -H $HOST -s $SN -m $CHIPID python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_grid_sampler -b build-android -H $HOST -s $SN -m $CHIPID ```
- Update the theme version to to pull the wheel from pypi - Change how we obtain the version in the CI. - Updated to properly parse `RELEASE` variable - Fixed `Makefile` to use `RELEASE=true` instead of `RELEASE=1` for consistency - Workflow sets `RELEASE=true` only for tagged releases (e.g., `v1.1.0`) - Main branch builds with `<meta name="robots" content="noindex">` tag - Release builds remain indexable by search engines cc @mergennachin @byjlw
Bump torch pin
…est-aten-div-out-mode (pytorch#15568) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#15494 by @zonglinpeng ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/zonglinpeng/6/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/zonglinpeng/6/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/zonglinpeng/6/orig Differential Revision: [D85364551](https://our.internmc.facebook.com/intern/diff/D85364551/) @diff-train-skip-merge --------- Co-authored-by: Zonglin Peng <[email protected]>
…inear ops fundamentally changes the way we decompose the Ops and match them (pytorch#15665) Summary: ^^^ Note that there are new dedicated CortexM tests to rely on for the new flow Differential Revision: D86469035
…#15551) New setup option is added: --install-mlsdk-deps-with-pip For Linux/Windows x86 machines, PyPi packages of MLSDK repository for VGF backend may be used. This will eventually be the default. Reason it is not yet default is because of a limitation of model-converter to handle large models. The new option will decrease setup time, which can enable VGF backend testing in github. Co-authored-by: Per Held <[[email protected]](mailto:[email protected])>, Ryan O'Shea <[[email protected]](mailto:[email protected])> cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai Signed-off-by: Måns Nilsson <[email protected]> Co-authored-by: Per Held <[email protected]> Co-authored-by: Per Held <[email protected]>, Ryan O'Shea <[email protected]>
Executor runner supports both models with/wo bundled io in same path. To enable bundled IO EXECUTORCH_BUILD_DEVTOOLS and EXECUTORCH_ENABLE_BUNDLE_IO are required. Adds tests in Arm backend for testing this/depending on this. Except for enabling bundle-io for VGF backend where applicable, some additional resnets model tests are enabled as well. Avoids narrowing conversion errors in pte_to_header script by switching char to unsigned char. Signed-off-by: Måns Nilsson <[email protected]> Co-authored-by: Jacob Szwejbka <[email protected]>
Summary: Suspect the failure https://github.com/pytorch/pytorch/actions/runs/19547462483/job/55989739476 is due to using different QnnBackend implementation. Rename this demo backend to a demo backend name Differential Revision: D87586567
### Summary LoraLinears contain: 1. base weight (nn.Linear) 2. lora_a (nn.Linear) 3. lora_b (nn.Linear) (2) and (3) are caught by the filter, but (1) is not, as the weight and bias are pulled out of the nn.Linear and placed into nn.Parameters, and the linear is performed manually. This is for checkpoint compatibility - otherwise we'd have to map the weights for any lora model. See: https://github.com/pytorch/executorch/blob/b4d72f1e271915e9c0e1d313753a1eec840fbdee/examples/models/llama/lora.py#L31-L37 This PR adds lora linears into the quantization filter. ### Test plan ``` python -m extension.llm.export.export_llm \ base.checkpoint="${DOWNLOADED_PATH}/consolidated.00.pth" \ base.params="${DOWNLOADED_PATH}/params.json" \ base.adapter_checkpoint="../et_docs_7_epoch/adapter_model.safetensors" \ base.adapter_config="../et_docs_7_epoch/adapter_config.json" \ base.tokenizer_path="../et_docs_7_epoch/" \ model.use_kv_cache=true \ model.use_sdpa_with_kv_cache=true \ ``` Confirm output model size is ~1.7GB instead of 5.1GB. ``` (executorch) [[email protected] /data/users/lfq/executorch (lfq.quantize-lora-linears)]$ ls -la *.pte -rw-r--r-- 1 lfq users 5106135168 Nov 20 15:59 et_lora.pte -rw-r--r-- 1 lfq users 1733835776 Nov 20 17:07 et_lora_fix.pte ```
Add a import-untyped for snakeviz in the case it is installed. Signed-off-by: [email protected] Change-Id: Ia951a0013d09e06c0d29a32bdb6b49ae11561d7d cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai Signed-off-by: [email protected] Co-authored-by: Zingo Andersen <[email protected]>
Differential Revision: D87579688 Pull Request resolved: pytorch#15925
Differential Revision: D87280747 Pull Request resolved: pytorch#15862
Before: When running CUDA benchmarks on multiple models, any model export failure would halt the entire benchmark job. After: With the new configuration, the benchmark job will continue for models that export successfully, even if some models fail to export.
It's not compatible with pytorch#15933 which cause 2^70+ byte counters like ``` Downloaded: 839890544179019776 / 1354151797 bytes (62023367397.93%) Downloaded: 841813590016000000 / 1354151797 bytes (62165378496.04%) ```
Summary:
This PR fixes two issues affecting the build and installation process:
1. **pyproject.toml configuration**: Fixed invalid `license` and
`license-files` fields that were causing build failures with newer
versions of `setuptools` and `pip` build isolation. The `license` field
now uses the table format `{text = ...}` and `license-files` was moved
to `[tool.setuptools]`.
2. **Editable install version.py**: Fixed an issue where `version.py`
was being written to the project root instead of the package directory
(`src/executorch`) during editable installs. This was causing
`ImportError: cannot import name 'version'` when importing `executorch`.
Test Plan:
- Verified `pip install . --no-build-isolation` works (metadata
generation succeeds).
- Verified `pip install -e . --no-build-isolation` works and `from
executorch import version` succeeds.
### Summary
[PLEASE REMOVE] See [CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests)
for ExecuTorch PR guidelines.
[PLEASE REMOVE] If this PR closes an issue, please add a `Fixes
#<issue-id>` line.
[PLEASE REMOVE] If this PR introduces a fix or feature that should be
the upcoming release notes, please add a "Release notes: <area>" label.
For a list of available release notes labels, check out
[CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests).
### Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual
commands you used and note down tests that you have written if
applicable.
cc @GregoryComer
Currently we downaload everything created durning export and benchmarking, including ptd, pte, benchmarking results, etc, when trying to upload benchmarking result to pytorch hub. ptd and pte are large and unnecessary for this stage and when we benchmarking lots of models, such large files will cause out of disk space error. this PR prevents those large and unnecessary files from downloading and try to avoid out of disk space error.
### Summary Take down the AWS device farm benchmarking jobs. We are dropping them due to performance data being unreliable on non-rooted devices.
seems like qnn download sdk is very unreliable. Trying to fix it then re-enable it
Differential Revision: D87510750 Pull Request resolved: pytorch#15944
Differential Revision: D87576772 Pull Request resolved: pytorch#15932
Explain how to prune a NN and the associated uplift in performance when running on the Ethos-U NPU.
Bias range was [-2147483648, 2147483646] which isn't really symmetric. This patch changes the range to [-2147483647, 2147483647]. Signed-off-by: Oscar Andersson <[email protected]>
Chenweng-quic MatthiasHertel80 (Arm) Michaelmaitland (Meta internal) RahulC7 (Meta internal) can update author jorgep31415 to Juniper Pineda: https://github.com/junpi3 Young Han - Meta - https://github.com/seyeong-han Mitch Bailey - https://github.com/jmahbs (Arm) Alex Tawse - https://github.com/AlexTawseArm Tanvir Islam - https://github.com/tanvirislam-meta (Meta)
Summary: Forward fix for pytorch#15368 Reviewed By: metascroy Differential Revision: D87712225
### Summary Fix eval_llama_qnn: retrieve custom annotation from quantization recipe ### Test plan ``` bash python -m executorch.examples.qualcomm.oss_scripts.llama.eval_llama_qnn --decoder_model qwen2_5-0_5b --quant_linear_only --max_seq_length 1024 --ptq 16a4w ```
PyTorch has nightly wheels for this
…execute rigtht after compilation to create command buffers. Differential Revision: D87781471 Pull Request resolved: pytorch#15962
Differential Revision: D87749871 Pull Request resolved: pytorch#15955
Differential Revision: D87122487 Pull Request resolved: pytorch#15934
### Summary GLM Enablement `python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8750 --temperature 0 --model_mode kv --max_seq_len 128 --decoder_model glm-1_5b --prompt "Could you tell me about Facebook?"` ### Test plan `python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_static_glm1_5b --model SM8750 --build_folder build-android/ --executorch_root . -s $DEVICE --artifact ./glm1_5b`
Differential Revision: D87752226 Pull Request resolved: pytorch#15961
Implements a new pass which fuses activation passes with preceeding cortex-m ops if possible. Removed quantization of conv1d, conv3d as they are not tested + moves Conv+relu test to test_activations. Propagate qmin, qmax to conv kernel. Signed-off-by: Adrian Lundell <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.
Fixes #10761[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.
Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.