testing: track the llama export times #1

mroreo · 2025-11-03T22:19:27Z

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

Fixes #10761

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

### Summary Pulling in the aoti change for lowering that lets you using mingw posix flavor

Fast path was broken for negative indices (see pytorch#15285) Because of this, pytorch#15366 disabled the fast path when the index tensor had negative indices. In this PR we fix the bug, and re-enable the fast path for negative indices. Fixes pytorch#15285 Differential Revision: D86351194

Differential Revision: D85817305 Pull Request resolved: pytorch#15471

Arm tests logged too much as comparisons with logger.level were used instead of logger.getEffectiveLevel(). logger.level will always be logging.NOTSET unless explicitly set with logger.setLevel() which we want to avoid. Instead, we should use logger.getEffectiveLevel() which will inherit the level from its parent. Signed-off-by: Oscar Andersson <[email protected]>

The pass assumed that if all repeat multiples are one, the op is a no-op. However, it can still change the rank. Signed-off-by: Erik Lundell <[email protected]>

pytorch#15596) Signed-off-by: Sebastian Larsson <[email protected]>

…ytorch#15595) Signed-off-by: Yufeng Shi <[email protected]>

Signed-off-by: Sebastian Larsson <[email protected]>

Add ignores for third-party modules and change import common into using the correct path. Signed-off-by: [email protected]

### Summary A minor refactor on HF LLM model UT, so it is easier to maintain ### Test plan UT pass

…ytorch#15555) Signed-off-by: Sebastian Larsson <[email protected]>

…orch#15590) A number of ops only handles shape/meta-data without changing the dynamic range. In these cases, no rescaling needs to be performed and the int8 portable_ops kernel can be used directly. A new test is added to ensure this behaviour, as well as a test showing how operators which does change the dynamic range (SUB) are not supported. To support quantization of graphs with no-rescale ops in the beginning/ end of the graph, two new quantizers InputQuantizer and OutputQuantizer are introduced. By explicitly stating the dtpye of the input/output, no-rescale ops inherit dtypes from them as with any other op. Signed-off-by: Adrian Lundell <[email protected]>

…ytorch#15630) Fix mypy warnings in test_insert_int32_casts_after_int64_placeholders_pass.py about using Tensor instead of LongTensor. Signed-off-by: [email protected]

Reuses the FoldAndAnnotateQParamsPass from the Arm backend to greatly simplify the logic for fusing the ops. Additionally updates the linear kernel to be numerically correct and computes the kernel_sum aot in the quantized_linear_fusion pass. Note that since this replaces the bias node it typically causes no extra memory usage. Updates the Linear tests to mirror this, including removing the various matmul tests. Since the linear is handled as a separate op rather than a particular type of matmul these tests are not related anymore. Removes unnecessary stub definitions in operators.py, operators.yaml and op_quantized_linear.cpp Leaving a few TODO:s since the patch is large already. Signed-off-by: Adrian Lundell <[email protected]>

Add a return None if elf_path is not set. Signed-off-by: [email protected]

…h#15635) - Add (0,3,1,2) and (0,2,3,1) as permutations supported for large shapes. - Lower permutations expressable as views ('singleton permutations') to views to allow them to run on the Ethos-U55. All unittests added were previosuly not lowered which leads for example to 19 permutes delegated on the convnext_tiny model from torchvision. Signed-off-by: Adrian Lundell <[email protected]>

…ch#15632) ### Summary Delay compile-spec creation in the backend test flow to prevent sharing the temp directory between tests. Previously, using a shared compile spec implied a shared temp directory. After we began cleaning the temp directory after each test, this sharing caused conflicts. ### Test plan This is tested by the Backend test flow Signed-off-by: Zingo Andersen <[email protected]>

@robert-kalmar

) ### Summary This PR replaces optimization in `move_relu_before_concat.py` by `MoveActivationBeforeConcat` aten pass. The pass moves selected activations that are supported for fusion on Neutron (Relu, Relu6, Sigmoid, Tanh) before the `concat` node if concat input nodes are either Conv 2D or Linear 2D. The whole node Logic is determined by target specs, now supporting Neutron-C. Tests updated. ### Test plan Unit tests provided (test_move_activation_before_concatenation.py). cc @robert-kalmar

@freddan80

…#15633) ### Summary Fix the filename in the log to match the file. ### Test plan Tested by hand cc @freddan80 @per @oscarandersson8218 @digantdesai Signed-off-by: Zingo Andersen <[email protected]>

…p debug mode usage (pytorch#15643) Title says it all! Differential Revision: [D86340342](https://our.internmc.facebook.com/intern/diff/D86340342/)

## Context The SDPA custom op accepts the `input_pos` (i.e. cache position) argument as a symbolic integer. The value of the symbolic integer is obtained by selecting the first element of a cache position input tensor and converting it to symint via local_scalar_dense. Currently, ET-VK handles this in a hacky manner. 1. the select + local_scalar_dense op pattern is removed, and the cache pos tensor is passed directly into the custom sdpa ops 2. Single element tensors that have users that are all select + local_scalar_dense will be interpreted as symints instead of tensors Unfortunately, this technique will not work for the huggingface implementation of transformer models, since the cache pos input tensor has not just a single element but is expected to be a vector of integer cache positions corresponding to all cache positions that will be updated. ## Changes Introduce a custom op to capture the select + local_scalar_dense op pattern, which is the proper way to handle the op pattern. Note that a custom op is needed because this op needs to access the staging buffer data of the input tensor, whereas `select` would typically be executed via a compute shader. The reason for this is because the `input_pos` value is needed to configure the sizes of attention weight tensors participating in the custom SDPA op, so the value must be set before any command buffers are dispatched. As a consequence of this change, the previous handling of select + local scalar dense can also be removed. Differential Revision: [D86340340](https://our.internmc.facebook.com/intern/diff/D86340340/)

…ytorch#15645) SDPA used to be handled by a custom op `sdpa_with_kv_cache`, but it was eventually split (D62301837) into update_cache and custom_sdpa ops. However, having a single fused op is useful for Vulkan since it allows more control over how the cache tensors are stored and represented. Essentially, it makes it easier to manage the cache tensors and opens up opportunities for future optimizations. This diff introduces a fusion pass that does 2 things: 1. Combine update_cache and custom_sdpa back into sdpa_with_kv_cache 2. Ensure all references to the cache_pos symint use the same node - this prevents the select_at_dim_as_symint op from being called every time it is used. Differential Revision: [D86340339](https://our.internmc.facebook.com/intern/diff/D86340339/)

…grid_sampler 2D and 3D (pytorch#15371) ### Summary Enable operators adaptive_max_pool2d and grid_sampler 2D and 3D ### Test plan ```bash python backends/qualcomm/tests/test_qnn_delegate.py TestQNNFloatingPointOperator.test_qnn_backend_adaptive_max_pool2d -b build-android -H $HOST -s $SN -m $CHIPID python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_adaptive_max_pool2d -b build-android -H $HOST -s $SN -m $CHIPID python backends/qualcomm/tests/test_qnn_delegate.py TestQNNFloatingPointOperator.test_qnn_backend_grid_sampler -b build-android -H $HOST -s $SN -m $CHIPID python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_grid_sampler -b build-android -H $HOST -s $SN -m $CHIPID ```

@mergennachin

- Update the theme version to to pull the wheel from pypi - Change how we obtain the version in the CI. - Updated to properly parse `RELEASE` variable - Fixed `Makefile` to use `RELEASE=true` instead of `RELEASE=1` for consistency - Workflow sets `RELEASE=true` only for tagged releases (e.g., `v1.1.0`) - Main branch builds with `<meta name="robots" content="noindex">` tag - Release builds remain indexable by search engines cc @mergennachin @byjlw

Bump torch pin

@zonglinpeng

…est-aten-div-out-mode (pytorch#15568) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#15494 by @zonglinpeng ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/zonglinpeng/6/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/zonglinpeng/6/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/zonglinpeng/6/orig Differential Revision: [D85364551](https://our.internmc.facebook.com/intern/diff/D85364551/) @diff-train-skip-merge --------- Co-authored-by: Zonglin Peng <[email protected]>

…inear ops fundamentally changes the way we decompose the Ops and match them (pytorch#15665) Summary: ^^^ Note that there are new dedicated CortexM tests to rely on for the new flow Differential Revision: D86469035

@freddan80

…#15551) New setup option is added: --install-mlsdk-deps-with-pip For Linux/Windows x86 machines, PyPi packages of MLSDK repository for VGF backend may be used. This will eventually be the default. Reason it is not yet default is because of a limitation of model-converter to handle large models. The new option will decrease setup time, which can enable VGF backend testing in github. Co-authored-by: Per Held <[[email protected]](mailto:[email protected])>, Ryan O'Shea <[[email protected]](mailto:[email protected])> cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai Signed-off-by: Måns Nilsson <[email protected]> Co-authored-by: Per Held <[email protected]> Co-authored-by: Per Held <[email protected]>, Ryan O'Shea <[email protected]>

Executor runner supports both models with/wo bundled io in same path. To enable bundled IO EXECUTORCH_BUILD_DEVTOOLS and EXECUTORCH_ENABLE_BUNDLE_IO are required. Adds tests in Arm backend for testing this/depending on this. Except for enabling bundle-io for VGF backend where applicable, some additional resnets model tests are enabled as well. Avoids narrowing conversion errors in pte_to_header script by switching char to unsigned char. Signed-off-by: Måns Nilsson <[email protected]> Co-authored-by: Jacob Szwejbka <[email protected]>

Summary: Suspect the failure https://github.com/pytorch/pytorch/actions/runs/19547462483/job/55989739476 is due to using different QnnBackend implementation. Rename this demo backend to a demo backend name Differential Revision: D87586567

### Summary LoraLinears contain: 1. base weight (nn.Linear) 2. lora_a (nn.Linear) 3. lora_b (nn.Linear) (2) and (3) are caught by the filter, but (1) is not, as the weight and bias are pulled out of the nn.Linear and placed into nn.Parameters, and the linear is performed manually. This is for checkpoint compatibility - otherwise we'd have to map the weights for any lora model. See: https://github.com/pytorch/executorch/blob/b4d72f1e271915e9c0e1d313753a1eec840fbdee/examples/models/llama/lora.py#L31-L37 This PR adds lora linears into the quantization filter. ### Test plan ``` python -m extension.llm.export.export_llm \ base.checkpoint="${DOWNLOADED_PATH}/consolidated.00.pth" \ base.params="${DOWNLOADED_PATH}/params.json" \ base.adapter_checkpoint="../et_docs_7_epoch/adapter_model.safetensors" \ base.adapter_config="../et_docs_7_epoch/adapter_config.json" \ base.tokenizer_path="../et_docs_7_epoch/" \ model.use_kv_cache=true \ model.use_sdpa_with_kv_cache=true \ ``` Confirm output model size is ~1.7GB instead of 5.1GB. ``` (executorch) [[email protected] /data/users/lfq/executorch (lfq.quantize-lora-linears)]$ ls -la *.pte -rw-r--r-- 1 lfq users 5106135168 Nov 20 15:59 et_lora.pte -rw-r--r-- 1 lfq users 1733835776 Nov 20 17:07 et_lora_fix.pte ```

@freddan80

Add a import-untyped for snakeviz in the case it is installed. Signed-off-by: [email protected] Change-Id: Ia951a0013d09e06c0d29a32bdb6b49ae11561d7d cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai Signed-off-by: [email protected] Co-authored-by: Zingo Andersen <[email protected]>

Differential Revision: D87579688 Pull Request resolved: pytorch#15925

Differential Revision: D87280747 Pull Request resolved: pytorch#15862

Before: When running CUDA benchmarks on multiple models, any model export failure would halt the entire benchmark job. After: With the new configuration, the benchmark job will continue for models that export successfully, even if some models fail to export.

It's not compatible with pytorch#15933 which cause 2^70+ byte counters like ``` Downloaded: 839890544179019776 / 1354151797 bytes (62023367397.93%) Downloaded: 841813590016000000 / 1354151797 bytes (62165378496.04%) ```

@GregoryComer

Summary: This PR fixes two issues affecting the build and installation process: 1. **pyproject.toml configuration**: Fixed invalid `license` and `license-files` fields that were causing build failures with newer versions of `setuptools` and `pip` build isolation. The `license` field now uses the table format `{text = ...}` and `license-files` was moved to `[tool.setuptools]`. 2. **Editable install version.py**: Fixed an issue where `version.py` was being written to the project root instead of the package directory (`src/executorch`) during editable installs. This was causing `ImportError: cannot import name 'version'` when importing `executorch`. Test Plan: - Verified `pip install . --no-build-isolation` works (metadata generation succeeds). - Verified `pip install -e . --no-build-isolation` works and `from executorch import version` succeeds. ### Summary [PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines. [PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line. [PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests). ### Test plan [PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable. cc @GregoryComer

as title

Currently we downaload everything created durning export and benchmarking, including ptd, pte, benchmarking results, etc, when trying to upload benchmarking result to pytorch hub. ptd and pte are large and unnecessary for this stage and when we benchmarking lots of models, such large files will cause out of disk space error. this PR prevents those large and unnecessary files from downloading and try to avoid out of disk space error.

### Summary Take down the AWS device farm benchmarking jobs. We are dropping them due to performance data being unreliable on non-rooted devices.

seems like qnn download sdk is very unreliable. Trying to fix it then re-enable it

Differential Revision: D87510750 Pull Request resolved: pytorch#15944

Differential Revision: D87576772 Pull Request resolved: pytorch#15932

Signed-off-by: [email protected]

Explain how to prune a NN and the associated uplift in performance when running on the Ethos-U NPU.

Bias range was [-2147483648, 2147483646] which isn't really symmetric. This patch changes the range to [-2147483647, 2147483647]. Signed-off-by: Oscar Andersson <[email protected]>

Chenweng-quic MatthiasHertel80 (Arm) Michaelmaitland (Meta internal) RahulC7 (Meta internal) can update author jorgep31415 to Juniper Pineda: https://github.com/junpi3 Young Han - Meta - https://github.com/seyeong-han Mitch Bailey - https://github.com/jmahbs (Arm) Alex Tawse - https://github.com/AlexTawseArm Tanvir Islam - https://github.com/tanvirislam-meta (Meta)

Summary: Forward fix for pytorch#15368 Reviewed By: metascroy Differential Revision: D87712225

### Summary Fix eval_llama_qnn: retrieve custom annotation from quantization recipe ### Test plan ``` bash python -m executorch.examples.qualcomm.oss_scripts.llama.eval_llama_qnn --decoder_model qwen2_5-0_5b --quant_linear_only --max_seq_length 1024 --ptq 16a4w ```

PyTorch has nightly wheels for this

…execute rigtht after compilation to create command buffers. Differential Revision: D87781471 Pull Request resolved: pytorch#15962

Differential Revision: D87749871 Pull Request resolved: pytorch#15955

Differential Revision: D87122487 Pull Request resolved: pytorch#15934

### Summary GLM Enablement `python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8750 --temperature 0 --model_mode kv --max_seq_len 128 --decoder_model glm-1_5b --prompt "Could you tell me about Facebook?"` ### Test plan `python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_static_glm1_5b --model SM8750 --build_folder build-android/ --executorch_root . -s $DEVICE --artifact ./glm1_5b`

Differential Revision: D87752226 Pull Request resolved: pytorch#15961

Implements a new pass which fuses activation passes with preceeding cortex-m ops if possible. Removed quantization of conv1d, conv3d as they are not tested + moves Conv+relu test to test_activations. Propagate qmin, qmax to conv kernel. Signed-off-by: Adrian Lundell <[email protected]>

mroreo changed the title ~~debug: add some timestamping to see how the timing would be called~~ regression testing: track the llama export times Nov 3, 2025

mroreo changed the title ~~regression testing: track the llama export times~~ testing: track the llama export times Nov 3, 2025

JacobSzwejbka and others added 28 commits November 5, 2025 15:33

Pin bump oct17 (pytorch#15251)

2bb8055

### Summary Pulling in the aoti change for lowering that lets you using mingw posix flavor

Properly set modified for variety of cadence passes

15a0fcd

Differential Revision: D85817305 Pull Request resolved: pytorch#15471

Arm backend: Fix bug in ConvertExpandToRepeat pass (pytorch#15589)

e487d29

The pass assumed that if all repeat multiples are one, the op is a no-op. However, it can still change the rank. Signed-off-by: Erik Lundell <[email protected]>

Arm backend: Add docstrings for operator_support/slice_copy_support.py (

c9fcb24

pytorch#15596) Signed-off-by: Sebastian Larsson <[email protected]>

Arm backend: Upcast index argument to int64 for aten.index_copy ops (p…

fd4eb9d

…ytorch#15595) Signed-off-by: Yufeng Shi <[email protected]>

Arm backend: Add docstrings to tosa/backend.py (pytorch#15552)

1205999

Signed-off-by: Sebastian Larsson <[email protected]>

Arm backend: Fix mypy warnings in test/models (pytorch#15594)

1c5f77f

Add ignores for third-party modules and change import common into using the correct path. Signed-off-by: [email protected]

Qualcomm AI Engine Direct - HF LLM E2E Test Refactor (pytorch#15542)

ac57fde

### Summary A minor refactor on HF LLM model UT, so it is easier to maintain ### Test plan UT pass

Arm backend: Add docstrings for operator_support/minmax_support.py (p…

4ea9ddf

…ytorch#15555) Signed-off-by: Sebastian Larsson <[email protected]>

Arm backend: Fix mypy warnings in test_insert_int32_casts_after_... (p…

ad27841

…ytorch#15630) Fix mypy warnings in test_insert_int32_casts_after_int64_placeholders_pass.py about using Tensor instead of LongTensor. Signed-off-by: [email protected]

Arm backend: Fix mypy warning in runner_utils.py (pytorch#15634)

5b2b91c

Add a return None if elf_path is not set. Signed-off-by: [email protected]

Dev tools: Fix filename in model explorer install suggestion (pytorch…

f83227c

…#15633) ### Summary Fix the filename in the log to match the file. ### Test plan Tested by hand cc @freddan80 @per @oscarandersson8218 @digantdesai Signed-off-by: Zingo Andersen <[email protected]>

[ET-VK][ez] Add lint rule to check no debug mode in shaders + clean u…

b8bdfa2

…p debug mode usage (pytorch#15643) Title says it all! Differential Revision: [D86340342](https://our.internmc.facebook.com/intern/diff/D86340342/)

Bump torch pin (pytorch#15614)

3405317

Bump torch pin

mansnils and others added 29 commits November 21, 2025 09:09

Rename the qnn demo backend (pytorch#15930)

a4298ac

Summary: Suspect the failure https://github.com/pytorch/pytorch/actions/runs/19547462483/job/55989739476 is due to using different QnnBackend implementation. Rename this demo backend to a demo backend name Differential Revision: D87586567

Tag scales for external data

e570942

Differential Revision: D87579688 Pull Request resolved: pytorch#15925

Save external constant tensors to custom filename

17744b7

Differential Revision: D87280747 Pull Request resolved: pytorch#15862

Remove download progress reporting (pytorch#15946)

0f3d1b2

It's not compatible with pytorch#15933 which cause 2^70+ byte counters like ``` Downloaded: 839890544179019776 / 1354151797 bytes (62023367397.93%) Downloaded: 841813590016000000 / 1354151797 bytes (62165378496.04%) ```

Create a button to manually trigger all cuda benchmarks (pytorch#15938)

e4faf06

as title

Remove AWS Device Farm benchmark jobs (pytorch#15433)

7905104

### Summary Take down the AWS device farm benchmarking jobs. We are dropping them due to performance data being unreliable on non-rooted devices.

Disable QNN build option for the default x86 build (pytorch#15949)

eb4ebcb

seems like qnn download sdk is very unreliable. Trying to fix it then re-enable it

Expose get_num_threads via pybind

9a2eb43

Differential Revision: D87510750 Pull Request resolved: pytorch#15944

Move user-defined passes after SpecPropPass

350ea3c

Differential Revision: D87576772 Pull Request resolved: pytorch#15932

Arm backend: Add missing default case in executor_runner (pytorch#15895)

8c5ab24

Signed-off-by: [email protected]

Arm backend: Minimal example of pruning (pytorch#15851)

e37062e

Explain how to prune a NN and the associated uplift in performance when running on the Ethos-U NPU.

Arm backend: Update bias quantization range (pytorch#15918)

d13c789

Bias range was [-2147483648, 2147483646] which isn't really symmetric. This patch changes the range to [-2147483647, 2147483647]. Signed-off-by: Oscar Andersson <[email protected]>

forward fix PR 15368 (pytorch#15964)

dcff8dc

Summary: Forward fix for pytorch#15368 Reviewed By: metascroy Differential Revision: D87712225

add 12.9 cuda support (pytorch#15818)

d8bc1e8

PyTorch has nightly wheels for this

Adding compile option warmup_execute_after_compile to optionally run …

7329730

…execute rigtht after compilation to create command buffers. Differential Revision: D87781471 Pull Request resolved: pytorch#15962

Fix cpp/c compilation warnings.

91b48a7

Differential Revision: D87749871 Pull Request resolved: pytorch#15955

Add ETDump event tracer support to LLaMa runner

cec1834

Differential Revision: D87122487 Pull Request resolved: pytorch#15934

Fix relu test + use xt macros.

3f92668

Differential Revision: D87752226 Pull Request resolved: pytorch#15961

Merge branch 'main' into dev/issue_10761

3c4fbf1

mroreo had a problem deploying to upload-benchmark-results December 2, 2025 03:42 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

testing: track the llama export times #1

testing: track the llama export times #1

Uh oh!

mroreo commented Nov 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

testing: track the llama export times #1

Are you sure you want to change the base?

testing: track the llama export times #1

Uh oh!

Conversation

mroreo commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

mroreo commented Nov 3, 2025 •

edited

Loading