Skip to content

Conversation

@mroreo
Copy link
Owner

@mroreo mroreo commented Nov 3, 2025

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

Fixes #10761

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

@mroreo mroreo changed the title debug: add some timestamping to see how the timing would be called regression testing: track the llama export times Nov 3, 2025
@mroreo mroreo changed the title regression testing: track the llama export times testing: track the llama export times Nov 3, 2025
JacobSzwejbka and others added 28 commits November 5, 2025 15:33
### Summary
Pulling in the aoti change for lowering that lets you using mingw posix
flavor
Fast path was broken for negative indices (see
pytorch#15285)
Because of this, pytorch#15366
disabled the fast path when the index tensor had negative indices.
In this PR we fix the bug, and re-enable the fast path for negative
indices.

Fixes pytorch#15285

Differential Revision: D86351194
Differential Revision: D85817305

Pull Request resolved: pytorch#15471
Arm tests logged too much as comparisons with logger.level were used
instead of logger.getEffectiveLevel(). logger.level will always be
logging.NOTSET unless explicitly set with logger.setLevel() which we
want to avoid. Instead, we should use logger.getEffectiveLevel() which
will inherit the level from its parent.

Signed-off-by: Oscar Andersson <[email protected]>
The pass assumed that if all repeat multiples are one, the op is a
no-op. However, it can still change the rank.

Signed-off-by: Erik Lundell <[email protected]>
Add ignores for third-party modules and change import common into using
the correct path.

Signed-off-by: [email protected]
### Summary
A minor refactor on HF LLM model UT, so it is easier to maintain
### Test plan
UT pass
…orch#15590)

A number of ops only handles shape/meta-data without changing the
dynamic range. In these cases, no rescaling needs to be performed and
the int8 portable_ops kernel can be used directly.

A new test is added to ensure this behaviour, as well as a test showing
how operators which does change the dynamic range (SUB) are not
supported.

To support quantization of graphs with no-rescale ops in the beginning/
end of the graph, two new quantizers InputQuantizer and OutputQuantizer
are introduced. By explicitly stating the dtpye of the input/output,
no-rescale ops inherit dtypes from them as with any other op.

Signed-off-by: Adrian Lundell <[email protected]>
…ytorch#15630)

Fix mypy warnings in
test_insert_int32_casts_after_int64_placeholders_pass.py about using
Tensor instead of LongTensor.

Signed-off-by: [email protected]
Reuses the FoldAndAnnotateQParamsPass from the Arm backend to greatly
simplify the logic for fusing the ops.

Additionally updates the linear kernel to be numerically correct and
computes the kernel_sum aot in the quantized_linear_fusion pass. Note
that since this replaces the bias node it typically causes no extra
memory usage.

Updates the Linear tests to mirror this, including removing the various
matmul tests. Since the linear is handled as a separate op rather than a
particular type of matmul these tests are not related anymore.

Removes unnecessary stub definitions in operators.py, operators.yaml and
op_quantized_linear.cpp

Leaving a few TODO:s since the patch is large already.


Signed-off-by: Adrian Lundell <[email protected]>
Add a return None if elf_path is not set.

Signed-off-by: [email protected]
…h#15635)

- Add (0,3,1,2) and (0,2,3,1) as permutations supported for large
shapes.
- Lower permutations expressable as views ('singleton permutations') to
views to allow them to run on the Ethos-U55.

All unittests added were previosuly not lowered which leads for example
to 19 permutes delegated on the
convnext_tiny model from torchvision.

Signed-off-by: Adrian Lundell <[email protected]>
…ch#15632)

### Summary
Delay compile-spec creation in the backend test flow to prevent sharing
the temp directory between tests.

Previously, using a shared compile spec implied a shared temp directory.
After we began cleaning the temp directory after each test, this sharing
caused conflicts.

### Test plan
This is tested by the Backend test flow

Signed-off-by: Zingo Andersen <[email protected]>
)

### Summary
This PR replaces optimization in `move_relu_before_concat.py` by
`MoveActivationBeforeConcat` aten pass. The pass moves selected
activations that are supported for fusion on Neutron (Relu, Relu6,
Sigmoid, Tanh) before the `concat` node if concat input nodes are either
Conv 2D or Linear 2D. The whole node Logic is determined by target
specs, now supporting Neutron-C. Tests updated.

### Test plan
Unit tests provided (test_move_activation_before_concatenation.py).

cc @robert-kalmar
…#15633)

### Summary
Fix the filename in the log to match the file.

### Test plan
Tested by hand


cc @freddan80 @per @oscarandersson8218 @digantdesai

Signed-off-by: Zingo Andersen <[email protected]>
## Context

The SDPA custom op accepts the `input_pos` (i.e. cache position) argument as a symbolic integer. The value of the symbolic integer is obtained by selecting the first element of a cache position input tensor and converting it to symint via local_scalar_dense.

Currently, ET-VK handles this in a hacky manner.

1. the select + local_scalar_dense op pattern is removed, and the cache pos tensor is passed directly into the custom sdpa ops
2. Single element tensors that have users that are all select + local_scalar_dense will be interpreted as symints instead of tensors

Unfortunately, this technique will not work for the huggingface implementation of transformer models, since the cache pos input tensor has not just a single element but is expected to be a vector of integer cache positions corresponding to all cache positions that will be updated.

## Changes

Introduce a custom op to capture the select + local_scalar_dense op pattern, which is the proper way to handle the op pattern.

Note that a custom op is needed because this op needs to access the staging buffer data of the input tensor, whereas `select` would typically be executed via a compute shader. The reason for this is because the `input_pos` value is needed to configure the sizes of attention weight tensors participating in the custom SDPA op, so the value must be set before any command buffers are dispatched.

As a consequence of this change, the previous handling of select + local scalar dense can also be removed.

Differential Revision: [D86340340](https://our.internmc.facebook.com/intern/diff/D86340340/)
…ytorch#15645)

SDPA used to be handled by a custom op `sdpa_with_kv_cache`, but it was eventually split (D62301837) into update_cache and custom_sdpa ops.

However, having a single fused op is useful for Vulkan since it allows more control over how the cache tensors are stored and represented. Essentially, it makes it easier to manage the cache tensors and opens up opportunities for future optimizations. This diff introduces a fusion pass that does 2 things:

1. Combine update_cache and custom_sdpa back into sdpa_with_kv_cache
2. Ensure all references to the cache_pos symint use the same node - this prevents the select_at_dim_as_symint op from being called every time it is used.

Differential Revision: [D86340339](https://our.internmc.facebook.com/intern/diff/D86340339/)
…grid_sampler 2D and 3D (pytorch#15371)

### Summary
Enable operators adaptive_max_pool2d and grid_sampler 2D and 3D

### Test plan
```bash
python backends/qualcomm/tests/test_qnn_delegate.py TestQNNFloatingPointOperator.test_qnn_backend_adaptive_max_pool2d -b build-android -H $HOST -s $SN -m $CHIPID
python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_adaptive_max_pool2d -b build-android -H $HOST -s $SN -m $CHIPID
python backends/qualcomm/tests/test_qnn_delegate.py TestQNNFloatingPointOperator.test_qnn_backend_grid_sampler -b build-android -H $HOST -s $SN -m $CHIPID
python backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_grid_sampler -b build-android -H $HOST -s $SN -m $CHIPID
```
- Update the theme version to to pull the wheel from pypi
- Change how we obtain the version in the CI.
- Updated to properly parse `RELEASE` variable
- Fixed `Makefile` to use `RELEASE=true` instead of `RELEASE=1` for
consistency
- Workflow sets `RELEASE=true` only for tagged releases (e.g., `v1.1.0`)
- Main branch builds with `<meta name="robots" content="noindex">` tag
- Release builds remain indexable by search engines

cc @mergennachin @byjlw
…est-aten-div-out-mode (pytorch#15568)

This PR was created by the merge bot to help merge the original PR into
the main branch.
ghstack PR number: pytorch#15494 by
@zonglinpeng
^ Please use this as the source of truth for the PR details, comments,
and reviews
ghstack PR base:
https://github.com/pytorch/executorch/tree/gh/zonglinpeng/6/base
ghstack PR head:
https://github.com/pytorch/executorch/tree/gh/zonglinpeng/6/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head:
https://github.com/pytorch/executorch/tree/gh/zonglinpeng/6/orig
Differential Revision:
[D85364551](https://our.internmc.facebook.com/intern/diff/D85364551/)
@diff-train-skip-merge

---------

Co-authored-by: Zonglin Peng <[email protected]>
…inear ops fundamentally changes the way we decompose the Ops and match them (pytorch#15665)

Summary:
^^^

Note that there are new dedicated CortexM tests to rely on for the new
flow

Differential Revision: D86469035
…#15551)

New setup option is added: --install-mlsdk-deps-with-pip  
For Linux/Windows x86 machines, PyPi packages of MLSDK repository for
VGF backend may be used.
This will eventually be the default. Reason it is not yet default is
because of a limitation of model-converter to handle large models. The
new option will decrease setup time, which can enable VGF backend
testing in github.

Co-authored-by: Per Held <[[email protected]](mailto:[email protected])>,
Ryan O'Shea <[[email protected]](mailto:[email protected])>


cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai

Signed-off-by: Måns Nilsson <[email protected]>
Co-authored-by: Per Held <[email protected]>
Co-authored-by: Per Held <[email protected]>, Ryan O'Shea <[email protected]>
mansnils and others added 29 commits November 21, 2025 09:09
Executor runner supports both models with/wo bundled io in same path.
To enable bundled IO EXECUTORCH_BUILD_DEVTOOLS and
EXECUTORCH_ENABLE_BUNDLE_IO are required.

Adds tests in Arm backend for testing this/depending on this.
Except for enabling bundle-io for VGF backend where applicable,
some additional resnets model tests are enabled as well.

Avoids narrowing conversion errors in pte_to_header script by switching
char to unsigned char.

Signed-off-by: Måns Nilsson <[email protected]>
Co-authored-by: Jacob Szwejbka <[email protected]>
Summary: Suspect the failure
https://github.com/pytorch/pytorch/actions/runs/19547462483/job/55989739476
is due to using different QnnBackend implementation. Rename this demo
backend to a demo backend name

Differential Revision: D87586567
### Summary
LoraLinears contain:
1. base weight (nn.Linear)
2. lora_a (nn.Linear)
3. lora_b (nn.Linear) 

(2) and (3) are caught by the filter, but (1) is not, as the weight and
bias are pulled out of the nn.Linear and placed into nn.Parameters, and
the linear is performed manually. This is for checkpoint compatibility -
otherwise we'd have to map the weights for any lora model.

See:

https://github.com/pytorch/executorch/blob/b4d72f1e271915e9c0e1d313753a1eec840fbdee/examples/models/llama/lora.py#L31-L37

This PR adds lora linears into the quantization filter.

### Test plan
```
python -m extension.llm.export.export_llm \
    base.checkpoint="${DOWNLOADED_PATH}/consolidated.00.pth" \
    base.params="${DOWNLOADED_PATH}/params.json" \
    base.adapter_checkpoint="../et_docs_7_epoch/adapter_model.safetensors" \
    base.adapter_config="../et_docs_7_epoch/adapter_config.json" \
    base.tokenizer_path="../et_docs_7_epoch/" \
    model.use_kv_cache=true \
    model.use_sdpa_with_kv_cache=true \
```

Confirm output model size is ~1.7GB instead of 5.1GB. 
```
(executorch) [[email protected] /data/users/lfq/executorch (lfq.quantize-lora-linears)]$ ls -la *.pte
-rw-r--r-- 1 lfq users 5106135168 Nov 20 15:59 et_lora.pte
-rw-r--r-- 1 lfq users 1733835776 Nov 20 17:07 et_lora_fix.pte
```
Add a import-untyped for snakeviz in the case it is installed.

Signed-off-by: [email protected]
Change-Id: Ia951a0013d09e06c0d29a32bdb6b49ae11561d7d


cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai

Signed-off-by: [email protected]
Co-authored-by: Zingo Andersen <[email protected]>
Differential Revision: D87579688

Pull Request resolved: pytorch#15925
Differential Revision: D87280747

Pull Request resolved: pytorch#15862
Before:
When running CUDA benchmarks on multiple models, any model export
failure would halt the entire benchmark job.

After:
With the new configuration, the benchmark job will continue for models
that export successfully, even if some models fail to export.
It's not compatible with
pytorch#15933 which cause 2^70+ byte
counters like
```
Downloaded: 839890544179019776 / 1354151797 bytes (62023367397.93%)
Downloaded: 841813590016000000 / 1354151797 bytes (62165378496.04%)
```
Summary:
This PR fixes two issues affecting the build and installation process:

1. **pyproject.toml configuration**: Fixed invalid `license` and
`license-files` fields that were causing build failures with newer
versions of `setuptools` and `pip` build isolation. The `license` field
now uses the table format `{text = ...}` and `license-files` was moved
to `[tool.setuptools]`.

2. **Editable install version.py**: Fixed an issue where `version.py`
was being written to the project root instead of the package directory
(`src/executorch`) during editable installs. This was causing
`ImportError: cannot import name 'version'` when importing `executorch`.

Test Plan:
- Verified `pip install . --no-build-isolation` works (metadata
generation succeeds).
- Verified `pip install -e . --no-build-isolation` works and `from
executorch import version` succeeds.

### Summary
[PLEASE REMOVE] See [CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests)
for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a `Fixes
#<issue-id>` line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be
the upcoming release notes, please add a "Release notes: <area>" label.
For a list of available release notes labels, check out
[CONTRIBUTING.md's Pull
Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests).

### Test plan
[PLEASE REMOVE] How did you test this PR? Please write down any manual
commands you used and note down tests that you have written if
applicable.


cc @GregoryComer
Currently we downaload everything created durning export and
benchmarking, including ptd, pte, benchmarking results, etc, when trying
to upload benchmarking result to pytorch hub. ptd and pte are large and
unnecessary for this stage and when we benchmarking lots of models, such
large files will cause out of disk space error.

this PR prevents those large and unnecessary files from downloading and
try to avoid out of disk space error.
### Summary
Take down the AWS device farm benchmarking jobs. We are dropping them
due to performance data being unreliable on non-rooted devices.
seems like qnn download sdk is very unreliable. Trying to fix it then
re-enable it
Differential Revision: D87510750

Pull Request resolved: pytorch#15944
Differential Revision: D87576772

Pull Request resolved: pytorch#15932
Explain how to prune a NN and the associated uplift in performance when
running on the Ethos-U NPU.
Bias range was [-2147483648, 2147483646] which isn't really symmetric.
This patch changes the range to [-2147483647, 2147483647].

Signed-off-by: Oscar Andersson <[email protected]>
Chenweng-quic
MatthiasHertel80 (Arm)
Michaelmaitland (Meta internal)
RahulC7 (Meta internal)
can update author jorgep31415 to Juniper Pineda:
https://github.com/junpi3 Young Han - Meta -
https://github.com/seyeong-han Mitch Bailey - https://github.com/jmahbs
(Arm)
Alex Tawse - https://github.com/AlexTawseArm 
Tanvir Islam - https://github.com/tanvirislam-meta (Meta)
Summary: Forward fix for
pytorch#15368

Reviewed By: metascroy

Differential Revision: D87712225
### Summary
Fix eval_llama_qnn: retrieve custom annotation from quantization recipe

### Test plan
``` bash
python -m executorch.examples.qualcomm.oss_scripts.llama.eval_llama_qnn --decoder_model qwen2_5-0_5b --quant_linear_only --max_seq_length 1024 --ptq 16a4w
```
PyTorch has nightly wheels for this
…execute rigtht after compilation to create command buffers.

Differential Revision: D87781471

Pull Request resolved: pytorch#15962
Differential Revision: D87749871

Pull Request resolved: pytorch#15955
Differential Revision: D87122487

Pull Request resolved: pytorch#15934
### Summary

GLM Enablement
`python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s
$DEVICE -m SM8750 --temperature 0 --model_mode kv --max_seq_len 128
--decoder_model glm-1_5b --prompt "Could you tell me about Facebook?"`

### Test plan
`python backends/qualcomm/tests/test_qnn_delegate.py -k
TestExampleLLMScript.test_static_glm1_5b --model SM8750 --build_folder
build-android/ --executorch_root . -s $DEVICE --artifact ./glm1_5b`
Differential Revision: D87752226

Pull Request resolved: pytorch#15961
Implements a new pass which fuses activation passes with preceeding
cortex-m ops if possible.

Removed quantization of conv1d, conv3d as they are not tested
+ moves Conv+relu test to test_activations.

Propagate qmin, qmax to conv kernel.

Signed-off-by: Adrian Lundell <[email protected]>
@mroreo mroreo had a problem deploying to upload-benchmark-results December 2, 2025 03:42 — with GitHub Actions Failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.