Using buffer for weight tensors for quantized mat mul op. (pytorch#15990)

trivedivivek · facebook-github-bot · commit fe3598846757 · 2025-11-26T13:17:34.000-08:00
Summary:

This change affects the performance and memory usage of the quantized matrix multiplication operation in the Executorch Vulkan backend. By using a buffer for weight tensors, the operation may become more efficient and use less memory, especially for large matrices.

Reviewed By: yipjustin

Differential Revision: D87911255
diff --git a/backends/vulkan/runtime/graph/ops/impl/Staging.cpp b/backends/vulkan/runtime/graph/ops/impl/Staging.cpp
@@ -285,7 +285,7 @@ ValueRef prepack_int4_linear_weight_transposed_interleaved(
   const int64_t N = qmat2_orig_sizes.at(ndim - 2);
   const int64_t N_div2 = N / int64_t(2);
 
-  utils::StorageType storage_type = utils::kTexture2D;
+  utils::StorageType storage_type = utils::kBuffer;
   uint32_t max_extent = graph.context()->adapter_ptr()->max_texture2d_dim();
   if (N_div2 > max_extent * 4 || K > max_extent) {
     storage_type = utils::kBuffer;