[Bug][iOS/Swift SDK] Multiple image input to vision models will throw error from TVM

## 🐛 Bug

When using the `mlc-llm` Swift package to chat with vision language models, specifically the `Phi-3-vision-instruct` model, errors occur when attempting to input an image for the second time or input multiple images simutaneously. The first image input processes correctly, but subsequent attempts result in one of the following errors:

1. **NDArray size mismatch:**
```
libc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [23:07:43] /Users/neet/code/mlc-llm/3rdparty/tvm/src/runtime/ndarray.cc:213: Check failed: relative_byte_offset + view_size <= curr_size (11046600 vs. 1017846) : ValueError: View with shape [1, 1700, 2166, 3] and datatype uint8 would have a size of 11046600 bytes. This would occupy bytes 0 <= i_byte < 11046600 within the backing array. However, the NDArray being viewed only contains 1017846 bytes (shape = [1, 618, 549, 3], dtype= uint8).
```
 
2. **Embedding shape mismatch:**
```
libc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [23:22:38] /Users/neet/code/mlc-llm/cpp/serve/model.cc:1023: InternalError: Check failed: embedding->shape[0] + offset <= dst->shape[0] (2535 vs. 2048) :
```

## To Reproduce

Steps to reproduce the behavior:

1. Send an image from the iOS app through the Swift mlc-llm SDK `MLCEngine.chatCompletion()` as base64 encoded url. This image shall input into the pipeline and decode tokens correctly.
2. Send a second image, this time it will throw one of the two errors above. 

or,

1. Send multiple images from the iOS app through the Swift mlc-llm SDK `MLCEngine.chatCompletion()` as base64 encoded url.

## Expected behavior

The model should process multiple image inputs without throwing exceptions.

## Environment

- Platform: iOS
- Operating system: macOS
- Device: macOS
- How you installed MLC-LLM: conda & pip
- How you installed TVM-Unity: Source



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug][iOS/Swift SDK] Multiple image input to vision models will throw error from TVM #3044

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug][iOS/Swift SDK] Multiple image input to vision models will throw error from TVM #3044

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions