Internal error for batch inference: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead. #3064

chenjiangtao · 2025-03-13T13:38:28Z

System Info / 系統信息

OS：Apple M3 Pro
pip --version
pip 25.0.1 from /opt/homebrew/lib/python3.12/site-packages/pip (python 3.12)

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

version 1.3.1.post1

The command used to start Xinference / 用以启动 xinference 的命令

install scripts:

pip install xinference
pip install transformers
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

start scripts:
XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

download language model: deepseek-chat
config:

Model Engine ：transformers
Model Format : pytorch
Model Size: 7
Quantization : 8-bit
GPU count: auto

2.start & run ,go to : http://localhost:9997/deepseek-chat/ and say： hello

Error info:

2025-03-13 21:13:34,397 xinference.model.llm.transformers.utils 30309 ERROR Internal error for batch inference: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead..
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.12/site-packages/xinference/model/llm/transformers/utils.py", line 502, in batch_inference_one_step
_batch_inference_one_step_internal(
File "/opt/homebrew/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xinference/model/llm/transformers/utils.py", line 280, in _batch_inference_one_step_internal
out = model(**prefill_kws, use_cache=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 834, in forward
outputs = self.model(
^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 592, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 335, in forward
hidden_states, self_attn_weights = self.self_attn(
^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 291, in forward
attn_output, attn_weights = attention_interface(
^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py", line 53, in sdpa_attention_forward
attn_output = torch.nn.functional.scaled_dot_product_attention(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.
Destroy generator f7563d26000c11f0aac0a6e0c79d02ed due to an error encountered.
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/api.py", line 419, in xoscar_next
r = await asyncio.create_task(_async_wrapper(gen))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/api.py", line 409, in _async_wrapper
return await _gen.anext() # noqa: F821
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xinference/core/model.py", line 569, in _to_async_gen
async for v in gen:
File "/opt/homebrew/lib/python3.12/site-packages/xinference/core/model.py", line 762, in _queue_consumer
raise RuntimeError(res[len(XINFERENCE_STREAMING_ERROR_FLAG) :])
RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.
2025-03-13 21:13:43,314 xinference.api.restful_api 30161 ERROR Chat completion stream got an error: [address=0.0.0.0:59945, pid=30309] Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.12/site-packages/xinference/api/restful_api.py", line 2048, in stream_results
async for item in iterator:
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/api.py", line 340, in anext
return await self._actor_ref.xoscar_next(self._uid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/backends/context.py", line 231, in send
return self._process_result_message(result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/backends/pool.py", line 667, in send
result = await self._run_coro(message.message_id, coro)
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
^^^^^^^^^^^^^^^^^
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive
result = await result
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/api.py", line 431, in xoscar_next
raise e
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/api.py", line 419, in xoscar_next
r = await asyncio.create_task(_async_wrapper(gen))
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xoscar/api.py", line 409, in _async_wrapper
return await _gen.anext() # noqa: F821
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xinference/core/model.py", line 569, in _to_async_gen
async for v in gen:
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xinference/core/model.py", line 762, in _queue_consumer
raise RuntimeError(res[len(XINFERENCE_STREAMING_ERROR_FLAG) :])
^^^^^^^^^^^^^^^^^
RuntimeError: [address=0.0.0.0:59945, pid=30309] Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.12/site-packages/gradio/queueing.py", line 527, in process_events
response = await route_utils.call_process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/gradio/route_utils.py", line 261, in call_process_api
output = await app.get_blocks().process_api(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/gradio/blocks.py", line 1786, in process_api
result = await self.call_function(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/gradio/blocks.py", line 1350, in call_function
prediction = await utils.async_iteration(iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/gradio/utils.py", line 709, in asyncgen_wrapper
response = await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/gradio/chat_interface.py", line 545, in _stream_fn
first_response = await async_iteration(generator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.anext()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/gradio/utils.py", line 576, in anext
return await anyio.to_thread.run_sync(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
return await future
^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 962, in run
result = context.run(func, *args)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/gradio/utils.py", line 559, in run_sync_iterator_async
return next(iterator)
^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xinference/core/chat_interface.py", line 129, in generate_wrapper
for chunk in model.chat(
^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/xinference/client/common.py", line 51, in streaming_response_iterator
raise Exception(str(error))
Exception: [address=0.0.0.0:59945, pid=30309] Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.

Expected behavior / 期待表现

success running ,no error

The text was updated successfully, but these errors were encountered:

qinxuye · 2025-03-17T07:18:45Z

For mac, llama.cpp or MLX are better options.

XprobeBot added the gpu label Mar 13, 2025

XprobeBot added this to the v1.x milestone Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal error for batch inference: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead. #3064

Internal error for batch inference: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead. #3064

chenjiangtao commented Mar 13, 2025 •

edited

Loading

qinxuye commented Mar 17, 2025

Internal error for batch inference: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead. #3064

Internal error for batch inference: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead. #3064

Comments

chenjiangtao commented Mar 13, 2025 • edited Loading

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

qinxuye commented Mar 17, 2025

chenjiangtao commented Mar 13, 2025 •

edited

Loading