You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Stateless design of the OAI v1 chat endpoints does not play very well with the way llama.cpp works. The current context cache is a nice workaround to make the default case of turn-by-turn chat fast, but a better solution would be to offer a way to address chat sessions explicitly, and add the concept of a (stateful, persistable) thread.
Technically it would be possible to improve the context cache for stateless requests as well (ie, remember the last N states, allow for reuse of context even on larger message-array-differences and editing of messages without having to reingest everything). I'm afraid though that this would add a lot of complexity while still being sub optimal in some cases. Thats why I dont plan to prioritize the stateless chat context cache for now. If we were to add complexity it seems more valuable to put it into adding thread state and into a "native" HTTP API that exposes functionality in a way that is closer modelled after how the internals work, and focus on keeping that as simple and transparent as possible while exposing as much functionality as we can.
What I had in mind for such an API:
Allow creation of all task types and try keep inputs as close as possible to their respective interfaces in node
Support monitoring tasks, and cancelling them
Include both stateful (/threads?) and stateless (/tasks/chat?) chat endpoints (Some drafts)
Package an in-memory thread store that can be swapped out for persistence
Allow mutation of these threads (make requests for generation of new assistant messages explicit)
The instance locking process (or the existence of model instances, or engines) should probably not be part of this API. Can't yet think of a way it could be useful. (Other than including the instance id in responses for debugging)
Everything up for debate. I have no time to work on this yet, but I thought I should post early on the rough direction I'd like it to go.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Stateless design of the OAI v1 chat endpoints does not play very well with the way llama.cpp works. The current context cache is a nice workaround to make the default case of turn-by-turn chat fast, but a better solution would be to offer a way to address chat sessions explicitly, and add the concept of a (stateful, persistable) thread.
Technically it would be possible to improve the context cache for stateless requests as well (ie, remember the last N states, allow for reuse of context even on larger message-array-differences and editing of messages without having to reingest everything). I'm afraid though that this would add a lot of complexity while still being sub optimal in some cases. Thats why I dont plan to prioritize the stateless chat context cache for now. If we were to add complexity it seems more valuable to put it into adding thread state and into a "native" HTTP API that exposes functionality in a way that is closer modelled after how the internals work, and focus on keeping that as simple and transparent as possible while exposing as much functionality as we can.
What I had in mind for such an API:
/threads
?) and stateless (/tasks/chat
?) chat endpoints (Some drafts)Everything up for debate. I have no time to work on this yet, but I thought I should post early on the rough direction I'd like it to go.
Contributions welcome :)
Beta Was this translation helpful? Give feedback.
All reactions