nnsight with multithreading #280

lithafnium · 2024-10-25T05:52:13Z

This is more of a niche feature but I'm attempting to serve nnsight as an api using FastAPI. I'm accessing the model and invoking the trace using loop.run_in_executor(). This mostly works. However, when I run multiple requests at the same time I receive the error: RuntimeError: trying to pop from empty mode stack. I assume this is because in nnsight there's some sort of global tracing involved when calculating the compute graph, which according to the error is not threadsafe? Not sure if thats the case, would love some some insight on this.

I'm fairly certain loop.run_in_executor() should work fine, as this functions normally with regular huggingface, and this is how vLLM handles async requests in their AsyncLLMEngine. I'm wondering if other people have noticed this error and whether there are any ways to circumvent this.

The text was updated successfully, but these errors were encountered:

lithafnium · 2024-10-25T05:57:48Z

Additionally, I get the error: AttributeError: 'LlamaDecoderLayer' object has no attribute 'output'

JadenFiotto-Kaufman · 2024-11-05T17:22:19Z

@lithafnium Hey I'd love to know more about this if you could get me some small reproduceable example. I dont think nnsight is close to threadsafe although maybe there are some features in nnsight you could disable to get it working. On another note, 0.4 is going to have vllm support so potentially you can just use that?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nnsight with multithreading #280

nnsight with multithreading #280

lithafnium commented Oct 25, 2024 •

edited

Loading

lithafnium commented Oct 25, 2024

JadenFiotto-Kaufman commented Nov 5, 2024

nnsight with multithreading #280

nnsight with multithreading #280

Comments

lithafnium commented Oct 25, 2024 • edited Loading

lithafnium commented Oct 25, 2024

JadenFiotto-Kaufman commented Nov 5, 2024

lithafnium commented Oct 25, 2024 •

edited

Loading