-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched Inference to Improve GPU Utilisation #493
Comments
I feel like this is very important, if they don't implement batch inferencing I can't really consider it over llama.cpp's GBNF grammers. |
+1 @drachs does GBNF in ggml support batched inference with different grammar constraints per generation in the batch? is that even possible? would love some guidance, if you please. |
I'm not very strong on the theory, but llama.cpp does support continuous batch inference with a grammar file. It had grammar support and continuous batching support for a while, but my understanding is it didn't start working together until this PR, maybe some clues in there: ggml-org/llama.cpp#3624 You can try it out yourself, here are some instructions on how to use this with Docker from my notes. Note that the version in the public docker images doesn't work, I assume they must have been published prior to the fix in October. git clone https://github.com/ggerganov/llama.cpp.git |
FWIW I get like 95+% utilization when running inference on Mac Metal (specifically using Mistral-7b, used via repeated ollama REST API queries). Speculating, I feel like this has something to do with memory bandwidth on the 3090 setup. Not sure though. |
Thanks, that's interesting to know. I think it's unlikely to be a memory bandwidth issue. The 3090 is 90-100% utilised when using the same model via huggingface transformers (with much better throughput). To speculate myself, I'm expecting the issue could be that much of the processing done in this library is CPU bound, so when running in a loop, the GPU is waiting while the CPU bound processes are being performed, then the CPU waits while the GPU inference is being run. This is why it would be great to see an implementation of batch inference, so that while the CPU is processing the output of the first item, the GPU can begin running inference on the next. That way, they aren't waiting for each other to finish and can work at the same time. |
👍 Batch inference would greatly unlock synthetic data. edit: in the meantime, |
Any idea on how to perform batch inference? This applies especially in the context of applying guidance to many data in parallel. |
Is your feature request related to a problem? Please describe.
When using this library in a loop, I am getting poor GPU Utilisation running zephyr-7b.
Describe the solution you'd like
It would be fantastic to be able to pass a list of prompts to a function of the Transformers class, and define a batch size like you can for a huggingface pipeline. This significantly improves speed and GPU utilisation.
Additional context

GPU utilisation for reference:
The text was updated successfully, but these errors were encountered: