Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched Inference to Improve GPU Utilisation #493

Open
lachlancahill opened this issue Dec 2, 2023 · 7 comments
Open

Batched Inference to Improve GPU Utilisation #493

lachlancahill opened this issue Dec 2, 2023 · 7 comments

Comments

@lachlancahill
Copy link

Is your feature request related to a problem? Please describe.
When using this library in a loop, I am getting poor GPU Utilisation running zephyr-7b.

Describe the solution you'd like
It would be fantastic to be able to pass a list of prompts to a function of the Transformers class, and define a batch size like you can for a huggingface pipeline. This significantly improves speed and GPU utilisation.

Additional context
GPU utilisation for reference:
image

@drachs
Copy link

drachs commented Dec 10, 2023

I feel like this is very important, if they don't implement batch inferencing I can't really consider it over llama.cpp's GBNF grammers.

@darrenangle
Copy link

+1

@drachs does GBNF in ggml support batched inference with different grammar constraints per generation in the batch? is that even possible? would love some guidance, if you please.

@drachs
Copy link

drachs commented Dec 14, 2023

I'm not very strong on the theory, but llama.cpp does support continuous batch inference with a grammar file. It had grammar support and continuous batching support for a while, but my understanding is it didn't start working together until this PR, maybe some clues in there: ggml-org/llama.cpp#3624

You can try it out yourself, here are some instructions on how to use this with Docker from my notes. Note that the version in the public docker images doesn't work, I assume they must have been published prior to the fix in October.

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .
docker run --gpus all -v .:/models -it --entrypoint /bin/bash local/llama.cpp:full-cuda
./parallel -m /models/ --grammar-file grammars/json.gbnf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 10 -ns 128 -n 100 -cb

@Jbollenbacher
Copy link

FWIW I get like 95+% utilization when running inference on Mac Metal (specifically using Mistral-7b, used via repeated ollama REST API queries).

Speculating, I feel like this has something to do with memory bandwidth on the 3090 setup. Not sure though.

@lachlancahill
Copy link
Author

FWIW I get like 95+% utilization when running inference on Mac Metal (specifically using Mistral-7b, used via repeated ollama REST API queries).

Speculating, I feel like this has something to do with memory bandwidth on the 3090 setup. Not sure though.

Thanks, that's interesting to know.

I think it's unlikely to be a memory bandwidth issue. The 3090 is 90-100% utilised when using the same model via huggingface transformers (with much better throughput).

To speculate myself, I'm expecting the issue could be that much of the processing done in this library is CPU bound, so when running in a loop, the GPU is waiting while the CPU bound processes are being performed, then the CPU waits while the GPU inference is being run. This is why it would be great to see an implementation of batch inference, so that while the CPU is processing the output of the first item, the GPU can begin running inference on the next. That way, they aren't waiting for each other to finish and can work at the same time.

@freckletonj
Copy link

freckletonj commented Jan 7, 2024

👍 Batch inference would greatly unlock synthetic data.

edit: in the meantime, outlines offers constrained gen and batch inference.

@CarloNicolini
Copy link

Any idea on how to perform batch inference? This applies especially in the context of applying guidance to many data in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants