Batched Inference to Improve GPU Utilisation #493

lachlancahill · 2023-12-02T07:22:12Z

Is your feature request related to a problem? Please describe.
When using this library in a loop, I am getting poor GPU Utilisation running zephyr-7b.

Describe the solution you'd like
It would be fantastic to be able to pass a list of prompts to a function of the Transformers class, and define a batch size like you can for a huggingface pipeline. This significantly improves speed and GPU utilisation.

Additional context
GPU utilisation for reference:

drachs · 2023-12-10T06:59:14Z

I feel like this is very important, if they don't implement batch inferencing I can't really consider it over llama.cpp's GBNF grammers.

darrenangle · 2023-12-13T21:07:42Z

+1

@drachs does GBNF in ggml support batched inference with different grammar constraints per generation in the batch? is that even possible? would love some guidance, if you please.

drachs · 2023-12-14T06:02:53Z

I'm not very strong on the theory, but llama.cpp does support continuous batch inference with a grammar file. It had grammar support and continuous batching support for a while, but my understanding is it didn't start working together until this PR, maybe some clues in there: ggml-org/llama.cpp#3624

You can try it out yourself, here are some instructions on how to use this with Docker from my notes. Note that the version in the public docker images doesn't work, I assume they must have been published prior to the fix in October.

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .
docker run --gpus all -v .:/models -it --entrypoint /bin/bash local/llama.cpp:full-cuda
./parallel -m /models/ --grammar-file grammars/json.gbnf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 10 -ns 128 -n 100 -cb

Jbollenbacher · 2023-12-19T02:32:00Z

FWIW I get like 95+% utilization when running inference on Mac Metal (specifically using Mistral-7b, used via repeated ollama REST API queries).

Speculating, I feel like this has something to do with memory bandwidth on the 3090 setup. Not sure though.

lachlancahill · 2023-12-21T22:26:19Z

FWIW I get like 95+% utilization when running inference on Mac Metal (specifically using Mistral-7b, used via repeated ollama REST API queries).

Speculating, I feel like this has something to do with memory bandwidth on the 3090 setup. Not sure though.

Thanks, that's interesting to know.

I think it's unlikely to be a memory bandwidth issue. The 3090 is 90-100% utilised when using the same model via huggingface transformers (with much better throughput).

To speculate myself, I'm expecting the issue could be that much of the processing done in this library is CPU bound, so when running in a loop, the GPU is waiting while the CPU bound processes are being performed, then the CPU waits while the GPU inference is being run. This is why it would be great to see an implementation of batch inference, so that while the CPU is processing the output of the first item, the GPU can begin running inference on the next. That way, they aren't waiting for each other to finish and can work at the same time.

freckletonj · 2024-01-07T21:28:04Z

👍 Batch inference would greatly unlock synthetic data.

edit: in the meantime, outlines offers constrained gen and batch inference.

CarloNicolini · 2024-04-03T12:22:57Z

Any idea on how to perform batch inference? This applies especially in the context of applying guidance to many data in parallel.

jadermcs mentioned this issue Feb 21, 2024

Question about batched inference and async_mode #361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched Inference to Improve GPU Utilisation #493

Batched Inference to Improve GPU Utilisation #493

lachlancahill commented Dec 2, 2023

drachs commented Dec 10, 2023

darrenangle commented Dec 13, 2023

drachs commented Dec 14, 2023

Jbollenbacher commented Dec 19, 2023

lachlancahill commented Dec 21, 2023

freckletonj commented Jan 7, 2024 •

edited

Loading

CarloNicolini commented Apr 3, 2024

Batched Inference to Improve GPU Utilisation #493

Batched Inference to Improve GPU Utilisation #493

Comments

lachlancahill commented Dec 2, 2023

drachs commented Dec 10, 2023

darrenangle commented Dec 13, 2023

drachs commented Dec 14, 2023

Jbollenbacher commented Dec 19, 2023

lachlancahill commented Dec 21, 2023

freckletonj commented Jan 7, 2024 • edited Loading

CarloNicolini commented Apr 3, 2024

freckletonj commented Jan 7, 2024 •

edited

Loading