-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only 25t/s at 4080s with 14B model 5.5bits #739
Comments
have you tried without ngram and also maybe installing flash attention? based on your shared command, it seems like you have tabby installed, you can probably activate it's venv and try running the same command while in that venv because it should have flash attention installed. flash attention isn't required but it will run slower and also continuously slow down speed by a lot over time as the context gets longer |
16GB Vram right? I reckon you're offloading to cpu / system memory. I tested the same model / similar prompt on my RTX 3090:
This is on a headless Linux server with nothing else loaded onto the GPU: nvidia-smi |grep MiB 0% 55C P2 388W / 390W | 16264MiB / 24576MiB | 96% Default | On Windows you'd have even less VRAM available. |
OS
Windows
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.6.0+cu124
Model
https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-exl2
Describe the bug
I am running the Qwen2.5 14B 5.0bit exl2 quantized model on my personal PC with a 4080S 16GB GPU, but the output speed is only 25 tokens/s, which seems much lower than expected.
I installed exllamav2 via pip install, and I'm not sure if I missed enabling some settings or if I'm missing some packages that could be causing this issue. Does anyone know what the possible reasons might be?
Reproduction steps
pip install exllamav2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
python examples/chat.py -m "xxxx\Qwen2.5" -mode raw -pt -ncf -ngram
Expected behavior
The output speed should reach 60t/s or higher.
Logs
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: