Only 25t/s at 4080s with 14B model 5.5bits #739

CharlinChen · 2025-02-22T06:57:34Z

OS

Windows

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.6.0+cu124

Model

https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-exl2

Describe the bug

I am running the Qwen2.5 14B 5.0bit exl2 quantized model on my personal PC with a 4080S 16GB GPU, but the output speed is only 25 tokens/s, which seems much lower than expected.
I installed exllamav2 via pip install, and I'm not sure if I missed enabling some settings or if I'm missing some packages that could be causing this issue. Does anyone know what the possible reasons might be?

(exl2) PS D:\xxxx\LLM\exllamav2> python examples/chat.py -m "D:\xxxx\LLM\tabbyAPI\models\Qwen2.5" -mode raw -pt -ncf -ngram
 -- Model: D:\xxxx\LLM\tabbyAPI\models\Qwen2.5
 -- Options: []
 -- Loading tokenizer...
 -- Loading model...
 -- Loading model...
 -- Prompt format: raw
 -- System prompt:

Reproduction steps

pip install exllamav2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
python examples/chat.py -m "xxxx\Qwen2.5" -mode raw -pt -ncf -ngram

Expected behavior

The output speed should reach 60t/s or higher.

Logs

No response

Additional context

No response

Acknowledgements

I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

The text was updated successfully, but these errors were encountered:

Anthonyg5005 · 2025-02-24T10:29:48Z

have you tried without ngram and also maybe installing flash attention? based on your shared command, it seems like you have tabby installed, you can probably activate it's venv and try running the same command while in that venv because it should have flash attention installed. flash attention isn't required but it will run slower and also continuously slow down speed by a lot over time as the context gets longer

gapeleon · 2025-02-24T10:34:11Z

16GB Vram right? I reckon you're offloading to cpu / system memory.

I tested the same model / similar prompt on my RTX 3090:

python examples/chat.py -m "/models3/exl2/Qwen2.5-14B-Instruct-exl2" -mode raw -pt -ncf -ngram

This is a conversation between a helpful AI assistant named Chatbort and a user.

User: Give me a list of 100 movies I should check out


Chatbort: Sure, here's a diverse list of 100 movies that you might enjoy from various genres and eras. This list includes some classics, recent releases, and hidden gems:

1. The Godfather (1972)

...
70. Deadpool (2016)
71. The Intouchables (20
 !! Response exceeded 1000 tokens and was cut short.

(Context: 39 tokens, response: 1000 tokens, 76.81 tokens/second, SD eff. 29.20%, SD acc. 31.13%

User: Haven't watched a movie for 10 years, what should I start with?

Chatbort: If you haven't watched a movie in a decade, you might want to start with something classic or widely enjoyed that's easy to find and accessible. Here are a few suggestions that are generally well-received and might serve as a good reintroduction to cinema:

1. **The Shawshank Redemption (1994)** - A deeply moving film about hope and friendship in the face of injustice.
2. **The Dark Knight (2008)** - A thrilling crime drama and a favorite among superhero films.
3. **Pulp Fiction (1994)** - A unique and influential story filled with memorable characters and dialogue.
4. **Toy Story (1995)** - An animated classic that’s beloved by people of all ages.
5. **Forrest Gump (1994)** - A heartwarming film that spans several decades and covers a range of historical events.
6. **Star Wars: Episode IV - A New Hope (1977)** - The beginning of a beloved franchise that's easy to get into.
7. **Jurassic Park (1993)** - A thrilling adventure with groundbreaking special effects.
8. **E.T. the Extra-Terrestrial (1982)** - A timeless and heartwarming family film about friendship.
9. **The Lion King (1994)** - A classic animated film about coming of age and family.
10. **Back to the Future (1985)** - A fun and nostalgic adventure that combines time travel with humor.

These films offer a mix of genres and themes, so you can choose based on your personal preference. Enjoy your return to the world of movies!
(Context: 1064 tokens, response: 348 tokens, 63.94 tokens/second, SD eff. 21.84%, SD acc. 29.01%)

This is on a headless Linux server with nothing else loaded onto the GPU:

nvidia-smi |grep MiB

0% 55C P2 388W / 390W | 16264MiB / 24576MiB | 96% Default |

On Windows you'd have even less VRAM available.

CharlinChen added the bug Something isn't working label Feb 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only 25t/s at 4080s with 14B model 5.5bits #739

Only 25t/s at 4080s with 14B model 5.5bits #739

CharlinChen commented Feb 22, 2025

Anthonyg5005 commented Feb 24, 2025

gapeleon commented Feb 24, 2025

Only 25t/s at 4080s with 14B model 5.5bits #739

Only 25t/s at 4080s with 14B model 5.5bits #739

Comments

CharlinChen commented Feb 22, 2025

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Anthonyg5005 commented Feb 24, 2025

gapeleon commented Feb 24, 2025