Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only 25t/s at 4080s with 14B model 5.5bits #739

Open
3 tasks done
CharlinChen opened this issue Feb 22, 2025 · 2 comments
Open
3 tasks done

Only 25t/s at 4080s with 14B model 5.5bits #739

CharlinChen opened this issue Feb 22, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@CharlinChen
Copy link

OS

Windows

GPU Library

CUDA 12.x

Python version

3.12

Pytorch version

2.6.0+cu124

Model

https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-exl2

Describe the bug

I am running the Qwen2.5 14B 5.0bit exl2 quantized model on my personal PC with a 4080S 16GB GPU, but the output speed is only 25 tokens/s, which seems much lower than expected.
I installed exllamav2 via pip install, and I'm not sure if I missed enabling some settings or if I'm missing some packages that could be causing this issue. Does anyone know what the possible reasons might be?

(exl2) PS D:\xxxx\LLM\exllamav2> python examples/chat.py -m "D:\xxxx\LLM\tabbyAPI\models\Qwen2.5" -mode raw -pt -ncf -ngram
 -- Model: D:\xxxx\LLM\tabbyAPI\models\Qwen2.5
 -- Options: []
 -- Loading tokenizer...
 -- Loading model...
 -- Loading model...
 -- Prompt format: raw
 -- System prompt:

Image

Image

Image

Reproduction steps

pip install exllamav2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
python examples/chat.py -m "xxxx\Qwen2.5" -mode raw -pt -ncf -ngram

Expected behavior

The output speed should reach 60t/s or higher.

Logs

No response

Additional context

No response

Acknowledgements

  • I have looked for similar issues before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will ask my questions politely.
@CharlinChen CharlinChen added the bug Something isn't working label Feb 22, 2025
@Anthonyg5005
Copy link

have you tried without ngram and also maybe installing flash attention? based on your shared command, it seems like you have tabby installed, you can probably activate it's venv and try running the same command while in that venv because it should have flash attention installed. flash attention isn't required but it will run slower and also continuously slow down speed by a lot over time as the context gets longer

@gapeleon
Copy link

16GB Vram right? I reckon you're offloading to cpu / system memory.

I tested the same model / similar prompt on my RTX 3090:

python examples/chat.py -m "/models3/exl2/Qwen2.5-14B-Instruct-exl2" -mode raw -pt -ncf -ngram
This is a conversation between a helpful AI assistant named Chatbort and a user.

User: Give me a list of 100 movies I should check out


Chatbort: Sure, here's a diverse list of 100 movies that you might enjoy from various genres and eras. This list includes some classics, recent releases, and hidden gems:

1. The Godfather (1972)

...
70. Deadpool (2016)
71. The Intouchables (20
 !! Response exceeded 1000 tokens and was cut short.

(Context: 39 tokens, response: 1000 tokens, 76.81 tokens/second, SD eff. 29.20%, SD acc. 31.13%

User: Haven't watched a movie for 10 years, what should I start with?

Chatbort: If you haven't watched a movie in a decade, you might want to start with something classic or widely enjoyed that's easy to find and accessible. Here are a few suggestions that are generally well-received and might serve as a good reintroduction to cinema:

1. **The Shawshank Redemption (1994)** - A deeply moving film about hope and friendship in the face of injustice.
2. **The Dark Knight (2008)** - A thrilling crime drama and a favorite among superhero films.
3. **Pulp Fiction (1994)** - A unique and influential story filled with memorable characters and dialogue.
4. **Toy Story (1995)** - An animated classic that’s beloved by people of all ages.
5. **Forrest Gump (1994)** - A heartwarming film that spans several decades and covers a range of historical events.
6. **Star Wars: Episode IV - A New Hope (1977)** - The beginning of a beloved franchise that's easy to get into.
7. **Jurassic Park (1993)** - A thrilling adventure with groundbreaking special effects.
8. **E.T. the Extra-Terrestrial (1982)** - A timeless and heartwarming family film about friendship.
9. **The Lion King (1994)** - A classic animated film about coming of age and family.
10. **Back to the Future (1985)** - A fun and nostalgic adventure that combines time travel with humor.

These films offer a mix of genres and themes, so you can choose based on your personal preference. Enjoy your return to the world of movies!
(Context: 1064 tokens, response: 348 tokens, 63.94 tokens/second, SD eff. 21.84%, SD acc. 29.01%)

This is on a headless Linux server with nothing else loaded onto the GPU:

nvidia-smi |grep MiB

0% 55C P2 388W / 390W | 16264MiB / 24576MiB | 96% Default |

On Windows you'd have even less VRAM available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants