-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Weird output from llama-speculative #8499
Comments
The You can run the same command, replacing ./build/bin/llama-cli -m ./llama-7b/ggml-model-f16.gguf -md ./llama-1.1b/ggml-model-f16.gguf -p "Making cake is like" -e -ngl 100 -ngld 100 -t 4 --temp 1.0 -n 128 -c 4096 -s 20 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 5 |
@ggerganov, Thank you for quick response! When I measured the speed of each model using
Here, I believe that we can approximate the Cost Coefficient, We can see that:
However, when measuring the tokens per second of each model during the generation phase using
🤔 We can see that the I do not have the capability or time to analyze the
I apologize for asking such a complex question. However, I find the llama.cpp system truly amazing, and seeing it utilized in papers like OSD made me want to examine its robustness. 👍👍 |
When using ./build/bin/llama-cli \
-m ./llama-68m/ggml-model-f16.gguf \
-e -ngl 100 -t 4 -n 512 -c 2048 \
-p "What can we do with llama llm model?" > result.txt |
Unfortunately, it still shows similar results. |
Not sure, with my RTX 2060 the results from GGML_CUDA=1 make -j && ./llama-bench -m models/llama-68m/ggml-model-f16.gguf -p 0 -n 128,256,512 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
build: 1666f92 (3404) GGML_CUDA=1 make -j && ./llama-cli -m models/llama-68m/ggml-model-f16.gguf -e -ngl 100 -t 4 -n 512 -c 2048 -p "What can we do with llama llm model?"
|
This is so weird... I could not get your logs in my system (2TB RAM, 16 core CPU, everything is overly enough).
Let me check if this issue is related to the Ampere GPU (since your GPU arch is Turing). After finishing my work(🤣), I will install and run the same commands on my personal desktop (RTX 3060) to check. I'm curious if other users have had similar experiences. Thank you for keeping an eye on this! |
Hm, yes that is weird - not sure why is that. Let us know the results with RTX 3060 when you get the chance |
Thank you for your interest and patience!
The result of the
🤔 Based on the results, I plan to proceed the following plan.
The different tokens per second results from Therefore, I intend to conclude this ISSUE by reporting the results of the above plan. If you want to know the results and follow-up analysis, I will keep this ISSUE open. If you believe that the comparison of the results is already sufficient, please feel free to close this ISSUE. Thank you! |
We conducted experiments on RTX 4090 (Ada architecture with compute capability 8.9) and A100 server after upgrading docker image to latest one ( The
Next, after upgrading docker image file, the
We were able to obtain similar results from However, the results measured on the current A100 server (which has better computing resources such as CPU, RAM etc) were different from them. While writing this comment, I was able to test on an H100 server (Hopper arch, compute capability 9.0).
The experimental results of the H100 from
🤔 The best next approach is to check the current llama.cpp's |
Huh.. Could you run the following on the H100: make clean
LLAMA_DISABLE_LOGS=1 GGML_CUDA=1 make -j
./llama-cli -m ./llama-68m/ggml-model-f16.gguf -e -ngl 100 -t 4 -n 512 -c 2048 -p "What can we do with llama llm model?" and also: CUDA_VISIBLE_DEVICES=0 ./llama-cli -m ./llama-68m/ggml-model-f16.gguf -e -ngl 100 -t 4 -n 512 -c 2048 -p "What can we do with llama llm model?" |
Inlcuding The results of running the first command are as follows.
The results of running the second command are as follows. Here, we forced
😲 In both cases, over 2000 tokens per second results are recorded. What a surprise! Additionally, we did the same process to see if we could obtain the similar results on the A100 server.
Now, the remaining question is why we were able to obtain consistent results from
In conclusion, it was your thoughtful help that us to solve the problem. Thank you so much! 👍👍👍 |
Maybe the difference is because your local desktop has more CPU resources available for logging than the GPU servers? I think this information is worth mentioning in the docs or in this issue as it seems to have a significant impact. Also, thanks for sharing your benchmarks! 😄 🙇♂️ |
Thank you @mscheong01 for checking this issue! As you suggested, I have reported this issue in #6398. 👍 |
The logging does incur some overhead as it is synchronous and some of the stuff that we log (e.g. batch contents) involves some heavy ops like detokenization. For very small models such as the 68M one used in the tests earlier, this can have noticeable impact, though it was still surprising to see such a big difference in your tests. For bigger models (i.e. above 1B) I expect that the logging overhead will have much smaller impact - maybe close to insignificant In any case, all of this will be resolved when we implement asynchronous logging and add compile-time verbosity levels (#8566) |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
Hello, llama.cpp experts! Thank you for creating such an amazing LLM Inference system. 😁
However, while using this system, I encountered an unusual results when checking the speculative decoding output.
I believe the observed issue is a bug and reporting it as a Bug ISSUE on this github project.
First of all, I want to provide a configuration of my system.
Next, I will explain the steps I took to download and run the model until the bug occurred.
It was somewhat challenging to use the llama.cpp systems.
And the printed result is as follows:
Here, unlike #3649, I got the
inf
eval time of the target model.I am currently comparing the generation phase latency of the draft model and the target model in Speculative Decoding.
So far, I have used
llama-bench
andllama-cli
to measure tokens per second for each model, and the results have been different (e.g. the latency ratio measured withllama-bench
was significanlty larger than that measured withllama-cli
).Therefore I attempted additional measurements with
llama-speculative
, but I obtained an unusual value ofinf
. I would like to request confirmation on whether this measurement result is a bug or if it is expected behavior of llama.cpp. 🙏Name and Version
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
The text was updated successfully, but these errors were encountered: