async/parallel speculative execution #6853
Replies: 4 comments 9 replies
-
Interesting experiment. I always thought that the memory bandwidth available on the M chips is shared between the CPU and the GPU, so I figured there would be contention if we try to compute in parallel. Can you describe in more details the reconciliation process between the draft and target sequences? How does it defer from standard speculative sampling? You can run tests with make -j speculative && time ./speculative \
-m ./models/llama-70b-v3-instruct/ggml-model-q8_0.gguf \
-md ./models/llama-8b-v3-instruct/ggml-model-q5_k.gguf \
-f in.txt -e -ngl 99 -n 4096 -c 4096 -s 20 -np 1 --draft 12 --color -s 1 --temp 0.0 -n 1024 ![]() You can vary the The main issue with speculative approaches on Mac currently is the inefficient quantum batched matrix multiplications as we discussed in #6777 |
Beta Was this translation helpful? Give feedback.
-
This is awesome, thank you! |
Beta Was this translation helpful? Give feedback.
-
Ok, cool, let me try and see if I can make it work with some good tree-based speculation + current llama70B. I don't have a second powerful machine to distribute main model though, - only one m2 ultra and m2 laptop. Once I figured that out will update here |
Beta Was this translation helpful? Give feedback.
-
@ggerganov Sorry got a little busy so didn't follow up on this before. After some experimentation with RPC + async speculation I think there's an interesting use-case for the situation when we have:
More details in readme here: https://github.com/okuvshynov/llama_duo But briefly, consider following HW: 30-core CPU machine, 100+GB RAM, A10 GPU with 24GB VRAM. I guess similar hardware profile is reasonably common in consumer devices (e.g. some 16 Cores/32Threads AMD Ryzen + 64GB system RAM + 3090 with 24GB VRAM), but I don't have access to such machine and just tested it on some rented server-like thing. I think with async there are more options to mix and match those, including over RPC. E.g. maybe speculation should still be on CPU, etc. Current speculation version is not making any changes to llama.cpp code itself (probably if we want to schedule smarter we'll need to). |
Beta Was this translation helpful? Give feedback.
-
Was there an attempt to run draft model in parallel with main model on difference compute device of the same machine?
Here's a small illustration i made: c8d446d
Experiment setup/observations:
Observe ~83s to process the prompt and produce the output.
Observe ~64s to process the same prompt and produce same output. Not dramatic, but fairly noticeable.
Beta Was this translation helpful? Give feedback.
All reactions