async/parallel speculative execution #6853

okuvshynov · 2024-04-23T18:29:34Z

okuvshynov
Apr 23, 2024

Was there an attempt to run draft model in parallel with main model on difference compute device of the same machine?

Here's a small illustration i made: c8d446d

It runs main model (Llama3-70-Q8) on M2 Ultra GPU
It runs draft model (Llama3-8-Q5) on 16 cpu perf cores of the same M2 Ultra
The intuition is, it is generally hard to do speculation well because you need a good small model (or train a subset of a model in case of medusa). On top of that producing single tokens one after another is very efficient on apple silicon (performance difference based on number of tokens to process #6777), so we need to get a pretty good acceptance rate to make evaluating multiple tokens a better choice.
The way that demo works is to have two parallel threads operating on two separate sequences to amortize the cost of draft model. Sequences and caches are reconciled every time a batch is finished evaluating on main model. It is simply linear, no tree/beam search/multiple sequences/etc. Greedy selection of the next token.
We can run a decent draft model this way.

Experiment setup/observations:

For a baseline version use simple.cpp, modified to allow 1024 tokens n_len and include time for prompt encoding in the timer.

make simple && ./simple ../llms/gguf/Meta-Llama-3-70B-Instruct.Q8_0-00001-of-00003.gguf "$(<examples/async_spec/in.txt)"

Observe ~83s to process the prompt and produce the output.

For a test run the async_spec (roughly based on the same simple.cpp) :

make async_spec && ./async_spec ../llms/gguf/Meta-Llama-3-70B-Instruct.Q8_0-00001-of-00003.gguf ../llms/gguf/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf "$(<examples/async_spec/in.txt)"

Observe ~64s to process the same prompt and produce same output. Not dramatic, but fairly noticeable.

We are able to generate really long sequences of draft model that are discarded (red tokens in the screenshot below). If we generate multiple sequences and use the CPU resource on that we might be able to further increase acceptance rate. I'm not sure what's the best known way to generate multiple sequences though, at what moment should we split, etc.

both CPU and GPU are well utilized during this process:

ggerganov · 2024-04-24T08:36:36Z

ggerganov
Apr 24, 2024
Maintainer

Interesting experiment. I always thought that the memory bandwidth available on the M chips is shared between the CPU and the GPU, so I figured there would be contention if we try to compute in parallel.

Can you describe in more details the reconciliation process between the draft and target sequences? How does it defer from standard speculative sampling?

You can run tests with speculative example like this:

make -j speculative && time ./speculative \
  -m ./models/llama-70b-v3-instruct/ggml-model-q8_0.gguf \
  -md ./models/llama-8b-v3-instruct/ggml-model-q5_k.gguf \
  -f in.txt -e -ngl 99 -n 4096 -c 4096 -s 20 -np 1 --draft 12 --color -s 1 --temp 0.0 -n 1024

You can vary the --draft argument - these are the number of tokens to draft on each iter.

The main issue with speculative approaches on Mac currently is the inefficient quantum batched matrix multiplications as we discussed in #6777

4 replies

okuvshynov Apr 24, 2024
Author

There's most definitely some contention - if we let two models run in parallel, but independently - not reusing any produced data - we observe slowdown:

main model by itself - 83s
main model with draft model but no reuse - 95s
main model with draft model and reuse - 64s

I don't know if there's a way to measure bandwidth utilization with high precision, I'll check if Apple's Instruments tool has something available. Do you know a good way to do that?

I don't think my merge implementation is much different from the standard speculative sampling, but tbh I might be not familiar with what the latest typical approach is.

We maintain three sequences:

main sequence, local to main model loop. This is a ground truth which we'll return eventually. Let's call it M.
speculative sequence, local to draft loop. This is a sequence produced by draft loop. Let's call it D.
shared sequence, which is used for communication between the two. Let's call it S.

Main loop logic:

Evaluate input sequence (say, "a b c") and produce next tokens (say, "b c d", so we got complete match). Find the longest match + one last non-matching token ("b c d" in this case) and append this to M. If there was a mismatch, empty the mismatched portion of kv cache here.
In protected section compare M and S. If M is a prefix of S, speculation got everything right to this point and we use S[len(M) - 1: ] as next input. If M is not a prefix of S, speculation got something wrong and we update S := M. Next input will be just one token S[-1].

Speculation logic loop:

Produce next token.
In protected section compare S and D. If S is a prefix of D, we assign S := D and keep speculating. If S is not a prefix of D that means main loop updated S with correct token and we need to adjust local draft sequence, so we assign D := S, clear the invalidated kv cache entires. In case of complete match D[-1] becomes next input, otherwise D[match_len:] becomes next input.

Note that in my current implementation M doesn't really exist, we just store/print the last produced section of it along with the offset to implement the above logic, but now I think it should be cleaner to just store it.

I think one difference is that we do 'best effort' speculation and not configure the draft length in any way. Next draft becomes 'as much as we managed to speculate'.

I tried to hardcode the default setting of ne11_mm_min you suggested in #6777 to 2..6 for this combination of hardware/model/prompt only. 2, 3 and even 4 were better than 1 (down to 60s), but we actually never got inputs of size 4 for this test (missing entries are 0):

input len_dist[1] = 78
input len_dist[2] = 49
input len_dist[3] = 10
input len_dist[5] = 5
input len_dist[6] = 27
input len_dist[7] = 2
input len_dist[8] = 4
input len_dist[9] = 2
input len_dist[11] = 7
input len_dist[12] = 20
input len_dist[13] = 4
input len_dist[14] = 2
input len_dist[20] = 1

At 6 we are definitely worse off (68s).

Do you know what's the supposed state of the art for tree-based speculation? How should we generate multiple sequences?

okuvshynov Apr 24, 2024
Author

I think for this specific use-case we might be able to adjust this ne11_mm_min parameter dynamically - have some feedback mechanism, very simple version of explore/exploit allowed for, say, inputs of sizes 2..K. Not sure if it worth the extra complexity.

okuvshynov May 16, 2024
Author

@ggerganov -- I did a few more tests and made more general tool where we could run speculation on a different device in parallel, e.g. mac mini/studio runs main model, and laptop run speculation (or the other way around, depending on models/device capabilities). It's in separate repo using llama.cpp as dependency for now (https://github.com/okuvshynov/llama_duo).

Some models have less extreme performance drop on Apple hw, e.g. here's llama3-70B fp16 on M2 Ultra:

So at least for some combinations of hw/model/quantization speculative approach seems beneficial.

If you think we can bring that in to examples/ with some changes, i'd be happy to do that some time.

ggerganov May 17, 2024
Maintainer

I think one difference is that we do 'best effort' speculation and not configure the draft length in any way. Next draft becomes 'as much as we managed to speculate'.

I see - very nice idea. I find this approach very interesting and promising.

Do you know what's the supposed state of the art for tree-based speculation? How should we generate multiple sequences?

Not sure. I played a little bit with the tree-based speculation in #3624, but the results didn't look great. Might have done something wrong though.

If you think we can bring that in to examples/ with some changes, i'd be happy to do that some time.

I think the best way to bring this into the examples is to utilize the new RPC backend (#6829) to offload the drafting to a second machine. This way the speculative example can remain mostly without changes

okuvshynov · 2024-05-17T13:06:29Z

okuvshynov
May 17, 2024
Author

I think the best way to bring this into the examples is to utilize the new RPC backend (#6829) to offload the drafting to a second machine. This way the speculative example can remain mostly without changes

This is awesome, thank you!

4 replies

okuvshynov May 17, 2024
Author

I'm not sure if you have discussed this before and planned for this, but seems like combining the async speculation and RPC distribution of large model might make running [not released yet] llama405B attainable by distributing it across multiple machines?

For a pretty expensive setup, consider

2 M2 Ultra devices, 192GB each.
405B model (at Q5 or Q6, so it would fit into [192 * 2 - whatever_overhead] ) is split into 2 parts and allocated to each machine.
we also load some decent speculative model on each device
as we have to run main model sequentially, we can spend 1/2 of compute on speculating without blocking main model evaluation - instance A passed the activations to instance B and started speculating, etc.

I'm not sure it makes economic sense in less extreme cases

ggerganov May 17, 2024
Maintainer

I'm not sure if you have discussed this before and planned for this, but seems like combining the async speculation and RPC distribution of large model might make running [not released yet] llama405B attainable by distributing it across multiple machines?

Yes, some people even started preparing https://x.com/elegyals/status/1783942625147376027

The speculative bit is a really good idea

I'm not sure it makes economic sense in less extreme cases

It will likely be the most cost-effective solution out there

okuvshynov May 17, 2024
Author

We might not even need to write too much new code for this, I suppose. Given that models are separate, we can start (main_A + speculative) on instance_A, (main_B + speculative) on instance_B. Then we need to orchestrate the data/logic passing during transition phase:

In the 'middle' of main model processing (A is done with first half), we need to pass activations to B and whatever B speculated so far back to A
At the end of main model processing (B is done with logits) we need to get whatever latest speculation on B is, consolidate it with what we have currently produced on A, pass the 'current approved tokens' to A, start speculating on B.
repeat

ggerganov May 17, 2024
Maintainer

Note that the RPC backend "hides" the network communication, so you don't have to worry about it. Using 2 RPC servers in a context should be the same as having 2 GPUs from the user-code perspective.

To do the load-balancing that you described, we can create 3 contexts:

first context with 2 RPC servers running the main model (as if 2 GPUs)
second context runs on the first server and loads the speculative model
third context runs on the second server and also loads the speculative model

In the 'middle' of main model processing (A is done with first half) ..

To determine when this happens from the user-code we might have to use the eval_callback mechanism - not sure if it works with RPC at this point though

okuvshynov · 2024-05-17T16:43:23Z

okuvshynov
May 17, 2024
Author

The speculative bit is a really good idea

It will likely be the most cost-effective solution out there

Ok, cool, let me try and see if I can make it work with some good tree-based speculation + current llama70B. I don't have a second powerful machine to distribute main model though, - only one m2 ultra and m2 laptop. Once I figured that out will update here

1 reply

okuvshynov May 27, 2024
Author

@ggerganov here's a draft #7570, i should have a little more time later in the week to finalize first version, but tl;dr - it should work.

okuvshynov · 2024-07-10T18:20:11Z

okuvshynov
Jul 10, 2024
Author

@ggerganov Sorry got a little busy so didn't follow up on this before.

After some experimentation with RPC + async speculation I think there's an interesting use-case for the situation when we have:

good CPU
enough system RAM
good GPU, but not enough VRAM.

More details in readme here: https://github.com/okuvshynov/llama_duo

But briefly, consider following HW: 30-core CPU machine, 100+GB RAM, A10 GPU with 24GB VRAM.
We'd like to run llama3-70b-q8 (so model is ~70GB).
Option A: put as many layers as we can on GPU (it was 22), rest main memory/CPU processing. I've got ~2.5 TPS for eval.
Option B: put llama3-8b-q8 (our draft) to GPU entirely, some layers of main model (it was 11), rest to CPU. Speculate for 4 tokens only, as CPU would have limited benefit from larger batch sizes - evaluating large batches would be too slow on CPU. I've got ~4.45 TPS. There was nothing smart done to schedule main/draft model evaluation timing on GPU.

I guess similar hardware profile is reasonably common in consumer devices (e.g. some 16 Cores/32Threads AMD Ryzen + 64GB system RAM + 3090 with 24GB VRAM), but I don't have access to such machine and just tested it on some rented server-like thing.

I think with async there are more options to mix and match those, including over RPC. E.g. maybe speculation should still be on CPU, etc.

Current speculation version is not making any changes to llama.cpp code itself (probably if we want to schedule smarter we'll need to).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async/parallel speculative execution #6853

{{title}}

Replies: 4 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

async/parallel speculative execution #6853

okuvshynov Apr 23, 2024

Replies: 4 comments · 9 replies

ggerganov Apr 24, 2024 Maintainer

okuvshynov Apr 24, 2024 Author

okuvshynov Apr 24, 2024 Author

okuvshynov May 16, 2024 Author

ggerganov May 17, 2024 Maintainer

okuvshynov May 17, 2024 Author

okuvshynov May 17, 2024 Author

ggerganov May 17, 2024 Maintainer

okuvshynov May 17, 2024 Author

ggerganov May 17, 2024 Maintainer

okuvshynov May 17, 2024 Author

okuvshynov May 27, 2024 Author

okuvshynov Jul 10, 2024 Author

okuvshynov
Apr 23, 2024

Replies: 4 comments 9 replies

ggerganov
Apr 24, 2024
Maintainer

okuvshynov Apr 24, 2024
Author

okuvshynov Apr 24, 2024
Author

okuvshynov May 16, 2024
Author

ggerganov May 17, 2024
Maintainer

okuvshynov
May 17, 2024
Author

okuvshynov May 17, 2024
Author

ggerganov May 17, 2024
Maintainer

okuvshynov May 17, 2024
Author

ggerganov May 17, 2024
Maintainer

okuvshynov
May 17, 2024
Author

okuvshynov May 27, 2024
Author

okuvshynov
Jul 10, 2024
Author