Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper : implement batched decoding #1048

Closed
ggerganov opened this issue Jun 25, 2023 · 3 comments
Closed

whisper : implement batched decoding #1048

ggerganov opened this issue Jun 25, 2023 · 3 comments
Assignees
Labels
decoding Decoding related issues performance CPU and memory usage - results and comparisons

Comments

@ggerganov
Copy link
Owner

When using beam search, we currently run the decoders sequentially:

whisper.cpp/whisper.cpp

Lines 4416 to 4444 in f1c9df5

// obtain logits for the next token
for (int j = 0; j < n_decoders_cur; ++j) {
auto & decoder = state->decoders[j];
if (decoder.failed || decoder.completed) {
continue;
}
decoder.tokens_tmp.resize(1);
decoder.tokens_tmp[0] = decoder.sequence.tokens.back().id;
//WHISPER_PRINT_DEBUG("%s: decoder %d: token %d, kv_self.n %d, seek_delta %d\n", __func__, j, decoder.tokens_tmp[0], decoder.kv_self.n, decoder.seek_delta);
if (!whisper_decode_internal(*ctx, *state, decoder, decoder.tokens_tmp.data(), decoder.tokens_tmp.size(), decoder.kv_self.n, params.n_threads)) {
fprintf(stderr, "%s: failed to decode\n", __func__);
return -8;
}
{
const int64_t t_start_sample_us = ggml_time_us();
whisper_process_logits(*ctx, *state, params, decoder, t_cur);
++decoder.kv_self.n;
state->t_sample_us += ggml_time_us() - t_start_sample_us;
}
}

This is multiple times slower compared to a batched evaluation. This inefficiency is the major factor preventing efficient usage of beam search in whisper.cpp and thus often resulting in bad transcription quality.

Batched inference has been demonstrated in llama.cpp:

https://github.com/ggerganov/llama.cpp/blob/bd34cdde38f8fd661890ddd5f57ca30bf279877b/examples/baby-llama/baby-llama.cpp#L768-L777

This can be a starting point for doing the same in whisper.cpp and achieving efficient beam search implementation

@ggerganov ggerganov added performance CPU and memory usage - results and comparisons decoding Decoding related issues labels Jun 25, 2023
@ggerganov ggerganov moved this to Todo in ggml : roadmap Jun 25, 2023
@fire
Copy link

fire commented Jul 28, 2023

What is a good way to check this is working? For testing purposes to implement this.

@bobqianic
Copy link
Collaborator

I think we need some documentation on how to use ggml, as ggml's API is quite hard to understand. This way, more people can get started quickly, just like with PyTorch. @ggerganov

@ggerganov
Copy link
Owner Author

I agree. Actually simple example programs would be even better as they are easier to maintain long term.

I just need to find the time..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decoding Decoding related issues performance CPU and memory usage - results and comparisons
Projects
Status: Done
Development

No branches or pull requests

3 participants