whisper : implement batched decoding #1048

ggerganov · 2023-06-25T12:48:51Z

When using beam search, we currently run the decoders sequentially:

Lines 4416 to 4444 in f1c9df5

    
           // obtain logits for the next token 
        
           for (int j = 0; j < n_decoders_cur; ++j) { 
        
               auto & decoder = state->decoders[j]; 
        
               if (decoder.failed || decoder.completed) { 
        
                   continue; 
        
               } 
        
               decoder.tokens_tmp.resize(1); 
        
               decoder.tokens_tmp[0] = decoder.sequence.tokens.back().id; 
        
               //WHISPER_PRINT_DEBUG("%s: decoder %d: token %d, kv_self.n %d, seek_delta %d\n", __func__, j, decoder.tokens_tmp[0], decoder.kv_self.n, decoder.seek_delta); 
        
               if (!whisper_decode_internal(*ctx, *state, decoder, decoder.tokens_tmp.data(), decoder.tokens_tmp.size(), decoder.kv_self.n, params.n_threads)) { 
        
                   fprintf(stderr, "%s: failed to decode\n", __func__); 
        
                   return -8; 
        
               } 
        
               { 
        
                   const int64_t t_start_sample_us = ggml_time_us(); 
        
                   whisper_process_logits(*ctx, *state, params, decoder, t_cur); 
        
                   ++decoder.kv_self.n; 
        
                   state->t_sample_us += ggml_time_us() - t_start_sample_us; 
        
               } 
        
           }

This is multiple times slower compared to a batched evaluation. This inefficiency is the major factor preventing efficient usage of beam search in whisper.cpp and thus often resulting in bad transcription quality.

Batched inference has been demonstrated in llama.cpp:

https://github.com/ggerganov/llama.cpp/blob/bd34cdde38f8fd661890ddd5f57ca30bf279877b/examples/baby-llama/baby-llama.cpp#L768-L777

This can be a starting point for doing the same in whisper.cpp and achieving efficient beam search implementation

The text was updated successfully, but these errors were encountered:

fire · 2023-07-28T20:49:22Z

What is a good way to check this is working? For testing purposes to implement this.

bobqianic · 2023-09-02T18:53:42Z

I think we need some documentation on how to use ggml, as ggml's API is quite hard to understand. This way, more people can get started quickly, just like with PyTorch. @ggerganov

ggerganov · 2023-09-02T19:02:00Z

I agree. Actually simple example programs would be even better as they are easier to maintain long term.

I just need to find the time..

ggerganov added performance CPU and memory usage - results and comparisons decoding Decoding related issues labels Jun 25, 2023

ggerganov added this to ggml : roadmap Jun 25, 2023

ggerganov moved this to Todo in ggml : roadmap Jun 25, 2023

guillaumekln mentioned this issue Jul 27, 2023

Comparision with faster-whisper #1127

Open

bobqianic mentioned this issue Sep 1, 2023

Any way to improve performance #1232

Closed

ggerganov mentioned this issue Sep 12, 2023

llama : add example for tree-based parallel decoding ggml-org/llama.cpp#3137

Closed

ggerganov self-assigned this Nov 13, 2023

ggerganov moved this from Todo to In Progress in ggml : roadmap Nov 13, 2023

ggerganov mentioned this issue Nov 14, 2023

whisper : add batched decoding #1486

Merged

ggerganov moved this from In Progress to Done in ggml : roadmap Nov 15, 2023

ggerganov closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : implement batched decoding #1048

whisper : implement batched decoding #1048

ggerganov commented Jun 25, 2023

fire commented Jul 28, 2023 •

edited

Loading

bobqianic commented Sep 2, 2023

ggerganov commented Sep 2, 2023

whisper : implement batched decoding #1048

whisper : implement batched decoding #1048

Comments

ggerganov commented Jun 25, 2023

fire commented Jul 28, 2023 • edited Loading

bobqianic commented Sep 2, 2023

ggerganov commented Sep 2, 2023

fire commented Jul 28, 2023 •

edited

Loading