You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is multiple times slower compared to a batched evaluation. This inefficiency is the major factor preventing efficient usage of beam search in whisper.cpp and thus often resulting in bad transcription quality.
Batched inference has been demonstrated in llama.cpp:
I think we need some documentation on how to use ggml, as ggml's API is quite hard to understand. This way, more people can get started quickly, just like with PyTorch. @ggerganov
When using beam search, we currently run the decoders sequentially:
whisper.cpp/whisper.cpp
Lines 4416 to 4444 in f1c9df5
This is multiple times slower compared to a batched evaluation. This inefficiency is the major factor preventing efficient usage of beam search in
whisper.cpp
and thus often resulting in bad transcription quality.Batched inference has been demonstrated in
llama.cpp
:https://github.com/ggerganov/llama.cpp/blob/bd34cdde38f8fd661890ddd5f57ca30bf279877b/examples/baby-llama/baby-llama.cpp#L768-L777
This can be a starting point for doing the same in
whisper.cpp
and achieving efficient beam search implementationThe text was updated successfully, but these errors were encountered: