-
Notifications
You must be signed in to change notification settings - Fork 126
Conversation
…ot into feature/gpu_layers
I'm sure this is still in progress but I tried the branch out and got some weird behavior I thought worth mentioning. Loading any number of layers to the gpu causes the completions to be mostly nonsense. I tried running stablecode with --ngl 24, 12, and even 1. My prompt was main.py
With no layers on the gpu it said With any layers on the gpu it generally said |
Yeah this is currently WIP - the CLBlast version of the code works fine but the Nvidia/CUDA implementation is crazy so I need to work out what's going on before I merge it or just merge the CLBLast build and disable pure cuda offloading support for now (basically leave the current mainline implementation of cuda turned on without any change) |
…ot into feature/gpu_layers
…ot into feature/gpu_layers
I was able to get CUDA working again by bringing the ggml submodule up to date with the current upstream main branch. I've made some changes to the docker launch script that allow you to use GPU offloading. I'm getting sub 10s responses for non-trivial prompts with my NVIDIA 4070 using stablecode |
Implement support for offloading inference to a GPU