Replies: 7 comments 10 replies
-
Hey @olad32 you should be able to do this. I believe you'd need to change the db connection pool limit for a single instance, to do this well (we currently set it to 100, which could be the max for some systems). We have a db table for the LLM config that we were planning on using for this. A problem I was trying to figure out was:
|
Beta Was this translation helpful? Give feedback.
-
If you have time today, would love to do a quick call and talk through this: https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat |
Beta Was this translation helpful? Give feedback.
-
@olad32 just pushed a fix to let you control db connection pool + timeouts for better scalability - https://docs.litellm.ai/docs/proxy/configs#configure-db-pool-limits--connection-timeouts Should be out in the next release. Would love to do a quick call and talk through the reload config file issue - https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat Let me know if any time this / next week works! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the configurable pool options. For now another option exists via the rolling update kuberntes strategy which can recreate each pod (ie litellm proxy instance) one by one with the new static config.yaml, but it generates a bit of noise on the cluster (recreate each pod), not ideal but manageable. One thing is mandatory for this option to work, litellm proxy must handle gracefull shutdown, ie handle sigterm signal sent by kubernetes and wait for current requests to end before effectively shutting down, especially usefull for long streaming response. In fact gracefulll shutdown is always a good thing to handle. |
Beta Was this translation helpful? Give feedback.
-
Hi @olad32 just wanted to follow up
|
Beta Was this translation helpful? Give feedback.
-
Hi, I am trying to run multiple instances of LiteLLM but I noticed the following issues when testing a budget and the rate limit per key:
I tried to use the following info for this setup: For example, I set the tps limit to 1 on the key and run 3 instances in our k8s cluster.
However, when I call the endpoint 8 times within the same minute, I receive 3 successful responses. I can see that keys with the proper limits are in my Redis (Valkey) store, however, tests show that instances are not using those limits. Is there anything else to configure to enable multiple instances using the same rate limits? |
Beta Was this translation helpful? Give feedback.
-
Is there a way to configure LiteLLM to apply the correct rpm/tpm limits when multiple instances are running? When I configure 3 instances synchronized through Redis and set a key rpm limit to 1, I can perform 3 completions (1 for each instance) in a minute instead of 1 completion in a minute. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi, is this possible to run multiple instances in parallel to be able to scale horizontaly ? Even with the database features in use ?
Is there anything to know to be able to rollout a new LiteLLM config on multiple instances without downtime ? Eg update model config
Thanks
Beta Was this translation helpful? Give feedback.
All reactions