Multiple LiteLLM instances #2032

olad32 · 2024-02-17T13:59:39Z

olad32
Feb 17, 2024

Hi, is this possible to run multiple instances in parallel to be able to scale horizontaly ? Even with the database features in use ?
Is there anything to know to be able to rollout a new LiteLLM config on multiple instances without downtime ? Eg update model config
Thanks

krrishdholakia · 2024-02-17T16:06:35Z

krrishdholakia
Feb 17, 2024
Maintainer

Hey @olad32 you should be able to do this. I believe you'd need to change the db connection pool limit for a single instance, to do this well (we currently set it to 100, which could be the max for some systems).

We have a db table for the LLM config that we were planning on using for this.

A problem I was trying to figure out was:

models need keys. Should the table store the keys?

0 replies

krrishdholakia · 2024-02-17T16:07:25Z

krrishdholakia
Feb 17, 2024
Maintainer

If you have time today, would love to do a quick call and talk through this:

https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat

0 replies

krrishdholakia · 2024-02-18T03:06:36Z

krrishdholakia
Feb 18, 2024
Maintainer

@olad32 just pushed a fix to let you control db connection pool + timeouts for better scalability - https://docs.litellm.ai/docs/proxy/configs#configure-db-pool-limits--connection-timeouts

Should be out in the next release.

Would love to do a quick call and talk through the reload config file issue - https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat

Let me know if any time this / next week works!

0 replies

olad32 · 2024-02-19T16:31:02Z

olad32
Feb 19, 2024
Author

Thanks for the configurable pool options.
Regarding the config reload topic, the static config.yaml is the simplest and most performant option for sure, but it would add flexibility to have an option in litellm proxy to be able to update config at runtime. Persisting the config in a database is one solution, litellm proxy could periodically check for updated config in the database, this would not alter performance too much.
As a side note I just saw that the proxy API has a new model endpoint, if this only update the static config in memory, it will not work in multi litellm proxy instance environnement behind a Load Balancer.

For now another option exists via the rolling update kuberntes strategy which can recreate each pod (ie litellm proxy instance) one by one with the new static config.yaml, but it generates a bit of noise on the cluster (recreate each pod), not ideal but manageable. One thing is mandatory for this option to work, litellm proxy must handle gracefull shutdown, ie handle sigterm signal sent by kubernetes and wait for current requests to end before effectively shutting down, especially usefull for long streaming response. In fact gracefulll shutdown is always a good thing to handle.

4 replies

olad32 Feb 19, 2024
Author

(i can't find a schedule to make a call as we are obviously at the earth opposite ;) )

ishaan-jaff Feb 19, 2024
Maintainer

@olad32 we're on this call if you're free right now: https://meet.google.com/zav-wexx-tbe

krrishdholakia Feb 26, 2024
Maintainer

Hi @olad32 i guess the question was:

if you do /config/update and add a new model to the config, would that be stored in the db?

Curious - what're you trying to update about litellm proxy? (is it just adding new models?)

olad32 Feb 28, 2024
Author

Hi @krrishdholakia, exactly, i want to be able to update existing models parameters (like rpm per model) and add new models, and the new configuration must be applied by all litellm proxy instances that runs in parrallel. This is needed to be able to auto scale LiteLLM itself and LLMs instances upstream (add a new llm instance).

ishaan-jaff · 2024-02-20T18:04:24Z

ishaan-jaff
Feb 20, 2024
Maintainer

Hi @olad32 just wanted to follow up

If we add the ability to add new models through the Admin UI, would you be able to try it and give us feedback ?
I'd love to set up a direct 1:1 support channel for you, feel free to pick the channel that works best for you

Linkedin https://www.linkedin.com/in/reffajnaahsi/
Twitter: https://twitter.com/ishaan_jaff
Discord: https://discord.com/invite/wuPM9dRgDw
Other - please let me know if you'd prefer another channel

0 replies

vvidovic · 2025-02-19T13:36:47Z

vvidovic
Feb 19, 2025

Hi, I am trying to run multiple instances of LiteLLM but I noticed the following issues when testing a budget and the rate limit per key:

when I remove a budget or the rate limit (request/minute and token/minute), only one of the instances is updated with the removal of the restriction
when I update a budget or the rate limit (request/minute and token/minute), only one of the instances is updated with the new value
when I call the completion API and a budget or the rate limit is active, the limits are applied on the each instance independently

I tried to use the following info for this setup:

https://docs.litellm.ai/docs/proxy/caching#set-cache-for-proxy-but-not-on-the-actual-llm-api-call

For example, I set the tps limit to 1 on the key and run 3 instances in our k8s cluster.
A LiteLLM cache is configured as follows:

    litellm_settings:
      cache: True
      cache_params:
        type: redis
        supported_call_types: []
        host: valkey.litellm.svc.cluster.local
        port: 6379
        namespace: litellm.caching.caching

However, when I call the endpoint 8 times within the same minute, I receive 3 successful responses.

I can see that keys with the proper limits are in my Redis (Valkey) store, however, tests show that instances are not using those limits.

Is there anything else to configure to enable multiple instances using the same rate limits?

5 replies

krrishdholakia Feb 19, 2025
Maintainer

Hey @vvidovic is the update just taking a bit to sync across instances?

vvidovic Feb 19, 2025

It is not that it takes some time but that it doesn't happen at all (until I restart instances).
At least not after a few minutes.
I will perform a longer test and let you know about the results.

vvidovic Feb 19, 2025

@krrishdholakia - after more testing, I am not sure when instances sync rpm limits. In the initial test, after more than 10 minutes, instances had rpm limits in sync (without any completion calls in between). However, after that, I started additional tests in which instances didn't sync even after a much longer time (but I performed completion calls every 30 seconds).

Test 1 with 3 LiteLLM instances running:

rpm_limit on a key set to 2
completion calls show that each instance allows 2 rpm: 6 rpm allowed in total
rpm_limit on a key set to 1
6 completion calls performed every 30 seconds
after more than 20 minutes we see that there are still 2 instances using rpm limit 2 (5 rpm allowed in total)

Test 1 with 3 LiteLLM instances running:

instance restart performed - all 3 instances were at 1 rpm limit
6 completion calls performed every 30 seconds
completion calls show that each instance allows 1 rpm: 3 rpm allowed in total
rpm_limit on a key set to 2
completion calls show that only 1 instance allows 2 rpm: 4 rpm allowed in total
after more than 45 minutes we see that there are still 2 instances using rpm limit 1 (4 rpm allowed in total)

krrishdholakia Feb 19, 2025
Maintainer

what happens if the calls happen every ~2 mins.

I wonder if the local cache resets the 'last_refreshed_at' value every call which is causing the update to not occur

for context our default in-memory cache is 60s

vvidovic Feb 20, 2025

I increased sleep between completion invocations from 30 to 120 seconds and increased the number of calls from 6 to 12 invocations in each turn.

Test 1 with 3 LiteLLM instances running:

initial state - all 3 instances were at 2 rpm limit
12 completion calls performed every 120 seconds
completion calls show that each instance allows 2 rpm: 6 rpm allowed in total
rpm_limit on a key set to 1
after more than 15 minutes, completion calls show that only 1 instance allows 1 rpm: 5 rpm allowed in total
after 20 minutes, all instances were using 1 rpm: 3 rmp allowed in total

I will investigate how different ttl values affect this behaviour.

vvidovic · 2025-02-20T07:09:48Z

vvidovic
Feb 20, 2025

Is there a way to configure LiteLLM to apply the correct rpm/tpm limits when multiple instances are running?

When I configure 3 instances synchronized through Redis and set a key rpm limit to 1, I can perform 3 completions (1 for each instance) in a minute instead of 1 completion in a minute.

Thanks.

1 reply

vvidovic Feb 20, 2025

The first successful configuration was done using the global_max_parallel_requests parameter.

If the global_max_parallel_requests is set to 1 - all requests are rejected. However, when I set this parameter to any value above 1 (2 or 20), it enforces that all LiteLLM instances respect the rpm limit set for a test key.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple LiteLLM instances #2032

{{title}}

Replies: 7 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Multiple LiteLLM instances #2032

olad32 Feb 17, 2024

Replies: 7 comments · 10 replies

krrishdholakia Feb 17, 2024 Maintainer

krrishdholakia Feb 17, 2024 Maintainer

krrishdholakia Feb 18, 2024 Maintainer

olad32 Feb 19, 2024 Author

olad32 Feb 19, 2024 Author

ishaan-jaff Feb 19, 2024 Maintainer

krrishdholakia Feb 26, 2024 Maintainer

olad32 Feb 28, 2024 Author

ishaan-jaff Feb 20, 2024 Maintainer

vvidovic Feb 19, 2025

krrishdholakia Feb 19, 2025 Maintainer

vvidovic Feb 19, 2025

vvidovic Feb 19, 2025

krrishdholakia Feb 19, 2025 Maintainer

vvidovic Feb 20, 2025

vvidovic Feb 20, 2025

vvidovic Feb 20, 2025

olad32
Feb 17, 2024

Replies: 7 comments 10 replies

krrishdholakia
Feb 17, 2024
Maintainer

krrishdholakia
Feb 17, 2024
Maintainer

krrishdholakia
Feb 18, 2024
Maintainer

olad32
Feb 19, 2024
Author

olad32 Feb 19, 2024
Author

ishaan-jaff Feb 19, 2024
Maintainer

krrishdholakia Feb 26, 2024
Maintainer

olad32 Feb 28, 2024
Author

ishaan-jaff
Feb 20, 2024
Maintainer

vvidovic
Feb 19, 2025

krrishdholakia Feb 19, 2025
Maintainer

krrishdholakia Feb 19, 2025
Maintainer

vvidovic
Feb 20, 2025