You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe this is similar to how vLLM reserves actual resources at startup. According to the paper, no actual resources should be allocated at startup; the CUDA interface should only be invoked to allocate resources when processing inference requests.
The text was updated successfully, but these errors were encountered:
The reserve_physical_memory API currently only allocates physical memory pages from the driver during initialization. It does not map physical pages into the virtual tensors of KV cache; mapping is done when processing inference requests as you mentioned. You are also right that this is not fundamentally required but we use it simply to use the same amount of memory for KV cache as vLLM; it helps us do a fair performance comparison between the two systems (otherwise batch sizes between vLLM and vAttention could differ). Caching physical pages also helps with optimizing latency. If we allocate physical pages on demand, it will double to latency of growing the KV cache.
I believe this is similar to how vLLM reserves actual resources at startup. According to the paper, no actual resources should be allocated at startup; the CUDA interface should only be invoked to allocate resources when processing inference requests.

The text was updated successfully, but these errors were encountered: