Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why init_kvcache need vattention.reserve_physical_pages(GPU_MEM_RESERVE) #13

Open
dingzhiqiang opened this issue Aug 2, 2024 · 1 comment

Comments

@dingzhiqiang
Copy link

I believe this is similar to how vLLM reserves actual resources at startup. According to the paper, no actual resources should be allocated at startup; the CUDA interface should only be invoked to allocate resources when processing inference requests.
image

@ramyaprabhu-alt
Copy link
Collaborator

The reserve_physical_memory API currently only allocates physical memory pages from the driver during initialization. It does not map physical pages into the virtual tensors of KV cache; mapping is done when processing inference requests as you mentioned. You are also right that this is not fundamentally required but we use it simply to use the same amount of memory for KV cache as vLLM; it helps us do a fair performance comparison between the two systems (otherwise batch sizes between vLLM and vAttention could differ). Caching physical pages also helps with optimizing latency. If we allocate physical pages on demand, it will double to latency of growing the KV cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants