why init_kvcache need vattention.reserve_physical_pages(GPU_MEM_RESERVE) #13

dingzhiqiang · 2024-08-02T03:18:37Z

I believe this is similar to how vLLM reserves actual resources at startup. According to the paper, no actual resources should be allocated at startup; the CUDA interface should only be invoked to allocate resources when processing inference requests.

ramyaprabhu-alt · 2024-08-02T09:24:08Z

The reserve_physical_memory API currently only allocates physical memory pages from the driver during initialization. It does not map physical pages into the virtual tensors of KV cache; mapping is done when processing inference requests as you mentioned. You are also right that this is not fundamentally required but we use it simply to use the same amount of memory for KV cache as vLLM; it helps us do a fair performance comparison between the two systems (otherwise batch sizes between vLLM and vAttention could differ). Caching physical pages also helps with optimizing latency. If we allocate physical pages on demand, it will double to latency of growing the KV cache.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why init_kvcache need vattention.reserve_physical_pages(GPU_MEM_RESERVE) #13

why init_kvcache need vattention.reserve_physical_pages(GPU_MEM_RESERVE) #13

dingzhiqiang commented Aug 2, 2024

ramyaprabhu-alt commented Aug 2, 2024

why init_kvcache need vattention.reserve_physical_pages(GPU_MEM_RESERVE) #13

why init_kvcache need vattention.reserve_physical_pages(GPU_MEM_RESERVE) #13

Comments

dingzhiqiang commented Aug 2, 2024

ramyaprabhu-alt commented Aug 2, 2024