What is the Paged Attention algorithm and how does it benefit vLLM?
Question
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.
Answers ( 1 )
The Paged Attention algorithm is a mechanism inspired by the paged memory management of operating systems. It divides the key-value (KV) cache of requests into fixed-size blocks and manages the mapping between logical and physical blocks through a block table. This reduces memory fragmentation and allows different requests to share blocks, significantly improving the efficiency and throughput of large model deployments.