Vllm

Creator

Creator

Seonglae Cho

Created

Created

2023 Jun 22 15:37

Editor

Editor

Seonglae Cho

Edited

Edited

2025 Mar 14 2:15

Refs

Refs

vllm-project • Updated 2025 Mar 15 12:26

Efficient management of attention key and value memory with

PagedAttention

Log probs

[Bug]: Cannot request more than 5 logprobs

Updated 2025 Feb 28 1:24

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.

https://blog.vllm.ai/2024/09/05/perf-update.html

Supported Models — vLLM

https://vllm.readthedocs.io/en/latest/models/supported_models.html

Backlinks

FlashInfer TGI Fine Tuning

Recommendations

//////