Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Development/AI Inference Tool/
Vllm
Search

Vllm

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jun 22 15:37
Editor
Editor
Seonglae ChoSeonglae Cho
Edited
Edited
2025 Mar 14 2:15
Refs
Refs
vllm
vllm-project • Updated 2025 Mar 15 12:26
AWQ
NCCL
Efficient management of attention key and value memory with
PagedAttention
 
 

Log probs

[Bug]: Cannot request more than 5 logprobs
Updated 2025 Feb 28 1:24
 
 
 
 
vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction
TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.
https://blog.vllm.ai/2024/09/05/perf-update.html
Supported Models — vLLM
Supported Models — vLLM
https://vllm.readthedocs.io/en/latest/models/supported_models.html
 
 

Backlinks

FlashInferTGIFine Tuning

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Development/AI Inference Tool/
Vllm
Copyright Seonglae Cho