Vllm

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jun 22 15:37
Editor
Edited
Edited
2026 Mar 24 14:25
Efficient management of attention key and value memory with
PagedAttention
Vllm Usages
from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="TheBloke/Mistral-7B-OpenOrca-AWQ", quantization="awq", dtype="half") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 
 

Log probs

llm = LLM( model=model_name, max_logprobs=tokenizer.vocab_size, ) outputs = llm.generate( texts, SamplingParams(temperature=0.0, logprobs=tokenizer.vocab_size), )
 
 
 
Streaming Requests & Realtime API in vLLM
Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns
Streaming Requests & Realtime API in vLLM
vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction
TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.
Supported Models — vLLM
 
 

Recommendations