Vllm Streaming Requests

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2026 Mar 24 14:21
Editor
Edited
Edited
2026 Mar 24 14:24
Refs
Refs
Streams input tokens via AsyncGenerator within a single request while simultaneously starting output generation (intra-request optimization)
Traditional flow: complete prompt submission → prefill → decode. With streaming requests, prefill begins on arrived tokens even before the entire input is received, and decode starts immediately upon input completion. This is the key to reducing latency
Works orthogonally with
In-Flight Batching
 
 
 
 
 

Recommendations