Vllm Streaming Requests

Creator

Seonglae Cho

Created

2026 Mar 24 14:21

Editor

Seonglae Cho

Edited

2026 Mar 24 14:24

Refs

Streams input tokens via AsyncGenerator within a single request while simultaneously starting output generation (intra-request optimization)

Traditional flow: complete prompt submission → prefill → decode. With streaming requests, prefill begins on arrived tokens even before the entire input is received, and decode starts immediately upon input completion. This is the key to reducing latency

Works orthogonally with

In-Flight Batching

Backlinks

Vllm Realtime API

Recommendations

///////