A method that dynamically batches in-flight requests with newly incoming requests to process them together.
- Complex implementation
- Different sequence lengths
- KV cache management required
- Scheduling challenges
- Fairness vs throughput tradeoff
- Memory fragmentation possible
Used by
- …

Seonglae Cho