Speculative Speculative Decoding

Even speculative decoding is constrained by the sequential dependency between speculation and verification. The key idea of SSD is that, while verification is running, the draft model predicts possible verification outcomes and pre-speculates for each outcome, storing them in a speculation cache.

If the actual verification outcome is in the cache (a cache hit), SSD can immediately return the next speculation, completely removing drafting overhead. SSD’s expected speedup is , and the upper bound compared to standard speculative decoding (SD) is . Based on the empirical observation that the cache-miss probability follows a power law in fan-out F, , the paper proves that the optimal fan-out allocation follows a capped geometric series: .

The optimized implementation, Saguaro, introduces three core techniques: (1) It constructs the speculation cache with a geometric fan-out strategy, achieving up to 90% bonus-token prediction accuracy. (2) It proposes Saguaro sampling, which downscales the probabilities of the top-F tokens under the draft distribution by a constant C ( if ), encouraging the residual distribution to concentrate mass on tokens that are in the cache. (3) It proves the optimal fallback strategy depends on batch size, adopting an adaptive policy: a neural backup for small batches and a fast (random-token) backup for large batches.

arxiv.org

https://arxiv.org/pdf/2603.03251

Speculative Speculative Decoding

Recommendations