The RVQ (Residual Vector Quantization) tokenizer needs to predict N codebook stages sequentially.
A type of Positional Embedding used to distinguish individual channels. Audio is represented and generated as multiple channels (default 9) of code sequences rather than a single stream. For example, if the first channel generates information at time t, the second channel generates at t - delay[1]. When handling multi-channel audio codes, this defines the rules or mechanisms for how each channel references temporal information to generate the next code.