For batch processing
attention_mask
is a binary tensor indicating which tokens should be attended to and which should not.This is an optional argument used when batching sequences together. This argument indicates to the model which tokens should be attended to, and which should not. The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them. For example, if two sequences have different lengths, the shorter sequence will be padded with zeros to match the length of the longer sequence. The attention mask will then be used to indicate which tokens are real and which are padding