create token embedding table
- number of embeddings
- embedding dimension
padding_idx- index of padding token in the input indices- The main purpose of this parameter is to provide a way to ignore certain tokens during the embedding lookup, which is particularly useful for batch processing of sequences of varying lengths
- the embedding vector for the padding index will not be updated during training.
- This is the token id value not vector index
Increasing the vocabulary size to a multiple of 64 means that the data can be more easily divided into equally sized batches that align with the way memory is managed and computations are performed on GPUs.
# GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency vocab_size: int = 50304
return embedding tensor for input index tensor

Seonglae Cho