Attention Mechanism Optimization Grouped-query Attention Activation Function SwiGLU Relative Positional Encoding RoPE Transformer Training Ring Attention Multimodal AI Model Merging Depth Up-Scaling Text Tokenizer