General Matrix Multiplications
Fused multiply-add (FMA) + Inner product
batched GEMM restription
The "batched GEMM kernel that bundles multiple matrix multiplications together and runs them very quickly" only works properly when all batch elements have the same shape

Seonglae Cho