Core linear algebra operation dominating transformer compute; optimized with tiling and tensor cores. ← Mainnet Layer 1 (L1) →