Memory‑Efficient Attention
Attention kernels that reduce memory/time complexity using tiling, FlashAttention, or linearized variants.
Attention kernels that reduce memory/time complexity using tiling, FlashAttention, or linearized variants.
Planning algorithm that balances exploration/exploitation by sampling trajectories; used in agents.
Technique that accumulates velocity of gradients to smooth updates; basis for SGD with momentum, Adam.
Multi‑layer perceptron used inside transformer blocks for token‑wise transformations.
Core linear algebra operation dominating transformer compute; optimized with tiling and tensor cores.
Training with lower precision formats to save memory and increase throughput, with loss‑scaling.
Reducing numerical precision (e.g., FP16→INT8/4) to shrink models and speed inference.
Training and controls to ensure models follow human intent and avoid harmful behavior.
Architecture where a router activates subsets of expert MLPs per token, increasing capacity efficiently.
Pretraining objective where tokens are masked and predicted; used in BERT‑style models.