https://fleuret.org
- Speculative decoding: a cheaper model generates tokens, and a rejection process corrects this generation to march the full-model distribution.
- Speculative decoding: a cheaper model generates tokens, and a rejection process corrects this generation to march the full-model distribution.
- FlashAttention: computes the attention on the fly, avoiding a memory footprint O(T^2) (+ optimizes very carefully for the GPU!)
- FlashAttention: computes the attention on the fly, avoiding a memory footprint O(T^2) (+ optimizes very carefully for the GPU!)
- Cosine schedule: the learning rate varies less at the beginning and end of the schedule
- AdamW: decouples weight includes decay from Adam
- Cosine schedule: the learning rate varies less at the beginning and end of the schedule
- AdamW: decouples weight includes decay from Adam
- MoE (Mixture of Experts): The FFN block is implemented with multiple MLPs and a gating mechanism selects which ones process each token.
- MoE (Mixture of Experts): The FFN block is implemented with multiple MLPs and a gating mechanism selects which ones process each token.
- MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it
- SwiGLU: non-linearity for the FFN block with per-component gating
- MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it
- SwiGLU: non-linearity for the FFN block with per-component gating
- GQA (Group Query Attention): more Q than (K, V)
- GQA (Group Query Attention): more Q than (K, V)
All this being said, putting both normalized and non-normalized cannot hurt methinks.
All this being said, putting both normalized and non-normalized cannot hurt methinks.
Someone linked this paper which is exactly the sort of thing I was looking for:
arxiv.org/abs/2502.12102
Someone linked this paper which is exactly the sort of thing I was looking for:
arxiv.org/abs/2502.12102
Magic!
3/3
Magic!
3/3
So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*!
2/3
So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*!
2/3