Also Ashwinee Panda for coining "zero-sum learning", which is honestly a pretty great name.
Also Ashwinee Panda for coining "zero-sum learning", which is honestly a pretty great name.
Code: github.com/mirandrom/zsl
Checkpoints: huggingface.co/mirandrom/zs...
Wandb logs: wandb.ai/amr-amr/zsl/...
Code: github.com/mirandrom/zsl
Checkpoints: huggingface.co/mirandrom/zs...
Wandb logs: wandb.ai/amr-amr/zsl/...
What’s cool is that these could potentially be mitigated independent of scaling (Step 2).
Exactly how to do this remains an open question.
What’s cool is that these could potentially be mitigated independent of scaling (Step 2).
Exactly how to do this remains an open question.
In our paper, we show how SGO (when destructive interference in per-example gradients approaches 1) fundamentally results in ZSL, and confirm it occurs with and explains deceleration.
In our paper, we show how SGO (when destructive interference in per-example gradients approaches 1) fundamentally results in ZSL, and confirm it occurs with and explains deceleration.
In other words, by explaining loss deceleration (and the mitigating effect of scale) we can explain scaling improvements. We propose the zero-sum learning (ZSL) hypothesis as an explanation for deceleration.
In other words, by explaining loss deceleration (and the mitigating effect of scale) we can explain scaling improvements. We propose the zero-sum learning (ZSL) hypothesis as an explanation for deceleration.
Scaling improvements can be expressed in terms of mitigating “loss deceleration”, an abrupt slowdown in the rate of loss improvement; characterized by piecewise linear log-log loss curves.
Scaling improvements can be expressed in terms of mitigating “loss deceleration”, an abrupt slowdown in the rate of loss improvement; characterized by piecewise linear log-log loss curves.
LLM scaling laws predict but do not explain *how* scaling model size improves loss.
By identifying a mechanism underlying scaling improvements, we could target it directly and potentially improve LLMs independent of scale.
LLM scaling laws predict but do not explain *how* scaling model size improves loss.
By identifying a mechanism underlying scaling improvements, we could target it directly and potentially improve LLMs independent of scale.