https://www.xtxmarkets.com/ 🏦 XTX Markets Research Director (NYC AI Lab)
Superpower is trying everything 🪅
Newest focus: training next-generation super intelligence - Preview above 👶
www.vita-group.space/team
www.vita-group.space/team
One can clearly see how this group evolves its own tastes!
… and deeper in my heart: long live optimization!! ❤️
One can clearly see how this group evolves its own tastes!
… and deeper in my heart: long live optimization!! ❤️
arxiv is filled by papers that treat symptoms (or not even!) without ever diagnosing the disease
arxiv is filled by papers that treat symptoms (or not even!) without ever diagnosing the disease
🔹 Classic Planning: LLMs ace simpler puzzles but struggle badly as complexity grows—losing track at longer-term decisions
🔎 Competitive Games: Top chess engines swept every LLM clean. Even simple tactical awareness quickly fades when facing deeper strategic branches
🔹 Classic Planning: LLMs ace simpler puzzles but struggle badly as complexity grows—losing track at longer-term decisions
🔎 Competitive Games: Top chess engines swept every LLM clean. Even simple tactical awareness quickly fades when facing deeper strategic branches
We all love seeing how smart LLMs can be-solving complex math, crafting beautiful text, and coding effortlessly. But how well do they handle real-world strategic complexity, cooperation, & social negotiation? Can they play well when things get tricky?
Not quite!
We all love seeing how smart LLMs can be-solving complex math, crafting beautiful text, and coding effortlessly. But how well do they handle real-world strategic complexity, cooperation, & social negotiation? Can they play well when things get tricky?
Not quite!
The insights were drawn from the good old compressive sensing — RIP!!(optimization forks shall get my joke!) 😆
The insights were drawn from the good old compressive sensing — RIP!!(optimization forks shall get my joke!) 😆
👊Memory❗
It for the first time enables pre-training LLaMA-13B with naive DDP on A100-80G without other system-level optimizations
👊Throughput❗
For LLaMA-7B pre-training 8×A100-80GB, supports 4× larger batch sizes, 3× training throughput, & maintaining the best perplexity reported
👊Memory❗
It for the first time enables pre-training LLaMA-13B with naive DDP on A100-80G without other system-level optimizations
👊Throughput❗
For LLaMA-7B pre-training 8×A100-80GB, supports 4× larger batch sizes, 3× training throughput, & maintaining the best perplexity reported
APOLLO approximates channel/tensor-wise learning rate scaling with low-rank state, via pure random projection - no SVD!
It is highly tolerant to extremely low rank (even rank-1)
APOLLO approximates channel/tensor-wise learning rate scaling with low-rank state, via pure random projection - no SVD!
It is highly tolerant to extremely low rank (even rank-1)
📢 Introducing APOLLO! 🚀: SGD-like memory cost, yet AdamW-level performance (or better!).
❓ How much memory do we need for optimization states in LLM training ? 🧐
Almost zero.
📜 Paper: arxiv.org/abs/2412.05270
🔗 GitHub: github.com/zhuhanqing/A...
📢 Introducing APOLLO! 🚀: SGD-like memory cost, yet AdamW-level performance (or better!).
❓ How much memory do we need for optimization states in LLM training ? 🧐
Almost zero.
📜 Paper: arxiv.org/abs/2412.05270
🔗 GitHub: github.com/zhuhanqing/A...
We’re thrilled to present 9 main conference papers, 5 workshop papers and a keynote talk:
We’re thrilled to present 9 main conference papers, 5 workshop papers and a keynote talk:
Wanna call it Edge of Stability?
Wanna call it Edge of Stability?