Satoki Ishikawa
satoki-ishikawa.bsky.social
Satoki Ishikawa
@satoki-ishikawa.bsky.social
Institute of Science Tokyo / R. Yokota lab / Neural Network / Optimization
Looking for great collabolation research
https://riverstone496.github.io/
某Xで最近盛り上がっているなぜNGDの近似ではなくMuonを使うのか,多分ここら辺だよねと思いつつ,そういう論文あまり見かけない.
www.arxiv.org/abs/2505.24333
Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-a...
www.arxiv.org
November 15, 2025 at 11:12 AM
October 28, 2025 at 9:44 AM
『確率的機械学習』の発売日,IBISが始まる二日前なのは合わせたように見える👀
October 23, 2025 at 4:25 PM
My favorite pianist Erik Lu got the first prize at the Chopin Piano Competition😀
I’m watching Chopin Competition.
I’m so surprised that he is using an office chair while I’m deeply impressed by his performance.

m.youtube.com/watch?v=fDsg...
ERIC LU – first round (19th Chopin Competition, Warsaw)
YouTube video by Chopin Institute
m.youtube.com
October 21, 2025 at 12:52 AM
IBIS,発表は間に合わなかったのですが,参加はすることにしました😀懇親会はもちろん間に合わなかったのですが,裏懇親会に申し込みました.よろしくお願いします🙇
October 20, 2025 at 6:28 AM
1つのライブラリ(torch, trl)に慣れてしまうと,例え他のライブラリ(jax, verl)の方が適していたとしても,移行する際の心理的な障壁がとても高い.プログラミング言語間の翻訳に特化したLLMが一番欲しい…
October 19, 2025 at 6:27 AM
Practical upper-bound is an interesting concepts. What kind of practical upper bound would be interesting other than this?
arxiv.org/abs/2510.09378
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much...
arxiv.org
October 15, 2025 at 1:04 AM
Muon,論文にする場合は,早めに書かないと,また誰かと被る可能性がある一方で,何かあとひと押しのオリジナリティが出せない…
October 13, 2025 at 11:52 AM
While the focus for generalization and implicit bias has been on robustness to sample-wise noise, the rise of large-scale models suggests that robustness to parameter-wise noise (e.g., from quantization) might be now just as important?

x.com/deepcohen/st...
Jeremy Cohen on X: "This nice, thorough paper on LLM pretraining shows that quantization error rises sharply when the learning rate is decayed. But, why would that be? The answer is likely related to curvature dynamics. https://t.co/cdkt3DU1iw" / X
This nice, thorough paper on LLM pretraining shows that quantization error rises sharply when the learning rate is decayed. But, why would that be? The answer is likely related to curvature dynamics. https://t.co/cdkt3DU1iw
x.com
October 13, 2025 at 5:40 AM
I’ve been challenging myself to read a lot of NeurIPS 2025 papers, but maybe I should switch soon to reading ICLR 2025 submissions instead.
October 10, 2025 at 1:11 PM
This paper is really interesting.
NGD builds curvature from the function gradient df/dw, while optimizers like Adam and Shampoo use the loss gradient dL/dw.
I’ve always wondered which is better, since using the loss gradient with EMA might cause loss spikes later in training.
October 10, 2025 at 12:46 PM
Reposted by Satoki Ishikawa
This paper studies why Adam occasionally causes loss spikes, which is attributed to the edge of stability phenomenon. As seen from the figure, once hitting EOS (see b) a loss spike is triggered. An interesting experimental report!

arxiv.org/abs/2506.04805
October 10, 2025 at 7:55 AM
I'm looking at ICLR submissions and I've noticed a significant number of papers related to Muon.
October 10, 2025 at 3:42 AM
学習ダイナミクスや暗黙的バイアスの観点から嬉しいスパース構造と,GPUを用いた行列積にとって嬉しいスパース構造と,脳が持っているスパース構造,どのくらいオーバーラップがあるのですかね🤔GPUを用いた行列積にとって嬉しいスパース行列の構造は複数パターン知られていますが,その学習理論や神経科学との接続はあまり聞かず,ただHPCの人も他分野に興味があるので,関連しそうな文献に引用は飛ばしつつも,あと一歩で行き詰まっている印象?
October 8, 2025 at 7:29 AM
I’m watching Chopin Competition.
I’m so surprised that he is using an office chair while I’m deeply impressed by his performance.

m.youtube.com/watch?v=fDsg...
ERIC LU – first round (19th Chopin Competition, Warsaw)
YouTube video by Chopin Institute
m.youtube.com
October 6, 2025 at 7:27 PM
二つの全く違うところでやっていた違うテーマ両方とオーバーラップが大きい研究が出てくると,精神的ダメージが大きいのですが,テーマ設定が安易すぎたのかもしれない
ここ2~3ヶ月考えていた2つのテーマに極めて近い論文が,昨日同時に2本出てしまい,研究テーマが消滅してしまった...早めに切り替えて,いい研究テーマをゼロから考え直さないといけないですね...
October 4, 2025 at 1:21 AM
ここ2~3ヶ月考えていた2つのテーマに極めて近い論文が,昨日同時に2本出てしまい,研究テーマが消滅してしまった...早めに切り替えて,いい研究テーマをゼロから考え直さないといけないですね...
October 2, 2025 at 6:06 PM
Reposted by Satoki Ishikawa
Not all scaling laws are nice power laws. This month’s blog post: Zipf’s law in next-token prediction and why Adam (ok, sign descent) scales better to large vocab sizes than gradient descent: francisbach.com/scaling-laws...
September 27, 2025 at 2:57 PM
I've made some small updates to the 'awesome list' for second-order optimization I made two years ago. It looks like Muon related works and the applications to PINNs have really taken off in the last couple of years.
github.com/riverstone49...
September 26, 2025 at 12:18 PM
I don’t know anything about fluid dynamics, but I came across a paper that seemed to say that second-order optimization is key when using the power of neural networks to solve the Navier–Stokes equations. If so, there’s something romantic about that.
arxiv.org/abs/2509.14185
Discovery of Unstable Singularities
Whether singularities can form in fluids remains a foundational unanswered question in mathematics. This phenomenon occurs when solutions to governing equations, such as the 3D Euler equations, develo...
arxiv.org
September 23, 2025 at 2:13 AM
Reposted by Satoki Ishikawa
This is not OK.

I don't submit often to NeurIPS, but I reviewed papers for this conference almost every year. As a reviewer, why would I spend time trying to give a fair opinion on papers if it's what happens in the end???
This is the metareview
September 20, 2025 at 6:10 AM
ACT-Xに採択されました.引き続き,ニューラルネットワークの最適化について,深い理解を得られるよう研究していきます😁
www.jst.go.jp/kisoken/act-...
2025年度 戦略的創造研究推進事業(ACT-X)の新規研究課題及び評価者について | ACT-X
www.jst.go.jp
September 18, 2025 at 5:43 AM
When a paper has more than 40 figures, I can really feel the author’s dedication just by looking at it - it’s energizing
arxiv.org/abs/2509.01440
Benchmarking Optimizers for Large Language Model Pretraining
The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those method...
arxiv.org
September 3, 2025 at 5:05 PM
私が情報系の進路を選んだ一番のきっかけは,「甘利先生の『神経回路網の数理』を現代的なモデルと大規模計算機で実験検証をしてみたい」なのですが,「神経回路網の数理」を久しぶりに開いて読んでみたら,今ちょうどやっていたモデルの簡約化&解析と実質同じことが書かれていたことに気がつき驚いている.
August 31, 2025 at 5:54 AM
Reposted by Satoki Ishikawa
Today I learned that the continuous-time limit of Nesterov's accelerated gradient is Bessel's differential equation, which can be solved analytically. That's unexpectedly a beautiful result to me...

web.stanford.edu/~boyd/papers...
August 28, 2025 at 2:20 PM