Adeel Razi
banner
adeelrazi.bsky.social
Adeel Razi
@adeelrazi.bsky.social
Computational Neuroscientist, NeuroAI, Causality. Monash, UCL, CIFAR. Lab: https://comp-neuro.github.io/
I am so sorry to hear about his passing. Very shocking and unexpected news, his work and memory lives on.
November 16, 2025 at 9:01 AM
Congratulations and looking forward to seeing what you do there!
September 24, 2025 at 6:48 AM
That's really intersting and relevant, will read closely and cite in the related work. We currently cited this one fo binary NNs: arxiv.org/abs/2002.10778
Training Binary Neural Networks using the Bayesian Learning Rule
Neural networks with binary weights are computation-efficient and hardware-friendly, but their training is challenging because it involves a discrete optimization problem. Surprisingly, ignoring the d...
arxiv.org
May 27, 2025 at 10:40 AM
Reg batchnorm: it's effective in many settings, but can be brittle in others, like when used with small batch sizes, non-i.i.d. data or models with stochasticity in the forward pass. In these cases, the running estimates of mean/variance can drift or misalign with test-time behaviour.

2/2
May 27, 2025 at 7:49 AM
Yes, absolutely, "noisy" was shorthand & it does depend on the surrogate. What I meant is that common surrogates can have high gradient variance, especially when their outputs saturate. That variance can hurt learning, particularly in deeper networks or those with binary/stochastic activations.
1/2
May 27, 2025 at 7:47 AM
of course, whenever you could!
May 26, 2025 at 7:43 AM
Why does KL divergence show up everywhere in machine learning?

Because it's not just a distance, it's the cost of believing your own model too much.

Minimizing KL = reducing surprise = optimizing variational free energy.

A silent principle behind robust inference.

5/6
May 26, 2025 at 4:04 AM
Our key innovation:

- A family of importance-weighted straight-through estimators (IW-ST), which unify and generalize previous methods.
- No need for backprop-through-noise tricks.
- No batch norm.

Just clean, effective training.

4/6
May 26, 2025 at 4:04 AM
We view training as Bayesian inference, minimizing KL divergence between a posterior and an amortized prior.

This lets us derive a principled loss from first principles—grounded in variational free energy, not heuristics.

3/6
May 26, 2025 at 4:04 AM
Binary/spiking neural networks are efficient and brain-inspired—but notoriously difficult to train.

Why? Discrete activations → non-differentiable.

Most current methods either approximate gradients or add noisy surrogates.

We do something different.

2/6
May 26, 2025 at 4:04 AM