Lightnews — Scholar-powered news

Gokul Swamy

@gokul.dev

4.2K followers 420 following 110 posts

PhD student at @cmurobotics.bsky.social working on efficient algorithms for interactive learning (e.g. imitation / RL / RLHF). no model is an island. prefers email. https://gokul.dev/. on the job market!

Posts Replies Media Videos

Gokul Swamy

@gokul.dev

Woooaaaah :O

October 22, 2025 at 5:51 PM

Gokul Swamy

@gokul.dev

I've been really enjoying the new Ninajirachi album -- it's very Boiler Room-core :)

August 23, 2025 at 6:19 PM

Gokul Swamy

@gokul.dev

Thanks for the shout-out and I hope the lectures were at least somewhat understandable! Yeah, once things settle down a bit for me, I'd like to more deeply understand the connection between Rust's structural estimation and IRL as I conceive of it.

August 23, 2025 at 6:16 PM

Gokul Swamy

@gokul.dev

We therefore advocate for caution when making or evaluating claims about LLM reasoning and beyond with GRPO and PPO, ideally using algorithms like RLoo or REBEL instead. Check out our blog post for links to our code and W&B logs if you'd like to reproduce our experiments.

July 15, 2025 at 5:46 PM

Gokul Swamy

@gokul.dev

While this worked out for the better on some seeds, it doesn't have to in general. After all, an algorithm that behaves unexpectedly *well* in one setting can perform unexpectedly *poorly* in another, perhaps more important, setting.

July 15, 2025 at 5:46 PM

Gokul Swamy

@gokul.dev

We see similar results on a didactic bandit problem -- i.e. a problem that has nothing to do with LLMs or reasoning! This implies that PPO / GRPO are fundamentally *not* following the true policy gradient.

July 15, 2025 at 5:46 PM

Gokul Swamy

@gokul.dev

We find that RLoo (an unbiased estimate of the vanilla PG) and REBEL (a regression-based approximation of online mirror descent) preserve performance as expected. In contrast, algorithms like PPO / GRPO that include heuristics (e.g. clipping) show a marked and unexpected change in performance.

July 15, 2025 at 5:46 PM

Gokul Swamy

@gokul.dev

So, with a truly random reward function, all policies look equally good. Thus, the *true* policy gradient is zero, as the initial policy is optimal by construction. So, we'd expect performance to flatline. We use random rewards as a *diagnostic task* to compare different RL algs.

July 15, 2025 at 5:46 PM

Gokul Swamy

@gokul.dev

Lead by Owen Oertell & Wenhao Zhan, joint w/ Steven Wu, Kiante Brantley, Jason Lee, and Wen Sun. If a project has got Wen, Owen, Wenhao, and Qwen on it, you know it's gotta be good 😛.

July 15, 2025 at 5:46 PM

Gokul Swamy

@gokul.dev

While I can't promise everything will be crystal-clear after going though the lectures (especially because of my handwriting :p), I hope that if nothing else, you can tell how beautiful we all find these ideas. If that feeling comes across, I'll feel like I have succeeded! :)

June 20, 2025 at 3:53 AM

Gokul Swamy

@gokul.dev

The second was being able to teach this course with my amazing advisors, Drew Bagnell and Steven Wu -- the folks I learned all of this stuff from. Fun fact: because of parking fees, Drew actually *paid* to lecture. And I'm always grateful to ZSW for pushing me out of the nest.

June 20, 2025 at 3:53 AM

Gokul Swamy

@gokul.dev

Two other things made this course particularly special. The first was the students and their *incredible* questions -- there were so many times where I was like wow, it took me *YEARS* before I realized that was the right question to be asking.

June 20, 2025 at 3:53 AM

Gokul Swamy

@gokul.dev

We also had wonderful guest lectures from Yuda Song
on hybrid RL (youtu.be/1B2XGXQ2hfA), Sanjiban Choudhury on scaling imitation (youtu.be/KnXSeTuCgFI), and Wen Sun on RLHF algorithms (youtu.be/qdkBZJywi_4).

Algorithmic Foundations of Interactive Learning SP25: Lecture 17

YouTube video by Gokul Swamy

youtu.be

June 20, 2025 at 3:53 AM

Gokul Swamy

@gokul.dev

My favorite lectures to give were on the value of interaction in imitation / RLHF! youtu.be/uESAXg-CXFs, youtu.be/N8-Nh_iTmps, youtu.be/qHvB30J5gyo, youtu.be/ZzFjoH47GIg. It took 5 years, but I finally have an answer at least I find compelling :p.

Algorithmic Foundations of Interactive Learning SP25: Lecture 19

YouTube video by Gokul Swamy

youtu.be

June 20, 2025 at 3:53 AM

Gokul Swamy

@gokul.dev

To do so, we worked backwards from things like ChatGPT and RMA and "backed out" a "dependency graph". We then did a "forward pass" over the semester, going from online learning, to game solving, to core RL, to imitation learning / robot learning, to RLHF / LLM fine-tuning.

June 20, 2025 at 3:53 AM

Gokul Swamy

@gokul.dev

I think in a field as fast-paced as machine learning, a good course gives students a conceptual framework for understanding new developments quickly + what is actually "new" vs. the classical algorithms. We also wanted to explain *when* scale isn't "all you need."

June 20, 2025 at 3:53 AM

Gokul Swamy

@gokul.dev

You can access all the content here:
Course Website: interactive-learning-algos.github.io
Lecture Playlist: youtube.com/playlist?lis...
Scribe Notes "Book": interactive-learning-algos.github.io/assets/pdfs/....
Homeworks / class competition material are also public!

Home

Website for AFIL course.

interactive-learning-algos.github.io

June 20, 2025 at 3:53 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news