Lightnews — Scholar-powered news

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

570 followers 470 following 100 posts

MSc. @mila-quebec.bsky.social and @mcgill.ca in the LiNC lab

Fixating on multi-agent RL, Neuro-AI and decisions

Ēka ē-akimiht

https://danemalenfant.com/

Posts Replies Media Videos

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

2/3 It was wonderful to see machine learning’s impact across so many fields. Especially work from Taiwan. Like Canada, Taiwan has Indigenous peoples; I believe cultural and traditional knowledge can strengthen ML systems and how they generalize and align to real-world contexts.

November 3, 2025 at 3:53 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

1/3 Thank you to CIFAR and partners in DSET for bringing me to Banff to speak on my research: The challenge of hidden gifts in multi-agent reinforcement learning arxiv.org/abs/2505.20579. We introduce a novel task on reciprocity with a scarce resource; take what you need, leave what you don’t.

November 3, 2025 at 3:53 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

I really enjoy reading equations in formal fields (e.g math, computer science, et.c) due to all the colours and shapes that appear in your mind but making low math public outreach posters is fun too @ivado.bsky.social @mila-quebec.bsky.social

October 22, 2025 at 3:28 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

A nice feeling after long long hours

October 21, 2025 at 9:51 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

To follow up on the asymptotic proof that the self-correction term works with any number of agents or coalitions, here are the results for 3 agents.

Policy gradient agent's performance suffers with more agents but self correction still stabilizes learning arxiv.org/abs/2505.20579

October 16, 2025 at 5:53 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

Here is my plan to make Bluesky more fun and active:

October 8, 2025 at 12:06 AM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

6/8 To make it more visually fun), I teamed up with the Société des arts technologiques sat.qc.ca to create an experience. Using open-source Ossia Score's particle clouds, audio, and 3D transforms in real time while the agents learned. ossia.io

October 7, 2025 at 6:34 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

4/8
To communicate this to a general audience and the #art community, I built a minimal task: two Gaussian bandits. One agent optimizes with entropy; the other doesn’t. Mid-training, the reward distribution jumps.

October 7, 2025 at 6:34 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

2/8
I proposed a reinforcement-learning (RL) demo: add a maximum-entropy term to increase the longevity of systems in a non-stationary environment. This is well known to the RL research community: openreview.net/forum?id=PtS...
(photo by Félix Bonne-Vie)

October 7, 2025 at 6:34 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

My eye colour apparently changed after 6 years

October 3, 2025 at 12:07 AM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

Hanover’s Oktoberfest honouring hip hop’s best

September 28, 2025 at 12:56 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

Particularly, I started presenting a validation experiment of the self-correction term.

Rather than "if x then y" this tested "if not x then not y".

This inhibits learning the sub-policy for maximizing collective reward. Agents compete even with a larger reward signal not to

Appendix Figure 15 with two panels. Panel a plots percent cumulative reward and collective success across 11,000 episodes for PG agents trained with a negated self-correction term: rewards trend upward but success rate decreases, with pronounced variance spikes. Panel b shows nine independent simulations (each averaged over 32 parallel environments) where reward curves dip sharply and recover, indicating agents compete for the single key and avoid dropping it, reducing collective success

September 20, 2025 at 3:42 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

ฅ^•ﻌ•^ฅ

August 21, 2025 at 1:39 AM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

The “proof” to the below thread in one page

The self-correction term: a decentralized gradient adjustment to infer a reward conditioned on another RL agent’s policy in a coordination task. Here the proof is for n agents rather than two

August 19, 2025 at 6:46 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

But with self-correction, not all combinations of agents need to be calculated. Only the coeffecients for each level of the tree yielding O(log V) complexity

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

Without self-correction this is like walking through an n-ary tree where the highest order gradient is at the root

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

and the full update is now:

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

Then since the sum converges to a gradient operator distribution with the identity operator as I.

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

Now let f define the correction term function for clarity

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

This would continue with more and more agents and is neatly a binomial distribution of the order of gradients

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

These objectives will cleanly come out in the global optimization but since there 2 ways to leave the key with 1 agent and 1 way to leave the key with all 3 agents out of three agents, we include a coefficient infront of the second term.

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

Then with policy independence (there isn't a better action to take since success now requires another agent), we have another correction term.

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

Using agent k's value approximation as a surrogate for the expected collection reward is the same as before but leads to a higher order gradient.

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

But now consider an extension to the reward function that reward two agents opening their doors and larger collective reward for 3 agents opening their doors. The 2 agent case has already been covered but now the collective reward requires 3 agents.

August 16, 2025 at 6:07 PM

Dane Carnegie Malenfant

@dvnxmvlhdf5.bsky.social

And this term is equivalent to the q-value estimate of the collective shared reward (which is non-stationary and changes between policy updates)

August 16, 2025 at 6:07 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news