Research @ Sony AI
AI should learn from its experiences, not copy your data.
My website for answering RL questions: https://www.decisionsanddragons.com/
Views and posts are my own.
* The task isn't truly sparse.
* The pre-training highly correlates ways of answering (like policies) so that you get very good generalization.
* Inference search means you only need to slightly increase the probabilities to see major changes.
* The task isn't truly sparse.
* The pre-training highly correlates ways of answering (like policies) so that you get very good generalization.
* Inference search means you only need to slightly increase the probabilities to see major changes.
If the pre-trained model had fairly uniform probabilities over the different modes at the start, you only need to modify the model much to get a major change in behavior.
If the pre-trained model had fairly uniform probabilities over the different modes at the start, you only need to modify the model much to get a major change in behavior.
For that I might suspect that the way inference works is a factor.
For that I might suspect that the way inference works is a factor.
The observables were states & actions, but the agent had an understanding of reward functions, options, etc.
The observables were states & actions, but the agent had an understanding of reward functions, options, etc.
Really dumb methods (like GRPO) are more effective uses of compute than any "smart" exploration method, because we sadly still still suck at exploration.
Really dumb methods (like GRPO) are more effective uses of compute than any "smart" exploration method, because we sadly still still suck at exploration.
On-policy methods can have worse sample complexity than off-policy methods, but that's for orthogonal reasons to exploration. It has more to do with data reuse than exploration.
On-policy methods can have worse sample complexity than off-policy methods, but that's for orthogonal reasons to exploration. It has more to do with data reuse than exploration.
From a pretrained model, many rollouts hit a positive signal and many hit a negative signal.
From a pretrained model, many rollouts hit a positive signal and many hit a negative signal.
That's actually a rich signal to learn from and is far more important than the UCT exploration.
That's actually a rich signal to learn from and is far more important than the UCT exploration.
That is, in Go, your reward is 0 for most time steps and only +1/-1 at end. That sound's sparse, but not from an algorithmic perspective.
That is, in Go, your reward is 0 for most time steps and only +1/-1 at end. That sound's sparse, but not from an algorithmic perspective.
While it might seem strange, GenAI research is important research to building those things. But the tech industry has glommed onto the wrong bits.
While it might seem strange, GenAI research is important research to building those things. But the tech industry has glommed onto the wrong bits.
By Wikipedia's definition of social science, I would be inclined to agree that "health care" is a social science, but "medicine" is not.
By Wikipedia's definition of social science, I would be inclined to agree that "health care" is a social science, but "medicine" is not.