PhD at Tübingen. Working on post-training diffusion and multimodal models. Previous research interns at Snapchat and Naver Labs. https://sgk98.github.io/
My guess is evaluating multimodal models turns out to be tricky because the language model is much larger/stronger than the other modalities, leading to a skewed approach by these models. In the long run, I'm hopeful that long-form streaming video holds solutions for a lot of the problems we face.
November 25, 2025 at 7:48 AM
My guess is evaluating multimodal models turns out to be tricky because the language model is much larger/stronger than the other modalities, leading to a skewed approach by these models. In the long run, I'm hopeful that long-form streaming video holds solutions for a lot of the problems we face.
Earlier this year, we'd spent a lot of time pushing the limits of blind baselines for vision-langauge compositionality benchmarks and found that they're surprisingly close to state-of-the-art on several benchmarks, and that filtering samples wasn't a great solution. Link: arxiv.org/abs/2506.08227
November 25, 2025 at 7:48 AM
Earlier this year, we'd spent a lot of time pushing the limits of blind baselines for vision-langauge compositionality benchmarks and found that they're surprisingly close to state-of-the-art on several benchmarks, and that filtering samples wasn't a great solution. Link: arxiv.org/abs/2506.08227
Very nice! Am I going crazy or do you use "Pick", "Pick Score", "PickScore", and "PickAScore" to refer to the same reward (i.e github.com/yuvalkirstain/PickScore)?
November 1, 2025 at 4:32 AM
Very nice! Am I going crazy or do you use "Pick", "Pick Score", "PickScore", and "PickAScore" to refer to the same reward (i.e github.com/yuvalkirstain/PickScore)?
Oh yes, nobody tells you in game that there's a tactic in this position. And you need to calculate a sacrifice fully in a game, and not play one move at a time. So it's not too hard to overfit, but doing online tactics well is a necessary but not sufficient condition to play chess well.
April 23, 2025 at 9:50 AM
Oh yes, nobody tells you in game that there's a tactic in this position. And you need to calculate a sacrifice fully in a game, and not play one move at a time. So it's not too hard to overfit, but doing online tactics well is a necessary but not sufficient condition to play chess well.
Maybe if people tried to overfit to online tactics ratings, sure. But having good calculation skills and awareness of tactical patterns is essential to being a good chess player, while "leetcode" is not essential to being a good programmer?
April 23, 2025 at 8:48 AM
Maybe if people tried to overfit to online tactics ratings, sure. But having good calculation skills and awareness of tactical patterns is essential to being a good chess player, while "leetcode" is not essential to being a good programmer?