We found some models, like Gemini 2.5 and Claude 3.7, when aided by summarization, could detect the change and successfully adapt their strategy, recovering performance. 10/13
We found some models, like Gemini 2.5 and Claude 3.7, when aided by summarization, could detect the change and successfully adapt their strategy, recovering performance. 10/13
The act of summarizing forced them to consolidate their knowledge, enabling them to form and execute better strategies in later trials. 8/13
The act of summarizing forced them to consolidate their knowledge, enabling them to form and execute better strategies in later trials. 8/13
So, we prompted them to write a summary of their findings after each trial. The effect was dramatic. 8/13
So, we prompted them to write a summary of their findings after each trial. The effect was dramatic. 8/13
They gathered data but failed to integrate it into a better strategy. Meta-learning did not occur naturally. 7/13
They gathered data but failed to integrate it into a better strategy. Meta-learning did not occur naturally. 7/13
This shows the challenge isn't basic, single-turn reasoning. They can select informative actions in the moment. 6/13
This shows the challenge isn't basic, single-turn reasoning. They can select informative actions in the moment. 6/13
This isolates different facets of exploration from Feature World. 5/13
This isolates different facets of exploration from Feature World. 5/13
1️⃣ Feature World (both text-based and 3D in Construction Lab): A stateless setting to test raw information-gathering efficiency. 4/13
1️⃣ Feature World (both text-based and 3D in Construction Lab): A stateless setting to test raw information-gathering efficiency. 4/13
Paper👇
arxiv.org/abs/2412.06438
🧵1/13
Paper👇
arxiv.org/abs/2412.06438
🧵1/13