Lightnews — Scholar-powered news

Davide Paglieri

@dpaglieri.bsky.social

🚨This week's new entry on balrogai.com is Microsoft Phi-4 (14B model)

While Phi-4 excels on benchmarks like math competitions, BALROG reveals that Phi-4 falls short as an agent. More research on how to improve agentic performance is needed.

January 16, 2025 at 11:30 AM

Davide Paglieri

@dpaglieri.bsky.social

🚨BALROG leaderboard update

This week's new entries on balrogai.com are:

Llama 3.3 70B Instruct 🫤
Claude 3.5 Haiku✨
Mistral-Nemo-it (12B) 🆗

Github: github.com/balrog-ai/BA...

December 12, 2024 at 11:30 AM

Reposted by Davide Paglieri

Nenad Tomasev

@nenadtomasev.bsky.social

I'm excited to share a new paper: "Mastering Board Games by External and Internal Planning with Language Models"

storage.googleapis.com/deepmind-med...

(also soon to be up on Arxiv, once it's been processed there)

storage.googleapis.com

December 5, 2024 at 7:49 AM

Reposted by Davide Paglieri

Jack Parker-Holder

@jparkerholder.bsky.social

Introducing 🧞Genie 2 🧞 - our most capable large-scale foundation world model, which can generate a diverse array of consistent worlds, playable for up to a minute. We believe Genie 2 could unlock the next wave of capabilities for embodied agents 🧠.

December 4, 2024 at 4:01 PM

Davide Paglieri

@dpaglieri.bsky.social

It's great to see BALROG featured on Jack Clark's Import AI newsletter!

Check out what he had to say about it here:
jack-clark.net

And check out BALROG's leaderboard on balrogai.com

December 4, 2024 at 9:37 AM

Reposted by Davide Paglieri

Laura

@lauraruis.bsky.social

Do you know what rating you’ll give after reading the intro? Are your confidence scores 4 or higher? Do you not respond in rebuttal phases? Are you worried how it will look if your rating is the only 8 among 3’s? This thread is for you.

November 27, 2024 at 5:25 PM

Reposted by Davide Paglieri

Tim Rocktäschel

@handle.invalid

Excited to announce "BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games" led b UCL DARK's @dpaglieri.bsky.social! Douwe Kiela plot below is maybe the scariest for AI progress — LLM benchmarks are saturating at an accelerating rate. BALROG to the rescue. This will keep us busy for years.

November 22, 2024 at 11:27 AM

Reposted by Davide Paglieri

Ethan Mollick

@emollick.bsky.social

This may sound odd, but game-based benchmarks are some of the most useful for AI, since we have human scores and they require reasoning, planning & vision

The hardest of all is Nethack. No AI is close, and I suspect that an AI that can fairly win/ascend would need to be AGI-ish. Paper: balrogai.com

November 23, 2024 at 4:32 AM

Reposted by Davide Paglieri

Mikayel Samvelyan

@samvelyan.com

Your LLM shall not pass! 🧙‍♂️

... unless it's really good in reasoning and games!

Check out this new amazing benchmark BALROG 👾 from @dpaglieri.bsky.social and team 👇

Davide Paglieri @dpaglieri.bsky.social · Nov 21

Tired of saturated benchmarks? Want scope for a significant leap in capabilities?

🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games!

BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come.

1/🧵

November 21, 2024 at 4:47 PM

Davide Paglieri

@dpaglieri.bsky.social

Tired of saturated benchmarks? Want scope for a significant leap in capabilities?

🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games!

BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come.

1/🧵

November 21, 2024 at 4:24 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news