Xing Han Lu
@xhluca.bsky.social
👨🍳 Web Agents @mila-quebec.bsky.social
🎒 @mcgill-nlp.bsky.social
🎒 @mcgill-nlp.bsky.social
Without 🐦 and 🦋, are we left with LinkedIn?
May 10, 2025 at 8:55 PM
Without 🐦 and 🦋, are we left with LinkedIn?
Daily Paper: huggingface.co/papers/2504....
Data: huggingface.co/datasets/McG...
Demo: huggingface.co/spaces/McGil...
Leaderboard: huggingface.co/spaces/McGil...
Arxiv: arxiv.org/abs/2504.08942
Data: huggingface.co/datasets/McG...
Demo: huggingface.co/spaces/McGil...
Leaderboard: huggingface.co/spaces/McGil...
Arxiv: arxiv.org/abs/2504.08942
Paper page - AgentRewardBench: Evaluating Automatic Evaluations of Web Agent
Trajectories
Join the discussion on this paper page
huggingface.co
April 15, 2025 at 7:10 PM
Daily Paper: huggingface.co/papers/2504....
Data: huggingface.co/datasets/McG...
Demo: huggingface.co/spaces/McGil...
Leaderboard: huggingface.co/spaces/McGil...
Arxiv: arxiv.org/abs/2504.08942
Data: huggingface.co/datasets/McG...
Demo: huggingface.co/spaces/McGil...
Leaderboard: huggingface.co/spaces/McGil...
Arxiv: arxiv.org/abs/2504.08942
An amazing team effort with: @a-kazemnejad.bsky.social Nick @arkil.bsky.social Dongchan Alejandra @karstanczak.bsky.social @ptshaw.bsky.social @chrisjpal.bsky.social @sivareddyg.bsky.social
April 15, 2025 at 7:10 PM
An amazing team effort with: @a-kazemnejad.bsky.social Nick @arkil.bsky.social Dongchan Alejandra @karstanczak.bsky.social @ptshaw.bsky.social @chrisjpal.bsky.social @sivareddyg.bsky.social
We find that rule-based evals underreport success rates, and no single LLM judge excels across all benchmarks.
We collect trajectories from web agents built on four LLMs (Claude 3.7, GPT-4o, Llama 3.3, Qwen2.5-VL) across popular web benchmarks (AssistantBench, WebArena, VWA, WorkArena, WorkArena++)
We collect trajectories from web agents built on four LLMs (Claude 3.7, GPT-4o, Llama 3.3, Qwen2.5-VL) across popular web benchmarks (AssistantBench, WebArena, VWA, WorkArena, WorkArena++)
April 15, 2025 at 7:10 PM
We find that rule-based evals underreport success rates, and no single LLM judge excels across all benchmarks.
We collect trajectories from web agents built on four LLMs (Claude 3.7, GPT-4o, Llama 3.3, Qwen2.5-VL) across popular web benchmarks (AssistantBench, WebArena, VWA, WorkArena, WorkArena++)
We collect trajectories from web agents built on four LLMs (Claude 3.7, GPT-4o, Llama 3.3, Qwen2.5-VL) across popular web benchmarks (AssistantBench, WebArena, VWA, WorkArena, WorkArena++)
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/
🔗: mcgill-nlp.github.io/thoughtology/
April 12, 2025 at 4:12 PM
WebArena by Zhou et al; AgentLab and Browsergym by @servicenow.bsky.social allowed us to explore the latest agents; @gradio-hf.bsky.social enabled us to design UIs for implementing our ARIA framework, whereas @hf.co provided a hosting platform for 100GB+ artifacts.
bsky.app/profile/xhlu...
bsky.app/profile/xhlu...
Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. spread misinformation?
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
March 10, 2025 at 5:45 PM
WebArena by Zhou et al; AgentLab and Browsergym by @servicenow.bsky.social allowed us to explore the latest agents; @gradio-hf.bsky.social enabled us to design UIs for implementing our ARIA framework, whereas @hf.co provided a hosting platform for 100GB+ artifacts.
bsky.app/profile/xhlu...
bsky.app/profile/xhlu...
This work was done by an awesome team of authors: @adadtur.bsky.social, Nick, @arkil.bsky.social, @karstanczak.bsky.social, Esin, @spandanagella.bsky.social, and @sivareddyg.bsky.social.
It's also important to recognize the incredible works that helped us build SafeArena:
It's also important to recognize the incredible works that helped us build SafeArena:
March 10, 2025 at 5:45 PM
This work was done by an awesome team of authors: @adadtur.bsky.social, Nick, @arkil.bsky.social, @karstanczak.bsky.social, Esin, @spandanagella.bsky.social, and @sivareddyg.bsky.social.
It's also important to recognize the incredible works that helped us build SafeArena:
It's also important to recognize the incredible works that helped us build SafeArena: