Lightnews — Scholar-powered news

Xing Han Lu

@xhluca.bsky.social

680 followers 160 following 57 posts

👨‍🍳 Web Agents @mila-quebec.bsky.social

🎒 @mcgill-nlp.bsky.social

Posts Replies Media Videos

Xing Han Lu

@xhluca.bsky.social

Without 🐦 and 🦋, are we left with LinkedIn?

May 10, 2025 at 8:55 PM

Xing Han Lu

@xhluca.bsky.social

Daily Paper: huggingface.co/papers/2504....
Data: huggingface.co/datasets/McG...
Demo: huggingface.co/spaces/McGil...
Leaderboard: huggingface.co/spaces/McGil...
Arxiv: arxiv.org/abs/2504.08942

Paper page - AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Join the discussion on this paper page

huggingface.co

April 15, 2025 at 7:10 PM

Xing Han Lu

@xhluca.bsky.social

An amazing team effort with: @a-kazemnejad.bsky.social Nick @arkil.bsky.social Dongchan Alejandra @karstanczak.bsky.social @ptshaw.bsky.social @chrisjpal.bsky.social @sivareddyg.bsky.social

April 15, 2025 at 7:10 PM

Xing Han Lu

@xhluca.bsky.social

We find that rule-based evals underreport success rates, and no single LLM judge excels across all benchmarks.
We collect trajectories from web agents built on four LLMs (Claude 3.7, GPT-4o, Llama 3.3, Qwen2.5-VL) across popular web benchmarks (AssistantBench, WebArena, VWA, WorkArena, WorkArena++)

April 15, 2025 at 7:10 PM

Xing Han Lu

@xhluca.bsky.social

bsky.app/profile/sara...

Sara Vera Marjanovic @saravera.bsky.social · Apr 1

Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/

A circular diagram with a blue whale icon at the center. The diagram shows 8 interconnected research areas around LLM reasoning represented as colored rectangular boxes arranged in a circular pattern. The areas include: §3 Analysis of Reasoning Chains (central cloud), §4 Scaling of Thoughts (discussing thought length and performance metrics), §5 Long Context Evaluation (focusing on information recall), §6 Faithfulness to Context (examining question answering accuracy), §7 Safety Evaluation (assessing harmful content generation and jailbreak resistance), §8 Language & Culture (exploring moral reasoning and language effects), §9 Relation to Human Processing (comparing cognitive processes), §10 Visual Reasoning (covering ASCII generation capabilities), and §11 Following Token Budget (investigating direct prompting techniques). Arrows connect the sections in a clockwise flow, suggesting an iterative research methodology.

April 12, 2025 at 4:12 PM

Xing Han Lu

@xhluca.bsky.social

WebArena by Zhou et al; AgentLab and Browsergym by @servicenow.bsky.social allowed us to explore the latest agents; @gradio-hf.bsky.social enabled us to design UIs for implementing our ARIA framework, whereas @hf.co provided a hosting platform for 100GB+ artifacts.

bsky.app/profile/xhlu...

Xing Han Lu @xhluca.bsky.social · Mar 10

Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. spread misinformation?

To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇

March 10, 2025 at 5:45 PM

Xing Han Lu

@xhluca.bsky.social

This work was done by an awesome team of authors: @adadtur.bsky.social, Nick, @arkil.bsky.social, @karstanczak.bsky.social, Esin, @spandanagella.bsky.social, and @sivareddyg.bsky.social.

It's also important to recognize the incredible works that helped us build SafeArena:

March 10, 2025 at 5:45 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news