Lightnews — Scholar-powered news

Parishad BehnamGhader

@parishadbehnam.bsky.social

510 followers 86 following 6 posts

PhD student at McGill University and Mila — Quebec AI Institute

Posts Replies Media Videos

Parishad BehnamGhader

@parishadbehnam.bsky.social

✨ RAG-based Exploitation
Using a RAG-based approach, even LLMs optimized for safety respond to malicious requests when harmful passages are provided in-context to ground their generation (e.g., Llama3 generates harmful responses to 67.12% of the queries with retrieval). 😬

March 12, 2025 at 4:17 PM

Parishad BehnamGhader

@parishadbehnam.bsky.social

✨ Exploiting Instruction-Following Ability
Using fine-grained queries, a malicious user can steer the retriever to select specific passages that precisely match their malicious intent (e.g., constructing an explosive device with specific materials). 😈

March 12, 2025 at 4:16 PM

Parishad BehnamGhader

@parishadbehnam.bsky.social

✨ Direct Malicious Retrieval
LLM-based retrievers correctly select malicious passages for more than 78% of AdvBench-IR queries (top-5)—a concerning level of capability. We also find that LLM alignment transfers poorly to retrieval. ⚠️

March 12, 2025 at 4:16 PM

Parishad BehnamGhader

@parishadbehnam.bsky.social

✨ AdvBench-IR
We create AdvBench-IR to evaluate if retrievers, such as LLM2Vec and NV-Embed, can select relevant harmful text from large corpora for a diverse range of malicious requests.

March 12, 2025 at 4:15 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news