for the fun collaboration!
@unccs.bsky.social
Paper: arxiv.org/abs/2504.13079
Data and Code: github.com/HanNight/RAM...
for the fun collaboration!
@unccs.bsky.social
Paper: arxiv.org/abs/2504.13079
Data and Code: github.com/HanNight/RAM...
➡️ As document support becomes imbalanced, baselines ignore under-supported correct answers but MADAM-RAG maintains stable performance
➡️ As misinformation 📈, baselines degrade sharply (−46%) but MADAM-RAG remains more robust
➡️ As document support becomes imbalanced, baselines ignore under-supported correct answers but MADAM-RAG maintains stable performance
➡️ As misinformation 📈, baselines degrade sharply (−46%) but MADAM-RAG remains more robust
Increasing debate rounds in MADAM-RAG improves performance by allowing agents to refine answers via debate.
Aggregator provides even greater gains, especially in early rounds, aligning conflicting views & suppressing misinfo.
Increasing debate rounds in MADAM-RAG improves performance by allowing agents to refine answers via debate.
Aggregator provides even greater gains, especially in early rounds, aligning conflicting views & suppressing misinfo.
MADAM-RAG consistently outperforms concatenated-prompt and Astute RAG baselines across all three datasets and model backbones.
MADAM-RAG consistently outperforms concatenated-prompt and Astute RAG baselines across all three datasets and model backbones.
1️⃣ Independent LLM agents - generate intermediate response conditioned on a single doc
2️⃣ Centralized aggregator
3️⃣ Iterative multi-round debate
1️⃣ Independent LLM agents - generate intermediate response conditioned on a single doc
2️⃣ Centralized aggregator
3️⃣ Iterative multi-round debate
➡️ Ambiguous queries w/ multiple valid ans.
➡️ Imbalanced document support (some answers backed by many sources, others by fewer)
➡️ Docs w/ misinformation (plausible but wrong claims) or noisy/irrelevant content
➡️ Ambiguous queries w/ multiple valid ans.
➡️ Imbalanced document support (some answers backed by many sources, others by fewer)
➡️ Docs w/ misinformation (plausible but wrong claims) or noisy/irrelevant content
@unccs.bsky.social
Paper: arxiv.org/abs/2502.01619
Code+Datasets: github.com/archiki/UTGe...
@unccs.bsky.social
Paper: arxiv.org/abs/2502.01619
Code+Datasets: github.com/archiki/UTGe...
For partially correct code with subtle errors (our MBPP+Fix hard split) debugging with UTGen improves over baselines by >12.35% on Qwen 2.5!
For partially correct code with subtle errors (our MBPP+Fix hard split) debugging with UTGen improves over baselines by >12.35% on Qwen 2.5!
1⃣Test-time scaling (self-consistency over multiple samples) for increasing output acc.
2⃣Validation & Backtracking: Generating multiple UTs to perform validation, accept edits only when the overall pass rate increases & backtrack otherwise
1⃣Test-time scaling (self-consistency over multiple samples) for increasing output acc.
2⃣Validation & Backtracking: Generating multiple UTs to perform validation, accept edits only when the overall pass rate increases & backtrack otherwise