Lightnews — Scholar-powered news

Arkil Patel

@arkil.bsky.social

270 followers 400 following 9 posts

PhD Student at Mila and McGill | Research in ML and NLP | Past: AI2, MSFTResearch

arkilpatel.github.io

Posts Replies Media Videos

Arkil Patel

@arkil.bsky.social

Paper: arxiv.org/pdf/2502.14678

Data: tinyurl.com/chase-data

Code: github.com/McGill-NLP/C...

arxiv.org

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

Results:

- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:

1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:

1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

Why synthetic data for evaluation?

- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns

February 21, 2025 at 4:29 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news