Lightnews — Scholar-powered news

Models often fail to:
1. Respect ownership rules
2. Infer type information
3. Follow idiomatic Rust interfaces
4. Preserve correct lifetimes
In the paper, we provide a taxonomy of common LLM mistakes.
🧵[5/6]

April 23, 2025 at 5:00 PM

Anirudh Khatry

@anirudhkhatry.bsky.social

We evaluate state-of-the-art closed-source LLMs (like o1, Claude-3.7, and Gemini-1.5-Pro), open-source models like QwQ-32B and virtuoso-32B, and the SWE-Agent on CRUST-Bench.
Even the best model—OpenAI's o1—passes only 15/100 tasks in a single-shot setting.
🧵[4/6]

April 23, 2025 at 5:00 PM

Anirudh Khatry

@anirudhkhatry.bsky.social

Our benchmark is the first to provide:
1. Rust tests
2. Rust interfaces, which are necessary for the transpiled code to work with the tests
3. A sizable number of real-scale transpilation problems.
🧵[3/6]

April 23, 2025 at 5:00 PM

Anirudh Khatry

@anirudhkhatry.bsky.social

Transpiling C to Rust helps modernize legacy code with memory safety guarantees. CRUST-Bench evaluates whether transpilation methods yield safe, idiomatic Rust, using handcrafted interfaces and tests to ensure safety and validate correctness.
🧵[2/6]

April 23, 2025 at 5:00 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news