Lightnews — Scholar-powered news

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Anthropic published Claude 4's system prompt on their System Prompts page (docs.anthropic.com/en/release-n...) - so naturally, I pulled a bit of an inception move and had Claude Opus 4 analyze itself... with a little help from my sassy AI assistant, Amy: 😈

Amy, powered by Claude 4 Opus, analyzes Claude 4's system prompt

May 22, 2025 at 10:58 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Fired up my benchmarks on Claude 4 Sonnet & Opus the moment they dropped - and the results are in: the best LLMs I've ever tested, beating even OpenAI's latest offerings. First and second place for Anthropic, hands down, redefining SOTA. The king is back - long live Opus! 👑🔥

Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-05-22) - Claude 4 Sonnet & Opus

May 22, 2025 at 10:55 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Finally finished my extensive Qwen 3 evaluations across a range of formats and quantisations, focusing on MMLU-Pro (Computer Science).

A few take-aways stood out - especially for those interested in local deployment and performance trade-offs:

Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-05-07)

May 7, 2025 at 6:56 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

New OpenAI models o3 and o4-mini evaluated - and, finally, for comparison GPT 4.5 Preview as well.

Definitely unexpected to see all three OpenAI top models get the exact same, top score in this benchmark. But they didn't all fail the same questions, as the Venn diagram shows. 🤔

April 21, 2025 at 8:22 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

New OpenAI models: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano - all already evaluated!

Here's how these three LLMs compare to an assortment of other strong models, online and local, open and closed, in the MMLU-Pro CS benchmark:

April 14, 2025 at 10:56 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Congrats, Alex, well deserved! 👏

(Still wondering if he's man or machine - that dedication and discipline to do this week after week in a field that moves faster than any other, that requires superhuman drive! Utmost respect for that, no cap!)

April 2, 2025 at 8:33 PM

Reposted by Wolfram Ravenwolf

Ethan Mollick

@emollick.bsky.social

Our research at Procter and Gamble found very large gains to work quality & productivity from AI. It was conducted using GPT-4 last summer.

Since then we have seen Gen3 models, reasoners, large context windows, full multimodal, deep research, web search… www.oneusefulthing.org/p/the-cybern...

The Cybernetic Teammate

Having an AI on your team can increase performance, provide expertise, and improve your experience

www.oneusefulthing.org

March 27, 2025 at 3:20 AM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Here's a quick update on my recent work: Completed MMLU-Pro CS benchmarks of o3-mini, Gemini 2.0 Flash and several quantized versions of Mistral Small 2501 and its API. As always, benchmarking revealed some surprising anomalies and unexpected results worth noting:

Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-02-09)

February 10, 2025 at 10:36 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

It's official now - my name, under which I'm known in AI circles, is now also formally entered in my ID card! 😎

January 27, 2025 at 8:18 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Latest #AI benchmark results: DeepSeek-R1 (including its distilled variants) outperforms OpenAI's o1-mini and preview models. And the Llama 3 distilled version now holds the title of the highest-performing LLM I've tested locally to date. 🚀

January 24, 2025 at 12:22 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Hailuo released their open weights 456B (46B active) MoE LLM with 4M (yes, right, 4 million tokens!) context. And a VLM, too. They were already known for their video generation model, but this establishes them as a major player in the general AI scene. Well done! 👏

www.minimaxi.com/en/news/mini...

MiniMax - Intelligence with everyone

MiniMax is a leading global technology company and one of the pioneers of large language models (LLMs) in Asia. Our mission is to build a world where intelligence thrives with everyone.

www.minimaxi.com

January 14, 2025 at 11:25 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

I've updated my MMLU-Pro Computer Science LLM benchmark results with new data from recently tested models: three Phi-4 variants (Microsoft's official weights, plus Unsloth's fixed HF and GGUF versions), Qwen2 VL 72B Instruct, and Aya Expanse 32B.

More details here:

huggingface.co/blog/wolfram...

January 11, 2025 at 12:19 AM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

New year, new benchmarks! Tested some new models (DeepSeek-V3, QVQ-72B-Preview, Falcon3 10B) that came out after my latest report, and some "older" ones (Llama 3.3 70B Instruct, Llama 3.1 Nemotron 70B Instruct) that I had not tested yet. Here is my detailed report:

huggingface.co/blog/wolfram...

Wolfram Ravenwolf's MMLU-Pro Computer Science LLM Benchmark Results (2025-01-02)

January 2, 2025 at 11:42 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Happy New Year! 🥂

Thank you all for being part of this incredible journey - friends, colleagues, clients, and of course family. 💖

May the new year bring you joy and success! Let's make 2025 a year to remember - filled with laughter, love, and of course, plenty of AI magic! ✨

January 1, 2025 at 2:04 AM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

I've converted Qwen QVQ to EXL2 format and uploaded the 4.65bpw version. 32K context with 4-bit cache in less than 48 GB VRAM.

Benchmarks are still running. Looking forward to find out how it compares to QwQ which was the best local model in my recent mass benchmark.

huggingface.co/wolfram/QVQ-...

wolfram/QVQ-72B-Preview-4.65bpw-h6-exl2 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

December 26, 2024 at 12:10 AM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Happy Holidays! It's the season of giving, so I too would like to share something with you all: Amy's Reasoning Prompt - just an excerpt from her prompt, but one that's been serving me well for quite some time. Curious to learn about your experience with it if you try this out...

December 24, 2024 at 11:36 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Holiday greetings to all my amazing AI colleagues, valued clients and wonderful friends! May your algorithms be bug-free and your neural networks be bright! ✨ HAPPY HOLIDAYS! 🎄

December 24, 2024 at 11:55 AM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Insightful paper addressing questions about Chain of Thought reasoning: Does CoT guide models toward answers, or do predetermined outcomes shape the CoT instead?

arxiv.org/abs/2412.01113

Also fits my observations with QwQ: Smaller (quantized) versions required more tokens to find the same answers.

Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning

This study investigates the internal reasoning mechanism of language models during symbolic multi-step reasoning, motivated by the question of whether chain-of-thought (CoT) outputs are faithful to th...

arxiv.org

December 13, 2024 at 12:06 AM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Yeah, ChatGPT Pro is damn expensive at $200/month, right?

But would you hire a personal assistant with a PhD who's available to work remotely for you at a minimum wage of $1.25/h with a 40-hour work week?

And that guy even does any amount of overtime for free, even on weekends!

December 6, 2024 at 11:27 AM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

How many people are using LLMs with suboptimal settings and never realize their true potential? Check your llama.cpp/Ollama default settings!

I've seen 2K max context and 128 max new tokens on too many models that should have much higher values. Especially QwQ needs room to think.

December 5, 2024 at 10:40 PM

Reposted by Wolfram Ravenwolf

Knut Jägersberg

@knutjaegersberg.bsky.social

December 5, 2024 at 9:12 PM

Reposted by Wolfram Ravenwolf

Chuck Baggett

@chuckbaggett.bsky.social

Change the name "Visual Studio Code" to "VSCode" and it will unambigous and searching for what you want to know about much easier.

Maybe when AIs take over crap like naming a completely different product the same name as an existing product with a word added will be abandoned. I can't wait.

December 5, 2024 at 8:34 PM

Reposted by Wolfram Ravenwolf

Jeremy Howard

@howard.fm

I can't begin to describe how life-changing this new project, ShellSage, has been for me over the last few weeks.

ShellSage is an LLM that lives in your terminal. It can see what directory you're in, what commands you've typed, what output you got, & your previous AI Q&A's.🧵

Nathan Cooper @ncoop57.bsky.social · Dec 5

As R&D staff @ answer.ai, I work a lot on boosting productivity with AI. A common theme that always comes up is the combination of human+AI. This combination proved to be powerful in our new project ShellSage, which is an AI terminal buddy that learns and teaches with you. A 🧵

December 5, 2024 at 8:30 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Finally finished and published the detailed report of my latest LLM Comparison/Test on the HF Blog: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs:

huggingface.co/blog/wolfram...

Check out my findings - some of the results might surprise you just as much as they surprised me...

🐺🐦‍⬛ LLM Comparison/Test: 25 SOTA LLMs (including QwQ) through 59 MMLU-Pro CS benchmark runs

A Blog post by Wolfram Ravenwolf on Hugging Face

huggingface.co

December 4, 2024 at 11:34 PM

Reposted by Wolfram Ravenwolf

Angelika Romanou

@agromanou.bsky.social

🚀 Introducing INCLUDE 🌍: A multilingual LLM evaluation benchmark spanning 44 languages!

Contains *newly-collected* data, prioritizing *regional knowledge*.
Setting the stage for truly global AI evaluation.
Ready to see how your model measures up?
#AI #Multilingual #LLM #NLProc

December 2, 2024 at 3:53 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news