desunit
desunit.bsky.social
desunit
@desunit.bsky.social
Entrepreneur

http://rwiz.ai - Handling reviews with AI
🎹 http://pianocompanion.info - Chords dictionary app with 1M+ downloads.
🕹️ http://chordiq.info - Learn chords.
📝 desunit.com - my blog
MathArena.ai
MathArena: Evaluating LLMs on Uncontaminated Math Benchmarks
matharena.ai
February 11, 2026 at 7:11 PM
If a system can correctly answer half of brand-new research math questions, sourced from papers published weeks ago, the bar has moved. A lot.

What happens when reasoning keeps improving, but humans keep arguing using 2022 mental models?

... just saying.
February 11, 2026 at 7:11 PM
Producing a final answer is much easier than proving it rigorously.

But the old argument "Just a parrot, repeating old stuff on loop" -
is getting weaker every month.
February 11, 2026 at 7:11 PM
> require understanding new results, not recalling textbooks

Yet people still say: AI can’t handle unknown equations ..... AI isn’t creative ....

This is basically checkmate.

This does not mean AI can write 60% of math papers.
February 11, 2026 at 7:11 PM
> final answers only (no "almost right" reasoning)

The results?

‼️ Top models get ~50–60% correct answers ‼️

GPT-5.2 - 60%.
Gemini-3-Pro is right behind

These are problems that:

> an average human cannot solve at all
> many math grads would struggle with
February 11, 2026 at 7:11 PM
I just stumbled on ArXivMath - a fresh benchmark that evaluates LLMs on research-level mathematical problems taken from recent ArXiv papers (you can say - from the last month). That means:

> minimal training contamination
> no memorization of a static benchmark
February 11, 2026 at 7:11 PM
Like it or not but the future isn’t won by perfect demos/clean decks.
It’s won by whoever ships early, floods the market, and improves in public.

Waiting for v1.0 is how you end up losing to someone who ships v0.1 at scale... a really large scale.
February 9, 2026 at 7:12 PM
Robots today are awkward/limited/silly

But who cares?!

Volume creates learning loops →
Learning loops create cost drops →
Cost drops create adoption →
Adoption creates dominance

Exactly what we've seen with EVs. Same logic shows up in AI adoption.
February 9, 2026 at 7:12 PM
February 6, 2026 at 7:16 PM
Talking to an LLM is often a better experience than learning another complex tool.
February 6, 2026 at 7:16 PM
- Interface-based moats are dying
- Proprietary data still matters
- Whoever owns the chat interface becomes the new aggregator

Yes, it’s a painful, especially if you’ve spent years building beautiful UX but the reality is simple:
February 6, 2026 at 7:16 PM
That’s the scary part. You don't have brand visibility/UX differentiation/no workflow lock-in. Pricing power collapses unless the data is truly proprietary. If your data can be licensed, scraped, or replicated, there’s no moat left - just commodity competition.

Takeaways:
February 6, 2026 at 7:16 PM
You don’t open tools, learn workflows, or even know which vendor is used. You just ask: Give me XXX, analyze YYY, run ZZZ

When the interface disappears, all that’s left is API vs API.
February 6, 2026 at 7:16 PM
In Web 2.0, aggregators like Google commoditized discovery. At the same time suppliers still owned two things:
- interface
- data

That’s why vertical software could charge premium prices.

but it looks like LLMs change that.

The LLM chat becomes the interface.
February 6, 2026 at 7:16 PM
For years, software companies didn’t win because of data - they won because of interfaces: complex workflows/plugins/exports/shortcuts. I know at least several examples where that friction created massive switching costs. Basically, their interface was the moat.
February 6, 2026 at 7:16 PM
The video of the app:
February 4, 2026 at 7:18 PM
5/ But now it’s a skill you actually need; otherwise, you’ll just waste time watching the LLM think and craft code.

LLMs are slow. Humans shouldn’t be idle while they think.
If you have experience and can keep context in your head, AI turns you into a force multiplier.
February 4, 2026 at 7:18 PM
4/ The interesting part is that while the LLM was implementing features we agreed on and planned, I was switching between several other projects. Reviewing. Thinking. Deciding what’s next.

I always believed context switching is bad. And it probably still is.
February 4, 2026 at 7:18 PM
3/ > Ingest incoming house invoices and split them across apartments
> Manage parking spots
…and a lot more

Could this have been done this fast a couple of years ago? I doubt it.
> Not with this scope.
> Not just me.
> And definitely not while juggling other projects.
February 4, 2026 at 7:18 PM
2/ To better understand the amount of work, here's what the system does:
> Send invoices
> Collect cold water meter readings
> Ping tenants who forgot to submit them
February 4, 2026 at 7:18 PM
1/ I love my wife, and I couldn’t watch her waste time on things that can be easily automated. The math is simple - 2 days per house turns into 24 days a year.

You know how it works - happy wife, happy family.
February 4, 2026 at 7:18 PM