Lightnews — Scholar-powered news

johnsteill.bsky.social

@johnsteill.bsky.social

Future work idea: build my own tool — maybe a poker-solver API.

ChatGPT can follow hand logic well, but sometimes drops real clunkers. A ReAct-style loop with solver calls mid-conversation could hang in the weeds of EV calcs.

September 8, 2025 at 9:09 PM

johnsteill.bsky.social

@johnsteill.bsky.social

Another observation: I’ve noticed “expert AI trainer” jobs in bioinformatics, data science, chemistry, etc. The work is to solve problems and explain your thinking step by step.

That’s basically ReAct training data. Important work, but undervalued when treated as annotation.

September 8, 2025 at 9:09 PM

johnsteill.bsky.social

@johnsteill.bsky.social

There’s also the eyebrow-raising effect of seeing “Acknowledge user progress” in a reasoning trace right before the AI tells you “Good job!”

The illusion cracks a little.

Yoav Farbey ( @yoavf.bsky.social ) has a hilarious repo on this theme:
github.com/yoavf/absolu...

GitHub - yoavf/absolutelyright: Claude said I'm absolutely right!

Claude said I'm absolutely right! Contribute to yoavf/absolutelyright development by creating an account on GitHub.

github.com

September 8, 2025 at 9:08 PM

johnsteill.bsky.social

@johnsteill.bsky.social

For an agent–LLM team, though, ReAct’s trick of marking up explicit Thought/Action/Observation tuples is a whole different story.

The output space grows combinatorially, but because nonsense actions are harder to justify in language, effective entropy may shrink.

September 8, 2025 at 9:06 PM

johnsteill.bsky.social

@johnsteill.bsky.social

The more I study AI, the more I watch myself think, and the weirder both human and computer brains seem.

For me, reasoning with vs. without verbalizing feels about the same — like the same neurons fire either way.

September 8, 2025 at 9:05 PM

johnsteill.bsky.social

@johnsteill.bsky.social

If you’re curious, the paper is here: arxiv.org/abs/2507.19849

Would love to hear what others think about ARPO — or if you’ve seen clever crossover ideas like using bioinformatics for GenAI training. 👀

Agentic Reinforced Policy Optimization

Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In rea...

arxiv.org

August 22, 2025 at 2:44 PM

johnsteill.bsky.social

@johnsteill.bsky.social

For me, ARPO was a chance to stretch math & stats muscles and think harder about how GenAI works under the hood. The more we understand the mechanics, the more effective we’ll be at putting these systems to work.

August 22, 2025 at 2:44 PM

johnsteill.bsky.social

@johnsteill.bsky.social

Bioinformatics déjà vu.

ARPO’s “soft advantage attribution” compares rollouts at the token level, but once sequences diverge it gets brittle. Reminds me of sequence alignment in genomics — maybe Smith–Waterman could help here.

August 22, 2025 at 2:43 PM

johnsteill.bsky.social

@johnsteill.bsky.social

Tool calls are magic.

They massively extend LLM capabilities, cut down on hallucinations, and give researchers a natural breakpoint mid-calculation. You can actually peek into the stream of tokens and see what’s happening.

August 22, 2025 at 2:43 PM

johnsteill.bsky.social

@johnsteill.bsky.social

Reflections that stuck with me:

What does it mean to be “agentic”? The line between an algorithm and an agent feels blurry. ARPO suggests it’s about adaptability: agents curate their own tools and reasoning paths.

August 22, 2025 at 2:42 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news