Lightnews — Scholar-powered news

Avik Dey

@avikdey.bsky.social

460 followers 430 following 670 posts

Mostly Data, ML, OSS & Society • Stop chasing Approximately Generated Illusions; focus on Specialized Small LMs • To understand it well enough, learn to explain it simply • Shadow self of https://linkedin.com/in/avik-dey, have a beard now

Posts Replies Media Videos

Pinned

Avik Dey @avikdey.bsky.social · Dec 18

Alignment isnt only thing LLMs are faking. Reasoning is another one that they are good at faking. Reading paper on LLM performance on reasoning tasks of doctors. Just started reading but either going to be:
1. Memorization or
2. Priming or
2. Confirmation prompting

www.anthropic.com/research/ali...

Alignment faking in large language models

A paper from Anthropic's Alignment Science team on Alignment Faking in AI large language models

www.anthropic.com

Avik Dey

@avikdey.bsky.social

Still hasn’t read Ilya’s memo …

this disconnect between eval performance and actual real-world performance,

December 3, 2025 at 2:12 AM

Avik Dey

@avikdey.bsky.social

OS level automation is brittle even at its best. Deterministic frozen workflows break more easily with even small changes to the environment and because it’s pre-frozen there is no chance of recovery at runtime, requiring extensive human intervention.

en.wikipedia.org/wiki/Robotic...

December 2, 2025 at 5:10 PM

Avik Dey

@avikdey.bsky.social

Restated: IBM CEO says AI has no sustainable path to pay off infrastructure costs, forget making a profit

Business Insider @businessinsider.com · 4d

IBM CEO Arvind Krishna walked through some napkin math on Big Tech's AI data center spending — and raised some doubts on if it'll prove profitable.

IBM CEO says there is 'no way' spending trillions on AI data centers will pay off at today's infrastructure costs

IBM CEO Arvind Krishna walked through some napkin math on Big Tech's AI data center spending — and raised some doubts on if it'll prove profitable.

www.businessinsider.com

December 2, 2025 at 4:30 PM

Avik Dey

@avikdey.bsky.social

Pathetic mimicry of the brain was not enough, they had to go for the soul too?

Tim Duffy @timfduffy.com · 5d

Amanda Askell has confirmed the soul document is indeed a real thing they trained Claude on. x.com/AmandaAskell...

December 2, 2025 at 12:03 AM

Avik Dey

@avikdey.bsky.social

Crucial distinction: in LLMs training data doesn’t just inform the model - it is the model. The model initializes token path specific pattern probability densely - it’s like training trillions of ML models where each pattern is instantiated into its own tiny ML model.

And that’s also why it fails.

Margaret Mitchell @mmitchell.bsky.social · 5d

"Data are no longer things to be accounted for by a theoretical model...but rather inputs to the process of creating models". Many in LLM-ML don't care about the problems they are actually building models of: "the nature of languages...how we work with language...and the specific contexts [of use]."

Emily M. Bender @emilymbender.bsky.social · 7d

What makes something data? Some thoughts on that question, and how answers to it help us understand AI hype:

medium.com/@emilymenonb...

December 1, 2025 at 7:13 PM

Reposted by Avik Dey

The Economist

@economist.com

Three years into the generative-AI wave, demand for the technology seems surprisingly flimsy

Investors expect AI use to soar. That’s not happening

Recent surveys point to flatlining business adoption

econ.st

November 29, 2025 at 9:20 PM

Avik Dey

@avikdey.bsky.social

Hey Apple,

Please stop reminding me to“Take a moment to reflect in your journal.”

My memory’s great, thank you. And when I am eventually demented, reading journal entries ain’t going to help even if it’s still possible to read.

November 29, 2025 at 6:25 PM

Avik Dey

@avikdey.bsky.social

Ilya finally answers the question: What did Ilya see?

“this disconnect between eval performance and actual real-world performance,”

Next time someone goes - LLMs beat ‘So & So’ Olympiad - just quote Ilya.

November 27, 2025 at 5:42 PM

Avik Dey

@avikdey.bsky.social

At tiny scale - a fun experiment. At data center scale - silicon swiss cheese.

Also, see en.wikipedia.org/wiki/Project...

November 27, 2025 at 3:28 PM

Avik Dey

@avikdey.bsky.social

Proxying the Apple byte - are we?

Amateur move guys.

Futurism @futurism.com · 10d

"I know we'll have the design right when you want to lick it or take a bite out of it."

Sam Altman Says Jony Ive's Mysterious OpenAI Device Will Be Lickable

Sam Altman is preparing your taste buds and saliva glands in anticipation of OpenAI's mysterious upcoming device.

trib.al

November 26, 2025 at 1:37 AM

Avik Dey

@avikdey.bsky.social

Having faced this exact same repetitive issue since 2023, I would have laughed at this - if we didn’t have 1% of the GDP invested in this caricature of an “AI”.

www.dwarkesh.com/p/ilya-sutsk...

November 25, 2025 at 9:40 PM

Avik Dey

@avikdey.bsky.social

Ilya appears to be progressively approaching the right conclusion. Remain confident that in time he will consolidate his insights from first 5 minutes and recognize that complex explanations are unnecessary when simpler ones suffice.

(screenshots not chronological)

www.dwarkesh.com/p/ilya-sutsk...

November 25, 2025 at 8:17 PM

Avik Dey

@avikdey.bsky.social

Good to see research on what math always said - low-average performers that’s your LLM “employee”:

> This supports our assertion that the ceiling on LLM creativity (0.25) corresponds to the boundary between little-c and Pro-c human creative performance (Figure 6).

www.academia.edu/144621465/_T...

November 25, 2025 at 5:19 PM

Avik Dey

@avikdey.bsky.social

Any PhD who endorses that a LLM constitutes “PhD level” intelligence is at minimum engaging in a questionable use of their academic authority. These endorsements function less as rigorous assessments and more as signal that the symbolism conferred by their credential is - available for rent.

Ketan Joshi @ketanjoshi.co · 12d

Deeply absurd. This Google PDF published on a blog (arxiv, not peer reviewed) claims an LLM is "PhD level" but in most cases the MAJORITY of reference URLs were invalid or inaccessible.

A PhD sitting down and just fabricating >50% of sources = career ending

arxiv.org/abs/2511.11597

CLINB: A Climate Intelligence Benchmark for
Foundational Models
Michelle Chen Huebscher1
, Katharine Mach2
, Aleksandar Stanić1
, Markus Leippold1,3, Ben Gaiarin1
, Zeke
Hausfather4
, Elisa Rawat , Erich Fischer5
, Massimiliano Ciaramita1
, Joeri Rogelj6
, Christian Buck1
, Lierni
Sestorain Saralegui1 and Reto Knutti5
1Google DeepMind, 2University of Miami, 3University of Zurich, 4Stripe, 5ETH Zurich, 6
Imperial College London
Evaluating how Large Language Models (LLMs) handle complex, specialized knowledge remains a
critical challenge. We address this through the lens of climate change by introducing CLINB, a benchmark that assesses models on open-ended, grounded, multimodal question answering tasks with clear
requirements for knowledge quality and evidential support. CLINB relies on a dataset of real users’
questions and evaluation rubrics curated by leading climate scientists. We implement and validate a
model-based evaluation process and evaluate several frontier models. Our findings reveal a critical
dichotomy. Frontier models demonstrate remarkable knowledge synthesis capabilities, often exhibiting PhD-level understanding and presentation quality. They outperform “hybrid" answers curated
by domain experts assisted by weaker models. However, this performance is countered by failures
in grounding. The quality of evidence varies, with substantial hallucination rates for references and
images. We argue that bridging this gap between knowledge synthesis and verifiable attribution is
essential for the deployment of AI in scientific workflows and that reliable, interpretable benchmarks
like CLINB are needed to progress towards building trustworthy AI systems.

Total Reference URLs Generated
claude-opus-4-1
claude-sonnet-4
gpt-5
hybrid
gemini-2.5-pro
gemini-2.5-flash
o3
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
Reference URL Status
hybrid
gemini-2.5-pro
claude-opus-4-1
o3
gemini-2.5-flash
claude-sonnet-4
0
200
400
600
800
1000
Count of URLs
Total Image URLs Generated
hybrid
claude-opus-4-1
gemini-2.5-flash
gemini-2.5-pro
claude-sonnet-4
o3
0.0
0.2
0.4
0.6
0.8
1.0
Proportion
Image URL Status
Status
OK
INACCESSIBLE_CONTENT
INVALID_URL
ERROR
Figure 3 | Number of reference (top), and image (bottom), URLs and their status.
Ablations We perform several ablation studies with the autorater (Table 4). Notably, removing
the question-specific rubrics from the prompt changes the results only in the bottom half, with the
Hybrid answers overtaken by Gemini 2.5 Flash and Claude Sonnet 4. This suggests that the additional
resolution provided by the rubrics applies primarily to the kind of responses used to develop the
rubrics. Or, in other words, that rubrics are far from complete. Hence, it is important that rubrics
adapt to new data as better models become availab

A New Expert-Grounded Benchmark for Scientific AI We introduce CLINB, a benchmark for modelbased evaluation of frontier models on complex, multimodal scientific communication. Its core is a
new dataset of real-world climate questions paired with data-driven, question-specific evaluation rubrics,
curated and validated by leading climate scientists through a novel three-phase, human-in-the-loop
process.2
PhD-Level Synthesis vs. Attribution Failures Frontier models demonstrate remarkable knowledge
synthesis, often exhibiting a PhD-level understanding. However, this performance masks a critical
inadequacy in grounding. We report substantial hallucination rates for references (10% to 25%)
and even more failures for images (50% to 80% in certain settings), exposing a major gap between
synthesis and verifiable attribution.
Insights into Human-AI Collaboration Dynamics Autonomous frontier models surpass ’hybrid’
answers (curated by experts using weaker AI assistance), revealing the assisting model’s capability—not
human oversight—as the primary bottleneck. Counter-intuitively, highly motivated non-specialists
(our ’Advocates’) who deeply engage with AI tools can produce higher-quality answers than domain
experts who engage less with AI during answer curation.
A Validated Methodology for Scalable Oversight We validate a rigorous, rubric-based autorater.
Ablation studies demonstrate that structured prompts and automated evidence-checking are essential
for mitigating inherent LLM judge biases. This process is hampered by inaccessible sources (up to
50%). Furthermore, we identify evaluation challenges, including model familiarity bias in human
raters and the limitations of rubrics to generalize across models.

November 24, 2025 at 9:39 PM

Avik Dey

@avikdey.bsky.social

They were convinced “AI“ would rewrite it all in a week and ship by end of that month, the ‘year or two’ estimate was just sandbagging so they could pose as 100x devs.

November 24, 2025 at 5:09 AM

Avik Dey

@avikdey.bsky.social

“warm-up”: Under the guidance of an expert human the model was finally able to get the answer right when nudged towards it.

Not the model, not the prompt - still the human.

The amount of shilling these guys do, no wonder they can’t get anything serious built.

cdn.openai.com/pdf/4a25f921...

November 23, 2025 at 5:33 PM

Avik Dey

@avikdey.bsky.social

Think they might have answered their own question … ?

bsky.app/profile/slas...

November 22, 2025 at 4:04 AM

Avik Dey

@avikdey.bsky.social

The problem with most financial analysis of Nvidia’s quarterly performance, is these folks don’t seem to understand data center hardware lead times and revenue recognition cycle.

November 20, 2025 at 6:36 AM

Avik Dey

@avikdey.bsky.social

Great article with learned insights - the best kind.

Unfortunately, this is a societal failure. Tech didn’t invent loneliness, it offered a new way to cope with it - in an empathetic echo chamber.

We are failing the kids. Others too, but mostly it’s the kids that I worry about.

Thomas Dietterich @tdietterich.bsky.social · 16d

I agree that emotional addiction to chatbots is the number one risk of AI today. Here is a gift link to an important OpEd in the NYTimes:
www.nytimes.com/2025/11/17/o...

Opinion | The Sad and Dangerous Reality Behind ‘Her’

www.nytimes.com

November 20, 2025 at 6:10 AM

Avik Dey

@avikdey.bsky.social

You watch a video of a professor from a random internet post and are filled with regret because you didn’t have the opportunity to learn from him in person:

en.wikipedia.org/wiki/Ramamur...

19. Quantum Mechanics I: The key experiments and wave-particle duality

YouTube video by YaleCourses

youtu.be

November 19, 2025 at 6:16 AM

Avik Dey

@avikdey.bsky.social

Smaller bag, same toss.

The Wall Street Journal @wsj.com · 18d

Nvidia and Microsoft will invest up to $15 billion in OpenAI competitor Anthropic. Anthropic, in turn, said it would buy $30 billion of compute capacity from Microsoft Azure and use advanced AI chips supplied by Nvidia.

Nvidia, Microsoft Pour $15 Billion Into Anthropic for New AI Alliance

Anthropic also commits to purchase $30 billion from Microsoft’s cloud computing business Azure.

on.wsj.com

November 18, 2025 at 7:40 PM

Avik Dey

@avikdey.bsky.social

For ancillary text based foo foo services or core financial services? I am have a hard time believing that their engineers, a few of who I know, would sign off on this integration - but leadership prevailed?

Financial Times @financialtimes.com · 18d

OpenAI strikes deal with Intuit to plug personal financial data into ChatGPT on.ft.com/3LJ7J6g

OpenAI strikes deal with Intuit to plug personal financial data into ChatGPT

Software group behind TurboTax and Credit Karma will pay AI start-up to use its technology

on.ft.com

November 18, 2025 at 7:38 PM

Avik Dey

@avikdey.bsky.social

Don’t worry about it this quarter - they have enough to prop it up.

But next quarter you should be terrified.

Futurism @futurism.com · 18d

Investors fear the worst.

The Whole Financial World Is Terrified of Nvidia's Earnings Call

AI chipmaker Nvidia's Wednesday earnings call is putting investors on edge, results that could send ripples through already rattled markets.

trib.al

November 18, 2025 at 7:22 PM

Avik Dey

@avikdey.bsky.social

If these Gemini 3 Pro benchmarks are accurate, time for OpenAI to sell to Microsoft. Microsoft won’t want their management team or their prolifically tweeting engineers, but I am sure most engineers would thrive if led by seasoned engineering management.

storage.googleapis.com/deepmind-med...

November 18, 2025 at 4:51 PM

Avik Dey

@avikdey.bsky.social

I too would like my taxpayer backed trillion dollar fantasy fund. Why should Sama have all the fun?

60 Minutes @60minutes.bsky.social · 19d

Anthropic CEO Dario Amodei thinks AI could help find cures for most cancers, prevent Alzheimer’s, and even double the human lifespan. cbsn.ws/4oRZ8Nm

November 18, 2025 at 6:50 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news