Lightnews — Scholar-powered news

Jessy Li @jessyjli.bsky.social · 9h

Test your models and see if they just memorize or truly understand!

PLSemanticsBench - where formal meets informal!

arxiv.org/abs/2510.03415

Team: Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Milos Gligoric

PLSemanticsBench: Large Language Models As Programming Language Interpreters

As large language models (LLMs) excel at code reasoning, a natural question arises: can an LLM execute programs (i.e., act as an interpreter) purely based on a programming language's formal semantics?...

arxiv.org

1

Jessy Li @jessyjli.bsky.social · 9h

So what's really happening⁉️
LLMs aren't interpreting rules -- they're recalling patterns.
Their "understanding" is promising... but shallow.

💡It's time to test semantics, not just syntax.💡
To move from surface-level memorization → true symbolic reasoning.

1 5

Jessy Li @jessyjli.bsky.social · 9h

Change the rules -- swap (+ with -) or replace (+ with novel symbols) operators -- and accuracy collapses.
Models that were "near-perfect" drop to single digits. 😬

1 1 4

Jessy Li @jessyjli.bsky.social · 9h

🚨 Does your LLM really understand code -- or is it just really good at remembering it?
We built **PLSemanticsBench** to find out.
The results: a wild mix.

✅The Brilliant:
Top reasoning models can execute complex, fuzzer-generated programs -- even with 5+ levels of nested loops! 🤯

❌The Brittle: 🧵

1 6 18

Reposted by Jessy Li

Greg Durrett @gregdnlp.bsky.social · 6d

Find my students and collaborators at COLM this week!

Tuesday morning: @juand-r.bsky.social and @ramyanamuduri.bsky.social 's papers (find them if you missed it!)

Wednesday pm: @manyawadhwa.bsky.social 's EvalAgent

Thursday am: @anirudhkhatry.bsky.social 's CRUST-Bench oral spotlight + poster

5 9

Jessy Li @jessyjli.bsky.social · 6d

We’re hiring faculty as well! Happy to talk about it at COLM!

Kyle Mahowald @kmahowald.bsky.social · 6d

UT Austin Linguistics is hiring in computational linguistics!

Asst or Assoc.

We have a thriving group sites.utexas.edu/compling/ and a long proud history in the space. (For instance, fun fact, Jeff Elman was a UT Austin Linguistics Ph.D.)

faculty.utexas.edu/career/170793

🤘

UT Austin Computational Linguistics Research Group – Humans processing computers processing humans processing language

sites.utexas.edu

2 9

Reposted by Jessy Li

Byron Wallace @byron.bsky.social · 19d

Can we quantify what makes some text read like AI "slop"? We tried 👇

Chantal @chantalsh.bsky.social · 19d

"AI slop" seems to be everywhere, but what exactly makes text feel like "slop"?

In our new work (w/ @tuhinchakr.bsky.social, Diego Garcia-Olano, @byron.bsky.social ) we provide a systematic attempt at measuring AI "slop" in text!

arxiv.org/abs/2509.19163

🧵 (1/7)

1 7

Reposted by Jessy Li

Kyle Mahowald @kmahowald.bsky.social · 7d

I’m at #COLM2025 from Wed with:

@siyuansong.bsky.social Tue am introspection arxiv.org/abs/2503.07513

@qyao.bsky.social Wed am controlled rearing: arxiv.org/abs/2503.20850

@sashaboguraev.bsky.social INTERPLAY ling interp: arxiv.org/abs/2505.16002

I’ll talk at INTERPLAY too. Come say hi!

Language Models Fail to Introspect About Their Knowledge of Language

There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of s...

arxiv.org

1 6 20

Jessy Li @jessyjli.bsky.social · 7d

On my way to #COLM2025 🍁

Check out jessyli.com/colm2025

QUDsim: Discourse templates in LLM stories arxiv.org/abs/2504.09373

EvalAgent: retrieval-based eval targeting implicit criteria arxiv.org/abs/2504.15219

RoboInstruct: code generation for robotics with simulators arxiv.org/abs/2405.20179

4 12

Reposted by Jessy Li

Kanishka Misra 🌊 @kanishka.bsky.social · 7d

Traveling to my first @colmweb.org🍁

Not presenting anything but here are two posters you should visit:

1. @qyao.bsky.social on Controlled rearing for direct and indirect evidence for datives (w/ me, @weissweiler.bsky.social and @kmahowald.bsky.social), W morning

Paper: arxiv.org/abs/2503.20850

Both Direct and Indirect Evidence Contribute to Dative Alternation Preferences in Language Models

Language models (LMs) tend to show human-like preferences on a number of syntactic phenomena, but the extent to which these are attributable to direct exposure to the phenomena or more general propert...

arxiv.org

1 5 13

Jessy Li @jessyjli.bsky.social · 11d

Here is a genuine one :) CosmicAI’s AstroVisBench, to appear at #NeurIPS

bsky.app/profile/nsfs...

NSF-Simons AI Institute for Cosmic Origins (CosmicAI) @nsfsimonscosmicai.bsky.social · 18d

Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy!

A new benchmark developed by researchers at the NSF-Simons AI Institute for Cosmic Origins is testing how well LLMs implement scientific workflows in astronomy and visualize results.

1 2

Jessy Li @jessyjli.bsky.social · 13d

All of us (@kanishka.bsky.social @kmahowald.bsky.social and me) are looking for PhD students this cycle! If computational linguistics/NLP is your passion, join us at UT Austin!

For my areas see jessyli.com

5 4

Jessy Li @jessyjli.bsky.social · 18d

Can AI aid scientists amidst their own workflows, when they do not know step-by-step workflows and may not know, in advance, the kinds of scientific utility a visualization would bring?

Check out @sebajoe.bsky.social’s feature on ✨AstroVisBench:

NSF-Simons AI Institute for Cosmic Origins (CosmicAI) @nsfsimonscosmicai.bsky.social · 18d

Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy!

A new benchmark developed by researchers at the NSF-Simons AI Institute for Cosmic Origins is testing how well LLMs implement scientific workflows in astronomy and visualize results.

3 8

Reposted by Jessy Li

UT Center for Health Communication @uthealthcomm.org · Sep 4

📣 NEW HCTS course developed in collaboration with @tephi-tx.bsky.social: AI in Health Communication 📣

Explore responsible applications and best practices for maximizing impact and building trust with @utaustin.bsky.social experts @jessyjli.bsky.social & @mackert.bsky.social.

💻: rebrand.ly/HCTS_AI

1 2

Jessy Li @jessyjli.bsky.social · Aug 16

Would be great to chat at COLM!

1 1

Reposted by Jessy Li

Kyle Lo @kylelo.bsky.social · Aug 15

long range narrative understanding, even basic fact checking that humans easily get near perfect on, has barely improved in LMs over years novelchallenge.github.io

NoCha leaderboard

novelchallenge.github.io

2 9

Reposted by Jessy Li

Tom McCoy @rtommccoy.bsky.social · Aug 15

🤖 🧠 NEW PAPER ON COGSCI & AI 🧠 🤖

Recent neural networks capture properties long thought to require symbols: compositionality, productivity, rapid learning

So what role should symbols play in theories of the mind? For our answer...read on!

Paper: arxiv.org/abs/2508.05776

1/n

The top shows the title and authors of the paper: "Whither symbols in the era of advanced neural networks?" by Tom Griffiths, Brenden Lake, Tom McCoy, Ellie Pavlick, and Taylor Webb.

At the bottom is text saying "Modern neural networks display capacities traditionally believed to require symbolic systems. This motivates a re-assessment of the role of symbols in cognitive theories."

In the middle is a graphic illustrating this text by showing three capacities: compositionality, productivity, and inductive biases. For each one, there is an illustration of a neural network displaying it. For compositionality, the illustration is DALL-E 3 creating an image of a teddy bear skateboarding in Times Square. For productivity, the illustration is novel words produced by GPT-2: "IKEA-ness", "nonneotropical", "Brazilianisms", "quackdom", "Smurfverse". For inductive biases, the illustration is a graph showing that a meta-learned neural network can learn formal languages from a small number of examples.

8 16 98

Jessy Li @jessyjli.bsky.social · Aug 15

Yes, at least need other data (like Echos in AI), quality measure (LitBench), also what we did in QUDsim was to make sure the stories are from posts pre-LLM to prevent AI stories. Further, The way they measure style + semantic diversity doesn't align with how they define it (only capture lexical)

1 2

Reposted by Jessy Li

Adina Williams @adinawilliams.bsky.social · Aug 15

I agree this thread's headline claim seems premature. Let me add our recent ACL Findings paper, with Dexter Ju and @hagenblix.bsky.social, which found syntactic simplification in at least some LMs, in a novel domain regeneration setting: aclanthology.org/2025.finding...

aclanthology.org

1 1 6

Jessy Li @jessyjli.bsky.social · Aug 15

Nice, reading level, syntactic complexity, and sentence structures are great angles to study this!!

2

Jessy Li @jessyjli.bsky.social · Aug 12

Thanks :) Yes will be there, let's catch up!

1

Jessy Li @jessyjli.bsky.social · Aug 12

Paper links:
Echoes in AI: arxiv.org/abs/2501.00273
Syntactic templates (EMNLP'24): aclanthology.org/2024.emnlp-m...
Discourse similarity (COLM'25 to appear): arxiv.org/abs/2504.09373

1 8

Jessy Li @jessyjli.bsky.social · Aug 12

The Echoes in AI paper showed quite the opposite with also a story continuation setup.
Additionally, we present evidence that both *syntactic* and *discourse* diversity measures show strong homogenization that lexical and cosine used in this paper do not capture.

2 13 60

Jessy Li @jessyjli.bsky.social · Aug 12

Ah yes and this at the discourse level, explicitly contrasting with the type of metrics in this work: arxiv.org/abs/2504.09373

QUDsim: Quantifying Discourse Similarities in LLM-Generated Text

As large language models become increasingly capable at various writing tasks, their weakness at generating unique and creative content becomes a major liability. Although LLMs have the ability to gen...

arxiv.org

1 1 5

Jessy Li @jessyjli.bsky.social · Jul 28

Tuesday at #ACL2025: Jan will be presenting this from 4-5:30pm in x4/x5!
Turns out content selection in LLMs are highly consistent with each other, but not so much with their own notion of importance or with human’s…

Jessy Li @jessyjli.bsky.social · Feb 21

Do you want to know what information LLMs prioritize in text synthesis tasks? Here's a short 🧵 about our new paper, led by Jan Trienes: an interpretable framework for salience analysis in LLMs.

First of all, information salience is a fuzzy concept. So how can we even measure it? (1/6)

5