Author | Lightnews

Harvey Fu @harveyfu.bsky.social · Jun 21

Thank you! We hypothesized llms are bad at finding omissions because they cannot attend to what’s not there (i.e. the omitted lines). We indicated the omissions with placeholders (for attentions to land on) and they turned out to be very helpful. We wish to study it further in future studies.

1 1

Harvey Fu @harveyfu.bsky.social · Jun 20

We open-source our benchmark and code
Dataset: huggingface.co/datasets/har...
Code: github.com/harvey-fin/a...

This work was done with many amazing collaborators: @aryanshri123.bsky.social, @jaredlcm.bsky.social, Peter West, @chenhaotan.bsky.social, @ari-holtzman.bsky.social

[7/n]

harveyfin/AbsenceBench · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 4

Harvey Fu @harveyfu.bsky.social · Jun 20

We hope AbsenceBench can serve as a starting point for building more robust and trustworthy architectures that are absence-aware: it’s a necessary but far from sufficient bar for making sure LLMs can reason about absence.

[6/n]

1 3

Harvey Fu @harveyfu.bsky.social · Jun 20

Implication: If LLMs can’t tell what’s missing, this might pose challenges for using LLMs as judges, graders, or assistants in any domain where absence matters.

[5/n]

2 1 5

Harvey Fu @harveyfu.bsky.social · Jun 20

Does chain-of-thought style reasoning solve this? It helps, without completely solving the problem, and usually requires generating more than 3x more thinking tokens than were in the original document.

[4/n]

1 4

Harvey Fu @harveyfu.bsky.social · Jun 20

Why do models struggle at identifying omissions? We find that using placeholders, such as “<missing line>”, to explicitly mark omissions boosts models’ performance by 35.7%. This suggests an inherent weakness with Transformer-style self-attention: models cannot attend to omitted information.

[3/n]

1 4

Harvey Fu @harveyfu.bsky.social · Jun 20

AbsenceBench conditions models on two versions of a document, an original and a modified version that deliberately omits certain parts, then asks models to generate what’s being left out.

Although similar to the needle-in-a-haystack (NIAH) task, LLMs perform much worse on AbsenceBench!

[2/n]

1 4

Harvey Fu @harveyfu.bsky.social · Jun 20

LLMs excel at finding surprising “needles” in very long documents, but can they detect when information is conspicuously missing?

🫥AbsenceBench🫥 shows that even SoTA LLMs struggle on this task, suggesting that LLMs have trouble perceiving “negative spaces”.
Paper: arxiv.org/abs/2506.11440

🧵[1/n]

2 15 74

Harvey Fu @harveyfu.bsky.social · Jun 20

We open-source our benchmark and code
Dataset: huggingface.co/datasets/har...
Code: github.com/harvey-fin/a...

This work was done with many amazing collaborators: @aryanshri123.bsky.social, @jaredlcm.bsky.social, Peter West, @chenhaotan.bsky.social, @ari-holtzman.bsky.social

harveyfin/AbsenceBench · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Harvey Fu @harveyfu.bsky.social · Jun 20

We hope AbsenceBench can serve as a starting point for building more robust and trustworthy architectures that are absence-aware: it’s a necessary but far from sufficient bar for making sure LLMs can reason about absence
[6/n]

1

Harvey Fu @harveyfu.bsky.social · Jun 20

Implication: If LLMs can’t tell what’s missing, this might pose challenges for using LLMs as judges, graders, or assistants in any domain where absence matters.
[5/n]

1

Harvey Fu @harveyfu.bsky.social · Jun 20

Does chain-of-thought style reasoning solve this? It helps, without completely solving the problem, and usually requires generating more than 3x more thinking tokens than were in the original document.
[4/n]

1

Harvey Fu @harveyfu.bsky.social · Jun 20

Why do models struggle at identifying omissions? We find that using placeholders, such as “<missing line>”, to explicitly mark omissions boosts models’ performance by 35.7%. This suggests an inherent weakness with Transformer-style self-attention: models cannot attend to omitted information.
[3/n]

1

Harvey Fu @harveyfu.bsky.social · Jun 20

AbsenceBench conditions models on two versions of a document, an original and a modified version that deliberately omits certain parts, then asks models to generate what’s being left out.

Although similar to the needle-in-a-haystack (NIAH) task, LLMs perform much worse on AbsenceBench!
[2/n]

1