Lightnews — Scholar-powered news

David Smith

@dasmiq.bsky.social

Thanks, @ianmilligan1.bsky.social ! I liked that piece when it came out. I don't mean that test contamination is the only reason for high performance; rather, better evaluation that rules out contamination would make clearer the kinds of documents where this is "solved".

November 26, 2025 at 4:46 PM

David Smith

@dasmiq.bsky.social

If this was the newspaper directories, I could imagine boilerplate text in then-in-copyright Ayer's that also appeared in an out-of-copyright Ayer's. (This conceptual problem of using probing for auditing copyright leakage hasn't shown up in the literature on LLM memorization, AFAIK.)

November 26, 2025 at 4:37 PM

Reposted by David Smith

Daniel Evans

@djevans.bsky.social

Interesting about Claude. A few years ago I tried to use Gemini1.5 to OCR a few pages of 19th C texts from the HTDL and got a RECITATION error. Apparently I was asking the model to output "passages that are "recited" from copyrighted material in the foundational LLM's training data."

November 26, 2025 at 4:27 PM

David Smith

@dasmiq.bsky.social

Oh, definitely, the models are getting better. But that's why we need to understand how we're benchmarking them!

November 26, 2025 at 4:04 PM

Reposted by David Smith

Sarah Bull

@sarahebull.bsky.social

FWIW, I've been asking Claude to transcribe both printed text and handwriting that it almost certainly hasn't seen (e.g. photos from pretty obscure docs in archives) for a couple of years, and it has gotten massively better this year. Still has difficulty with some stuff (esp. very period-specific

November 26, 2025 at 4:01 PM

David Smith

@dasmiq.bsky.social

It seems likely. One could try probing Gemini, but that wouldn't be dispositive. It'd be much easier to check an open model!

November 26, 2025 at 3:26 PM

David Smith

@dasmiq.bsky.social

Not yet. Olmo is also based on pretraining and instruction-tuning sets and has some nice tools for exploring them. On the other hand, it's much more English-centered, of course, than Apertus. allenai.org/olmo

Olmo from Ai2

Our fully open language model and complete model flow.

allenai.org

November 26, 2025 at 3:24 PM

David Smith

@dasmiq.bsky.social

Don't pronounce a eulogy on paleographers yet. We'll want them around to understand the data we do have, build more open data, work on languages big companies don't care about, and evaluate when systems go wrong. bsky.app/profile/giul...

Giulia Taurino @giuliataurino.bsky.social · 13d

arxiv.org/abs/2503.15195

Benchmarking Large Language Models for Handwritten Text Recognition

Traditional machine learning models for Handwritten Text Recognition (HTR) rely on supervised training, requiring extensive manual annotations, and often produce errors due to the separation between l...

arxiv.org

November 26, 2025 at 3:15 PM

David Smith

@dasmiq.bsky.social

which I know from personal inspection. What it had was the biggest (n-gram) language model anyone had yet built. @nsaphra.bsky.social et al. have a nice paper on this analogy. arxiv.org/abs/2311.05020

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to o...

arxiv.org

November 26, 2025 at 3:15 PM

David Smith

@dasmiq.bsky.social

As some of the replies to @dancohen.org have pointed out, these OCR capabilities are especially impressive for more recent English, where the language model is strongest. Probably the best analogy for this is early Google Translate, which had a pretty weak translation model, ...

November 26, 2025 at 3:15 PM

David Smith

@dasmiq.bsky.social

The pile of data that made Gemini's OCR possible was produced by past research! We know examples of OCR/HTR training sets that Google certainly used, so funding them was certainly helpful. bsky.app/profile/scot...

scott b. weingart @scottbot.bsky.social · 1d

I know this is the funding/research game, and we put a lot of money/time into soon-curtailed paths because one payoff is sometimes all we need, but: it's sobering thinking of all the clever technologies and methodologies that were swept away when fundamentally stupid LLMs came on the scene.

November 26, 2025 at 3:15 PM

David Smith

@dasmiq.bsky.social

It's still useful that Gemini can incorporate this knowledge. But this is also why open models whose data you can check for train-test contamination are important. allenai.org/olmo

Olmo from Ai2

Our fully open language model and complete model flow.

allenai.org

November 26, 2025 at 3:15 PM

David Smith

@dasmiq.bsky.social

Thanks so much, @palaeofuturist.bsky.social ! This is very helpful. I agree it’d be good to aggregate these resources. We’ll see how far we get!

November 26, 2025 at 1:14 PM

Reposted by David Smith

Gabriel Bodard

@palaeofuturist.bsky.social

Back to EpiDoc: A query to the www.jiscmail.ac.uk/EPIDOC-MARKUP list would surely return pointers to dozens of repos of EpiDoc files with Greek inscriptions. I'd love to see that list!

JISCMail - EPIDOC-MARKUP List at WWW.JISCMAIL.AC.UK

www.jiscmail.ac.uk

November 26, 2025 at 11:29 AM

Reposted by David Smith

Gabriel Bodard

@palaeofuturist.bsky.social

I believe Inscriptions of the Northern Black Sea also have EpiDoc files available (github.com/kingsdigital...), but I haven't checked how up to date these are.

Many EpiDoc projects listed at: wiki.digitalclassicist.org/Category:Epi... . At least some I haven't thought of will have downloadable XML.

GitHub - kingsdigitallab/iospe

Contribute to kingsdigitallab/iospe development by creating an account on GitHub.

github.com

November 26, 2025 at 11:22 AM