David Smith
dasmiq.bsky.social
David Smith
@dasmiq.bsky.social
Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.
Thanks, @ianmilligan1.bsky.social ! I liked that piece when it came out. I don't mean that test contamination is the only reason for high performance; rather, better evaluation that rules out contamination would make clearer the kinds of documents where this is "solved".
November 26, 2025 at 4:46 PM
If this was the newspaper directories, I could imagine boilerplate text in then-in-copyright Ayer's that also appeared in an out-of-copyright Ayer's. (This conceptual problem of using probing for auditing copyright leakage hasn't shown up in the literature on LLM memorization, AFAIK.)
November 26, 2025 at 4:37 PM
Reposted by David Smith
Interesting about Claude. A few years ago I tried to use Gemini1.5 to OCR a few pages of 19th C texts from the HTDL and got a RECITATION error. Apparently I was asking the model to output "passages that are "recited" from copyrighted material in the foundational LLM's training data."
November 26, 2025 at 4:27 PM
Oh, definitely, the models are getting better. But that's why we need to understand how we're benchmarking them!
November 26, 2025 at 4:04 PM
Reposted by David Smith
FWIW, I've been asking Claude to transcribe both printed text and handwriting that it almost certainly hasn't seen (e.g. photos from pretty obscure docs in archives) for a couple of years, and it has gotten massively better this year. Still has difficulty with some stuff (esp. very period-specific
November 26, 2025 at 4:01 PM
It seems likely. One could try probing Gemini, but that wouldn't be dispositive. It'd be much easier to check an open model!
November 26, 2025 at 3:26 PM
Not yet. Olmo is also based on pretraining and instruction-tuning sets and has some nice tools for exploring them. On the other hand, it's much more English-centered, of course, than Apertus. allenai.org/olmo
Olmo from Ai2
Our fully open language model and complete model flow.
allenai.org
November 26, 2025 at 3:24 PM
Don't pronounce a eulogy on paleographers yet. We'll want them around to understand the data we do have, build more open data, work on languages big companies don't care about, and evaluate when systems go wrong. bsky.app/profile/giul...
November 26, 2025 at 3:15 PM
which I know from personal inspection. What it had was the biggest (n-gram) language model anyone had yet built. @nsaphra.bsky.social et al. have a nice paper on this analogy. arxiv.org/abs/2311.05020
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to o...
arxiv.org
November 26, 2025 at 3:15 PM
As some of the replies to @dancohen.org have pointed out, these OCR capabilities are especially impressive for more recent English, where the language model is strongest. Probably the best analogy for this is early Google Translate, which had a pretty weak translation model, ...
November 26, 2025 at 3:15 PM
The pile of data that made Gemini's OCR possible was produced by past research! We know examples of OCR/HTR training sets that Google certainly used, so funding them was certainly helpful. bsky.app/profile/scot...
I know this is the funding/research game, and we put a lot of money/time into soon-curtailed paths because one payoff is sometimes all we need, but: it's sobering thinking of all the clever technologies and methodologies that were swept away when fundamentally stupid LLMs came on the scene.
November 26, 2025 at 3:15 PM
It's still useful that Gemini can incorporate this knowledge. But this is also why open models whose data you can check for train-test contamination are important. allenai.org/olmo
Olmo from Ai2
Our fully open language model and complete model flow.
allenai.org
November 26, 2025 at 3:15 PM
Thanks so much, @palaeofuturist.bsky.social ! This is very helpful. I agree it’d be good to aggregate these resources. We’ll see how far we get!
November 26, 2025 at 1:14 PM
Reposted by David Smith
Back to EpiDoc: A query to the www.jiscmail.ac.uk/EPIDOC-MARKUP list would surely return pointers to dozens of repos of EpiDoc files with Greek inscriptions. I'd love to see that list!
JISCMail - EPIDOC-MARKUP List at WWW.JISCMAIL.AC.UK
www.jiscmail.ac.uk
November 26, 2025 at 11:29 AM
Reposted by David Smith
I believe Inscriptions of the Northern Black Sea also have EpiDoc files available (github.com/kingsdigital...), but I haven't checked how up to date these are.

Many EpiDoc projects listed at: wiki.digitalclassicist.org/Category:Epi... . At least some I haven't thought of will have downloadable XML.
GitHub - kingsdigitallab/iospe
Contribute to kingsdigitallab/iospe development by creating an account on GitHub.
github.com
November 26, 2025 at 11:22 AM
Thanks! I don’t mind the format—I’ve parsed the binary TLG disks before—it’s the PHI license I worry about.
November 26, 2025 at 12:46 AM
Thanks! At least it’s Greek we’re looking for, so it could be worse.
November 26, 2025 at 12:44 AM
Thanks, @hcayless.bsky.social ! A few pointers to big ones would be great! We're not looking for an exhaustive list—just some nice, open data.
November 25, 2025 at 11:20 PM
Yeah, I'd assume they don't have a dump of their scraped data? (And I wouldn't know about publishing with it anyway.)
November 25, 2025 at 10:45 PM
Oh, I found the US Epigraphy Project GitHub repo, which looks great, but other suggestions are welcome! github.com/Brown-Univer...
GitHub - Brown-University-Library/usep-data: inscriptions and related data files for 'http://library.brown.edu/projects/usep/'
inscriptions and related data files for 'http://library.brown.edu/projects/usep/' - Brown-University-Library/usep-data
github.com
November 25, 2025 at 10:40 PM