Lightnews — Scholar-powered news

Reposted by David Smith

Daniel Evans

@djevans.bsky.social

Interesting about Claude. A few years ago I tried to use Gemini1.5 to OCR a few pages of 19th C texts from the HTDL and got a RECITATION error. Apparently I was asking the model to output "passages that are "recited" from copyrighted material in the foundational LLM's training data."

November 26, 2025 at 4:27 PM

Reposted by David Smith

Sarah Bull

@sarahebull.bsky.social

FWIW, I've been asking Claude to transcribe both printed text and handwriting that it almost certainly hasn't seen (e.g. photos from pretty obscure docs in archives) for a couple of years, and it has gotten massively better this year. Still has difficulty with some stuff (esp. very period-specific

November 26, 2025 at 4:01 PM

Reposted by David Smith

Martin Paul Eve

@eve.gd

I was wondering this! Namely whether it had already seen the plaintext in training.

David Smith @dasmiq.bsky.social · 5h

Dan's post is great, as usual! It's exciting to move to full-page models that allow us to engage with the text and avoid mucking around with image processing. But we don't need to pronounce a eulogy on paleographers yet. The War Department and Jane Austen examples are likely known to the LM. 1/

Dan Cohen @dancohen.org · 1d

New issue of my newsletter: "The Writing Is on the Wall for Handwriting Recognition" — One of the hardest problems in digital humanities has finally been solved, and it's a good use of AI newsletter.dancohen.org/archive/the-...

November 26, 2025 at 3:20 PM

David Smith

@dasmiq.bsky.social

Dan's post is great, as usual! It's exciting to move to full-page models that allow us to engage with the text and avoid mucking around with image processing. But we don't need to pronounce a eulogy on paleographers yet. The War Department and Jane Austen examples are likely known to the LM. 1/

Dan Cohen @dancohen.org · 1d

New issue of my newsletter: "The Writing Is on the Wall for Handwriting Recognition" — One of the hardest problems in digital humanities has finally been solved, and it's a good use of AI newsletter.dancohen.org/archive/the-...

The Writing Is on the Wall for Handwriting Recognition

One of the hardest problems in digital humanities has finally been solved

newsletter.dancohen.org

November 26, 2025 at 3:15 PM

Reposted by David Smith

brendan o’connor

@brenocon.bsky.social

Here at UMass Amherst CICS, we’re searching for TT faculty in NLP – see the link from
www.cics.umass.edu/about/employ...

I’m happy to answer questions of course, too!

Faculty Positions

Open tenure-track and teaching faculty positions in computer science and informatics at the Manning College of Information and Computer Sciences

www.cics.umass.edu

November 26, 2025 at 4:21 AM

Reposted by David Smith

Gabriel Bodard

@palaeofuturist.bsky.social

Back to EpiDoc: A query to the www.jiscmail.ac.uk/EPIDOC-MARKUP list would surely return pointers to dozens of repos of EpiDoc files with Greek inscriptions. I'd love to see that list!

JISCMail - EPIDOC-MARKUP List at WWW.JISCMAIL.AC.UK

www.jiscmail.ac.uk

November 26, 2025 at 11:29 AM

Reposted by David Smith

Gabriel Bodard

@palaeofuturist.bsky.social

I believe Inscriptions of the Northern Black Sea also have EpiDoc files available (github.com/kingsdigital...), but I haven't checked how up to date these are.

Many EpiDoc projects listed at: wiki.digitalclassicist.org/Category:Epi... . At least some I haven't thought of will have downloadable XML.

GitHub - kingsdigitallab/iospe

Contribute to kingsdigitallab/iospe development by creating an account on GitHub.

github.com

November 26, 2025 at 11:22 AM

Reposted by David Smith

Gabriel Bodard

@palaeofuturist.bsky.social

EpiDoc: There are repos for I.Aphrodisias (github.com/Aphrodisias/...), IRCyr (github.com/kingsdigital...), IGCyr (doi.org/10.6092/unib...), ICretaInst (github.com/IreneVagiona...) with c. 100s – 1000s of EpiDoc files in each.

GitHub - Aphrodisias/IAph_EFES: Preparation of the future publication built on the EFES platform of Inscriptions of Aphrodisias, a second edition of the 2007 publication at https://insaph.kcl.ac.uk/ia...

Preparation of the future publication built on the EFES platform of Inscriptions of Aphrodisias, a second edition of the 2007 publication at https://insaph.kcl.ac.uk/iaph2007/ - Aphrodisias/IAph_EFES

github.com

November 26, 2025 at 11:15 AM

David Smith

@dasmiq.bsky.social

A student is working on detecting gaps in text transcriptions. Does anyone in #DigiClass or beyond know of open transcriptions of ancient Greek inscriptions, in EpiDoc or otherwise, similar to papyri.info ? I've worked with literary texts and papyri, but I'm stumped on this.

papyri.info

November 25, 2025 at 10:20 PM

Reposted by David Smith

Sciences Po médialab

@medialab-scpo.bsky.social

Que reste-t-il des textes médiévaux ? Pourquoi certains survivent et d’autres s’effacent ?

👉 Une séance de séminaire dédiée aux enjeux de la transmission textuelle dans le cadre du projet ERC LostMA avec @jbcamps.bsky.social

Rendez-vous demain à partir de 14h

Inscription 🔽

Modelling the transmission and survival of texts and manuscripts | médialab Sciences Po

Le séminaire du médialab accueille Jean-Baptiste Camps pour une séance sur le projet LostMA sur la transmission et la survie des anciens textes. Il analysera comment et pourquoi certains textes manusc...

medialab.sciencespo.fr

November 24, 2025 at 10:35 AM

Reposted by David Smith

Hanna Wallach

@hannawallach.bsky.social

Three exciting opportunities at
@msftresearch.bsky.social in NYC!!! 🎉

Internship w/ FATE: apply.careers.microsoft.com/careers/job?...

Internship w/ STAC on AI evaluation and measurement: apply.careers.microsoft.com/careers/job?...

Postdoc w/ FATE: apply.careers.microsoft.com/careers/job?...

Jenn Wortman Vaughan @jennwv.bsky.social · 6d

FATE internships: apply.careers.microsoft.com/careers/job?...

FATE postdocs: apply.careers.microsoft.com/careers/job?...

And internships with our close collaborators at STAC: apply.careers.microsoft.com/careers/job?...

November 24, 2025 at 3:33 PM

Reposted by David Smith

Daniel van Strien

@danielvanstrien.bsky.social

Building datasets to train smaller, task-focused models used to be incredibly time-consuming.

Very excited to see SAM3 massively lower that barrier. Describe the class you want to detect and get annotated datasets automatically!

Try it yourself: huggingface.co/datasets/uv-...!

Screenshot of a simple app showing bounding boxes for photographs detected in historic newspaper images.

$hf jobs uv run \ --flavor a100-large \ -s HF_TOKEN=HF_TOKEN \ https://huggingface.co/datasets/uv-scripts/sam3/raw/main/detect-objects.py \ -- davanstrien/newspapers-with-images-after-photography-big \ davanstrien/newspapers-photo-predictions \ --class-name "photograph" \ --confidence-threshold 0.4$

November 21, 2025 at 1:30 PM

Reposted by David Smith

Jenn Wortman Vaughan

@jennwv.bsky.social

Spread the word! 📢 The FATE (Fairness, Accountability, Transparency, and Ethics) group at @msftresearch.bsky.social in NYC is hiring interns and postdocs to start in summer 2026! 🎉

Apply by *December 15* for full consideration.

November 20, 2025 at 8:11 PM

Reposted by David Smith

Nathan Lambert

@natolambert.bsky.social

We present Olmo 3, our next family of fully open, leading language models.
This family of 7B and 32B models represents:

1. The best 32B base model.
2. The best 7B Western thinking & instruct models.
3. The first 32B (or larger) fully open reasoning model.

November 20, 2025 at 2:32 PM

Reposted by David Smith

Computational Humanities Research

@comphumresearch.bsky.social

📢 The #CHR2025 proceedings are out!

97 papers, ~1600 pages of computational humanities 🔥 Now published via the new Anthology of Computers and the Humanities, with DOIs for every paper.

🔗 anthology.ach.org/volumes/vol0...

And don’t forget: registration closes tomorrow (20 Nov)!

Edited by Taylor Arnold, Margherita Fantoli, and Ruben Ros

anthology.ach.org

November 19, 2025 at 12:53 PM

Reposted by David Smith

Ben Lee

@bcgl.bsky.social

2/ GovScape is built on top of the End of Term Web Archive (eotarchive.org) and currently contains all renderable PDFs (50 pages or fewer) from the 2020 crawl, documenting the first Trump administration. An overview of GovScape’s search functionality can be found in this diagram.

A diagram showing the three central query methods within GovScape: semantic text search, visual search, and keyword search.

November 18, 2025 at 8:19 PM

Reposted by David Smith

Ben Lee

@bcgl.bsky.social

1/ Announcing GovScape – a public search system for 10 million U.S. government PDFs (70 million pages)! GovScape offers visual search, semantic text search, and keyword search. Explore below:

Website: www.govscape.net
ArXiv link: arxiv.org/abs/2511.11010

www.govscape.net

November 18, 2025 at 8:19 PM

Reposted by David Smith

Ryan Cordell

@ryancordell.org

The new features are still in alpha—but @djevans.bsky.social's contextual tools for Viral Texts data are live clusters.viraltexts.org

Click "View witness in context" to see a given reprint on the newspaper page, alongside other reprints on that page—click a cluster ID here to see its other reprints

A digital archive interface showing a newspaper text. On the left side is the OCR data for the text, while on the right side is an image clipping of the text from the original archive. At the top is metadata about the newspaper title, date, and location, with an arrow pointing to the text "View witness in context"

An image of a historical newspaper page with a yellow highlight box surrounding one small text. An arrow points to a sidebar with metadata about this reprint, with a hyperlink to 37 reprints of this text

November 17, 2025 at 5:34 PM

Reposted by David Smith

Matthew Kirschenbaum

@mkirschenbaum.bsky.social

THE STEAM MAN OF THE PRARIE, an “Edisonade” (story about a boy wonder inventor) by Edward S. Ellis published as a literal dime novel in 1868. Not “AI” but a ten-foot steam-powered mechanical man used as transport and surrogate labor. Cover image here is from the Rosenbach Library, via Wikipedia.

Cover image of the novel showing an engraved line drawing of the rotund “man” pulling a carriage with riders.

November 17, 2025 at 1:41 AM

Reposted by David Smith

Dallas Card

@dallascard.bsky.social

Trying an experiment in good old-fashioned blogging about papers: dallascard.github.io/granular-mat...

Language Model Hacking - Granular Material

dallascard.github.io

November 16, 2025 at 7:51 PM

Reposted by David Smith

Will Lowe

@conjugateprior.org

For example, one really helpful thing about a delegation framing is that its natural to think about any technology as an agent with at least two principals - say, you and Sam Altman. That seems mostly invisible from a fantasy HR perspective.

November 15, 2025 at 10:29 AM

Reposted by David Smith

Will Lowe

@conjugateprior.org

Ironically, and now slightly painfully, I've been pushing formal models of delegation for thinking clearly about technology and tech ethics for years in class. In my partial defence none of that demands turning what is abstractly a delegation problem into an exercise in fantasy HR.

November 15, 2025 at 9:59 AM

Reposted by David Smith

Marisa Hudspeth

@marisahudspeth.bsky.social

(2/2) Morphology-aware tokenization improves Latin LM performance on four downstream tasks, including gains for out-of-domain texts and rare words.

📄 arxiv.org/abs/2511.09709

Contextual morphologically-guided tokenization for Latin encoder models

Tokenization is a critical component of language model pretraining, yet standard tokenization methods often prioritize information-theoretical goals like high compression and low fertility rather than...

arxiv.org

November 14, 2025 at 8:02 PM

Reposted by David Smith

Marisa Hudspeth

@marisahudspeth.bsky.social

(1/2) 🎉 New preprint: "Contextual Morphologically-Guided Tokenization for Latin Encoder Models"
w/ @diyclassics.bsky.social @brenocon.bsky.social

November 14, 2025 at 8:02 PM

Reposted by David Smith

William Timkey

@wtimkey.bsky.social

(2) The prediction view: the cost of processing each word in a sentence can be fully reduced to the word’s contextual predictability (i.e. surprisal). Predicting the next word is exactly what LLMs are trained to do, so they’re a great tool for evaluating this view. (3/n)

November 14, 2025 at 7:19 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news