Jörg Lehmann
banner
jrglmn.bsky.social
Jörg Lehmann
@jrglmn.bsky.social
Digital humanism | machine learning | digital cultural heritage | Berlin State Library |
„Name a bias – we have it!“
Nikhil Kandpal et al.: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text, June 2025
doi.org/10.48550/arX...

Stefan Baack et al.: Towards Best Practices for Open Datasets for LLM Training, Jan 2025
doi.org/10.48550/arX...
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concern...
doi.org
December 8, 2025 at 10:48 AM
Lukas Gienapp et al.: The German Commons – 154 Billion Tokens of Openly Licensed Text for German Language Models, Oct 2025
doi.org/10.48550/arX...

Pierre-Carl Langlais et al.: Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training, June 2025
doi.org/10.48550/arX...
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models
Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated f...
doi.org
December 8, 2025 at 10:48 AM
Thomas Padilla et al: Public Interest Corpus Principles and Goals, Dec 2025
www.authorsalliance.org/2025/12/03/r...

Paul Keller & Europeana Foundation: Outline for a European Books Data Commons, Nov 2025
openfuture.eu/publication/...
Releasing The Public Interest Corpus Principles and Goals
Today, we are pleased to release The Public Interest Corpus Principles and Goals. This release builds on the recap of our final planning workshop and anticipates release of our final deliverable la…
www.authorsalliance.org
December 8, 2025 at 10:48 AM
Written together with @amsichani.bsky.social
April 2, 2025 at 6:54 PM
Two more blog posts on #openness #GLAMs and #opensource

Openness & its shades (of grey)
mmk.sbb.berlin/2024/06/21/o...

Openness & closed systems
mmk.sbb.berlin/2024/06/25/o...

Thus forming a trio of reflections on redefining openness in the 21st century
Openness, and Some of its Shades – Mensch.Maschine.Kultur
mmk.sbb.berlin
July 2, 2024 at 12:55 AM
Brewster Kahle vs. HF reminded me of WorldCat vs. Anna's Archive, one month ago:

www.infodocket.com/2024/02/07/r...
Mass scraping of bibliographic metadata from WorldCat ...

... obviously, we have (again) to become more clear of what is "open", "public domain", CC0 etc.
Report: “Lawsuit Accuses Anna’s Archive of Hacking WorldCat, Stealing 2.2 TB Data”
From Torrent Freak: The complaint accuses Washington citizen Maria Dolores Anasztasia Matienzo and several “John Does” of operating the search engine and scraping WorldCat data. The scraping is equate...
www.infodocket.com
March 15, 2024 at 2:05 PM