Lightnews — Scholar-powered news

Reposted by IDI

Greg Leppert

@leppert.me

Even if you're not a partner library, you might be curious about what it's like to work with GRIN. Our technical report has a wealth of details. arxiv.org/abs/2511.11447

GRIN Transfer: A production-ready tool for libraries to retrieve digital copies from Google Books

Publicly launched in 2004, the Google Books project has scanned tens of millions of items in partnership with libraries around the world. As part of this project, Google created the Google Return Inte...

arxiv.org

November 20, 2025 at 4:42 PM

Reposted by IDI

Greg Leppert

@leppert.me

We're also sharing the pipeline we developed for Institutional Books that seamlessly dedupes, classifies, and enhances the data once GRIN Transfer brings it down. www.institutional.org/tools

Institutional Books | Institutional Data Initiative

Institutional Books 1.0 is our first release of public domain books. This set was originally digitized through Harvard Library’s participation in the Google Books project..

www.institutional.org

November 20, 2025 at 4:42 PM

IDI

@institutional.org

Register to join the talk virtually: harvard.zoom.us/meeting/regi...

Welcome! You are invited to join a meeting: IDI Talk with Michele Dolfi & Peter Staar on SmolDocling. After registering, you will receive a confirmation email about joining the meeting.

Please join us for a talk with Michele Dolfi & Peter Staar from IBM Research in Zurich to discuss their work on SmolDocling, an “ultra-compact” model for diverse OCR tasks.

harvard.zoom.us

September 9, 2025 at 4:06 PM

Reposted by IDI

Greg Leppert

@leppert.me

Cohosted by @institutionaldatainitiative.org and The Berkman Klein Center. harvard.zoom.us/webinar/regi...

Welcome! You are invited to join a webinar: Open AI Development. After registering, you will receive a confirmation email about joining the webinar.

For AI to truly benefit society, it must be built on foundations of transparency, fairness, and accountability—starting with the most foundational building block that powers it: data. Not long ago, ...

harvard.zoom.us

June 16, 2025 at 7:48 PM

IDI

@institutional.org

We hope Institutional Books will be the beginning of a process that makes millions more books accessible to the public for a variety of uses.

We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
www.institutionaldatainitiative.org/institutiona...

Institutional Books | Institutional Data Initiative

Institutional Books 1.0 is our first release of public domain books. This set was originally digitized through Harvard Library’s participation in the Google Books project..

www.institutionaldatainitiative.org

June 12, 2025 at 9:12 PM

IDI

@institutional.org

We look forward to growing Institutional Books through community. We welcome collaboration from researchers and model makers as we:
- Evaluate the dataset’s impact on model outputs
- Continuing to refine our OCR pipelines

View the dataset on Hugging Face: huggingface.co/datasets/ins...

institutional/institutional-books-1.0 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

June 12, 2025 at 9:12 PM

IDI

@institutional.org

As part of our refinement work, we supplemented the original OCR-extracted text with a post-processed version that utilizes line detection to reassemble the text according to the line type.

June 12, 2025 at 9:12 PM

IDI

@institutional.org

We included extensive volume-level metadata with both original and generated components, such as results from text-level language detection.

June 12, 2025 at 9:12 PM

IDI

@institutional.org

We analyzed the dataset’s coverage across time, topic, and language and found:
- 40% of English text + long tail of 254 languages
- 20 clear topical tranches
- Largely published in the 19th and 20th centuries

Technical report here: arxiv.org/abs/2506.08300

June 12, 2025 at 9:12 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news