We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
www.institutionaldatainitiative.org/institutiona...
We welcome feedback as we continue to expand this dataset, refine its contents, and sharpen our process.
www.institutionaldatainitiative.org/institutiona...
- Evaluate the dataset’s impact on model outputs
- Continuing to refine our OCR pipelines
View the dataset on Hugging Face: huggingface.co/datasets/ins...
- Evaluate the dataset’s impact on model outputs
- Continuing to refine our OCR pipelines
View the dataset on Hugging Face: huggingface.co/datasets/ins...
- 40% of English text + long tail of 254 languages
- 20 clear topical tranches
- Largely published in the 19th and 20th centuries
Technical report here: arxiv.org/abs/2506.08300
- 40% of English text + long tail of 254 languages
- 20 clear topical tranches
- Largely published in the 19th and 20th centuries
Technical report here: arxiv.org/abs/2506.08300