iseeaswell.bsky.social
@iseeaswell.bsky.social
It could also have been short for 蒙古篆
November 9, 2025 at 12:33 PM
All are welcome. Please make this space your own, and add channels at will.
June 17, 2025 at 5:46 PM
Our first task is to massively expand SMOL through community contribution. Anyone who contributes significant volunteer translations or post-edits will get on the Arxiv paper in the next refresh!
June 17, 2025 at 5:46 PM
this is a space for grassroots collaboration. It doubles as a directory of speakers of such languages, so you can directly talk with and collaborate with community members.
June 17, 2025 at 5:46 PM
February 19, 2025 at 5:36 PM
By the way, GATITOS has now officially moved to the SMOL Huggingface repo
February 19, 2025 at 5:36 PM
Finally, if you are a speaker of any SMOL languages, please take a look at the data and tell me what you think. Despite the quality checks, I am sure that some of the deliveries have quality issues, and I would love to understand and/or fix any affected sources. We are in this together!
February 19, 2025 at 5:36 PM
I would also like to thank FAIR for being an academic leader in open-sourcing work with low-resource languages, including NLLB and Flores. Thank you for helping make the academic community feel collaborative!
February 19, 2025 at 5:36 PM
I would like to thank our native-language consultants and translators -- too numerous to name -- for their invaluable help along the way. Several entire languages in SMOL only exist because of volunteer contributions!
February 19, 2025 at 5:36 PM
SMOL also provides factuality ratings for 671 documents, with well-researched justifications.
February 19, 2025 at 5:36 PM
SMOL has two sub-sources: SMOL-Doc, a document-level set, and SMOL-Sent, a sentence-level source. They join the token-level GATITOS to hit at three levels of granularity!
February 19, 2025 at 5:36 PM
And that’s just OOTB finetuning—we know that the community can think of more clever ways to train on SMOL. Multiway parallel data is tricky to deal with without overfitting.
February 19, 2025 at 5:36 PM
Finetuning of Gemini 2.0 Flash on SMOL yields average improvements of about +4.0 ChrF, with some languages -- including Ewe, Kokborok, Manipuri, Ga, and Dombe -- seeing gains of over +20 ChrF.
February 19, 2025 at 5:36 PM
SMOL comprises sentences and documents carefully selected for the biggest “Bang for Buck” ratio. It includes 6.1M translated tokens—and if you’ve been in this field a while you know that’s a lot!
February 19, 2025 at 5:36 PM
Google translate gives “在这里,说得很多,但听到的却很少”, which has the suspicious property that the passive is marked with 得 in the first clause and 的 in the second. More credence to the theory that they are cognate with all Irish, specifically the Cork dialect, which is the oldest and purest form
December 27, 2024 at 6:14 PM
Interesting, my brain didn’t consider that because that construction feels like an adjective rather than a verb, but it does seem to have more or less the same meaning!
December 27, 2024 at 6:11 PM