Workshop on Multilingual Data Quality Signals
@wmdqs.bsky.social
8 followers 9 following 18 posts
The first iteration of our workshop will be co-located with @colmweb.org 2025 in Montreal. https://wmdqs.org/
Posts Media Videos Starter Packs
wmdqs.bsky.social
If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...
wmdqs.bsky.social
Thank you everyone for coming to WMDQS (pronounced "whim ducks")!
wmdqs.bsky.social
Then we had our second poster session for our paper submissions. The full papers are available on our website!
wmdqs.bsky.social
David Adelani gave a keynote about text quality for low-resource languages.
wmdqs.bsky.social
We had our first poster session, hearing from some of our shared task participants!
wmdqs.bsky.social
We presented the results of our shared task! We received annotations for over 30,000 document representing over 60 languages. We also showed the results of our LangID dataset and system shared task tracks. Thank you everyone who participated!
wmdqs.bsky.social
We started with a keynote from @juliakreutzer.bsky.social about multilingual fine-tuning data!
wmdqs.bsky.social
WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
Reposted by Workshop on Multilingual Data Quality Signals
juliakreutzer.bsky.social
Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! 🤩
wmdqs.bsky.social
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
wmdqs.bsky.social
We will also have a session on our shared task, which was about improving language identification models. Participants of the shared task contributed annotations to create a new LangID dataset and also submitted new LangID systems.
wmdqs.bsky.social
Our third and final keynote will be from @sebnagel.bsky.social about the data in Common Crawl.
wmdqs.bsky.social
Our second keynote will be by David Adelani about text quality for low-resource languages.
wmdqs.bsky.social
Our first keynote will be from @juliakreutzer.bsky.social about data for multilingual fine-tuning.
wmdqs.bsky.social
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
Reposted by Workshop on Multilingual Data Quality Signals
pjox.bsky.social
If you want to help us improve language and cultural coverage, and build an open source LangID system, please register to our shared task on Language Identification! 💬

Registering is easy! All the details are on the shared task webpage: wmdqs.org/shared-task/

Deadline: July 23, 2025 (AoE) ⏰
Reposted by Workshop on Multilingual Data Quality Signals
wmdqs.bsky.social
We've added lots more documents/languages and extended the deadline for the first round of annotations until July 23rd. Check out the details below 👇
Reposted by Workshop on Multilingual Data Quality Signals
catherinearnett.bsky.social
One of the biggest obstacles to improving language technologies for low-resource languages is the lack of data. To address this, we need better language identification tools. So, we're organizing a shared task on Language Identification for Web Data! #NLP #NLProc