Rachel Hong
rachelhong.bsky.social
Rachel Hong
@rachelhong.bsky.social
PhD student at University of Washington
machine learning fairness, algorithmic bias, dataset audits, data privacy, tech policy. she/her
💡 What can ML researchers do instead? Prior work has explored various alternatives for web datasets, including limiting license terms to prevent commercial use, evaluating automated sanitization tools, attributing training data, and creating explicit consent mechanisms [13/N]
June 30, 2025 at 9:15 PM
While privacy laws carve out publicly available data, being web-accessible isn’t the same as being legally “public.” We call for enforcing reasonable basis standards for web-scraped data and modernizing the “publicly available” exception in consumer privacy and data protection laws [12/N]
June 30, 2025 at 9:15 PM
Legal findings: Web-scraping doesn’t look at context or intent behind personal information and instead vacuums all of the web. We show that certain parts of data protection laws are not met: there’s a lack of reasonable basis, purpose specification, and data minimization [11/N]
June 30, 2025 at 9:15 PM
With web-scale, it’s hard for people to be aware, find, and take down their images, as data replicates across sites even if the original is taken down. Opt-out doesn’t address dataset monoculture, as many models may have already trained on central datasets like CommonPool [10/N]
June 30, 2025 at 9:15 PM
The Wayback Machine tracks the earliest recorded timestamp of a subset of images with non-blurred faces. We find a significant portion existed before 2020, raising questions of how anyone can consent to the use of their personal data before the rise of large AI systems [9/N]
June 30, 2025 at 9:15 PM
DataComp (like other datasets) optionally includes automatic face blurring as a way to preserve privacy. However, the face blurring algorithm fails to catch an estimated 102 million samples of real human faces, of which some samples that reveal children or people’s names [8/N]
June 30, 2025 at 9:15 PM
We link resumes to online profiles (like LinkedIn) and estimate at least 142K samples (out of 12.8B) depict resumes of individuals with public online presence. We annotate the presence of personal data of resumes (with online profiles), split by geographic region below [7/N]
June 30, 2025 at 9:15 PM
Several common websites in DataComp no longer have images available to download, but at the time of curation did exist. Upon inspection of download errors, we find that some errors are “Forbidden” errors due to a lack of permissions to access the image [6/N]
June 30, 2025 at 9:15 PM
Some samples reveal names and faces linked to demographic and children’s information (see paraphrased examples below). Many come from news sites, where someone may have disclosed the information for an article, rather than consenting their data to be used to train a model [5/N]
June 30, 2025 at 9:15 PM
🌳 DataComp CommonPool is an image dataset crawled from the web, following LAION-5B (taken down in Dec 2023 for illegal material). DataComp has been downloaded ≥2M times (!) a huge amount of downstream dataset users and model users (i.e. the leaves) relying on one source [4/N]
June 30, 2025 at 9:15 PM
🚀 Empirically, we find:
1. Examples of credit card numbers, passport/ID numbers, resumes, faces, and children’s data
2. Attempts at data sanitization (such as face blurring) aren’t perfect
3. Data on the web isn’t always “publicly available” according to legal frameworks
[3/N]
June 30, 2025 at 9:15 PM
🔒 Our dataset audit findings inform our legal analysis with regards to existing consumer privacy and data protection laws, like the CCPA and GDPR. We surface various privacy risks of current data curation practices built upon the indiscriminate scraping of the web. [2/N]
June 30, 2025 at 9:15 PM