AI’s Dirty Data Secret: Massive Training Dataset Found to Contain Millions of Personal Records

A major investigation has revealed that DataComp CommonPool, one of the largest open-source datasets used for training AI image-generation models, contains millions of examples of personal data. Researchers found thousands of images of passports, credit cards, birth certificates, driver’s licenses, and other official identity documents within just 0.1% of the dataset, leading to the estimate that the full dataset likely holds hundreds of millions of such images, including identifiable faces and sensitive documents.

The dataset has 12.8 billion samples and has been downloaded over 2 million times since its 2023 release, while intended for academic research, its license does not restrict commercial use, allowing the data to be deployed in corporate AI systems. The dataset includes:

Thousands of verified identity documents and over 800 job application documents (resumes, cover letters) traced to real individuals through online searches.
Some resumes included disability status, background check results, racial information, and even contact and government identifiers linked to real people.
Children’s personal information was present, including birth certificates and health records that appear to have been scraped from places where the data was not intended to be broadly public.

The research highlights how indiscriminate web scrapingfor AI datasets leads to the capture of data never meant for such use, and experts are now urging the AI community to abandon these practices and adopt better protections for personal data.

The findings challenge the assumption that all internet-accessible information can be freely used for AI, stressing the need for updated legal and ethical standards

Legal Uncertainties: A rigorous audit and legal analysis found that the dataset raises complex questions about whether so-called “publicly available” data is actually fair game under privacy laws. Current international privacy regulations impose obligations like data minimization, consent, and breach notification, but enforcement is inconsistent. Notably, public accessibility does not automatically make information public under privacy law; data posted on the internet for a limited audience may still be protected, and indiscriminate scraping often fails to meet reasonable legal standards.
Failure of Automated Protections: DataComp attempted privacy safeguards such as automated face blurring, but audits revealed that these tools were often ineffective—missing millions of faces and failing to detect sensitive information in both images and captions. As a result, huge numbers of unblurred faces and unredacted identification documents remain accessible within the dataset.
Ethical Debate: The case is fueling broader debate about AI data curation practices. It underscores that, in AI development, privacy risks can propagate through the technology stack—from data sourcing to model deployment—making the need for transparent, robust curation measures even more urgent.
Research Community Call to Action: Experts now call for a fundamental rethink of how AI training data is sourced, urging stronger privacy screening, improved consent mechanisms, and explicit ethical guidelines to ensure the rights and dignity of individuals whose data may be inadvertently swept into the digital training pipeline.

In summary, DataComp CommonPool demonstrates how AI’s hunger for large web datasets can lead to widespread privacy breaches, legal minefields, and a crisis of public trust—all while existing technical and regulatory safeguards struggle to keep up.

Resources:

Zhao, D., Liu, R., & Zhou, L. (2025). LLM-Agent-Bench: Evaluating the Decision-Making Ability of Multi-Agent Collaboration with Large Language Models. arXiv preprint arXiv:2506.17185. https://arxiv.org/abs/2506.17185

Privacy Professionals

We provide integrated personal data protection services using innovative technologies that enhance privacy and ensure compliance to regulations, in collaboration with trusted global partners and pioneers in advanced technical solutions adapted to the laws and requirements of local regulators.

Joined February 24, 2026

Posts 24

View All Posts