Data used to build AI models is fast getting inaccessible
- Poulomi
- Aug 5, 2024
- 6 min read
Last week I came across the latest publication from Data Provenance Organization, an MIT led research group. Titled Consent in Crisis, the piece was riveting. Limited as my understanding of AI is, the fact that availability of data, on which AI models are developed, could be a concern was eye opening.
With approval from the good folks at Data Provenance Org, I summarize my understanding of the paper (and quote from it where needed). As always, the topic is a complex one and the research painstakingly nuanced. The onus for any misrepresentation, while unintended, is squarely on me. The link to the original publication is at the end of this paper.
Summary of findings
A proliferation of restrictions on AI data commons [a]
Consent asymmetries & inconsistencies
A divergence in content characteristics between the head and tail of public web-crawled training corpora [b]
A mismatch between web data and common uses of conversational AI
Background
AI models are built on data commons – the public internet data – enabled by the sheer scale and heterogeneity of this data.
However, the use of this web data for AI poses ethical and legal challenges to data consent, attribution, and copyright.
According to the research published by Data Provenance Initiative, an MIT-led research group, this data is fast becoming inaccessible. The paper quantifies the access issues and points to the potential impact this would have on future
Data Sources Used
The study looked at 3 data sources which have high rate of downloads and have been used in most foundation models [c]. These sources are
1. C4
a) English language text sourced from public common crawl web space
b) Extracts natural language and is used to train NLP models
c) Has about 750 GB of cleaned text data from ~365 million webpages
d) Has 1.0 trillion tokens [d]
2. RefinedWeb
a) Developed by Technology Innovation Institute
b) High quality web dataset used to train LLM like Falcon LLM series.
c) Data is extensively filtered and deduplicated
d) Has 5.0 trillion tokens with public extract of 600 billion tokens available for use
e) Hosting platform Hugging Face claims models trained on RefinedWeb achieve performance in-line with or better than models trained on curated datasets
3. Dolma
a) Developed by Allen Institute for AI
b) Used primarily to train language models
c) Has 3.0 trillion tokens from a diverse mix of web content, academic publications, code, books and encyclopedic materials
Head Sample and Random Sample
The study was conducted on 14,000 web domains – 3.95K from each of the three data sources for the top web domains ranked by their number of tokens. The remaining 10K domains were randomly selected from the intersection of the three corpora (10,135,147 domains).
Human annotators were trained to manually label websites for content type, website purpose, presence of paywalls, embedded advertisements, text of terms of service and other metadata.
Findings
1. Decline of consent to open web data – this is happening at multiple levels
a) More web domains are adopting robots.txt [e] and Terms of Service pages.
b) Crawling restrictions of robots.txt have risen since the introduction of GPTBot and Google-Extended crawler agents
c) Terms of Service pages have installed more anti-crawling and anti-AI restrictions
d) High restrictions by news websites
Given the rapid increase in restrictions – both robots.txt and terms of service policies – there is likely to be a significant decline in open and freely available data on the web. This is likely to have a significant impact on the quality of AI models.
2. Inconsistent and ineffective communication on AI consent
a) There is a difference in the degree of restriction faced by AI developers. OpenAI crawlers have been found to face greater restrictions compared to Anthropic, Cohere and Meta. These asymmetrical restrictions tend to advance the lesser well known AI developers.
b) Contradictions between robots.txt and ToS – web crawlers follow the Robots Exclusion Protocol [f] while Terms of Service is a legal agreement between the site and its users. REP, created in 1995, has a rigid structure and limitations in what it can convey whereas a ToS can carry a nuanced policy in natural language. The researchers found that in many cases the robots.txt implementation on a website fails to capture the intentions specified in ToS. The figure below with cross tab between robots’ restriction and terms of service measures restriction in percent of tokens and shows the inherent contradiction
3. Correlating features of web data
a) There are several statistically significant differences between the head distribution and tail distribution of web domains. Head distribution is heavy on news, social media, encyclopedias while tail is dominated by personal/organizations’ website, blogs and e-commerce sites. This discrepancy highlights the importance of curation choice for AI to create training datasets without biases, undesirable text and images, and quality discrepancies across languages
b) The head distribution is more multimodal and heavily monetized. Due to the higher content of news, periodicals, etc. in the head distribution, the restrictions on such sites from robots.txt and Terms of Service is higher. Most content is heavily monetized via advertising and paywalls. Crawlers as such face greater challenges up-to-date content sources
4. Misalignment between real-world AI usage and web data – this section is focused primarily on the use of ChatGPT from a sample of the chats.
a) Evidence suggests that models trained on unstructured data are not aligned with the way users use generative AI. In one case researchers found that over 30% of requests on ChatGPT was for story writing, role-playing or poetry. – this category is poorly represented in the web data used to develop models.
b) Sexual role-play appears to be a prevalent use of ChatGPT despite being removed from common public datasets. This finding is surprising since OpenAI filters harmful content from its training data (according to the GPT-4 technical report) while it’s Usage Policies prohibit such content for minors. However, there is ambiguity – distinction in the model refusal instructions between erotic and non-erotic sexual content and contextualized sexual content for medical purposes.
c) While news websites have the highest number of tokens in the datasets used for model development, fewer than 1% of ChatGPT queries are related to news.
Impact of decline in consent
a) Skew the availability of data, impact the recency and diversity thereby impacting model capabilities
b) The new wave of robots.txt and Terms of Service pages don’t currently have the capability to distinguish the various uses of their data. As such a blanket prohibition disproportionately affects the non-commercial users like academic researchers and non-profits.
c) The possibility of more websites locking their data behind login or paywalls with further decreases in open web. Researchers believe this to be a possibility since the content on the internet was not created to be used for training AI models.
View from Research Lab
AI spending hit a frenzy after OpenAI introduced ChatGPT in late Nov,’22. Per a report from New York Times big technology companies have sunk in about $50 billion this year. What is not clear is when all this money will start paying off.
Wall Street has started to question the spending while a few tech bosses (Alphabet’s Pichai and senior executives from Microsoft) preached patience. Goldman Sach’s Jim Covello spoke for a lot of people when he wrote in a June report “What $1 trillion problem will A.I. solve?”
However, despite challenges around accessing free, high frequency data and questions on investment returns, AI will continue to be a game changer for the foreseeable future. The progress is incremental since currently the bulk of the CapEx is in building infrastructure. However, the benefits will cascade. According to the Mar’24 report from Statistica, the market size from all applications is expected to cross $800 billion in 2030, a CAGR of 28% between now and then.
Will robots.txt and Terms of Service stand in the way? Unlikely. Though lawsuits may be a cause for concern. Last Dec, New York Times reported that it had sued OpenAI and Microsoft over copyright infringement. The lawsuit alleges articles published by Times were used to train automated chatbots and in April this year, 8 other dailies sued OpenAI and Microsoft over the same issue.
This story is still evolving.
Appendix – Source ChatGPT 4o
[a] Data commons in AI refers to shared resource or infrastructure where data can be collected, managed, and accessed collectively by a community or organization to foster collaboration, innovation, and transparency in AI development.
[b] Training corpora (singular – corpus) is a collection of large datasets used to train AI models. Data may be text, image, audio or video
[c] Foundation models – language models (GPT, BERT) and vision models (CLIP, DALL-E)
[d] Tokens – fundamental unit of data processed by algorithms, particularly in NLP and machine learning
[e] robot.txt file – tells search engine crawlers which URLs the crawler can access on the target website. Purpose is to avoid overloading the site with requests
[f] Robots Exclusion Protocol – simple and powerful mechanism that webmasters and SEOs can use to instruct automated web crawlers like search engine bots what parts of their websites not to crawl.
About Data Provenance Initiative
Volunteer collective of AI researchers from around the world
Most recent work – analysis of 14,000 web domains to understand the evolving provenance and consent signals behind AI dat.
Original Paper
Comments