getting scraped

The Washington Post looked at what information feeds Google’s chatbots, particularly the C4 Data Set which scraped 15 million English language websites. This is the ‘artificial intelligence’ that feeds the chat bot — stuff that people have written and posted online. All of this is taken without authorization — “The copyright symbol — which denotes a work registered as intellectual property — appears more than 200 million times in the C4 data set.

“Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.

This text is the AI’s main source of information about the world as it is being built, and it influences how it responds to users. If it aces the bar exam, for example, it’s probably because its training data included thousands of LSAT practice sites.

Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.” —WaPo 2023-04-19

This website is in that data set, among about 500K blogs, added without my permission and in contravention of my copyright licence — CC-BY-NC-SA 4.0

search for jarche.com in google's c2 data set showing 350K tokens

I doubt there is much I can do about this, as I cannot afford a lawyer to go up against Google. This situation is another one where giant information technology companies move fast and break things, knowing that it will take legislators, regulators, and the courts some time to catch up to the current state of technology. Meanwhile big corporations make big money.

Consider that GPT-3 consumed 700,000 litres of water in its creation or that each question to a ChatGPT could consume 500 ml of water — that is each and every question asked by each person. For now, I am watching these generative large language multi-modal models though I will not use them. But they are using us.

Leave a comment

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.