Web dev at the end of the world, from Hveragerði, Iceland

GDPR and American AIs


​Italian regulators turn their attention towards OpenAI and ChatGPT

If you’ve been following AI discussions in social media, you might have heard that the Italian privacy regulator banned ChatGPT in Italy. This has led to jokes such as it being retaliation for ChatGPT recommending you snap spaghetti in two to make boiling it easier or that you put pineapple on your pizza.

It’s also led to the usual accusations of regulators being against progress and hating cool things. (This is the sort of attitude that’s the reason why the US still hasn’t banned asbestos.)

Instead of just reading yet another pundit spout an opinion based on somebody’s response to what somebody else thinks the ban might actually be about, I decided to go and read the complaint itself. (Scroll for English).

It’s relatively straightforward, if you have some familiarity with the GDPR.

(So, not straightforward at all, honestly.)

The GDPR is the EU’s data privacy regulation. It harmonises data privacy laws across the EU, and it has pretty stiff fines for violations. From the Wikipedia page: “€20 million or up to 4% of the annual worldwide turnover of the preceding financial year in case of an enterprise, whichever is greater.” (Emphasis mine.)

One way GDPR limits the abuses of data collection is data can only be collected with consent and used for a specific legitimate purpose. This is to prevent companies from just hoovering up data and then just it as a general-purpose building block for products, analytics, and surveillance. If you collect data, it must be for a specific purpose.

This post has more details on the implications this has on language models. It’s from a couple of months ago so any competent tech co should have known this was coming. The primary complaint is that OpenAI is collecting personal data and using that to train the model. To be more specific, OpenAI is collecting pretty much all the internet, which inevitably is going to contain personal data, and training on that.

Since the model is a general-purpose language model, there is no way for it to enforce the purpose restriction required by the GDPR. It is specifically a general-purpose language model intended to be the foundation other tools build on.

Even if OpenAI somehow did get around the purpose-restriction, they don’t have consent from the owners of the private data.

Even if they did get that, for example by arguing that by publicly posting the data the owner has given implicit consent, OpenAI doesn’t support the right of erasure: you can’t ask it to delete all your personal data. This is essential for the implicit permission defense because you need to be able to erase data that was accidentally or maliciously made public. Machine ‘unlearning’ hasn’t caught up with regular machine learning and, as far as I can tell, nobody’s got it working properly on a system the size of GPT-3 or GPT-4. So, even without any other issues, just judging from the inclusion of crawled websites in the training data, it looks like OpenAI is indeed breaking GDPR regulations.

But, additionally, by their own admission, OpenAI was by default training on user data until a month ago and they themselves admitted that deleting that user data is impossible. Both are admissions of clear violations of the GDPR.

It certainly looks like OpenAI is pretty unambiguously in the wrong here. But, more importantly for tech’s AI aspirations, it looks like the same applies to every other foundational model out there.

If the Italian regulator is right, and it looks like they are, then this generation of large-language-models just might not be compatible with the GDPR.

Almost as important as the primary complaint are the data breaches. Not properly reporting breaches to both the regulator and affected users is a serious GDPR violation for a service of that size. Companies have been fined for that in the past. OpenAI has had a bunch of data breaches lately that it handled poorly, which the regulator cites in its notice.

The thing about having 100 million users is you get the regulation that comes with it. I have no idea how OpenAI are going to handle this, but the complaint seems valid enough.

And because, unlike the big tech cos, OpenAI doesn’t have a presence in the EU and hasn’t picked a lead supervisory data protection authority, all of the regulators have jurisdiction. “Controllers without any establishment in the EU must deal with local supervisory authorities in every Member State they are active in.”

Italy might just be the first and, unfortunately for OpenAI, every single regulatory body has the power to fine. They could be facing multiple fines from multiple countries.

People have been wondering how on earth these models were supposed to comply with the GDPR for months now.

The answer seems to be that they aren’t.

The AI is an American

Years ago, I had a go at explaining to somebody that AI colourisation inevitably erased variation and minorities out of history.

AI-generated images are that x1000. Everything becomes American.

Read more…

You can also find me on Mastodon and Bluesky