NO! My data can not be used to train AI models!

There has been rapid development in machine learning models recently. The scientific achievements are in many ways impressive and valuable (Gupta et al., 2022; Ardila et al., 2019). However, some of the development has been more focused on commerce, i.e. selling different products based on these trained models. Lensa AI, for instance, is a smartphone app that creates AI images based on your selfie. They sell ”magic avatar packs” for 5€ and also require membership. Other similar image-alteration or text-to-image products ask for equal payments, and some stock image banks have started selling AI-generated images. Even though a few euros might not be that much for many, it feels questionable when the products use models trained on our data; who has the right to profit from such data? Yet many products still remain free, looking for a business model in this new market. Before this new market establishes, we should really think about what such AI means, who gets to profit from it and in which way.

These machine learning algorithms use data to train themselves - to produce the machine learning model that can then be used to do the thing it does⁠ ¹. This data does not just materialize from a void but is collected from various sources. There are large databases that can be used to train these models, like LAION-5B, that offers image/text pairs to train the models on things requiring some sort of text/image parsing. LAION-5B has 5.8 billion pairs⁠², but there are other databases, some that are openly accessible and others that are not. There are many debates and questions concerning how these images are gathered and treated, but in general, any data found on the internet can be used to train these models. This may include copyrighted materials (Sharp, 2022; Baio 2022, ), medical records (Edwards, 2022) or other sensitive material that has leaked into the internet for some reason.

Moreover, much of the data may not be copyrighted per se, meaning it is not only allowed but also meant to be consumed by others. However, this does not mean that the creator is allowing the material to be used to train machine-learning models. One could liken it to the way that if an artist has a portfolio, it does not mean that a gallery can copy and print the images against the artist's will and produce an exhibition without the artist's consent and commission. Alternatively, if a researcher has a research paper, that paper can not be used to create modified copies of the paper; research is there for others to develop or falsify the research, not to make endless modified copies of it⁠³.

To put it shortly, there is lots of material on the internet that is meant to be experienced, enjoyed, hated — consumed, but not to be exploited. However, as asking permission for 5.8 billion images, for example, is a bit time-consuming, the tech companies do what they always do: act without thinking or asking⁴.

Some sites, like "Have I been trained" try to make it easy for creators to get their data/materials out of the training sets. However, for instance, as John Siracusa pointed out in a recent episode of Accidental Tech Podcast (episode 515), the situation is similar to (I am paraphrasing here) first robbing a bank and only afterwards saying," Oh, OK, if you do not want that I rob the bank, I will not rob your bank again. In other words, countless models have already been trained with that data.

Furthermore, there is the question of the data itself. As an artist, art educator and researcher, I find it absolutely disgusting that any image can be thought to be equivalent with text⁠⁵, specially not with just a few subjects. A painting of a house is not the same as writing: ”painting, house”⁶.⁠ Text is not an image; an image is not a text. These two always escape each other: Text can produce many vivid images within our heads, which vary depending on the people (or the background, context, status, time of day/month/year/century etc.) Moreover, images always escape the definitions of text. There are lots of issues with data and classification, too many to go into detail here, but Kate Crawford's book might be an excellent place to start (2021).

Still, there are many more questions about machine learning⁠⁷. For me, using the word learning is just as confusing as using the word intelligence⁠⁸ in artificial intelligence. What is learning? Do the algorithms learn? Or do they do math? Or what? I like Dan McQuillan's description of AI as just optimised data(2018). There are no images, articles, novels, or oil paintings for machine learning models- just a humongous set of binary data that needs to be optimised. Is that learning? Again, more issues that would require a lot of thought and research.

Even further, one might ask like Alexander R. Galloway (2022 ) that what would be the end result of machine-learning production, what do we get when optimised data is spun and spun again and again?

However, I am pissed off. I find it irritating that my creations might be stolen in order to train some model that can then be packeted into an app/or whatnot and sold for lots of money. Therefore, I made this image that says NO!

No, you little pigs, No, you little gluttons, no, I am not OK that you use my data without me even knowing about it.

I do not know if that helps, but it makes me feel a bit better while we search for ways to handle this disaster.

NO, you can not use my creations to train any machine learning models that can be employed for commercial use.

Feel free to download and use it.

Footnotes

1 Which can be many things: from generating images from other images, or from text, or creating text based on input, or to observing people, feelings, or objects from a photo, to detecting cancer cells and so on—many, many things.

2 LAION-5B says they use the text found in ALT metadata, i.e. the text put in there to describe the image to the internet, which is questionable, at the least. Other models try to describe images with categories.

3 I know it may sometimes seem there is no difference between these two, as scientific publishing is what it is, but there is a substantial difference

4 Or who thinks that creating a website to rank the campus's girls by their looks is a good idea? Still, that is how Facebook was born. But there are countless examples with devastating outcomes (Morozov, 2014; Rushkoff, 2016)

5 LAION-5B says they use the text found in ALT metadata, i.e. the text put in there to describe the image to the internet, which is questionable, at the least. Some other models try to describe images with categories.

6 If you don’t believe ask Wittgenstein.

7 I am not saying that these models are unusable or beneficial. But the whole concept is misleading. And this leads to lots of misinterpretations, for instance, when asking who has created the product and so on

8 The whole concept of artificial intelligence tends to lead to so many challenges and misinterpretations, like comparing intelligence to human intelligence or limiting intelligence to a specific domain(s) of human capabilities.

References

Baio, A. Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator - Waxy.org. https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/

Crawford, K. (2021). Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence. Yale.

Edwards, B. Artist finds private medical record photos in popular AI training data set Ars Technica. https://arstechnica.com/information-technology/2022/09/artist-finds-private-medical-record-photos-in-popular-ai-training-data-set/

Galloway, A. Normal Science Alexander R. Galloway. http://cultureandcommunication.org/galloway/normal-science

Gupta, M. D., Kunal, S., Girish, M. P., Gupta, A., & Yadav, R. (2022). Artificial intelligence in cardiology: The past, present and future. Indian Heart J, 74(4), 265-269. https://doi.org/10.1016/j.ihj.2022.07.004

McQuillan, D. Rethinking AI through the politics of 1968 openDemocracy. https://www.opendemocracy.net/en/rethinking-ai-through-politics-of-1968/

Morozov, E. (2014). To Save Everything, Click Here. PublicAffairs.

Rushkoff, D. (2016). Throwing Rocks at the Google Bus. Penguin UK.

Sharp, S. R. (2022). He’s Bigger Than Picasso on AI Platforms, and He Hates It. https://hyperallergic.com/766241/hes-bigger-than-picasso-on-ai-platforms-and-he-hates-it/