Largest dataset powering AI images removed after discovery of Child Sexual Abuse Materials

[15]

winther

December 20, 2023

Link

Not really surprised with how the current tech culture is move fast and break things, without any care in the world for potential damaging consequences. Those are for the legal department to deal...

Not really surprised with how the current tech culture is move fast and break things, without any care in the world for potential damaging consequences. Those are for the legal department to deal with afterwards it seems.

We have allowed these models to be trained on data without consent with the argument that if it is on the internet, it is fair game. We really need better regulation on what data is allowed to feed these models.

31 votes

[11]

Comment deleted by author
Link Parent
1. [9]
  winther
  December 21, 2023
  Link Parent
  Consent isn’t given simply because it is on the internet. These images are likely not on the internet with consent from those on the images. AI models have also been trained on pirated material....
  
  Consent isn’t given simply because it is on the internet. These images are likely not on the internet with consent from those on the images. AI models have also been trained on pirated material.
  
  There is also a Big difference in a search engine that indexes websites and links to the source, and these scrapers that take the content and create something new from it and removing any reference to the original source. Stuff public on the internet is still copyrighted and the owner of the website still has control over the content. If they remove the website from the internet the search engine will remove the link as well. All this is gone with AI scraping. The legal issues of this are still in limbo and it is definitely not the same.
  
  17 votes
  1. [9]
    
    Comment deleted by author
    Link Parent
    
    [5]
    winther
    December 21, 2023
    Link Parent
    There is a difference in making something available to view and read, and something available to be processed and basically resold in a new commercial product. It is clear that AI training models...
    
    It is if it's publicly available. That's what most courts have ruled, and all web scraping is. If you make something available, you make it available.
    
    There is a difference in making something available to view and read, and something available to be processed and basically resold in a new commercial product. It is clear that AI training models is something that the current copyright laws isn't fit to handle and it at best a grey area without any clear legal precedent globally. But that is just regular stuff online, the models have also scraped pirated material and in the case of this article images of sexual abuse that is clearly not made with consent.
    
    There's not. That's how search engines work. They take a source and create something new from it. Most search engines even use the same AI tools.
    
    Search engines direct users back to the website. If the website is removed, the search index link no longer works. There is also transparency where you have the tools and right to be forgotten even from Google. With AI models tons of personal data is just scraped and potentially permanently used forever. Not the mention possible GDPR issues where we have the right to remove our personal data if a company has those in their systems is currently not something many of these AI companies seem to care much about.
    
    Sure, but if you host a site, you've made it freely available. It's like putting a box of fruit on your lawn saying "here take one". If you don't want people to take one, or have some control over what they do with it you need to lock it up somehow or control access.
    
    That is something is on a website gives everyone the ability to view it. They don't automatically have a right to for example print it out and resell it as a book. That would be a clear infringement. And currently, there is no clear path for regular people to have their stuff removed from current AI models. Even if those models contain personal data that could a be GDPR violation or illegally obtained data from pirate sites or like in this case pictures of abuse.
    
    It's exactly the same, it's just web scraping. You're taking data that is offered freely to any computer that can access it, and using it in a way that is allowed via fair use. Either via research exceptions or transformative action. While new laws might be passed, like in the EU. As it stands now, at least in the US, it's being show to be legal.
    
    At least it is not globally yet even though they use it globally and scrape data from all over the world, without any consideration for existing laws with handling of private data. The limitations of fair use doesn't have a global totally clear legal precedent yet at least. And you still somehow ignore illegal or pirated data that these scrapers also include in their models and seemingly have little control over since some models have also been shown to leak their raw training data, which is clearly not within fair use.
    
    8 votes
    
    [5]
    
    Comment deleted by author
    Link Parent
    
    [4]
    winther
    December 21, 2023 (edited December 21, 2023)
    Link Parent
    I would start of by saying what the legalities of all this is still not nearly settled. There are still cases and lawsuits happening as this a whole new area that stretches the limits of...
    
    I would start of by saying what the legalities of all this is still not nearly settled. There are still cases and lawsuits happening as this a whole new area that stretches the limits of traditional understanding of copyright and fair use. What I think is generally fair to criticize here, is how the role of consent has been turned on its head in my opinion. Putting something up an a website gives users of the internet the right to read and see that content sure. But that I have allowed you to X with it, isn't the same as I have allowed Y. Most people would agree that a company simply can't take my personal pictures from my website and put in a billboard. But we are some supposed to allow a company using my pictures to feed into models to sell new pictures, perhaps see our family pictures being used to create new AI porn from? While the legalities are still unsettled, I still strongly think there is an big ethical difference between having allowed Google to find and link people to your site, and a company using your data to create a new product and sell it without clear crediting of source. Search indexing at least leads people direct to the source of the content. AI models completely removes that.
    
    On that I agree with you. This content shouldn't have been allowed in. They should do a better job inspecting the data, but with that said, it's literally 1 out of 2,000,000 images, and it's data that's likely already been index by google, apple and others. My point, this isn't an issue with AI or training of AI, it's much broader and deals with things outside of AI systems.
    
    That is just this example which is of course the worst thinkable offense. Then there is all the rest with unclear sources. I think it is fair to demand that these companies are better at securing their sources upfront, instead of trying to clean it up afterwards. For example authors have found that ilegally obtained books of theirs has been used to train these models. Sure those particular books might be removed from future models, but the damage has already been done and we now see scammers trying to sell AI-written stories, based on material scraped without permission, back to publishers. Which is my main criticism of how these companies do their "move fast and break things", is that just let the damages happened in the wake of their growth goals instead of being a little bit cautious.
    
    There's still an index of it though. Including often summaries and LDA vectors, etc. While not enough to fully rebuild it, it is a significant amount of data, equivalent to what's stored in an AI, possibly even greater.
    
    The usage is different. Search engines provide people with links back to the sources. There is transparency and control from a website owner whether to be indexed by a search engine or not. The AI models just absorbs the content and removes the original source completely. It is one thing to allow your site to be indexed by a search engine, because you want people to your site. Figuratively there is some sort of back transaction going on here. That is gone with these models.
    
    If the data is arrogated into the model that persons data is thermalized and no longer exists independently. That is, it's not there any more and has been blurred out by all the other data.
    
    That is clearly not always the case. They don't have full control on their own models and to everyone else it is just a blackbox that it is hard to trust that they actually don't still contain personal data or can leak it somehow.
    
    EU laws only apply in the EU. That's why they're probably going to lose AI development business and see long term GDP stagnation with in this market segment. It's possible others might make up for it though, other countries will leap frog them.
    
    They still sell and operate within the EU. And given how they can't control their sources, they likely still process data on EU citizens and we have no clear way of getting out of having our data sold and used.
    
    Of course it would, but we're not talking about that. We're talking about a fair use exception in regards to research and transformative work. Which AI training fits under.
    
    Again, I don't think there is global agreement on the legal on that part yet. All the lawsuits aren't done. Also, this is new territory for everyone. We have just been through a decade of personal data privacy going down the drain to boost the market value of tech companies. Maybe now is a good time to at least have a democratic debate on what kind of consent and control we have over our own data and creations? I find it weird and unethical that our default setting seems to be we allow for everything - like everyone is opt-in by default, instead of the other way around. It is not unreasonable to think that most people that put stuff up on a website was with the intent for it to read and consumed, not necessarily processed and resold by big tech companies. This data is clearly valuable to them, but they harvest all the profit.
    
    Beyond CSA though, general copyrighted data doesn't exist in the AI model after it's been trained, any more then it would exist in my "number of houses" dataset. If you take a hash of an image, that hash has been transformed enough, so the original image no longer exists. Properly trained, an AI model would similarly have the same level of detail of the input data. That is, if hashes are legal, a properly trained AI should be as well.
    
    Yet these models are able to output text "in the style of" some author or a painting in a style of an artist, which at least comes tangentially close to plagiarism. And as shown in another link above, the data is maybe not always completely gone.
    
    No I haven't. The pirated data is a complex gray area, but in the US there are avenues to deal with that via the DMCA. That would attack the dataset itself, but the model would not contain copyrighted data, any more than a hash would.
    
    Even if the original data is gone, the model will still have increased in value due to the input from illegally obtained data. The damage is already done and a bit more complex to reverse, and likely practically unfeasible if there ever was made a good DMCA system where everyone could have their stuff removed from the model. Which leads me back to my original point, that we should demand these companies to have better control over their training sets instead of just piling through everything and leaving the damages in their wake for other to cleanup.
    
    4 votes
    
    [2]
    
    Comment deleted by author
    Link Parent
    
    winther
    December 21, 2023
    Link Parent
    I think this goes somewhat beyond the original idea behind fair use. The current copyright laws are ancient and the internet is still fairly new and the laws have barely catched up with that, and...
    
    The law is never a settled matter. That said, the only way you'll change this is by killing a large part of fair use. At least in the US.
    
    I think this goes somewhat beyond the original idea behind fair use. The current copyright laws are ancient and the internet is still fairly new and the laws have barely catched up with that, and this new thing is just going even further. I think it is reasonable to have a democratic debate on how to deal with these things and how peoples personal website and whatnot are being used in a totally new manner most people didn't anticipate.
    
    If you don't want people to use your data, don't open it up and give it away for free. People want to close off a big part of what makes the internet useful because they don't want to understand how data and information works.
    
    There are very different levels of use. Some are already handled by current copyright laws, but I hope we can at least agree that these AI models are something new that must people who put stuff online didn't anticipate. Putting stuff online and having it indexed by Google had at least some level of transaction by letting people to your own site. The AI models just take valuable data and take all the profit, and the original source is lost in the process. If anything, these models have the potential to basically ruin the current internet, because if they contain all the information, no one has a need to go to the original source websites anymore. Not gonna happen tomorrow, but everything on the internet can just be repackaged and resold by AI companies under "fair use", then what is left for those that actually make original data in the first place? There is a massive jump from what search indexes did and what these AI companies are doing with data on websites.
    
    No, you can chose not to put your data on the internet or lock it up. It's analogous to arguing that people can't take pictures of your house, and using them in art work.
    
    I don't think you generally have copyright on just your house. But I find it weird that you argue for just because I write something public online, I should just cease up all control of it? Currently we actually still have copyright even on stuff that is publicly available on the internet. It is weird to me to say that if you want other people to read your website, you should also allow companies to harvest your data and resell for a profit without you getting anything. There is a massive difference in naturally wanting visitors to your site, maybe even something quoting you and citing the source, to process everything and reselling it. I don't see an argument for that it should be all or nothing.
    
    But your data isn't there any more. It's been abstracted away in the same way a hash is. It's like your arguing that you want the source available and that it should be hidden at the same time. Also, search indexing doesn't direct people to the source of content just the highest ranked resource, which often times isn't the source, but a re-write.
    
    It is more than just a hash I would argue. A hash of some text makes it possible to verify if the original text is the same. A single character will change the entire hash. A hash in itself isn't able to produce anything new. I am not saying the source should be hidden, quite the opposite. The ideal would be for the models to have better control of their sources including consent from the creators to be used. Best example I can think of is how Kagi's FastGPT works. In the case of a websites that is of course more a problem from the website using other peoples uncredited source and not so much the search index in itself. Of course it would be fair to do a better job at these sort of things, just like I think we should a bit more from the AI companies in how they use everyone elses data.
    
    Fair use allows for much of this. You can do things with other people's works even if they don't want you to. Provided it fits under the guidelines of fair use. You'd have to kill fair use to do this and make copy right even more authoritarian then it is.
    
    Fair use is already a vague and fairly old concept, that at least needs to be rethought with AI models in mind. The sheer scale of this goes quite beyond the intent of fair use I would say, which was for example that you could cite a work for review purposes and what not. That is different than processing all Stephen King novels to create a new "Stephen King"-live novel from an AI, which seems to be close to possible soon.
    
    Maybe, maybe not. Depends if fruit of the poison tree applies (this is an area I'd agree has not been properly litigated). The data set in question would be in violation of copyright, but the model outputs may or may not be. The problem is, you are allowed to do transformative functions on data that was not legally acquired, again like hashing. There are AI algorithms which are fundamentally and functionally no different from a hashing algorithm (some are even generative), to try and make a law around that is, insane. It becomes a distinction without a difference.
    
    I don't quite follow you on the hashing algorithm comparison. A hash of one novel is indistinguishable from a hash from another novel. The AI models have plenty more than just that, or else they wouldn't be able to the "in the style of .." type things. There is also a difference in what is legal to do in a purely mathematical sense of machine learning, and what kind of uses you should be allowed to sell your models for. I am not talking about making machine learning algorithms illegal, more on regulation on how these companies should be allowed to profit from other peoples data.
    
    A company that's cautious will lost to a company that isn't. A country will lose to a country that isn't, and then that country will make the laws. This isn't like ecological disaster, it's just information, entropy.
    
    Which is why we have regulations on many industries. The pharmacy industry could probably work a whole lot faster if they didn't have to care about medical ethics or regulations, but we have agreed as a society that not everything is allowed. There is a middle ground between total ban and doing whatever you want.
    
    Beyond which, that's not how search engines work, they frequently don't push people back to the source. They push people to the highest scoring resource, which often times is a re-write of the original source, because they can be better optimized for the search algorithm. There is no back transaction, just data.
    
    I think we are mixing things here. In this example then it is the websites that are in the wrong and have reused other peoples work, and as I said before, search engines could also do better in that regard. That doesn't change anything with regards to website owners allowing to be search indexed, is not the same as they intent to have their stuff used for all other imaginable purposes.
    
    Beyond which, if it removes the original source completely then the issues of personal information no longer matter, and your points above become wrong.
    
    Personal data can still be present which we have seen examples of. The problem is that it very much a blackbox with no clear way of controlling what your personal data is being used for.
    
    Which is why they won't operate in the EU. They'll have subsidiaries in other countries that are not the same company, allowing them to operate under different rules and laws. I mean it's more complicated then that, but functionally that's how it works.
    
    Even if they don't sell their product to EU citizens, given that they have scraped the entire internet, they still use data without consent from EU citizens. I think that is fair to criticize them for.
    
    The problem is, in many ways you've already lost it. The question is whether everyone will have access to the data or more and more like, just the big business who can afford it. At least in the US, everyone can use the data, for now. I actually expect fair use to die because of all this. It's one of the reason I fight so hard against this.
    
    So we should just give up instead of trying to correct things, because the damages have already happened? That was my problem in the first place, that we allow companies to do damage in the name of progress and just give up afterwards because it is "too late". It might be an uphill, even impossible battle, but that doesn't mean we should accept it.
    
    I don't hope or think fair use is going away, but it is probably time to at least revisit its limitations and update the concept for this new world. Right now there are no lines drawn. It took over a decade to clear things up with personal data online, and GDPR is mostly a poor bandaid. Would be nice if this time we could do a little bit better from the start.
    
    That's how information works. That's it's default, natural setting. It's just entropy (I mean that in a literal sense). Physically that's all it is, and entropy spreads. You want to stop it you put up thermal barriers, which will always be leaky anyway. This is my biggest problem with all of this, there is no physical way to do this, with out further breaking the free flow of information. The internet, and everything else.
    
    That sounds like an argument to abandon copyright all together, because it will just be broken anyways. I think we can all agree it is close to impossible to make technical or physical boundaries that are guaranteed to work in all cases. But that doesn't mean we shouldn't at least have proper regulation for the legally operating industry. Some shadier companies in shadier countries might not care, but we can still have control over what kind of business we allow to operate. We do that in most other industries, but the tech industry seems to just be allowed to do mostly whatever they want.
    
    You can't copy right style, so not like plagiarism.
    
    Of course that will depend on a case by case thing. If the "style" of an author becomes awfully close to an existing novel for example. But my point was mainly that there is enough data in these models to be beyond what I would consider the original intent of fair use, because it allows for something close to replication.
    
    Anyway, I'm done, I'm tired. This shit is depressing, you very well will might win. And everyone will suffer for it. Either in a world with ever more broken legal landscapes, or the lost of creative to larger companies (ironically what you want to stop).
    
    Not sure what you are getting at here. Currently this new development seems to mostly benefit big companies, and smaller artists in particular are left behind. I don't see how more people will suffer if we demand these companies to have better control over their data. Especially in the case here where victims of abuse could see pictures of them being used in a new fashion.
    
    2 votes
    
    [2]
    V17
    December 21, 2023
    Link Parent
    You should probably clarify whether you're talking about LLMs or image models. The article was about image models, and in that case there has afaik been only one single case that demonstrated...
    
    That is clearly not always the case. They don't have full control on their own models and to everyone else it is just a blackbox that it is hard to trust that they actually don't still contain personal data or can leak it somehow.
    
    You should probably clarify whether you're talking about LLMs or image models. The article was about image models, and in that case there has afaik been only one single case that demonstrated approximate reproduction of training data, which was based on an early unpublished version of Stable Diffusion that did not filter the training dataset for duplicates, so some images were present in it in 100+ copies, causing overtraining - a well-known issue. And despite that it took I think over a million attempts of using the exact (known in advance) prompt to generate one copy of the input data.
    
    I have seen no other evidence ever, with the exception of widely photographed and reproduced subjects like the Mona Lisa painting, and when you look at the (byte) size of the training dataset and the size of the resulting neural net, it's obvious that aside from overtraining reproduction is not possible because you often have just a few bits per image.
    
    4 votes
    
    winther
    December 21, 2023
    Link Parent
    Yes, the case I have linked to was with text not images. I just think we should at least err on the side of caution with these things, because it doesn't seem to me these companies have fully...
    
    Yes, the case I have linked to was with text not images. I just think we should at least err on the side of caution with these things, because it doesn't seem to me these companies have fully control over their models - sometimes they even admit they don't understand all that is happening. So even if there isn't currently a known case for images, I would consider it to at least be a realistic risk that it could be "prompt hacked" to create problematic images possibly resembling the victims. And all this is still very new and as the models grow in size and use, these kind of things will happen more, which is why I want to advocate for better control and more caution on what kind of data are fed into these models, instead of just going full throttle and dealing with the damages later if discovered.
    
    psi
    December 21, 2023
    Link Parent
    The report is specifically talking about CSAM. Obviously no one involved had consent to include these images in the dataset.
    
    The report is specifically talking about CSAM. Obviously no one involved had consent to include these images in the dataset.
    
    5 votes
    
    [2]
    RoyalHenOil
    December 21, 2023
    Link Parent
    What courts are you thinking of? It is illegal to distribute these images in very nearly every single nation on Earth.
    
    Consent isn’t given simply because it is on the internet. These images are likely not on the internet with consent from those on the images. AI models have also been trained on pirated material.
    
    It is if it's publicly available. That's what most courts have ruled, and all web scraping is. If you make something available, you make it available.
    
    What courts are you thinking of? It is illegal to distribute these images in very nearly every single nation on Earth.
    
    1 vote
    
    [2]
    
    Comment deleted by author
    Link Parent
    
    RoyalHenOil
    December 21, 2023
    Link Parent
    Ah, that's good. I think we were confused because the person you were replying to was talking specifically about CSAM and pirated images (i.e., those that were uploaded in violation of the owner's...
    
    Ah, that's good. I think we were confused because the person you were replying to was talking specifically about CSAM and pirated images (i.e., those that were uploaded in violation of the owner's consent) in the part of the comment that you quoted:
    
    Consent isn’t given simply because it is on the internet. These images are likely not on the internet with consent from those on the images. AI models have also been trained on pirated material.
2. Apocalypto
  December 21, 2023
  Link Parent
  "Move fast and break things" is a very specific mindset, not a fundamental aspect of progress.
  
  That's how all progress goes. Both in capitalist and socialists systems.
  
  "Move fast and break things" is a very specific mindset, not a fundamental aspect of progress.
  
  6 votes
[4]
Japeth
December 21, 2023
Link Parent
Has there ever been a time under capitalism where the people with means were not trying to move as fast and break as many things as the available technology allowed?

Has there ever been a time under capitalism where the people with means were not trying to move as fast and break as many things as the available technology allowed?

3 votes
1. zipf_slaw
  December 21, 2023
  Link Parent
  We're in the middle of the 3rd or 4th wave of "oh shit, this stuff we've been making for these several decades has been killing us!" lead in tableware/gasoline/paint, asbestos, PFAS, BPA, car...
  
  We're in the middle of the 3rd or 4th wave of "oh shit, this stuff we've been making for these several decades has been killing us!"
  
  lead in tableware/gasoline/paint, asbestos, PFAS, BPA, car tires, micro-plastics... Can't wait to see what the next thing I've been slowly poisoned by once it's too late!
  
  8 votes
2. RoyalHenOil
  December 21, 2023
  Link Parent
  Yes, some industries self-regulate (e.g., the American bar Association) or historically self-regulated (e.g., the Hays Code). We also see numerous individual companies and professionals...
  
  Yes, some industries self-regulate (e.g., the American bar Association) or historically self-regulated (e.g., the Hays Code). We also see numerous individual companies and professionals voluntarily pursue accreditation or certification schemes that are not legislated (e.g., LEED certification).
  
  2 votes
3. winther
  December 21, 2023
  Link Parent
  Maybe not but the tech industry seems to have it more as a mantra than most. Also many other industries are under heavier regulations to limit these things.
  
  Maybe not but the tech industry seems to have it more as a mantra than most. Also many other industries are under heavier regulations to limit these things.

[9]

chocobean

December 20, 2023 (edited December 20, 2023)

Link

So Stanford researchers used technology to find 3226 suspected instances of CSAM, and then forwarded them to authorities, of which 1008 were verified. Several thousand out of "five billion links...

So Stanford researchers used technology to find 3226 suspected instances of CSAM, and then forwarded them to authorities, of which 1008 were verified. Several thousand out of "five billion links to images scraped from the open web" (elsewhere on articled, "more than 5.85 billion entries") doesn't sound like a lot, but that's several thousand found "so far", and that's "not including all of the intimate imagery published and gathered non‐consensually".

I feel pretty conflicted about this. Do we not use auto-scalped images from the internet then? There's going to be far fewer than 6 billion. But look at the way they're responding to alarms: "nah we have algorithms to detect it" -- > "it's not that much" --> "it's unavoidable" --> "but think of what we'd lose".

We don't NEED AI and large data sets and none of this makes humanity survive better yet. There's a lot of potential yeah. But maybe this is exactly the point where we stop and start fresh. Don't use auto-scalped stuff from the darkest corners of the universe. Start fresh. Pay the quickly obsoleting humans real wages to willingly upload their photos specifically for use of training AI: don't steal the sum of human input just because it's free and because they clicked the "consent" box that one time when they didn't read it or understand that their pictures will be used for eternity to generate content.

It's funny because we all want ethically sourced food and textiles and whatnot, but how come we're unwilling to use ethically sourced content to train this futuristic business that's supposed to be all good and wholesome and wonderful?

6 billion images is a lot to give up on. And it feels like a tight race and everybody gonna get it first. Do unto others before they do unto you.

But what is humanity being asked to give up if we don't?

Think about all those images we all put up on the internet in the early days of our youth and optimism. Why should our naivete be exploited like this?

CASM is illegal. But unethically scalped intimate photos aren't good either. Maybe I don't want to use generated images from a "brain" that's seen all the scariest horribleness out there, and I don't want writing generated from racist garbage -- is there truly no market for a database that only contains good and true and beautiful things?

In 2022, Facebook made more than 21 million reports of CSAM to the National Center for Missing and Exploited Children (NCMEC) tipline, while Instagram made 5 million reports, and Twitter made 98,050.

Instead of pulling out visible dead flies from this soup made from floor sweepings, why don't we try to make soup from good clean ingredients to begin with?

15 votes

[7]
Minori
December 20, 2023
Link Parent
At a certain scale, it's just not realistic to get enough images or data without some automations. Gathering 6 billion unique images would take a hell of a long time if humans uploaded each one...

At a certain scale, it's just not realistic to get enough images or data without some automations. Gathering 6 billion unique images would take a hell of a long time if humans uploaded each one independently. You're right that this is a weakness of algorithms.

11 votes
1. [6]
  chocobean
  December 20, 2023
  Link Parent
  I understand that 6 billion is not a trivial number. But if it can't be done with "clean" materials, then this tech is not yet mature enough to exist. The end of the article mentioned some...
  
  I understand that 6 billion is not a trivial number.
  
  But if it can't be done with "clean" materials, then this tech is not yet mature enough to exist. The end of the article mentioned some precedence for algorithm disgorgement. Just because something is hard doesn't mean it's impossible.
  
  Suppose we do it the other way: countries and companies contribute money and pay human beings to be recorded in ultra high resolution doing regular human things in specific places and interacting with specific objects and on different backgrounds and lighting and camera settings etc etc, from many different angles all at the same time so the algorithm can be trained to "think in 4D" natively, with honest data. This is what I look like eating pie from this angle, and here's 300 angles of me eating that same pie. Wouldn't that database be amazingly useful?
  
  "room 1 : man woman person camera tv - apple orange banana - laughing crying jumping falling drinking eating"
  
  We'd get way better tags on the stuff wouldn't we? We wouldn't have to rely on the third world army of humans being exploited to tag things they have no way to meaningfully participating in. Maybe don't even have to pay humans: we all get license to use this database freely when we contribute to it.
  
  The fact does not change that those 6B images are being used in ways none of us consented to. Start fresh, let humans know hey we're trying to gather 6 billion well tagged images of humans doing human things, come help. It can be done. It needs to be done.
  
  9 votes
  1. [5]
    tauon
    December 20, 2023
    Link Parent
    Also, in this scenario, there would (probably) be less inherent biases to the image generation, since we’d have had (idealistically speaking) a panel of diverse members who put some actual thought...
    
    "room 1 : man woman person camera tv - apple orange banana - laughing crying jumping falling drinking eating"
    
    Also, in this scenario, there would (probably) be less inherent biases to the image generation, since we’d have had (idealistically speaking) a panel of diverse members who put some actual thought into what data is needed and should be depicted, in all those angles, settings and other variations.
    
    It’s a neat idea, but unfortunately likely already too late to enforce now.
    
    4 votes
    
    [4]
    chocobean
    December 21, 2023
    Link Parent
    Surely there's somebody dojng this.....why wouldn't companies want to buy a clean data set? Esp if It's only asking for miniscule amounts of money right? They're already paying for humana to...
    
    Surely there's somebody dojng this.....why wouldn't companies want to buy a clean data set? Esp if It's only asking for miniscule amounts of money right? They're already paying for humana to annotate data.
    
    And especially if, like this article points out, there's illegal stuff in the current ones
    
    wervenyt
    December 21, 2023
    Link Parent
    Well, basically, because the methods we're using to create these neural networks only work with ungodly quantities of training material. To go from 6 billion photos to 6 million may well be the...
    
    Well, basically, because the methods we're using to create these neural networks only work with ungodly quantities of training material. To go from 6 billion photos to 6 million may well be the difference between the uncanny valley and innumerable fingers and the average person having no way to distinguish a real photo from a generated one. And then, someone has to spend the time doing the clerical work of ensuring rights, for every one of the 6 millions photos.
    
    7 votes
    
    tauon
    December 21, 2023
    Link Parent
    Cost. Cost is probably the sole factor preventing this, realistically. A bot scraping the web, using alt text image descriptions, and humans to fill in the rest, will be vastly cheaper than...
    
    Cost. Cost is probably the sole factor preventing this, realistically.
    
    A bot scraping the web, using alt text image descriptions, and humans to fill in the rest, will be vastly cheaper than creating any sort of set/photoshoot, always.
    
    2 votes
    
    V17
    December 21, 2023
    Link Parent
    Because it doesn't seem to matter for the end result enough to be worth the money.
    
    .why wouldn't companies want to buy a clean data set?
    
    Because it doesn't seem to matter for the end result enough to be worth the money.
    
    2 votes
cdb
December 21, 2023
Link Parent
I wonder if the use of these large databases for AI training creates more of an incentive to scrutinize them more, which might lead to more policing of this kind of content. As in, would anyone...

I wonder if the use of these large databases for AI training creates more of an incentive to scrutinize them more, which might lead to more policing of this kind of content. As in, would anyone have checked for these examples of illegal material if companies weren't aggregating these links and using them for high profile projects? The article says the LAION data sets are derived from Common Crawl, which is not new. So just based on the article it seems that we were able to get some more attention on this issue due to the use of these sets of links for AI training.

5 votes

Eji1700

December 20, 2023

Link

Yep. I was wondering when something like this would come up. Coming to a headline near you next month will probably be "oops it might also contain poorly secured military secrets/information"

10 votes

V17

December 22, 2023

Link

I've been thinking about this in the last day or so and the more I do, the less I'm convinced that this is a big enough issue that would warrant the removal of LAION or other drastic steps -...

I've been thinking about this in the last day or so and the more I do, the less I'm convinced that this is a big enough issue that would warrant the removal of LAION or other drastic steps - after, of course, removing the identified images.

First reason is that it certainly feels terrible, but I struggle to see real damage caused using the dataset in training a model, as considering the number of images included and how the training works it seems unlikely that it would significantly affect the model itself. It took a ton of work to make the original v1.5 models produce decent quality legal porn with adult people, despite the fact that it was included in much higher numbers in the dataset.

Redistribution by someone downloading the dataset seems like more of a legal issue than factual - unless they know the particular images talked about (which I assume are now removed), searching through the whole dataset and blindly trying to find unidentified child porn doesn't seem easier than just searching through the internet due to the sheer size of the dataset.

And the second reason why I dislike this is that the majority of similar complaints go against open solutions like LAION and Stable Diffusion - obviously, because they're the only ones easy to investigate. What is the end result? The corporations running commercial models are fine, and likely will be fine because they have more money for legal defense and lobbying, while the only models that are actually accessible to everyone for free to use and modify are going to be non-existent.

I don't trust corporations and see this as a clear change for the worse.

1 vote

[3]

updawg

December 21, 2023

Link

I understand reasons for using the "CSA" terminology instead of calling it something like "child porn," but the fact that I was wondering why Confederate States of America material in an image...

I understand reasons for using the "CSA" terminology instead of calling it something like "child porn," but the fact that I was wondering why Confederate States of America material in an image dataset would get it removed shows why a different alternate term is needed...although I'm not entirely sure a different term is needed. CSAM feels very sterilized to me, whereas CP sounds evil. I'll admit that the abbreviation CP is problematic for people with Cerebral Palsy, but I don't think the argument that "porn refers to consensual images" holds any water. For example, it's still called "revenge porn" and not "vengeful sexual abuse material."

2 votes

crud_lover (OP)
December 21, 2023
Link Parent
CSA/CSAM is a bit more specific and less traumatizing to those who have experienced it. This isn't the internet circa 2005 anymore.

CSA/CSAM is a bit more specific and less traumatizing to those who have experienced it. This isn't the internet circa 2005 anymore.

6 votes
V17
December 21, 2023
Link Parent
It's just the euphemism treadmill going on. Sometimes it's reasonable, often times it's stupid, in almost all cases there's pretty much nothing an individual can do to stop it.

It's just the euphemism treadmill going on. Sometimes it's reasonable, often times it's stupid, in almost all cases there's pretty much nothing an individual can do to stop it.

2 votes

Link information

29 comments