Honestly, I'd much rather a platform hosting public information keep an open API and permanent open access to that public information, which will always come with people scraping the data for one...
Exemplary
Honestly, I'd much rather a platform hosting public information keep an open API and permanent open access to that public information, which will always come with people scraping the data for one purpose or another, than a corporation gatekeep the access and sell user data to some rich AI or advertisement company in a backhand deal, while everyone else remains dumbfounded and can't do anything requiring API access. Just look at the Twitter or Reddit API desasters. There are so many downsides to that, such as excluding any third-party indie developers from say developing third party clients (see e.g. 'reddit is fun'), or restricting useful infornation from being found via search engines (see e.g. the reddit+google deal).
The emphasis here lies on 'public information', meaning anything that is posted to the general public to be viewed without restrictions. As soon as it's information that's restricted to some well-defined, small group of people such as my friends, members of a company, or similar, that's a whole different story. In fact I think it's much more questionable that a lot of companies sell even that sort of data to advertisement companies or other third parties.
With regards to regulation one can still require that large datasets be stripped of identifying information in one way or anaother, in cases where no legitimate public interest in keeping accountability exists. But that shouldn't be enforced via making a platform unscrapable but by scrutinizing and fining people doing the scraping and processing of data in an irresponsible manner.
In any case people should be aware that what is posted publicly on reddit/twitter/bluesky/tildes is publicly available information that anyone will be able to work with wether you want it or not. Don't post personal or restricted information to a platform without ensuring that posts are actually restricted to a specific group of people!
And honestly specifically with platforms like Reddit or Tildes I actively want there to be the ability to back up posts to keep them accessible. I don't care for who posted what, so usernames could easily be stripped/pseudonymized, but there's so much helpful and interesting information on reddit that's useful or interesting to people years after being posted, it just hurts so many and destroys so much if reddit suddenly makes access go away. It's sometimes absurd how much control a platform has over user created information, that the platform itself notably doesn't have anything to do with as its just the medium chosen by the user to publish the information.
I've honestly been thinking about a model that resolves that problem, and I'm not opposed to the idea of having a platform that has a license agreement ensuring that the content of any posts are immediately and automatically released to the public domain or with some open license. That way users are well aware that any information posted to the platform is in fact public and usable by anyone and should act accordingly, and the platform itself cannot enforce unreasonable restrictions with regards to the information it hosts. That's of course not suitable for functions intended for private or semi-private information exchange, but for platforms such as tildes, reddit, stackoverflow, etc it would bear a lot of advantages from the perspective of the general public in the long term.
Plus, it severely restricts the ability of small companies, academia and the FOSS community to provide LLMs and other AI products. The genie is out of the bottle, it's at this point a question of...
There are so many downsides to that, such as excluding any third-party indie developers from say developing third party clients (see e.g. 'reddit is fun'), or restricting useful infornation from being found via search engines (see e.g. the reddit+google deal).
Plus, it severely restricts the ability of small companies, academia and the FOSS community to provide LLMs and other AI products. The genie is out of the bottle, it's at this point a question of who gets access. I'd rather have the option of using models from non-monopolists, and I'd rather everyone with a GPU can run the state of the art, rather than being reliant on Big Tech for models trained on proprietary (and thus potentially unethically/illegally sourced) data.
The more data is out there for everyone to use, the less power the big corporations have.
Many user of bluesky were under false pretense that bsky is saviour of their data and it'll save them from AI. bsky was always clear that the data is publicly available and they themselve won't...
Many user of bluesky were under false pretense that bsky is saviour of their data and it'll save them from AI. bsky was always clear that the data is publicly available and they themselve won't use it for AI but somehow bsky users got it wrong and now they're disappointed. To rebel they made a moderation list targeting all of the hugging face employee & users and started blocking them on bluesky.
That's not really "rebelling" Blocking the AI company people makes sense to me. Obviously it won't catch them all. I didn't see anyone on my feeds who thought Bluesky was "safer" from the same...
That's not really "rebelling"
Blocking the AI company people makes sense to me. Obviously it won't catch them all. I didn't see anyone on my feeds who thought Bluesky was "safer" from the same crawling that hits the rest of the public net, just pissed at yet another AI company committing mass copyright infringement for their own profit.
And they just got added to the existing lists, including one of the ones I follow. They block spam, MAGA assholes, TERFs and AI companies. I'm good with it.
Apparently so. They would appear under Moderation Tools > Moderation Lists in settings. But I haven't used them. There are also "labelers" which can "warn" or "hide" posts. I use a US politics...
Apparently so. They would appear under Moderation Tools > Moderation Lists in settings. But I haven't used them.
There are also "labelers" which can "warn" or "hide" posts. I use a US politics labeler, set to "warn," so it's much like content warnings on Mastodon, but automatically applied by someone's algorithm.
Is there any sort of repo for these lists? Feel like just getting the US Politics out of my face will be great but there are some other communities that have made home there that I would just...
Is there any sort of repo for these lists? Feel like just getting the US Politics out of my face will be great but there are some other communities that have made home there that I would just rather not see.
https://blue.mackuba.eu/labellers/ Searches can help too but it can be a bit more uneven about what you happen to find list wise. I can recommend some useful people to follow for spam/follow bot...
Searches can help too but it can be a bit more uneven about what you happen to find list wise. I can recommend some useful people to follow for spam/follow bot sorts of lists
Rahaeli: If you look at the lists tab on her account (she's the cofounder of Dreamwidth and worked in T&S for LIvejournal) she's got quite a few. She has both excellent instincts and deep...
Rahaeli: If you look at the lists tab on her account (she's the cofounder of Dreamwidth and worked in T&S for LIvejournal) she's got quite a few.
She has both excellent instincts and deep knowledge and her scammer/bot account lists are great. She can write with a lot of authority on the difficulties of preventing CSAM, preventing or banning spammers, etc. I like her account a lot.
This is a right wing account list by skysentry it's been solid so far and I haven't felt like I'm missing anything or it's over-moderating
And this is the Very British Bigotry list specifically targeting the British style of transphobia, where, you know, it's not that there's anything wrong about trans people they're just appropriating the role of women and probably are predators
I have a few other rightwing/other asshole lists, mostly because I'm seeing which work best. But these have been most helpful.
Yep! Quite a few. The first one I was recommended is @skywatch.blue - you'll see Lists in the middle of the page, and from there you can select any list you want to block. There'll be a Subscribe...
Yep! Quite a few. The first one I was recommended is @skywatch.blue - you'll see Lists in the middle of the page, and from there you can select any list you want to block. There'll be a Subscribe button at the top right, and when you select that, it offers you the option to mute or block the entire list (which will apply to all new entries to the list as well).
Some bad actors have created misleading lists, so you do have to make sure you choose ones that seem solid and come well-recommended.
Yes, where @skybrian directed you though is where your own made lists live, not where you find others. Searching for block or moderation lists is probably the easiest way, or I can share a link if...
Yes, where @skybrian directed you though is where your own made lists live, not where you find others.
Searching for block or moderation lists is probably the easiest way, or I can share a link if you're looking for something similar. I do similarly use mine to mute more than block 99% of the time but I think I have one set up to block. I follow someone who's done Trust and Safety for another website or two and so I use their lists more comfortably.
I've not yet wanted to unblock anyone. Some folks in the British "TERF" wave list aren't always talking about TERFs but I trust the mute (and have found the concerns to be valid when I've gone digging)
Ironic considering the lawsuit by Sony/RIAA over copyright infringement. If they win, I get to demand the AI sites lose internet access forever too, right?
Ironic considering the lawsuit by Sony/RIAA over copyright infringement.
If they win, I get to demand the AI sites lose internet access forever too, right?
I've continuously been confused or at a loss reading arguments against allowing AI to train on publicly available data. Recently on BlueSky, I've noticed people protesting once again simply...
I've continuously been confused or at a loss reading arguments against allowing AI to train on publicly available data. Recently on BlueSky, I've noticed people protesting once again simply because they dislike the concept. While I empathize with those concerned about AI-related job displacement (which I believe is inevitable), I think that publicly available information should be fair game for AI training. Hell, I barely give a shit about the copyright infringement that could be happening if I am completely honest.
My perspective is admittedly influenced by growing up with the 'Hacker Ethic' of the early internet, but it's been fascinating to witness AI's capabilities from all the data it has been trained on. Despite the protesting I think that these AI tools have been a positive force overall.
I can't tell what people want. Do they want laws restricting access? Do they want everything to be a walled garden? I think that would just destroy the internet, and also restricting access to training data could have serious unintended consequences. It would likely consolidate AI development among well-funded corporations, effectively shutting out smaller developers and open-source initiatives. What do people actually want?
The 'hacker ethic' is also very much built on "giving something back", where with how the current situation is - it is understandable that people are somewhat frustrated with merely being...
The 'hacker ethic' is also very much built on "giving something back", where with how the current situation is - it is understandable that people are somewhat frustrated with merely being data-feeders for multi-billion dollar for-profit companies. OpenAI also started with something resembling that open to the public ethic, but they have abandoned that to become a for-profit company like everyone else. I think that plays a huge part in many peoples negative reactions to what is going on.
But look at the open source LLMs available now, and built on publicly available content. You do get back, just from good people, not shareholder led corporations.
But look at the open source LLMs available now, and built on publicly available content.
You do get back, just from good people, not shareholder led corporations.
Guess then this depends on how we define giving something back. I have received a ton of value from services such as ChatGPT already, which I feel would qualify. Would I prefer all of these tools...
Guess then this depends on how we define giving something back. I have received a ton of value from services such as ChatGPT already, which I feel would qualify.
Would I prefer all of these tools to be open source? Probably. Those do still exist though and I fear that people acting out against tools held by corporate interests are still going to hurt smaller and open source models as well.
What people want depends on the person, but I think being able to set the terms for how your posts may be used is a reasonable request. This is what we do when we release source code under a...
What people want depends on the person, but I think being able to set the terms for how your posts may be used is a reasonable request. This is what we do when we release source code under a particular license. You can chose whatever license you want.
From a practical point of view, having machine-readable ways to declare these terms (like with robots.txt) makes it easy to automate. The easier it is, the more likely that scrapers will do it. Standard protocols often get built into software, which can make it easier to follow the standard than not.
Like copyright law itself, this is difficult for computer systems to enforce on people who don’t want to cooperate, which makes it ineffective against piracy, when people ignore and often try to defeat whatever restrictions you try to place on them. We shouldn’t be naive about it. But it doesn’t make it useless. There are many people and businesses that are careful to follow copyright law, at least, some of the time. We don’t abandon open source licenses just because they don’t stop pirates.
That's why my "high-profile" open source projects are released under GPLv3. Free to use but not to profit from. Or at least much harder. That should be the default for everything freely published...
This is what we do when we release source code under a particular license.
That's why my "high-profile" open source projects are released under GPLv3. Free to use but not to profit from. Or at least much harder.
That should be the default for everything freely published on the internet: do whatever you want with it, as long as you don't try to make money from it.
Bluesky is an open and public social network, much like websites on the Internet itself. Websites can specify whether they consent to outside companies crawling their data with a robots.txt file, and we’re investigating a similar practice here.
For example, this might look like a setting that allows Bluesky users to specify whether they consent to outside developers using their content in AI training datasets
Bluesky won’t be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings
I think that’s a little pessimistic. It seems like there should be a robots.txt lexicon that’s based on a setting? Mastodon has a setting for whether to allow search engines to index your profile,...
I think that’s a little pessimistic. It seems like there should be a robots.txt lexicon that’s based on a setting? Mastodon has a setting for whether to allow search engines to index your profile, which is similar.
Maybe a little late, but still, better now than later.
Robots.txt only works if people respect it. With the way Bluesky works, it also means that any instances also have a perfectly justifiable excuse for why they need a copy of everything, since it's...
Robots.txt only works if people respect it. With the way Bluesky works, it also means that any instances also have a perfectly justifiable excuse for why they need a copy of everything, since it's actually part of bootstrapping.
Sometimes they do and sometimes they don’t, but I think would be good for there to be a protocol that at least allows people to get/give consent? I don’t think they need every profile for AI...
Sometimes they do and sometimes they don’t, but I think would be good for there to be a protocol that at least allows people to get/give consent? I don’t think they need every profile for AI training and it’s easy enough to skip some of them.
It’s certainly not something to rely on, though. The posts are all cryptographically signed and anyone who really wants to can keep “receipts.”
So what happens if everyone starts signing their BlueSky posts David Mayer? https://www.newsweek.com/chatgpt-openai-david-mayer-error-ai-1994100 Yes I know ChatGPT wasn't the one harvesting...
There's no way this could happen in reality, but if the training data became flooded with that kind of signal, it'd cause random breakages from the filter after as the LLM before started rarely...
There's no way this could happen in reality, but if the training data became flooded with that kind of signal, it'd cause random breakages from the filter after as the LLM before started rarely but randomly reproducing that signal. Best place for it would be stashed away in the same overly formal sounding text that it tries to reproduce, but human people writing formal text probably aren't easily swayed (at least as a whole) by these kinds of shenanigans.
Honestly, I'd much rather a platform hosting public information keep an open API and permanent open access to that public information, which will always come with people scraping the data for one purpose or another, than a corporation gatekeep the access and sell user data to some rich AI or advertisement company in a backhand deal, while everyone else remains dumbfounded and can't do anything requiring API access. Just look at the Twitter or Reddit API desasters. There are so many downsides to that, such as excluding any third-party indie developers from say developing third party clients (see e.g. 'reddit is fun'), or restricting useful infornation from being found via search engines (see e.g. the reddit+google deal).
The emphasis here lies on 'public information', meaning anything that is posted to the general public to be viewed without restrictions. As soon as it's information that's restricted to some well-defined, small group of people such as my friends, members of a company, or similar, that's a whole different story. In fact I think it's much more questionable that a lot of companies sell even that sort of data to advertisement companies or other third parties.
With regards to regulation one can still require that large datasets be stripped of identifying information in one way or anaother, in cases where no legitimate public interest in keeping accountability exists. But that shouldn't be enforced via making a platform unscrapable but by scrutinizing and fining people doing the scraping and processing of data in an irresponsible manner.
In any case people should be aware that what is posted publicly on reddit/twitter/bluesky/tildes is publicly available information that anyone will be able to work with wether you want it or not. Don't post personal or restricted information to a platform without ensuring that posts are actually restricted to a specific group of people!
And honestly specifically with platforms like Reddit or Tildes I actively want there to be the ability to back up posts to keep them accessible. I don't care for who posted what, so usernames could easily be stripped/pseudonymized, but there's so much helpful and interesting information on reddit that's useful or interesting to people years after being posted, it just hurts so many and destroys so much if reddit suddenly makes access go away. It's sometimes absurd how much control a platform has over user created information, that the platform itself notably doesn't have anything to do with as its just the medium chosen by the user to publish the information.
I've honestly been thinking about a model that resolves that problem, and I'm not opposed to the idea of having a platform that has a license agreement ensuring that the content of any posts are immediately and automatically released to the public domain or with some open license. That way users are well aware that any information posted to the platform is in fact public and usable by anyone and should act accordingly, and the platform itself cannot enforce unreasonable restrictions with regards to the information it hosts. That's of course not suitable for functions intended for private or semi-private information exchange, but for platforms such as tildes, reddit, stackoverflow, etc it would bear a lot of advantages from the perspective of the general public in the long term.
Plus, it severely restricts the ability of small companies, academia and the FOSS community to provide LLMs and other AI products. The genie is out of the bottle, it's at this point a question of who gets access. I'd rather have the option of using models from non-monopolists, and I'd rather everyone with a GPU can run the state of the art, rather than being reliant on Big Tech for models trained on proprietary (and thus potentially unethically/illegally sourced) data.
The more data is out there for everyone to use, the less power the big corporations have.
Many user of bluesky were under false pretense that bsky is saviour of their data and it'll save them from AI. bsky was always clear that the data is publicly available and they themselve won't use it for AI but somehow bsky users got it wrong and now they're disappointed. To rebel they made a moderation list targeting all of the hugging face employee & users and started blocking them on bluesky.
That's not really "rebelling"
Blocking the AI company people makes sense to me. Obviously it won't catch them all. I didn't see anyone on my feeds who thought Bluesky was "safer" from the same crawling that hits the rest of the public net, just pissed at yet another AI company committing mass copyright infringement for their own profit.
And they just got added to the existing lists, including one of the ones I follow. They block spam, MAGA assholes, TERFs and AI companies. I'm good with it.
Edit: Update, they've pulled the Bluesky data.
Are there shareable block lists? I only just made a bluesky account.
Apparently so. They would appear under Moderation Tools > Moderation Lists in settings. But I haven't used them.
There are also "labelers" which can "warn" or "hide" posts. I use a US politics labeler, set to "warn," so it's much like content warnings on Mastodon, but automatically applied by someone's algorithm.
Is there any sort of repo for these lists? Feel like just getting the US Politics out of my face will be great but there are some other communities that have made home there that I would just rather not see.
https://blue.mackuba.eu/labellers/
Searches can help too but it can be a bit more uneven about what you happen to find list wise. I can recommend some useful people to follow for spam/follow bot sorts of lists
Give me those recs.
Rahaeli: If you look at the lists tab on her account (she's the cofounder of Dreamwidth and worked in T&S for LIvejournal) she's got quite a few.
She has both excellent instincts and deep knowledge and her scammer/bot account lists are great. She can write with a lot of authority on the difficulties of preventing CSAM, preventing or banning spammers, etc. I like her account a lot.
This is a right wing account list by skysentry it's been solid so far and I haven't felt like I'm missing anything or it's over-moderating
And this is the Very British Bigotry list specifically targeting the British style of transphobia, where, you know, it's not that there's anything wrong about trans people they're just appropriating the role of women and probably are predators
I have a few other rightwing/other asshole lists, mostly because I'm seeing which work best. But these have been most helpful.
Yep! Quite a few. The first one I was recommended is @skywatch.blue - you'll see Lists in the middle of the page, and from there you can select any list you want to block. There'll be a Subscribe button at the top right, and when you select that, it offers you the option to mute or block the entire list (which will apply to all new entries to the list as well).
Some bad actors have created misleading lists, so you do have to make sure you choose ones that seem solid and come well-recommended.
Yes, where @skybrian directed you though is where your own made lists live, not where you find others.
Searching for block or moderation lists is probably the easiest way, or I can share a link if you're looking for something similar. I do similarly use mine to mute more than block 99% of the time but I think I have one set up to block. I follow someone who's done Trust and Safety for another website or two and so I use their lists more comfortably.
I've not yet wanted to unblock anyone. Some folks in the British "TERF" wave list aren't always talking about TERFs but I trust the mute (and have found the concerns to be valid when I've gone digging)
if you put speech out in the world don't expected to be respected by anyone.
Ironic considering the lawsuit by Sony/RIAA over copyright infringement.
If they win, I get to demand the AI sites lose internet access forever too, right?
lol as if those rules would apply to the likes of us.
I've continuously been confused or at a loss reading arguments against allowing AI to train on publicly available data. Recently on BlueSky, I've noticed people protesting once again simply because they dislike the concept. While I empathize with those concerned about AI-related job displacement (which I believe is inevitable), I think that publicly available information should be fair game for AI training. Hell, I barely give a shit about the copyright infringement that could be happening if I am completely honest.
My perspective is admittedly influenced by growing up with the 'Hacker Ethic' of the early internet, but it's been fascinating to witness AI's capabilities from all the data it has been trained on. Despite the protesting I think that these AI tools have been a positive force overall.
I can't tell what people want. Do they want laws restricting access? Do they want everything to be a walled garden? I think that would just destroy the internet, and also restricting access to training data could have serious unintended consequences. It would likely consolidate AI development among well-funded corporations, effectively shutting out smaller developers and open-source initiatives. What do people actually want?
The 'hacker ethic' is also very much built on "giving something back", where with how the current situation is - it is understandable that people are somewhat frustrated with merely being data-feeders for multi-billion dollar for-profit companies. OpenAI also started with something resembling that open to the public ethic, but they have abandoned that to become a for-profit company like everyone else. I think that plays a huge part in many peoples negative reactions to what is going on.
But look at the open source LLMs available now, and built on publicly available content.
You do get back, just from good people, not shareholder led corporations.
Those shareholder owned corporations (or venture capitalist funded, whatever) should have to pay for data then
Guess then this depends on how we define giving something back. I have received a ton of value from services such as ChatGPT already, which I feel would qualify.
Would I prefer all of these tools to be open source? Probably. Those do still exist though and I fear that people acting out against tools held by corporate interests are still going to hurt smaller and open source models as well.
What people want depends on the person, but I think being able to set the terms for how your posts may be used is a reasonable request. This is what we do when we release source code under a particular license. You can chose whatever license you want.
From a practical point of view, having machine-readable ways to declare these terms (like with robots.txt) makes it easy to automate. The easier it is, the more likely that scrapers will do it. Standard protocols often get built into software, which can make it easier to follow the standard than not.
Like copyright law itself, this is difficult for computer systems to enforce on people who don’t want to cooperate, which makes it ineffective against piracy, when people ignore and often try to defeat whatever restrictions you try to place on them. We shouldn’t be naive about it. But it doesn’t make it useless. There are many people and businesses that are careful to follow copyright law, at least, some of the time. We don’t abandon open source licenses just because they don’t stop pirates.
That's why my "high-profile" open source projects are released under GPLv3. Free to use but not to profit from. Or at least much harder.
That should be the default for everything freely published on the internet: do whatever you want with it, as long as you don't try to make money from it.
A nit: technically, you can always use a GPLed application yourself, including for business. So that’s a pretty rough summary of the license terms.
Looks like BlueSky is on it:
Is what it is. The only way to prevent this would be to make instances which are whitelist-first, but obviously that would kill adoption.
I think that’s a little pessimistic. It seems like there should be a robots.txt lexicon that’s based on a setting? Mastodon has a setting for whether to allow search engines to index your profile, which is similar.
Maybe a little late, but still, better now than later.
Robots.txt only works if people respect it. With the way Bluesky works, it also means that any instances also have a perfectly justifiable excuse for why they need a copy of everything, since it's actually part of bootstrapping.
Sometimes they do and sometimes they don’t, but I think would be good for there to be a protocol that at least allows people to get/give consent? I don’t think they need every profile for AI training and it’s easy enough to skip some of them.
It’s certainly not something to rely on, though. The posts are all cryptographically signed and anyone who really wants to can keep “receipts.”
So what happens if everyone starts signing their BlueSky posts David Mayer?
https://www.newsweek.com/chatgpt-openai-david-mayer-error-ai-1994100
Yes I know ChatGPT wasn't the one harvesting BlueSky data, but they might be.
Nothing. That would seem to be a filter placed on the text generation, not an issue with the LLM.
There's no way this could happen in reality, but if the training data became flooded with that kind of signal, it'd cause random breakages from the filter after as the LLM before started rarely but randomly reproducing that signal. Best place for it would be stashed away in the same overly formal sounding text that it tries to reproduce, but human people writing formal text probably aren't easily swayed (at least as a whole) by these kinds of shenanigans.