18
votes
Decrypted Apple Intelligence safety filters
Link information
This data is scraped automatically and may be incorrect.
- Title
- GitHub - BlueFalconHD/apple_generative_model_safety_decrypted: Decrypted Generative Model safety files for Apple Intelligence containing filters
- Authors
- BlueFalconHD
- Word count
- 570 words
So is this giving access to the content filter models that Apple uses to mark and report illegal content on a user's iDevice?Aside from a general curiosity, the only use cases I can see here areGetting to test innocuous data and find false positives to build evidence about how lame the entire concept is.Enabling bad actors to create adversarial networks that can take illegal content and mutate it subtly so it can pass as clean.EDIT: as stated by Pending, this is for the assistant/Siri. Here's a good summary of what this is about: https://successquarterly.com/apple-intelligence-guardrails-exposed/
Also the hackernews thread: https://news.ycombinator.com/item?id=44483485
It looks more like (or at least includes) the filters for what their image generator is not willing to draw a picture of. Including a variety of slurs and insults, so nobody can run a news headline that "Apple AI willingly draws pictures of slurs and insults".
One potential application of seeing this is to steal their rules to make other systems similarly resistant to being used for racism and sexism.
Another is to look at why the list is necessary and critique why Apple has created a model that has an idea of what the listed things would look like. And to wonder whether that model will not leak its biases or show its ability to draw things it thinks these words refer to, even when it is prevented from being explicitly asked to do so. And to note any glaring omissions of slurs and insults Apple did not bother to protect the referents of.
I disagree, this sounds like an AI-expanded version of the README.
This repo is very difficult to understand with the deep nesting and all, but I found this amusing:
The American version of the filter list for suggested email replies has "black americans", "white people", "illegal immigrant", etc banned. And so does the Danish version, but in Danish, despite the fact that every country has its own unique form of racism! I'd expect to see "migrant" or "gypsy" in the Danish ban list.
To me, that says that Apple's goal here is to avoid American media attention, not anything else.