18 votes

Decrypted Apple Intelligence safety filters

Posted July 7, 2025 by fxgn

Tags: apple, artificial intelligence.generative, safety, filters, decryption, source.github

https://github.com/BlueFalconHD/apple_generative_model_safety_decrypted

Link information

This data is scraped automatically and may be incorrect.

Title: GitHub - BlueFalconHD/apple_generative_model_safety_decrypted: Decrypted Generative Model safety files for Apple Intelligence containing filters
Authors: BlueFalconHD
Word count: 570 words

4 comments

[3]
Oxalis
July 7, 2025 (edited July 7, 2025)
Link
EDIT: as stated by Pending, this is for the assistant/Siri. Here's a good summary of what this is about: https://successquarterly.com/apple-intelligence-guardrails-exposed/ Also the hackernews...
~~So is this giving access to the content filter models that Apple uses to mark and report illegal content on a user's iDevice?~~

~~Aside from a general curiosity, the only use cases I can see here are~~
- ~~Getting to test innocuous data and find false positives to build evidence about how lame the entire concept is.~~
- ~~Enabling bad actors to create adversarial networks that can take illegal content and mutate it subtly so it can pass as clean.~~
EDIT: as stated by Pending, this is for the assistant/Siri. Here's a good summary of what this is about: https://successquarterly.com/apple-intelligence-guardrails-exposed/

Also the hackernews thread: https://news.ycombinator.com/item?id=44483485
7 votes
1. PendingKetchup
  July 7, 2025
  Link Parent
  It looks more like (or at least includes) the filters for what their image generator is not willing to draw a picture of. Including a variety of slurs and insults, so nobody can run a news...
  
  It looks more like (or at least includes) the filters for what their image generator is not willing to draw a picture of. Including a variety of slurs and insults, so nobody can run a news headline that "Apple AI willingly draws pictures of slurs and insults".
  
  One potential application of seeing this is to steal their rules to make other systems similarly resistant to being used for racism and sexism.
  
  Another is to look at why the list is necessary and critique why Apple has created a model that has an idea of what the listed things would look like. And to wonder whether that model will not leak its biases or show its ability to draw things it thinks these words refer to, even when it is prevented from being explicitly asked to do so. And to note any glaring omissions of slurs and insults Apple did not bother to protect the referents of.
  
  9 votes
2. tibpoe
  July 7, 2025
  Link Parent
  I disagree, this sounds like an AI-expanded version of the README.
  
  Here's a good summary of what this is about
  
  I disagree, this sounds like an AI-expanded version of the README.
  
  1 vote
tibpoe
July 7, 2025
Link
This repo is very difficult to understand with the deep nesting and all, but I found this amusing: The American version of the filter list for suggested email replies has "black americans", "white...

This repo is very difficult to understand with the deep nesting and all, but I found this amusing:

The American version of the filter list for suggested email replies has "black americans", "white people", "illegal immigrant", etc banned. And so does the Danish version, but in Danish, despite the fact that every country has its own unique form of racism! I'd expect to see "migrant" or "gypsy" in the Danish ban list.

To me, that says that Apple's goal here is to avoid American media attention, not anything else.

2 votes