6 votes

Multimodal neurons in artificial neural networks

2 comments

  1. [2]
    skybrian
    (edited )
    Link
    From the article: [...] [...] [...] [...] [...] There are a lot more examples in the paper. For example, there are Pokémon, Minecraft, Disney, and Lego neurons. From the paper: [...] After a...

    From the article:

    Now, we’re releasing our discovery of the presence of multimodal neurons in CLIP. One such neuron, for example, is a “Spider-Man” neuron (bearing a remarkable resemblance to the “Halle Berry” neuron) that responds to an image of a spider, an image of the text “spider,” and the comic book character “Spider-Man” either in costume or illustrated.

    [...]

    Using the tools of interpretability, we give an unprecedented look into the rich visual concepts that exist within the weights of CLIP. Within CLIP, we discover high-level concepts that span a large subset of the human visual lexicon—geographical regions, facial expressions, religious iconography, famous people and more. By probing what each neuron affects downstream, we can get a glimpse into how CLIP performs its classification.

    [...]

    While this analysis shows a great breadth of concepts, we note that a simple analysis on a neuron level cannot represent a complete documentation of the model’s behavior. The authors of CLIP have demonstrated, for example, that the model is capable of very precise geolocation, with a granularity that extends down to the level of a city and even a neighborhood. In fact, we offer an anecdote: we have noticed, by running our own personal photos through CLIP, that CLIP can often recognize if a photo was taken in San Francisco, and sometimes even the neighborhood (e.g., “Twin Peaks”).

    Despite our best efforts, however, we have not found a “San Francisco” neuron, nor did it seem from attribution that San Francisco decomposes nicely into meaningful unit concepts like “California” and “city.” We believe this information to be encoded within the activations of the model somewhere, but in a more exotic way, either as a direction or as some other more complex manifold. We believe this to be a fruitful direction for further research.

    [...]

    Probing how CLIP understands words, it appears to the model that the word “surprised” implies some not just some measure of shock, but a shock of a very specific kind, one combined perhaps with delight or wonder. “Intimate” consists of a soft smile and hearts, but not sickness. We note that this reveals a reductive understanding of the the full human experience of intimacy-the subtraction of illness precludes, for example, intimate moments with loved ones who are sick. We find many such omissions when probing CLIP’s understanding of language.

    [...]

    Through a series of carefully-constructed experiments, we demonstrate that we can exploit this reductive behavior to fool the model into making absurd classifications. We have observed that the excitations of the neurons in CLIP are often controllable by its response to images of text, providing a simple vector of attacking the model.

    The finance neuron, for example, responds to images of piggy banks, but also responds to the string “$$$”. By forcing the finance neuron to fire, we can fool our model into classifying a dog as a piggy bank.

    [...]

    We have observed, for example, a “Middle East” neuron with an association with terrorism; and an “immigration” neuron that responds to Latin America. We have even found a neuron that fires for both dark-skinned people and gorillas, mirroring earlier photo tagging incidents in other models we consider unacceptable.

    There are a lot more examples in the paper. For example, there are Pokémon, Minecraft, Disney, and Lego neurons.

    From the paper:

    These neurons don’t just select for a single object. They also fire (more weakly) for associated stimuli, such as a Barack Obama neuron firing for Michelle Obama or a morning neuron firing for images of breakfast. They also tend to be maximally inhibited by stimuli which could be seen, in a very abstract way, as their opposite.

    [...]

    Many of these neurons deal with sensitive topics, from political figures to emotions. Some neurons explicitly represent or are closely related to protected characteristics: age, gender, race, religion, sexual orientation, disability and mental health status, pregnancy and parental status. These neurons may reflect prejudices in the “associated” stimuli they respond to, or be used downstream to implement biased behavior. There are also a small number of people detectors for individuals who have committed crimes against humanity, and a “toxic” neuron which responds to hate speech and sexual content. Having neurons corresponding to sensitive topics doesn’t necessarily mean a network will be prejudiced. You could even imagine explicit representations helping in some cases: the toxic neuron might help the model match hateful images with captions that refute them. But they are a warning sign for a wide range of possible biases, and studying them may help us find potential biases which might be less on our radar.

    After a content warning, they talk about Jesus, Hitler, and Trump neurons.

    1 vote
    1. MimicSquid
      Link Parent
      You skipped my favorite part, which is that a handwritten sign with a word on it creates great confidence that the picture is of the named object. A granny smith apple with a handwritten sign...

      You skipped my favorite part, which is that a handwritten sign with a word on it creates great confidence that the picture is of the named object. A granny smith apple with a handwritten sign saying iPod in front of it will be very definitively identified as an iPod.

      6 votes