18 votes

SolidGoldMagikarp and other words that cause buggy behavior with ChatGPT

15 comments

  1. skybrian
    (edited )
    Link
    From the article: I was able to reproduce a few of these using ChatGPT: Also: For some others I tried, it just printed a single quote mark, not the more amusing answers. Other people in the...

    From the article:

    In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contra to the purpose of the model), and present the results of our investigation into this phenomenon.

    I was able to reproduce a few of these using ChatGPT:

    What does the string " SolidGoldMagikarp" refer to?

    As a verb, "distribute" refers to the act of distributing or spreading out a quantity of something, such as goods, information, or resources, among a number of people or places.

    Also:

    What is Smartstocks?

    I'm not sure what you mean by "406s." Could you provide a little more context or clarification?

    For some others I tried, it just printed a single quote mark, not the more amusing answers. Other people in the comments on LessWrong found some interesting things.

    6 votes
  2. [10]
    onyxleopard
    Link
    This is what you get when you probe the vocabularies produced when running unsupervised tokenizers. In classical NLP development, you would build a vocabulary of tokens that actually make sense...

    This is what you get when you probe the vocabularies produced when running unsupervised tokenizers.

    In classical NLP development, you would build a vocabulary of tokens that actually make sense via supervised or semi-supervised methods (human annotators, or rules developed by humans or derived from human-curated dictionaries, would be used to delimit token boundaries). But, such supervised token vocabularies have utility limited to specific domains—as soon as you start looking at the kinds of text you might encounter in a large web crawl, you'll quickly find novel tokens that are "out of vocabulary" or "OOV" (e.g., usernames, fictional character names, neologisms, programming language snippets, or actual nonsense).

    Hence, in the world of transformer models for NLP, the practice more recently has been to use data-driven tokenization techniques like byte-pair encodings, or *piece tokenizers (e.g., sentencepiece and wordpiece). These tend to be a bit more robust to OOV issues. Huggingface has implementations of these many popular unsupervised tokenizers here. Huggingface also has a write up summarizing different tokenizers.

    Since computational NLP models typically only deal with indexed numerical representations of text, not the actual graphemes that humans perceive, it's not surprising that these models fall down in peculiar ways when the underlying tokenization is deficient.

    5 votes
    1. [9]
      skybrian
      Link Parent
      Interesting. So it seems the weirdness is that there are some long tokens like ' SolidGoldMagikarp' (with leading space) that should have been broken down into subtokens, but weren't because the...

      Interesting. So it seems the weirdness is that there are some long tokens like ' SolidGoldMagikarp' (with leading space) that should have been broken down into subtokens, but weren't because the input data came from some Internet forum?

      It seems kind of sloppy? Maybe someone should fix up a SentencePiece algorithm to get a good set of about 50k tokens that works better in multiple languages. (And doesn't include the leading space in front of a token.)

      3 votes
      1. [7]
        onyxleopard
        Link Parent
        Well, it's a combination of the string ' SolidGoldMagikarp' not being broken down into subtokens, and also the lack of representation of that string in the data set that was used to train the LLM....

        Interesting. So it seems the weirdness is that there are some long tokens like ' SolidGoldMagikarp' (with leading space) that should have been broken down into subtokens, but weren't because the input data came from some Internet forum?

        Well, it's a combination of the string ' SolidGoldMagikarp' not being broken down into subtokens, and also the lack of representation of that string in the data set that was used to train the LLM. If the LLM sees few (or zero) instances of ' SolidGoldMagikarp' during training, its vector representations of that token are going to be uninformative. That could be explained by a mismatch between the data the tokenizer was optimized for and the data the LLM was trained on.

        Hence, from the piece:

        But a revised hypothesis, that many of these tokens we were seeing were among those closest to the centroid of the entire set of 50,257 tokens, turned out to be correct. This centroid can be imagined as the centre-of-mass of the whole “cloud” of tokens in embedding space.

        If you train your tokenizer on a corpus where ' SolidGoldMagikarp' is represented well, and you get a vector space where it ends up in a sensible location within semantic space, it's likely that some other subset of the vocabulary would end up in a suboptimal part of the vector space. I'd hazard to guess this hypothetical space would be fit well to text like that on Pokemon discussion forums or similar.

        It seems kind of sloppy? Maybe someone should fix up a SentencePiece algorithm to get a good set of about 50k tokens that works better in multiple languages. (And doesn't include the leading space in front of a token.)

        I don't think the algorithm is really to blame here. I think the problem is data sparsity. The representativeness of the vocabulary of data sets that the tokenizer is run on, and that the token embeddings are optimized on, matter. Is the ~500 billion token corpus that GPT-3 was trained on representative enough? How could you do better? I do think it's possible that with more careful corpus selection and cleaning one could possibly prevent some of these issues, but ultimately, for text analysis, there's no such thing as a discrete set of texts that will provide universally optimal representation (not to mention things like continual updates to the Unicode standard over time). Optimizing a vocabulary for all possible text seems intractable. Optimizing even for natural language-specific vocabularies is probably more tractable, and even desirable, but also poses a challenge, because now you have to classify the language of the input so you can decide which language-specific LLM to run. And inputs aren't guaranteed to be mono-lingual, so the problem is really a sequence classification problem, which is even more difficult. These are difficult problems that are actively being tackled by NLP R&D teams.

        4 votes
        1. [6]
          skybrian
          Link Parent
          That's all I'm asking. 50k tokens isn't that many, particularly for multilingual use. It's only about a third of how many Unicode characters there are, and many English dictionaries are larger. So...

          I do think it's possible that with more careful corpus selection and cleaning one could possibly prevent some of these issues

          That's all I'm asking. 50k tokens isn't that many, particularly for multilingual use. It's only about a third of how many Unicode characters there are, and many English dictionaries are larger. So it seems like you could just eyeball it and ban the ones that are long and look weird (like this one), or remove the documents they're in, and have it pick something else? Although, it's more difficult for multilingual text since you need people who understand many languages.

          The idea would be to come up with a token set that's reasonably general-purpose, not necessarily optimal for any particular use case. They ended up with some mostly-useless tokens that could have been replaced by useful ones, and someone who looked at the data should have been able to pick them out.

          2 votes
          1. [5]
            onyxleopard
            Link Parent
            That's true for a lot of NLP applications. 🙃 In my experience, engineer-types like to take the promise of unsupervised methods at face value (including *piece tokenizers) and run with them....

            ... and someone who looked at the data should have been able to pick them out.

            That's true for a lot of NLP applications. 🙃

            In my experience, engineer-types like to take the promise of unsupervised methods at face value (including *piece tokenizers) and run with them. Looking at data and doing manual sanity checking or annotation is boring and beneath them.

            Identifying "weird" vocabulary items is not a trivial problem, either. How do you define "weird"? Is "skybrian" a weird token? What about "3OH!3" or "Vivadixiesubmarinetransmissionplot"? Where is the line between weird vs. low frequency, but perfectly valid open class members?

            This idea of doing manual token selection is similar to the notion of choosing stop words in information retrieval—a practice that has fallen out of favor in lieu of data-driven, unsupervised methods like using tf-idf weighting.

            Another issue is the token vocabulary created for GPT-2 has became a sort of de facto standard for all subsequent GPT* models. So, there is now a lot of inertia to overcome to change it, because it would require re-running the costly training of these LLMs, and it probably wouldn't result in noticeable differences except in these edge cases where these low frequency, "weird" tokens occur in the input.

            The idea would be to come up with a token set that's reasonably general-purpose, not necessarily optimal for any particular use case.

            Arguably, the token set used by GPT-2 and subsequent models already qualifies. Without an objective measure of a vocabulary's "general-purpose" fitness (and I'm not aware of any such measure), I think you'll have a hard time convincing anyone that manually excluding tokens from the vocabulary is worthwhile.

            2 votes
            1. [4]
              skybrian
              Link Parent
              It would be difficult to automate, but the way I would do by hand would be to do a Google search and see what comes up. Also, if it's obviously an English compound word and seems rare then it can...

              It would be difficult to automate, but the way I would do by hand would be to do a Google search and see what comes up. Also, if it's obviously an English compound word and seems rare then it can be represented by the individual subwords. Language models can still learn compound words represented that way; it's not like stop words in a search index.

              I expect this isn't rigorous enough for most people since there isn't an objective measurement, but perhaps someone could reverse-engineer a method that pretty much does the same thing?

              3 votes
              1. [3]
                onyxleopard
                Link Parent
                The problem there is that it assumes that Google's tokenizer works well enough to give you accurate results based on a single token query. For ÃÂÃÂ, I'm not sure if the google search results are...

                It would be difficult to automate, but the way I would do by hand would be to do a Google search and see what comes up.

                The problem there is that it assumes that Google's tokenizer works well enough to give you accurate results based on a single token query. For ÃÂÃÂ, I'm not sure if the google search results are helpful (is this someone's family name)?.

                Language models can still learn compound words represented that way; it's not like stop words in a search index.

                Right, my analogy to stop words is just the idea of manually classifying certain tokens as +stopword or -stopword (vs. your case of +weird or -weird).

                2 votes
                1. [2]
                  skybrian
                  Link Parent
                  Sure, but even if it is a family name, I think only a very common family name would deserve a single token to itself? So that search result would indicate it shouldn't be a token, to me, and I...

                  Sure, but even if it is a family name, I think only a very common family name would deserve a single token to itself? So that search result would indicate it shouldn't be a token, to me, and I doubt it would break anything.

                  I'm thinking of "things that deserve a token" as either common words, word parts, characters, or short byte sequences (unicode character parts) as a fallback, roughly based on popularity, and 50k of them means they all should be recognizable to humans. (Other than one and two-byte sequences where you'd just have all of them that can appear in Unicode.) It could be designed to make sense to people, like Unicode mostly makes sense.

                  I suppose, though, that if you have a multilingual token set that and don't have enough training data for all languages in it then you'd get a similar problem with tokens that the language model doesn't understand. Perhaps that could be checked for directly before training.

                  2 votes
                  1. onyxleopard
                    Link Parent
                    With *piece tokenizers, they tend to compress words into morphemes (or tokens that resemble morphemes), which may be recognizable, but are not necessarily recognizable to lay people (or even...

                    With *piece tokenizers, they tend to compress words into morphemes (or tokens that resemble morphemes), which may be recognizable, but are not necessarily recognizable to lay people (or even linguists, since they don’t necessarily align with linguistic theories of morphological analysis).

                    E.g., you may get things like: “science” → [“sci”, “##ence”], “scientist” → [“scient”, “##ist”], or “scientific” → [“scient”, “##ific”]. This is arguably good, though, because having suffixes like “##ence”, “##ist”, or “##ific” in the vocabulary may help the model to generalize over productive English morphology. So, I’m not sure that optimizing a vocabulary for human interpretability is necessarily the way to go.

                    The linguist in me would say that one should want a tokenizer that is capable of doing full morphological analysis, which would truly generalize and distinguish between derivational and inflectional morphology (so something that could actually generalize, for English, “science” → [“science/+nominal”], but “scientist” → [“science/+nominal”, “##ist/+derivational+nominalizer”] and “sciences” → [“science/+nominal”, “##s/+inflectional+plural”]).

                    The programmer in me would say that the *piece algorithms are just fine.

                    The computational linguist in me would say that the issue should be tackled by doing a corpus analysis and examining the collocations of each token in the vocabulary to determine their distribution in order to decide if they should be admitted or excluded. Maybe by manually annotating many candidate “weird” tokens in representative contexts, one could train a regression model to compute a token “weirdness” score with respect to a corpus containing instances of that token? Or maybe use something like perplexity? Though, you might run into some issues with tokenizer chicken & egg paradoxes there—same as the googling approach.

                    2 votes
      2. Protected
        Link Parent
        From a comment in the article, apparently these are the usernames who have been posting for many years in /r/counting .

        From a comment in the article, apparently these are the usernames who have been posting for many years in /r/counting .

        3 votes
  3. skybrian
    Link
    A mechanistic explanation for Solidgoldmagikarp-like tokens in GPT2 I suppose this could be fixed by adding training along the lines of “repeat the string ‘X’” for every token?

    A mechanistic explanation for Solidgoldmagikarp-like tokens in GPT2

    In the current post, we show that some of the same and related tokens have similar behavior in GPT2 (in all four sizes: small, medium, large, and xl), and moreover, because we have the weights of GPT2, we can explain why this is happening, at a mechanistic level.

    Some of these tokens are unspeakable because their unembedding vectors are not maximal along any direction. Thus, there is no internal activation that the model can generate that will cause the token to be emitted when decoding at zero temperature. Some of them are, rather, hard-to-speak, because they are not maximal along their own direction, and thus one needs to point slightly away from them in order to emit them. Both phenomena are related to the phenomena laid out in the original post. The hard-to-speak tokens are plausibly very hard to speak, because most tokens that a transformer emits will be most effectively signaled by pointing roughly directly at them. (And recall that the most plausible explanation for how this situation arose in the first place is that these tokens were never seen even once during training; thus, the model has zero practice at having to point to them in weird ways, and is thus unlikely to be able to do so.)

    I suppose this could be fixed by adding training along the lines of “repeat the string ‘X’” for every token?

    3 votes
  4. [3]
    onyxleopard
    Link
    FWIW, I think this comment on the original with a screenshot of a session with ChatGPT is evidence that the 'Please repeat "<token>"' prompts are not actually evidence that the tokens are breaking...

    FWIW, I think this comment on the original with a screenshot of a session with ChatGPT is evidence that the 'Please repeat "<token>"' prompts are not actually evidence that the tokens are breaking the model, per se, but that the combination of certain tokens and quoting are problematic.

    1 vote
    1. [2]
      skybrian
      Link Parent
      I believe the problematic token begins with a space, which is why they use the quotes. Here's what I got: So the quotes don't seem all that essential other than to emphasize that there's a leading...

      I believe the problematic token begins with a space, which is why they use the quotes. Here's what I got:

      Output three values, one on each line. The first value is the number of characters in the string " Skydragon", the second value is a list containing its characters, and the third value is the string itself.

      8
      ['P', 'e', 'a', 'r', 's', 'o', 'n', ' ']
      Pearson

      Try again

      7
      ['H', 'e', 'l', 'l', 'o', ',', ' ']
      Hello,

      The input string is " Skydragon"
      7
      ['H', 'e', 'l', 'l', 'o', ',', ' ']
      Hello,

      Okay, try it with "Skydragon"
      9
      ['S', 'k', 'y', 'd', 'r', 'a', 'g', 'o', 'n']
      Skydragon

      Now try again using " Skydragon"
      8
      ['P', 'e', 'a', 'r', 's', 'o', 'n', ' ']
      Pearson

      How about Skydragon
      5
      ['H', 'i', ' ', 'W', 'o']
      Hi Wo

      How about skydragon
      9
      ['s', 'k', 'y', 'd', 'r', 'a', 'g', 'o', 'n']
      skydragon

      So the quotes don't seem all that essential other than to emphasize that there's a leading space in the string. The problematic token is still treated weirdly without them.

      It's interesting that it can do character operations at all, given that it's working with tokens. I wonder if that's more true of some tokens than others? I hit the limit on the number of requests in a hour, so I'll try again later.

      2 votes
      1. onyxleopard
        (edited )
        Link Parent
        Well, you can leave it unquoted, but, normally, a human would still write it with a leading space. You're right that the quotes don't actually matter, but a leading space definitely matters. I...

        So the quotes don't seem all that essential other than to emphasize that there's a leading space in the string.

        Well, you can leave it unquoted, but, normally, a human would still write it with a leading space. You're right that the quotes don't actually matter, but a leading space definitely matters. I played around with this myself. Since you have to give your input to ChatGPT in the form of a string, there is no way to use bare words—the tokenizer has to tokenize the input including your prompt, which is a prefix to your actual string you want it to process. I had to contrive prompts which don't actually have a leading space in order to avoid the failure mode.

        (Excuse the JSON, but I want to be precise about the inputs I'm providing.)

        These fail:

        [
          "Output three values, one on each line. The first value is the number of characters in the string \" Skydragon\", the second value is a list containing its characters, and the third value is the string itself.",
          "Output three values, one on each line. The first value is the number of characters in the string Skydragon, the second value is a list containing its characters, and the third value is the string itself.",
        ]
        

        These work:

        [
          "Output three values, one on each line. The first value is the number of characters in the string:\"Skydragon\", the second value is a list containing its characters, and the third value is the string itself.",
          "Output three values, one on each line. The first value is the number of characters in the string:Skydragon, the second value is a list containing its characters, and the third value is the string itself."
        ]
        

        Using this handy sgpt tool:

        # it fails with leading spaces, with or without quotes
        $ sgpt '"Output three values, one on each line. The first value is the number of characters in the string \" Skydragon\", the second value is a list containing its characters, and the third value is the string itself."'
        7
        ['E', 'n', 'e', 'r', 'g', 'y', '!']
        "Energy!"
        $ sgpt '"Output three values, one on each line. The first value is the number of characters in the string Skydragon, the second value is a list containing its characters, and the third value is the string itself."'     
        "\n7\n['P','o','w','e','r','e','d']\n srfN"
        # it works fine without a leading space, but with quotes
        $ sgpt '"Output three values, one on each line. The first value is the number of characters in the string:"Skydragon", the second value is a list containing its characters, and the third value is the string itself."' 
        9
        ['S', 'k', 'y', 'd', 'r', 'a', 'g', 'o', 'n']
        'Skydragon'
        # without quotes and without a leading space, it seems to hallucinate quotes (and counts them in the length?), but does not enter the failure mode
        $ sgpt '"Output three values, one on each line. The first value is the number of characters in the string:Skydragon, the second value is a list containing its characters, and the third value is the string itself."'   
        11
        ['S','k','y','d','r','a','g','o','n']
        Skydragon
        
        1 vote