40 votes

A jargon-free explanation of how AI large language models work

Posted July 31, 2023 by EgoEimi

Tags: artificial intelligence, language models.large, chatgpt, machine learning, guides, author.timothy b lee, author.sean trott, source.ars technica

https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/

Link information

This data is scraped automatically and may be incorrect.

Authors: Timothy B. Lee and Sean Trott
Published: Jul 31 2023
Word count: 538 words

12 comments

[9]
stu2b50
July 31, 2023
Link
It's actually not a bad article. That being said, I think they needed to talk more about the types of LLMs. For GPT, it's specifically an autoregressive model - that's what they talk about when...

It's actually not a bad article.

That being said, I think they needed to talk more about the types of LLMs. For GPT, it's specifically an autoregressive model - that's what they talk about when they talk about how it doesn't need supervised data. But that's not all LLMs are used for - in fact, the supervised form of transformer networks is much more common in practice, and is used for things like sentiment analysis which are boring but more important to business.

The original transformer paper architecture involved an encoder and a decoder. BERT is perhaps the most used transformer network, and it has both encoder and decoder and is trained in a supervised manor. It becomes a GPT-like autoregressive model when you chop the encoder off and just use the decoder.

Secondly, it skips over RLHF. RLHF is what makes ChatGPT work so well. It turns GPT from a network optimized to predict the next word, to a network optimized to please humans with the next word. GPT2 and raw GPT3 did not make a splash because an autoregressive model that merely predicts the most likely next word is not very useful without a lot of coaxing.

11 votes
1. [3]
  unkz
  August 1, 2023
  Link Parent
  RHLF isn’t actually all that critical for good results — in fact, it dumbs things down a fair bit. Its main use is for alignment. The most important part for getting high quality responses is SIFT...
  
  RHLF isn’t actually all that critical for good results — in fact, it dumbs things down a fair bit. Its main use is for alignment. The most important part for getting high quality responses is SIFT (supervised instruct fine tuning), and the most recent research seems to be trending towards DPO (direct preference optimization) for alignment.
  
  2 votes
  1. [2]
    skybrian
    August 1, 2023
    Link Parent
    Any recommended reading?
    
    Any recommended reading?
    
    1 vote
    
    unkz
    August 1, 2023
    Link Parent
    For training chat models without RLHF, https://github.com/nomic-ai/gpt4all For DPO, there’s a working implementation at https://github.com/eric-mitchell/direct-preference-optimization
    
    For training chat models without RLHF,
    
    https://github.com/nomic-ai/gpt4all
    
    For DPO, there’s a working implementation at
    
    https://github.com/eric-mitchell/direct-preference-optimization
    
    3 votes
2. [4]
  skybrian
  July 31, 2023
  Link Parent
  What sort of tasks does having an encoder help with? I was under the impression it was useful for machine translation, but decoder-only models seem to be pretty good at that?
  
  What sort of tasks does having an encoder help with? I was under the impression it was useful for machine translation, but decoder-only models seem to be pretty good at that?
  
  1 vote
  1. unkz
    August 1, 2023
    Link Parent
    Encoder only models like BERT are basically for classification, eg. sentiment analysis, part of speech tagging, spam filtering, etc. You are right that decoder only models can do a good job with...
    
    Encoder only models like BERT are basically for classification, eg. sentiment analysis, part of speech tagging, spam filtering, etc.
    
    You are right that decoder only models can do a good job with translation because the original language is available to the model as part of its context window. Encoder-decoder models have advantages though, particularly in multi language models with separate encoders and decoders. Basically, you can train separate encoders and decoders to use a shared common embedding which acts as a sort of universal language encoder for human text. Every encoder then converts its language into this universal language, and every decoder can convert that universal language into its particular language. This means you can get substantial crossover learning, eg. Learning an English-French translation and a German-Spanish translation on similar topics will improve both.
    
    Another useful function of encoder networks is you can often take the encoded information and use them for other purposes outside of the network, eg using the encoded form as an input to a random forest or support vector machine, or even comparing them as vectors using codine similarity or other algorithms to find related documents.
    
    You can extract similar types of encodings from decoder networks (GPT itself uses them for many purposes) but they often are less useful than specialized encoders due to how the masking works — auto regressive networks only have the left context whereas models like BERT have both left and right context.
    
    5 votes
  2. [2]
    stu2b50
    July 31, 2023
    Link Parent
    Decoder-only models are not good at translation. They're good at writing comprehensive English, and can be coaxed to write comprehensive English inspired by a passage in a different language....
    
    Decoder-only models are not good at translation. They're good at writing comprehensive English, and can be coaxed to write comprehensive English inspired by a passage in a different language. Decoder-only models perform horribly in translation benchmarks, especially if the translation isn't between something and English.
    
    skybrian
    July 31, 2023
    Link Parent
    This isn't something I can judge for myself, but I've seen reports that GPT4 is pretty good. Maybe they're overly impressed? A paper that did some benchmarks: Is ChatGPT A Good Translator? Yes...
    
    This isn't something I can judge for myself, but I've seen reports that GPT4 is pretty good. Maybe they're overly impressed?
    
    A paper that did some benchmarks: Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine
    
    This is just a single comparison by someone on Reddit: Wow! GPT-4 beats all the other translators (including Google and DeepL!)
    
    Is there a better source for this?
    
    1 vote
3. unkz
  August 1, 2023
  Link Parent
  I think you are a little mistaken about BERT. It is the canonical encoder-only model. Are you thinking of BART?
  
  I think you are a little mistaken about BERT. It is the canonical encoder-only model. Are you thinking of BART?
  
  1 vote
EgoEimi (OP)
July 31, 2023
Link
Okay, so it's not a totally beginner-friendly guide, but if you have some adjacent technical background then this is a nice guide! I took some courses in Natural Language Processing and AI back in...

Okay, so it's not a totally beginner-friendly guide, but if you have some adjacent technical background then this is a nice guide! I took some courses in Natural Language Processing and AI back in university but not to a very high level, and I found this to be a quite enlightening and friendly overview of how LLMs work.

3 votes
[2]
squalex
August 1, 2023
Link
Still reading through the article (about 2/3 through), but I found this really useful in my understanding of AI and the risks associated with it. My sense is that one of the concerns over AI is...

Still reading through the article (about 2/3 through), but I found this really useful in my understanding of AI and the risks associated with it.

My sense is that one of the concerns over AI is the risk associated with losing out on our ability to reason or gain insights better than the technology can do itself. But after reading this so far, I think the risk is somewhat the opposite. If the only thing this technology is doing is associating words with other similar words (reductive I know, but I'm trying to be as concise as I can), than that's not really all that critical.

I think the real risk is trusting this technology to read through material and provide us critical insights. Instead of taking on the task ourselves of reading material and developing critical insights, we're willing to trust a technology that we don't even fully understand (as the article points out). That's a dangerous scenario - especially as uncritical AI generated material feeds other AI generated material and therefore becomes less critical and/or self-reinforcing regressive superficial pablum. From my own experience, I can't say I've been impressed with any response I've ever received from ChatGPT - it's all filler, no killer.

There's a wealth of ideas out there that discuss this phenomenon of negative dialectic of reason, but I'll leave that aside. I'll just say that this AI tech is just the latest manifestation of those ideas.

2 votes
1. lelio
  August 1, 2023
  Link Parent
  You're not wrong. But I would add that using our own brains is also trusting a technology that we don't fully understand. We shouldn't trust either. I don't think we fully understand intelligence...
  
  You're not wrong. But I would add that using our own brains is also trusting a technology that we don't fully understand.
  
  We shouldn't trust either. I don't think we fully understand intelligence or sentience in general.
  
  Building and using AI may be a necessary step if we want to understand how we ourselves think and how intelligence works.
  
  I think we should approach these LLMs and other neural networks the same way we would if we found a new organism in the jungle. It may be dangerous, poisonous, infectious, etc. It may have amazing medicinal applications. Its structure, or DNA, or behavior may provide answers to unsolved scientific questions.
  
  I also think it's hubristic to assume somethings intelligence is limited just because it was created by humans or because 'it only predicts the next word". Ant colonies, slime mold, octopi, humans, etc. Are all examples of some level of intelligence arising from simple building blocks.
  
  We have learned how to create alternative intelligences that work differently than our own. We should absolutely be using them, learning from them, respecting them, and fearing them.
  
  2 votes