11 votes

Toolformer: Language models can teach themselves to use tools

3 comments

  1. DataWraith
    Link
    It is fascinating to me that you can use a Large Language Model to generate new training data for said language model, which then improves performance on the tasks you care about after...

    It is fascinating to me that you can use a Large Language Model to generate new training data for said language model, which then improves performance on the tasks you care about after re-training.

    In this case they're teaching the model to use tools by

    1. Giving a demonstration of tool use to the LLM, which only requires a few hand-written examples
    2. Asking the model to come up with many more examples similar to the demonstrations
    3. Filling in the generated examples with the output of the requested tool/API
    4. If the output from the tool improves perplexity, the example is added to the training set. In other words, if "knowing" the answer from, say, a calculator API improves the probability of generating the correct output, the tool use is shown to be beneficial and the example is added to the training set.

    The most impressive version of this bootstrapping procedure I've seen so far is probably Anthropic's Constitutional AI (PDF), which takes in a "Constitution" of rules ("don't be racist", etc.) and then critiques its own past outputs on whether they adhere to the constitutional rules. Following that, they ask the model itself to rewrite the response to remove any violations of the constitution and use the result of that as part of a new training set, completing one step of the bootstrapping procedure.

    8 votes
  2. skybrian
    Link
    Here's the abstract:

    Here's the abstract:

    Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, two different search engines, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

    5 votes
  3. skybrian
    Link
    A survey of related work: Augmented Language Models: a Survey

    A survey of related work:

    Augmented Language Models: a Survey

    This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demonstrations. While adhering to a standard missing tokens prediction objective, such augmented LMs can use various, possibly non-parametric external modules to expand their context processing ability, thus departing from the pure language modeling paradigm. We therefore refer to them as Augmented Language Models (ALMs). The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks and even outperforming most regular LMs on several benchmarks. In this work, after reviewing current advance in ALMs, we conclude that this new research direction has the potential to address common limitations of traditional LMs such as interpretability, consistency, and scalability issues.

    3 votes