14 votes

GPT in 243 lines of pure python

Posted February 12 by stu2b50

Tags: artificial intelligence, gpt, python, karpathy, source.github, short read

https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95

Link information

This data is scraped automatically and may be incorrect.

Title: microgpt
Authors: karpathy
Word count: 104 words

1 comment

stu2b50 (OP)
February 12 (edited February 12)
Link
Karpathy posted this and I thought it was interesting. “AI”, or LLMs specifically, are often treated as magic, and some companies reinforce that with how they write about it. But ultimately the...
Karpathy posted this and I thought it was interesting. “AI”, or LLMs specifically, are often treated as magic, and some companies reinforce that with how they write about it.

But ultimately the technology is simpler than you think it is. This is all that it takes to write a decoder text model with just the standard python libraries. Of course, doing unoptimized matrix multiplication on the CPU is not going to run particularly quickly.

—-

In conjunction, here’s my attempt to explain the underlying theory:

First, you can create arbitrary function estimators by adding linear functions with non-linear maps. As an example, consider f(x) = x^2 the classic parabola. You can’t estimate this with a linear line for obvious reasons.

But what if you used two lines? You can represent that. First, define relu(x) = x if x> 0, 0 otherwise. You can estimate x^2 far closer with relu(x) + relu(-x). Not perfect, but at least you get the two lines forming a V. You can see how you can get closer and closer the more you add in to that.

Next, as long as you can calculate a derivative you can use something called gradient descent to nudge those piecewise lines towards a target function.

All that is to get you to a point where you understand two things
1. Matrices with what are called “activator” functions (often as simple as relu) can represent functions of arbitrary complexity
2. For types of inputs and outputs (differeniable), you can use gradient descent to optimize the matrices to be like any function you want.
Then the problem is just picking the problem such that its differentiable and playing around with the specifics so that it produces useful results.

Now, let’s define the function we want to estimate as: the probability the next token is “a”, “b” and so forth (so, a 26 length vector, that sums to 1) given the previous characters are X.

Once you train that model, you can generate text by using randomly picking a letter based on the probabilities it outputs, and doing that in a loop.
4 votes