Pen & Paper LM 160
Pen & Paper LM 160 is a microscopic language model designed to be run by hand.
It does not generate normal prose. It predicts the next token in a small workflow language:
QUESTION
TASK
IDEA
FACT
PROBLEM
UNKNOWN
PLAN
DO
CHECK
ASK
ANSWER
DONE
The model is trained to produce simple useful loops:
QUESTION → CHECK → ANSWER → DONE
TASK → PLAN → DO → CHECK → DONE
IDEA → CHECK → PLAN → DO → CHECK → DONE
FACT → ANSWER → DONE
PROBLEM → CHECK → PLAN → DO → CHECK → DONE
UNKNOWN → ASK → DONE
DONE → DONE
It is small enough to infer on paper, but it still has the core parts of a neural language model: context, weights, hidden activations, ReLU, logits, and next-token decoding.
Model shape
Pen & Paper LM 160 uses the previous token and the current token as context.
previous token + current token
→ hidden layer
→ ReLU
→ output logits
→ next token
It has:
12 tokens
2-token context
4 hidden neurons
12 output logits
The first layer is split into two tables:
W1_prev: weights for the previous token
W1_curr: weights for the current token
This is equivalent to a 24 × 4 input matrix, but easier to use by hand.
Parameter count
W1_prev: 12 × 4 = 48
W1_curr: 12 × 4 = 48
b1: 4
W2: 4 × 12 = 48
b2: 12
Total = 160 parameters
Token order
Use this order for all output scores:
| Index | Token |
|---|---|
| 0 | QUESTION |
| 1 | TASK |
| 2 | IDEA |
| 3 | FACT |
| 4 | PROBLEM |
| 5 | UNKNOWN |
| 6 | PLAN |
| 7 | DO |
| 8 | CHECK |
| 9 | ASK |
| 10 | ANSWER |
| 11 | DONE |
Inference rule
If the prompt has one token, duplicate it.
TASK
becomes:
previous = TASK
current = TASK
Then calculate:
h_raw = W1_prev[previous] + W1_curr[current] + b1
h = ReLU(h_raw)
logits = h × W2 + b2
ReLU is simple:
negative numbers become 0
zero and positive numbers stay unchanged
The predicted next token is the token with the highest logit.
If there is a tie, choose the first highest token in the token order.
Hidden bias
b1 = [1, 1, 1, 2]
W1 previous-token table
| Previous token | H1 | H2 | H3 | H4 |
|---|---|---|---|---|
| QUESTION | 0 | 4 | -2 | 0 |
| TASK | -1 | -2 | 0 | 0 |
| IDEA | 0 | 0 | 2 | 3 |
| FACT | 0 | 2 | 1 | -1 |
| PROBLEM | 0 | 0 | 2 | 3 |
| UNKNOWN | -1 | 0 | 2 | -1 |
| PLAN | 1 | 1 | -1 | 2 |
| DO | 2 | 1 | 2 | -1 |
| CHECK | -1 | 0 | 0 | -1 |
| ASK | 0 | 0 | 0 | 0 |
| ANSWER | 0 | 0 | 0 | 0 |
| DONE | 2 | 2 | 2 | -1 |
W1 current-token table
| Current token | H1 | H2 | H3 | H4 |
|---|---|---|---|---|
| QUESTION | 0 | 1 | -1 | 3 |
| TASK | 0 | -1 | 1 | 2 |
| IDEA | 3 | 1 | -2 | 2 |
| FACT | -1 | 3 | -2 | -1 |
| PROBLEM | 3 | 1 | -2 | 2 |
| UNKNOWN | 0 | -1 | 3 | -1 |
| PLAN | -1 | -1 | -2 | -2 |
| DO | 2 | 1 | -1 | 2 |
| CHECK | -1 | 2 | 2 | -1 |
| ASK | 0 | 3 | 2 | -1 |
| ANSWER | 2 | 2 | 3 | -1 |
| DONE | 2 | 1 | 2 | -1 |
W2 output matrix
Columns use the token order:
QUESTION TASK IDEA FACT PROBLEM UNKNOWN PLAN DO CHECK ASK ANSWER DONE
| Hidden row | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| H1 | -1 | -1 | -1 | -2 | -2 | -1 | -2 | -2 | 2 | -1 | -2 | 1 |
| H2 | -2 | -2 | -2 | -1 | -2 | -2 | -1 | -1 | 1 | -4 | 3 | 1 |
| H3 | -2 | -2 | -1 | -1 | -1 | -1 | 1 | -1 | -3 | 3 | -2 | 1 |
| H4 | -2 | -1 | -1 | -1 | -1 | -2 | 3 | -1 | 2 | -2 | -2 | -2 |
Output bias
Again, this uses the same token order:
QUESTION TASK IDEA FACT PROBLEM UNKNOWN PLAN DO CHECK ASK ANSWER DONE
b2 = [-2, -3, -3, -2, -2, -2, -2, 7, -1, -2, -2, -1]
Worked example
Prompt:
TASK
Because there is only one token, duplicate it:
previous = TASK
current = TASK
Find the two rows:
W1_prev[TASK] = [-1, -2, 0, 0]
W1_curr[TASK] = [ 0, -1, 1, 2]
b1 = [ 1, 1, 1, 2]
Add them:
h_raw = [-1, -2, 0, 0]
+ [ 0, -1, 1, 2]
+ [ 1, 1, 1, 2]
h_raw = [0, -2, 2, 4]
Apply ReLU:
h = [0, 0, 2, 4]
Now calculate the output logits:
logits = h × W2 + b2
Result:
| Token | Logit |
|---|---|
| QUESTION | -14 |
| TASK | -11 |
| IDEA | -9 |
| FACT | -8 |
| PROBLEM | -8 |
| UNKNOWN | -12 |
| PLAN | 12 |
| DO | 1 |
| CHECK | 1 |
| ASK | -4 |
| ANSWER | -14 |
| DONE | -7 |
The highest logit is:
PLAN = 12
So the model predicts:
TASK → PLAN
Continue the same way:
TASK PLAN → DO
PLAN DO → CHECK
DO CHECK → DONE
Full output:
TASK → PLAN → DO → CHECK → DONE
Verified behavior
With greedy decoding, the model produces:
QUESTION → CHECK → ANSWER → DONE
TASK → PLAN → DO → CHECK → DONE
IDEA → CHECK → PLAN → DO → CHECK → DONE
FACT → ANSWER → DONE
PROBLEM → CHECK → PLAN → DO → CHECK → DONE
UNKNOWN → ASK → DONE
DONE → DONE
Pen & Paper LM 160 is not useful because it is powerful. It is useful because the whole model is visible.
You can inspect every weight, run every step by hand, and change the model’s behavior directly. Raising weights toward CHECK makes it more cautious. Raising weights toward DO makes it more action-oriented. Raising weights toward DONE makes it finish sooner.
It is a complete language model small enough to fit in a notebook.
P.S. Here's a little playground where you can fiddle with the model.