Maciej Wróblewski – blog

Pen & Paper LM 160

Pen & Paper LM 160 is a microscopic language model designed to be run by hand.

It does not generate normal prose. It predicts the next token in a small workflow language:

QUESTION
TASK
IDEA
FACT
PROBLEM
UNKNOWN
PLAN
DO
CHECK
ASK
ANSWER
DONE

The model is trained to produce simple useful loops:

QUESTION → CHECK → ANSWER → DONE
TASK → PLAN → DO → CHECK → DONE
IDEA → CHECK → PLAN → DO → CHECK → DONE
FACT → ANSWER → DONE
PROBLEM → CHECK → PLAN → DO → CHECK → DONE
UNKNOWN → ASK → DONE
DONE → DONE

It is small enough to infer on paper, but it still has the core parts of a neural language model: context, weights, hidden activations, ReLU, logits, and next-token decoding.

Model shape

Pen & Paper LM 160 uses the previous token and the current token as context.

previous token + current token
→ hidden layer
→ ReLU
→ output logits
→ next token

It has:

12 tokens
2-token context
4 hidden neurons
12 output logits

The first layer is split into two tables:

W1_prev: weights for the previous token
W1_curr: weights for the current token

This is equivalent to a 24 × 4 input matrix, but easier to use by hand.

Parameter count

W1_prev: 12 × 4 = 48
W1_curr: 12 × 4 = 48
b1:       4

W2:       4 × 12 = 48
b2:       12

Total = 160 parameters

Token order

Use this order for all output scores:

Index Token
0 QUESTION
1 TASK
2 IDEA
3 FACT
4 PROBLEM
5 UNKNOWN
6 PLAN
7 DO
8 CHECK
9 ASK
10 ANSWER
11 DONE

Inference rule

If the prompt has one token, duplicate it.

TASK

becomes:

previous = TASK
current  = TASK

Then calculate:

h_raw = W1_prev[previous] + W1_curr[current] + b1
h = ReLU(h_raw)
logits = h × W2 + b2

ReLU is simple:

negative numbers become 0
zero and positive numbers stay unchanged

The predicted next token is the token with the highest logit.

If there is a tie, choose the first highest token in the token order.

Hidden bias

b1 = [1, 1, 1, 2]

W1 previous-token table

Previous token H1 H2 H3 H4
QUESTION 0 4 -2 0
TASK -1 -2 0 0
IDEA 0 0 2 3
FACT 0 2 1 -1
PROBLEM 0 0 2 3
UNKNOWN -1 0 2 -1
PLAN 1 1 -1 2
DO 2 1 2 -1
CHECK -1 0 0 -1
ASK 0 0 0 0
ANSWER 0 0 0 0
DONE 2 2 2 -1

W1 current-token table

Current token H1 H2 H3 H4
QUESTION 0 1 -1 3
TASK 0 -1 1 2
IDEA 3 1 -2 2
FACT -1 3 -2 -1
PROBLEM 3 1 -2 2
UNKNOWN 0 -1 3 -1
PLAN -1 -1 -2 -2
DO 2 1 -1 2
CHECK -1 2 2 -1
ASK 0 3 2 -1
ANSWER 2 2 3 -1
DONE 2 1 2 -1

W2 output matrix

Columns use the token order:

QUESTION TASK IDEA FACT PROBLEM UNKNOWN PLAN DO CHECK ASK ANSWER DONE
Hidden row 0 1 2 3 4 5 6 7 8 9 10 11
H1 -1 -1 -1 -2 -2 -1 -2 -2 2 -1 -2 1
H2 -2 -2 -2 -1 -2 -2 -1 -1 1 -4 3 1
H3 -2 -2 -1 -1 -1 -1 1 -1 -3 3 -2 1
H4 -2 -1 -1 -1 -1 -2 3 -1 2 -2 -2 -2

Output bias

Again, this uses the same token order:

QUESTION TASK IDEA FACT PROBLEM UNKNOWN PLAN DO CHECK ASK ANSWER DONE
b2 = [-2, -3, -3, -2, -2, -2, -2, 7, -1, -2, -2, -1]

Worked example

Prompt:

TASK

Because there is only one token, duplicate it:

previous = TASK
current  = TASK

Find the two rows:

W1_prev[TASK] = [-1, -2, 0, 0]
W1_curr[TASK] = [ 0, -1, 1, 2]
b1            = [ 1,  1, 1, 2]

Add them:

h_raw = [-1, -2, 0, 0]
      + [ 0, -1, 1, 2]
      + [ 1,  1, 1, 2]

h_raw = [0, -2, 2, 4]

Apply ReLU:

h = [0, 0, 2, 4]

Now calculate the output logits:

logits = h × W2 + b2

Result:

Token Logit
QUESTION -14
TASK -11
IDEA -9
FACT -8
PROBLEM -8
UNKNOWN -12
PLAN 12
DO 1
CHECK 1
ASK -4
ANSWER -14
DONE -7

The highest logit is:

PLAN = 12

So the model predicts:

TASK → PLAN

Continue the same way:

TASK PLAN → DO
PLAN DO → CHECK
DO CHECK → DONE

Full output:

TASK → PLAN → DO → CHECK → DONE

Verified behavior

With greedy decoding, the model produces:

QUESTION → CHECK → ANSWER → DONE
TASK → PLAN → DO → CHECK → DONE
IDEA → CHECK → PLAN → DO → CHECK → DONE
FACT → ANSWER → DONE
PROBLEM → CHECK → PLAN → DO → CHECK → DONE
UNKNOWN → ASK → DONE
DONE → DONE

Pen & Paper LM 160 is not useful because it is powerful. It is useful because the whole model is visible.

You can inspect every weight, run every step by hand, and change the model’s behavior directly. Raising weights toward CHECK makes it more cautious. Raising weights toward DO makes it more action-oriented. Raising weights toward DONE makes it finish sooner.

It is a complete language model small enough to fit in a notebook.

P.S. Here's a little playground where you can fiddle with the model.