Chapter 9: Single-Head Attention - Tokens Looking at Each Other – Discover

What You'll Build

The attention mechanism: the only place in a transformer where a token at position t gets to look at tokens at positions 0..t-1. This is specifically self-attention, where the token attends to other tokens in the same sequence. (You might encounter "cross-attention" in other materials, which is used in encoder-decoder models where tokens attend to a different sequence. We don't use cross-attention here.)

Depends On

Chapters 1-2, 5 (Value, Helpers).

The Core Idea

Until now, each token has been processed independently. The token at position 3 has no idea what's at positions 0, 1, or 2. Attention fixes this by letting each token ask: "what earlier tokens are relevant to me?" Because each token can only look backward (at positions before it, never ahead), this is called causal attention. The past can influence the future, but not the other way around.

It works through three separate projections of the same input:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I offer if selected?"

Where the Names Come From

If these descriptions feel a bit hand-wavy, that's because they are. The Query/Key/Value names are borrowed from database lookup, which pre-dates transformers by decades. In that world you have a query (what you want to find), the database has keys to match against, and each key has an associated value that gets returned when the key matches. Attention works the same way: Q and K dot-product together to measure "match", and V is the payload that flows through for the matches that win.

What's actually load-bearing isn't the names, it's the math. The attention formula needs two vectors to dot-product together and one to weight-and-sum, so three projections are required. You could rename them Alice, Bob, and Carol and the arithmetic would be identical. The query/key/value descriptions are a database metaphor humans use to reason about what the projections are for. The model doesn't know or care. That's why those "what am I looking for?" descriptions feel a little forced: the projections don't have to encode anything human-interpretable, they just have to play their mathematical roles.

Why Three Separate Projections?

You might wonder why we need three separate projections rather than just using the embedding directly as Q, K, and V. Each projection lets the model learn a different aspect of the same token. The query might learn to represent "what kind of character should come next", while the key learns "what kind of character am I", and the value learns "what information should I pass forward if selected". Three separate learned projections give the model the flexibility to use the same input in three different ways.

Why the Dot Product Measures Matching

The dot product multiplies matching elements of two vectors and sums them. When two vectors are aligned (point the same way), matching elements tend to share signs and contribute positive values, so the sum gets large. When they're perpendicular (unrelated), each dimension's activity in one doesn't overlap with the other, so contributions cancel or stay zero. When they're opposite, the signs disagree everywhere and the sum is large and negative.

Concrete examples with a 4-dimensional Q asking a question, and three different K candidates:

Q = [3, 2, -1, 0]                       (the query)

K_similar   = [2, 1, -1, 0]     Q·K = 3*2 + 2*1 + (-1)*(-1) + 0*0   =  9   large positive - matches
K_unrelated = [0, 0,  0, 5]     Q·K = 3*0 + 2*0 + (-1)*0   + 0*5   =  0   zero - no relationship
K_opposite  = [-3, -2, 1, 0]    Q·K = 3*-3 + 2*-2 + (-1)*1 + 0*0   = -14  large negative - anti-match

For attention, only the first case is what the model wants: a Q asking a question, and an earlier token's K offering an answer that aligns. The higher the dot product, the more relevant that earlier token is.

The Attention Pipeline

The math flows in stages:

Q · K[t]        → score for position t  (how well Q matches K[t])
scores / sqrt(d)  → scaled logits (prevents big numbers)
softmax(logits) → attention weights that sum to 1
weights × V[t]  → weighted contribution from position t
sum of those    → final output

Q and K never touch V directly. Q·K decides who's relevant, and V is what the relevant ones pass along. Each ingredient has one job:

K answers "do I match?" (used in the dot product)
V answers "if I'm relevant, here's what I have to share"
Q is the question being asked

The original transformer paper ("Attention Is All You Need") calls this scaled dot-product attention because of the / sqrt(d) scaling step.

A Concrete Example

Suppose we're processing the name "emma" and we're at the second 'm' (position 3), trying to predict what comes next.

The model might learn a query like "what vowels appeared recently?". The earlier 'e' at position 1 would have a key that matches this query well, giving it a high attention weight. Its value (encoding information about being a vowel) flows into the current position. The first 'm' might get a lower weight because its key doesn't match the query as well.

This is how the model learns long-range patterns like "after two consonants, a vowel is likely".

The KV Cache

A key design detail: during both training and inference, we process one token at a time. After computing K and V for the current token, we append them to a cache. When computing attention for position 5, we already have cached K and V from positions 0-4, and we compute dot products between the current Q and all cached Ks.

"Doesn't the KV cache make training different from inference?" Not algorithmically. In production systems, the KV cache is usually only used at inference, because during training all positions are processed in parallel using matrix operations. But the math is identical. MicroGPT processes one token at a time during both training and inference, making the KV cache explicit in both cases.

"Where is the causal mask?" If you've read nanoGPT or other batched transformer code, you've probably seen a lower-triangular tril matrix multiplied into the attention scores to zero out future positions. MicroGPT has no such mask, and it doesn't need one. Because we build the sequence token-by-token and append to cachedKeys and cachedValues as we go, the only keys and values in scope when computing attention at position t are the ones from positions 0 through t. The future tokens physically aren't in the cache yet, so there is nothing to mask. Sequential KV caching replaces matrix masking - same causality, different shape.

A subtle but important point: the cached keys and values are not frozen numbers. They're live Value objects that are part of the computation graph. When Backward runs, gradients flow through the cached values just like any other Value. That's what makes attention learnable. The model adjusts the weight projections (queryWeights, keyWeights, valueWeights) based on how the cached keys and values contributed to the final loss.

Code

Here's the shape of scaled dot-product attention, extracted from the full GptModel.Forward you'll build in Chapter 11. This is exactly what the runnable exercise below exercises, and what Chapter 10 generalises to multiple heads.

The four weight matrices. queryWeights, keyWeights, valueWeights, and outputWeights below are the learned matrices for this attention layer. Three of them turn the input into Q, K, and V projections. The fourth (outputWeights) is applied at the very end to mix the attention result back into the model's internal representation. This becomes important in Chapter 10 when multi-head attention needs to mix information across heads. In Chapter 11's GptModel you'll see these stored under GPT-2's state dict keys (attn_wq, attn_wk, attn_wv, attn_wo) so PyTorch checkpoints could map directly, but the descriptive parameter names used here make the roles clearer.

// Shape reference - Chapter 11 integrates this into GptModel.Forward.
// embeddingSize is the embedding dimension (16 in our model, set in Chapter 6)

List<Value> SingleHeadAttention(
    List<Value> x,
    List<List<Value>> cachedKeys,
    List<List<Value>> cachedValues,
    List<List<Value>> queryWeights,
    List<List<Value>> keyWeights,
    List<List<Value>> valueWeights,
    List<List<Value>> outputWeights
)
{
    List<Value> query = Helpers.Linear(x, queryWeights);
    List<Value> key = Helpers.Linear(x, keyWeights);
    List<Value> value = Helpers.Linear(x, valueWeights);

    cachedKeys.Add(key);
    cachedValues.Add(value);

    var attentionLogits = new List<Value>();
    for (int t = 0; t < cachedKeys.Count; t++)
    {
        var dot = new Value(0);
        for (int j = 0; j < embeddingSize; j++)
        {
            dot += query[j] * cachedKeys[t][j];
        }

        // Scale by sqrt(embeddingSize) to keep the dot products in a reasonable range.
        // Without this, larger embedding dimensions produce larger dot products,
        // which push Softmax toward extreme values (all weight on one token).
        attentionLogits.Add(dot / Math.Sqrt(embeddingSize));
    }

    List<Value> attentionWeights = Helpers.Softmax(attentionLogits);

    var output = new List<Value>();
    for (int j = 0; j < embeddingSize; j++)
    {
        output.Add(new Value(0));
    }

    for (int t = 0; t < cachedValues.Count; t++)
    {
        Value w = attentionWeights[t];
        for (int j = 0; j < embeddingSize; j++)
        {
            output[j] += w * cachedValues[t][j];
        }
    }

    return Helpers.Linear(output, outputWeights);
}

Exercise: Hand-Crafted Single-Head Attention

The code above is what GptModel will use, but it takes learned projections (queryWeights, keyWeights, valueWeights, outputWeights), which means you can't run it meaningfully until training starts in Chapter 11. To actually see attention working, the exercise below skips the projections and constructs Q, K, and V by hand, picking directions on purpose so you can predict which position should win.

The setup: three cached positions with embeddingSize = 4, where each K[t] points in a different basis direction, each V[t] carries a distinct payload, and Q is aligned with K[1]. If the math is right, position 1 should receive nearly all the attention weight, and the output vector should look mostly like V[1].

Create Chapter9Exercise.cs:

// --- Chapter9Exercise.cs ---

using static MicroGPT.Helpers;

namespace MicroGPT;

public static class Chapter9Exercise
{
    public static void Run()
    {
        // Hand-crafted single-head attention on a 3-position sequence with embeddingSize=4.
        // We skip the Q/K/V projections and just build K, V, and Q directly so we can
        // see exactly which position the query attends to.
        const int EmbeddingSize = 4;

        // Three cached positions. Each key points in a different basis direction.
        var cachedKeys = new List<List<Value>>
        {
            new() { new(1), new(0), new(0), new(0) }, // K[0]
            new() { new(0), new(1), new(0), new(0) }, // K[1]
            new() { new(0), new(0), new(1), new(0) }, // K[2]
        };

        // Each value carries a distinct "payload" so we can see whose information
        // flows through to the output.
        var cachedValues = new List<List<Value>>
        {
            new() { new(10), new(0), new(0), new(0) }, // V[0] = 10 in slot 0
            new() { new(0), new(20), new(0), new(0) }, // V[1] = 20 in slot 1
            new() { new(0), new(0), new(30), new(0) }, // V[2] = 30 in slot 2
        };

        // The current query points strongly in the same direction as K[1].
        // Expectation: position 1 gets the highest attention weight, and
        // the output should look mostly like V[1] = [0, 20, 0, 0].
        var query = new List<Value> { new(0), new(5), new(0), new(0) };

        List<Value> attentionLogits = ComputeAttentionLogits(query, cachedKeys, EmbeddingSize);
        List<Value> attentionWeights = Softmax(attentionLogits);
        List<Value> output = ComputeAttentionOutput(attentionWeights, cachedValues, EmbeddingSize);

        Console.WriteLine("--- Single-Head Attention ---");
        Console.WriteLine($"Q = [{string.Join(", ", query.Select(v => v.Data))}]");
        Console.WriteLine();

        Console.WriteLine("Attention weights (should peak at position 1):");
        for (int t = 0; t < attentionWeights.Count; t++)
        {
            Console.WriteLine($"  position {t}: {attentionWeights[t].Data:F4}");
        }

        Console.WriteLine();
        Console.WriteLine("Output vector (should look mostly like V[1] = [0, 20, 0, 0]):");
        Console.WriteLine($"  [{string.Join(", ", output.Select(v => v.Data.ToString("F3")))}]");

        // Sanity check: position 1 should have the highest weight.
        int topPosition = 0;
        for (int t = 1; t < attentionWeights.Count; t++)
        {
            if (attentionWeights[t].Data > attentionWeights[topPosition].Data)
            {
                topPosition = t;
            }
        }

        Console.WriteLine();
        Console.WriteLine(
            $"Top-attended position: {topPosition} ({(topPosition == 1 ? "PASS" : "FAIL")})"
        );
    }

    // Scaled dot-product attention scores: score[t] = (query . keys[t]) / sqrt(embeddingSize)
    private static List<Value> ComputeAttentionLogits(
        List<Value> query,
        List<List<Value>> cachedKeys,
        int embeddingSize
    )
    {
        var attentionLogits = new List<Value>();
        for (int t = 0; t < cachedKeys.Count; t++)
        {
            var dot = new Value(0);
            for (int j = 0; j < embeddingSize; j++)
            {
                dot += query[j] * cachedKeys[t][j];
            }

            attentionLogits.Add(dot / Math.Sqrt(embeddingSize));
        }
        return attentionLogits;
    }

    // Weighted sum of value vectors, using the attention weights.
    private static List<Value> ComputeAttentionOutput(
        List<Value> attentionWeights,
        List<List<Value>> cachedValues,
        int embeddingSize
    )
    {
        var output = new List<Value>();
        for (int j = 0; j < embeddingSize; j++)
        {
            output.Add(new Value(0));
        }

        for (int t = 0; t < cachedValues.Count; t++)
        {
            Value w = attentionWeights[t];
            for (int j = 0; j < embeddingSize; j++)
            {
                output[j] += w * cachedValues[t][j];
            }
        }
        return output;
    }
}

Uncomment the Chapter 9 case in the dispatcher in Program.cs:

case "ch9":
    Chapter9Exercise.Run();
    break;

Then run it:

dotnet run -- ch9

You should see attention weights of approximately [0.07, 0.86, 0.07] and an output vector dominated by the 20 from slot 1. If you change query to point at K[0] or K[2], the peak moves accordingly. Try it. That's the whole attention mechanism in ~40 lines of arithmetic.